Dynamic Malware Analysis Based on API Sequence Semantic Fusion
<p>Overall architecture of the Mal-ASSF framework.</p> "> Figure 2
<p>Schematic diagram of the API2Vec training process.</p> "> Figure 3
<p>Schematic diagram of API implicit semantic feature mapping.</p> "> Figure 4
<p>Schematic diagram of the BiLSTM model.</p> "> Figure 5
<p>Schematic diagram of the calculation process of the attention mechanism.</p> "> Figure 6
<p>Multiclass confusion matrix of the Alibaba Cloud dataset.</p> "> Figure 7
<p>Accuracy with different sequence lengths.</p> "> Figure 8
<p>Comparison of the PR and ROC curves of different deep learning models.</p> ">
Abstract
:1. Introduction
- API2Vec is a class of neural network models that can produce a corresponding vector for each unique API element in a continuous space in which the linguistic contexts of APIs can be observed. It is used to obtain the dimensionality reduction representation of the API call sequences. When establishing the correlation between APIs, we use a bidirectional long short-term memory network (BiLSTM) to capture the behavioral features of segments of different lengths.
- A pair of operation and type, separately representing the verb and the objective in the function name, is designed to represent the API functions. To fully discover the implicit semantic information of API functions, we construct the core model based on TextRNN and the self-attention mechanism, where the sequence feature and semantic feature of API are fused, and the suspected malicious segments of the sequences are focused on adaptively.
- To evaluate the effectiveness of Mal-ASSF, we applied a systematic experiment to a large dataset of malware families. We build up a confusion matrix to analyze the performance of Mal-ASSF under different classification tasks. We perform experimental comparisons of different sequence lengths and determine whether to deduplicate. We compare Mal-ASSF with machine learning and other deep learning methods for classifying malicious code. We conduct ablation experiments to verify the effectiveness of each module in Mal-ASSF. The experimental results show that it achieves higher detection accuracy than related work, especially in the case of malware family classification and newer malware samples.
2. Related Work
3. Methodology
3.1. Overview
3.2. Detailed Archetecture
3.2.1. Data Preparation
3.2.2. Vectorized Representation Based on API2Vec
- We first encode the API information through one-hot encoding to obtain a high-dimensional sparse feature vector.
- We then train a shallow neural network to obtain the hidden layer weight of each API. The input layer and the hidden layer weights are jointly calculated to obtain the output vector, which is also the word vector form of the API and can uniquely identify each API. We use the Skip-Gram model for weight training due to the relatively small association between APIs. Assuming that an API is represented by a 32-dimensional vector feature, for 295 kinds of APIs, compared to the 295-dimensional one-hot encoding form, the API2Vec method can be used to reduce the dimension of the word vector to 32 dimensions.
- The word vector representation of each API name with a fixed dimension is finally calculated and stored in the form of a dictionary.
3.2.3. API Implicit Semantic Sequence Feature Extraction
3.2.4. Sequence Feature Extraction Based on BiLSTM
3.2.5. Key API Sequence Recognition Based on the Attention Mechanism
- (1)
- The dot product is used to calculate the correlation between the two sides, as shown in Equation (4).
- (2)
- The Softmax function is introduced to numerically transform the scores calculated in the first stage. As shown in Equation (5), normalization is performed, and important factor weights are highlighted.
- (3)
- The calculated in the second stage is the weight coefficient corresponding to . The attention value can be calculated by weighted summation. In this way, the attention value of each element can be calculated. as shown in Formula (6).
4. Experiment
- A confusion matrix is built to analyze the performance of Mal-ASSF under different classification tasks.
- The sequence length and whether to deduplicate the adjacent API may affect the performance of Mal-ASSF. Therefore, this paper performs experimental comparisons of different sequence lengths and determines whether to deduplicate.
- Mal-ASSF is compared with other methods for classifying malicious code based on API sequences, including deep learning models such as TextCNN, CNN-LSTM, and TextCNN-LSTM, as well as machine learning methods such as K-Nearest Neighbors (KNN), support vector machine (SVM), decision tree (DT), random forest (RF), and gradient boosting (XGB). For the comparison method, we use publicly available algorithms to run on the same dataset and evaluate the experimental results.
- Ablation experiments are performed to verify the effectiveness of each module in Mal-ASSF. The performance of Mal-ASSF is compared with that of a model using only one-hot encoding for word embedding, a model using one-way LSTM, and a model without an attention mechanism.
4.1. Dataset
4.2. Results and Analysis
4.2.1. Confusion Matrix
4.2.2. Effect of the API Sequence Length on the Performance of the Mal-ASSF Model
4.2.3. Comparison with API Frequency-Based Machine Learning Models
4.2.4. Comparison with Other Deep Learning Models
4.2.5. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Al-rimy, B.A.S.; Maarof, M.A.; Shaid, S.Z.M. Ransomware threat success factors, taxonomy, and countermeasures: A survey and research directions. Comput. Secur. 2018, 74, 144–166. [Google Scholar] [CrossRef]
- Mcafee Labs Threats Report. 2021. Available online: https://www.mcafee.com/enterprise/en-us/assets/reports/rp-threats-jun-2021.pdf (accessed on 25 March 2022).
- VirusTotal File Statistics. 2020. Available online: https://www.virustotal.com/en/statistics (accessed on 20 June 2021).
- Ye, Y.; Li, T.; Adjeroh, D.; Iyengar, S.S. A Survey on Malware Detection Using Data Mining Techniques. ACM Comput. Surv. 2017, 50, 1–40. [Google Scholar] [CrossRef]
- Mohanta, A.; Saldanha, A. Malware Analysis and Detection Engineering: A Comprehensive Approach to Detect and Analyze Modern Malware; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Aghakhani, H.; Gritti, F.; Mecca, F.; Lindorfer, M.; Ortolani, S.; Balzarotti, D.; Vigna, G.; Kruegel, C. When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium 2020, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
- Mantovani, A.; Aonzo, S.; Ugarte-Pedrero, X.; Merlo, A.; Balzarotti, D. Prevalence and Impact of Low-Entropy Packing Schemes in the Malware Ecosystem. In Proceedings of the NDSS, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
- Sahay, S.K.; Sharma, A.; Rathore, H. Evolution of Malware and Its Detection Techniques. In Proceedings of the Information and Communication Technology for Sustainable Development, New Delhi, India, 27–28 February 2020; pp. 139–150. [Google Scholar]
- Mehrabi Koushki, M.; AbuAlhaol, I.; Raju, A.D.; Zhou, Y.; Giagone, R.S.; Shengqiang, H. On building machine learning pipelines for Android malware detection: A procedural survey of practices, challenges and opportunities. Cybersecurity 2022, 5, 16. [Google Scholar] [CrossRef]
- Carlin, D.; O’Kane, P.; Sezer, S. A cost analysis of machine learning using dynamic runtime opcodes for malware detection. Comput. Secur. 2019, 85, 138–155. [Google Scholar] [CrossRef]
- Nguyen, T.N.; Ngo, Q.-D.; Nguyen, H.-T.; Nguyen, G.L. An Advanced Computing Approach for IoT-Botnet Detection in Industrial Internet of Things. IEEE Trans. Ind. Inform. 2022, 18, 8298–8306. [Google Scholar] [CrossRef]
- Rosenberg, I.; Shabtai, A.; Rokach, L.; Elovici, Y. Generic Black-Box End-to-End Attack Against State of the Art API Call Based Malware Classifiers. In Proceedings of the Research in Attacks, Intrusions, and Defenses, Crete, Greece, 10–12 September 2018; pp. 490–510. [Google Scholar]
- Qiang, W.; Yang, L.; Jin, H. Efficient and Robust Malware Detection Based on Control Flow Traces Using Deep Neural Networks. Comput. Secur. 2022, 122, 102871. [Google Scholar] [CrossRef]
- Rosenberg, I.; Shabtai, A.; Elovici, Y.; Rokach, L. Query-Efficient Black-Box Attack Against Sequence-Based Malware Classifiers. In Proceedings of the Annual Computer Security Applications Conference, Austin, TX, USA, 7–11 December 2020; pp. 611–626. [Google Scholar]
- Huang, W.; Stokes, J.W. MtNet: A Multi-Task Neural Network for Dynamic Malware Classification. In Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment, San Sebastián, Spain, 7–8 July 2016; pp. 399–418. [Google Scholar]
- Wang, Q.; Qian, Q. Malicious code classification based on opcode sequences and textCNN network. J. Inf. Secur. Appl. 2022, 67, 103151. [Google Scholar] [CrossRef]
- Kang, J.; Jang, S.; Li, S.; Jeong, Y.-S.; Sung, Y. Long short-term memory-based Malware classification method for information security. Comput. Electr. Eng. 2019, 77, 366–375. [Google Scholar] [CrossRef]
- Pektaş, A.; Acarman, T. Classification of malware families based on runtime behaviors. J. Inf. Secur. Appl. 2017, 37, 91–100. [Google Scholar] [CrossRef]
- Suaboot, J.; Tari, Z.; Mahmood, A.; Zomaya, A.Y.; Li, W. Sub-curve HMM: A malware detection approach based on partial analysis of API call sequences. Comput. Secur. 2020, 92, 101773. [Google Scholar] [CrossRef]
- Jan, N.; Gwak, J.; Pei, J.; Maqsood, R.; Nasir, A. Analysis of Networks and Digital Systems by Using the Novel Technique Based on Complex Fuzzy Soft Information. IEEE Trans. Consum. Electron. 2022, 69, 183–193. [Google Scholar] [CrossRef]
- Fan, M.; Liu, J.; Luo, X.; Chen, K.; Tian, Z.; Zheng, Q.; Liu, T. Android Malware Familial Classification and Representative Sample Selection via Frequent Subgraph Analysis. IEEE Trans. Inf. Forensics Secur. 2018, 13, 1890–1905. [Google Scholar] [CrossRef]
- Singh, J.; Singh, J. Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms. Inf. Softw. Technol. 2020, 121, 106273. [Google Scholar] [CrossRef]
- AlAhmadi, B.A.; Martinovic, I. MalClassifier: Malware family classification using network flow sequence behaviour. In Proceedings of the 2018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, USA, 15–17 May 2018; pp. 1–13. [Google Scholar]
- Cui, Z.; Xue, F.; Cai, X.; Cao, Y.; Wang, G.-G.; Chen, J. Detection of Malicious Code Variants Based on Deep Learning. IEEE Trans. Ind. Inform. 2018, 14, 3187–3196. [Google Scholar] [CrossRef]
- Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
- Jha, S.; Prashar, D.; Long, H.V.; Taniar, D. Recurrent neural network for detecting malware. Comput. Secur. 2020, 99, 102037. [Google Scholar] [CrossRef]
- Catak, F.O.; Yazı, A.F.; Elezaj, O.; Ahmed, J. Deep learning based Sequential model for malware analysis using Windows exe API Calls. PeerJ Comput. Sci. 2020, 6, e285. [Google Scholar] [CrossRef] [PubMed]
- Abusnaina, A.; Abuhamad, M.; Alasmary, H.; Anwar, A.; Jang, R.; Salem, S.; Nyang, D.; Mohaisen, D. DL-FHMC: Deep Learning-Based Fine-Grained Hierarchical Learning Approach for Robust Malware Classification. IEEE Trans. Dependable Secur. Comput. 2022, 19, 3432–3447. [Google Scholar] [CrossRef]
- Amer, E.; Zelinka, I. A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence. Comput. Secur. 2020, 92, 101760. [Google Scholar] [CrossRef]
- Zhang, Z.; Qi, P.; Wang, W. Dynamic Malware Analysis with Feature Engineering and Feature Learning. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1210–1217. [Google Scholar] [CrossRef]
- Daeef, A.Y.; Al-Naji, A.; Nahar, A.K.; Chahl, J. Features Engineering to Differentiate between Malware and Legitimate Software. Appl. Sci. 2023, 13, 1972. [Google Scholar] [CrossRef]
- Chen, X.; Hao, Z.; Li, L.; Cui, L.; Zhu, Y.; Ding, Z.; Liu, Y. CruParamer: Learning on Parameter-Augmented API Sequences for Malware Detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 788–803. [Google Scholar] [CrossRef]
- Li, C.; Cheng, Z.; Zhu, H.; Wang, L.; Lv, Q.; Wang, Y.; Li, N.; Sun, D. DMalNet: Dynamic malware analysis based on API feature engineering and graph learning. Comput. Secur. 2022, 122, 102872. [Google Scholar] [CrossRef]
- Balan, G.; GavriluŢ, D.T.; Luchian, H. Using API Calls for Sequence-Pattern Feature Mining-Based Malware Detection. In Proceedings of the Information Security Practice and Experience, Taipei, Taiwan, 23–25 November 2022; pp. 233–251. [Google Scholar]
- Nawaz, M.S.; Fournier-Viger, P.; Nawaz, M.Z.; Chen, G.; Wu, Y. MalSPM: Metamorphic malware behavior analysis and classification using sequential pattern mining. Comput. Secur. 2022, 118, 102741. [Google Scholar] [CrossRef]
- Agrawal, R.; Stokes, J.W.; Marinescu, M.; Selvaraj, K. Neural Sequential Malware Detection with Parameters. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2656–2660. [Google Scholar]
- Alibaba Cloud Malware Detection Based on Behaviors. 2021. Available online: https://tianchi.aliyun.com/competition/entrance/231694/information (accessed on 20 June 2021).
Operation Set |
---|
‘allocate’, ‘accept’, ‘bind’, ‘close’, ‘compress’, ‘connect, ‘control’, ‘copy’, ‘crack’, ‘decode’, ‘decrypt’, ‘delete’, ‘download’, ‘draw’, ‘encode’, ‘exec’, ‘exit’, ‘export’, ‘free’, ‘find’, ‘get’, ‘hash’, ‘initialize’, ‘listen’, ‘ls’, ‘load’, ‘lookup’, ‘make’, ‘map’, ‘move’, ‘open’, ‘put’, ‘query’, ‘read’, ‘recv’, ‘register’, ‘remove’, ‘save’, ‘search’, ‘send’, ‘select’, ‘set’, ‘shutdown’, ‘socket’, ‘start’, ‘suspend’, ‘unhook’, ‘unload’, ‘write’ |
Type Set |
---|
‘system’, ‘network’, ‘process’, ‘file’, ‘registry’, ‘service’, ‘ui’, ‘crypto’, ‘ole’, ‘exception’, ‘none’, ‘certificate’, ‘misc’, ‘netapi’, ‘resource’, ‘iexplore’ |
API | Operation | Type |
---|---|---|
RegOpenKeyExW | open | registry |
RegQueryValueExW | query | registry |
NtCreateThreadEx | create | process |
GetSystemInfo | get | system |
DeleteFileW | delete | file |
Label | Category Attribution | Quantity |
---|---|---|
0 | Normal software | 4978 |
1 | Ransomware | 502 |
2 | Mining program | 1196 |
3 | DDoS attack | 820 |
4 | Worm virus | 100 |
5 | Infectious virus | 4279 |
6 | Backdoor | 515 |
File_id | Label | API | Tid | Index |
---|---|---|---|---|
5 | 0 | SetErrorMode | 2500 | 0 |
5 | 0 | LdrGetDllHandle | 2500 | 1 |
5 | 0 | LdrGetProcedureAddress | 2500 | 2 |
5 | 0 | GetSystemDirectoryA | 2500 | 3 |
…… | …… | …… | …… | …… |
5 | 0 | SetErrorMode | 2596 | 0 |
5 | 0 | LdrGetDllHandle | 2596 | 1 |
Model | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
KNN | 86.46% | 87.85% | 85.76% | 0.8681 |
SVM | 88.46% | 89.85% | 85.76% | 0.8775 |
DT | 89.78% | 89.17% | 89.13% | 0.8914 |
RF | 9 0.45% | 88.95% | 91.54% | 0.9023 |
XGB | 9 1.84% | 91.95% | 91.54% | 0.9177 |
Mal-ASSF | 94.49% | 94.01% | 94.19% | 0.9402 |
Model Composition | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
TEXTCNN | 88.69% | 88.59% | 88.69% | 0.8864 |
CNN-LSTM | 90.23% | 89.64% | 90.23% | 0.8993 |
TCNN-LSTM | 92.13% | 92.17% | 92.13% | 0.9215 |
Mal-ASSF | 94.49% | 94.01% | 94.19% | 0.9402 |
Ablation Part | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
one-hot | 87.22% | 87.63% | 87.22% | 0.8721 |
nosemantic | 91.12% | 91.21% | 91.12% | 0.9116 |
nobilstm | 92.75% | 92.55% | 92.75% | 0.9265 |
no-attention | 92.13% | 92.17% | 92.13% | 0.9215 |
Mal-ASSF | 94.49% | 94.01% | 94.19% | 0.9402 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, S.; Wu, J.; Zhang, M.; Yang, W. Dynamic Malware Analysis Based on API Sequence Semantic Fusion. Appl. Sci. 2023, 13, 6526. https://doi.org/10.3390/app13116526
Zhang S, Wu J, Zhang M, Yang W. Dynamic Malware Analysis Based on API Sequence Semantic Fusion. Applied Sciences. 2023; 13(11):6526. https://doi.org/10.3390/app13116526
Chicago/Turabian StyleZhang, Sanfeng, Jiahao Wu, Mengzhe Zhang, and Wang Yang. 2023. "Dynamic Malware Analysis Based on API Sequence Semantic Fusion" Applied Sciences 13, no. 11: 6526. https://doi.org/10.3390/app13116526
APA StyleZhang, S., Wu, J., Zhang, M., & Yang, W. (2023). Dynamic Malware Analysis Based on API Sequence Semantic Fusion. Applied Sciences, 13(11), 6526. https://doi.org/10.3390/app13116526