Abstract
Machine learning is one of the main approaches to malware detection in the literature, since machine learning models are more adaptive than signature based solutions. One of the main challenges in the application of machine learning to malware detection is the presence of concept drift, which is a change in the data distribution over time. To tackle drift, online models that can be dynamically updated passively or by actively detecting change are applied. However, these models require new instances to be labelled to update the model. Usually, labels are scarce, cannot be obtained immediately and the presence of imbalance in the data make the construction of an effective model difficult. It has been studied that concept drift has a lower impact on benign instances, so we test the effectiveness of anomaly detection models to detect malware in the presence of concept drift. Anomaly detection models only need benign instances for training, and therefore may be less affected by the scarcity of labelled malicious instances. The results show that anomaly detection models achieve better results than supervised online models in conditions of heavy data imbalance and label scarcity.
Supported by Spanish National Cybersecurity Institute (INCIBE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amer, M., Goldstein, M., Abdennadher, S.: Enhancing one-class support vector machines for unsupervised anomaly detection. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 8–15 (2013)
Bifet, A., Gavaldà, R.: Adaptive learning from evolving data streams. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 249–260. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03915-7_22
Buckland, M., Gey, F.: The relationship between recall and precision. J. Am. Soc. Inf. Sci. 45(1), 12–19 (1994)
Ceschin, F., Botacin, M., Gomes, H.M., Pinagé, F., Oliveira, L.S., Grégio, A.: Fast & furious: on the modelling of malware detection as an evolving data stream. Expert Syst. Appl. 212, 118590 (2023). https://doi.org/10.1016/j.eswa.2022.118590
Choras, M., Wozniak, M.: Concept Drift Analysis for Improving Anomaly Detection Systems in Cybersecurity, pp. 35–42 (2017). https://doi.org/10.18690/978-961-286-114-8.3
Cook, J., Ramadas, V.: When to consult precision-recall curves. Stand. Genomic Sci. 20(1), 131–148 (2020)
Cox, D.R.: The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 20(2), 215–242 (1958)
Darem, A.A., Ghaleb, F.A., Al-Hashmi, A.A., Abawajy, J.H., Alanazi, S.M., Al-Rezami, A.Y.: An adaptive behavioral-based incremental batch learning malware variants detection model using concept drift detection and sequential deep learning. IEEE Access 9, 97180–97196 (2021). https://doi.org/10.1109/ACCESS.2021.3093366
Galloro, N., Polino, M., Carminati, M., Continella, A., Zanero, S.: A systematical and longitudinal study of evasive behaviors in windows malware. Comput. Secur. 113, 102550 (2022). https://doi.org/10.1016/j.cose.2021.102550
Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4) (2014). https://doi.org/10.1145/2523813
Gibert, D., Mateu, C., Planes, J.: The rise of machine learning for detection and classification of malware: research developments, trends and challenges. J. Netw. Comput. Appl. 153 (2020). https://doi.org/10.1016/j.jnca.2019.102526
Gomes, H.M., et al.: Adaptive random forests for evolving data stream classification. Mach. Learn. 106, 1469–1495 (2017)
Guerra-Manzanares, A., Bahsi, H., Nõmm, S.: KronoDroid: time-based hybrid-featured dataset for effective android malware detection and characterization. Comput. Secur. 110, 102399 (2021). https://doi.org/10.1016/j.cose.2021.102399
Guerra-Manzanares, A., Luckner, M., Bahsi, H.: Android malware concept drift using system calls: detection, characterization and challenges. Expert Syst. Appl. 206, 117200 (2022). https://doi.org/10.1016/j.eswa.2022.117200
Halimu, C., Kasem, A., Newaz, S.S.: Empirical comparison of area under ROC curve (AUC) and Mathew correlation coefficient (MCC) for evaluating machine learning algorithms on imbalanced datasets for binary classification. In: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, pp. 1–6 (2019)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 97–106. Association for Computing Machinery, New York (2001). https://doi.org/10.1145/502512.502529
Jordaney, R., et al.: Transcend: detecting concept drift in malware classification models. In: 26th USENIX Security Symposium (USENIX Security 2017), Vancouver, BC, pp. 625–642. USENIX Association (2017)
Kan, Z., Pendlebury, F., Pierazzi, F., Cavallaro, L.: Investigating labelless drift adaptation for malware detection. In: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, AISec 2021, pp. 123–134. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3474369.3486873
Kegelmeyer, W.P., Chiang, K., Ingram, J.: Streaming malware classification in the presence of concept drift and class imbalance. In: Proceedings of 12th International Conference on Machine Learning and Applications, vol. 2, pp. 48–53 (2013). https://doi.org/10.1109/ICMLA.2013.104
Kermenov, R., Nabissi, G., Longhi, S., Bonci, A.: Anomaly detection and concept drift adaptation for dynamic systems: a general method with practical implementation using an industrial collaborative robot. Sensors 23(6) (2023). https://doi.org/10.3390/s23063260
Liu, K., Xu, S., Xu, G., Zhang, M., Sun, D., Liu, H.: A review of android malware detection approaches based on machine learning. IEEE Access 8, 124579–124607 (2020). https://doi.org/10.1109/ACCESS.2020.3006143
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019). https://doi.org/10.1109/TKDE.2018.2876857
Manapragada, C., Webb, G.I., Salehi, M.: Extremely fast decision tree. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pp. 1953–1962. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3219819.3220005
Matthews, B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophys. Acta (BBA) - Protein Struct. 405(2), 442–451 (1975). https://doi.org/10.1016/0005-2795(75)90109-9
Montiel, J., et al.: River: machine learning for streaming data in Python (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001). https://doi.org/10.1162/089976601750264965
Shahraki, A., Abbasi, M., Taherkordi, A., Jurcut, A.D.: A comparative study on online machine learning techniques for network traffic streams analysis. Comput. Netw. 207, 108836 (2022). https://doi.org/10.1016/j.comnet.2022.108836
Tan, S.C., Ting, K.M., Liu, T.F.: Fast anomaly detection for streaming data. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, IJCAI 2011, vol. 2, pp. 1511–15160. AAAI Press (2011)
Yang, L., et al.: CADE: detecting and explaining concept drift samples for security applications. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 2327–2344. USENIX Association (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Escudero García, D., DeCastro-García, N. (2023). Application of Anomaly Detection Models to Malware Detection in the Presence of Concept Drift. In: García Bringas, P., et al. Hybrid Artificial Intelligent Systems. HAIS 2023. Lecture Notes in Computer Science(), vol 14001. Springer, Cham. https://doi.org/10.1007/978-3-031-40725-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-40725-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40724-6
Online ISBN: 978-3-031-40725-3
eBook Packages: Computer ScienceComputer Science (R0)