CN109753801B

CN109753801B - Intelligent terminal malicious software dynamic detection method based on system call

Info

Publication number: CN109753801B
Application number: CN201910087194.6A
Authority: CN
Inventors: 景小荣; 王宁; 陈怡西; 王丹
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Beijing Zhongfangxin Technology Co ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-04-22
Anticipated expiration: 2039-01-29
Also published as: CN109753801A

Abstract

The invention relates to a system call-based intelligent terminal malicious software dynamic detection method. The method comprises the steps of extracting system calling characteristics of a sample to be tested in the automatic installation and operation processes of software, carrying out data dimension reduction, vectorization processing and redundancy removal processing on log information of dynamic behaviors, carrying out machine learning classification processing based on a support vector machine model, constructing a Markov matrix of a system calling sequence before learning, converting a result into a format which can be recognized by a Support Vector Machine (SVM), and finally carrying out effectiveness experiment training by taking a mixed training set of normal software and malicious software as the input of the SVM to obtain a trained classification model for judgment. The invention breaks through the traditional safety detection mode, designs the malicious software detection system into a special form by using the Markov matrix to reconstruct the system calling sequence, can realize the automatic detection of malicious codes, can greatly reduce the dimensionality of the trained data, and has simple implementation process and wide application range.

Description

Dynamic detection method of intelligent terminal malware based on system call

技术领域technical field

本发明涉及信息安全处理技术领域，具体涉及基于系统调用的智能终端恶意软件动态检测方法。The invention relates to the technical field of information security processing, in particular to a dynamic detection method for intelligent terminal malware based on system calls.

背景技术Background technique

随着全球移动互联网的快速增长和不断创新，各种包含传感器、处理芯片和软件系统，具备数据收集处理和通信能力的新型智能终端层出不穷，逐渐进入消费者的日常生活。智能终端设备是指具有接入互联网能力的产品设备，它们通常搭载各种操作系统，并且可根据用户需求不同而定制各种功能。新型智能终端在多个领域的发展带动了从应用服务到产业生态、商业模式的变革。在功能种类不断丰富的同时，新型智能终端的数量也迅速增长。据英特网数据中心(Internet Data Center,IDC)预计，到2020年，将有280亿设备连接到物联网。而据《2018年全球手机跟踪季报》报道，2018年安卓(Android)手机销量在全球市场上的占比达到大约85％，不难看出，Android智能终端的市场份额遥遥领先其他智能终端。With the rapid growth and continuous innovation of the global mobile Internet, various new smart terminals including sensors, processing chips and software systems with data collection, processing and communication capabilities emerge in an endless stream, gradually entering consumers' daily lives. Intelligent terminal equipment refers to the product equipment with the ability to access the Internet. They are usually equipped with various operating systems and can customize various functions according to different user needs. The development of new smart terminals in various fields has driven changes from application services to industrial ecology and business models. While the types of functions are constantly enriched, the number of new smart terminals is also growing rapidly. According to the Internet Data Center (IDC), by 2020, 28 billion devices will be connected to the Internet of Things. According to the "2018 Global Mobile Phone Tracking Quarterly Report", Android mobile phone sales accounted for about 85% of the global market in 2018. It is not difficult to see that the market share of Android smart terminals is far ahead of other smart terminals.

Android智能终端市场占有率高，主要有以下几个因素：第一，Android智能终端的开放性。开发商可以根据用户的需求定制系统，用户同样可以通过自学编程语言编写自己喜欢的应用软件，这是Android智能终端市场占有率高的最重要因素；第二，硬件兼容性高，跨平台特性强。Android厂商不断推出各种各样的Android产品来吸引顾客，比如智能手环、平板、电视，甚至各种智能家居、汽车配件，而且其上层应用也可以直接向各种设备移植，用户体验不断丰富；第三，性价比高。相比与其他操作系统的智能终端，在同样的价格下，Android智能终端具有绝对的性价比优势；第四，丰富的应用。应用是操作系统发展的重要环节，Android应用开发门槛低，多彩多样的应用软件也是Android智能终端广泛流行的因素之一。The high market share of Android smart terminals is mainly due to the following factors: First, the openness of Android smart terminals. Developers can customize the system according to the needs of users, and users can also write their favorite application software through self-learning programming languages, which is the most important factor for the high market share of Android smart terminals; second, high hardware compatibility and strong cross-platform features . Android manufacturers continue to launch a variety of Android products to attract customers, such as smart bracelets, tablets, TVs, and even various smart home and auto accessories, and their upper-level applications can also be directly transplanted to various devices, and the user experience is constantly enriched ; Third, cost-effective. Compared with smart terminals of other operating systems, at the same price, Android smart terminals have absolute cost-effectiveness advantages; fourth, rich applications. Application is an important link in the development of the operating system. The threshold for Android application development is low, and the colorful and diverse application software is also one of the factors that make Android smart terminals widely popular.

但是智能终端产品的质量问题被频频报出，它们以可执行文件、程序模块或者程序片段的形式出现，在用户不知情或者未授权的情况下进行安装、运行，执行违反国家相关法律法规规范的行为，达到窃取隐私、盗取钱财等不正当的目的，极大地影响了消费者对智能终端产品的信心。根据调研，由于产品软件和硬件的各种质量问题，用户流失比例较高，严重制约了产业的发展。2017年7月，安全公司派拓网络(Palo Alto Networks)报告了一款在中国流行并针对中国用户的新型Android恶意应用SpyDealer，该恶意应用通过被黑客控制的无线网络进行传播，一旦用户连接被控制的无线网络便被感染，这个恶意应用能够从微信、微博等40余款流行应用软件中窃取用户敏感信息，给用户造成严重的隐私损失。2017年，一个名为WannaCry的勒索病毒在全球范围内的爆发，据统计，仅在9个月内360烽火实验室捕获的手机勒索恶意软件数量就超过了50万个，而来自360的监测数据则显示，在过去一年中国内勒索病毒的攻击数量近500万次，所导致的经济损失保守估计超过3亿元。据腾讯手机管家与腾讯安全联合实验室发布的《2018年上半年手机安全报告》显示：2018年上半年，Android新增病毒包468.70万个，其中支付类病毒包新增近2.70万个，病毒感染用户数近6106.84万，尽管新增木马病毒趋势减缓，但由于其基数大，危害仍不可小觑。However, the quality problems of intelligent terminal products are frequently reported. They appear in the form of executable files, program modules or program fragments. They are installed and run without the user's knowledge or authorization, and the implementation of violations of relevant national laws and regulations. behaviors, to achieve illegitimate purposes such as stealing privacy and stealing money, which greatly affects consumers' confidence in smart terminal products. According to the survey, due to various quality problems of product software and hardware, the loss of users is relatively high, which seriously restricts the development of the industry. In July 2017, security firm Palo Alto Networks reported a new malicious Android app, SpyDealer, prevalent in China and targeting Chinese users. The controlled wireless network is infected. This malicious application can steal user sensitive information from more than 40 popular applications such as WeChat and Weibo, causing serious loss of privacy to users. In 2017, a ransomware virus called WannaCry broke out worldwide. According to statistics, the number of mobile phone ransomware malware captured by 360 Fiberhome Labs exceeded 500,000 in just 9 months, and the monitoring data from 360 It shows that in the past year, the number of ransomware attacks in China was nearly 5 million, and the economic losses caused by it were conservatively estimated to exceed 300 million yuan. According to the "Mobile Security Report for the First Half of 2018" released by Tencent Mobile Manager and Tencent Security Joint Lab, in the first half of 2018, Android added 4.687 million virus packages, of which nearly 27,000 payment virus packages were added. The number of infected users is nearly 61.0684 million. Although the trend of new Trojan viruses has slowed down, due to its large base, the harm cannot be underestimated.

近年来，针对智能终端恶意软件的检测技术和手段不断发展，如专利申请一种Android平台恶意应用检测方法及装置，调用FlowDroid工具，提取待测Android应用的静态数据流特征，利用SUSI技术对待测Android应用的静态数据流特征进行处理，生成待测Android应用的数据流的特征向量，将生成的待测Android应用的数据流的特征向量输入预先训练好的深度置信网络检测模型，获得待测Android应用是否是恶意应用的检测结果。能对Android平台恶意应用进行准确检测，避免动态污点追踪存在的路径覆盖问题，克服静态数据流分析技术需要对应用运行流程进行准确建模及准确获取组件间通信的目标组件。但该技术计算复杂度高，需对大量无用和冗余信息进行处理。In recent years, the detection technology and means for intelligent terminal malware have been continuously developed. For example, a patent application for a method and device for detecting malicious applications on the Android platform, calling the FlowDroid tool, extracting the static data flow characteristics of the Android application to be tested, and using the SUSI technology to be tested The static data stream features of the Android application are processed to generate the feature vector of the data stream of the Android application to be tested, and the generated feature vector of the data stream of the Android application to be tested is input into the pre-trained deep belief network detection model to obtain the Android application to be tested. A detection result of whether the app is a malicious app. It can accurately detect malicious applications on the Android platform, avoid the path coverage problem of dynamic taint tracking, and overcome the need for accurate modeling of the application running process and accurate acquisition of target components for communication between components in static data flow analysis technology. However, this technology has high computational complexity and needs to process a large amount of useless and redundant information.

广东工业大学2018年03月16日申请的专利名称为一种基于深度学习的Android平台恶意软件检测方法，通过反编译得到应用软件APK对应的字节码文件；从字节码文件中提取并生成相应的指令序列，以向量的形式来表示每条指令的信息，并得到指令序列的时间序列；以指令序列的时间序列作为循环神经网络的输入值，循环神经网络的输出值为one-hot向量，通过对循环神经网络进行大量输入输出对的训练，得到恶意软件识别器；利用恶意软件识别器对恶意软件检测识别。能够对神经网络持续训练，更为快捷地得到识别模型，这种实现方法能够得到快速恶意软件识别器，恶意软件识别器经过大量样本训练后具有较高的检测准确率和速度，提高了恶意软件检测准确率和速度。但该方法是基于静态环境调试，不适合动态系统运行环境。The name of the patent applied by Guangdong University of Technology on March 16, 2018 is a deep learning-based Android platform malware detection method, which obtains the bytecode file corresponding to the application software APK through decompilation; extracts and generates from the bytecode file The corresponding instruction sequence, the information of each instruction is represented in the form of a vector, and the time sequence of the instruction sequence is obtained; the time sequence of the instruction sequence is used as the input value of the cyclic neural network, and the output value of the cyclic neural network is a one-hot vector. , through the training of a large number of input and output pairs of the recurrent neural network, the malware identifier is obtained; the malware identifier is used to detect and identify the malware. The neural network can be continuously trained, and the recognition model can be obtained more quickly. This implementation method can obtain a fast malware identifier. The malware identifier has a high detection accuracy and speed after being trained with a large number of samples, which improves the malware detection rate. Detection accuracy and speed. However, this method is based on static environment debugging and is not suitable for dynamic system operating environment.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是，针对现有技术存在的上述问题，使用支持向量机(Support Vector Machine)SVM算法对Android应用进行学习和检测，降低恶意应用检测复杂度，节约系统资源消耗，在解决高维特征和自动化分类检测问题上，进一步提高了对恶意软件的检测准确率。The technical problem to be solved by the present invention is that, aiming at the above problems existing in the prior art, using the Support Vector Machine (SVM) SVM algorithm to learn and detect Android applications, reduce the complexity of malicious application detection, save system resource consumption, and In solving the problem of high-dimensional features and automatic classification and detection, the detection accuracy of malware is further improved.

本发明解决上述技术问题的技术方案是，提出一种基于系统调用的智能终端恶意软件动态检测方法，通过特征提取建立动态信息的沙盒测试环境，安装并运行待检测样本，自动模拟和激发待测样本的动态行为，同时使用检测工具记录动态行为的日志，提取出待测样本信息；特征选择，将动态行为的日志信息进行预处理，包括数据降维，向量化处理和去冗余处理；机器学习分类处理，构建系统调用序列的马尔科夫(Markov)矩阵，将结果转化为SVM能够识别的格式，将正常软件和恶意软件的混合训练集作为SVM的输入进行有效性实验训练，得出训练好的分类模型，最后进行样本检测，采用训练好的模型对恶意样本和正常样本的混合集检测判断，从而得出检测结果。The technical solution of the present invention to solve the above technical problems is to propose a dynamic detection method for intelligent terminal malware based on system calls, establish a sandbox testing environment for dynamic information through feature extraction, install and run samples to be detected, and automatically simulate and stimulate Measure the dynamic behavior of the sample, and use the detection tool to record the log of the dynamic behavior, and extract the information of the sample to be tested; feature selection, preprocess the log information of the dynamic behavior, including data dimensionality reduction, vectorization processing and de-redundancy processing; The machine learning classification process, constructs the Markov matrix of the system call sequence, converts the results into a format that SVM can recognize, and uses the mixed training set of normal software and malware as the input of SVM for effective experimental training. After training the classification model, finally perform sample detection, and use the trained model to detect and judge the mixed set of malicious samples and normal samples, so as to obtain the detection result.

具体为，一种基于系统调用的智能终端恶意软件动态检测方法，包括以下步骤：特征提取模块使用检测工具记录动态行为的日志，提取待测样本动态行为日志信息，按时间顺序输出系统调用序列；特征选择模块将动态行为日志信息进行预处理，包括数据降维，向量化处理和去冗余，数据降维部分将系统调用按照重要程度不同进行量化评分，并将评分从高到低对系统调用序列进行处理；机器学习模块构建系统调用序列的马尔科夫(Markov)矩阵，将结果转化为支持向量机(SVM)能够识别的格式，将正常软件样本和恶意软件样本的混合训练集作为SVM的输入进行有效性训练得到训练好的分类模型；采用训练好的分类模型对恶意软件样本和正常软件样本的混合集检测，输出检测结果。Specifically, a method for dynamic detection of intelligent terminal malware based on system calls includes the following steps: a feature extraction module uses a detection tool to record a log of dynamic behavior, extracts dynamic behavior log information of a sample to be tested, and outputs a sequence of system calls in chronological order; The feature selection module preprocesses the dynamic behavior log information, including data dimensionality reduction, vectorization processing and de-redundancy. The data dimensionality reduction part quantifies and scores system calls according to their importance, and assigns scores to system calls from high to low. The sequence is processed; the machine learning module builds the Markov matrix of the system call sequence, converts the result into a format that can be recognized by the support vector machine (SVM), and uses the mixed training set of normal software samples and malware samples as the SVM's The input is subjected to validity training to obtain a trained classification model; the trained classification model is used to detect the mixed set of malware samples and normal software samples, and the detection results are output.

本发明进一步包括，特征提取模块获取超级管理员(Root)权限开启超级管理员模式，获取待测软件样本的进程身份编号PID，提取系统调用的动态特征，模拟操作样本软件，对每个样本软件运行预定时间后杀死进程，以文本(TXT)格式进行记录保存样本日志文件，去除掉乱码部分，去除系统调用的时间和参数信息，得到按时间顺序输出的系统调用序列。The present invention further includes that the feature extraction module obtains the super administrator (Root) authority to open the super administrator mode, obtains the process identity number PID of the software sample to be tested, extracts the dynamic characteristics of the system call, simulates the operation of the sample software, and analyzes the sample software for each sample software. Kill the process after running for a predetermined time, record and save the sample log file in text (TXT) format, remove the garbled part, remove the time and parameter information of the system call, and obtain the system call sequence output in chronological order.

在数据降维中，采用信息增益(Information Gain，IG)方法对系统调用序列进行量化评分。在系统调用的阈值选择方面，根据评分高低，从高到低分别选取数量为系统调用进行向量化处理，未被选取的部分不做处理，而按照属性的不同进行归类，合并“无用”命令，通过实验选择系统调用合理的阈值。In data dimensionality reduction, the Information Gain (IG) method is used to quantitatively score the system call sequence. In terms of the threshold selection of system calls, according to the score, select the number from high to low for vectorized processing of system calls, and the unselected parts are not processed, but are classified according to different attributes, and "useless" commands are combined. , and select a reasonable threshold for system calls through experiments.

数据降维具体包括，根据待分类集合的熵和系统调用的条件熵，调用公式：IG(T)＝H(C)-H(C|T)计算信息增益IG(T)，IG(T)值越大表示分类结果越重要，根据IG(T)值对系统调用进行向量化处理，将样本软件分为“重要”和“无用”部分，对“重要”部分直接进行特征向量化处理，而“无用”部分根据属性合并处理，得到向量化的调用序列。The data dimensionality reduction specifically includes, according to the entropy of the set to be classified and the conditional entropy of the system call, the calling formula: IG(T)=H(C)-H(C|T) to calculate the information gain IG(T), IG(T) The larger the value is, the more important the classification result is. The system calls are vectorized according to the IG(T) value, the sample software is divided into "important" and "useless" parts, and the "important" part is directly eigenvectorized, while The "useless" parts are merged according to attributes, resulting in a vectorized call sequence.

对软件中的冗余信息进行去除具体包括，对日志中向量化的调用序列进行N-gram(语言模型)处理，作为量化评分的依据，采用TF-IDF算法对N-gram中的子序列进行量化评分，首先根据公式

计算动态行为日志d_i中子序列t_j出现的频率，其中n_i,j为子序列t_j在动态行为日志d_i中出现的次数；其次根据公式

计算子序列t_j的反文档频率IDF_j，其中|D|为样本库中的动态行为日志总数；|i:t_j∈d_i|为出现子序列j的日志总数；t_j∈d_i表示子序列t_j在动态行为日志d_i中出现；+1是为了防止分母变为0。去除冗余部分的信息并保留“重要数据”后，对子序列进行合并，得到去冗余后的系统调用序列。The removal of redundant information in the software specifically includes, performing N-gram (language model) processing on the vectorized call sequence in the log, as the basis for quantitative scoring, using the TF-IDF algorithm to perform the subsequence in the N-gram. Quantitative scoring, first according to the formula

Calculate the frequency of occurrence of subsequence t _j in dynamic behavior log d _i , where n _i,j is the number of occurrences of subsequence t _j in dynamic behavior log d _i ; secondly, according to the formula

Calculate the inverse document frequency IDF _j of the subsequence t _j , where |D| is the total number of dynamic behavior logs in the sample library; |i:t _j ∈ d _i | is the total number of logs that appear in the subsequence j; t _j ∈ d _i means The subsequence _tj _appears in the dynamic behavior log di; +1 is to prevent the denominator from becoming 0. After removing redundant information and retaining "important data", the subsequences are merged to obtain the de-redundant system call sequence.

本发明还提出一种基于系统调用的智能终端恶意软件动态检测系统，包括：特征提取模块、特征选择模块、机器学习分类模块，特征提取模块使用检测工具记录动态行为的日志，提取待测样本动态行为日志信息，按时间顺序输出系统调用序列；特征选择模块将动态行为日志信息进行预处理，包括数据降维，向量化处理和去冗余，数据降维部分将系统调用按照重要程度不同进行量化评分，并将评分从高到低对系统调用序列进行处理；机器学习分类模块构建系统调用序列的马尔科夫(Markov)矩阵，将结果转化为支持向量机(SVM)能够识别的格式，将正常软件样本和恶意软件样本的混合训练集作为SVM的输入进行有效性训练得到训练好的分类模型；采用训练好的分类模型对恶意软件样本和正常软件样本的混合集检测，输出检测结果。The present invention also proposes a system call-based intelligent terminal malware dynamic detection system, including: a feature extraction module, a feature selection module, and a machine learning classification module. Behavior log information, output the sequence of system calls in chronological order; the feature selection module preprocesses the dynamic behavior log information, including data dimensionality reduction, vectorization processing and de-redundancy, and the data dimensionality reduction part quantifies system calls according to their importance. Score, and process the system call sequence from high to low; the machine learning classification module builds the Markov matrix of the system call sequence, and converts the result into a format that can be recognized by the support vector machine (SVM), and the normal The mixed training set of software samples and malware samples is used as the input of SVM for effective training to obtain a trained classification model; the trained classification model is used to detect the mixed set of malware samples and normal software samples, and the detection results are output.

本发明在软件自动安装和运行过程中提取出待测样本的系统调用特征，对动态行为的日志信息进行包括数据降维、向量化处理和去冗余处理，基于支持向量机模型进行机器学习分类处理，在学习前构建系统调用序列的马尔科夫矩阵，将结果转化为支持向量机SVM能够识别的格式，将正常软件和恶意软件的混合训练集作为SVM的输入进行有效性实验训练，得出训练好的分类模型用于判断。本发明突破了传统的安全检测模式，通过使用马尔科夫矩阵重构系统调用序列将恶意软件检测系统设计成一种特殊的形式，既能实现恶意代码自动检测，又可将训练的数据维度大幅度降低，其实现过程简单，应用范围广泛。The present invention extracts the system call characteristics of the sample to be tested during the automatic installation and operation of the software, performs data dimension reduction, vectorization processing and de-redundancy processing on the log information of dynamic behavior, and performs machine learning classification based on the support vector machine model. Processing, build the Markov matrix of the system call sequence before learning, convert the result into a format that can be recognized by the support vector machine SVM, and use the mixed training set of normal software and malware as the input of the SVM for effective experimental training, and obtain The trained classification model is used for judgment. The invention breaks through the traditional security detection mode, and designs the malware detection system into a special form by using the Markov matrix to reconstruct the system call sequence, which can not only realize the automatic detection of malicious code, but also greatly reduce the training data dimension. reduce, its realization process is simple, and its application range is wide.

附图说明Description of drawings

图1为恶意软件动态检测模型示意图；Figure 1 is a schematic diagram of a malware dynamic detection model;

图2为恶意软件特征提取模块流程图；Figure 2 is a flowchart of a malware feature extraction module;

图3为本发明提供的N-gram算法具体实现示意图；3 is a schematic diagram of a specific implementation of the N-gram algorithm provided by the present invention;

图4为本发明提供的SVM算法原理示意图。FIG. 4 is a schematic diagram of the principle of the SVM algorithm provided by the present invention.

具体实施方式Detailed ways

下面将结合附图对本发明的具体实施过程做详细说明。The specific implementation process of the present invention will be described in detail below with reference to the accompanying drawings.

图1所示为本发明应用检测系统模型示意图。为了实现对Android系统恶意应用的检测，基于系统调用的智能终端恶意软件动态检测方法，提出一种Android恶意应用检测系统，该模型包括，特征提取模块、特征处理模块和SVM分类算法模块。FIG. 1 is a schematic diagram of the application detection system model of the present invention. In order to realize the detection of malicious applications in the Android system, a dynamic detection method of intelligent terminal malware based on system calls is proposed, and an Android malicious application detection system is proposed. The model includes a feature extraction module, a feature processing module and an SVM classification algorithm module.

特征提取模块将收集到的正常软件和恶意软件样本集中进行批量模拟操作，记录软件在运行过程中的行为日志，并将待测样本日志文件去除乱码部分，以文本(TXT)格式进行记录保存，同时把系统调用的时间和参数等部分全部去除，得到按时间顺序输出的系统调用序列；特征处理模块对样本进行特征处理，包括：特征降维处理，特征向量化处理和去冗余处理；SVM分类算法模块使用SVM算法对其进行训练和分类，以检测Android恶意应用。The feature extraction module performs batch simulation operations on the collected normal software and malware samples, records the behavior log of the software during the running process, removes the garbled part of the log file of the sample to be tested, and records and saves it in text (TXT) format. At the same time, all the time and parameters of the system call are removed to obtain the system call sequence output in time order; the feature processing module performs feature processing on the samples, including: feature dimension reduction processing, feature vectorization processing and de-redundancy processing; SVM The classification algorithm module uses the SVM algorithm to train and classify it to detect Android malicious applications.

图2所示为恶意软件特征提取模块流程图，具体包括：对Android系统模拟器或者Android设备获取超级管理员(Root)权限，Root用户是系统中唯一的超级管理员，它具有等同于操作系统的权限。可使用Root精灵或者其它第三方工具实现一键Root功能。Figure 2 shows the flow chart of the malware feature extraction module, which specifically includes: obtaining super administrator (Root) permissions for the Android system emulator or Android device. The root user is the only super administrator in the system, and it has the same permission. You can use the Root Wizard or other third-party tools to achieve the one-click Root function.

下载超级管理员软件(Android Debug Bridge Device，ADBD)，并在手机中安装完成后开启超级管理员模式，否则设备在启动系统用户空间跟踪器(Strace)命令时会警告并提示未获得权限(PermissionNotAllowed)。Download the super administrator software (Android Debug Bridge Device, ADBD), and enable the super administrator mode after the installation is completed on the mobile phone, otherwise the device will warn and prompt that permission is not granted (PermissionNotAllowed) when starting the system user space tracer (Strace) command ).

本发明采用编程语言(Python)脚本提取特征，特征提取部分步骤如下：首先对模拟器或者Android设备获取超级管理员(Root)权限，下载并安装超级管理员软件并开启超级管理员模式，其次获取待测样本的进程身份编号(Personal ID，PID)，并利用Strace命令来提取系统调用的动态特征；然后利用模拟人工操作工具(Monkey)对样本软件模拟人工操作，等待程序自动运行一段时间后，杀死进程，在这里需要保证每一个待测样本的运行时间是一致；最后待样本将日志文件以文本(TXT)格式进行记录保存，并去除掉乱码部分，同时把系统调用的时间和参数等部分全部去除，进而得到按时间顺序输出的系统调用序列。The present invention uses a programming language (Python) script to extract features, and the feature extraction steps are as follows: first, obtain super administrator (Root) authority for an emulator or Android device, download and install super administrator software and enable super administrator mode, and secondly obtain The process identity number (Personal ID, PID) of the sample to be tested, and use the Strace command to extract the dynamic characteristics of the system call; then use the simulated manual operation tool (Monkey) to simulate the manual operation of the sample software, and wait for the program to automatically run for a period of time. To kill the process, it is necessary to ensure that the running time of each sample to be tested is consistent; finally, the sample to be tested will record and save the log file in text (TXT) format, and remove the garbled part, and at the same time save the time and parameters of the system call, etc. Parts are completely removed, and then the system call sequence output in chronological order is obtained.

获取待测样本的进程身份编号PID(Personal ID)。具体可为，对文件包进行反编译，并获取反编译后的Android清单文件(AndroidManifest.xml)列表，从中获取软件的包名，通过包名获取样本的PID(可利用Android调试桥(Android Debug Bridge，ADB)工具获取)。提取系统调用的动态特征，如可利用Strace命令，或使用ADB工具命令将相应版本的Strace可执行文件放置于设备系统程序集缓存(bin)下。利用模拟人工操作工具(Monkey)操作样本软件，并且保证每一个待测样本的运行时间一致，杀死进程。将样本日志文件以文本(TXT)格式记录保存，并去除乱码部分，同时把系统调用的时间和参数等部分全部去除，按时间顺序输出系统调用序列。Obtain the process identification number PID (Personal ID) of the sample to be tested. Specifically, decompile the file package, obtain a list of the decompiled Android manifest files (AndroidManifest.xml), obtain the package name of the software, and obtain the PID of the sample through the package name (Android Debug Bridge (Android Debug Bridge) can be used. Bridge, ADB) tool acquisition). To extract the dynamic characteristics of system calls, for example, you can use the Strace command, or use the ADB tool command to place the corresponding version of the Strace executable file in the device system assembly cache (bin). Use the simulated manual operation tool (Monkey) to operate the sample software, and ensure that the running time of each sample to be tested is consistent, and kill the process. The sample log file is recorded and saved in text (TXT) format, and the garbled part is removed. At the same time, the time and parameters of the system call are all removed, and the system call sequence is output in chronological order.

特征处理模块包括，数据降维部分，特征向量化部分和去冗余部分。其大致步骤为首先通过信息增益算法对系统调用特征进行量化评分处理，进行数据降维以剔除无用特征而得到有用特征，然后对其保留的特征进行向量化处理，最后针对特征向量进行去冗余操作，为得到更准确的SVM分类模型。The feature processing module includes a data dimension reduction part, a feature vectorization part and a de-redundancy part. The general steps are as follows: first, quantify and score the system call features through the information gain algorithm, perform data dimensionality reduction to eliminate useless features and obtain useful features, then perform vectorization processing on the retained features, and finally remove redundancy for the feature vectors. operation, in order to get a more accurate SVM classification model.

数据降维部分：恶意软件检测基于特征提取，而提取的特征种类和维度起着决定性的作用，特征维度过多会导致检测时间过长，特征提取少又会导致检测准确率下降，为了提升检测的准确度，研究人员不得不牺牲效率，尽量提取出多种的多维度的特征，所以如何对这些大幅度特征筛选和降维，这是其中一个难点。Data dimensionality reduction part: Malware detection is based on feature extraction, and the type and dimension of the extracted features play a decisive role. Too many feature dimensions will lead to too long detection time, and less feature extraction will lead to a decrease in detection accuracy. In order to improve detection However, researchers have to sacrifice efficiency and try to extract a variety of multi-dimensional features, so how to screen and reduce the dimensionality of these large-scale features is one of the difficulties.

数据降维部分首先对系统调用按照重要程度不同进行量化评分，然后根据每个系统的评分，保留“重要”的系统调用，去除“无用”的系统调用，从而达到降维的目的。The data dimensionality reduction part first quantifies and scores the system calls according to the degree of importance, and then retains the "important" system calls and removes the "useless" system calls according to the score of each system, so as to achieve the purpose of dimensionality reduction.

本发明采用信息增益法对系统调用进行评分，信息增益越大，则这个系统调用的选择性越好。定义待分类系统调用集合C的熵H(C)与某系统调用T的条件熵H(C|T)之差作为系统调用T给系统带来的信息增益IG(T)，即IG(T)＝H(C)-H(C|T)，其中C代表系统调用集合，T代表其中的某个系统调用；H(C|T)表示集合C中系统调用T的不确定性，包含两种情况，一种是系统调用T出现(标记为t)时的条件熵，表示为H(C|t)；另一种是系统调用T不出现(标记为

)时的条件熵，表示为

因此

其中P(t)和

分别代表系统调用T出现和不出现的概率。在计算IG(T)的具体值时，首先根据公式

计算系统调用集合C的信息熵，其中P(C_m)代表系统调用集合C中第m个系统调用出现的概率，K代表集合C中系统调用的总数量；然后根据公式

和公式

分别计算系统调用T出现时和不出现时的条件熵，其中P(C_m|t)和

分别代表系统调用T出现和不出现的情况下，集合C中第m个系统调用出现的概率。整理上述公式可以计算得到信息增益值IG(T)，IG(T)值越大，表示系统调用T对分类结果越重要。对集合C中的所有系统调用进行上述计算，可得到每个系统调用的信息增益值，作为相应系统调用的评分。The present invention uses the information gain method to score the system call, and the greater the information gain, the better the selectivity of the system call. Define the difference between the entropy H(C) of the system call set C to be classified and the conditional entropy H(C|T) of a system call T as the information gain IG(T) brought by the system call T to the system, namely IG(T) =H(C)-H(C|T), where C represents the system call set, T represents a system call in it; H(C|T) represents the uncertainty of the system call T in the set C, including two One is the conditional entropy when the system call T appears (marked as t), expressed as H(C|t); the other is when the system call T does not appear (marked as

), the conditional entropy is expressed as

therefore

where P(t) and

Represent the probability of the system call T appearing and not appearing, respectively. When calculating the specific value of IG(T), first according to the formula

Calculate the information entropy of the system call set C, where P(C _m ) represents the probability of the occurrence of the mth system call in the system call set C, and K represents the total number of system calls in the set C; then according to the formula

and formula

Calculate the conditional entropy with and without the occurrence of system call T, respectively, where P(C _m |t) and

Represents the probability of the occurrence of the mth system call in the set C when the system call T appears and does not appear, respectively. The information gain value IG(T) can be obtained by arranging the above formula. The larger the value of IG(T), the more important the system call T is to the classification result. The above calculation is performed on all system calls in the set C, and the information gain value of each system call can be obtained as the score of the corresponding system call.

在将系统调用分为“重要”和“无用”时的分类阈值选取方面，本发明通过仿真来选取合适的阈值。首先根据系统调用的评分从高到低分别选取20、40、60、80、100、120、140个系统调用进行特征向量化，进而构建马尔科夫矩阵并将结果转化为SVM能够识别的格式进行机器学习，分别得到保留20、40、60、80、100、120、140个系统调用时的检测率，根据仿真得到的检测率趋势图，选取合适的阈值，大于阈值的部分分类为“重要”的系统调用，小于阈值的部分分类为“无用”的系统调用。由于系统调用包含该应用软件的多种系统调用属性，故未被选取的部分不做处理，而按照系统调用的属性进行归类，合并“无用”命令，对“重要”部分直接进行特征向量化处理，而“无用”部分根据系统调用的属性不同合并处理，从而完成数据向量化。Regarding the selection of the classification threshold when the system calls are divided into "important" and "useless", the present invention selects an appropriate threshold through simulation. First, 20, 40, 60, 80, 100, 120, and 140 system calls are selected for feature vectorization according to the system call scores from high to low, and then the Markov matrix is constructed and the result is converted into a format that SVM can recognize. Using machine learning, the detection rates were obtained when 20, 40, 60, 80, 100, 120, and 140 system calls were reserved, respectively. According to the detection rate trend graph obtained by the simulation, an appropriate threshold was selected, and the part larger than the threshold was classified as "important" The system calls that are smaller than the threshold are classified as "useless" system calls. Since the system call contains various system call attributes of the application software, the unselected parts are not processed, but are classified according to the attributes of the system call, and the "useless" commands are combined, and the "important" parts are directly eigenvectorized. processing, and the "useless" part is combined and processed according to the properties of the system call to complete the data vectorization.

选取了阈值后对特征进行向量化处理，生成对应的向量空间。通过信息增益对数据进行降维后得到有用的系统调用，此时的系统调用数量较大，因此采用去冗余方法剔除冗余部分的特征，本实施例可采用TF-IDF(词频逆文本频率指数)算法和N-gram模型结合的方案对软件中的冗余信息进行筛选和剔除操作，具体为以下描述：After the threshold is selected, the feature is vectorized to generate the corresponding vector space. Use information gain to reduce the dimension of the data to obtain useful system calls. At this time, the number of system calls is relatively large. Therefore, a de-redundancy method is used to eliminate the features of redundant parts. In this embodiment, TF-IDF (word frequency inverse text frequency) can be used. The combination of the index) algorithm and the N-gram model performs screening and elimination operations on redundant information in the software, as described below:

在去冗余处理中使用N-gram语言模型(N表示N-gram语言模型中参数值)，图3所示为本实施例的N-gram语言模型算法具体实现示意图。假设动态行为日志d_i包含四种类型的系统调用(表示为S1,S2,S3,S4)，动态行为日志d_i中系统调用序列的长度为L_i。其大致流程为：首先根据研究选定固定长度为N＝3(其值经过检测为最佳N值)的滑动窗口，即以3个系统调用作为滑动窗口的大小，通过每次平滑1个单位长度，直至系统调用序列末尾，得到L_i-N+1个系统调用子序列。对所有样本对应的动态行为日志使用N-gram算法，则可得到

个系统调用子序列，其中D代表样本库。进一步，采用TF-IDF算法对每一个系统调用子序列量化评分，具体为，根据公式

计算子序列t_j的反文档频率IDF_j，其中|D|为样本库中的动态行为日志总数；|i:t_j∈d_i|为出现子序列t_j的日志总数；t_j∈d_i表示子序列t_j在动态行为日志d_i中出现；+1是为了防止分母变为0。最后通过实验仿真选取合适的阈值，去除冗余部分的信息，在保留“重要数据”后对子序列进行合并，从而得到包含重要数据的系统调用向量，进行SVM算法学习和分类。An N-gram language model (N represents a parameter value in the N-gram language model) is used in the de-redundancy processing, and FIG. 3 is a schematic diagram of a specific implementation of the N-gram language model algorithm of this embodiment. Assuming that the dynamic behavior log d _i contains four types of system calls (denoted as S1, S2, S3, S4), the length of the system call sequence in the dynamic behavior log d _i is _Li . The general process is as follows: first, according to the research, a sliding window with a fixed length of N=3 (the value of which is detected as the best N value) is selected, that is, three system calls are used as the size of the sliding window, and each time is smoothed by 1 unit length, until the end of the system call sequence, to obtain L _i -N+1 system call subsequences. Using the N-gram algorithm for the dynamic behavior logs corresponding to all samples, we can get

system call subsequence, where D represents the sample library. Further, the TF-IDF algorithm is used to quantify the score of each system call subsequence, specifically, according to the formula

Calculate the inverse document frequency IDF _j of the subsequence t _j , where |D| is the total number of dynamic behavior logs in the sample library; |i:t _j ∈ d _i | is the total number of logs that appear in the subsequence t _j ; t _j ∈ d _i Indicates that the subsequence _tj _appears in the dynamic behavior log di; +1 is to prevent the denominator from becoming 0. Finally, the appropriate threshold is selected through experimental simulation, the redundant information is removed, and the subsequences are merged after retaining the "important data", so as to obtain the system call vector containing the important data for SVM algorithm learning and classification.

本发明在机器学习训练测试之前，将正常软件和恶意软件样本构建成Markov矩阵，并将该结果转化为SVM能够识别的格式，将其作为SVM的输入进行分类模型训练，以备后续测试，大致步骤如下。Before the machine learning training test, the present invention constructs the normal software and malware samples into a Markov matrix, converts the result into a format that can be recognized by SVM, and uses it as the input of the SVM for classification model training for subsequent testing. Proceed as follows.

首先特征处理模块得到经过特征处理后的系统调用序列后，由于直接对样本进行训练复杂度太高，其他处理方式不能很好地还原系统调用之间的联系，故使用马尔科夫矩阵构建系统调用序列，以最大程度地还原系统调用序列的原始结构，其中每个应用程序的系统调用序列都是一个离散的马尔科夫链，最后将结果转化为SVM能够识别的格式。First of all, after the feature processing module obtains the system call sequence after feature processing, because the direct training of the samples is too complex, other processing methods cannot restore the relationship between the system calls well, so the Markov matrix is used to construct the system call sequence to restore the original structure of the system call sequence to the greatest extent, in which the system call sequence of each application is a discrete Markov chain, and finally convert the result into a format that SVM can recognize.

由于检测的结果为正常样本和恶意样本两类，因此恶意软件的检测本质上属于一个二分类问题。根据已有研究，SVM算法非常适合解决小样本二分类问题，并且该算法在解决样本量较小以及高维度识别问题方面具有优势，适用于软件检测领域，故本发明利用获得的马尔科夫链，采用SVM分类算法实现分类。Since the detection results are divided into two categories: normal samples and malicious samples, the detection of malware is essentially a binary classification problem. According to the existing research, the SVM algorithm is very suitable for solving the small sample binary classification problem, and the algorithm has advantages in solving the problem of small sample size and high-dimensional identification, and is suitable for the field of software detection. Therefore, the present invention uses the obtained Markov chain. , using the SVM classification algorithm to achieve classification.

采用SVM算法检测的具体原理如图4所示，黑点和白点分别代表恶意样本和正常样本，两条虚线之间的所有超平面均可以完成分类，但只有实线代表的超平面可以取得最大的区分间隔，使分类器达到最优的分类目的，即实线代表的超平面为决策最优超平面。两条虚线上的三个样本为样本的边界，是寻找最优分类超平面的重要依据，被称为支持向量。使用SVM算法的目的便是找出这个最优超平面。The specific principle of detection using the SVM algorithm is shown in Figure 4. The black and white dots represent malicious samples and normal samples, respectively. All hyperplanes between the two dotted lines can be classified, but only the hyperplane represented by the solid line can be obtained. The maximum discrimination interval enables the classifier to achieve the optimal classification purpose, that is, the hyperplane represented by the solid line is the optimal hyperplane for decision-making. The three samples on the two dotted lines are the boundaries of the samples and are an important basis for finding the optimal classification hyperplane, which is called the support vector. The purpose of using the SVM algorithm is to find this optimal hyperplane.

在本发明中，具体可采用以下方法，首先将正常软件和恶意软件的混合样本集按照3:1的比例分为训练集和测试集；其次将正常软件和恶意软件的混合训练集导入到特征提取模块，提取系统调用特征后，进行上述数据降维、去冗余和构建马尔科夫矩阵处理；进一步，针对混合训练集中的每一款应用软件，根据其正常或恶意样本的真实属性，对它们分别进行标签处理，正常样本标记为“0”，恶意样本标记为“1”，SVM算法根据训练集在平面内的2维分布，通过直线分割，找到一条最优的分割线，从而完成对模型的训练，最后将测试集作为SVM的输入进行检测，则可得出正常或恶意软件的检测结果。In the present invention, the following method can be adopted specifically. First, the mixed sample set of normal software and malicious software is divided into training set and test set according to the ratio of 3:1; secondly, the mixed training set of normal software and malicious software is imported into the feature The extraction module, after extracting the system call features, performs the above data dimensionality reduction, de-redundancy and Markov matrix construction; further, for each application software in the mixed training set, according to the real attributes of its normal or malicious samples, the They are labelled separately. The normal samples are marked as "0" and the malicious samples are marked as "1". The SVM algorithm finds an optimal dividing line according to the 2-dimensional distribution of the training set in the plane, and finds an optimal dividing line through straight line segmentation. The training of the model, and finally the test set is used as the input of SVM for detection, and the detection result of normal or malware can be obtained.

本发明基于系统调用的智能终端恶意软件动态检测方法具体实现流程将其详细实施步骤可归纳如下：获取管理员权限；自动化安装软件；通过Monkey工具模拟操作；通过Strace获取软件的行为日志；对日志进行预处理：数据降维，特征向量化，去冗余；构建系统调用序列的Markov链；使用SVM分别训练恶意样本和正常样本；测试。The specific implementation process of the system call-based intelligent terminal malware dynamic detection method of the present invention can be summarized as follows: obtaining administrator authority; automatically installing software; simulating operation through Monkey tool; obtaining software behavior log through Strace; Preprocessing: data dimensionality reduction, feature vectorization, de-redundancy; build Markov chain of system call sequence; use SVM to train malicious samples and normal samples separately; test.

Claims

1. The intelligent terminal malicious software dynamic detection method based on system call is characterized by comprising the following steps: the characteristic extraction module records a log of the dynamic behavior by using a detection tool, extracts log information of the dynamic behavior of the sample to be detected, and outputs a system calling sequence according to a time sequence; the characteristic selection module preprocesses the dynamic behavior log information, including data dimension reduction, vectorization processing and redundancy removal, wherein the data dimension reduction part quantificationally scores the system call according to different importance degrees and processes the system call sequence from high to low according to the scores, and after the information of the redundancy part is removed and the important part is reserved, the subsequence is merged to obtain the redundancy-removed system call sequence; a machine learning module constructs a Markov (Markov) matrix of a system calling sequence, converts a result into a format which can be recognized by a Support Vector Machine (SVM), and takes a mixed training set of a normal software sample and a malicious software sample as the input of the SVM to carry out effectiveness training to obtain a trained classification model; detecting a mixed set of the malicious software sample and the normal software sample by adopting a trained classification model, and outputting a detection result;

specifically, the data dimensionality reduction comprises the following steps of calling a formula according to the entropy H (C) of a system call set to be classified and the conditional entropy H (C | T) of the system call: IG (T) ═ H (C) — H (C | T) calculates an information gain IG (T), and hence IG values for all system calls can be obtained; the larger the IG (T) value is, the more important the classification result is, vectorizing the system call according to the IG (T) value, selecting a threshold value through simulation, dividing the system call into an important part and a useless part, directly performing characteristic vectorization on the important part, combining the useless part according to attributes, and performing vectorization, and finally obtaining a system call sequence after each dynamic behavior log is vectorized;

specifically, the method for removing the redundant information in the software comprises the steps of carrying out language model N-gram processing on a vector quantization system call sequence to obtain a plurality of subsequences of system call, using the subsequences as objects of quantitative scoring, and carrying out quantitative scoring on the subsequences in the N-gram according to a formula

Computing dynamic behavior logs d_iNeutron sequence t_jFrequency of occurrence of wherein n_i,jIs a subsequence t_jIn dynamic behavior Log d_iThe number of occurrences in (a); according to the formula

Calculating a subsequence t_jInverse document frequency IDF of_jWherein | D | is the total number of dynamic behavior logs in the sample library; i: t_j∈d_iI is the occurrence subsequence t_jThe total number of logs; t is t_j∈d_iDenotes the subsequence t_jIn dynamic behavior Log d_i(iii) is present; +1 is to prevent the denominator from becoming 0, remove the redundant information and keep the "important" part, then merge the subsequences to get the system call sequence after removing redundancy.

2. The method as claimed in claim 1, wherein the feature extraction module obtains Root authority of the super administrator to open a super administrator mode, obtains a process identity number PID of a software sample to be tested, extracts dynamic features of system call, simulates operation sample software, kills a process after running each sample software for a preset time, records and saves a sample log file in a text TXT format, removes a messy code part, and time and parameter information of system call to obtain a system call sequence output according to a time sequence.

3. Intelligent terminal malicious software dynamic detection system based on system call is characterized by comprising: the system comprises a feature extraction module, a feature selection module and a machine learning classification module, wherein the feature extraction module records a log of dynamic behaviors by using a detection tool, extracts log information of the dynamic behaviors of a sample to be detected, and outputs a system calling sequence according to a time sequence; the characteristic selection module preprocesses the dynamic behavior log information, including data dimension reduction, vectorization processing and redundancy removal, wherein the data dimension reduction part quantificationally scores the system call according to different importance degrees and processes the system call sequence from high to low; a machine learning classification module constructs a Markov matrix of a system calling sequence, converts a result into a format which can be recognized by a Support Vector Machine (SVM), and takes a mixed training set of a normal software sample and a malicious software sample as the input of the SVM to carry out effectiveness training to obtain a trained classification model; detecting a mixed set of the malicious software sample and the normal software sample by adopting a trained classification model, and outputting a detection result;

specifically, for a dynamic behavior log generated when the APK runs in the simulator, recording all system calls appearing in all logs as a set C, and calling a formula according to the entropy H (C) of the system call set to be classified and the conditional entropy H (C | T) of the system calls: calculating information gains IG (T) by using IG (T) ═ H (C | T), wherein the larger the value of IG (T) is, the more important the classification result is, carrying out vectorization processing on the system call according to the value of IG (T), dividing the sample software into an important part and a useless part, directly carrying out feature vectorization processing on the important part, and carrying out merging processing on the useless part according to attributes to obtain a vectorized call sequence;

removing redundant information in sample software, specifically, carrying out N-gram processing on a vectorized system call sequence in a log to obtain a plurality of subsequences of system call, using the subsequences as objects of quantitative scoring, and carrying out quantitative scoring on the subsequences in the N-gram according to a formula

Calculating a subsequence t_jInverse document frequency IDF of_jWherein | D | is the total number of dynamic behavior logs in the sample library; i: t_j∈d_iL is the total number of logs with the sub-sequence j; t is t_j∈d_iDenotes the subsequence t_jIn dynamic behavior Log d_i(iii) is present; +1 is to prevent the denominator from becoming 0, remove the redundant information and keep the "important" part, then merge the subsequences to get the system call sequence after removing redundancy.

4. The system of claim 3, wherein the feature extraction module obtains Root authority of the hypervisor to open a hypervisor mode, obtains a process identity number PID of a software sample to be tested, extracts dynamic features of system calling, simulates operation sample software, kills a process after running each sample software for a predetermined time, records and stores a sample log file in a text TXT format, removes a messy code part, and removes time and parameter information of system calling to obtain a system calling sequence output according to a time sequence.