CN109543406B

CN109543406B - Android malicious software detection method based on XGboost machine learning algorithm

Info

Publication number: CN109543406B
Application number: CN201811150736.1A
Authority: CN
Inventors: 王雪敬; 凌捷; 孙玉; 孙宇平
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2023-04-11
Anticipated expiration: 2038-09-29
Also published as: CN109543406A

Abstract

The invention relates to an Android malware detection method based on an XGBoost machine learning algorithm. First, the Permission, Intent, Component and API call features are extracted by decompiling the apk file, and the feature matrix is quantified to form a feature matrix, and the parallelism of the ant colony algorithm is used. The robustness of the XGBoost classifier is optimized to obtain the optimal target and the optimal parameter combination of XGBoost. Compared with the traditional XGBoost algorithm, the improved XGBoost machine learning algorithm proposed by this invention has higher classification accuracy in Android malware detection, improves the correct rate of malware detection, and reduces the damage caused by detection errors to the Android system. probability of attack.

Description

An Android malware detection method based on XGBoost machine learning algorithm

技术领域Technical Field

本发明涉及Android平台上恶意软件检测的技术领域，具体涉及一种基于XGBoost机器学习算法的Android恶意软件检测方法。The present invention relates to the technical field of malware detection on an Android platform, and in particular to an Android malware detection method based on an XGBoost machine learning algorithm.

背景技术Background Art

Android系统由Google公司在2007年11月5日正式发布，作为一款基于Linux内核的操作系统，其开源、自由的特性，使得Android系统以极快的速度成为市场占有量最大的智能移动设备操作系统。然而，在其备受广大App开发者和用户欢迎的同时，也成为恶意攻击者的首选目标。Android恶意软件的快速增长己经对用户的安全和隐私构成严重威胁，恶意软件窃取用户的私人数据，导致财产损失，以及利用系统漏洞获取更高的权限，实现更大的危害。随着移动支付产业的持续推进，互联网+概念火爆，移动支付迅速发展，手机支付病毒也是层出不穷，严重危害了用户财产安全。因此需要能快速有效地检测出恶意软件的方法。The Android system was officially released by Google on November 5, 2007. As an operating system based on the Linux kernel, its open source and free features have made the Android system quickly become the largest smart mobile device operating system in the market. However, while it is popular among App developers and users, it has also become the preferred target of malicious attackers. The rapid growth of Android malware has posed a serious threat to the security and privacy of users. Malware steals users' private data, causing property losses, and exploits system vulnerabilities to obtain higher permissions and achieve greater harm. With the continuous advancement of the mobile payment industry, the Internet+ concept is popular, mobile payment is developing rapidly, and mobile payment viruses are emerging in an endless stream, seriously endangering the property safety of users. Therefore, a method that can quickly and effectively detect malware is needed.

目前针对Android恶意软件的检测方法主要有三种，静态检测方法、动态检测方法以及静态检测与动态检测相结合的方法。Currently, there are three main detection methods for Android malware: static detection method, dynamic detection method, and a combination of static detection and dynamic detection.

其中，静态检测方法是在不运行Android应用程序的情况下，通过逆向工程对应用程序的安装包进行反编译，并提取相关特征，如权限信息、API调用、指令特征等信息，以此来表征程序在运行时可能进行的操作，从而辨别该应用程序是否是恶意软件。静态检测大多使用机器学习算法对提取出的特征信息进行分类检测。然而，该种静态检测方法的分类精度不高，恶意软件检测的正确率较低，增加了由于检测错误而导致Android系统遭受攻击的概率。Among them, the static detection method is to decompile the installation package of the application through reverse engineering without running the Android application, and extract relevant features, such as permission information, API calls, instruction features, etc., to characterize the operations that the program may perform when running, so as to identify whether the application is malware. Static detection mostly uses machine learning algorithms to classify and detect the extracted feature information. However, the classification accuracy of this static detection method is not high, and the accuracy of malware detection is low, which increases the probability of the Android system being attacked due to detection errors.

发明内容Summary of the invention

本发明的目的在于克服现有技术的不足，提供一种分类精度较高、恶意软件检测的正确率较高、大大降低由于检测错误而导致Android系统遭受攻击的概率的基于XGBoost机器学习算法的Android恶意软件检测方法。The purpose of the present invention is to overcome the shortcomings of the prior art and provide an Android malware detection method based on the XGBoost machine learning algorithm, which has high classification accuracy, high accuracy of malware detection, and greatly reduces the probability of the Android system being attacked due to detection errors.

为实现上述目的，本发明所提供的技术方案为：To achieve the above purpose, the technical solution provided by the present invention is:

一种基于XGBoost机器学习算法的Android恶意软件检测方法，通过反编译apk文件提取Permission，Intent，Component和API call特征，并量化组成特征矩阵，利用蚁群优化算法对XGBoost集成学习框架进行参数优化，快速寻找到全局最优解，多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值 min_child_weight，最后将优化后的XGBoost算法应用到Android恶意软件检测模型中。An Android malware detection method based on XGBoost machine learning algorithm extracts Permission, Intent, Component and API call features by decompiling apk files and quantizing the feature matrix. The ant colony optimization algorithm is used to optimize the parameters of the XGBoost ensemble learning framework to quickly find the global optimal solution. After multiple iterations, the optimal target value is obtained and the optimal parameter combination of XGBoost, namely, shrinkage step size and minimum sample weight threshold min_child_weight in child nodes, is obtained. Finally, the optimized XGBoost algorithm is applied to the Android malware detection model.

进一步地，基于XGBoost机器学习算法的Android恶意软件检测方法的具体步骤如下：Furthermore, the specific steps of the Android malware detection method based on the XGBoost machine learning algorithm are as follows:

S1：利用apktool将apk文件反编译得到AndroidManifest.xml和 classes.dex；S1: Use apktool to decompile the apk file to get AndroidManifest.xml and classes.dex;

S2：提取Permission、Intent、Component和API call特征；S2: Extract Permission, Intent, Component and API call features;

S3：特征量化，输出值为one-hot向量，如果存在特征，则标记为1，否则将其标记为0；S3: Feature quantization, the output value is a one-hot vector, if the feature exists, it is marked as 1, otherwise it is marked as 0;

S4：将所有的特征向量形成特征向量集合，采用特征选择算法对特征向量集合进行降维，选取最优的特征子集；S4: All feature vectors are formed into a feature vector set, and the feature selection algorithm is used to reduce the dimension of the feature vector set to select the optimal feature subset;

S5：利用蚁群优化算法对XGBoost集成学习框架进行参数优化，快速寻找到全局最优解，多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight；S5: Use the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework, quickly find the global optimal solution, obtain the optimal target value after multiple iterations, and obtain the optimal parameter combination of XGBoost, the shrinkage step size, and the minimum sample weight threshold min_child_weight in the child node;

S6：将优化特征向量随机抽取10％作为测试集,剩余的90％作为训练集合输入到优化后的XGBoost集成学习框架中进行优化学习；S6: Randomly extract 10% of the optimized feature vectors as the test set, and the remaining 90% as the training set and input them into the optimized XGBoost ensemble learning framework for optimization learning;

S7：从真正率、假正率、分类精度对分类结果进行评估,判断该基于蚁群算法优化的XGBoost算法用于生成Android恶意软件检测模型是否符合检测要求。S7: Evaluate the classification results from the perspective of true positive rate, false positive rate, and classification accuracy to determine whether the XGBoost algorithm optimized by the ant colony algorithm is used to generate the Android malware detection model that meets the detection requirements.

进一步地，利用蚁群优化算法对XGBoost集成学习框架进行参数优化的具体步骤如下：Furthermore, the specific steps of using the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework are as follows:

A、设置XGBoost分类器参数的收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight的上下限,最大的迭代次数MaxIter,蚁群规模M，信息蒸发系数Rho；A. Set the shrinkage step size of the XGBoost classifier parameters, the upper and lower limits of the minimum sample weight threshold min_child_weight in the child node, the maximum number of iterations MaxIter, the ant colony size M, and the information evaporation coefficient Rho;

B、初始化种群，即初始化shrinkage和min_child_weight，作为每一只蚂蚁的位置向量；B. Initialize the population, that is, initialize shrinkage and min_child_weight as the position vector of each ant;

C、执行蚁群搜索；C. Perform ant colony search;

D、进行XGBoost训练；D. Perform XGBoost training;

E、用XGBoost分类器计算每只蚂蚁的目标函数值和信息素值，寻找当前最优蚂蚁；E. Use XGBoost classifier to calculate the objective function value and pheromone value of each ant and find the current optimal ant;

F、判断是否满足终止条件：如果迭代的次数大于MaxIter,则输出蚁群最优值以及对应的shrinkage和min_child_weight值，执行步骤G，否则迭代次数加1，执行步骤C；F. Determine whether the termination condition is met: If the number of iterations is greater than MaxIter, output the optimal value of the ant colony and the corresponding shrinkage and min_child_weight values, and execute step G; otherwise, increase the number of iterations by 1 and execute step C;

G、更新信息素；G. Update pheromones;

H、将输出的shrinkage和min_child_weight用于Android恶意软件的检测模型中。H. Use the output shrinkage and min_child_weight in the Android malware detection model.

进一步地，所述蚁群优化算法具体如下：Furthermore, the ant colony optimization algorithm is specifically as follows:

蚁群位置初始化：Ant colony position initialization:

假设XGBoost的分类准确率作为目标函数值Assume that XGBoost's classification accuracy is used as the objective function value

max{F(s₁,w₁),F(s₂,w₂),...,F(s_m,w_m)},记为 max fitness＝max{F(X)},X＝{x₁,x₂,...,x_m},其中x_i表示蚂蚁，利用混沌序列产生初始化的种群步骤如下：max{F(s ₁ ,w ₁ ),F(s ₂ ,w ₂ ),...,F(s _m ,w _m )}, denoted as max fitness＝max{F(X)},X＝{x ₁ ,x ₂ ,...,x _m }, where _xi represents ants. The steps to generate the initialized population using chaotic sequence are as follows:

1)产生一个D维的随机向量：1) Generate a D-dimensional random vector:

2)Logistics映射，使用上式作为初始迭代，Logistics映射方程如下：2) Logistics mapping, using the above formula as the initial iteration, the Logistics mapping equation is as follows:

式中，μ＝1,i＝1,2,...,N,d＝1,2,..,D；In the formula, μ=1,i=1,2,...,N,d=1,2,...,D;

3)将混沌空间映射到优化变量的搜索空间：3) Map the chaotic space to the search space of optimization variables:

式中，max^d为取上限值，min^d为取下限值；In the formula, max ^d is the upper limit value, and min ^d is the lower limit value;

蚂蚁移动规则：Ant movement rules:

蚁群初始化后，计算其目标函数，

为第k迭代第j个蚂蚁的位置向量，定义，目标函数越大，其位置信息素浓度越大，则保存当前目标值最大的蚂蚁为

以及其信息素最大值

After the ant colony is initialized, its objective function is calculated.

is the position vector of the jth ant at the kth iteration. It is defined that the larger the objective function is, the greater the pheromone concentration is at its position. The ant with the largest current objective value is

and its pheromone maximum value

选择局部搜索或者全局搜索：Select Local Search or Global Search:

蚂蚁转移的概率定义如下：The probability of ant migration is defined as follows:

式中，S为适应度函数的标准差，计算公式如下：In the formula, S is the standard deviation of the fitness function, and the calculation formula is as follows:

式中，m为蚂蚁个数，F_ave为平均适应度值；In the formula, m is the number of ants, and _Fave is the average fitness value;

由上式可知，离

越近，蚂蚁的转移概率就越大，其搜索的方法如下：From the above formula, we can see that

The closer the ant is, the greater the probability of its transfer. The search method is as follows:

若P(x_i)≤P0，其中，P0为常数，0<P0<1,则蚂蚁在附近局部位置搜索，移动公式如下：If P( _xi )≤P0, where P0 is a constant and 0<P0<1, the ant searches in the nearby local position and the movement formula is as follows:

式中

为移动后的位置，

为移动前的位置，a为移动步长，定义如下：In the formula

is the position after moving,

is the position before moving, a is the moving step length, which is defined as follows:

若P(x_i)＞P0，则蚂蚁在解空间搜索；If P( _xi )>P0, the ant searches in the solution space;

信息素更新：Pheromone Update:

根据个体位置函数值的大小，更新信息素如下：According to the value of the individual position function, the pheromone is updated as follows:

式中，ρ为信息蒸发系数。Where ρ is the information evaporation coefficient.

与现有技术相比，本方案原理和优点如下：Compared with the existing technology, the principles and advantages of this solution are as follows:

相对于传统的XGBoost机器学习算法在Android恶意软件检测中因参数选取而影响XGBoost算法分类的表现性能，本方案应用蚁群算法对XGBoost 的进行参数寻优，快速地找到最优参数，使得XGBoost算法具有良好得分类性能，应用到Android恶意软件检测模型中，使在Android恶意软件检测时具有更高的分类精度,大大提高恶意软件检测的正确率,从而降低由于检测错误而导致Android系统遭受攻击的概率。Compared with the traditional XGBoost machine learning algorithm, which affects the performance of XGBoost algorithm classification due to parameter selection in Android malware detection, this scheme uses the ant colony algorithm to optimize the parameters of XGBoost and quickly find the optimal parameters, so that the XGBoost algorithm has good classification performance and is applied to the Android malware detection model, so that it has higher classification accuracy in Android malware detection, greatly improving the accuracy of malware detection, thereby reducing the probability of Android system being attacked due to detection errors.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法的检测流程图；FIG1 is a detection flow chart of an Android malware detection method based on an XGBoost machine learning algorithm of the present invention;

图2为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法中特征提取的流程图；FIG2 is a flow chart of feature extraction in an Android malware detection method based on an XGBoost machine learning algorithm according to the present invention;

图3为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法中应用蚁群算法优化XGBoost参数的流程图。FIG3 is a flow chart of applying the ant colony algorithm to optimize XGBoost parameters in an Android malware detection method based on the XGBoost machine learning algorithm of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例对本发明作进一步说明：The present invention will be further described below in conjunction with specific embodiments:

本实施例所述的一种基于XGBoost机器学习算法的Android恶意软件检测方法，具体内容如下：The Android malware detection method based on the XGBoost machine learning algorithm described in this embodiment is specifically as follows:

XGBoost(eXtreme Gradient Boosting)由Tian Chen于2015年提出的一种集成学习算法，在XGBoost集成学习框架中，直接影响其分类性能的主要有参数的收缩步长(shrinkage)和子节点中最小样本权重阈值 (min_child_weight)。过小的shrinkage会导致算法过拟合，较大的 shrinkage导致算法无法收敛，对于min_child_weight,过小会导致算法过拟合，过大的mini_child_weight将会导致算法对线性不可分数据的分类性能。XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm proposed by Tian Chen in 2015. In the XGBoost ensemble learning framework, the main parameters that directly affect its classification performance are the shrinkage step size (shrinkage) and the minimum sample weight threshold (min_child_weight) in the child node. Too small shrinkage will cause the algorithm to overfit, and too large shrinkage will cause the algorithm to fail to converge. For min_child_weight, too small will cause the algorithm to overfit, and too large mini_child_weight will cause the algorithm to have poor classification performance for linearly inseparable data.

因此，本实施例通过反编译apk文件提取Permission，Intent，Component 和APIcall特征量化组成特征矩阵后，利用蚁群优化算法对XGBoost集成学习框架进行参数优化，快速寻找到全局最优解，多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight，最后将优化后的XGBoost算法应用到Android 恶意软件检测模型中。如图1所示，具体步骤如下：Therefore, this embodiment extracts Permission, Intent, Component and APIcall features by decompiling the apk file to quantify the feature matrix, and then uses the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework, quickly find the global optimal solution, obtain the optimal target value after multiple iterations, and obtain the optimal parameter combination of XGBoost, the shrinkage step size, and the minimum sample weight threshold min_child_weight in the child node. Finally, the optimized XGBoost algorithm is applied to the Android malware detection model. As shown in Figure 1, the specific steps are as follows:

S2：提取Permission、Intent、Component和API call特征，具体过程如图2所示；S2: Extract Permission, Intent, Component and API call features. The specific process is shown in Figure 2.

上述中，如图3所示，利用蚁群优化算法对XGBoost集成学习框架进行参数优化的具体步骤如下：In the above, as shown in Figure 3, the specific steps of using the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework are as follows:

C、执行蚁群搜索；C. Perform ant colony search;

D、进行XGBoost训练；D. Perform XGBoost training;

G、更新信息素；G. Update pheromones;

而具体的蚁群优化算法如下：The specific ant colony optimization algorithm is as follows:

蚁群位置初始化：Ant colony position initialization:

1)产生一个D维的随机向量：1) Generate a D-dimensional random vector:

蚂蚁移动规则：Ant movement rules:

蚁群初始化后，计算其目标函数，

以及其信息素最大值

After the ant colony is initialized, its objective function is calculated.

and its pheromone maximum value

选择局部搜索或者全局搜索：Select Local Search or Global Search:

由上式可知，离

式中

为移动后的位置，

为移动前的位置，a为移动步长，定义如下：In the formula

is the position after moving,

信息素更新：Pheromone Update:

本实施例首先通过反编译apk文件提取Permission，Intent，Component 和APIcall特征，并量化组成特征矩阵，利用蚁群算法的并行性和较强的鲁棒性，对XGBoost分类器参数进行寻优，以求得最优目标并得到XGBoost的最优参数组合。该实施例提出的改进的XGBoost机器学习算法与传统的 XGBoost算法相比，在Android恶意软件检测时具有更高的分类精度,提高了恶意软件检测的正确率,降低了由于检测错误而导致Android系统遭受攻击的概率。This embodiment first extracts Permission, Intent, Component and APIcall features by decompiling the apk file, and quantizes the feature matrix, and optimizes the XGBoost classifier parameters by using the parallelism and strong robustness of the ant colony algorithm to obtain the optimal target and the optimal parameter combination of XGBoost. Compared with the traditional XGBoost algorithm, the improved XGBoost machine learning algorithm proposed in this embodiment has higher classification accuracy in Android malware detection, improves the accuracy of malware detection, and reduces the probability of Android system being attacked due to detection errors.

以上所述之实施例子只为本发明之较佳实施例，并非以此限制本发明的实施范围，故凡依本发明之形状、原理所作的变化，均应涵盖在本发明的保护范围内。The embodiments described above are only preferred embodiments of the present invention and are not intended to limit the scope of implementation of the present invention. Therefore, all changes made according to the shape and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An Android malicious software detection method based on an XGboost machine learning algorithm is characterized in that Permission, intent, component and APIcall characteristics are extracted by decompiling an apk file, a characteristic matrix is formed in a quantized mode, an ant colony optimization algorithm is used for carrying out parameter optimization on an XGboost integrated learning framework, a global optimal solution is quickly found, an optimal target value is obtained after multiple iterations, an optimal parameter combination contraction step length shrinkage of the XGboost and a minimum sample weight threshold value min _ child _ weight in a child node are obtained, and finally the optimized XGboost algorithm is applied to an Android malicious software detection model;

the method specifically comprises the following steps:

s1: decompiling the apk file by using the apktool to obtain android manifest.xml and classes.dex;

s2: extracting the Permission, intent, component and API call characteristics;

s3: quantizing the features, wherein the output value is a one-hot vector, if the features exist, the vector is marked as 1, otherwise, the vector is marked as 0;

s4: forming a feature vector set by all feature vectors, reducing the dimension of the feature vector set by adopting a feature selection algorithm, and selecting an optimal feature subset;

s5: performing parameter optimization on the XGboost integrated learning framework by using an ant colony optimization algorithm, quickly finding out a global optimal solution, obtaining an optimal target value after multiple iterations, and obtaining an optimal parameter combination shrinkage step length shrinkage of the XGboost and a minimum sample weight threshold value min _ child _ weight in a child node;

s6: randomly extracting 10% of the optimized feature vectors as a test set, and inputting the rest 90% of the optimized feature vectors as a training set into an optimized XGboost integrated learning frame for optimized learning;

s7: and evaluating the classification result from the true rate, the false positive rate and the classification precision, and judging whether the Android malicious software detection model generated by the XGboost algorithm optimized based on the ant colony algorithm meets the detection requirement.

2. The Android malicious software detection method based on the XGboost machine learning algorithm as claimed in claim 1, wherein the specific steps of using the ant colony optimization algorithm to perform parameter optimization on the XGboost ensemble learning frame are as follows:

A. setting the contraction step length shrinkage of the XGboost classifier parameter and the upper and lower limits of the minimum sample weight threshold min _ child _ weight in the child node, the maximum iteration times MaxIter, the ant colony scale M and the information evaporation coefficient Rho;

B. initializing populations, namely initializing shrinkage and min _ child _ weight as a position vector of each ant;

C. executing ant colony search;

D. XGboost training is carried out;

E. calculating the objective function value and the pheromone value of each ant by using an XGboost classifier, and searching the current optimal ant;

F. judging whether a termination condition is met: if the iteration times are larger than the MaxIter, outputting an ant colony optimal value and corresponding shrinkage and min _ child _ weight values, executing the step G, and if not, adding 1 to the iteration times, and executing the step C;

G. updating the pheromone;

H. and using the output shrinkage and min _ child _ weight in a detection model of the Android malicious software.