[go: up one dir, main page]

CN109543406B - Android malicious software detection method based on XGboost machine learning algorithm - Google Patents

Android malicious software detection method based on XGboost machine learning algorithm Download PDF

Info

Publication number
CN109543406B
CN109543406B CN201811150736.1A CN201811150736A CN109543406B CN 109543406 B CN109543406 B CN 109543406B CN 201811150736 A CN201811150736 A CN 201811150736A CN 109543406 B CN109543406 B CN 109543406B
Authority
CN
China
Prior art keywords
xgboost
algorithm
child
optimal
ant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811150736.1A
Other languages
Chinese (zh)
Other versions
CN109543406A (en
Inventor
王雪敬
凌捷
孙玉
孙宇平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201811150736.1A priority Critical patent/CN109543406B/en
Publication of CN109543406A publication Critical patent/CN109543406A/en
Application granted granted Critical
Publication of CN109543406B publication Critical patent/CN109543406B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Virology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

本发明涉及一种基于XGBoost机器学习算法的Android恶意软件检测方法,首先通过反编译apk文件提取Permission,Intent,Component和API call特征,并量化组成特征矩阵,利用蚁群算法的并行性和较强的鲁棒性,对XGBoost分类器参数进行寻优,以求得最优目标并得到XGBoost的最优参数组合。该发明提出的改进的XGBoost机器学习算法与传统的XGBoost算法相比,在Android恶意软件检测时具有更高的分类精度,提高了恶意软件检测的正确率,降低了由于检测错误而导致Android系统遭受攻击的概率。

Figure 201811150736

The invention relates to an Android malware detection method based on an XGBoost machine learning algorithm. First, the Permission, Intent, Component and API call features are extracted by decompiling the apk file, and the feature matrix is quantified to form a feature matrix, and the parallelism of the ant colony algorithm is used. The robustness of the XGBoost classifier is optimized to obtain the optimal target and the optimal parameter combination of XGBoost. Compared with the traditional XGBoost algorithm, the improved XGBoost machine learning algorithm proposed by this invention has higher classification accuracy in Android malware detection, improves the correct rate of malware detection, and reduces the damage caused by detection errors to the Android system. probability of attack.

Figure 201811150736

Description

一种基于XGBoost机器学习算法的Android恶意软件检测方法An Android malware detection method based on XGBoost machine learning algorithm

技术领域Technical Field

本发明涉及Android平台上恶意软件检测的技术领域,具体涉及一种基于XGBoost机器学习算法的Android恶意软件检测方法。The present invention relates to the technical field of malware detection on an Android platform, and in particular to an Android malware detection method based on an XGBoost machine learning algorithm.

背景技术Background Art

Android系统由Google公司在2007年11月5日正式发布,作为一款基于Linux内核的操作系统,其开源、自由的特性,使得Android系统以极快的速度成为市场占有量最大的智能移动设备操作系统。然而,在其备受广大App开发者和用户欢迎的同时,也成为恶意攻击者的首选目标。Android恶意软件的快速增长己经对用户的安全和隐私构成严重威胁,恶意软件窃取用户的私人数据,导致财产损失,以及利用系统漏洞获取更高的权限,实现更大的危害。随着移动支付产业的持续推进,互联网+概念火爆,移动支付迅速发展,手机支付病毒也是层出不穷,严重危害了用户财产安全。因此需要能快速有效地检测出恶意软件的方法。The Android system was officially released by Google on November 5, 2007. As an operating system based on the Linux kernel, its open source and free features have made the Android system quickly become the largest smart mobile device operating system in the market. However, while it is popular among App developers and users, it has also become the preferred target of malicious attackers. The rapid growth of Android malware has posed a serious threat to the security and privacy of users. Malware steals users' private data, causing property losses, and exploits system vulnerabilities to obtain higher permissions and achieve greater harm. With the continuous advancement of the mobile payment industry, the Internet+ concept is popular, mobile payment is developing rapidly, and mobile payment viruses are emerging in an endless stream, seriously endangering the property safety of users. Therefore, a method that can quickly and effectively detect malware is needed.

目前针对Android恶意软件的检测方法主要有三种,静态检测方法、动态检测方法以及静态检测与动态检测相结合的方法。Currently, there are three main detection methods for Android malware: static detection method, dynamic detection method, and a combination of static detection and dynamic detection.

其中,静态检测方法是在不运行Android应用程序的情况下,通过逆向工程对应用程序的安装包进行反编译,并提取相关特征,如权限信息、API调用、指令特征等信息,以此来表征程序在运行时可能进行的操作,从而辨别该应用程序是否是恶意软件。静态检测大多使用机器学习算法对提取出的特征信息进行分类检测。然而,该种静态检测方法的分类精度不高,恶意软件检测的正确率较低,增加了由于检测错误而导致Android系统遭受攻击的概率。Among them, the static detection method is to decompile the installation package of the application through reverse engineering without running the Android application, and extract relevant features, such as permission information, API calls, instruction features, etc., to characterize the operations that the program may perform when running, so as to identify whether the application is malware. Static detection mostly uses machine learning algorithms to classify and detect the extracted feature information. However, the classification accuracy of this static detection method is not high, and the accuracy of malware detection is low, which increases the probability of the Android system being attacked due to detection errors.

发明内容Summary of the invention

本发明的目的在于克服现有技术的不足,提供一种分类精度较高、恶意软件检测的正确率较高、大大降低由于检测错误而导致Android系统遭受攻击的概率的基于XGBoost机器学习算法的Android恶意软件检测方法。The purpose of the present invention is to overcome the shortcomings of the prior art and provide an Android malware detection method based on the XGBoost machine learning algorithm, which has high classification accuracy, high accuracy of malware detection, and greatly reduces the probability of the Android system being attacked due to detection errors.

为实现上述目的,本发明所提供的技术方案为:To achieve the above purpose, the technical solution provided by the present invention is:

一种基于XGBoost机器学习算法的Android恶意软件检测方法,通过反编译apk文件提取Permission,Intent,Component和API call特征,并量化组成特征矩阵,利用蚁群优化算法对XGBoost集成学习框架进行参数优化,快速寻找到全局最优解,多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值 min_child_weight,最后将优化后的XGBoost算法应用到Android恶意软件检测模型中。An Android malware detection method based on XGBoost machine learning algorithm extracts Permission, Intent, Component and API call features by decompiling apk files and quantizing the feature matrix. The ant colony optimization algorithm is used to optimize the parameters of the XGBoost ensemble learning framework to quickly find the global optimal solution. After multiple iterations, the optimal target value is obtained and the optimal parameter combination of XGBoost, namely, shrinkage step size and minimum sample weight threshold min_child_weight in child nodes, is obtained. Finally, the optimized XGBoost algorithm is applied to the Android malware detection model.

进一步地,基于XGBoost机器学习算法的Android恶意软件检测方法的具体步骤如下:Furthermore, the specific steps of the Android malware detection method based on the XGBoost machine learning algorithm are as follows:

S1:利用apktool将apk文件反编译得到AndroidManifest.xml和 classes.dex;S1: Use apktool to decompile the apk file to get AndroidManifest.xml and classes.dex;

S2:提取Permission、Intent、Component和API call特征;S2: Extract Permission, Intent, Component and API call features;

S3:特征量化,输出值为one-hot向量,如果存在特征,则标记为1,否则将其标记为0;S3: Feature quantization, the output value is a one-hot vector, if the feature exists, it is marked as 1, otherwise it is marked as 0;

S4:将所有的特征向量形成特征向量集合,采用特征选择算法对特征向量集合进行降维,选取最优的特征子集;S4: All feature vectors are formed into a feature vector set, and the feature selection algorithm is used to reduce the dimension of the feature vector set to select the optimal feature subset;

S5:利用蚁群优化算法对XGBoost集成学习框架进行参数优化,快速寻找到全局最优解,多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight;S5: Use the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework, quickly find the global optimal solution, obtain the optimal target value after multiple iterations, and obtain the optimal parameter combination of XGBoost, the shrinkage step size, and the minimum sample weight threshold min_child_weight in the child node;

S6:将优化特征向量随机抽取10%作为测试集,剩余的90%作为训练集合输入到优化后的XGBoost集成学习框架中进行优化学习;S6: Randomly extract 10% of the optimized feature vectors as the test set, and the remaining 90% as the training set and input them into the optimized XGBoost ensemble learning framework for optimization learning;

S7:从真正率、假正率、分类精度对分类结果进行评估,判断该基于蚁群算法优化的XGBoost算法用于生成Android恶意软件检测模型是否符合检测要求。S7: Evaluate the classification results from the perspective of true positive rate, false positive rate, and classification accuracy to determine whether the XGBoost algorithm optimized by the ant colony algorithm is used to generate the Android malware detection model that meets the detection requirements.

进一步地,利用蚁群优化算法对XGBoost集成学习框架进行参数优化的具体步骤如下:Furthermore, the specific steps of using the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework are as follows:

A、设置XGBoost分类器参数的收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight的上下限,最大的迭代次数MaxIter,蚁群规模M,信息蒸发系数Rho;A. Set the shrinkage step size of the XGBoost classifier parameters, the upper and lower limits of the minimum sample weight threshold min_child_weight in the child node, the maximum number of iterations MaxIter, the ant colony size M, and the information evaporation coefficient Rho;

B、初始化种群,即初始化shrinkage和min_child_weight,作为每一只蚂蚁的位置向量;B. Initialize the population, that is, initialize shrinkage and min_child_weight as the position vector of each ant;

C、执行蚁群搜索;C. Perform ant colony search;

D、进行XGBoost训练;D. Perform XGBoost training;

E、用XGBoost分类器计算每只蚂蚁的目标函数值和信息素值,寻找当前最优蚂蚁;E. Use XGBoost classifier to calculate the objective function value and pheromone value of each ant and find the current optimal ant;

F、判断是否满足终止条件:如果迭代的次数大于MaxIter,则输出蚁群最优值以及对应的shrinkage和min_child_weight值,执行步骤G,否则迭代次数加1,执行步骤C;F. Determine whether the termination condition is met: If the number of iterations is greater than MaxIter, output the optimal value of the ant colony and the corresponding shrinkage and min_child_weight values, and execute step G; otherwise, increase the number of iterations by 1 and execute step C;

G、更新信息素;G. Update pheromones;

H、将输出的shrinkage和min_child_weight用于Android恶意软件的检测模型中。H. Use the output shrinkage and min_child_weight in the Android malware detection model.

进一步地,所述蚁群优化算法具体如下:Furthermore, the ant colony optimization algorithm is specifically as follows:

蚁群位置初始化:Ant colony position initialization:

假设XGBoost的分类准确率作为目标函数值Assume that XGBoost's classification accuracy is used as the objective function value

max{F(s1,w1),F(s2,w2),...,F(sm,wm)},记为 max fitness=max{F(X)},X={x1,x2,...,xm},其中xi表示蚂蚁,利用混沌序列产生初始化的种群步骤如下:max{F(s 1 ,w 1 ),F(s 2 ,w 2 ),...,F(s m ,w m )}, denoted as max fitness=max{F(X)},X={x 1 ,x 2 ,...,x m }, where xi represents ants. The steps to generate the initialized population using chaotic sequence are as follows:

1)产生一个D维的随机向量:1) Generate a D-dimensional random vector:

Figure BDA0001817899010000041
Figure BDA0001817899010000041

2)Logistics映射,使用上式作为初始迭代,Logistics映射方程如下:2) Logistics mapping, using the above formula as the initial iteration, the Logistics mapping equation is as follows:

Figure BDA0001817899010000042
Figure BDA0001817899010000042

式中,μ=1,i=1,2,...,N,d=1,2,..,D;In the formula, μ=1,i=1,2,...,N,d=1,2,...,D;

3)将混沌空间映射到优化变量的搜索空间:3) Map the chaotic space to the search space of optimization variables:

Figure BDA0001817899010000043
Figure BDA0001817899010000043

式中,maxd为取上限值,mind为取下限值;In the formula, max d is the upper limit value, and min d is the lower limit value;

蚂蚁移动规则:Ant movement rules:

蚁群初始化后,计算其目标函数,

Figure BDA0001817899010000044
为第k迭代第j个蚂蚁的位置向量,定义,目标函数越大,其位置信息素浓度越大,则保存当前目标值最大的蚂蚁为
Figure BDA0001817899010000045
以及其信息素最大值
Figure BDA0001817899010000046
After the ant colony is initialized, its objective function is calculated.
Figure BDA0001817899010000044
is the position vector of the jth ant at the kth iteration. It is defined that the larger the objective function is, the greater the pheromone concentration is at its position. The ant with the largest current objective value is
Figure BDA0001817899010000045
and its pheromone maximum value
Figure BDA0001817899010000046

选择局部搜索或者全局搜索:Select Local Search or Global Search:

蚂蚁转移的概率定义如下:The probability of ant migration is defined as follows:

Figure BDA0001817899010000047
Figure BDA0001817899010000047

式中,S为适应度函数的标准差,计算公式如下:In the formula, S is the standard deviation of the fitness function, and the calculation formula is as follows:

Figure BDA0001817899010000048
Figure BDA0001817899010000048

式中,m为蚂蚁个数,Fave为平均适应度值;In the formula, m is the number of ants, and Fave is the average fitness value;

由上式可知,离

Figure BDA0001817899010000049
越近,蚂蚁的转移概率就越大,其搜索的方法如下:From the above formula, we can see that
Figure BDA0001817899010000049
The closer the ant is, the greater the probability of its transfer. The search method is as follows:

若P(xi)≤P0,其中,P0为常数,0<P0<1,则蚂蚁在附近局部位置搜索,移动公式如下:If P( xi )≤P0, where P0 is a constant and 0<P0<1, the ant searches in the nearby local position and the movement formula is as follows:

Figure BDA00018178990100000410
Figure BDA00018178990100000410

式中

Figure BDA00018178990100000411
为移动后的位置,
Figure BDA00018178990100000412
为移动前的位置,a为移动步长,定义如下:In the formula
Figure BDA00018178990100000411
is the position after moving,
Figure BDA00018178990100000412
is the position before moving, a is the moving step length, which is defined as follows:

Figure BDA0001817899010000051
Figure BDA0001817899010000051

若P(xi)>P0,则蚂蚁在解空间搜索;If P( xi )>P0, the ant searches in the solution space;

信息素更新:Pheromone Update:

根据个体位置函数值的大小,更新信息素如下:According to the value of the individual position function, the pheromone is updated as follows:

Figure BDA0001817899010000052
Figure BDA0001817899010000052

式中,ρ为信息蒸发系数。Where ρ is the information evaporation coefficient.

与现有技术相比,本方案原理和优点如下:Compared with the existing technology, the principles and advantages of this solution are as follows:

相对于传统的XGBoost机器学习算法在Android恶意软件检测中因参数选取而影响XGBoost算法分类的表现性能,本方案应用蚁群算法对XGBoost 的进行参数寻优,快速地找到最优参数,使得XGBoost算法具有良好得分类性能,应用到Android恶意软件检测模型中,使在Android恶意软件检测时具有更高的分类精度,大大提高恶意软件检测的正确率,从而降低由于检测错误而导致Android系统遭受攻击的概率。Compared with the traditional XGBoost machine learning algorithm, which affects the performance of XGBoost algorithm classification due to parameter selection in Android malware detection, this scheme uses the ant colony algorithm to optimize the parameters of XGBoost and quickly find the optimal parameters, so that the XGBoost algorithm has good classification performance and is applied to the Android malware detection model, so that it has higher classification accuracy in Android malware detection, greatly improving the accuracy of malware detection, thereby reducing the probability of Android system being attacked due to detection errors.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法的检测流程图;FIG1 is a detection flow chart of an Android malware detection method based on an XGBoost machine learning algorithm of the present invention;

图2为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法中特征提取的流程图;FIG2 is a flow chart of feature extraction in an Android malware detection method based on an XGBoost machine learning algorithm according to the present invention;

图3为本发明一种基于XGBoost机器学习算法的Android恶意软件检测方法中应用蚁群算法优化XGBoost参数的流程图。FIG3 is a flow chart of applying the ant colony algorithm to optimize XGBoost parameters in an Android malware detection method based on the XGBoost machine learning algorithm of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例对本发明作进一步说明:The present invention will be further described below in conjunction with specific embodiments:

本实施例所述的一种基于XGBoost机器学习算法的Android恶意软件检测方法,具体内容如下:The Android malware detection method based on the XGBoost machine learning algorithm described in this embodiment is specifically as follows:

XGBoost(eXtreme Gradient Boosting)由Tian Chen于2015年提出的一种集成学习算法,在XGBoost集成学习框架中,直接影响其分类性能的主要有参数的收缩步长(shrinkage)和子节点中最小样本权重阈值 (min_child_weight)。过小的shrinkage会导致算法过拟合,较大的 shrinkage导致算法无法收敛,对于min_child_weight,过小会导致算法过拟合,过大的mini_child_weight将会导致算法对线性不可分数据的分类性能。XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm proposed by Tian Chen in 2015. In the XGBoost ensemble learning framework, the main parameters that directly affect its classification performance are the shrinkage step size (shrinkage) and the minimum sample weight threshold (min_child_weight) in the child node. Too small shrinkage will cause the algorithm to overfit, and too large shrinkage will cause the algorithm to fail to converge. For min_child_weight, too small will cause the algorithm to overfit, and too large mini_child_weight will cause the algorithm to have poor classification performance for linearly inseparable data.

因此,本实施例通过反编译apk文件提取Permission,Intent,Component 和APIcall特征量化组成特征矩阵后,利用蚁群优化算法对XGBoost集成学习框架进行参数优化,快速寻找到全局最优解,多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight,最后将优化后的XGBoost算法应用到Android 恶意软件检测模型中。如图1所示,具体步骤如下:Therefore, this embodiment extracts Permission, Intent, Component and APIcall features by decompiling the apk file to quantify the feature matrix, and then uses the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework, quickly find the global optimal solution, obtain the optimal target value after multiple iterations, and obtain the optimal parameter combination of XGBoost, the shrinkage step size, and the minimum sample weight threshold min_child_weight in the child node. Finally, the optimized XGBoost algorithm is applied to the Android malware detection model. As shown in Figure 1, the specific steps are as follows:

S1:利用apktool将apk文件反编译得到AndroidManifest.xml和 classes.dex;S1: Use apktool to decompile the apk file to get AndroidManifest.xml and classes.dex;

S2:提取Permission、Intent、Component和API call特征,具体过程如图2所示;S2: Extract Permission, Intent, Component and API call features. The specific process is shown in Figure 2.

S3:特征量化,输出值为one-hot向量,如果存在特征,则标记为1,否则将其标记为0;S3: Feature quantization, the output value is a one-hot vector, if the feature exists, it is marked as 1, otherwise it is marked as 0;

S4:将所有的特征向量形成特征向量集合,采用特征选择算法对特征向量集合进行降维,选取最优的特征子集;S4: All feature vectors are formed into a feature vector set, and the feature selection algorithm is used to reduce the dimension of the feature vector set to select the optimal feature subset;

S5:利用蚁群优化算法对XGBoost集成学习框架进行参数优化,快速寻找到全局最优解,多次迭代后获取最优目标值并且得到XGBoost的最优参数组合收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight;S5: Use the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework, quickly find the global optimal solution, obtain the optimal target value after multiple iterations, and obtain the optimal parameter combination of XGBoost, the shrinkage step size, and the minimum sample weight threshold min_child_weight in the child node;

S6:将优化特征向量随机抽取10%作为测试集,剩余的90%作为训练集合输入到优化后的XGBoost集成学习框架中进行优化学习;S6: Randomly extract 10% of the optimized feature vectors as the test set, and the remaining 90% as the training set and input them into the optimized XGBoost ensemble learning framework for optimization learning;

S7:从真正率、假正率、分类精度对分类结果进行评估,判断该基于蚁群算法优化的XGBoost算法用于生成Android恶意软件检测模型是否符合检测要求。S7: Evaluate the classification results from the perspective of true positive rate, false positive rate, and classification accuracy to determine whether the XGBoost algorithm optimized by the ant colony algorithm is used to generate the Android malware detection model that meets the detection requirements.

上述中,如图3所示,利用蚁群优化算法对XGBoost集成学习框架进行参数优化的具体步骤如下:In the above, as shown in Figure 3, the specific steps of using the ant colony optimization algorithm to optimize the parameters of the XGBoost integrated learning framework are as follows:

A、设置XGBoost分类器参数的收缩步长shrinkage和子节点中最小样本权重阈值min_child_weight的上下限,最大的迭代次数MaxIter,蚁群规模M,信息蒸发系数Rho;A. Set the shrinkage step size of the XGBoost classifier parameters, the upper and lower limits of the minimum sample weight threshold min_child_weight in the child node, the maximum number of iterations MaxIter, the ant colony size M, and the information evaporation coefficient Rho;

B、初始化种群,即初始化shrinkage和min_child_weight,作为每一只蚂蚁的位置向量;B. Initialize the population, that is, initialize shrinkage and min_child_weight as the position vector of each ant;

C、执行蚁群搜索;C. Perform ant colony search;

D、进行XGBoost训练;D. Perform XGBoost training;

E、用XGBoost分类器计算每只蚂蚁的目标函数值和信息素值,寻找当前最优蚂蚁;E. Use XGBoost classifier to calculate the objective function value and pheromone value of each ant and find the current optimal ant;

F、判断是否满足终止条件:如果迭代的次数大于MaxIter,则输出蚁群最优值以及对应的shrinkage和min_child_weight值,执行步骤G,否则迭代次数加1,执行步骤C;F. Determine whether the termination condition is met: If the number of iterations is greater than MaxIter, output the optimal value of the ant colony and the corresponding shrinkage and min_child_weight values, and execute step G; otherwise, increase the number of iterations by 1 and execute step C;

G、更新信息素;G. Update pheromones;

H、将输出的shrinkage和min_child_weight用于Android恶意软件的检测模型中。H. Use the output shrinkage and min_child_weight in the Android malware detection model.

而具体的蚁群优化算法如下:The specific ant colony optimization algorithm is as follows:

蚁群位置初始化:Ant colony position initialization:

假设XGBoost的分类准确率作为目标函数值Assume that XGBoost's classification accuracy is used as the objective function value

max{F(s1,w1),F(s2,w2),...,F(sm,wm)},记为 max fitness=max{F(X)},X={x1,x2,...,xm},其中xi表示蚂蚁,利用混沌序列产生初始化的种群步骤如下:max{F(s 1 ,w 1 ),F(s 2 ,w 2 ),...,F(s m ,w m )}, denoted as max fitness=max{F(X)},X={x 1 ,x 2 ,...,x m }, where xi represents ants. The steps to generate the initialized population using chaotic sequence are as follows:

1)产生一个D维的随机向量:1) Generate a D-dimensional random vector:

Figure BDA0001817899010000071
Figure BDA0001817899010000071

2)Logistics映射,使用上式作为初始迭代,Logistics映射方程如下:2) Logistics mapping, using the above formula as the initial iteration, the Logistics mapping equation is as follows:

Figure BDA0001817899010000081
Figure BDA0001817899010000081

式中,μ=1,i=1,2,...,N,d=1,2,..,D;In the formula, μ=1,i=1,2,...,N,d=1,2,...,D;

3)将混沌空间映射到优化变量的搜索空间:3) Map the chaotic space to the search space of optimization variables:

Figure BDA0001817899010000082
Figure BDA0001817899010000082

式中,maxd为取上限值,mind为取下限值;In the formula, max d is the upper limit value, and min d is the lower limit value;

蚂蚁移动规则:Ant movement rules:

蚁群初始化后,计算其目标函数,

Figure BDA0001817899010000083
为第k迭代第j个蚂蚁的位置向量,定义,目标函数越大,其位置信息素浓度越大,则保存当前目标值最大的蚂蚁为
Figure BDA0001817899010000084
以及其信息素最大值
Figure BDA0001817899010000085
After the ant colony is initialized, its objective function is calculated.
Figure BDA0001817899010000083
is the position vector of the jth ant at the kth iteration. It is defined that the larger the objective function is, the greater the pheromone concentration is at its position. The ant with the largest current objective value is
Figure BDA0001817899010000084
and its pheromone maximum value
Figure BDA0001817899010000085

选择局部搜索或者全局搜索:Select Local Search or Global Search:

蚂蚁转移的概率定义如下:The probability of ant migration is defined as follows:

Figure BDA0001817899010000086
Figure BDA0001817899010000086

式中,S为适应度函数的标准差,计算公式如下:In the formula, S is the standard deviation of the fitness function, and the calculation formula is as follows:

Figure BDA0001817899010000087
Figure BDA0001817899010000087

式中,m为蚂蚁个数,Fave为平均适应度值;In the formula, m is the number of ants, and Fave is the average fitness value;

由上式可知,离

Figure BDA0001817899010000088
越近,蚂蚁的转移概率就越大,其搜索的方法如下:From the above formula, we can see that
Figure BDA0001817899010000088
The closer the ant is, the greater the probability of its transfer. The search method is as follows:

若P(xi)≤P0,其中,P0为常数,0<P0<1,则蚂蚁在附近局部位置搜索,移动公式如下:If P( xi )≤P0, where P0 is a constant and 0<P0<1, the ant searches in the nearby local position and the movement formula is as follows:

Figure BDA0001817899010000089
Figure BDA0001817899010000089

式中

Figure BDA00018178990100000810
为移动后的位置,
Figure BDA00018178990100000811
为移动前的位置,a为移动步长,定义如下:In the formula
Figure BDA00018178990100000810
is the position after moving,
Figure BDA00018178990100000811
is the position before moving, a is the moving step length, which is defined as follows:

Figure BDA00018178990100000812
Figure BDA00018178990100000812

若P(xi)>P0,则蚂蚁在解空间搜索;If P( xi )>P0, the ant searches in the solution space;

信息素更新:Pheromone Update:

根据个体位置函数值的大小,更新信息素如下:According to the value of the individual position function, the pheromone is updated as follows:

Figure BDA0001817899010000091
Figure BDA0001817899010000091

式中,ρ为信息蒸发系数。Where ρ is the information evaporation coefficient.

本实施例首先通过反编译apk文件提取Permission,Intent,Component 和APIcall特征,并量化组成特征矩阵,利用蚁群算法的并行性和较强的鲁棒性,对XGBoost分类器参数进行寻优,以求得最优目标并得到XGBoost的最优参数组合。该实施例提出的改进的XGBoost机器学习算法与传统的 XGBoost算法相比,在Android恶意软件检测时具有更高的分类精度,提高了恶意软件检测的正确率,降低了由于检测错误而导致Android系统遭受攻击的概率。This embodiment first extracts Permission, Intent, Component and APIcall features by decompiling the apk file, and quantizes the feature matrix, and optimizes the XGBoost classifier parameters by using the parallelism and strong robustness of the ant colony algorithm to obtain the optimal target and the optimal parameter combination of XGBoost. Compared with the traditional XGBoost algorithm, the improved XGBoost machine learning algorithm proposed in this embodiment has higher classification accuracy in Android malware detection, improves the accuracy of malware detection, and reduces the probability of Android system being attacked due to detection errors.

以上所述之实施例子只为本发明之较佳实施例,并非以此限制本发明的实施范围,故凡依本发明之形状、原理所作的变化,均应涵盖在本发明的保护范围内。The embodiments described above are only preferred embodiments of the present invention and are not intended to limit the scope of implementation of the present invention. Therefore, all changes made according to the shape and principle of the present invention should be included in the protection scope of the present invention.

Claims (2)

1. An Android malicious software detection method based on an XGboost machine learning algorithm is characterized in that Permission, intent, component and APIcall characteristics are extracted by decompiling an apk file, a characteristic matrix is formed in a quantized mode, an ant colony optimization algorithm is used for carrying out parameter optimization on an XGboost integrated learning framework, a global optimal solution is quickly found, an optimal target value is obtained after multiple iterations, an optimal parameter combination contraction step length shrinkage of the XGboost and a minimum sample weight threshold value min _ child _ weight in a child node are obtained, and finally the optimized XGboost algorithm is applied to an Android malicious software detection model;
the method specifically comprises the following steps:
s1: decompiling the apk file by using the apktool to obtain android manifest.xml and classes.dex;
s2: extracting the Permission, intent, component and API call characteristics;
s3: quantizing the features, wherein the output value is a one-hot vector, if the features exist, the vector is marked as 1, otherwise, the vector is marked as 0;
s4: forming a feature vector set by all feature vectors, reducing the dimension of the feature vector set by adopting a feature selection algorithm, and selecting an optimal feature subset;
s5: performing parameter optimization on the XGboost integrated learning framework by using an ant colony optimization algorithm, quickly finding out a global optimal solution, obtaining an optimal target value after multiple iterations, and obtaining an optimal parameter combination shrinkage step length shrinkage of the XGboost and a minimum sample weight threshold value min _ child _ weight in a child node;
s6: randomly extracting 10% of the optimized feature vectors as a test set, and inputting the rest 90% of the optimized feature vectors as a training set into an optimized XGboost integrated learning frame for optimized learning;
s7: and evaluating the classification result from the true rate, the false positive rate and the classification precision, and judging whether the Android malicious software detection model generated by the XGboost algorithm optimized based on the ant colony algorithm meets the detection requirement.
2. The Android malicious software detection method based on the XGboost machine learning algorithm as claimed in claim 1, wherein the specific steps of using the ant colony optimization algorithm to perform parameter optimization on the XGboost ensemble learning frame are as follows:
A. setting the contraction step length shrinkage of the XGboost classifier parameter and the upper and lower limits of the minimum sample weight threshold min _ child _ weight in the child node, the maximum iteration times MaxIter, the ant colony scale M and the information evaporation coefficient Rho;
B. initializing populations, namely initializing shrinkage and min _ child _ weight as a position vector of each ant;
C. executing ant colony search;
D. XGboost training is carried out;
E. calculating the objective function value and the pheromone value of each ant by using an XGboost classifier, and searching the current optimal ant;
F. judging whether a termination condition is met: if the iteration times are larger than the MaxIter, outputting an ant colony optimal value and corresponding shrinkage and min _ child _ weight values, executing the step G, and if not, adding 1 to the iteration times, and executing the step C;
G. updating the pheromone;
H. and using the output shrinkage and min _ child _ weight in a detection model of the Android malicious software.
CN201811150736.1A 2018-09-29 2018-09-29 Android malicious software detection method based on XGboost machine learning algorithm Expired - Fee Related CN109543406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811150736.1A CN109543406B (en) 2018-09-29 2018-09-29 Android malicious software detection method based on XGboost machine learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811150736.1A CN109543406B (en) 2018-09-29 2018-09-29 Android malicious software detection method based on XGboost machine learning algorithm

Publications (2)

Publication Number Publication Date
CN109543406A CN109543406A (en) 2019-03-29
CN109543406B true CN109543406B (en) 2023-04-11

Family

ID=65841391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811150736.1A Expired - Fee Related CN109543406B (en) 2018-09-29 2018-09-29 Android malicious software detection method based on XGboost machine learning algorithm

Country Status (1)

Country Link
CN (1) CN109543406B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197068B (en) * 2019-05-06 2022-07-12 广西大学 Android malicious application detection method based on improved gray wolf algorithm
CN110263539A (en) * 2019-05-15 2019-09-20 湖南警察学院 A kind of Android malicious application detection method and system based on concurrent integration study
CN110362995B (en) * 2019-05-31 2022-12-02 电子科技大学成都学院 Malicious software detection and analysis system based on reverse direction and machine learning
CN112818344B (en) * 2020-08-17 2024-06-04 北京辰信领创信息技术有限公司 Method for improving virus killing rate by using artificial intelligence algorithm
CN112989342B (en) * 2021-03-04 2022-08-05 北京邮电大学 Malware detection network optimization method, device, electronic device and storage medium
CN115801463B (en) * 2023-02-06 2023-04-18 山东能源数智云科技有限公司 Industrial Internet platform intrusion detection method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 P2P net loan borrower credit risk assessment device
CN107577942A (en) * 2017-08-22 2018-01-12 中国民航大学 A Hybrid Feature Screening Method for Android Malware Detection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6701311B2 (en) * 2001-02-07 2004-03-02 International Business Machines Corporation Customer self service system for resource search and selection
US8108933B2 (en) * 2008-10-21 2012-01-31 Lookout, Inc. System and method for attack and malware prevention

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 P2P net loan borrower credit risk assessment device
CN107577942A (en) * 2017-08-22 2018-01-12 中国民航大学 A Hybrid Feature Screening Method for Android Malware Detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习的移动终端高级持续性威胁检测技术研究;胡彬等;《计算机工程》;20170115(第01期);242-246 *

Also Published As

Publication number Publication date
CN109543406A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109543406B (en) Android malicious software detection method based on XGboost machine learning algorithm
Pei et al. AMalNet: A deep learning framework based on graph convolutional networks for malware detection
Kolosnjaji et al. Adversarial malware binaries: Evading deep learning for malware detection in executables
CN109145600B (en) System and method for detecting malicious files using static analysis elements
CN106503558B (en) An Android malicious code detection method based on community structure analysis
Xiaofeng et al. ASSCA: API sequence and statistics features combined architecture for malware detection
WO2021027831A1 (en) Malicious file detection method and apparatus, electronic device and storage medium
CN108985061B (en) A webshell detection method based on model fusion
CN113704759B (en) Adaboost-based android malicious software detection method and system and storage medium
CN113297571B (en) Graph Neural Network Model-Oriented Backdoor Attack Detection Method and Device
CN113139185A (en) Malicious code detection method and system based on heterogeneous information network
CN111144274A (en) A method and device for protecting social image privacy for YOLO detector
CN114595451A (en) Graph convolution-based android malicious application classification method
Wu A systematical study for deep learning based android malware detection
CN114637990A (en) File malice degree evaluation method and device, electronic equipment and medium
CN108959930A (en) Malice PDF detection method, system, data storage device and detection program
Onoja et al. Exploring the effectiveness and efficiency of LightGBM algorithm for windows malware detection
Olowoyo et al. Malware classification using deep learning technique
CN111368894B (en) A FCBF Feature Selection Method and Its Application in Network Intrusion Detection
Du et al. A mobile malware detection method based on malicious subgraphs mining
CN110647747B (en) False mobile application detection method based on multi-dimensional similarity
CN107622201B (en) A kind of Android platform clone&#39;s application program rapid detection method of anti-reinforcing
CN110197068A (en) Based on the Android malicious application detection method for improving grey wolf algorithm
KR20200067044A (en) Method and apparatus for detecting malicious file
CN113449304B (en) Malicious software detection method and device based on strategy gradient dimension reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230411