Background
According to different data acquisition modes, android malware detection methods can be roughly divided into dynamic detection methods and static detection methods.
1. The dynamic Android malicious software detection method is used for acquiring behavior data of software by running the software so as to judge whether the software is malicious software or not. Firstly, the method needs to operate software in Android entity equipment or a sandbox to obtain behavior data of the software, such as user layer API call, kernel layer API call, software network flow data and the like. And then, analyzing the data by combining a machine learning algorithm and obtaining a detection result. The method has advantages in resisting attack means in the forms of code confusion, dynamic loading and the like. However, the methods have inherent defects of low code coverage rate, large resource consumption and the like, so that the detection accuracy is low, and the industrial requirements under the efficient and rapid detection scene are difficult to meet.
2. The static analysis method refers to malicious detection by directly analyzing static data contained in an APK file. The method directly analyzes files such as source codes, android manifest and the like of installation package files, excavates characteristics of malicious codes and constructs a detector. Compared with a dynamic analysis method, the code coverage rate of the static analysis method is high, multi-layer analysis data can be provided, and an accurate detection result is guaranteed. Therefore, the method becomes a focus of cross-research of machine learning and malicious APP analysis, and the research work of the invention also focuses on the static analysis method. According to the feature construction mode of the analysis process, the existing static analysis method can be divided into a detection method based on statistical features and an analysis method based on behavior semantics. The characteristics of the two methods and the problems that exist will be analyzed in turn.
(1) The detection method based on the statistical characteristics comprises the following steps: the method for screening characteristics by a statistical method, which can effectively distinguish characteristics of malicious software and normal software, is called a statistical characteristic detection method.
(2) The detection method based on the behavior semantics comprises the following steps: for malicious APP which continuously evolves, researchers provide a large number of malicious software detection methods based on semantic analysis. The method aims to learn the behavior intention of the software and realize software representation from a high-level behavior semantic level.
In addition, aging of the models put into production may occur. The reason is that the statistical properties of the target variables that the model tries to predict change in an unpredictable manner over time, and the types of changes include abrupt change, gradual change, and repeated change. Over time, the prediction accuracy of the model decreases, and most Android malware detection models age quickly.
In conclusion: based on the analysis of the existing work, the existing static malware detection method has the following problems: (1) The method based on the statistical characteristics generally takes the existing structured data in the installation package as an analysis material, and the method only depends on the statistical characteristics of a surface layer, so that a high-level behavior intention characterization mode cannot be established, and the method is difficult to cope with continuously evolved and mutated malicious software. (2) Most of the existing methods based on behavior semantics need data such as function call graphs or control flow graphs depending on programs, and the complexity of graph construction and analysis processes is high, so that the detection efficiency is low. In addition, the method has a certain promotion space in the comprehensiveness of semantic construction. (3) The existing static malicious software detection method has the problem of model aging, so that the accuracy rate of a malicious software detection model is gradually reduced after the malicious software detection model is put into production.
Disclosure of Invention
The invention aims to provide an anti-aging high-efficiency malicious APP detection method for constructing API (application programming interface) associated confidence aiming at the problems in the malicious APP detection research.
The design principle of the invention is as follows: firstly, the method deals with the deviation caused by the continuous upgrade of an Android system to the analysis of malicious software by abstracting the name of an API package in an APK file according to layers; secondly, extracting high-level behavior semantics of the software by calculating the correlation confidence coefficient between the APIs; and finally, respectively constructing classifiers for the APKs in different release periods, and selecting a representative classifier to learn the behavior pattern among API combinations to finish the detection of the malicious software.
The technical scheme of the invention is realized by the following steps:
step 1, pretreatment;
step 1.1, decompressing the APK file to obtain a DEX file;
step 1.2, decompiling the DEX file to obtain Smali codes of each APK;
step 2, API abstraction;
step 2.1, extracting method methods defined by all developers from Smali codes of the APK, and generating a method dictionary for each APK;
step 2.2, in each method, retrieving API call instructions including invoke-virtual, invoke-direct, invoke-static and invoke-super, and acquiring all API information called by the instructions;
step 2.3, the specific API is expressed by adopting the API packet names distributed according to the layers, and the high-level attribute with the fine-grained meaning in the packet names is deleted;
step 3, feature extraction, namely calculating the association confidence coefficients between the two abstract APIs one by one for all the abstract APIs in the abstracted API set to generate an APK confidence coefficient matrix;
step 4, classifier prediction;
step 4.1, acquiring APK issued in a fixed sliding time window, and inputting a corresponding confidence matrix;
step 4.2, constructing a new classifier model in each sliding time window;
and 4.3, detecting whether the new classifier model is aged: if yes, clustering all historical models, extracting representative models of different clustering clusters, and entering the step 4.5; if not, entering step 4.4;
step 4.4, measuring the similarity of the models, judging whether the new models are similar to the historical models or not: if so, deleting the old model and storing the new model into a historical model list; if not, only saving the new model to a history list;
and 4.5, outputting a prediction result by the representative model in the history list through a weighted majority voting method.
Advantageous effects
Compared with the MaDroid method (Onwuzurike L, mariconti E, andriotis P, et al. MaDroid: detecting and android malware by building markov chains of behavial models [ J ]. ACM Transactions On Privacy and Security (TOPS), 2019,22 (2): 14.), the method can effectively detect continuously evolving malware and has certain anti-aging property. In addition, the construction process of the function call graph needs to analyze the global call relation of all the APIs in the APK file, and the method provided by the invention avoids the time-consuming operation, so that the method has higher detection efficiency and detection stability.
Detailed Description
In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.
The data source details are shown in table 1. The first database was Drebin, which contained 5560 samples from 179 families, with the age of the samples being between 2010-2012. In addition, 5945 normal samples with the same distribution in ages as the Drebin samples in the android database are obtained. The second database is AMD, which contains 2453 malicious samples distributed between 2010-2016 and belonging to 71 malicious families. In addition, 20519 normal software with a contemporary distribution were acquired on the android platform. During the actual APK processing, part of APK samples cannot be normally converted into feature vectors due to packet decompression or file parsing errors. In the # Drebin database with 11505 samples, 11332 processes were successfully performed, wherein 5883 normal samples and 5448 malicious samples. Of the 41362 samples in the # AMD database, a total of 40533 samples were successfully pre-processed (malicious sample: 20583, normal sample: 19950).
TABLE 1 method characterization Experimental data set information overview
The experiment is carried out on one computer, and the specific configuration of the computer is as follows: intel (R) Core (TM) i 7-6700, CPU 3.40GHz, memory 8G, operating system windows 7, 64 bits; the programming tools were used python3.6, scimit-spare 0.22, and androguard 3.3.5.
Detection of malware is a two-class problem, so the test uses the evaluation index commonly used in the class problem in table 2. FP in the table is the number of applications that the classifier wrongly classified a sample as malicious; FN is the number of applications that the classifier wrongly classifies as normal; TP represents the number of applications correctly classified as malicious; TN is the number of applications that are correctly classified as normal.
TABLE 2 evaluation indexes adopted in malware detection experiments
The specific process of the experiment is as follows:
step 1, pretreatment;
step 1.1, decompressing the APK file to obtain a DEX file;
step 1.2, decompiling the DEX file to obtain Smali codes of each APK;
step 2, API abstraction;
step 2.1, extracting method methods defined by all developers from Smali codes of the APK, and generating a method dictionary for each APK;
step 2.2, in each method, retrieving API call instructions including invoke-virtual, invoke-direct, invoke-static and invoke-super, and obtaining all API information called by the instructions;
step 2.3, the specific API is expressed by adopting the API packet names distributed according to the layers, and the high-level attribute with the fine-grained meaning in the packet names is deleted;
taking "android.telephone.gsm" as an example, level one "Android" represents that the API belongs to an Android system function; level two, "telephone," indicates that the API has functionality associated with telephone operation; level three "GSM" indicates that the function provides system services using a particular GSM telephony function. Thus, the API can be further abstracted by deleting high-level attributes with fine-grained meaning. The first two levels of attributes are preserved during the actual API abstraction, which abstracts 443 packages from the Android system APIs to 73, plus custom APIs and obfuscation APIs, for a total of 75 abstract APIs.
Step 3, feature extraction, namely calculating the association confidence coefficients between the two abstract APIs one by one for all the abstract APIs in the abstracted API set to generate an APK confidence coefficient matrix;
the strength of association between the two abstract APIs is calculated by confidence: configence (X → Y) = δ (X utoxy)/δ (X), in which δ (X) = | { t |, (X) = | (t)/(X)
i |X∈t
i ,t
i E.g. T } |, X and Y are two abstraction APIs, delta (X) refers to the number of times of X occurrence, and delta (X @ Y) refers to the number of times of X and Y occurrence simultaneously. In malware analysis, each abstract API is referred to as an item, and the set of items can be represented as: i = [ ]
1 ,I
2 ,I
3 ,I
4 ,...,I
d ]D =75, if an APK sample contains k methods, the APK sample can be represented as T = [ T ]
1 ,t
2 ,t
3 ,t
4 ,...,t
k ]. By focusing on the rules between two items, e.g., { java>{java.net},{android.net}—>Xml, establish a rule association. The strength of association between rules is calculated by confidence. Generate all 75 abstract API calls
An association rule. The abstract API associates a confidence correspondence calculation relationship, see fig. 2. And obtaining a confidence matrix as a vector representation of the Android application software, which is shown in figure 3.
And 4, predicting by using the classifier, wherein the prediction process of the classifier is shown in figure 4.
And 4.1, acquiring the APK issued in the sliding time window, and inputting the corresponding confidence matrix, wherein the sliding time window can be set to be 10 days.
Step 4.2, constructing a new classifier model in each sliding time window;
and 4.3, detecting whether the new classifier model is aged: if yes, clustering all historical models, extracting representative models of different clustering clusters, and entering the step 4.5; if not, entering step 4.4;
the model aging detection method can judge whether the classification accuracy of the model is less than 90%, and if the classification accuracy of the model is less than 90%, the model is aged; if not, the model is normal. The model clustering method is that all history models predict APK released in a time window, and the prediction result 1 is correct and 0 is incorrect. And if the number of the models in the historical model list is M and the number of the APKs in the time window is b, the number of the prediction results is M multiplied by b. And clustering the prediction results by adopting an expectation maximization algorithm, wherein the prediction results of the same cluster belong to the same class corresponding to the model.
Step 4.4, measuring the similarity of the model, and judging whether the new model is similar to the historical model: if so, deleting the old model and storing the new model into a historical model list; if not, only saving the new model to a history list;
the similarity calculation method adopts a Q statistical method, and the calculation formula is as follows: q i,j =(N 11 N 00 -N 01 N 10 )/(N 11 N 00 +N 01 N 10 ) Wherein N is a,b The number of the APK confidence coefficient matrixes which are classified into different categories by the classifier is as per the classifier c i Classification as a according to classifier c j The classification is b,1 is a positive classification, and 0 is an erroneous classification. The Q value varies between-1 and 1, and the model classifies the same sample, and if the result is correct or wrong at the same time, the Q value is positive, and if the classification result of the same sample is opposite, the Q value is negative. Whether the models are similar or not is judged by setting a threshold value theta. If Q is larger than or equal to theta, the classifier model is similar to the historical model, the new model is stored in the historical model list, and the similar historical model is deleted. If Q < θ, only the new model is saved in the historical model list, and θ can be set to 0.5.
And 4.5, outputting a prediction result by the representative model in the history list through a weighted majority voting method.
And (3) testing results: experiments are based on the anti-aging high-efficiency malicious APP detection method for establishing API (application programming interface) associated confidence, performance test experiments are carried out on Drebin and AMD public data sets, and the effectiveness of the method is proved. The performance results of the anti-aging high-efficiency malicious APP detection method for constructing the API correlation confidence coefficient in the two databases are shown in tables 3 and 4 under different classification algorithms.
Table 3 AMD database under different algorithm test results
TABLE 4# Drebin database
Aiming at the problem of detecting continuously evolving malicious software, the method is subjected to detailed analysis and comparison on the detection accuracy and the detection efficiency by using a MaMaDroid method, and the analysis efficiency is further proved to have superiority under the condition of keeping the detection accuracy. The results of the test for comparison of detection accuracy are shown in table 5, which shows the results of 36F-number tests for both methods during malware classification. The results of the time efficiency test are shown in table 6, with the group with the addition of "", removed the abnormal samples from the experiment that were analyzed over 20 minutes.
TABLE 5 comparative Experimental malware Classification F-value test Experimental results
Note: in the table, "old" represents an old version normal APP sample with the year 2014, and "new" represents a new generation normal APP sample with the year 2017. The numbers 2013 to 2017 represent malicious APP samples for that year.
Table 6 comparative experiment detection efficiency test analysis results
Note: in the table, A refers to the result of MaMaDroid method experiment, and B refers to the result of the invention experiment.
The above detailed description is provided for the purpose of illustrating the invention and the accompanying claims, and it is to be understood that the above description is only exemplary of the invention and is not intended to limit the scope of the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the invention should be included in the scope of the invention.