CN114969734A

CN114969734A - A ransomware variant detection method based on API call sequence

Info

Publication number: CN114969734A
Application number: CN202210526872.6A
Authority: CN
Inventors: 李博; 刘振龙; 刘陈; 刘旭东
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-08-30
Anticipated expiration: 2042-05-16
Also published as: CN114969734B

Abstract

The invention realizes a ransomware variant detection method based on API calling sequence through the method in the field of information security. First, set up a ransomware family classification technology unit based on API call sequences, deploy Cuckoo sandbox for the input ransomware virus characteristics, and construct a ransomware data set; collect a large number of API call sequences to form a corpus, and use word2vec for pre-training; select API Call the sequence as a learning feature for preprocessing; train the detection model and evaluate it to obtain a usable model, and then obtain the classification result; on the basis of classifying all dynamic behaviors, the classification results are visualized using Graphviz and Neo4j-based ransomware attack process The technical unit is mainly based on the visualization process of the attack process based on Graphviz, supplemented by the visualization process of the attack process based on Neo4j, and outputs the classification results of the virus. It has shown good results in the task of ransomware family classification.

Description

A ransomware variant detection method based on API call sequence

技术领域technical field

本发明涉及信息安全技术领域，尤其涉及一种基于API调用序列的勒索病毒变种检测方法。The invention relates to the technical field of information security, in particular to a method for detecting ransomware variants based on an API calling sequence.

背景技术Background technique

近几年，由于勒索病毒攻击性强，攻击者已经开始使用它们作为网络攻击的有效方法。根据咨询机构埃森哲调查显示，2021年上半年，全球网络威胁活动较去年增长125％。聚焦在勒索攻击领域，针对的行业主要是保险、消费品和服务、电信，占比分别为23％、17％、16％，三者总计占比达到56％。2020年下半年相比，2021年上半年，由勒索软件攻击引起的数据泄露事件增长了24％。勒索攻击不仅带来了代价高昂的服务中断，还直接威胁政治安全、经济安全、科技安全等各个方面。In recent years, attackers have begun to use ransomware as an effective method of cyber attacks due to their aggressive nature. According to a survey by consulting firm Accenture, in the first half of 2021, global cyber threat activity increased by 125% compared to last year. Focusing on the field of ransomware attacks, the targeted industries are mainly insurance, consumer goods and services, and telecommunications, accounting for 23%, 17%, and 16% respectively, and the three together account for 56%. Data breaches caused by ransomware attacks increased by 24% in the first half of 2021 compared to the second half of 2020. Ransomware attacks not only bring costly service interruptions, but also directly threaten political security, economic security, technological security and other aspects.

勒索病毒是一种特殊的恶意软件，它会感染计算机并限制用户访问，直到支付赎金才能解锁。攻击者可以通过多态等技术迅速产生大量勒索病毒变种，而且此种方法成本低。尽管已经开发了防火墙、反病毒程序和自动分析程序等安全机制来对抗这种威胁，但这些传统机制收效甚微，无法保护存储在本地或云存储资源中的宝贵资产。Ransomware is a special kind of malware that infects computers and restricts user access until a ransom is paid to unlock it. Attackers can quickly generate a large number of ransomware variants through techniques such as polymorphism, and this method is low-cost. Although security mechanisms such as firewalls, anti-virus programs, and automated analyzers have been developed to combat this threat, these traditional mechanisms are ineffective and fail to protect valuable assets stored on local or cloud storage resources.

勒索病毒分析技术是分析勒索病毒目的和功能的过程，主要分为静态分析和动态分析两种方法。静态分析是不执行勒索病毒的自动分析，提取MD5、操作码等静态特征，进而对其行为进行推断。但是攻击者通常使用加壳、代码混淆等技术，使得静态分析难以应对。动态分析是通过在虚拟环境中执行收集的勒索病毒样本，对恶意行为或勒索病毒的风险进行运行时监控和分析。一般来说，动态分析根据使用的特征和应用的技术有两种方式。首先，通过使用的特征进行动态分析，利用诸如API调用的频率或顺序，编译的十六进制代码，程序执行路径和其他的信息作为特征。其次，通过应用技术分析，利用序列对齐和数据挖掘或机器学习，对收集的特征数据进行分析。但是许多勒索病毒变种会识别沙箱而延迟执行，或者执行期间大量调用无关良性API，使得动态分析变得困难。因此，现有检测方法缺乏有效的勒索病毒变种检测方法。Ransomware analysis technology is the process of analyzing the purpose and function of ransomware, which is mainly divided into two methods: static analysis and dynamic analysis. Static analysis is an automatic analysis of ransomware that does not perform ransomware, extracts static features such as MD5 and opcodes, and then infers its behavior. However, attackers usually use techniques such as packing and code obfuscation, which makes static analysis difficult to deal with. Dynamic analysis is run-time monitoring and analysis of malicious behavior or the risk of ransomware by executing collected ransomware samples in a virtual environment. Generally speaking, there are two approaches to dynamic analysis depending on the features used and the techniques applied. First, a dynamic analysis is performed by using the characteristics, using information such as frequency or sequence of API calls, compiled hex code, program execution path and other information as characteristics. Second, the collected feature data is analyzed by applying technical analysis using sequence alignment and data mining or machine learning. However, many ransomware variants will recognize sandboxes and delay execution, or make a lot of calls to unrelated benign APIs during execution, making dynamic analysis difficult. Therefore, the existing detection methods lack effective ransomware variant detection methods.

随着深度学习的迅速发展，研究者开始将其应用到安全领域，并取得了不错的效果。如果学习特征选择API调用序列，并将其看作是一个文本，其中的单个API函数看作是一个词元，那么勒索病毒分类问题就可以转换为NLP领域的文本分类问题。因此，文本分类中的一些经典模型就可以应用到勒索病毒家族分类任务中。With the rapid development of deep learning, researchers began to apply it to the security field, and achieved good results. If a sequence of feature selection API calls is learned and treated as a text with a single API function as a token, then the ransomware classification problem can be transformed into a text classification problem in the NLP domain. Therefore, some classic models in text classification can be applied to the task of ransomware family classification.

近年来，勒索病毒攻击行为隐蔽性强且危害显著，异较快且易传播，攻击路径和目标多元化发展，受勒索攻击领域更加宽泛。此外，由于攻击者对勒索病毒变种转化技术的滥用，变种数量和质量激增，传统反病毒技术很难检测到此类攻击。因此，亟待一种良好的勒索病毒检测技术，迅速确定样本所属的家族或相近家族，进而采取针对性地分析，编写解密程序或安全补丁，从而尽可能地减少损失。In recent years, ransomware attacks are highly insidious and harmful, and they are fast and easy to spread. The attack paths and targets are diversified, and the scope of ransomware attacks is wider. In addition, due to the abuse of ransomware variant transformation technology by attackers, the quantity and quality of variants have surged, making it difficult for traditional antivirus technologies to detect such attacks. Therefore, there is an urgent need for a good ransomware detection technology, which can quickly determine the family or similar family to which the sample belongs, and then conduct targeted analysis and write decryption programs or security patches to reduce losses as much as possible.

静态分析技术难以应对多态、变种、加壳、压缩等技术，动态技术难以应对延迟运行、添加大量无关的良性API调用等。而且两种方法主要对已知勒索病毒样本或样本库具有较好的检测效果，不适用于勒索病毒变种。Static analysis technology is difficult to deal with polymorphism, variant, packing, compression and other technologies, and dynamic technology is difficult to deal with delayed operation, adding a large number of unrelated benign API calls, etc. Moreover, the two methods mainly have a good detection effect on known ransomware samples or sample libraries, and are not suitable for ransomware variants.

发明内容SUMMARY OF THE INVENTION

为此，本发明首先提出一种基于API调用序列的勒索病毒变种检测方法，首先设置基于API调用序列的勒索病毒家族分类技术单元，通过对于输入的勒索病毒样本，部署Cuckoo沙箱，构建勒索病毒数据集；收集大量API调用序列构成语料库，使用word2vec进行预训练；选取API调用序列作为学习特征，进行预处理；训练检测模型并评价后获得可用的模型，进而获取分类结果；To this end, the present invention first proposes a ransomware variant detection method based on API calling sequence. First, a ransomware family classification technology unit based on API calling sequence is set up, and the Cuckoo sandbox is deployed for the input ransomware sample to construct a ransomware virus. Data set; collect a large number of API call sequences to form a corpus, and use word2vec for pre-training; select API call sequences as learning features for preprocessing; train detection models and evaluate them to obtain available models, and then obtain classification results;

之后，在对所有动态行为分为进程、系统、Shell代码检测、内存、文件、注册表、网络七类的基础上，对于分类结果采用基于Graphviz和Neo4j的勒索病毒攻击流程可视化技术单元，以基于Graphviz的攻击流程可视化流程为主，基于Neo4j的攻击流程可视化流程为辅的方式，使用Graphviz的可视化结果查看攻击流程的总体，使用Neo4j的可视化结果查看攻击流程的特定行为或细节，并最终，输出对于勒索病毒家族分类结果的可视化视图。After that, on the basis of classifying all dynamic behaviors into seven categories: process, system, shell code detection, memory, file, registry, and network, the visualization technology unit of ransomware attack process based on Graphviz and Neo4j is used for the classification results. Graphviz's attack process visualization process is the main, and Neo4j-based attack process visualization process is supplemented. Use Graphviz's visualization results to view the overall attack process, use Neo4j's visualization results to view specific behaviors or details of the attack process, and finally, output A visual view of the classification results of ransomware virus families.

所述构建勒索病毒数据集的具体方法为：采用两种方式为数据来源；The specific method for constructing the ransomware data set is as follows: using two methods as the data source;

具体地，一种方法为：调用现成的行为分析报告或API调用序列数据集，收集行为分析报告和API调用序列数据集，然后利用Python脚本提取API调用序列和勒索病毒家族名称作为标签；Specifically, one method is: calling a ready-made behavior analysis report or API call sequence data set, collecting the behavior analysis report and API call sequence data set, and then using a Python script to extract the API call sequence and the name of the ransomware family as tags;

另一种方法为：搭建一个Cukoo host和一个Analysis Guest模式的Cuckoo沙箱作为勒索病毒分析环境，以开源网站中获取原始的恶意样本批量放入勒索病毒分析环境运行，批量处理样本，汇总现成的行为分析报告后，然后提取API调用序列。Another method is: build a Cukoo host and a Cuckoo sandbox in Analysis Guest mode as a ransomware analysis environment, obtain original malicious samples from an open source website and put them into the ransomware analysis environment in batches, process samples in batches, and summarize ready-made samples. After the behavioral analysis report, the API call sequence is then extracted.

所述预训练方法为：首先需要构建语料库，其中API调用序列不需要有具体的标签，所以来源是恶意代码数据集、无标签的勒索病毒数据集、有标签的勒索病毒数据集；The pre-training method is as follows: first, a corpus needs to be constructed, wherein the API call sequence does not need to have specific labels, so the source is a malicious code data set, an unlabeled ransomware data set, and a labeled ransomware data set;

预训练模型选择包括跳元模型和连续词袋模型的word2vec，然后选用跳元模型，使用中心词预测文本序列中生成其周围的单词；The pre-training model selects word2vec including the jump element model and the continuous word bag model, and then selects the jump element model to use the central word to predict the words around it in the text sequence;

通过最大化似然函数来学习模型参数，The model parameters are learned by maximizing the likelihood function,

其中API调用文本序列长度为T，上下文窗口大小为m，在时间t处的词元为w^(t)，并采用下采样技术，降低高频词的重要性，提高低频词的重要性，然后进行小批量处理。The length of the API call text sequence is T, the size of the context window is m, the token at time t is w ^(t) , and the downsampling technique is used to reduce the importance of high-frequency words and increase the importance of low-frequency words, and then Do small batches.

所述训练检测模型方法具体为：构建基于API调用序列的检测模型，模型输入是API调用序列，输出是类别，TextCNN+Attention模型输入API调用序列，经过嵌入层转化为二维词向量矩阵；卷积层设置不同尺寸的卷积核，其中第二维的长度为词向量长度，激活函数使用ReLU函数，加入非线性因素，经过卷积层会生成3个不同长度的一维向量；池化层，采用最大池化策略，生成三个数值，其中每个数值是原向量中的最大元素；池化层生成的三个数值经过拼接生成1*3的向量，然后经过Attention来聚焦设定的信息；最后经过全连接层输出对应类别。The method for training the detection model is specifically: constructing a detection model based on the API call sequence, the model input is the API call sequence, the output is the category, the TextCNN+Attention model inputs the API call sequence, and is converted into a two-dimensional word vector matrix through the embedding layer; The product layer sets convolution kernels of different sizes, where the length of the second dimension is the length of the word vector, the activation function uses the ReLU function, and adds nonlinear factors. After the convolution layer, three one-dimensional vectors of different lengths will be generated; the pooling layer , using the maximum pooling strategy to generate three values, each of which is the largest element in the original vector; the three values generated by the pooling layer are spliced to generate a 1*3 vector, and then focus on the set information through Attention ; Finally, the corresponding category is output through the fully connected layer.

模型的所述评价方法为，采用加权平均的方法计算多分类整体精确率：The evaluation method of the model is to use the weighted average method to calculate the overall accuracy of the multi-classification:

召回率：

基于精确率和召回率的调和平均：

其中，N代表类别总数，i代表样本对应类别，Total代表样本总数，Cnt_i代表类别i的数量，P_i、R_i、F_1i分别代表类别i的精确率、召回率、精确率和召回率的调和平均。

Recall rate:

Harmonic averaging based on precision and recall:

Among them, N represents the total number of categories, i represents the corresponding category of the sample, Total represents the total number of samples, Cnt _i represents the number of category i, and P _i , R _i , and F _1i represent the precision rate, recall rate, precision rate and recall rate of category i, respectively. the harmonic mean.

所述基于Graphviz的攻击流程可视化流程的实现方式为：The implementation of the Graphviz-based attack process visualization process is as follows:

步骤1：通过pip安装Python库graphviz；Step 1: Install the Python library graphviz via pip;

步骤2：安装Graphviz，并确保包含dot可执行文件的目录在系统路径上，安装过程中需要添加PATH；Step 2: Install Graphviz, and make sure that the directory containing the dot executable file is on the system path, and PATH needs to be added during the installation process;

步骤3：为了显示直观，突出重点行为，设置端点形状、端点颜色、边的颜色、字体颜色；Step 3: In order to display intuitively and highlight key behaviors, set the endpoint shape, endpoint color, edge color, and font color;

步骤4：使用minidom解析器打开XML文档，获取所有字段action_list，进而解析该字段的必要属性，最后存入Action对象列表；Step 4: Use the minidom parser to open the XML document, get all the fields action_list, then parse the necessary attributes of the field, and finally store it in the Action object list;

步骤5：初始化端点，如动态分析开始、动态分析异常结束、动态分析结束、样本文件、开始、vasstarter.exe等端点；Step 5: Initialize endpoints, such as dynamic analysis start, dynamic analysis abnormal end, dynamic analysis end, sample file, start, vasstarter.exe and other endpoints;

步骤6：遍历Action对象列表，获取端点，利用辅助数组进行端点去重；Step 6: Traverse the list of Action objects, obtain endpoints, and use auxiliary arrays to deduplicate endpoints;

步骤7：初始化边，如动态分析过程、关闭主机、进程开始执行等边；Step 7: Initialize edges, such as dynamic analysis process, shutdown of the host, process start execution, etc.;

步骤8：遍历Action对象列表，获取边，然后连接上面得到的端点；Step 8: Traverse the list of Action objects, get the edges, and then connect the endpoints obtained above;

步骤9：然后利用Python调用API进行图的导出，如png、jpg、svg、pdf等格式；Step 9: Then use Python to call the API to export the graph, such as png, jpg, svg, pdf and other formats;

步骤10：如果边、端点的数量过多，会影响绘图效率，采用多线程编程来缓解该问题。Step 10: If the number of edges and endpoints is too large, it will affect the drawing efficiency. Multi-threaded programming is used to alleviate this problem.

所述基于Neo4j的攻击流程可视化流程具体方法为：The specific method of the Neo4j-based attack process visualization process is as follows:

步骤1：从Oracle官方网站下载Java SE JDK并进行安装，Neo4j是基于Java的图形数据库，运行Neo4j需要启动JVM进程，因此必须安装JAVA SE的JDK；Step 1: Download the Java SE JDK from the official Oracle website and install it. Neo4j is a Java-based graph database. To run Neo4j, the JVM process needs to be started, so the JDK of JAVA SE must be installed;

步骤2：在图数据库Neo4j官网下载Neo4j并安装，安装过程需要配置环境变量；Step 2: Download Neo4j from the graph database Neo4j official website and install it. The installation process needs to configure environment variables;

步骤3：通过控制台启动Neo4j程序，此时Neo4j服务已经在本地部署；Step 3: Start the Neo4j program through the console, and the Neo4j service has been deployed locally;

步骤4：通过pip安装Python库py2neo，然后连接Neo4j图数据库；Step 4: Install the Python library py2neo through pip, and then connect to the Neo4j graph database;

步骤5：使用Graphviz中相同的方法获取Action对象列表，并构建端点和边；Step 5: Use the same method in Graphviz to get a list of Action objects, and build endpoints and edges;

步骤6：在已有节点上批量创建关系，以提高作图效率；Step 6: Create relationships in batches on existing nodes to improve mapping efficiency;

步骤7：打开浏览器，访问http://localhost:7474，就可以看到可视化结果。Step 7: Open the browser and visit http://localhost:7474, you can see the visualization results.

本发明所要实现的技术效果在于：The technical effect to be realized by the present invention is:

实现一种搜索方法，并使方法具备如下特性：Implement a search method with the following properties:

充分利用API调用序列以实现对勒索病毒变种的检测，引入了word2vec+TextCNN+Attention的检测方式，在API语料库上进行预训练，然后利用TextCNN和注意力机制进行分类学习，该方法在勒索病毒家族分类任务上展现出了较好的效果。Make full use of the API call sequence to detect ransomware variants, introduce the detection method of word2vec+TextCNN+Attention, pre-train on the API corpus, and then use TextCNN and attention mechanism for classification learning, this method is in the ransomware family. It shows good results on classification tasks.

附图说明Description of drawings

图1Cuckoo主要架构；Figure 1 The main architecture of Cuckoo;

图2跳元模型示例；Figure 2 Example of jumping element model;

图3API调用序列长度分布图；Figure 3 API call sequence length distribution diagram;

图4word2vec+TextCNN+Attention模型结构；Figure 4word2vec+TextCNN+Attention model structure;

图5基于Graphviz的行为可视化示例；Figure 5. An example of behavior visualization based on Graphviz;

图6基于Neo4j的行为可视化示例；Figure 6. An example of behavior visualization based on Neo4j;

具体实施方式Detailed ways

以下是本发明的优选实施例并结合附图，对本发明的技术方案作进一步的描述，但本发明并不限于此实施例。The following is a preferred embodiment of the present invention and combined with the accompanying drawings to further describe the technical solution of the present invention, but the present invention is not limited to this embodiment.

本发明提出了一种基于API调用序列的勒索病毒变种检测方法。The present invention proposes a ransomware variant detection method based on an API calling sequence.

静态分析技术难以应对多态、变种、加壳、压缩等技术，动态技术难以应对延迟运行、添加大量无关的良性API调用等。而且两种方法主要对已知勒索病毒样本或样本库具有较好的检测效果，不适用于勒索病毒变种。本专利从动态分析出发，选择API调用序列，去除重复子序列后作为学习特征，然后利用深度学习分类模型进行训练，最终获取分类结果。接下来，为了提高检测结果的可解释性，对勒索病毒变种的攻击流程进行可视化。基于此，本专利由“基于API调用序列的勒索病毒家族分类技术”和“基于Graphviz和Neo4j的勒索病毒攻击流程可视化技术”两大部分组成。Static analysis technology is difficult to deal with polymorphism, variant, packing, compression and other technologies, and dynamic technology is difficult to deal with delayed operation, adding a large number of unrelated benign API calls, etc. Moreover, the two methods mainly have a good detection effect on known ransomware samples or sample libraries, and are not suitable for ransomware variants. This patent starts from dynamic analysis, selects the API calling sequence, removes repeated subsequences as learning features, and then uses the deep learning classification model for training, and finally obtains the classification result. Next, in order to improve the interpretability of the detection results, the attack flow of the ransomware variant is visualized. Based on this, this patent consists of two parts: "Ransomware Family Classification Technology Based on API Call Sequence" and "Visualization Technology of Ransomware Attack Process Based on Graphviz and Neo4j".

基于API调用序列的勒索病毒家族分类技术Classification technology of ransomware family based on API calling sequence

本节旨在解决勒索病毒家族分类问题，部署Cuckoo沙箱，构建勒索病毒数据集；收集大量API调用序列构成语料库，使用word2vec进行预训练；选取A PI调用序列作为学习特征，进行预处理；训练检测模型，获取分类结果。This section aims to solve the classification problem of ransomware family, deploy Cuckoo sandbox, and construct ransomware dataset; collect a large number of API call sequences to form a corpus, and use word2vec for pre-training; select API call sequences as learning features for preprocessing; training Detect the model and get the classification result.

勒索病毒数据集构建Ransomware dataset construction

查阅相关文献得，目前可用的勒索病毒数据集匮乏，开源社区提供的数据集老旧，不会定时更新；一些竞赛、开源网站大多提供的是恶意代码数据集，没有细化为勒索病毒家族。因此，需要自主构建勒索病毒数据集。According to relevant literature, currently available ransomware data sets are scarce, and the data sets provided by open source communities are old and will not be updated regularly; some competitions and open source websites mostly provide malicious code data sets, which are not refined into ransomware virus families. Therefore, it is necessary to construct a ransomware dataset independently.

数据来源主要有两种：其一，现成的行为分析报告或API调用序列数据集，如ACT-KingKong数据集、MMCC微软恶意软件分类挑战数据集、Ember数据集等；其二，原始的恶意样本，从MalShare、VirusShare、Exploit Database、Viru sTotal等开源网站中获取，批量放入勒索病毒分析环境运行，进而获取API调用序列。There are two main data sources: first, ready-made behavior analysis reports or API call sequence datasets, such as ACT-KingKong dataset, MMCC Microsoft Malware Classification Challenge dataset, Ember dataset, etc.; second, original malicious samples , obtained from open source websites such as MalShare, VirusShare, Exploit Database, Viru sTotal, etc., put it into the ransomware analysis environment in batches to run, and then obtain the API call sequence.

数据来源方式一：收集行为分析报告和API调用序列数据集，然后利用Py thon脚本提取API调用序列和标签(勒索病毒家族名称)。Data source method 1: Collect behavior analysis reports and API call sequence data sets, and then use Python script to extract API call sequences and tags (ransomware family name).

数据来源方式二：选取Cuckoo沙箱作为勒索病毒分析环境，Cuckoo是一个开源的恶意软件自动分析系统，其主要架构如图1所示。Data source method 2: Select the Cuckoo sandbox as the ransomware analysis environment. Cuckoo is an open-source automatic malware analysis system. Its main architecture is shown in Figure 1.

搭建一个Cukoo host和一个Analysis Guest模式的Cuckoo沙箱，批量处理样本，汇总现成的行为分析报告后，然后提取API调用序列。Build a Cuckoo host and a Cuckoo sandbox in Analysis Guest mode, process samples in batches, summarize ready-made behavior analysis reports, and then extract the API call sequence.

通过上述两种数据来源进行收集与处理勒索病毒样本，构建了一个包含13887条数据的勒索病毒数据集。数据集包含8个类别，良性和7个勒索病毒家族，具体分布情况见表1。Collect and process ransomware samples from the above two data sources, and construct a ransomware dataset containing 13,887 pieces of data. The dataset contains 8 categories, benign and 7 ransomware families, and the specific distribution is shown in Table 1.

表1勒索病毒数据集分布情况表Table 1 Distribution of ransomware datasets

类别category BenignBenign PhobosPhobos SodinokibiSodinokibi WannaCryWannaCry RyukRyuk AvaddonAvaddon StopStop GlobelmposterGlobelmposter 数量quantity 49784978 14871487 515515 42894289 100100 820820 11961196 502502

预训练pre-training

由于独热编码或TF-IDF等向量化的方法无法表达API函数之间的相似程度，故采取预训练的方式获取词向量。Since vectorization methods such as one-hot encoding or TF-IDF cannot express the similarity between API functions, pre-training is used to obtain word vectors.

首先需要构建语料库，其中API调用序列不需要有具体的标签，所以来源可以是恶意代码数据集、无标签的勒索病毒数据集、有标签的勒索病毒数据集等。这样一来，收集到的无效数据也得到了有效利用。First, a corpus needs to be constructed, in which the API call sequence does not need to have specific labels, so the source can be malicious code datasets, unlabeled ransomware datasets, labeled ransomware datasets, etc. In this way, the invalid data collected is also used effectively.

预训练模型选择word2vec(包括跳元模型和连续词袋模型)，然后选用跳元模型，使用中心词预测文本序列中生成其周围的单词。The pre-training model selects word2vec (including the jump element model and the continuous word bag model), and then selects the jump element model to use the central word to predict the surrounding words in the text sequence.

跳元模型参数是词元的中心词向量和上下文词向量。在训练中，我们通过最大化似然函数(即极大似然估计)来学习模型参数。这相当于最小化公式1的损失函数，其中文本(API调用)序列长度为T，上下文窗口大小为m，在时间t处的词元为w^(t)。The jump element model parameters are the center word vector and the context word vector of the token. During training, we learn model parameters by maximizing the likelihood function (i.e. maximum likelihood estimation). This is equivalent to minimizing the loss function of Equation 1, where the text (API call) sequence length is T, the context window size is m, and the token at time t is w ^(t) .

公式1跳元模型损失函数:Equation 1 Jump element model loss function:

为了降低计算复杂度，通常进行近似训练，如负采样和分层softmax。文本数据中通常有高频词，这些词经常出现在上下文窗口，但是提供的有用信息却很少，所以采用下采样技术，降低高频词的重要性，提高低频词的重要性。然后进行小批量处理，以便在训练过程中迭代加载。To reduce computational complexity, approximate training is usually performed, such as negative sampling and hierarchical softmax. There are usually high-frequency words in text data. These words often appear in the context window, but provide little useful information. Therefore, downsampling technology is used to reduce the importance of high-frequency words and increase the importance of low-frequency words. Then do mini-batches for iterative loading during training.

经过预训练，会将词元映射为向量，也就是词向量，这些词向量包含不同词元间的相似信息。After pre-training, the word elements are mapped to vectors, that is, word vectors, which contain similar information between different word elements.

预处理preprocessing

API调用序列中的API函数具有前后逻辑关系，而且API函数具有特殊含义，表2是勒索病毒常用的API函数调用及其功能描述。The API functions in the API call sequence have a logical relationship before and after, and the API functions have special meanings. Table 2 is the API function calls commonly used by ransomware viruses and their function descriptions.

表2勒索病毒API函数功能表Table 2 Ransomware API function function table

API调用序列长短不一，多则成千上万，少则几十条，为了API序列数据可用，能够作为检测模型的输入，需要进行预处理。API call sequences vary in length, ranging from tens of thousands to dozens. In order for the API sequence data to be available and used as the input of the detection model, preprocessing is required.

一方面，软件本身有大量的重复API调用；另一方面，攻击者为了增加安全人员的分析难度，刻意增加大量无用的API调用。有文献提出并实验证明：API调用序列去除重复子序列，不影响API序列间的相似性计算。另外，对API调用序列进行压缩，会减少计算时间。综上，对API调用序列去重是必要的。采用去除连续API函数调用的方法，比如原序列为“QWEERRRT”，去重后是“QWERT”。On the one hand, the software itself has a large number of repeated API calls; on the other hand, attackers deliberately add a large number of useless API calls in order to increase the difficulty of analysis by security personnel. Some literatures have proposed and experimentally proved that the repeated subsequences are removed from the API calling sequence, which does not affect the similarity calculation between API sequences. In addition, compressing the sequence of API calls reduces computation time. In summary, it is necessary to de-duplicate the API call sequence. The method of removing consecutive API function calls is adopted. For example, the original sequence is "QWEERRRT", and the duplicated sequence is "QWERT".

去重后统计API调用序列长度可得，该数据集的序列长度主要分布在0到200，具体见图3。因此，选取200作为序列长度上限，序列长度大于200的API调用序列进行截断操作，小于200的API调用序列进行补零操作。After deduplication, the length of the sequence of API calls can be obtained. The sequence length of this dataset is mainly distributed from 0 to 200, as shown in Figure 3. Therefore, 200 is selected as the upper limit of the sequence length, the API call sequence with the sequence length greater than 200 is truncated, and the API call sequence less than 200 is filled with zeros.

检测模型Detection model

卷积神经网络(CNN)是一种前馈神经网络，它的人工神经元可以响应一部分覆盖范围内的周围单元，对于大型图像处理有出色表现。典型的卷积神经网络由卷积层、池化层和全连接层三部分构成。卷积层负责提取图像中的局部特征；池化层用来大幅降低参数量级(降维)；全连接层类似传统神经网络的部分，用来输出想要的结果。CNN已经得到了广泛的应用，比如：人脸识别、自动驾驶、安防等很多领域。TextCNN是在文本分类上的应用，核心思想是捕捉局部特征，对于文本来说，局部特征就是由若干单词组成的滑动窗口，类似于N-gram。卷积神经网络的优势在于能够自动地对N-gram特征进行组合和筛选，获得不同抽象层次的语义信息。Convolutional Neural Network (CNN) is a feed-forward neural network whose artificial neurons can respond to surrounding units within a partial coverage area, and perform well for large-scale image processing. A typical convolutional neural network consists of three parts: convolutional layer, pooling layer and fully connected layer. The convolutional layer is responsible for extracting local features in the image; the pooling layer is used to greatly reduce the parameter magnitude (dimension reduction); the fully connected layer is similar to the part of the traditional neural network to output the desired result. CNN has been widely used, such as: face recognition, automatic driving, security and many other fields. TextCNN is an application in text classification. The core idea is to capture local features. For text, local features are sliding windows composed of several words, similar to N-gram. The advantage of convolutional neural networks is that they can automatically combine and filter N-gram features to obtain semantic information at different levels of abstraction.

注意力机制(Attention)是在计算能力有限的情况下，将计算资源分配给更重要的任务，同时解决信息超载问题的一种资源分配方案。在神经网络学习中，一般而言模型的参数越多则模型的表达能力越强，模型所存储的信息量也越大，但这会带来信息过载的问题。那么通过引入注意力机制，在众多的输入信息中聚焦于对当前任务更为关键的信息，降低对其他信息的关注度，甚至过滤掉无关信息，就可以解决信息过载问题，并提高任务处理的效率和准确性。Attention mechanism is a resource allocation scheme that allocates computing resources to more important tasks and solves the problem of information overload in the case of limited computing power. In neural network learning, generally speaking, the more parameters of the model, the stronger the expression ability of the model, and the greater the amount of information stored in the model, but this will bring about the problem of information overload. Then, by introducing an attention mechanism, focusing on the information that is more critical to the current task among the many input information, reducing the attention to other information, and even filtering out irrelevant information, the problem of information overload can be solved and the task processing efficiency can be improved. Efficiency and accuracy.

借鉴文本分类思路，构建了基于API调用序列的检测模型，模型输入是API调用序列，输出是类别。Drawing on the idea of text classification, a detection model based on API call sequence is constructed. The input of the model is the API call sequence, and the output is the category.

TextCNN+Attention模型结构见图4，输入API调用序列，经过嵌入层转化为二维词向量矩阵；卷积层，设置不同尺寸的卷积核，比如3、4、5，其中第二维的长度为词向量长度，激活函数使用ReLU函数，加入非线性因素，经过卷积层会生成3个不同长度的一维向量；池化层，采用最大池化策略，生成三个数值，其中每个数值是原向量中的最大元素；池化层生成的三个数值经过拼接生成1*3的向量，然后经过Attention来聚焦设定的信息；最后经过全连接层输出对应类别。The structure of TextCNN+Attention model is shown in Figure 4. The input API call sequence is converted into a two-dimensional word vector matrix through the embedding layer; the convolution layer is set with different sizes of convolution kernels, such as 3, 4, and 5, where the length of the second dimension is For the length of the word vector, the activation function uses the ReLU function and adds nonlinear factors. After the convolution layer, three one-dimensional vectors of different lengths are generated; the pooling layer uses the maximum pooling strategy to generate three values, where each value is the largest element in the original vector; the three values generated by the pooling layer are spliced to generate a 1*3 vector, and then the set information is focused through Attention; finally, the corresponding category is output through the fully connected layer.

评价模型Evaluation model

评价一个模型的优劣，通常有准确率(Accuracy)、精确率(Precision)、召回率(Recall)、F1等指标。To evaluate the pros and cons of a model, there are usually indicators such as Accuracy, Precision, Recall, and F1.

准确率(Accuracy)：所有预测正确的样本在样本总量中的占比，计算公式如下：Accuracy: The proportion of all correctly predicted samples in the total sample. The calculation formula is as follows:

公式2准确率计算公式Formula 2 accuracy calculation formula

精确率(Precision)：也称为查准率，准确率是模型只找到相关目标的能力，计算公式如下：Precision: Also known as precision, accuracy is the ability of the model to find only relevant targets. The formula is as follows:

公式3精确率计算公式Formula 3 precision rate calculation formula

召回率(Recall)：也称为查全率，是模型找到所有相关目标的能力，即模型给出的预测结果最多能覆盖多少真实目标，计算公式如下：Recall: Also known as recall, it is the ability of the model to find all relevant targets, that is, how many real targets can the prediction results given by the model cover at most. The calculation formula is as follows:

公式4召回率计算公式Formula 4 Recall rate calculation formula

F1：基于精确率和召回率的调和平均，计算公式如下：F1: Based on the harmonic mean of precision and recall, the formula is as follows:

公式5F1计算公式Formula 5F1 calculation formula

其中，TP(True Positive)：真实值为真，预测值为真；FP(False Posi tive)：真实值为假，预测值为真；FN(False Negative)：真实值为真，预测值为假；TN(True Negative)：真实值为假，预测值为假。Among them, TP (True Positive): the true value is true, the predicted value is true; FP (False Positive): the true value is false, the predicted value is true; FN (False Negative): the true value is true, the predicted value is false ; TN (True Negative): The true value is false and the predicted value is false.

上述计算公式适用于二分类问题，多分类问题需要进一步处理，可以看作N(数据集类别数)个二分类任务计算上述指标。本专利采用了weighted-aver age(加权平均)的方法计算多分类整体Precision、Recall、F1，计算方式如下：The above calculation formula is suitable for the two-class problem, and the multi-class problem needs further processing. It can be regarded as N (number of data set categories) two-class tasks to calculate the above indicators. This patent adopts the method of weighted-average (weighted average) to calculate the overall Precision, Recall, and F1 of the multi-classification. The calculation method is as follows:

公式6多分类精确率计算公式Formula 6 Multi-classification accuracy calculation formula

公式7多分类召回率计算公式Formula 7 Multi-class recall calculation formula

公式8多分类F1计算公式Formula 8 multi-class F1 calculation formula

其中，N代表类别总数，i代表样本对应类别，Total代表样本总数，Cnt_i代表类别i的数量，P_i、R_i、F_1i分别代表类别i的Precison、Recall、F1-sco re。Among them, N represents the total number of categories, i represents the corresponding category of the sample, Total represents the total number of samples, Cnt _i represents the number of category i, and P _i , R _i , and F _1i represent the Precison, Recall, and F1-sco re of category i, respectively.

基于Graphviz和Neo4j的勒索病毒攻击流程可视化技术Visualization technology of ransomware attack process based on Graphviz and Neo4j

勒索病毒分析报告中的行为部分，仅仅是简单的罗列，不直观，短时间内很难发现重点行为。所以需要对勒索病毒的攻击流程进行可视化，可视化工作还增加了检测模型的可解释性。The behavior part of the ransomware analysis report is only a simple listing, which is not intuitive, and it is difficult to find key behaviors in a short period of time. Therefore, it is necessary to visualize the attack process of the ransomware virus, and the visualization work also increases the interpretability of the detection model.

下面针对ACT-KingKong数据集继续实现，该数据集有xml文件组成。每个xml文件对应一个勒索病毒(或恶意代码)样本，其中包含了一系列动态行为，如创建进程、删改注册表键值、加载模块等。The following continues to be implemented for the ACT-KingKong dataset, which consists of xml files. Each xml file corresponds to a ransomware (or malicious code) sample, which contains a series of dynamic behaviors, such as creating processes, deleting registry keys, loading modules, etc.

经过综合分析，将所有动态行为分为进程、系统、Shell代码检测、内存、文件、注册表、网络七类，具体分类情况如表3所示。After comprehensive analysis, all dynamic behaviors are divided into seven categories: process, system, Shell code detection, memory, file, registry, and network. The specific classification is shown in Table 3.

表3勒索病毒动态行为分类表Table 3 Ransomware dynamic behavior classification table

如果将每个进程看作一个端点，把动态行为看作连接两个端点之间的边，因此勒索病毒攻击行为可视化实质上是有向图构建。If each process is regarded as an endpoint, and the dynamic behavior is regarded as an edge connecting two endpoints, the visualization of ransomware attack behavior is essentially a directed graph construction.

基于Graphviz的行为可视化Graphviz-based behavior visualization

Graphviz(图形可视化软件的简称)是一个由AT&T Labs Research发起的开源工具包，用于绘制文件扩展名为“gv”的DOT语言脚本中指定的图形，还为软件应用程序提供了使用这些工具的库。Python库graphviz为图形绘制软件Graphviz提供了一个简单的纯python接口，利用Python编程可以高效地进行图形绘制；该模块提供了两个类：Graph和Digraph，它们分别以DOT语言为无向图和有向图创建图描述，具有相同的API。Graphviz (short for Graph Visualization Software) is an open-source toolkit initiated by AT&T Labs Research for drawing graphs specified in DOT language scripts with the file extension "gv", and also provides software applications with tools to use these tools. library. The Python library graphviz provides a simple pure python interface for the graph drawing software Graphviz, which can efficiently draw graphs using Python programming; this module provides two classes: Graph and Digraph, which are undirected graphs and directed graphs in DOT language respectively. Create graph descriptions to graphs, with the same API.

基于Graphviz的攻击流程可视化，具体实现步骤如下：The visualization of the attack process based on Graphviz, the specific implementation steps are as follows:

Step 1：通过pip安装Python库graphviz；Step 1: Install the Python library graphviz via pip;

Step 2：安装Graphviz，并确保包含dot可执行文件的目录在系统路径上，安装过程中需要添加PATH；Step 2: Install Graphviz, and make sure that the directory containing the dot executable file is on the system path, and PATH needs to be added during the installation process;

Step 3：为了显示直观，突出重点行为，设置端点形状、端点颜色、边的颜色、字体颜色；Step 3: In order to display intuitively and highlight key behaviors, set the endpoint shape, endpoint color, edge color, and font color;

Step 4：使用minidom解析器打开XML文档，获取所有字段action_list，进而解析该字段的必要属性，最后存入Action对象列表；Step 4: Use the minidom parser to open the XML document, get all fields action_list, then parse the necessary attributes of the field, and finally store it in the Action object list;

Step 5：初始化端点，如动态分析开始、动态分析异常结束、动态分析结束、样本文件、开始、vasstarter.exe等端点；Step 5: Initialize endpoints, such as dynamic analysis start, dynamic analysis abnormal end, dynamic analysis end, sample file, start, vasstarter.exe and other endpoints;

Step 6：遍历Action对象列表，获取端点，利用辅助数组进行端点去重；Step 6: Traverse the list of Action objects, obtain endpoints, and use auxiliary arrays to deduplicate endpoints;

Step 7：初始化边，如动态分析过程、关闭主机、进程开始执行等边；Step 7: Initialize edges, such as dynamic analysis process, shutdown of the host, and start of process execution;

Step 8：遍历Action对象列表，获取边，然后连接上面得到的端点；Step 8: Traverse the list of Action objects, get the edges, and then connect the endpoints obtained above;

Step 9：然后利用Python调用API进行图的导出，如png、jpg、svg、pd f等格式；Step 9: Then use Python to call the API to export the graph, such as png, jpg, svg, pdf and other formats;

Step 10：如果边、端点的数量过多，会影响绘图效率，采用多线程编程来缓解该问题。Step 10: If the number of edges and endpoints is too large, it will affect the drawing efficiency. Use multi-threaded programming to alleviate this problem.

最终实现效果如图5所示，通过示例图可以清晰看出行为执行的前后逻辑关系，不同的端点具有不同颜色，有利于快速地定位关键行为，而且可以使得检测模型具有可解释性。然而，当勒索病毒样本包含大量端点、边时，生成的图庞大且复杂，许多次要的行为会影响关键行为的发现。The final implementation effect is shown in Figure 5. The example diagram can clearly see the logical relationship before and after behavior execution. Different endpoints have different colors, which is conducive to quickly locating key behaviors and making the detection model interpretable. However, when the ransomware sample contains a large number of endpoints and edges, the generated graph is large and complex, and many minor behaviors will affect the discovery of key behaviors.

基于图数据库Neo4j的行为可视化Behavior visualization based on graph database Neo4j

Neo4j是Neo4j,Inc.开发的图形数据库管理系统，开发人员将其描述为兼容acid的事务数据库，具有原生图形存储和处理功能，基于Java实现。Pytho n库py2neo提供了Python操控Neo4J的相应API，简单、安全且高效。Neo4j is a graph database management system developed by Neo4j, Inc. The developers describe it as an acid-compatible transactional database with native graph storage and processing capabilities, implemented based on Java. The Python library py2neo provides the corresponding API for Python manipulation of Neo4J, which is simple, safe and efficient.

使用Neo4j对勒索病毒攻击流程可视化的步骤如下：The steps to visualize the ransomware attack process using Neo4j are as follows:

Step 1：从Oracle官方网站下载Java SE JDK并进行安装，Neo4j是基于Java的图形数据库，运行Neo4j需要启动JVM进程，因此必须安装JAVA SE的JDK。Step 1: Download the Java SE JDK from Oracle's official website and install it. Neo4j is a Java-based graph database. To run Neo4j, the JVM process needs to be started, so the JDK of JAVA SE must be installed.

Step 2：在图数据库Neo4j官网下载Neo4j并安装，安装过程需要配置环境变量；Step 2: Download Neo4j from the graph database Neo4j official website and install it. The installation process needs to configure environment variables;

Step 3：通过控制台启动Neo4j程序，此时Neo4j服务已经在本地部署；Step 3: Start the Neo4j program through the console, and the Neo4j service has been deployed locally;

Step 4：通过pip安装Python库py2neo，然后连接Neo4j图数据库；Step 4: Install the Python library py2neo through pip, and then connect to the Neo4j graph database;

Step 5：使用Graphviz中相同的方法获取Action对象列表，并构建端点和边；Step 5: Use the same method in Graphviz to get a list of Action objects, and build endpoints and edges;

Step 6：在已有节点上批量创建关系，以提高作图效率；Step 6: Create relationships in batches on existing nodes to improve mapping efficiency;

Step 7：打开浏览器，访问http://localhost:7474，就可以看到可视化结果。Step 7: Open the browser, visit http://localhost:7474, and you can see the visualization results.

最终实现效果如图6所示，由于端点和边是浮动的，行为之间的前后逻辑关系不易发现。但是，由于图数据库Neo4j可以进行增删查改操作，可操作性强且灵活。可以通过删除次要端点，快速定位关键端点和关键边；也可以查询与指定端点相关的其他端点。The final implementation effect is shown in Figure 6. Since the endpoints and edges are floating, the logical relationship between behaviors is not easy to find. However, since the graph database Neo4j can perform addition, deletion, search and modification operations, it is highly maneuverable and flexible. You can quickly locate critical endpoints and critical edges by deleting secondary endpoints; you can also query other endpoints related to the specified endpoint.

综上所述，基于Graphviz的行为可视化和基于图数据库Neo4j的行为可视化各有优缺点，因此以Graphviz为主，Neo4j为辅的方案是一个不错的可视化方式。In summary, the behavior visualization based on Graphviz and the behavior visualization based on the graph database Neo4j have their own advantages and disadvantages. Therefore, the solution based on Graphviz and supplemented by Neo4j is a good visualization method.

本发明使用实验室的集群进行训练，其操作系统类型为CentOS Linux rel ease7.6.1810(Core)，处理器为Intel(R)Xeon(R)Silver 4214R，内存大小为256GB，具体的硬件配置如表4所示。此外，本发明的开发语言为Python，深度学习框架为pytorch，具体的软件配置如表5所示。本发明的检测模型使用TextCNN，结合注意力机制，具体模型参数如表6所示。The present invention uses a laboratory cluster for training, the operating system type is CentOS Linux rel ease7.6.1810 (Core), the processor is Intel(R) Xeon(R) Silver 4214R, the memory size is 256GB, and the specific hardware configuration is shown in the table 4 shown. In addition, the development language of the present invention is Python, the deep learning framework is pytorch, and the specific software configuration is shown in Table 5. The detection model of the present invention uses TextCNN, combined with the attention mechanism, and the specific model parameters are shown in Table 6.

表4实验硬件配置Table 4 Experimental hardware configuration

表5实验软件配置Table 5 Experimental software configuration

表6检测模型参数表Table 6 Detection model parameter table

本发明使用的勒索病毒数据集是自主构建的，数据来源包括ACT-KingKong数据集、开源社区、竞赛数据集、恶意代码网站等，由13887个样本组成，由七个勒索病毒家族样本和良性样本组成，按照8:2的比例划分为训练集和测试集。The ransomware data set used in the present invention is independently constructed, and the data sources include ACT-KingKong data set, open source community, competition data set, malicious code website, etc. It consists of 13,887 samples, including seven ransomware family samples and benign samples. It is divided into training set and test set according to the ratio of 8:2.

为了验证本发明检测模型的优越性(word2vec+TextCNN+Attention，简称WTA)，在构建的勒索病毒数据集上进行对比实验。同样选取随机森林(RF)、多层感知机(MLP)、TextCNN、长短期记忆网络(LSTM)、word2vec结合TextCNN(简称WT)等方法进行对比，其中RF算法选取API编号进行向量化；多层感知机采用API编号作为学习特征，网络结构包括两个全连接层，其中激活函数使用ReLU；TextCNN模型采用独热编码进行向量化；LSTM模型使用word2vec进行预训练，采用双向长短期记忆网络；WT模型使用word2vec进行预训练，分类模型使用TextCNN，与本发明模型的不同是未使用注意力机制。In order to verify the superiority of the detection model of the present invention (word2vec+TextCNN+Attention, WTA for short), a comparative experiment was performed on the constructed ransomware data set. Similarly, random forest (RF), multi-layer perceptron (MLP), TextCNN, long short-term memory network (LSTM), word2vec combined with TextCNN (referred to as WT) and other methods are selected for comparison. The RF algorithm selects the API number for vectorization; multi-layer The perceptron uses the API number as the learning feature, and the network structure includes two fully connected layers, in which the activation function uses ReLU; the TextCNN model uses one-hot encoding for vectorization; the LSTM model uses word2vec for pre-training, and uses a bidirectional long and short-term memory network; WT The model uses word2vec for pre-training, and the classification model uses TextCNN. The difference from the model of the present invention is that the attention mechanism is not used.

表7对比实验结果Table 7 Comparative experimental results

分类算法Classification algorithm AccuracyAccuracy PrecisionPrecision RecallRecall F1F1 耗时(分钟)Time (minutes) RFRF 0.8060.806 0.8050.805 0.8060.806 0.7970.797 0.0420.042 MLPMLP 0.6790.679 0.7010.701 0.6790.679 0.6780.678 0.4730.473 TextCNNTextCNN 0.8290.829 0.8290.829 0.8290.829 0.8270.827 18.49118.491 LSTMLSTM 0.8520.852 0.8530.853 0.8520.852 0.8510.851 56.37856.378 WTWT 0.8480.848 0.8460.846 0.8480.848 0.8450.845 17.12817.128 WTA(本发明)WTA (the present invention) 0.8550.855 0.8620.862 0.8550.855 0.8530.853 18.77818.778

实验结果如表7所示，可以得出以下结论：RF机器学习算法在各项指标上具有不错的表现，特别是耗时最短；对比RF和MLP两种算法，与随机森林算法相比，多层感知机并不占任何优势；对比TextCNN和WT模型两种模型，发现使用预训练模型后准确率提升约2％；对比WT与WTA两种模型，发现使用注意力机制后准确率提升约1％；对比LSTM和WTA两种模型，发现两种模型在各个评价指标上相差不大，但是LSTM模型训练时间更长；综上，本发明模型WTA在各项指标上均有突出的表现，而且训练耗时可以接受。The experimental results are shown in Table 7. The following conclusions can be drawn: the RF machine learning algorithm has good performance in various indicators, especially the shortest time-consuming; comparing the RF and MLP algorithms, compared with the random forest algorithm, more The layer perceptron does not have any advantages; comparing the TextCNN and WT models, it is found that the accuracy rate is improved by about 2% after using the pre-training model; comparing the WT and WTA models, it is found that the accuracy rate is improved by about 1% after using the attention mechanism %; Comparing the two models of LSTM and WTA, it is found that the two models have little difference in each evaluation index, but the training time of the LSTM model is longer; In conclusion, the model WTA of the present invention has outstanding performance in various indicators, and The training time is acceptable.

Claims

1. A ransomware variant detection method based on API calling sequence, it is characterized in that: first set up the ransomware virus family classification technology unit based on API calling sequence, by inputting ransomware virus sample, deploy Cuckoo sandbox, construct ransomware virus data set; Collect a large number of API call sequences to form a corpus, and use word2vec for pre-training; select API call sequences as learning features for preprocessing; train detection models and evaluate them to obtain available models, and then obtain classification results;

After that, on the basis of classifying all dynamic behaviors into seven categories: process, system, shell code detection, memory, file, registry, and network, the visualization technology unit of ransomware attack process based on Graphviz and Neo4j is used for the classification results. Graphviz's attack process visualization process is the main, and Neo4j-based attack process visualization process is supplemented. Use Graphviz's visualization results to view the overall attack process, use Neo4j's visualization results to view specific behaviors or details of the attack process, and finally, output A visual view of the classification results of ransomware virus families.

2. A kind of ransomware variant detection method based on API calling sequence as claimed in claim 1, it is characterized in that: the concrete method of described constructing ransomware data set is: adopt two kinds of ways as data sources;

Specifically, one method is: calling a ready-made behavior analysis report or API call sequence data set, collecting the behavior analysis report and API call sequence data set, and then using a Python script to extract the API call sequence and the name of the ransomware family as tags;

Another method is: build a Cukoo host and a Cuckoo sandbox in Analysis Guest mode as a ransomware analysis environment, obtain original malicious samples from an open source website and put them into the ransomware analysis environment in batches, process samples in batches, and summarize ready-made samples. After the behavioral analysis report, the API call sequence is then extracted.

3. A kind of ransomware variant detection method based on API calling sequence as claimed in claim 2, it is characterized in that: described pre-training method is: first need to build corpus, wherein API calling sequence does not need to have specific label, so The sources are malicious code datasets, unlabeled ransomware datasets, and labeled ransomware datasets;

The pre-training model selects word2vec including the jump element model and the continuous word bag model, and then selects the jump element model to use the central word to predict the words around it in the text sequence;

The model parameters are learned by maximizing the likelihood function,

The length of the API call text sequence is T, the size of the context window is m, the token at time t is w ^(t) , and the downsampling technique is used to reduce the importance of high-frequency words and increase the importance of low-frequency words, and then Do small batches.

4. a kind of ransomware variant detection method based on API call sequence as claimed in claim 3, it is characterized in that: described training detection model method is specifically: build the detection model based on API call sequence, and the model input is API call sequence , the output is the category, the TextCNN+Attention model inputs the API call sequence, and is converted into a two-dimensional word vector matrix through the embedding layer; the convolution layer sets convolution kernels of different sizes, where the length of the second dimension is the length of the word vector, and the activation function uses The ReLU function, adding nonlinear factors, will generate three one-dimensional vectors of different lengths after the convolution layer; the pooling layer uses the maximum pooling strategy to generate three values, each of which is the largest element in the original vector; The three values generated by the pooling layer are spliced to generate a 1*3 vector, and then the set information is focused through Attention; finally, the corresponding category is output through the fully connected layer.

5. a kind of ransomware variant detection method based on API calling sequence as claimed in claim 4, is characterized in that: described evaluation method of model is, adopts the method of weighted average to calculate multi-classification overall accuracy rate:

Recall rate:

Harmonic averaging based on precision and recall:

6. a kind of ransomware variant detection method based on API calling sequence as claimed in claim 5, is characterized in that: the realization mode of described attack process visualization process based on Graphviz is:

Step 1: Install the Python library graphviz via pip;

Step 2: Install Graphviz, and make sure that the directory containing the dot executable file is on the system path, and PATH needs to be added during the installation process;

Step 3: In order to display intuitively and highlight key behaviors, set the endpoint shape, endpoint color, edge color, and font color;

Step 4: Use the minidom parser to open the XML document, get all fields action_list, then parse the necessary attributes of the field, and finally store it in the Action object list;

Step 5: Initialize endpoints, such as dynamic analysis start, dynamic analysis abnormal end, dynamic analysis end, sample file, start, vasstarter.exe and other endpoints;

Step 6: Traverse the list of Action objects, obtain endpoints, and use auxiliary arrays to deduplicate endpoints;

Step 7: Initialize edges, such as dynamic analysis process, shutdown of the host, process start execution, etc.;

Step 8: Traverse the list of Action objects, get the edges, and then connect the endpoints obtained above;

Step 9: Then use Python to call the API to export the graph, such as png, jpg, svg, pdf and other formats;

Step 10: If the number of edges and endpoints is too large, it will affect the drawing efficiency. Multi-threaded programming is used to alleviate this problem.

7. a kind of ransomware variant detection method based on API calling sequence as claimed in claim 6, is characterized in that: described Neo4j-based attack flow visualization flow specific method is:

Step 1: Download the Java SE JDK from the Oracle official website and install it. Neo4j is a Java-based graph database. Running Neo4j needs to start the JVM process, so the JDK of JAVA SE must be installed;

Step 2: Download Neo4j from the graph database Neo4j official website and install it. The installation process needs to configure environment variables;

Step 3: Start the Neo4j program through the console, and the Neo4j service has been deployed locally;

Step 4: Install the Python library py2neo through pip, and then connect to the Neo4j graph database;

Step 5: Use the same method in Graphviz to get a list of Action objects, and build endpoints and edges;

Step 6: Create relationships in batches on existing nodes to improve mapping efficiency;

Step 7: Open the browser and visit http://localhost:7474 to get the visual result.