CN102984140B

CN102984140B - Malicious software feature fusion analytical method and system based on shared behavior segments

Info

Publication number: CN102984140B
Application number: CN201210473746.5A
Authority: CN
Inventors: 王小峰; 胡晓峰; 王勇军; 吴纯青; 陆华彪; 赵峰; 虞万荣; 孙浩; 王雯; 周寰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2012-11-21
Filing date: 2012-11-21
Publication date: 2015-06-17
Anticipated expiration: 2032-11-21
Also published as: CN102984140A

Abstract

The invention discloses a malware feature fusion analysis method and system based on behavior segment sharing, the method steps are as follows: deploy nodes for collecting and analyzing malware, establish a distributed hash table module; collect malware samples and divide them into fragment sets, Statistical local statistical characteristics; share to the distributed hash table module, gather the global characteristics of the behavior fragments and return them to the source node; the source node calculates the set of candidate neighbor nodes and performs similar operations on the behavior characteristics of the remote nodes of the set of candidate neighbor nodes to construct behavior characteristics Adjacency diagram; generate and fuse a fusion tree based on the behavior feature adjacency graph, and output the root behavior feature; the device includes multiple nodes, and the nodes include a behavior feature segmentation module, a distributed hash table module, a behavior fragment collaborative sharing module, and a neighbor Behavior feature discovery module and behavior feature gradual fusion module. The invention has the advantages of high analysis accuracy, strong analysis performance and good expandability.

Description

Malware feature fusion analysis method and system based on behavior fragment sharing

技术领域technical field

本发明涉及计算机网络安全技术领域，具体涉及一种基于行为片段共享的恶意软件特征融合分析方法及系统。The invention relates to the technical field of computer network security, in particular to a malware feature fusion analysis method and system based on behavior segment sharing.

背景技术Background technique

根据国家互联网应急中心互联网安全威胁报告中术语的定义，恶意软件是指在未经授权的情况下，在信息系统中安装、执行以达到不正当目的的程序。恶意软件主要包括：1)特洛伊木马(Trojan Horse)，以盗取用户个人信息，甚至是远程控制用户计算机为主要目标的恶意软件。2)僵尸程序，用于构建大规模攻击平台的恶意软件。按照使用的通信协议，僵尸程序可进一步分为：IRC(Internet Relay Chat)僵尸程序、HTTP(Hypertext Transfer Protocol)僵尸程序、P2P(peer-to-peer)僵尸程序等。3)蠕虫，指能自我复制和广泛传播，以占用系统和网络资源为主要目的恶意软件。4)病毒，通过感染计算机文件进行传播，以破坏或篡改用户数据，影响信息系统正常运行为主要目的恶意软件。According to the definition of terms in the Internet Security Threat Report of the National Internet Emergency Center, malware refers to programs that are installed and executed in information systems without authorization to achieve improper purposes. Malicious software mainly includes: 1) Trojan horse (Trojan Horse), malicious software with the main goal of stealing user personal information, and even remotely controlling the user's computer. 2) Bots, malicious software used to build large-scale attack platforms. According to the communication protocol used, bots can be further divided into: IRC (Internet Relay Chat) bots, HTTP (Hypertext Transfer Protocol) bots, P2P (peer-to-peer) bots, etc. 3) Worms refer to malicious software that can self-replicate and spread widely, with the main purpose of occupying system and network resources. 4) Viruses, which spread by infecting computer files, with the main purpose of destroying or tampering with user data and affecting the normal operation of information systems.

恶意软件的检测与分析正变得越来越困难，主要表现在以下三个方面。1)恶意软件数量巨大并且成指数级增长，赛门铁克公司一系列网络安全威胁报告(Symantec Internet SecurityThreat Report)指出目前恶意软件数量巨大并成指数级增长，赛门铁克公司在2011年共发现4亿新的恶意软件样本，平均每天110万。这样巨大的恶意软件样本给恶意软件检测系统如何正确识别、归类、描述恶意软件带来巨大挑战。2)恶意软件的行为呈现出更强的多样性，通过消息加密、变换传播途径、多态等技术，同一种恶意软件的不同样本表现出不同的行为，难以对观察到的恶意软件样本进行正确有效分析。3)恶意软件的样本在空间上广泛分布并且具有很高的隐蔽性，因此单一局域网或企业网能够观察到的同一种恶意软件的样本数目非常有限。由于恶意软件行为的多样性，在样本数目有限的情况下，无法获取恶意软件的本质特征，分析准确性无法保证。因此恶意软件分析系统一般采用分布采集方式覆盖足够多的恶意软件样本。赛门铁克公司的研究人员在网络安全顶级会议SP'11(IEEE Symposium on Securityand Privacy 2011)撰写短文(Benchmarking Computer Security Using WINE)指出该公司在全球部署了24万恶意软件采集节点。The detection and analysis of malware is becoming more and more difficult, mainly in the following three aspects. 1) The number of malicious software is huge and exponentially increasing. Symantec’s series of Internet Security Threat Reports (Symantec Internet Security Threat Report) pointed out that the current malicious software is huge and exponentially growing. Symantec’s 2011 total 400 million new malware samples discovered, an average of 1.1 million per day. Such huge malware samples bring great challenges to how to correctly identify, classify, and describe malware for malware detection systems. 2) The behavior of malware shows a stronger diversity. Through technologies such as message encryption, changing transmission channels, and polymorphism, different samples of the same malware show different behaviors, and it is difficult to correctly identify the observed malware samples. effective analysis. 3) Malware samples are widely distributed in space and have high concealment, so the number of samples of the same malware that can be observed in a single LAN or enterprise network is very limited. Due to the diversity of malware behaviors, the essential characteristics of malware cannot be obtained when the number of samples is limited, and the analysis accuracy cannot be guaranteed. Therefore, the malware analysis system generally adopts the distributed collection method to cover enough malware samples. Symantec researchers wrote a short paper (Benchmarking Computer Security Using WINE) at the top network security conference SP'11 (IEEE Symposium on Security and Privacy 2011), pointing out that the company has deployed 240,000 malware collection nodes around the world.

美国国防部定义信息融合为：信息融合是一个组合数据和信息以估计或预测实体(entity)状态的过程。本发明研究分布在网络各地的分析节点间恶意软件分析结果融合是信息融合在网络安全方面的应用，是一类特殊的信息融合。The U.S. Department of Defense defines information fusion as: Information fusion is the process of combining data and information to estimate or predict the state of an entity. The fusion of malicious software analysis results among the analysis nodes distributed in different parts of the network researched by the present invention is an application of information fusion in network security, and is a special type of information fusion.

目前恶意软件攻击采用多种复杂攻击方式，比如将多个样本分布在网络各地的多攻击源分布式攻击和跳板攻击，同时恶意软件广泛分散且行为隐蔽，单个主机和单个网络无法准确判断其恶意性或者全面了解其攻击试图。因此融合分布在各地的信息来准确了解恶意软件本质特性和全局攻击视图成为网络安全研究的热点。早期的研究集中在融合商业入侵检测系统(IDS，Intrusion Detection System)的报警信息来发现多步复杂攻击以及降低入侵检测系统的误报率和提高检测覆盖范围。专利“大规模分布式入侵检测系统的实时数据融合方法”(专利号：03137444.1)提出一种层次性的报警融合体系结构，并采用“聚类-合并-关联”三个步骤实现对报警的逐步融合，聚类是计算入侵检测组件(如IDS、防火墙、病毒检测软件等)输出的报警之间的相异度，将多个相异度小的报警聚合为一个或者多个报警簇，合并是将报警簇变成一个包含该报警簇中的各种代表性信息的中级报警，关联是利用关联规则将多个中级报警关联成一个高级报警，最终为安全管理人员提供简练又信息全面的报警。目前研究人员普遍采用信息融合来检测和分析恶意软件，学者GuofeiGu等在网络安全国际顶级会议Security'08(Proceedings of the 17th USENIX Security Symposium 2008)发表论文“BotMiner:Clustering analysis of network traffic for protocol-and structure-independent botnet detection”融合多个网络区域不同主机的恶意行为(扫描、发送垃圾邮件、二进制代码下载等)与通信网络流量，判断那些在行为上具有相同恶意行为且在通信流量上具有相似且同步内容的主机集为同一僵尸网络的主机。学者Roberto Perdisci等在网络系统国际顶级会议NSDI'10(USENIXSymposium on Networked System Design and Implementation 2010)上发表论文“BehavioralClustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces”,从分布在网络各地的恶意软件采集节点收集各种基于HTTP通信的恶意软件样本，然后采用集中方式在集中服务器分析这些恶意软件样本的HTTP行为之间相似性，由于生成的URL特征综合了分布在网络各地的各种变形样本的行为特征，因此该URL特征准确捕获了该类恶意软件的HTTP行为的本质特征。可以发现现有采用信息融合检测和分析恶意软件系统一般采用集中式或者层次式结构，集中式结构在集中服务器上处理所有恶意软件样本，层次式结构虽然在中间层进行了一部分功能的处理，但是最后分析结果都送往集中管理节点，恶意软件数量巨大并且成指数级增长，集中式和层次式结构中的集中服务器和集中管理节点都存在计算瓶颈问题。At present, malware attacks use a variety of complex attack methods, such as multi-source distributed attacks and springboard attacks that distribute multiple samples across the network. At the same time, malware is widely dispersed and its behavior is hidden. characteristics or a comprehensive understanding of its attack attempts. Therefore, the fusion of distributed information to accurately understand the essential characteristics of malware and the global attack view has become a hotspot in network security research. Early research focused on fusing the alarm information of commercial intrusion detection systems (IDS, Intrusion Detection System) to discover multi-step complex attacks, reduce the false alarm rate of intrusion detection systems and improve detection coverage. The patent "Real-time data fusion method of large-scale distributed intrusion detection system" (patent number: 03137444.1) proposes a hierarchical alarm fusion architecture, and uses three steps of "clustering-merging-association" to realize the gradual alarming Fusion, clustering is to calculate the dissimilarity between alarms output by intrusion detection components (such as IDS, firewall, virus detection software, etc.), aggregate multiple alarms with small dissimilarity into one or more alarm clusters, and merge is Turn an alarm cluster into an intermediate alarm that contains various representative information in the alarm cluster. Association is to use association rules to associate multiple intermediate alarms into a high-level alarm, and finally provide security managers with concise and comprehensive alarms. At present, researchers generally use information fusion to detect and analyze malware. Scholar GuofeiGu et al. published a paper "BotMiner: Clustering analysis of network traffic for protocol-and "structure-independent botnet detection" integrates malicious behaviors (scanning, sending spam, binary code download, etc.) The set of hosts that synchronize content is the host of the same botnet. Scholar Roberto Perdisci published a paper "Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces" at NSDI'10 (USENIX Symposium on Networked System Design and Implementation 2010), the top international conference on network systems. The collection node collects various malware samples based on HTTP communication, and then adopts a centralized method to analyze the similarity between the HTTP behaviors of these malware samples on the centralized server, because the generated URL features integrate the characteristics of various deformed samples distributed all over the network. Behavioral features, so this URL feature accurately captures the essential features of the HTTP behavior of this type of malware. It can be found that the existing malware detection and analysis systems using information fusion generally adopt a centralized or hierarchical structure. The centralized structure processes all malware samples on the centralized server. Although the hierarchical structure handles some functions in the middle layer, The final analysis results are sent to the centralized management node. The number of malicious software is huge and exponentially increasing. Both the centralized server and the centralized management node in the centralized and hierarchical structures have computing bottlenecks.

为此，一些研究采用全分布式结构进行信息融合检测和恶意软件分析，但他们大多只能处理简单或者特定的恶意软件。学者Min Cai等在IEEE学报TDSC 2007(IEEE Transactions onDependable and Secure Computing 2007)发表论文“WormShield:Fast Worm SignatureGeneration with Distributed Fingerprint Aggregation”，提出在边缘网络部署多个检测节点，每个检测节点监测其负责的边缘网络的所有出入流量并从流量中过滤出可以的流量片段提交与其它检测节点进行融合，但它只能检测和提取在传播过程中行为不发生改变的单形蠕虫(monomorphic worm)，因为只有同一类单形蠕虫的流量中才会有相同的足够长的流量片段，统计流量片段的全系统特性才有意义。对于采用消息加密、变换传播途径、多态等复杂技术的恶意软件的行为呈现出复杂的多样性，同一类的恶意软件样本之间的行为相差很大，不同样本行为间没有足够长的相同的行为序列。因此现有全分布式分析技术难以从多态的恶意软件行为中挖掘出特征码。For this reason, some studies adopt fully distributed structure for information fusion detection and malware analysis, but most of them can only deal with simple or specific malware. Scholar Min Cai et al. published the paper "WormShield: Fast Worm SignatureGeneration with Distributed Fingerprint Aggregation" in IEEE Transactions on Dependable and Secure Computing 2007, proposing to deploy multiple detection nodes on the edge network, and each detection node monitors its responsible All incoming and outgoing traffic of the edge network is filtered out from the traffic and submitted to other detection nodes for fusion, but it can only detect and extract monomorphic worms whose behavior does not change during the propagation process, because only Only the traffic of the same type of monomorphic worms will have the same sufficiently long traffic fragments, and it is meaningful to count the system-wide characteristics of the traffic fragments. The behaviors of malware using complex technologies such as message encryption, changing transmission channels, and polymorphism show complex diversity. The behaviors of the same type of malware samples vary greatly, and there is no long enough identical behavior between different samples. Behavioral sequence. Therefore, it is difficult for existing fully distributed analysis techniques to mine signatures from polymorphic malware behaviors.

综上所述，为了能够整合恶意软件的网络行为提取出特征码，现有集中式和层次式恶意软件分析系统中的集中服务器或集中管理节点存在计算和通信瓶颈问题，而现有全分布式恶意软件分析技术只能检测和提取在传播过程中行为不发生改变的单形恶意软件，无法应对采用消息加密、变换传播途径、多态等复杂技术的恶意软件。To sum up, in order to integrate malware network behaviors to extract signatures, there are computing and communication bottlenecks in centralized servers or centralized management nodes in existing centralized and hierarchical malware analysis systems, while existing fully distributed Malware analysis technology can only detect and extract monomorphic malware whose behavior does not change during the propagation process, and cannot deal with malware that uses complex technologies such as message encryption, changing propagation channels, and polymorphism.

分布式哈希表(DHT，Distributed Hash Table)通常用于分布式数据存储和检索，具有去中心化、可靠和良好的可扩展性等特性。DHT网络中，每个节点负责一个小范围的路由处理及部分数据对象(data)的存储。DHT的基本原理是将每个数据资源对象表示成一个数据资源对象索引条目(Key,Node)对，Key称为关键字，是资源对象描述信息(如资源名、资源编号、资源内容等)的哈希值，Node是实际存储该资源对象的节点的描述信息(如IP地址、名称等)。所有的数据资源对象索引条目(即所有的(Key,Node)对)组成一个资源对象索引哈希表，只要输入目标资源对象的Key值，就可以从该资源对象索引哈希表中查出存储该资源对象的节点的地址或者其他描述信息。DHT的主要功能是将该资源索引哈希表分布化，将其分割成很多段小块(局部哈希表)，然后按照特定的规则把每一块局部哈希表分配到系统中的一个参与节点上，使得每个节点负责维护其中的一块局部哈希表。Distributed Hash Table (DHT, Distributed Hash Table) is usually used for distributed data storage and retrieval, and has the characteristics of decentralization, reliability and good scalability. In the DHT network, each node is responsible for a small range of routing processing and storage of some data objects (data). The basic principle of DHT is to represent each data resource object as a data resource object index entry (Key, Node) pair, Key is called a keyword, which is the resource object description information (such as resource name, resource number, resource content, etc.) Hash value, Node is the descriptive information (such as IP address, name, etc.) of the node that actually stores the resource object. All data resource object index entries (that is, all (Key,Node) pairs) form a resource object index hash table. As long as the Key value of the target resource object is input, the storage can be found from the resource object index hash table. The node address or other descriptive information of the resource object. The main function of DHT is to distribute the resource index hash table, divide it into many small pieces (local hash table), and then assign each local hash table to a participating node in the system according to specific rules In this way, each node is responsible for maintaining a local hash table.

发明内容Contents of the invention

本发明要解决的技术问题是提供一种分析准确性高、分析性能强、可扩展性好的基于行为片段共享的恶意软件特征融合分析方法及系统。The technical problem to be solved by the present invention is to provide a malware feature fusion analysis method and system based on behavior fragment sharing with high analysis accuracy, strong analysis performance and good scalability.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种基于行为片段共享的恶意软件特征融合分析方法，实施步骤如下：A malware feature fusion analysis method based on behavior fragment sharing, the implementation steps are as follows:

1)在网络中分别部署地理位置分散的节点，每一个节点负责一片网络区域中恶意软件样本的采集和分析，在节点中建立用于构建分布式哈希表的分布式哈希表模块；1) Deploy geographically dispersed nodes in the network, each node is responsible for the collection and analysis of malware samples in a network area, and establish a distributed hash table module for building a distributed hash table in the node;

2)各个节点采集恶意软件样本并分割为长度固定的行为片段集合，统计本地具有所述行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性；2) Each node collects a malware sample and divides it into a fixed-length behavior segment set, and counts the number of local malware samples with each behavior segment behavior in the behavior segment set locally to obtain the local statistical characteristics of the behavior segment set;

3)各个节点将行为片段集合及其本地统计特性发布共享至分布式哈希表，通过分布式哈希表的节点聚拢来自不同节点的相同行为片段并统计聚拢所述行为片段的全局特性，所述行为片段的全局特性包括行为片段、源节点地址、源节点本地具有所述行为片段行为的本地恶意软件样本数的三元对集合，将带有全局特性的行为片段集合返回给共享行为片段集合及其本地统计特性的源节点；3) Each node publishes and shares the set of behavior fragments and their local statistical characteristics to the distributed hash table, gathers the same behavior fragments from different nodes through the nodes of the distributed hash table and gathers the global characteristics of the behavior fragments statistically, so The global characteristics of the behavior fragments include a triplet set of behavior fragments, source node addresses, and the number of local malware samples that have the behavior of the behavior fragments locally on the source node, and return the behavior fragment collections with global characteristics to the shared behavior fragment collections and its source node for local statistical properties;

4)所述源节点根据恶意软件的行为片段及其全局特性构成的恶意软件的行为特征计算具有相似行为特征的候选邻居节点集，向候选邻居节点集中的远程节点发送恶意软件对应的源行为特征，候选邻居节点集中的远程节点将收到的源行为特征与本地的目的行为特征比较判断目的行为特征是否为源行为特征的邻居行为特征，将判断结果作为行为特征邻居关系返回源节点，所述源节点根据远程节点返回的行为特征邻居关系构造行为特征邻接关系图；4) The source node calculates a set of candidate neighbor nodes with similar behavior features according to the behavior features of the malware formed by the behavior fragments of the malware and its global characteristics, and sends the source behavior features corresponding to the malware to the remote nodes in the set of candidate neighbor nodes , the remote node in the candidate neighbor node set compares the received source behavior feature with the local destination behavior feature to judge whether the destination behavior feature is the neighbor behavior feature of the source behavior feature, and returns the judgment result to the source node as the neighbor relationship of the behavior feature. The source node constructs a behavior feature adjacency graph according to the behavior feature neighbor relationship returned by the remote node;

5)在所述特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树来实现分布式生成融合树，将所述融合树中的行为特征进行融合，且以行为特征最大相似树为基础决定所述融合树中行为特征的融合顺序使最相似的行为特征优先被融合，将所述融合树的根行为特征作为恶意软件特征融合分析结果输出。5) On the basis of the feature adjacency graph, a distributed minimum spanning tree algorithm that preferentially selects edges with large weights is used to generate the maximum similarity tree of behavioral features on the adjacency graph as a fusion tree according to the associated edge data structure of behavioral features To realize the distributed generation of the fusion tree, the behavioral features in the fusion tree are fused, and the fusion order of the behavioral features in the fusion tree is determined based on the behavioral feature maximum similarity tree so that the most similar behavioral features are preferentially fused, The root behavior feature of the fusion tree is output as the fusion analysis result of the malware feature.

作为本发明基于行为片段共享的恶意软件特征融合分析方法的进一步改进：As a further improvement of the malware feature fusion analysis method based on behavior fragment sharing of the present invention:

所述步骤2)的详细步骤如下：The detailed steps of described step 2) are as follows:

2.1)将恶意软件样本的行为视为顺序执行的行为序列，从所述顺序执行的行为序列中选择固定长度的连续行为子序列作为分割得到的长度固定的行为片段集合；或者根据恶意软件样本的行为操作数据之间的依赖关系建立行为依赖图，从所述行为依赖图中选择获取固定顶点数目的行为依赖子图作为分割得到的长度固定的行为片段集合；2.1) The behavior of the malware sample is regarded as a sequentially executed behavior sequence, and a fixed-length continuous behavior subsequence is selected from the sequentially executed behavior sequence as a set of fixed-length behavior fragments obtained by segmentation; or according to the behavior sequence of the malware sample Establishing a behavior dependency graph based on the dependency relationship between the behavior operation data, selecting and obtaining a behavior dependency subgraph with a fixed number of vertices from the behavior dependency graph as a set of fixed-length behavior fragments obtained by segmentation;

2.2)统计本地具有所述行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并输出。2.2) Counting the number of local malware samples with the behavior of each behavior segment in the behavior segment set to obtain and output the local statistical characteristics of the behavior segment set.

所述步骤3)的详细步骤如下：The detailed steps of described step 3) are as follows:

3.1)各个节点调用分布式哈希表模块得到所述行为片段的关键字值，将所述行为片段及其本地统计特性封装为分布式哈希表消息，然后将所述关键字值和分布式哈希表消息发送给分布式哈希表模块；所述分布式哈希表模块根据关键字值查找负责该关键字值的采集分析节点，并将分布式哈希表消息路由到负责该关键字值的节点进行存储；3.1) Each node calls the distributed hash table module to obtain the key value of the behavior segment, encapsulates the behavior segment and its local statistical characteristics into a distributed hash table message, and then combines the key value and the distributed The hash table message is sent to the distributed hash table module; the distributed hash table module searches for the collection and analysis node responsible for the keyword value according to the keyword value, and routes the distributed hash table message to the node responsible for the keyword value. value node for storage;

3.2)不同节点负责不同行为片段，通过分布式哈希表模块将不同节点的相同行为片段聚拢到同一节点并记录发布行为片段的源节点地址，所述节点通过分布式哈希表模块存储的行为片段及其本地统计特性，根据行为片段及其源节点、源节点本地具有所述行为片段行为的本地恶意软件样本数进行统计全局特性，所述行为片段的全局特性包括行为片段、源节点地址、源节点本地具有所述行为片段行为的本地恶意软件样本数的三元对集合；然后将带有全局特性的行为片段集合发送给源节点地址，源节点地址对应的源节点收到带有全局特性的行为片段集合。3.2) Different nodes are responsible for different behavior fragments, gather the same behavior fragments of different nodes to the same node through the distributed hash table module and record the source node address of the published behavior fragment, and the behavior stored by the node through the distributed hash table module Fragment and its local statistical characteristics, according to the behavior fragment and its source node, the source node locally has the local malicious software sample number of described behavior fragment behavior to carry out statistical global characteristic, the global characteristic of described behavior fragment comprises behavior fragment, source node address, The source node locally has a ternary pair set of the number of local malware samples with the behavior fragment behavior; then the behavior fragment set with global characteristics is sent to the source node address, and the source node corresponding to the source node address receives A collection of behavior fragments.

所述步骤4)的详细步骤如下：The detailed steps of described step 4) are as follows:

4.1)源节点根据恶意软件的行为片段及其全局特性获取每一个本地的恶意软件的行为特征，根据式(1)计算所述行为特征与其它远程节点之间的行为片段共享度，将所述片段共享度大于预设共享度阈值的所有远程节点作为具有相似行为特征的候选邻居节点集；4.1) The source node obtains the behavior characteristics of each local malware according to the behavior fragments and global characteristics of the malware, calculates the behavior fragment sharing degree between the behavior characteristics and other remote nodes according to formula (1), and divides the All the remote nodes whose fragment sharing degree is greater than the preset sharing degree threshold are used as a set of candidate neighbor nodes with similar behavioral characteristics;

${Share share}_{tj tj} = = \underset{{Frag Frag}_{s the s} &Element; &Element; {FragSet FragSet}_{t t},, {Node node}_{s the s} = = {Node node}_{j j}}{\underset{(({Frag Frag}_{s the s},, {Node node}_{s the s},, {Num Num}_{s the s})) &Element; &Element; FragStatSet FragStatSet}{Σ Σ}} log log (({Num Num}_{t t} + + {Num Num}_{s the s})) log log \frac{11}{{F f}_{s the s}} ((11 \leq \leq t t \leq \leq {M m}_{i i},, 11 \leq \leq j j \leq \leq N N,, j j &NotEqual; &NotEqual; i i)) - - - - - - ((11))$

式(1)中，FragStatSet为代表行为片段的全局特性的三元对集合，N为节点数量，M_i为节点i的本地行为特征数量，行为特征t的行为片段集为FragSet_t,该行为特征t的本地样本数为Num_t，(Frag_s，Node_s，Num_s)为FragStatSet中的行为片段为Frag_s、源节点为Node_s和本地样本数为Num_s的三元对，F_s为行为片段为Frag_s在正常程序的行为中出现的频率，Node_j为远程节点；Share_tj(1≤t≤M_i,1≤j≤N,j≠i)为行为特征t与远程节点Node_j之间的行为片段共享度；In formula (1), FragStatSet is a set of ternary pairs representing the global characteristics of behavior fragments, N is the number of nodes, M _i is the number of local behavior characteristics of node i, and the behavior fragment set of behavior characteristic t is FragSet _t , the behavior characteristic The number of local samples of t is Num _t , (Frag _s , Node _s , Num _s ) is the ternary pair in FragStatSet _where the behavior fragment is Frag _s , the source node is Node s and the number of local samples is Num _s , and F _s is the behavior Fragment is the frequency of Frag _s in normal program behavior, Node _j is the remote node; Share _tj (1≤t≤M _i ,1≤j≤N, j≠i) is the relationship between behavioral feature t and remote node Node _j The sharing degree of behavior fragments among them;

4.2)源节点将本地恶意软件的行为特征作为源行为特征发送给候选邻居节点集中的所有远程节点；4.2) The source node sends the behavior characteristics of the local malware as the source behavior characteristics to all remote nodes in the candidate neighbor node set;

4.3)候选邻居节点集中的远程节点将收到源行为特征与本地的目的行为特征中的共同行为组成融合行为特征，判断所述融合行为特征在正常程序中出现的频率是否小于设定阈值，如果小于则说明源行为特征与目的行为特征的共同行为在正常程序中不常见且是特定于恶意软件的，判定源行为特征与目的行为特征属于同一恶意软件；否则判定源行为特征与目的行为特征不属于同一恶意软件；4.3) The remote nodes in the candidate neighbor node set will receive the common behavior of the source behavior feature and the local destination behavior feature to form the fusion behavior feature, and judge whether the frequency of the fusion behavior feature in the normal program is less than the set threshold, if If less than, it means that the common behavior of the source behavior feature and the target behavior feature is not common in normal programs and is specific to malware, and it is determined that the source behavior feature and the target behavior feature belong to the same malware; otherwise, it is determined that the source behavior feature and the target behavior feature are different. belong to the same malware;

4.4)如果通过验证源行为特征与目的行为特征属于同一恶意软件，则远程节点的目的行为特征为源行为特征的邻居行为特征，远程节点将源行为特征编号、目的行为特征的编号、源行为特征与目的行为特征两者的相似度、有邻居行为特征标志位组装成返回消息返回给源行为特征所在源节点，否则远程节点返回给源行为特征所在源节点无邻居行为特征标志位的否定消息；4.4) If it is verified that the source behavior characteristic and the destination behavior characteristic belong to the same malicious software, then the destination behavior characteristic of the remote node is the neighbor behavior characteristic of the source behavior characteristic, and the remote node assigns the source behavior characteristic number, the destination behavior characteristic number, the source behavior characteristic The similarity between the target behavior feature and the neighbor behavior feature flag is assembled into a return message and returned to the source node where the source behavior feature is located, otherwise the remote node returns a negative message to the source node where the source behavior feature has no neighbor behavior feature flag;

4.5)所述源节点根据返回的行为特征邻居关系创建行为特征邻接关系图的边，边的权值为两个顶点所代表的两个行为特征之间的相似度，最终构造行为特征邻接关系图；所述创建行为特征邻接关系图的边时采用单向边方式或者双向边方式，当采用单向边方式时：1)当验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征时，在本地的目的行为特征与远程的源行为特征间创建边，添加入本地的关联边数据结构中；2)根据接收远程节点返回的带有邻居行为特征标志位的邻居行为特征消息时，查看本地的关联边数据结构中查看是否已经在关于该消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建了边，如果有，不做任何操作，如果没有则在息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建边，添加入本地的关联边数据结构中，最终构建得到特征邻接关系图；当采用双向边方式时，当本地的行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征且收到关于该本地的目的行为特征编号及该远程的源行为特征编号的带着有邻居行为特征标志位的邻居行为特征消息时，才在该本地的目的行为特征及该远程的源行为特征间创建边，添加入本地的关联边数据结构中，最终构建得到特征邻接关系图。4.5) The source node creates the edge of the behavioral feature adjacency graph according to the returned behavioral feature neighbor relationship, and the weight of the edge is the similarity between the two behavioral features represented by the two vertices, and finally constructs the behavioral feature adjacency graph ; Adopt one-way edge mode or two-way edge mode when the edge of described creation behavior feature adjacency graph, when adopting one-way edge mode: 1) when verifying that remote source behavior feature and local purpose behavior feature are the same malicious software Behavior features, create an edge between the local destination behavior feature and the remote source behavior feature, and add it into the local associated edge data structure; 2) According to the neighbor behavior feature message with the neighbor behavior feature flag bit returned by the remote node , check the local associated edge data structure to see if an edge has been created between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, if yes, no Do any operation, if not, create an edge between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, add it to the local associated edge data structure, and finally construct Feature adjacency graph; when the two-way edge method is adopted, when the local behavior feature neighbor relationship verification sub-module verifies that the remote source behavior feature and the local target behavior feature are the same malware behavior feature and receives information about the local target behavior Only when the feature number and the remote source behavior feature number have a neighbor behavior feature message with a neighbor behavior feature flag bit, an edge is created between the local destination behavior feature and the remote source behavior feature, and the local association is added. In the edge data structure, the feature adjacency graph is finally constructed.

所述步骤5)的详细步骤如下：The detailed steps of described step 5) are as follows:

5.1)所述源节点在所述特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树，以行为特征最大相似树为基础决定所述融合树中行为特征的融合顺序使最相似的行为特征优先被融合；5.1) On the basis of the feature adjacency graph, the source node adopts a distributed minimum spanning tree algorithm that preferentially selects edges with large weights to generate maximum similarity of behavioral features on the adjacency graph according to the associated edge data structure of behavioral features The tree is used as a fusion tree, and the fusion order of the behavioral features in the fusion tree is determined on the basis of the maximum similarity tree of the behavioral features so that the most similar behavioral features are preferentially fused;

5.2)所述源节点为特征邻接关系图中每一个顶点的行为特征单独生成一个进程或者线程作为特征融合代理，或者为特征邻接关系图中所有顶点的行为特征生成一个特征融合代理；所述特征融合代理根据所述融合顺序将包含自身行为特征的融合结果提交给下一层的特征融合代理；收到融合结果的特征融合代理记录接收到所述融合结果的上一层特征融合代理对应特征邻接关系图中的邻接边，查看所述邻接边在所有邻接边中是不是自身行为特征的关联边中的权值最大的边，如果是，则将自身的行为特征和所有收到融合结果进行融合并将融合结果根据所述融合顺序提交给下一层特征融合代理，否则，融合所有收到的融合结果并将融合结果和自身的行为特征根据所述融合顺序提交给下一层特征融合代理，最终融合得到所述融合树的根行为特征；5.2) The source node generates a process or thread independently as a feature fusion agent for the behavior characteristics of each vertex in the feature adjacency graph, or generates a feature fusion agent for the behavior characteristics of all vertices in the feature adjacency graph; The fusion agent submits the fusion result containing its own behavioral characteristics to the feature fusion agent of the next layer according to the fusion sequence; the feature fusion agent that receives the fusion result records the corresponding feature adjacency of the upper layer feature fusion agent that received the fusion result Adjacent edges in the relationship graph, check whether the adjacent edges are the edges with the largest weight among the associated edges of their own behavior characteristics among all adjacent edges, and if so, fuse their own behavior characteristics with all received fusion results And submit the fusion result to the next layer of feature fusion agent according to the fusion order, otherwise, fuse all the received fusion results and submit the fusion result and its own behavior characteristics to the next layer of feature fusion agent according to the fusion order, Final fusion obtains the root behavior characteristic of described fusion tree;

5.3)将所述融合树的根行为特征作为恶意软件特征融合分析结果输出。5.3) Outputting the root behavior feature of the fusion tree as a malware feature fusion analysis result.

本发明还提供一种基于行为片段共享的恶意软件特征融合分析系统，包括在网络中分别部署地理位置分散的节点，每一个节点负责一片网络区域中恶意软件样本的采集和分析，所述节点包括：The present invention also provides a malicious software feature fusion analysis system based on behavior segment sharing, which includes deploying geographically dispersed nodes in the network, each node is responsible for the collection and analysis of malicious software samples in a network area, and the nodes include :

行为特征分割模块，用于节点采集恶意软件样本并分割为长度固定的行为片段集合，统计本地具有所述行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并提交给行为片段协同共享模块；The behavior feature segmentation module is used for nodes to collect malware samples and divide them into fixed-length behavior segment sets, count the number of local malware samples with the behavior of each behavior segment in the behavior segment set to obtain the local statistical characteristics of the behavior segment set and Submit to the Behavior Fragment Collaborative Sharing Module;

分布式哈希表模块，用于在节点中建立用于构建分布式哈希表；Distributed hash table module, used to build distributed hash table in the node;

行为片段协同共享模块，用于各个节点将行为片段集合及其本地统计特性发布共享至分布式哈希表，通过分布式哈希表的节点聚拢来自不同节点的相同行为片段并统计聚拢所述行为片段的全局特性，所述行为片段的全局特性包括行为片段、源节点地址、源节点本地具有所述行为片段行为的本地恶意软件样本数的三元对集合，将带有全局特性的行为片段集合返回给共享行为片段集合及其本地统计特性的源节点；Behavior segment collaborative sharing module, used for each node to publish and share the set of behavior segments and their local statistical characteristics to the distributed hash table, gather the same behavior segments from different nodes through the nodes of the distributed hash table and gather the behaviors statistically The global characteristics of the segment, the global characteristics of the behavior segment include a set of ternary pairs of the behavior segment, source node address, and the number of local malware samples of the source node locally having the behavior of the behavior segment, and the behavior segment set with the global feature Return to the source node that shares the collection of behavioral fragments and their local statistical properties;

邻居行为特征发现模块，用于根据恶意软件的行为片段及其全局特性构成的恶意软件的行为特征计算具有相似行为特征的候选邻居节点集，向候选邻居节点集中的远程节点发送恶意软件对应的源行为特征；同时在节点作为候选邻居节点集中的远程节点时将收到的源行为特征与本地的目的行为特征比较判断目的行为特征是否为源行为特征的邻居行为特征，将判断结果作为行为特征邻居关系返回源节点，根据远程节点返回的行为特征邻居关系构造行为特征邻接关系图；The neighbor behavior feature discovery module is used to calculate the set of candidate neighbor nodes with similar behavior features according to the behavior features of the malware formed by the behavior fragments of the malware and its global characteristics, and send the source information corresponding to the malware to the remote nodes in the set of candidate neighbor nodes. Behavior characteristics; at the same time, when the node is a remote node in the candidate neighbor node set, compare the received source behavior characteristics with the local destination behavior characteristics to judge whether the destination behavior characteristics are the neighbor behavior characteristics of the source behavior characteristics, and use the judgment result as the behavior characteristic neighbor The relationship returns the source node, and constructs a behavior feature adjacency graph according to the behavior feature neighbor relationship returned by the remote node;

分布式层次融合树构造模块，用于在所述特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树来实现分布式生成融合树；The distributed hierarchical fusion tree construction module is used to generate on the adjacency graph based on the feature adjacency graph, using a distributed minimum spanning tree algorithm that preferentially selects edges with large weights according to the associated edge data structure of behavioral features The maximum similarity tree of behavioral characteristics is used as a fusion tree to realize distributed generation of fusion trees;

行为特征逐步融合模块，用于将所述融合树中的行为特征进行融合，且以行为特征最大相似树为基础决定所述融合树中行为特征的融合顺序使最相似的行为特征优先被融合，将所述融合树的根行为特征作为恶意软件特征融合分析结果输出。A behavioral feature progressive fusion module, used to fuse the behavioral features in the fusion tree, and determine the fusion order of the behavioral features in the fusion tree based on the behavioral feature maximum similarity tree so that the most similar behavioral features are preferentially fused, The root behavior feature of the fusion tree is output as the fusion analysis result of the malware feature.

作为本发明基于行为片段共享的恶意软件特征融合分析系统的进一步改进：As a further improvement of the malware feature fusion analysis system based on behavior fragment sharing of the present invention:

所述行为特征分割模块包括：The behavior feature segmentation module includes:

连续行为子序列子模块或者行为依赖子图分割子模块，所述连续行为子序列子模块用于将恶意软件样本的行为视为顺序执行的行为序列，从所述顺序执行的行为序列中选择固定长度的连续行为子序列作为分割得到的长度固定的行为片段集合；所述行为依赖子图分割子模块用于根据恶意软件样本的行为操作数据之间的依赖关系建立行为依赖图，从所述行为依赖图中选择获取固定顶点数目的行为依赖子图作为分割得到的长度固定的行为片段集合；A continuous behavior subsequence submodule or a behavior-dependent subgraph segmentation submodule, the continuous behavior subsequence submodule is used to regard the behavior of the malware sample as a sequentially executed behavior sequence, and select a fixed behavior sequence from the sequentially executed behavior sequence. The continuous behavior subsequence of length is used as a set of fixed-length behavior fragments obtained by segmentation; the behavior dependency subgraph segmentation submodule is used to establish a behavior dependency graph according to the dependency between the behavior operation data of the malware sample, from the behavior A behavior-dependent subgraph with a fixed number of vertices is selected in the dependency graph as a set of fixed-length behavior fragments obtained by segmentation;

行为片段集合统计子模块，用于统计本地具有所述行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并输出至行为片段协同共享模块。The behavior segment set statistics sub-module is used to count the number of local malware samples with the behavior of each behavior segment in the behavior segment set to obtain the local statistical characteristics of the behavior segment set and output it to the behavior segment collaborative sharing module.

所述分布式哈希表模块包括：The distributed hash table module includes:

行为片段关键字映射子模块，用于根据输入的行为片段通过哈希计算获取关键字值；Behavior fragment keyword mapping sub-module, used to obtain keyword value through hash calculation according to the input behavior fragment;

关键字路由子模块，用于根据关键字值查找负责该关键字值的采集分析节点，并将分布式哈希表消息路由到负责该关键字值的节点进行存储；The keyword routing sub-module is used to find the collection and analysis node responsible for the keyword value according to the keyword value, and route the distributed hash table message to the node responsible for the keyword value for storage;

所述行为片段协同共享模块包括：The behavior fragment collaborative sharing module includes:

行为片段发布子模块，用于调用分布式哈希表模块得到所述行为片段的关键字值，将所述行为片段及其本地统计特性封装为分布式哈希表消息，然后将所述关键字值和分布式哈希表消息发送给分布式哈希表模块；The behavior fragment publishing submodule is used to call the distributed hash table module to obtain the key value of the behavior fragment, encapsulate the behavior fragment and its local statistical characteristics into a distributed hash table message, and then send the key value The value and the DHT message are sent to the DHT module;

行为片段接收子模块，在节点作为分布式统计全局特性的节点时接收分布式哈希表模块存储的行为片段及其本地统计特性；Behavior fragment receiving sub-module, when the node is a node with distributed statistics global characteristics, it receives the behavior fragments stored in the distributed hash table module and its local statistical characteristics;

行为片段统计子模块，用于在节点作为分布式统计全局特性的节点时根据行为片段及其源节点、源节点本地具有所述行为片段行为的本地恶意软件样本数进行统计全局特性，所述行为片段的全局特性包括行为片段、源节点地址、源节点本地具有所述行为片段行为的本地恶意软件样本数的三元对集合；Behavior segment statistics sub-module, used to perform statistical global characteristics according to the behavior segment and its source node, the local malware sample number of the source node locally having the behavior of the behavior segment when the node is used as a node of distributed statistical global properties, the behavior The global characteristics of the segment include a triplet set of the behavior segment, the source node address, the number of local malware samples of the source node locally having the behavior of the behavior segment;

行为片段全局特性返回子模块，用于在节点作为分布式统计全局特性的节点时将所述行为片段统计子模块输出的带有全局特性的行为片段集合发送给源节点地址；The behavior fragment global characteristic return submodule is used to send the behavior fragment set with global characteristics output by the behavior fragment statistics submodule to the source node address when the node is a node of distributed statistical global characteristics;

行为片段全局特性接收子模块，用于在节点作为共享行为片段的源节点时接收所述行为片段全局特性返回子模块发送的带有全局特性的行为片段集合。The behavior fragment global characteristic receiving sub-module is used for receiving the behavior fragment global characteristic and returning the behavior fragment set with the global characteristic sent by the sub-module when the node is the source node of the shared behavior fragment.

所述邻居行为特征发现模块包括：The neighbor behavior feature discovery module includes:

候选邻居节点集计算子模块，用于根据恶意软件的行为片段及其全局特性获取每一个本地的恶意软件的行为特征，根据式(1)计算所述行为特征与其它远程节点之间的行为片段共享度，将所述片段共享度大于预设共享度阈值的所有远程节点作为具有相似行为特征的候选邻居节点集；The candidate neighbor node set calculation submodule is used to obtain the behavioral characteristics of each local malware according to the behavioral fragments of the malware and its global characteristics, and calculate the behavioral fragments between the behavioral characteristics and other remote nodes according to formula (1) Sharing degree, using all remote nodes whose fragment sharing degree is greater than a preset sharing degree threshold as a set of candidate neighbor nodes with similar behavioral characteristics;

行为特征发送子模块，用于在节点作为源节点时将本地恶意软件的行为特征作为源行为特征发送给候选邻居节点集中的所有远程节点；Behavioral feature sending sub-module, used to send the behavioral feature of local malware as source behavioral feature to all remote nodes in the candidate neighbor node set when the node is used as the source node;

行为特征接收子模块，用于在节点作为候选邻居节点集的远程节点时将收源节点发送的源行为特征；Behavior characteristic receiving sub-module, used for receiving the source behavior characteristics sent by the source node when the node is a remote node of the candidate neighbor node set;

行为特征邻居关系验证子模块，用于将收到源行为特征与本地的目的行为特征中的共同行为组成融合行为特征，判断所述融合行为特征在正常程序中出现的频率是否小于设定阈值，如果小于则说明源行为特征与目的行为特征的共同行为在正常程序中不常见且是特定于恶意软件的，判定源行为特征与目的行为特征属于同一恶意软件；否则判定源行为特征与目的行为特征不属于同一恶意软件；如果通过验证源行为特征与目的行为特征属于同一恶意软件，则远程节点的目的行为特征为源行为特征的邻居行为特征，远程节点将源行为特征编号、目的行为特征的编号、源行为特征与目的行为特征两者的相似度、有邻居行为特征标志位组装成返回消息返回给源行为特征所在源节点，否则远程节点返回给源行为特征所在源节点无邻居行为特征标志位的否定消息；Behavioral feature neighbor relationship verification sub-module, used to form the common behavior of the received source behavior feature and the local target behavior feature into the fusion behavior feature, and judge whether the frequency of the fusion behavior feature in the normal program is less than the set threshold, If it is less than, it means that the common behavior of the source behavior feature and the target behavior feature is not common in normal programs and is specific to malware, and it is determined that the source behavior feature and the target behavior feature belong to the same malware; otherwise, it is determined that the source behavior feature and the target behavior feature do not belong to the same malware; if it is verified that the source behavior characteristic and the destination behavior characteristic belong to the same malware, then the destination behavior characteristic of the remote node is the neighbor behavior characteristic of the source behavior characteristic, and the remote node assigns the source behavior characteristic number, the destination behavior characteristic number , the similarity between the source behavior feature and the destination behavior feature, and the neighbor behavior feature flag bits are assembled into a return message and returned to the source node where the source behavior feature is located, otherwise the remote node returns to the source node where the source behavior feature has no neighbor behavior feature flag bits negative news;

邻接关系图计算子模块，用于根据返回的行为特征邻居关系创建行为特征邻接关系图的边，边的权值为两个顶点所代表的两个行为特征之间的相似度，最终构造行为特征邻接关系图；所述创建行为特征邻接关系图的边时采用单向边方式或者双向边方式，当采用单向边方式时：1)当验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征时，在本地的目的行为特征与远程的源行为特征间创建边，添加入本地的关联边数据结构中；2)根据接收远程节点返回的带有邻居行为特征标志位的邻居行为特征消息时，查看本地的关联边数据结构中查看是否已经在关于该消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建了边，如果有，不做任何操作，如果没有则在息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建边，添加入本地的关联边数据结构中；当采用双向边方式时，当本地的行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征且收到关于该本地的目的行为特征编号及该远程的源行为特征编号的带着有邻居行为特征标志位的邻居行为特征消息时，才在该本地的目的行为特征及该远程的源行为特征间创建边，添加入本地的关联边数据结构中；The adjacency graph calculation sub-module is used to create the edge of the behavior feature adjacency graph according to the returned behavior feature neighbor relationship. The weight of the edge is the similarity between the two behavior features represented by the two vertices, and finally constructs the behavior feature Adjacency graph; when creating the edge of the behavior feature adjacency graph, adopt a one-way edge mode or a two-way edge mode, when using a one-way edge mode: 1) when verifying that the remote source behavior feature is identical to the local destination behavior feature For the behavior characteristics of malware, an edge is created between the local destination behavior characteristics and the remote source behavior characteristics, and added to the local associated edge data structure; When a behavior feature message is received, check the local associated edge data structure to see if an edge has been created between the local behavior feature represented by the source behavior feature number in the message and the remote behavior feature represented by the destination behavior feature number in the message, if Yes, do nothing, if not, create an edge between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, and add it to the local associated edge data structure; When the two-way edge method is adopted, when the local behavior characteristic neighbor relationship verification sub-module verifies that the remote source behavior characteristic and the local destination behavior characteristic are the behavior characteristics of the same malware and receives information about the local destination behavior characteristic number and the remote Only when the source behavior characteristic number of the source behavior characteristic number carries the neighbor behavior characteristic message with the neighbor behavior characteristic flag bit, an edge is created between the local destination behavior characteristic and the remote source behavior characteristic, and added to the local associated edge data structure;

式(1)中，FragStatSet为代表行为片段的全局特性的三元对集合，N为节点数量，M_i为节点i的本地行为特征数量，行为特征t的行为片段集为FragSet_t,该行为特征t的本地样本数为Num_t，(Frag_s，Node_s，Num_s)为FragStatSet中的行为片段为Frag_s、源节点为Node_s和本地样本数为Num_s的三元对，F_s为行为片段为Frag_s在正常程序的行为中出现的频率，Node_j为远程节点；Share_tj(1≤t≤M_i,1≤j≤N,j≠i)为行为特征t与远程节点Node_j之间的行为片段共享度。In formula (1), FragStatSet is a set of ternary pairs representing the global characteristics of behavior fragments, N is the number of nodes, M _i is the number of local behavior characteristics of node i, and the behavior fragment set of behavior characteristic t is FragSet _t , the behavior characteristic The number of local samples of t is Num _t , (Frag _s , Node _s , Num _s ) is the ternary pair in FragStatSet _where the behavior fragment is Frag _s , the source node is Node s and the number of local samples is Num _s , and F _s is the behavior Fragment is the frequency of Frag _s in normal program behavior, Node _j is the remote node; Share _tj (1≤t≤M _i ,1≤j≤N, j≠i) is the relationship between behavioral feature t and remote node Node _j The degree of sharing of behavioral fragments among them.

所述分布式层次融合树构造模块包括：The distributed hierarchical fusion tree construction module includes:

行为特征最大相似树构造子模块，用于在所述特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树；Behavioral feature maximum similarity tree construction sub-module, used for on the basis of the feature adjacency graph, using a distributed minimum spanning tree algorithm that preferentially selects edges with large weights on the adjacency graph according to the associated edge data structure of the behavior feature Generate the maximum similarity tree of behavioral features as a fusion tree;

行为特征融合顺序构造子模块，用于以行为特征最大相似树为基础决定所述融合树中行为特征的融合顺序使最相似的行为特征优先被融合；Behavior feature fusion sequence construction submodule, used to determine the fusion order of behavior features in the fusion tree based on the maximum similarity tree of behavior features so that the most similar behavior features are preferentially fused;

所述行为特征逐步融合模块包括通过进程或者线程的形式分别对应特征邻接关系图中每一个顶点的行为特征的多个特征融合代理或者对应特征邻接关系图中所有顶点的行为特征的一个特征融合代理；The behavioral feature fusion module includes a plurality of feature fusion agents corresponding to the behavioral features of each vertex in the feature adjacency graph or a feature fusion agent corresponding to the behavioral features of all vertices in the feature adjacency graph in the form of processes or threads ;

所述特征融合代理包括：The feature fusion agent includes:

行为特征提交子模块，用于根据所述融合顺序将包含自身行为特征的融合结果提交给下一层的特征融合代理；Behavioral feature submission submodule, used to submit the fusion result containing its own behavioral features to the feature fusion agent of the next layer according to the fusion order;

行为特征接收子模块，用于接收上一层的特征融合代理提交的融合结果并记录接收到所述融合结果的上一层特征融合代理对应特征邻接关系图中的邻接边；行为特征融合子模块，用于判断所述邻接边在所有邻接边中是不是自身行为特征的关联边中的权值最大的边，如果是，则将自身的行为特征和所有收到融合结果进行融合并将融合结果根据所述融合顺序提交给下一层特征融合代理，否则，融合所有收到的融合结果并将融合结果和自身的行为特征根据所述融合顺序提交给下一层特征融合代理；最终由融合树的根特征融合代理融合得到所述融合树的根行为特征，并将所述融合树的根行为特征作为恶意软件特征融合分析结果输出。The behavior feature receiving sub-module is used to receive the fusion result submitted by the feature fusion agent of the upper layer and record the adjacent edge in the feature adjacency graph corresponding to the feature fusion agent of the upper layer that received the fusion result; the behavior feature fusion sub-module , is used to judge whether the adjacent edge is the edge with the largest weight among the associated edges of its own behavior characteristics among all adjacent edges, and if so, fuse its own behavior characteristics and all received fusion results Submit to the next layer of feature fusion agent according to the fusion order, otherwise, fuse all received fusion results and submit the fusion result and its own behavior characteristics to the next layer of feature fusion agent according to the fusion order; finally the fusion tree The root feature fusion agent fuses to obtain the root behavior feature of the fusion tree, and outputs the root behavior feature of the fusion tree as a malware feature fusion analysis result.

本发明基于行为片段共享的恶意软件特征融合分析方法具有下述优点：由于恶意软件可能利用消息加密和多态等技术，使得同一类恶意软件的不同样本可能具有多样的不同行为特征，但是由于功能和来源的相似性以及通信方式的一致性，同类恶意软件的行为具有相似性，具体表现为同一恶意软件的样本间存在一些行为片段是相同的，因此分布在网络各节点的关于同类恶意软件的样本行为具有相同的一些行为片段，而这些样本行为的共同部分即为恶意软件的特征。本发明基于同类恶意软件不同样本行为特征的这种特性，设计基于行为片段共享的恶意软件特征分布式融合方法，通过行为片段分布信息的共享以及邻居节点的发现机制，可以以很小的网络通信实现可扩展的分布是恶意软件特征融合，通过恶意软件特征邻接关系图，支持多态恶意软件特征按照家族树进行层次融合；另外，本发明的分布层次的特征融合验证技术可以实现更准确的恶意软件特征挖掘；综上所述，本发明具有分析准确性高、分析性能强、可扩展性好的优点。The malware feature fusion analysis method based on behavior fragment sharing of the present invention has the following advantages: because malware may utilize technologies such as message encryption and polymorphism, different samples of the same type of malware may have various different behavioral characteristics, but due to the functional The similarity with the source and the consistency of the communication method, the behavior of the same kind of malware is similar, specifically shown that some behavior fragments are the same among the samples of the same malware, so the information about the same kind of malware distributed in each node of the network The sample behaviors have the same behavior fragments, and the common parts of these sample behaviors are the characteristics of malware. Based on the characteristics of the behavior characteristics of different samples of the same type of malware, the present invention designs a malware feature distributed fusion method based on the sharing of behavior fragments, through the sharing of behavior fragment distribution information and the discovery mechanism of neighbor nodes, it can communicate with a small network To achieve scalable distribution is the fusion of malware features, through the malware feature adjacency graph, support the hierarchical fusion of polymorphic malware features according to the family tree; in addition, the feature fusion verification technology of the distribution level of the present invention can realize more accurate malware Software feature mining; in summary, the present invention has the advantages of high analysis accuracy, strong analysis performance and good scalability.

本发明基于行为片段共享的恶意软件特征融合分析系统为本发明基于行为片段共享的恶意软件特征融合分析方法相对应的系统，具有与本发明基于行为片段共享的恶意软件特征融合分析方法相同的技术效果，在此不再赘述。The malware feature fusion analysis system based on behavior segment sharing in the present invention is a system corresponding to the malware feature fusion analysis method based on behavior segment sharing in the present invention, and has the same technology as the malware feature fusion analysis method based on behavior segment sharing in the present invention effect, which will not be repeated here.

附图说明Description of drawings

图1为本发明实施例方法的实施流程示意图。Fig. 1 is a schematic flow chart of the implementation of the method of the embodiment of the present invention.

图2为本发明实施例中节点构成行为片段融合覆盖网的拓扑结构示意图。FIG. 2 is a schematic diagram of a topological structure of a converged overlay network composed of behavior segments of nodes in an embodiment of the present invention.

图3为本发明实施例中连续行为子序列的分割原理示意图。Fig. 3 is a schematic diagram of the segmentation principle of continuous behavior subsequences in the embodiment of the present invention.

图4为本发明实施例中行为依赖子图的分割原理示意图。FIG. 4 is a schematic diagram of the segmentation principle of behavior-dependent subgraphs in an embodiment of the present invention.

图5为本发明实施例系统的框架结构示意图。Fig. 5 is a schematic diagram of the frame structure of the system of the embodiment of the present invention.

图6为本发明实施例中分布式哈希表模块及行为片段协同共享模块的工作原理示意图。Fig. 6 is a schematic diagram of the working principle of the distributed hash table module and the collaborative sharing module of behavior fragments in the embodiment of the present invention.

图7为本发明实施例中邻居行为特征发现模块的工作流程示意图。Fig. 7 is a schematic diagram of the workflow of the neighbor behavior feature discovery module in the embodiment of the present invention.

图8为本发明实施例中分布式层次融合树构造模块的框架结构示意图。Fig. 8 is a schematic diagram of the frame structure of the distributed hierarchical fusion tree construction module in the embodiment of the present invention.

图9为本发明实施例中行为特征逐步融合模块的工作流程示意图。FIG. 9 is a schematic diagram of the workflow of the behavior feature gradual fusion module in the embodiment of the present invention.

图10为本发明实施例中节点i中行为特征集采用无管理进程的多进程表示方式示意图。FIG. 10 is a schematic diagram of a multi-process representation mode in which the behavior feature set of node i adopts an unmanaged process in an embodiment of the present invention.

图11为本发明实施例中节点i中的行为特征集的有管理进程的多进程表示方式示意图。FIG. 11 is a schematic diagram of a multi-process representation of a managed process of a behavior feature set in node i in an embodiment of the present invention.

图12为本发明实施例中行为特征表示中进程的结构示意图。Fig. 12 is a schematic structural diagram of the process of behavior feature representation in the embodiment of the present invention.

具体实施方式Detailed ways

如图1所示，本实施例基于行为片段共享的恶意软件特征融合分析方法的实施步骤如下：As shown in Figure 1, the implementation steps of the malware feature fusion analysis method based on behavioral segment sharing in this embodiment are as follows:

2)各个节点采集恶意软件样本并分割为长度固定的行为片段集合，统计本地具有行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性；2) Each node collects malware samples and divides them into fixed-length behavior segment sets, and counts the number of local malware samples with behaviors of each behavior segment in the behavior segment set to obtain the local statistical characteristics of the behavior segment set;

3)各个节点将行为片段集合及其本地统计特性发布共享至分布式哈希表，通过分布式哈希表的节点聚拢来自不同节点的相同行为片段并统计聚拢行为片段的全局特性，将带有全局特性的行为片段集合返回给共享行为片段集合及其本地统计特性的源节点；3) Each node publishes and shares the set of behavior fragments and their local statistical characteristics to the distributed hash table, and gathers the same behavior fragments from different nodes through the nodes of the distributed hash table, and collects statistics on the global characteristics of the behavior fragments, which will have The set of behavior fragments of the global characteristics is returned to the source node that shares the collection of behavior fragments and its local statistical characteristics;

4)源节点根据恶意软件的行为片段及其全局特性构成的恶意软件的行为特征计算具有相似行为特征的候选邻居节点集，向候选邻居节点集中的远程节点发送恶意软件对应的源行为特征，候选邻居节点集中的远程节点将收到的源行为特征与本地的目的行为特征比较判断目的行为特征是否为源行为特征的邻居行为特征，将判断结果作为行为特征邻居关系返回源节点，源节点根据远程节点返回的行为特征邻居关系构造行为特征邻接关系图；4) The source node calculates the set of candidate neighbor nodes with similar behavior characteristics according to the behavior characteristics of malware composed of behavior fragments and global characteristics of the malware, and sends the source behavior characteristics corresponding to the malware to the remote nodes in the set of candidate neighbor nodes. The remote nodes in the neighbor node set compare the received source behavior characteristics with the local destination behavior characteristics to judge whether the destination behavior characteristics are the neighbor behavior characteristics of the source behavior characteristics, and return the judgment result to the source node as the neighbor relationship of the behavior characteristics. Behavior feature neighbor relations returned by nodes construct behavior feature adjacency graph;

5)在特征邻接关系图的基础上分布式生成融合树，将融合树中的行为特征进行融合，将融合树的根行为特征作为恶意软件特征融合分析结果输出。5) Generate a fusion tree in a distributed manner based on the feature adjacency graph, fuse the behavior features in the fusion tree, and output the root behavior feature of the fusion tree as the malware feature fusion analysis result.

如图2所示，本实施例包含若干节点，每个节点对其自身采集到的恶意软件样本进行本地分析，本地分析结果包括聚类分析、由同一聚类内的恶意软件样本共同行为组成的行为特征以及本地具有该行为特征的样本数目(简称为行为特征的本地样本数)，各节点共同组成“行为片段融合覆盖网”，“行为片段融合覆盖网”由各节点的分布式哈希表模块及行为片段协同共享模块通过网络连接组成。节点将行为特征分割为行为片段集合，然后通过“行为片段融合覆盖网”共享对同一恶意软件的本地分析结果，然后在负责同一恶意软件的节点构造一棵行为特征融合树，理想情况下每一种恶意软件就有一棵融合树与之对应，本实施例根据融合树指定的顺序逐步融合同一恶意软件在不同节点的所有这些本地分析结果，融合过程分为三个阶段：1)邻居行为特征发现阶段；2)构造行为特征融合树阶段；3)融合树获取全局分析结果阶段，最后在融合树的根节点处获得融合树对应恶意软件的全局精确的全局分析结果。由于恶意软件的广泛分散性和极其隐蔽性，恶意软件单点分析系统获取的恶意软件样本信息量少，而恶意软件使用加密、多种攻击途径、多态技术等，使得同一恶意软件的不同样本具有不同的行为，单点分析系统很难从少量恶意软件样本中提取恶意软件的本质特性，分析准确性低；而另一方面，恶意软件数量巨大且成指数级增长，将所有采集到的样本统一提交到集中服务器进行集中处理的分布式采集集中分析系统存在计算和通信瓶颈问题。本实施例基于同类恶意软件不同样本行为特征功能、来源的相似性以及通信方式的一致性，节点通过分布式哈希表实现节点间恶意软件行为片段共享，节点通过恶意软件相似性发现机制发现同类恶意软件的其它节点，进而在发现同类恶意软件的节点集中构建负载均衡的汇聚树，实现恶意软件特征的全局分析和多态特征的层次融合，通过分布式哈希表模块能够有效克服恶意软件采集和分析过程中的计算与通信瓶颈问题，能够检测和提取在传播过程中行为不发生改变的单形恶意软件，而且还能够应对各类采用消息加密、变换传播途径、多态等复杂技术的恶意软件，具有分析准确性高、分析性能强、可扩展性好的优点，既解决了恶意软件单点分析系统获取恶意软件样本信息量少而导致获得的恶意软件特性是非本质的问题，又解决了分布式采集集中分析系统由于恶意软件数目巨大且成指数级增长而造成的计算和通信瓶颈问题，而且兼顾了恶意软件分析的准确性和可扩展性，具有良好的应用前景。As shown in Figure 2, this embodiment includes several nodes, and each node performs local analysis on the malware samples collected by itself. Behavior characteristics and the number of local samples with the behavior characteristics (abbreviated as the number of local samples of behavior characteristics), each node jointly forms a "behavior fragment fusion overlay network", and the "behavior fragment fusion overlay network" is composed of the distributed hash table of each node Modules and behavior fragments are composed of collaborative sharing modules through network connections. The nodes divide the behavioral features into a collection of behavioral fragments, and then share the local analysis results of the same malware through the "behavior fragment fusion overlay network", and then construct a behavioral feature fusion tree at the nodes responsible for the same malware. Ideally, each There is a fusion tree corresponding to one kind of malware, and this embodiment gradually fuses all these local analysis results of the same malware in different nodes according to the order specified by the fusion tree. The fusion process is divided into three stages: 1) Neighbor behavior feature discovery stage; 2) the stage of constructing a behavior feature fusion tree; 3) the stage of obtaining the global analysis result by the fusion tree, and finally obtain the global accurate global analysis result of the malware corresponding to the fusion tree at the root node of the fusion tree. Due to the widespread dispersion and extremely concealment of malware, the amount of malware sample information obtained by the malware single-point analysis system is small, and malware uses encryption, multiple attack paths, polymorphic techniques, etc., making different samples of the same malware With different behaviors, it is difficult for a single-point analysis system to extract the essential characteristics of malware from a small number of malware samples, and the analysis accuracy is low; on the other hand, the number of malware is huge and exponentially growing, and all collected samples There are computing and communication bottlenecks in the distributed collection and centralized analysis system that is submitted to the centralized server for centralized processing. This embodiment is based on the similarity of behavioral features, sources, and communication modes of different samples of the same kind of malware. Nodes realize the sharing of malware behavior fragments between nodes through a distributed hash table, and nodes discover similar malware through a malware similarity discovery mechanism. Other nodes of malware, and then build a load-balanced aggregation tree on the nodes that find similar malware, realize the global analysis of malware features and the hierarchical fusion of polymorphic features, and the distributed hash table module can effectively overcome malware collection. It can detect and extract monomorphic malware whose behavior does not change during the propagation process, and can also deal with various types of malicious software that use complex technologies such as message encryption, changing propagation channels, and polymorphism. The software has the advantages of high analysis accuracy, strong analysis performance, and good scalability. It not only solves the problem of non-essential malware characteristics obtained due to the small amount of malware sample information obtained by the malware single-point analysis system, but also solves the problem of The distributed collection and centralized analysis system has a good application prospect due to the computing and communication bottlenecks caused by the huge number and exponential growth of malware, and taking into account the accuracy and scalability of malware analysis.

本实施例中，步骤2)的详细步骤如下：2.1)将恶意软件样本的行为视为顺序执行的行为序列，从顺序执行的行为序列中选择固定长度的连续行为子序列作为分割得到的长度固定的行为片段集合；或者根据恶意软件样本的行为操作数据之间的依赖关系建立行为依赖图，从行为依赖图中选择获取固定顶点数目的行为依赖子图作为分割得到的长度固定的行为片段集合；2.2)统计本地具有行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并输出。如图3所示，假设某行为特征具有如下的行为：1)在指定目录下创建复制恶意软件的目的文件的CreateFile行为；2)在注册表中读取指定关键字的注册表条目得到复制恶意软件的源文件路径的ReadRegistry行为；3)将该源文件路径下的恶意软件文件复制到新创建的目的文件的CopyFile行为；4)然后创建新的进程以可执行文件方式运行刚复制的目的文件的RunFile行为；5)之后不同样本有不同的行为，在图中用通配符“*”表示；6)在本地创建套接字接口的CreateSocket行为；7)用创建的套接字接口连接远程服务器的Connect行为；8)用连接好的socket接口与服务器通信的SendMsg行为。假设每个节点产生的本地分析结果不考虑行为之间数据依赖关系的序列行为特征，如图3上半部分所示，将图3所示的序列行为序列经过分割后得到3个长度为3的行为子序列(行为片段)。如图4所示，从行为数据间依赖关系出发，则图3上部分恶意软件的行为依赖图如图4上部分所示：CopyFile依赖于CreateFile创建的目的文件和ReadRegistry输出的源文件路径，RunFile依赖于CopyFile拷贝给目的文件的拷贝内容，Connect依赖于CreateSocket创建的套接字接口，SendMsg依赖于Connect产生的与服务器间的连接。在行为依赖图中，行为片段是顶点数目固定的任意连通子图，如图4下半部分所示经过分割后得到3个长度(顶点数)为3的行为依赖子图(行为片段)。对于分割行为片段集合的具体方式，可根据部署环境和可获得的信息类型来选择。但是，行为依赖图更能精确的表达恶意软件的行为特性，恶意软件即使完成某一特定功能时任意改变没有依赖的行为间的执行顺序，而行为依赖图关注行为操作数据间的依赖关系，在完成相同功能下依赖关系一般不会改变，因此在完成的功能语义不变的情况下，行为依赖图一般不会改变，从而能够提高检测的准确度。In this embodiment, the detailed steps of step 2) are as follows: 2.1) The behavior of the malware sample is regarded as a sequentially executed behavior sequence, and a fixed-length continuous behavior subsequence is selected from the sequentially executed behavior sequence as the fixed-length subsequence obtained by segmentation. set of behavior fragments; or establish a behavior dependency graph according to the dependency relationship between the behavior operation data of the malware sample, and select a behavior dependency subgraph with a fixed number of vertices from the behavior dependency graph as a fixed-length behavior fragment set obtained by segmentation; 2.2) Counting the number of local malware samples with the behavior of each behavior segment in the behavior segment set to obtain and output the local statistical characteristics of the behavior segment set. As shown in Figure 3, assume that a certain behavioral feature has the following behaviors: 1) create the CreateFile behavior of copying the target file of malicious software under the specified directory; 2) read the registry entry of the specified keyword in the registry to obtain the malicious The ReadRegistry behavior of the source file path of the software; 3) the CopyFile behavior of copying the malware file under the source file path to the newly created destination file; 4) and then creating a new process to run the newly copied destination file as an executable file 5) Afterwards, different samples have different behaviors, which are represented by the wildcard "*" in the figure; 6) CreateSocket behavior for creating a socket interface locally; 7) Connecting to a remote server with the created socket interface Connect behavior; 8) SendMsg behavior of communicating with the server using the connected socket interface. Assuming that the local analysis results generated by each node do not consider the sequence behavior characteristics of data dependencies between behaviors, as shown in the upper part of Figure 3, the sequence behavior sequence shown in Figure 3 is divided into three lengths of 3 Behavior subsequences (behavior fragments). As shown in Figure 4, starting from the dependency relationship between behavioral data, the behavior dependency diagram of some malware in Figure 3 is shown in the upper part of Figure 4: CopyFile depends on the destination file created by CreateFile and the source file path output by ReadRegistry, RunFile Depends on the copied content copied by CopyFile to the destination file, Connect depends on the socket interface created by CreateSocket, and SendMsg depends on the connection with the server generated by Connect. In the behavior dependency graph, behavior fragments are any connected subgraphs with a fixed number of vertices. As shown in the lower part of Figure 4, three behavior-dependent subgraphs (behavior fragments) with a length (number of vertices) of 3 are obtained after division. The specific manner of dividing the set of behavior fragments may be selected according to the deployment environment and the type of information available. However, the behavior dependency graph can more accurately express the behavior characteristics of malware. Even if the malware completes a specific function, it can change the execution sequence of behaviors without dependencies, while the behavior dependency graph focuses on the dependencies between behavior and operation data. The dependency relationship generally does not change when the same function is completed. Therefore, when the semantics of the completed function remain unchanged, the behavior dependency graph generally does not change, which can improve the accuracy of detection.

本实施例中，步骤3)的详细步骤如下：3.1)各个节点调用分布式哈希表模块得到行为片段的关键字值，将行为片段及其本地统计特性封装为分布式哈希表消息，然后将关键字值和分布式哈希表消息发送给分布式哈希表模块；分布式哈希表模块根据关键字值查找负责该关键字值的采集分析节点，并将分布式哈希表消息路由到负责该关键字值的节点进行存储；3.2)不同节点负责不同行为片段，通过分布式哈希表模块将不同节点的相同行为片段聚拢到同一节点并记录发布行为片段的源节点地址，节点通过分布式哈希表模块存储的行为片段及其本地统计特性，根据行为片段及其源节点、源节点本地具有行为片段行为的本地恶意软件样本数进行统计全局特性，行为片段的全局特性包括行为片段、源节点地址、源节点本地具有行为片段行为的本地恶意软件样本数的三元对集合；然后将带有全局特性的行为片段集合发送给源节点地址，源节点地址对应的源节点收到带有全局特性的行为片段集合。In the present embodiment, the detailed steps of step 3) are as follows: 3.1) each node invokes the distributed hash table module to obtain the key value of the behavior segment, encapsulates the behavior segment and its local statistical characteristics into a distributed hash table message, and then Send the keyword value and the distributed hash table message to the distributed hash table module; the distributed hash table module searches for the collection and analysis node responsible for the keyword value according to the keyword value, and routes the distributed hash table message 3.2) Different nodes are responsible for different behavior fragments, gather the same behavior fragments of different nodes to the same node through the distributed hash table module and record the source node address of the published behavior fragments, and the nodes pass Behavior fragments and their local statistical characteristics stored in the distributed hash table module, according to the behavior fragments and their source nodes, and the number of local malware samples with behavior fragment behavior in the source node, the global characteristics are calculated. The global characteristics of behavior fragments include behavior fragments , the address of the source node, and the number of local malware samples with behavior fragments in the source node; A collection of behavioral fragments with global properties.

本实施例中，步骤4)的详细步骤如下：In the present embodiment, the detailed steps of step 4) are as follows:

4.1)源节点根据恶意软件的行为片段及其全局特性获取每一个本地的恶意软件的行为特征，根据式(1)计算行为特征与其它远程节点之间的行为片段共享度，将片段共享度大于预设共享度阈值的所有远程节点作为具有相似行为特征的候选邻居节点集；4.1) The source node obtains the behavior characteristics of each local malware according to the behavior fragments of the malware and its global characteristics, and calculates the behavior fragment sharing degree between the behavior characteristics and other remote nodes according to formula (1), and the fragment sharing degree is greater than All remote nodes with a preset sharing degree threshold are used as a set of candidate neighbor nodes with similar behavioral characteristics;

式(1)中，FragStatSet为代表行为片段的全局特性的三元对集合，N为节点数量，M_i为节点i的本地行为特征数量，行为特征t的行为片段集为FragSet_t，该行为特征t的本地样本数为Num_t，(Frag_s，Node_s，Num_s)为FragStatSet中的行为片段为Frag_s、源节点为Node_s和本地样本数为Num_s的三元对，F_s为行为片段为Frag_s在正常程序的行为中出现的频率，Node_j为远程节点；Share_tj(1≤t≤M_i,1≤j≤N,j≠i)为行为特征t与远程节点Node_j之间的行为片段共享度；In formula (1), FragStatSet is a set of ternary pairs representing the global characteristics of behavior fragments, N is the number of nodes, M _i is the number of local behavior characteristics of node i, and the behavior fragment set of behavior characteristic t is FragSet _t , the behavior characteristic The number of local samples of t is Num _t , (Frag _s , Node _s , Num _s ) is the ternary pair in FragStatSet _where the behavior fragment is Frag _s , the source node is Node s and the number of local samples is Num _s , and F _s is the behavior Fragment is the frequency of Frag _s in normal program behavior, Node _j is the remote node; Share _tj (1≤t≤M _i ,1≤j≤N, j≠i) is the relationship between behavioral feature t and remote node Node _j The sharing degree of behavior fragments among them;

4.3)候选邻居节点集中的远程节点将收到源行为特征与本地的目的行为特征中的共同行为组成融合行为特征，判断融合行为特征在正常程序中出现的频率是否小于设定阈值，如果小于则说明源行为特征与目的行为特征的共同行为在正常程序中不常见且是特定于恶意软件的，判定源行为特征与目的行为特征属于同一恶意软件；否则判定源行为特征与目的行为特征不属于同一恶意软件；4.3) The remote nodes in the candidate neighbor node set will receive the common behavior of the source behavior feature and the local destination behavior feature to form the fusion behavior feature, and judge whether the frequency of the fusion behavior feature in the normal program is less than the set threshold, if it is less than then Explain that the common behavior of the source behavior feature and the target behavior feature is not common in normal programs and is specific to malware. It is determined that the source behavior feature and the target behavior feature belong to the same malware; otherwise, it is determined that the source behavior feature and the target behavior feature do not belong to the same malware. malicious software;

4.5)源节点根据返回的行为特征邻居关系创建行为特征邻接关系图的边，边的权值为两个顶点所代表的两个行为特征之间的相似度，最终构造行为特征邻接关系图；创建行为特征邻接关系图的边时采用单向边方式或者双向边方式，当采用单向边方式时：1)当验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征时，在本地的目的行为特征与远程的源行为特征间创建边，添加入本地的关联边数据结构中；2)根据接收远程节点返回的带有邻居行为特征标志位的邻居行为特征消息时，查看本地的关联边数据结构中查看是否已经在关于该消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建了边，如果有，不做任何操作，如果没有则在消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建边，添加入本地的关联边数据结构中，最终构建得到特征邻接关系图；当采用双向边方式时，当本地的行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征且收到关于该本地的目的行为特征编号及该远程的源行为特征编号的带着有邻居行为特征标志位的邻居行为特征消息时，才在该本地的目的行为特征及该远程的源行为特征间创建边，添加入本地的关联边数据结构中，最终构建得到特征邻接关系图。4.5) The source node creates the edge of the behavioral feature adjacency graph according to the returned behavioral feature neighbor relationship, and the weight of the edge is the similarity between the two behavioral features represented by the two vertices, and finally constructs the behavioral feature adjacency graph; When the edge of the behavior feature adjacency graph adopts the one-way edge method or the two-way edge method, when the one-way edge method is adopted: 1) when verifying that the remote source behavior characteristic and the local destination behavior characteristic are the behavior characteristics of the same malware, Create an edge between the local destination behavior feature and the remote source behavior feature, and add it to the local associated edge data structure; 2) When receiving the neighbor behavior feature message with the neighbor behavior feature flag bit returned by the remote node, check the local In the associated edge data structure, check whether an edge has been created between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, and if so, do nothing. If not, create an edge between the local behavior feature represented by the source behavior feature number in the message and the remote behavior feature represented by the destination behavior feature number in the message, add it to the local associated edge data structure, and finally construct the feature adjacency graph ; When the two-way edge mode is adopted, when the local behavior characteristic neighbor relationship verification sub-module verifies that the remote source behavior characteristic and the local destination behavior characteristic are the behavior characteristics of the same malware and receives information about the local destination behavior characteristic number and the Only when the remote source behavior feature number carries the neighbor behavior feature message with the neighbor behavior feature flag, an edge is created between the local destination behavior feature and the remote source behavior feature, and added to the local associated edge data structure , and finally construct the feature adjacency graph.

本实施例中，步骤5)的详细步骤如下：In the present embodiment, the detailed steps of step 5) are as follows:

5.1)源节点在特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树，以行为特征最大相似树为基础决定融合树中行为特征的融合顺序使最相似的行为特征优先被融合；5.1) On the basis of the feature adjacency graph, the source node adopts the distributed minimum spanning tree algorithm that preferentially selects edges with large weights to generate the maximum similarity tree of behavioral features on the adjacency graph as a fusion tree according to the associated edge data structure of behavioral features , based on the behavioral feature maximum similarity tree, determine the fusion order of the behavioral features in the fusion tree so that the most similar behavioral features are fused first;

5.2)源节点为特征邻接关系图中每一个顶点的行为特征单独生成一个进程或者线程作为特征融合代理，或者为特征邻接关系图中所有顶点的行为特征生成一个特征融合代理；特征融合代理根据融合顺序将包含自身行为特征的融合结果提交给下一层的特征融合代理；收到融合结果的特征融合代理记录接收到融合结果的上一层特征融合代理对应特征邻接关系图中的邻接边，查看邻接边在所有邻接边中是不是自身行为特征的关联边中的权值最大的边，如果是，则将自身的行为特征和所有收到融合结果进行融合并将融合结果根据融合顺序提交给下一层特征融合代理，否则，融合所有收到的融合结果并将融合结果和自身的行为特征根据融合顺序提交给下一层特征融合代理，最终融合得到融合树的根行为特征；5.2) The source node generates a separate process or thread as a feature fusion agent for the behavior characteristics of each vertex in the feature adjacency graph, or generates a feature fusion agent for the behavior characteristics of all vertices in the feature adjacency graph; the feature fusion agent is based on the fusion Sequentially submit the fusion result containing its own behavioral features to the feature fusion agent of the next layer; the feature fusion agent that receives the fusion result records the adjacency edge in the corresponding feature adjacency graph of the upper layer feature fusion agent that received the fusion result, check Is the adjacent edge the edge with the largest weight among the associated edges of its own behavior characteristics among all adjacent edges? If so, fuse its own behavior characteristics with all received fusion results and submit the fusion results to the following according to the fusion order A layer of feature fusion agents, otherwise, fuse all received fusion results and submit the fusion results and their own behavioral characteristics to the next layer of feature fusion agents according to the fusion sequence, and finally fuse to obtain the root behavioral characteristics of the fusion tree;

5.3)将融合树的根行为特征作为恶意软件特征融合分析结果输出。5.3) Output the root behavior feature of the fusion tree as the result of fusion analysis of malware features.

如图5所示，本实施例基于行为片段共享的恶意软件特征融合分析系统包括在网络中分别部署地理位置分散的节点，每一个节点负责一片网络区域中恶意软件样本的采集和分析，节点包括：As shown in Figure 5, the malware feature fusion analysis system based on the sharing of behavioral fragments in this embodiment includes separately deploying geographically dispersed nodes in the network, each node is responsible for the collection and analysis of malware samples in a network area, and the nodes include :

行为特征分割模块，用于节点采集恶意软件样本并分割为长度固定的行为片段集合，统计本地具有行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并提交给行为片段协同共享模块；Behavior feature segmentation module, used for nodes to collect malware samples and divide them into fixed-length behavior fragment sets, count the number of local malware samples with the behavior of each behavior fragment in the behavior fragment set, obtain the local statistical characteristics of the behavior fragment set and submit it to Behavior fragment collaborative sharing module;

行为片段协同共享模块，用于各个节点将行为片段集合及其本地统计特性发布共享至分布式哈希表，通过分布式哈希表的节点聚拢来自不同节点的相同行为片段并统计聚拢行为片段的全局特性，将带有全局特性的行为片段集合返回给共享行为片段集合及其本地统计特性的源节点；Behavior segment collaborative sharing module, used for each node to publish and share the set of behavior segments and their local statistical characteristics to the distributed hash table, gather the same behavior segments from different nodes through the nodes of the distributed hash table, and count the aggregated behavior segments Global properties, returns the set of behavior fragments with global properties to the source node sharing the set of behavior fragments and their local statistical properties;

分布式层次融合树构造模块，用于在特征邻接关系图的基础上分布式生成融合树；The distributed hierarchical fusion tree construction module is used to generate the fusion tree distributedly on the basis of the feature adjacency graph;

行为特征逐步融合模块，用于将融合树中的行为特征进行融合，将融合树的根行为特征作为恶意软件特征融合分析结果输出。The behavior feature fusion module is used to fuse the behavior features in the fusion tree, and output the root behavior feature of the fusion tree as the malware feature fusion analysis result.

本实施例中，行为特征分割模块包括：In this embodiment, the behavior feature segmentation module includes:

连续行为子序列子模块或者行为依赖子图分割子模块，连续行为子序列子模块用于将恶意软件样本的行为视为顺序执行的行为序列，从顺序执行的行为序列中选择固定长度的连续行为子序列作为分割得到的长度固定的行为片段集合；行为依赖子图分割子模块用于根据恶意软件样本的行为操作数据之间的依赖关系建立行为依赖图，从行为依赖图中选择获取固定顶点数目的行为依赖子图作为分割得到的长度固定的行为片段集合；The continuous behavior sub-sequence sub-module or the behavior-dependent subgraph segmentation sub-module, the continuous behavior sub-sequence sub-module is used to treat the behavior of malware samples as a sequentially executed behavior sequence, and select a fixed-length continuous behavior from the sequentially executed behavior sequence The subsequence is a set of fixed-length behavior fragments obtained by segmentation; the behavior dependency subgraph segmentation submodule is used to establish a behavior dependency graph according to the dependency relationship between the behavior operation data of malware samples, and select and obtain a fixed number of vertices from the behavior dependency graph The target behavior depends on the subgraph as a set of fixed-length behavior fragments obtained by segmentation;

行为片段集合统计子模块，用于统计本地具有行为片段集合中各个行为片段行为的本地恶意软件样本数得到行为片段集合的本地统计特性并输出至行为片段协同共享模块。The behavior segment set statistics sub-module is used to count the number of local malware samples with the behavior of each behavior segment in the behavior segment set to obtain the local statistical characteristics of the behavior segment set and output it to the behavior segment collaborative sharing module.

为了实现全分布式的恶意软件多态行为特征分析，本实施例通过可扩展的分布式哈希表(DHT，Distributed Hash Table)构建分布式分析方法，在网络上部署多个节点，在节点中建立用于构建分布式哈希表的分布式哈希表模块，每个节点的行为片段协同共享模块通过调用分布式哈希表模块的相关功能按着分布式哈希表的方式将该节点本地行为特征的行为片段集合存储到分布式哈希结构中，存储该行为片段集合中的每一个行为片段的具体方法是：以该行为片段为资源名，以行为片段及该行为片段的本地样本数等本地分析结果为内容，采用哈希函数对行为片段(资源名)进行哈希得到该行为片段的关键字值，基于分布式哈希表存储该行为片段的内容到负责该关键字值的目的节点；目的节点在接收并存储该行为片段的内容；根据分布式哈希表的特性，来自不同节点的所有相同行为片段都会路由及存储到同一个目的节点上，因此目的节点拥有了恶意软件本地分析结果分布式融合系统中该行为片段的全局信息。目的节点通过累计相同行为片段的节点地址和该节点关于该行为片段的本地样本数得到全系统具有该行为片段的源节点与本地样本数二元对集合；目的节点将行为片段的源节点与本地样本数二元对加上第三元——对应的行为片段——组成行为片段、源节点和本地样本数三元对集合返回给该三元对集合中所有的源节点；经过上述的发布返回过程，每个节点获得了其自身行为特征的行为片段在恶意软件本地分析结果分布式融合系统中的全局特性(行为片段、源节点和本地样本数三元对集合)。分布式哈希表模块用于支撑行为片段协同共享模块，将行为片段通过哈希之后映射为关键字值，然后根据这个关键字值将行为片段及其本地样本数等本地分析结果路由到负责该关键字值的节点，本实施例中分布式哈希表模块包括：In order to realize the fully distributed polymorphic behavior characteristic analysis of malicious software, this embodiment constructs a distributed analysis method through a scalable distributed hash table (DHT, Distributed Hash Table), and deploys multiple nodes on the network. Establish a distributed hash table module for building a distributed hash table. The behavior fragments of each node cooperate with the shared module to localize the node in the form of a distributed hash table by calling the relevant functions of the distributed hash table module. The behavior fragment set of behavior characteristics is stored in the distributed hash structure. The specific method of storing each behavior fragment in the behavior fragment set is: use the behavior fragment as the resource name, use the behavior fragment and the local sample number of the behavior fragment Wait for the local analysis result to be the content, use the hash function to hash the behavior fragment (resource name) to obtain the keyword value of the behavior fragment, and store the content of the behavior fragment based on the distributed hash table to be responsible for the purpose of the keyword value node; the destination node is receiving and storing the content of the behavior fragment; according to the characteristics of the distributed hash table, all the same behavior fragments from different nodes will be routed and stored on the same destination node, so the destination node has the malware local The global information of the behavior segment in the analysis result distributed fusion system. The destination node obtains the binary pair set of source nodes and local samples with the behavior fragment in the whole system by accumulating the node address of the same behavior fragment and the local sample number of the node about the behavior fragment; the destination node combines the source node of the behavior fragment with the local The sample number binary pair plus the third element—the corresponding behavior segment—constitutes the behavior segment, source node and local sample number triplet pair set and returns to all source nodes in the triplet pair set; returns after the above release In the process, each node obtains the global characteristics of behavior fragments of its own behavior characteristics in the distributed fusion system of malware local analysis results (a set of ternary pairs of behavior fragments, source nodes, and local sample numbers). The distributed hash table module is used to support the collaborative sharing module of behavior fragments, which maps behavior fragments to key values after hashing, and then routes local analysis results such as behavior fragments and their local sample numbers to the responsible The node of key value, distributed hash table module includes in the present embodiment:

关键字路由子模块，用于根据关键字值查找负责该关键字值的采集分析节点，并将分布式哈希表消息路由到负责该关键字值的节点进行存储。The keyword routing sub-module is used to find the collection and analysis node responsible for the keyword value according to the keyword value, and route the distributed hash table message to the node responsible for the keyword value for storage.

行为片段协同共享模块用于负载均衡的、分散的、可扩展的聚拢来自不同节点的相同行为片段并统计聚拢的行为片段的全局特性，全局特性包括全系统中具有该行为片段的节点集合和该行为片段在该节点集合中每个节点上的本地样本数集合，由于节点与本地样本数一一对应，即全局特性包括全系统中具有该行为片段的节点和该节点上具有该行为片段行为的本地恶意软件样本数目(简称为行为片段的本地样本数)二元对集合，加上行为片段本身构成行为片段、源节点地址、源节点本地具有行为片段行为的本地恶意软件样本数的三元对集合。Behavior Fragment Collaborative Sharing Module is used for load-balancing, decentralized, and scalable aggregation of the same behavior fragments from different nodes and statistics of the global characteristics of the aggregated behavior fragments. The global characteristics include the node set and the behavior fragments in the whole system. The local sample number set of each node in the behavior fragment in the node set, because the node corresponds to the local sample number one by one, that is, the global characteristics include the nodes with the behavior fragment in the whole system and the nodes with the behavior fragment behavior on the node The number of local malware samples (abbreviated as the number of local samples of the behavior segment) binary pair set, plus the behavior segment itself constitutes the ternary pair of the behavior segment, the address of the source node, and the number of local malware samples with the behavior of the behavior segment in the source node gather.

本实施例中行为片段协同共享模块包括：In this embodiment, the behavior segment cooperative sharing module includes:

行为片段发布子模块，用于调用分布式哈希表模块得到行为片段的关键字值，将行为片段及其本地统计特性封装为分布式哈希表消息，然后将关键字值和分布式哈希表消息发送给分布式哈希表模块；Behavior Fragment Publishing sub-module is used to call the Distributed Hash Table module to obtain the key value of the Behavior Fragment, encapsulate the Behavior Fragment and its local statistical characteristics into a Distributed Hash Table message, and then combine the key value and the distributed hash The table message is sent to the distributed hash table module;

行为片段统计子模块，用于在节点作为分布式统计全局特性的节点时根据行为片段及其源节点、源节点本地具有行为片段行为的本地恶意软件样本数进行统计全局特性，行为片段的全局特性包括行为片段、源节点地址、源节点本地具有行为片段行为的本地恶意软件样本数的三元对集合；Behavior fragment statistics sub-module, used to count global characteristics and global characteristics of behavior fragments according to behavior fragments and their source nodes, and the number of local malware samples with behavior fragment behaviors in source nodes A set of ternary pairs including the behavior fragment, source node address, and the number of local malware samples of the source node that have the behavior of the behavior fragment;

行为片段全局特性返回子模块，用于在节点作为分布式统计全局特性的节点时将行为片段统计子模块输出的带有全局特性的行为片段集合发送给源节点地址；Behavior fragment global characteristic return sub-module, used to send the behavior fragment collection with global characteristics output by the behavior fragment statistics sub-module to the source node address when the node is a node with distributed statistics global characteristics;

行为片段全局特性接收子模块，用于在节点作为共享行为片段的源节点时接收行为片段全局特性返回子模块发送的带有全局特性的行为片段集合。Behavior Fragment Global Feature Receiving Sub-module, used to receive Behavior Fragment Global Feature when the node is the source node of Shared Behavior Fragment and return the set of Behavior Fragment with global feature sent by the sub-module.

如图6所示，行为片段协同共享模块的工作步骤如下：①、行为片段发布子模块调用分布式哈希表模块的行为片段关键字映射子模块得到行为片段的关键字值，将行为片段及其本地统计特性封装为分布式哈希表消息，然后将关键字值和分布式哈希表消息发送给分布式哈希表模块的关键字路由子模块，通过关键字路由子模块路由发送给负责对应关键字值的节点进行存储；②、负责关键字值的节点分别作为分布式统计全局特性的节点，通过行为片段接收子模块接收分布式哈希表模块存储的各个节点发布所负责关键字值对应的行为片段及其本地统计特性；③、行为片段统计子模块根据行为片段及其源节点、源节点本地具有行为片段行为的本地恶意软件样本数进行统计全局特性，行为片段的全局特性包括行为片段、源节点地址、源节点本地具有行为片段行为的本地恶意软件样本数的三元对集合；④、行为片段全局特性返回子模块跨节点将行为片段统计子模块输出的带有全局特性的行为片段集合发送给源节点地址；行为片段全局特性接收子模块接收行为片段全局特性返回子模块发送的带有全局特性的行为片段集合，最终各个发布行为片段的节点都获取到所发布行为片段的全局属性。行为片段协同共享模块构建在分布式哈希表模块之上，完成按着传统分布式哈希表方式将本地行为特征的行为片段集合路由发送到各自的负责节点，实现来自不同节点的相同行为片段聚拢到同一节点处，不同的行为片段聚拢到不同的节点处，实现负载均衡的、分散的、可扩展的聚拢来自不同节点的相同行为片段；然后在行为片段的聚拢点对行为片段进行全局特性统计；最后将行为片段及其全局特性返回该行为片段的源节点集；最终结果是每个节点都获得了其自身行为特征的行为片段集合在全系统的全局特性，行为片段的全局特性包括全系统的行为片段、源节点和该源节点本地具有该行为片段行为的本地样本数三元对集合。行为片段协同共享模块构建在分布式哈希表模块之上，完成按着传统分布式哈希表方式将本地行为特征的行为片段集合路由发送到负责各行为片段对应关键字值的节点，实现来自不同节点的相同行为片段聚拢到同一节点处，不同的行为片段聚拢到不同的节点处，实现负载均衡的、分散的、可扩展的聚拢来自不同节点的相同行为片段；然后在行为片段的聚拢点对行为片段进行全局特性统计；最后将行为片段及其全局特性返回给行为片段的源节点集；最终结果是每个节点都获得了其自身行为特征的行为片段集合在全系统的全局特性，行为片段的全局特性包括全系统的行为片段、源节点和该源节点本地具有该行为片段行为的本地样本数三元对集合。行为片段协同共享模块包括：行为片段发布子模块、行为片段接收子模块、行为片段统计子模块、行为片段全局特性返回子模块和行为片段全局特性接收子模块。行为片段发布子模块以行为特征分割模块输出的行为片段集合为输入，对于行为片段集合中每一个行为片段做如下处理：首先以该行为片段作为输入调用分布式哈希表模块的行为片段关键字映射子模块得到该行为片段对应的关键字值，将该行为片段及本其本地样本数等本地分析结果组装为分布式哈希表消息以分布式哈希表模块的片段关键字映射子模块返回的关键字值作为路由关键字输入给分布式哈希表模块的关键字路由子模块，由分布式哈希表模块的关键字路由子模块完成将该消息传送到负责该路由关键字的节点，实现聚拢来自不同节点的相同行为片段到负责该相同行为片段对应关键字值的同一节点处。行为片段接收子模块是传统分布式哈希表路由消息接收模块的扩展，该子模块除了传统分布式哈希表路由消息接收功能外还记录发送该消息的源节点地址。行为片段统计子模块以行为片段接收子模块接收的路由消息(包括行为片段及其本地样本数等本地分析结果)和消息发送源节点地址作为输入，通过统计行为片段接收子模块接收的路由消息中行为片段相同的所有路由消息中源节点地址和本地样本数得到全系统具有该行为片段的源节点与本地样本数二元对集合，即行为片段的全局特性。行为片段全局特性返回子模块以行为片段统计子模块输出的行为片段、对应的全局特性以及发布该行为片段的源节点地址集合作为输入，将行为片段及对应的全局特性组装为返回消息，然后通过套接字接口将该返回消息发送给源节点地址集合中的每一个地址。行为片段全局特性接收子模块是一个简单的服务端套接字接口，接收其它节点发送给该服务端套接字接口的返回消息(包括行为片段和对应的全局特性)。恶意软件本地分析结果分布式融合系统中的各个节点的行为片段协同共享模块和分布式哈希表模块协同组成了一个行为片段融合覆盖网，通过源节点发送其自身行为特征的行为片段集合，负责行为片段的对应关键字值的目的节点统计来自不同源节点的相同行为片段的全局特性，然后将行为片段和全局特性返回给源节点，最后所有源节点都获得了其自身行为特征的行为片段集合在全系统中的全局特性。源节点接收到其本地不同行为片段的全局特性，即不同行为片段的源节点与本地样本数二元对集合，为了在后面的模块中统一表述所有行为片段的全局特性，将行为片段的全局特性(源节点与本地样本数二元对集合)中的每一个二元对中加上该行为片段成为行为片段、源节点、本地样本数三元对集合，因此也可以说行为片段协同共享模块获得本地行为片段集合在全系统的行为片段、源节点、本地样本数三元对集合。As shown in Figure 6, the working steps of the behavior fragment collaborative sharing module are as follows: ①. The behavior fragment release sub-module calls the behavior fragment keyword mapping sub-module of the distributed hash table module to obtain the keyword value of the behavior fragment, and converts the behavior fragment and Its local statistical characteristics are encapsulated into a distributed hash table message, and then the keyword value and distributed hash table message are sent to the keyword routing sub-module of the distributed hash table module, and sent to the responsible The nodes corresponding to the key value are stored; ②, the nodes responsible for the key value are respectively used as the nodes of the global characteristics of distributed statistics, and each node stored in the distributed hash table module is received through the behavior segment receiving sub-module to publish the responsible key value Corresponding behavior fragments and their local statistical characteristics; ③. The behavior fragment statistics sub-module performs statistical global characteristics according to the behavior fragments and their source nodes, and the number of local malware samples with behavior fragment behaviors in the source node. The global characteristics of behavior fragments include behavior A set of ternary pairs of fragments, source node addresses, and the number of local malware samples with behavioral fragment behaviors in the source node; ④, global characteristics of behavioral fragments return submodules across nodes and export behaviors with global characteristics from the behavioral fragment statistics submodule The fragment set is sent to the source node address; the behavior fragment global characteristic receiving submodule receives the behavior fragment global characteristic and returns the behavior fragment collection with global characteristics sent by the submodule, and finally each node that publishes the behavior fragment obtains the global behavior fragment released Attributes. The Behavior Fragment Cooperative Sharing Module is built on top of the Distributed Hash Table module, and completes the routing of the Behavior Fragment Sets of local behavior characteristics to their respective responsible nodes according to the traditional Distributed Hash Table method, so as to realize the same behavior fragments from different nodes Gather to the same node, gather different behavior fragments to different nodes, realize load-balanced, decentralized, and scalable aggregation of the same behavior fragments from different nodes; then perform global characteristics on the behavior fragments at the aggregation point of the behavior fragments Statistics; finally, the behavior fragment and its global characteristics are returned to the source node set of the behavior fragment; the final result is that each node has obtained the global characteristics of the behavior fragment collection of its own behavior characteristics in the whole system, and the global characteristics of the behavior fragment include the global characteristics of the whole system. A set of ternary pairs including the behavior fragment of the system, the source node, and the local sample number of the source node locally having the behavior of the behavior fragment. The behavior segment collaborative sharing module is built on top of the distributed hash table module. According to the traditional distributed hash table method, the behavior segment set routing of the local behavior characteristics is sent to the node responsible for the corresponding key value of each behavior segment, realizing from The same behavior fragments of different nodes are gathered at the same node, and different behavior fragments are gathered at different nodes to achieve load-balanced, decentralized, and scalable gathering of the same behavior fragments from different nodes; and then at the gathering point of the behavior fragments Perform global characteristic statistics on the behavior fragments; finally, return the behavior fragments and their global characteristics to the source node set of the behavior fragments; the final result is that each node obtains the global characteristics of the behavior fragment set of its own behavior characteristics in the whole system, and the behavior The global characteristics of a segment include a triplet set of a behavior segment of the whole system, a source node, and a local sample number of the source node locally having the behavior of the behavior segment. The behavior fragment cooperative sharing module includes: a behavior fragment publishing submodule, a behavior fragment receiving submodule, a behavior fragment statistics submodule, a behavior fragment global characteristic returning submodule and a behavior fragment global characteristic receiving submodule. The behavior fragment release sub-module takes the behavior fragment set output by the behavior feature segmentation module as input, and does the following processing for each behavior fragment in the behavior fragment set: first, the behavior fragment keyword of the distributed hash table module is invoked with the behavior fragment as input The mapping submodule obtains the keyword value corresponding to the behavior fragment, assembles the behavior fragment and its local sample number and other local analysis results into a distributed hash table message, and returns it from the fragment keyword mapping submodule of the distributed hash table module The keyword value of is input to the keyword routing submodule of the distributed hash table module as the routing keyword, and the keyword routing submodule of the distributed hash table module completes the transmission of the message to the node responsible for the routing keyword. The implementation gathers the same behavior fragments from different nodes to the same node responsible for the corresponding key value of the same behavior fragments. The behavior segment receiving sub-module is an extension of the traditional distributed hash table routing message receiving module. In addition to the traditional distributed hash table routing message receiving function, this sub-module also records the address of the source node that sent the message. The behavior segment statistics sub-module takes the routing message received by the behavior segment receiving sub-module (including the local analysis results such as the behavior segment and its local sample number) and the address of the message sending source node as input, and through the statistics of the routing message received by the behavior segment receiving sub-module The source node address and local sample number in all routing messages with the same behavior segment get the source node and local sample number binary pair set with the behavior segment in the whole system, that is, the global characteristics of the behavior segment. The behavior segment global feature return sub-module takes the behavior segment output by the behavior segment statistics sub-module, the corresponding global feature, and the address set of the source node that issued the behavior segment as input, assembles the behavior segment and the corresponding global feature into a return message, and then passes The socket interface sends the return message to each address in the source node address set. Behavior Fragment Global Feature Receiving Submodule is a simple server-side socket interface, which receives return messages (including behavior fragments and corresponding global features) sent to the server-side socket interface by other nodes. Malware local analysis results The behavior fragments of each node in the distributed fusion system cooperate with the sharing module and the distributed hash table module to form a behavior fragment fusion overlay network, which sends the behavior fragment collection of its own behavior characteristics through the source node, responsible for The destination node of the corresponding keyword value of the behavior fragment counts the global characteristics of the same behavior fragment from different source nodes, and then returns the behavior fragment and global characteristics to the source node, and finally all source nodes obtain the behavior fragment collection of their own behavior characteristics Global properties throughout the system. The source node receives the global characteristics of different local behavior fragments, that is, a set of binary pairs of source nodes and local samples of different behavior fragments. In order to uniformly express the global characteristics of all behavior fragments in the following modules, the global characteristics of behavior fragments Adding the behavior segment to each binary pair in (source node and local sample number binary pair set) becomes a behavior segment, source node, and local sample number triple pair set, so it can also be said that the behavior segment cooperates with the shared module to obtain The collection of local behavior fragments is a collection of ternary pairs of behavior fragments, source nodes, and local sample numbers in the whole system.

邻居行为特征发现模块基于行为片段协同共享模块返回的行为片段的全局特性为每一个本地的行为特征计算候选邻居节点集，然后向候选节点集中的节点发送该行为特征，接收到行为特征(称为源行为特征)的候选节点在本地找到与该行为特征最相似的行为特征(称为目的行为特征)，并验证两者是否属于同一恶意软件，如果验证属于同一恶意软件，则目的行为特征是源行为特征的邻居行为特征，否则目的行为特征不是源行为特征的邻居行为特征。The neighbor behavior feature discovery module calculates the candidate neighbor node set for each local behavior feature based on the global characteristics of the behavior segment returned by the behavior segment collaborative sharing module, and then sends the behavior feature to the nodes in the candidate node set, and receives the behavior feature (called The candidate node of the source behavior feature) finds the behavior feature most similar to the behavior feature (called the target behavior feature) locally, and verifies whether the two belong to the same malware. If it is verified that they belong to the same malware, the target behavior feature is the source Neighbor behavior characteristics of behavior characteristics, otherwise destination behavior characteristics are not neighbor behavior characteristics of source behavior characteristics.

本实施例中邻居行为特征发现模块包括：In this embodiment, the neighbor behavior feature discovery module includes:

候选邻居节点集计算子模块，用于根据恶意软件的行为片段及其全局特性获取每一个本地的恶意软件的行为特征，根据式(1)计算行为特征与其它远程节点之间的行为片段共享度，将片段共享度大于预设共享度阈值的所有远程节点作为具有相似行为特征的候选邻居节点集；The candidate neighbor node set calculation sub-module is used to obtain the behavior characteristics of each local malware according to the behavior fragments of the malware and its global characteristics, and calculate the behavior characteristics and the behavior fragment sharing degree between other remote nodes according to formula (1) , taking all the remote nodes whose segment sharing degree is greater than the preset sharing degree threshold as a set of candidate neighbor nodes with similar behavioral characteristics;

行为特征接收子模块，用于在节点作为候选邻居节点集的远程节点时将收源节点发送的源行为特征；Behavior feature receiving sub-module, used to receive source behavior features sent by source nodes when the node is a remote node of the candidate neighbor node set;

行为特征邻居关系验证子模块用于将收到源行为特征与本地的目的行为特征中的共同行为组成融合行为特征，判断融合行为特征在正常程序中出现的频率是否小于设定阈值，如果小于则说明源行为特征与目的行为特征的共同行为在正常程序中不常见且是特定于恶意软件的，判定源行为特征与目的行为特征属于同一恶意软件；否则判定源行为特征与目的行为特征不属于同一恶意软件；如果通过验证源行为特征与目的行为特征属于同一恶意软件，则远程节点的目的行为特征为源行为特征的邻居行为特征，远程节点将源行为特征编号、目的行为特征的编号、源行为特征与目的行为特征两者的相似度、有邻居行为特征标志位组装成返回消息返回给源行为特征所在源节点，否则远程节点返回给源行为特征所在源节点无邻居行为特征标志位的否定消息The behavior feature neighbor relationship verification sub-module is used to combine the common behavior of the received source behavior feature and the local destination behavior feature to form a fusion behavior feature, and judge whether the frequency of the fusion behavior feature in the normal program is less than the set threshold. If it is less than Explain that the common behavior of the source behavior feature and the target behavior feature is not common in normal programs and is specific to malware. It is determined that the source behavior feature and the target behavior feature belong to the same malware; otherwise, it is determined that the source behavior feature and the target behavior feature do not belong to the same malware. Malware; if it is verified that the source behavior characteristic and the destination behavior characteristic belong to the same malware, then the destination behavior characteristic of the remote node is the neighbor behavior characteristic of the source behavior characteristic, and the remote node assigns the source behavior characteristic number, the destination behavior characteristic number, the source behavior characteristic The similarity between the feature and the target behavior feature, and the neighbor behavior feature flag are assembled into a return message and returned to the source node where the source behavior feature is located, otherwise the remote node returns a negative message to the source node where the source behavior feature has no neighbor behavior feature flag

邻接关系图计算子模块，用于根据返回的行为特征邻居关系创建行为特征邻接关系图的边，边的权值为两个顶点所代表的两个行为特征之间的相似度，最终构造行为特征邻接关系图；创建行为特征邻接关系图的边时采用单向边方式或者双向边方式，当采用单向边方式时：1)当验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征时，在本地的目的行为特征与远程的源行为特征间创建边，添加入本地的关联边数据结构中；2)根据接收远程节点返回的带有邻居行为特征标志位的邻居行为特征消息时，查看本地的关联边数据结构中查看是否已经在关于该消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建了边，如果有，不做任何操作，如果没有则在息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建边，添加入本地的关联边数据结构中；当采用双向边方式时，当本地的行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征且收到关于该本地的目的行为特征编号及该远程的源行为特征编号的带着有邻居行为特征标志位的邻居行为特征消息时，才在该本地的目的行为特征及该远程的源行为特征间创建边，添加入本地的关联边数据结构中；The adjacency graph calculation sub-module is used to create the edge of the behavior feature adjacency graph according to the returned behavior feature neighbor relationship. The weight of the edge is the similarity between the two behavior features represented by the two vertices, and finally constructs the behavior feature Adjacency graph; when creating the edge of the behavior characteristic adjacency graph, use the one-way edge method or the two-way edge method. When using the one-way edge method: 1) When verifying that the remote source behavior characteristic and the local destination behavior characteristic are the same malware When the behavior characteristics of the local destination behavior characteristics and the remote source behavior characteristics are created, the edge is added to the local associated edge data structure; 2) according to the neighbor behavior characteristics returned by the remote node with the neighbor behavior characteristic flag bit When sending a message, check the local associated edge data structure to see if an edge has been created between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, if so, Do nothing, if not, create an edge between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, and add it to the local associated edge data structure; when using In the two-way side mode, when the local behavior characteristic neighbor relationship verification sub-module verifies that the remote source behavior characteristic and the local destination behavior characteristic are the same malware behavior characteristics and receives information about the local destination behavior characteristic number and the remote source behavior characteristic Only when the behavior characteristic number carries the neighbor behavior characteristic message with the neighbor behavior characteristic flag bit, an edge is created between the local destination behavior characteristic and the remote source behavior characteristic, and added to the local associated edge data structure;

本实施例邻接关系图计算子模块在获取具有相似行为特征的候选邻居节点集时，计算本地行为特征与远程节点的行为片段共享度满足以下原则：1)行为特征与远程节点共享的行为片段、源节点和本地样本数三元对集合中元素越多，行为片段共享度越高；2)行为特征与远程节点共享的行为片段、源节点和本地样本数三元对集合中的本地样本数越大，行为片段共享度值越高；3)行为特征与远程节点共享的行为片段、源节点和本地样本数三元对集合中的行为片段在正常程序行为中出现的频率越低，行为片段共享度值越高。因此，除了上述式(1)以外，还可以根据需要采用其它符合上述原则的候选邻居节点集的相似度计算方法。候选邻居节点集计算子模块根据节点本地的所有行为特征与远程节点的行为片段共享度Share_tj(1≤t≤M_i,1≤j≤N,j≠i)计算一个阈值：根据经验假定行为片段共享度Share_tj(1≤t≤M_i,1≤j≤N,j≠i)服从某一概率分布，比如正态分布，根据区间估计计算本地节点行为特征与远程节点行为片段共享度的一个正常下限区间(下限区间的下限被称为共享度阈值)，与行为特征的行为片段共享度大于共享度阈值的所有远程节点即该行为特征的候选邻居节点集。邻居行为特征发现模块基于行为片段协同共享模块返回的本地的行为片段的全局特性为每一个行为特征计算候选邻居节点集，然后向候选节点集中的节点发送对应的行为特征，接收到行为特征(称为源行为特征)的候选节点找到在本地与该行为特征最相似的行为特征(称为目的行为特征)，并验证两者是否属于同一恶意软件，如果验证属于同一恶意软件，则目的行为特征是源行为特征的邻居行为特征，否则目的行为特征不是源行为特征的邻居行为特征。When the adjacency graph calculation submodule of this embodiment acquires a set of candidate neighbor nodes with similar behavior characteristics, the calculation of the degree of sharing of behavior fragments between local behavior characteristics and remote nodes meets the following principles: 1) behavior fragments shared by behavior characteristics and remote nodes, The more elements in the ternary pair set of source node and local sample number, the higher the sharing degree of behavior fragments; 3) Behavioral fragments shared with remote nodes, source nodes, and local samples triplet the behavior fragments in the set of behavior fragments are less frequent in normal program behavior, and behavior fragments share The higher the value is. Therefore, in addition to the above formula (1), other similarity calculation methods for candidate neighbor node sets that meet the above principles can also be used as needed. The candidate neighbor node set calculation sub-module calculates a threshold value based on the shared degree Share _tj (1≤t≤M _i , 1≤j≤N, j≠i) of all the local behavior characteristics of the node and the behavior fragments of the remote node: assuming behavior based on experience Fragment sharing degree Share _tj (1≤t≤M _i , 1≤j≤N, j≠i) obeys a certain probability distribution, such as normal distribution, calculates the ratio of local node behavior characteristics and remote node behavior fragment sharing degree according to interval estimation A normal lower bound interval (the lower bound of the lower bound interval is called the sharing degree threshold), and all remote nodes whose sharing degree with the behavior segment of the behavior feature is greater than the sharing degree threshold are the set of candidate neighbor nodes for the behavior feature. The neighbor behavior feature discovery module calculates the candidate neighbor node set for each behavior feature based on the global characteristics of the local behavior segment returned by the behavior segment collaborative sharing module, and then sends the corresponding behavior feature to the nodes in the candidate node set, and receives the behavior feature (called is the candidate node of the source behavior feature) to find the behavior feature (called the target behavior feature) that is most similar to the behavior feature locally, and verify whether the two belong to the same malware. If the verification belongs to the same malware, the target behavior feature Neighbor behavior feature of the source behavior feature, otherwise the target behavior feature is not the neighbor behavior feature of the source behavior feature.

如图7所示，邻居行为特征发现模块采用两阶段邻居行为特征发现方法为本地的每一个行为特征计算远程的邻居行为特征。两阶段邻居行为特征发现方法第一阶段的候选邻居节点计算由候选邻居节点集计算子模块完成，第二阶段的邻居行为特征验证由行为特征发送子模块、行为特征接收子模块、行为特征邻居关系验证子模块完成，而邻接关系图计算子模块则由行为特征的邻居行为特征导出行为特征间的邻接关系图。As shown in Figure 7, the neighbor behavior feature discovery module uses a two-stage neighbor behavior feature discovery method to calculate remote neighbor behavior features for each local behavior feature. In the two-stage neighbor behavior feature discovery method, the calculation of candidate neighbor nodes in the first stage is completed by the candidate neighbor node set calculation submodule, and the neighbor behavior feature verification in the second stage is performed by the behavior feature sending submodule, behavior feature receiving submodule, and behavior feature neighbor relationship The verification sub-module is completed, and the adjacency graph calculation sub-module derives the adjacency graph between the behavior features from the neighbor behavior features of the behavior features.

第一阶段：候选邻居节点集计算子模块基于行为片段融合覆盖网获得的行为片段、源节点和本地样本数这样的三元对集合计算本地行为特征与远程节点的行为片段共享度。行为特征的行为片段集是指由行为特征分割模块对该行为特征进行分割后得到的行为片段，行为特征的行为片段、源节点和本地样本数三元对集合是指通过行为片段融合覆盖网获得的行为片段、源节点和本地样本数三元对集合中行为片段属于该行为特征的行为片段集中元素的所有行为片段、源节点和本地样本数三元对，行为特征与远程节点共享的行为片段、源节点和本地样本数三元对集合是指行为特征的行为片段、源节点和本地样本数三元对集合中源节点等于该远程节点的所有行为片段、源节点和本地样本数三元对。The first stage: the candidate neighbor node set calculation sub-module calculates the sharing degree of local behavior characteristics and remote nodes' behavior fragments based on the triple pair set obtained by the behavior fragment fusion overlay network, the source node and the local sample number. The behavior segment set of behavior feature refers to the behavior segment obtained after the behavior feature is segmented by the behavior feature segmentation module, and the triple pair set of behavior segment, source node and local sample number refers to the behavior segment fusion overlay network Behavior Fragments, Source Nodes, and Local Sample Number Triple Pairs Behavior Fragments Belonging to Behavioral Fragments Belonging to Behavior Feature All triplet pairs of Behavior Fragments, Source Nodes, and Local Sample Numbers Behavior Fragments Belonging to Behavior Feature Behavior Fragments Behavior Fragments Shared by Remote Nodes , source node and local sample number ternary pair set refers to all behavior fragments, source nodes and local sample number ternary pairs in the set of behavioral feature fragments, source nodes and local sample number ternary pairs whose source node is equal to the remote node .

第二阶段：特征发送子模块将本地的行为特征发送给由候选邻居节点集计算子模块计算得到的该行为特征的候选邻居节点集中的所有节点。行为特征接收子模块接收其它节点发送给节点的行为特征，在周期内接收到的所有行为特征称为源行为特征集。行为特征邻居关系验证子模块为源行为特征集中的每一个源行为特征在本地的行为特征集中查找最相似的行为特征(称为目的行为特征)，然后验证源行为特征与目的行为特征是否是同一恶意软件的行为特征，验证方法可以如下：验证由源行为特征与目的行为特征的共同行为组成的融合行为特征在正常程序中出现的频率是否小于设定阈值，如果小于则说明两者的共同行为在正常程序中不常见、是特定于恶意软件的，则判定两者属于同一恶意软件，否则判定两者不属于同一恶意软件，如果通过验证两者属于同一恶意软件，则目的行为特征为源行为特征的邻居行为特征，并将源行为特征编号、目的行为特征的编号、两者的相似度、有邻居行为特征标志位组装成返回消息返回给源行为特征所在节点，否则返回给源行为特征所在节点无邻居行为特征标志位的否定消息。The second stage: the feature sending sub-module sends the local behavior feature to all nodes in the candidate neighbor node set of the behavior feature calculated by the candidate neighbor node set calculation sub-module. The behavior feature receiving sub-module receives the behavior features sent to the node by other nodes, and all the behavior features received in a period are called the source behavior feature set. Behavior feature neighbor relationship verification sub-module finds the most similar behavior feature (called destination behavior feature) in the local behavior feature set for each source behavior feature in the source behavior feature set, and then verifies whether the source behavior feature and the target behavior feature are the same The behavior characteristics of malicious software, the verification method can be as follows: verify whether the fusion behavior characteristics composed of the common behavior of the source behavior characteristics and the destination behavior characteristics appear in the normal program. It is not common in normal programs and is specific to malware, then it is determined that the two belong to the same malware, otherwise it is determined that the two do not belong to the same malware, if it is verified that the two belong to the same malware, then the target behavior characteristic is the source behavior The neighbor behavior feature of the feature, and assemble the source behavior feature number, the target behavior feature number, the similarity between the two, and the neighbor behavior feature flag into a return message and return it to the node where the source behavior feature is located, otherwise return to the source behavior feature. A negative message for a node that has no neighbor behavior feature flags.

邻接关系图计算子模块则由行为特征的邻居行为特征导出行为特征间的邻接关系图。邻接关系图计算子模块接收远程节点返回的关于本地行为特征的邻居行为特征消息，输出的是在邻接关系图中的每一个行为特征的关联边数据结构。邻接关系图是指以系统中参与节点的本地行为特征为顶点，根据行为特征间的邻居关系创建顶点间的边。根据行为特征间的邻居关系创建边有以下两种方式：1)单向边，行为特征A是行为特征B的邻居行为特征或者行为特征B是行为特征A的邻居行为特征，则在代表行为特征A的顶点A与代表行为特征B的顶点B间创建一条边；2)双向边，行为特征A是行为特征B的邻居行为特征而且行为特征B是行为特征A的邻居行为特征，则代表行为特征A的顶点A与代表行为特征B的顶点B间才能创建一条边。边的权值是两顶点代表的两个行为特征间的相似度。邻接关系图计算子模块计算本地行为特征与远程节点的行为特征间的边的也有两种方式。当采用单向边方式时：1)当行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征时，在本地的目的行为特征与远程的源行为特征间创建边，添加入本地的关联边数据结构中；2)根据接收远程节点返回的带有邻居行为特征标志位的邻居行为特征消息时，查看本地的关联边数据结构中查看是否已经在关于该消息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建了边，如果有，不做任何操作，如果没有则在息中源行为特征编号所代表的本地行为特征与消息中目的行为特征编号所代表的远程行为特征间创建边，添加入本地的关联边数据结构中。当采用双向边方式时，当本地的行为特征邻居关系验证子模块验证远程的源行为特征与本地的目的行为特征为同一恶意软件的行为特征且收到关于该本地的目的行为特征编号及该远程的源行为特征编号的带着有邻居行为特征标志位的邻居行为特征消息时，才在该本地的目的行为特征及该远程的源行为特征间创建边，添加入本地的关联边数据结构中。The adjacency graph calculation sub-module derives the adjacency graph between behavioral features from the neighbor behavioral features of the behavioral features. The adjacency graph calculation sub-module receives the neighbor behavior feature message about the local behavior feature returned by the remote node, and outputs the associated edge data structure of each behavior feature in the adjacency graph. The adjacency graph refers to the local behavior characteristics of participating nodes in the system as vertices, and the edges between vertices are created according to the neighbor relationship between behavior characteristics. There are two ways to create an edge according to the neighbor relationship between behavioral features: 1) One-way edge, behavioral feature A is the neighbor behavioral feature of behavioral feature B or behavioral feature B is the neighbor's behavioral feature of behavioral feature A, then in the representative behavioral feature Create an edge between vertex A of A and vertex B representing behavior feature B; 2) bidirectional edge, behavior feature A is the neighbor behavior feature of behavior feature B and behavior feature B is the neighbor behavior feature of behavior feature A, it represents the behavior feature Only one edge can be created between vertex A of A and vertex B representing behavior characteristic B. The weight of an edge is the similarity between two behavioral features represented by two vertices. There are also two ways for the adjacency graph calculation sub-module to calculate the edge between the local behavior feature and the behavior feature of the remote node. When using the one-way edge method: 1) When the behavior feature neighbor relationship verification sub-module verifies that the remote source behavior feature and the local target behavior feature are the same malware behavior feature, the local target behavior feature and the remote source behavior Create an edge between features and add it to the local associated edge data structure; 2) When receiving the neighbor behavior feature message with the neighbor behavior feature flag bit returned by the remote node, check the local associated edge data structure to see if it has been in the relevant edge data structure An edge is created between the local behavior characteristic represented by the source behavior characteristic number in the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message. An edge is created between the local behavior characteristic represented by the message and the remote behavior characteristic represented by the destination behavior characteristic number in the message, and added to the local associated edge data structure. When the two-way edge method is adopted, when the local behavior characteristic neighbor relationship verification sub-module verifies that the remote source behavior characteristic and the local destination behavior characteristic are the behavior characteristics of the same malware and receives information about the local destination behavior characteristic number and the remote Only when the source behavior characteristic number of the source behavior characteristic number carries the neighbor behavior characteristic message with the neighbor behavior characteristic flag bit, an edge is created between the local destination behavior characteristic and the remote source behavior characteristic, and added to the local associated edge data structure.

分布式层次融合树构造模块基于邻居行为特征发现模块输出的关于本地行为特征的邻居行为特征构造行为特征邻接关系图。在行为特征邻接关系图中，以行为特征为顶点，行为特征与它的邻居行为特征间构造一条边，边的权值为边两端的行为特征间的相似度。在行为特征邻接关系图上分布式生成一些融合树，融合树尽量按层次式融合的方式融合树中顶点代表的行为特征：最相似的行为特征最先被融合，最不相似的行为特征最后被融合。The distributed hierarchical fusion tree construction module constructs the behavior feature adjacency graph based on the neighbor behavior features about the local behavior features output by the neighbor behavior feature discovery module. In the behavior feature adjacency graph, with the behavior feature as the vertex, an edge is constructed between the behavior feature and its neighbor behavior features, and the weight of the edge is the similarity between the behavior features at both ends of the edge. Some fusion trees are generated in a distributed manner on the behavioral feature adjacency graph, and the fusion tree tries to fuse the behavioral features represented by the vertices in the tree in a hierarchical fusion manner: the most similar behavioral features are fused first, and the least similar behavioral features are fused last fusion.

如图8所示，本实施例中分布式层次融合树构造模块包括：As shown in Figure 8, the distributed hierarchical fusion tree construction module in this embodiment includes:

行为特征最大相似树构造子模块，用于在特征邻接关系图的基础上，采用优先选择权值大的边的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树作为融合树；Behavior feature maximum similarity tree construction sub-module, used to generate behavior on the adjacency graph based on the feature adjacency graph, using the distributed minimum spanning tree algorithm that preferentially selects edges with large weights according to the associated edge data structure of behavior features The feature maximum similarity tree is used as a fusion tree;

行为特征融合顺序构造子模块，用于以行为特征最大相似树为基础决定融合树中行为特征的融合顺序使最相似的行为特征优先被融合。The behavioral feature fusion order construction sub-module is used to determine the fusion order of the behavioral features in the fusion tree based on the maximum similarity tree of the behavioral features so that the most similar behavioral features are fused first.

本实施例的分布式层次融合树构造模块首先通过行为特征最大相似树构造子模块完成行为特征最大相似树构造，然后由行为特征融合顺序构造子模块完成行为特征融合顺序构造。The distributed hierarchical fusion tree construction module of this embodiment first completes the construction of the behavior feature maximum similarity tree through the behavior feature maximum similarity tree construction sub-module, and then completes the behavior feature fusion sequence construction by the behavior feature fusion sequence construction sub-module.

分布式层次融合树构造模块基于邻居行为特征发现模块输出的本地行为特征的邻居行为特征为输入，以行为特征为顶点，本地行为特征与它的邻居行为特征间构造一条边，边的权值为两特征的相似程度，这样在所有行为特征间构造了一幅无向图，在该无向图上分布式生成一些融合树，融合树尽量按层次式融合的方式融合树中顶点代表的行为特征：最相似的行为特征最先被融合，最不相似的行为特征最后被融合。分布式层次融合树构造模块以邻居行为特征发现模块输出邻接关系图中的关于本地行为特征的关联边数据结构为输入，首先在行为特征间生成行为特征最大相似树，然后在最大相似树基础上决定行为特征的融合顺序。每个行为特征运行经典的分布式最小生成树算法MST(算法由发表在ACM Transactions onProgramming Languages and Systems第五卷第一期的论文“A Distributed Algorithm forMinimum-Weight Spanning Trees”详细说明)的修改版：由优先选择权值小的边改为优先选择权值大的边(即最相似行为特征间的边)。修改后的分布式最小生成树算法根据行为特征的关联边数据结构在邻接关系图上生成行为特征最大相似树。以行为特征最大相似树为基础决定行为特征的融合顺序，达到尽可能使最相似的行为特征先融合的层次式融合方式。分布式层次融合树构造模块由行为特征最大相似树构造子模块和行为特征融合顺序构造子模块组成。行为特征最大相似树构造子模块以邻居行为特征发现模块输出邻接关系图中的关于本地行为特征的关联边数据结构为输入，为每个行为特征运行修改后的分布式最小生成树算法MST代理，在邻接关系图上生成行为特征最大相似树，本地的每个行为特征都有相应的行为特征最大相似树与之对应。行为特征融合顺序构造子模块以行为特征最大相似树构造子模块输出的行为特征最大相似树为输入，为每个行为特征运行融合顺序选择代理。融合顺序中的下一层行为特征是指行为特征最大相似树中行为特征需要向其提交自身融合结果进行下一次融合的行为特征。融合顺序选择代理选择下一层行为特征需满足以下原则：1)在没有收到其它行为特征融合结果的边中，权值最大的边的另一端行为特征是下一层行为特征；2)在不满足1)条件的情况下，行为特征最大相似树的父节点是下一层行为特征。每个行为特征根据自身与其它行为特征间边的权值独立选择下一层行为特征，达到分布式逐步融合行为特征，最后在行为特征最大相似树的根节点处融合最大相似树中的所有行为特征得到全局行为特征。The distributed hierarchical fusion tree construction module is based on the neighbor behavior characteristics of the local behavior characteristics output by the neighbor behavior characteristic discovery module, and takes the behavior characteristics as vertices, and constructs an edge between the local behavior characteristics and its neighbor behavior characteristics, and the weight of the edge is In this way, an undirected graph is constructed between all behavioral features, and some fusion trees are generated distributedly on the undirected graph. The fusion tree tries to fuse the behavioral features represented by the vertices in the tree in a hierarchical fusion manner. : The most similar behavior features are fused first, and the least similar behavior features are fused last. The distributed hierarchical fusion tree construction module takes the associated edge data structure of the local behavior characteristics in the adjacency graph output by the neighbor behavior characteristic discovery module as input, first generates the maximum similarity tree of behavior characteristics among the behavior characteristics, and then based on the maximum similarity tree Determines the fusion order of behavioral features. Each behavioral feature runs a modified version of the classic distributed minimum spanning tree algorithm MST (the algorithm is detailed in the paper "A Distributed Algorithm for Minimum-Weight Spanning Trees" published in the first issue of volume 5 of ACM Transactions on Programming Languages and Systems): Change from preferentially selecting edges with small weights to preferentially selecting edges with large weights (that is, edges between the most similar behavioral features). The modified distributed minimum spanning tree algorithm generates the maximum similarity tree of behavioral features on the adjacency graph according to the associated edge data structure of behavioral features. Based on the maximum similarity tree of behavioral features, the fusion order of behavioral features is determined, and a hierarchical fusion method is achieved in which the most similar behavioral features are first fused as much as possible. The distributed hierarchical fusion tree construction module is composed of the behavior feature maximum similarity tree construction sub-module and the behavior feature fusion sequence construction sub-module. Behavioral feature maximum similarity tree construction sub-module takes the associated edge data structure about the local behavior feature in the adjacency graph output by the neighbor behavior feature discovery module as input, and runs the modified distributed minimum spanning tree algorithm MST agent for each behavior feature, The maximum similarity tree of behavioral features is generated on the adjacency graph, and each local behavioral feature has a corresponding maximum similarity tree of behavioral features corresponding to it. The behavior feature fusion sequence construction sub-module takes the behavior feature maximum similarity tree output by the behavior feature maximum similarity tree construction sub-module as input, and runs the fusion sequence selection agent for each behavior feature. The behavior feature of the next layer in the fusion sequence refers to the behavior feature in the behavior feature maximum similarity tree to which the behavior feature needs to submit its own fusion result for the next fusion. The following principles must be met for the fusion sequence selection agent to select the next layer of behavior features: 1) Among the edges that have not received the fusion results of other behavior features, the behavior feature at the other end of the edge with the largest weight is the behavior feature of the next layer; If the condition 1) is not satisfied, the parent node of the behavior feature maximum similarity tree is the behavior feature of the next layer. Each behavioral feature independently selects the next layer of behavioral features according to the weight of the edge between itself and other behavioral features, to achieve distributed and gradual fusion of behavioral features, and finally fuse all behaviors in the maximum similarity tree at the root node of the maximum similarity tree of behavioral features Features get global behavioral features.

行为特征逐步融合模块根据分布式层次融合树构造模块输出的融合树逐步融合树中的行为特征，最后在融合树的根行为特征(顶点)处得到全局行为特征和其它的一些全局属性。其它的一些全局属性可以包括全系统采集到的感染了该恶意软件的主机地址分布，可用来进一步确认恶意软件的恶意性和扩散范围。行为特征逐步融合模块包括通过进程或者线程的形式分别对应特征邻接关系图中每一个顶点的行为特征的多个特征融合代理或者对应特征邻接关系图中所有顶点的行为特征的一个特征融合代理。特征融合代理也可以采用无管理进程的进程实现，或者采用有管理线程或者无管理线程的线程实现，或者采用单进程事件驱动实现，其实现原理与本实施例相同，在此不再赘述。行为特征逐步融合模块据分布式层次融合树构造模块输出的融合树逐步融合树中的行为特征，最后在融合树的根行为特征(顶点)处得到全局行为特征和其它的一些全局属性，其它的一些全局属性可以包括全系统监测到的感染了该恶意软件样本的主机地址分布，可用来进一步确认恶意软件的恶意性和扩散范围。The behavioral feature fusion module gradually fuses the behavioral features in the tree according to the fusion tree output by the distributed hierarchical fusion tree construction module, and finally obtains the global behavioral features and other global attributes at the root behavioral feature (vertex) of the fusion tree. Some other global attributes may include the distribution of addresses of hosts infected with the malware collected throughout the system, which can be used to further confirm the maliciousness and spread of the malware. The behavioral feature progressive fusion module includes multiple feature fusion agents corresponding to the behavioral features of each vertex in the feature adjacency graph or one feature fusion agent corresponding to the behavioral features of all vertices in the feature adjacency graph in the form of processes or threads. The feature fusion agent can also be implemented by an unmanaged process, or by a managed thread or an unmanaged thread, or by a single-process event-driven implementation. The implementation principle is the same as that of this embodiment, and will not be repeated here. The behavioral feature fusion module gradually fuses the behavioral features in the tree according to the fusion tree output by the distributed hierarchical fusion tree construction module, and finally obtains global behavioral features and other global attributes at the root behavioral feature (vertex) of the fusion tree, and other Some global attributes can include the distribution of addresses of hosts infected with the malware sample detected by the whole system, which can be used to further confirm the maliciousness and spread of the malware.

本实施例中，特征融合代理包括：In this embodiment, the feature fusion agent includes:

行为特征提交子模块，用于根据融合顺序将包含自身行为特征的融合结果提交给下一层的特征融合代理；The behavior feature submission submodule is used to submit the fusion result containing its own behavior features to the feature fusion agent of the next layer according to the fusion order;

行为特征接收子模块，用于接收上一层的特征融合代理提交的融合结果并记录接收到融合结果的上一层特征融合代理对应特征邻接关系图中的邻接边；行为特征融合子模块，用于判断邻接边在所有邻接边中是不是自身行为特征的关联边中的权值最大的边，如果是，则将自身的行为特征和所有收到融合结果进行融合并将融合结果根据融合顺序提交给下一层特征融合代理，否则，融合所有收到的融合结果并将融合结果和自身的行为特征根据融合顺序提交给下一层特征融合代理；最终由融合树的根特征融合代理融合得到融合树的根行为特征，并将融合树的根行为特征作为恶意软件特征融合分析结果输出。The behavior feature receiving sub-module is used to receive the fusion result submitted by the feature fusion agent of the upper layer and record the adjacency edge in the feature adjacency graph corresponding to the feature fusion agent of the upper layer that received the fusion result; the behavior feature fusion sub-module uses It is used to judge whether the adjacent edge is the edge with the largest weight among the associated edges of its own behavior characteristics among all adjacent edges. If so, fuse its own behavior characteristics with all received fusion results and submit the fusion results according to the fusion order Give the next layer of feature fusion agents, otherwise, fuse all received fusion results and submit the fusion results and their own behavioral features to the next layer of feature fusion agents according to the fusion sequence; finally, the root feature fusion agent of the fusion tree is fused to obtain fusion The root behavior feature of the tree, and the root behavior feature of the fusion tree is output as the malware feature fusion analysis result.

行为特征逐步融合模块为本地的每一个行为特征运行行为特征融合代理，行为特征融合代理负责将自身的行为特征提交给其下一层行为特征，并接收其对应的行为特征最大相似树中其它行为特征(顶点)提交的行为特征，然后对接收到的和自身的行为特征做融合处理。行为特征逐步融合模块包括行为特征提交子模块、行为特征接收子模块和行为特征融合子模块，每个行为特征的行为特征融合代理都包含了这三个模块的代理。行为特征提交子模块根据分布式层次融合树构造模块输出的自身行为特征的下一层行为特征将自身的融合结果(包括自身的行为特征)提交给下一层行为特征的行为特征逐步融合代理。行为特征接收子模块接收发送给其所在的行为特征逐步融合代理的其它行为特征的融合结果(也是行为特征)，并记录接收该融合结果的关联边。行为特征融合子模块以其所在的行为特征逐步融合代理的行为特征接收子模块接收的行为特征和自身行为特征为输入，查看行为特征接收子模块记录的接收的邻接边中是不是自身行为特征的关联边中的权值最大的边，如果有，则融合所有收到的和自身的行为特征，然后将融合结果通过行为特征提交子模块提交给下一层行为特征；否则融合收到的行为特征，然后将融合结果和自身的行为特征通过行为特征提交子模块提交给下一层行为特征。The behavioral feature fusion module runs the behavioral feature fusion agent for each local behavioral feature. The behavioral feature fusion agent is responsible for submitting its own behavioral features to its next-level behavioral features and receiving its corresponding behavioral features from other behaviors in the maximum similarity tree. The behavioral features submitted by the feature (vertex), and then perform fusion processing on the received behavioral features and its own behavioral features. The behavior feature fusion module includes a behavior feature submission sub-module, a behavior feature reception sub-module and a behavior feature fusion sub-module, and the behavior feature fusion agent of each behavior feature includes the agents of these three modules. The behavior feature submission sub-module submits its own fusion results (including its own behavior features) to the behavior features of the next layer of behavior features and gradually fuses the agent according to the next layer of behavior features of its own behavior features output by the distributed hierarchical fusion tree construction module. The behavioral feature receiving sub-module receives the fusion result (also a behavioral feature) of other behavioral features sent to its behavioral feature gradual fusion agent, and records the associated edge that receives the fusion result. The behavioral feature fusion sub-module takes the behavioral feature where it is located and gradually fuses the behavioral features received by the agent’s behavioral feature receiving sub-module and its own behavioral features as input, and checks whether the received adjacent edges recorded by the behavioral feature receiving sub-module are its own behavioral features The edge with the largest weight in the associated edge, if there is one, fuses all received and its own behavioral features, and then submits the fusion result to the next layer of behavioral features through the behavioral feature submission sub-module; otherwise, fuses the received behavioral features , and then submit the fusion result and its own behavioral features to the next layer of behavioral features through the behavioral feature submission sub-module.

如图9所示，本实施例行为特征逐步融合模块的工作步骤如下：1.行为特征融合树接收。每个融合代理接收发送给其所在的行为特征逐步融合代理的其它行为特征的融合结果(也是行为特征)，并记录接收该融合结果的关联边。2.行为特征融合。以接收的行为特征和自身行为特征为输入，查看上一步中记录的接收结果邻接边中有没有自身行为特征关联边中的权值最大的边，如果有，则融合所有收到的和自身的行为特征，然后将融合结果提交下一步进行行为特征融合树发送；否则融合收到的行为特征，然后将融合结果和自身的行为特征提交下一步进行行为特征融合树发送。3.行为特征融合树提交：根据分布式层次融合树构造模块输出的自身行为特征的下一层行为特征将自身的融合结果(包括自身的行为特征)提交给下一层行为特征的行为特征逐步融合代理。As shown in FIG. 9 , the working steps of the behavior feature fusion module in this embodiment are as follows: 1. Behavior feature fusion tree reception. Each fusion agent receives the fusion result (also a behavior feature) sent to its behavior feature to gradually fuse other behavior features of the agent, and records the associated edge that receives the fusion result. 2. Behavioral feature fusion. Taking the received behavioral characteristics and its own behavioral characteristics as input, check whether there is an edge with the largest weight in the adjacent edges of the receiving result recorded in the previous step, and if so, fuse all the received and its own Behavior features, and then submit the fusion result to the next step for behavior feature fusion tree transmission; otherwise, fuse the received behavior features, and then submit the fusion result and its own behavior features to the next step for behavior feature fusion tree transmission. 3. Behavioral feature fusion tree submission: Submit the fusion result (including its own behavioral features) to the behavioral features of the next layer of behavioral features according to the behavioral features of its own behavioral features output by the distributed hierarchical fusion tree construction module. Fusion Proxy.

行为特征逐步融合模块的特征融合代理可根据需要采用进程或者线程的形式分别对应特征邻接关系图中每一个顶点的行为特征的多个特征融合代理的技术方案。如图10所示，分布式层次融合树构造和行为特征融合过程中节点i的行为特征的无管理进程的多进程表示时，假设节点i本地共有M_i个行为特征，每个行为特征创建一个对应的进程并绑定一个套接字接口。行为特征(邻接关系图中的顶点)与远程的行为(邻接关系图中的顶点)之间通过两者之间的套接字接口进行通信。这种表示方式在分布式层次融合树构造和行为特征融合过程之前需要交互行为特征绑定的套接字接口信息；如图11所示，分布式层次融合树构造和行为特征融合过程中节点i的行为特征的有管理进程的多进程表示时，假设节点i本地共有M_i个行为特征。节点i首先创建一个管理进程并绑定一个管理套接字接口。然后管理进程为每一个本地的行为特征生成一个进程并记录进程与行为特征间的对应关系，行为特征(邻接关系图中的顶点)与远程的行为(邻接关系图中的顶点)之间通过对应的管理进程进行通信，由通信消息指示属于哪一个行为特征。管理进程收到通信消息时通过进程间通信方法将该消息发送给该消息指示的行为特征对应的进程。管理进程绑定的管理套接字接口可以是固定的，因此在分布式层次融合树构造和行为特征融合过程之前可以不用交互管理套接字接口信息；如图12所示，本实施例中行为特征多进程表示图10和图11所表示进程的结构中，每个进程具有分布式层次融合树构造模块代理和行为特征逐步融合模块代理，分布式层次融合树构造模块代理完成分布式层次融合树构造模块的功能，行为特征逐步融合模块代理完成行为特征逐步融合模块的功能。The feature fusion agent of the behavioral feature gradual fusion module can adopt the technical scheme of multiple feature fusion agents in the form of processes or threads corresponding to the behavioral features of each vertex in the feature adjacency graph as needed. As shown in Figure 10, when the multi-process representation of the unmanaged process of the behavior characteristics of node i in the process of distributed hierarchical fusion tree construction and behavior characteristics fusion, it is assumed that node i has a total of M_i behavior characteristics locally, and each behavior characteristic creates a corresponding process and bind a socket interface. Behavioral features (vertices in the adjacency graph) communicate with remote behaviors (vertices in the adjacency graph) through the socket interface between the two. This representation requires socket interface information bound to interactive behavior features before the process of distributed hierarchical fusion tree construction and behavior feature fusion; as shown in Figure 11, node i in the process of distributed hierarchical fusion tree construction and behavior feature fusion When the multi-process representation of the behavioral characteristics of , it is assumed that node i has M_i behavioral characteristics locally. Node i first creates a management process and binds a management socket interface. Then the management process generates a process for each local behavioral feature and records the correspondence between the process and the behavioral feature, and the behavioral feature (the vertex in the adjacency graph) and the remote behavior (the vertex in the adjacency graph) are through the correspondence The management process communicates, and the communication message indicates which behavior characteristic it belongs to. When the management process receives the communication message, it sends the message to the process corresponding to the behavior characteristic indicated by the message through the inter-process communication method. The management socket interface bound by the management process can be fixed, so before the process of distributed hierarchical fusion tree construction and behavior feature fusion, there is no need to interact with the management socket interface information; as shown in Figure 12, the behavior in this embodiment Feature multi-process representation In the process structure shown in Figure 10 and Figure 11, each process has a distributed hierarchical fusion tree construction module agent and a behavioral feature gradual fusion module agent, and the distributed hierarchical fusion tree construction module agent completes the distributed hierarchical fusion tree Construct the functions of the modules, and the behavioral features are gradually fused with the module agent to complete the functions of the behavioral features and gradually fused modules.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. the characteristic of malware convergence analysis method that Behavior-based control fragment is shared, is characterized in that implementation step is as follows:

1) dispose the node of geographical position dispersion in a network respectively, each node is responsible for collection and the analysis of Malware sample in a slice network area, sets up the distributed hashtable module for building distributed hashtable in node;

2) each node gathers Malware sample and is divided into the fixing behavior segment set of length, and statistics this locality has the local Malware sample number of each behavior segment behavior in described behavior segment set and obtains the local statistical property of behavior segment set;

3) behavior segment set and local statistical property thereof are issued and are shared to distributed hashtable by each node, gather the identical behavior segment from different node by the node of distributed hashtable and add up the global property gathering described behavior segment, the global property of described behavior segment comprises behavior segment, source node address, source node this locality have the ternary of the local Malware sample number of described behavior segment behavior to set, the behavior segment set with global property is returned to the source node of shared behavior segment set and local statistical property thereof;

4) behavioural characteristic of Malware that described source node is formed according to the behavior segment of Malware and global property thereof calculates the candidate neighbor nodes collection with similar behavioural characteristic, the remote node concentrated to candidate neighbor nodes sends source behavioural characteristic corresponding to Malware, whether the source behavioural characteristic received and local object behavioural characteristic multilevel iudge object behavioural characteristic are neighbours' behavioural characteristic of source behavioural characteristic by the remote node concentrated of candidate neighbor nodes, judged result is returned source node as behavioural characteristic neighborhood, the behavioural characteristic neighborhood structure behavioural characteristic syntople figure that described source node returns according to remote node,

5) on the basis of described feature syntople figure, adopt the distributed minimal spanning tree algorithm on the large limit of prioritizing selection weights on syntople figure, to generate the maximum similar tree of behavioural characteristic according to the incidence edge data structure of behavioural characteristic and realize distributed generation fusion tree as fusion tree, the described behavioural characteristic merged in tree is merged, and based on the maximum similar tree of behavioural characteristic, determine that the described fusion sequence merging behavioural characteristic in tree makes the most similar behavioural characteristic preferentially be merged, the described root behavioural characteristic merging tree is exported as characteristic of malware convergence analysis result.

2. Behavior-based control fragment according to claim 1 share characteristic of malware convergence analysis method, it is characterized in that, described step 2) detailed step as follows:

2.1) behavior of Malware sample is considered as the behavior sequence that order performs, from the behavior sequence that described order performs, selects the behavior segment set that the Continuous behavior subsequence of regular length is fixed as the length that segmentation obtains; Or set up behavior dependency graph according to the dependence between the behavior operating data of Malware sample, from described behavior dependency graph, select the behavior obtaining fixed vertices number to rely on subgraph as splitting the fixing behavior segment set of the length that obtains;

2.2) statistics is local has the local Malware sample number of each behavior segment behavior in described behavior segment set and obtains the local statistical property of behavior segment set and export.

3. Behavior-based control fragment according to claim 1 share characteristic of malware convergence analysis method, it is characterized in that, described step 3) detailed step as follows:

3.1) each node calls the key value that distributed hashtable module obtains described behavior segment, described behavior segment and local statistical property thereof are encapsulated as distributed hashtable message, then described key value and distributed hashtable message are sent to distributed hashtable module; Described distributed hashtable module searches the collection analysis node of this key value responsible according to key value, and is stored to the node being responsible for this key value by distributed hashtable Message routing;

3.2) different node is responsible for different behavior segment, by distributed hashtable module, the identical behavior segment of different node is brought to same node and the source node address of record issue behavior segment, the behavior segment that described node is stored by distributed hashtable module and local statistical property thereof, according to behavior segment and source node thereof, the local Malware sample number that source node this locality has described behavior segment behavior carries out statistics global property, the global property of described behavior segment comprises behavior segment, source node address, source node this locality has the ternary of the local Malware sample number of described behavior segment behavior to set, then the behavior segment set with global property is sent to source node address, source node corresponding to source node address receives the behavior segment set with global property.

4. the Behavior-based control fragment according to claim 1 or 2 or 3 share characteristic of malware convergence analysis method, it is characterized in that, described step 4) detailed step as follows:

4.1) source node obtains the behavioural characteristic of the Malware of each this locality according to the behavior segment of Malware and global property thereof, calculate the behavior segment sharing degree between described behavioural characteristic and other remote node according to formula (1), described fragment sharing degree is greater than all remote nodes of default sharing degree threshold value as the candidate neighbor nodes collection with similar behavioural characteristic;

{Share}_{tj} = \underset{{Frag}_{s} &Element; {FragSet}_{t}, {Node}_{s} = {Node}_{j}}{\underset{({Frag}_{s}, {Node}_{s}, {Num}_{s}) &Element; FragStatSet}{Σ}} \log ({Num}_{t} + {Num}_{s}) \log \frac{1}{F_{s}} (1 \leq t \leq M_{i}, 1 \leq j \leq N, j &NotEqual; i) - - - (1)

In formula (1), FragStatSet be the ternary of the global property representing behavior segment to set, N is number of nodes, M _ifor the local behavior characteristic amount of node i, the behavior segment of behavioural characteristic t integrates as FragSet _t, the local sample number of behavior feature t is Num _t, (Frag _s, Node _s, Num _s) for the behavior segment in FragStatSet be Frag _s, source node is Node _sbe Num with local sample number _sternary pair, F _sfor behavior segment is Frag _sthe frequency occurred in the behavior of normal procedure, Node _jfor remote node; Share _tj(1≤t≤M _i, 1≤j≤N, j ≠ i) and be behavioural characteristic t and remote node Node _jbetween behavior segment sharing degree;

4.2) source node using the behavioural characteristic of local Malware as source behavioural characteristic send to candidate neighbor nodes concentrate all remote nodes;

4.3) remote node concentrated of candidate neighbor nodes to form with the joint act in local object behavioural characteristic merge behavioural characteristic by receiving source behavioural characteristic, judge whether the frequency that described fusion behavioural characteristic occurs in normal procedure is less than setting threshold, if be less than, illustrate that the joint act of source behavioural characteristic and object behavioural characteristic is uncommon and specific to Malware in normal procedure, judge that source behavioural characteristic and object behavioural characteristic belong to same Malware; Otherwise judge that source behavioural characteristic and object behavioural characteristic do not belong to same Malware;

4.4) if belong to same Malware by checking source behavioural characteristic and object behavioural characteristic, then the object behavioural characteristic of remote node is neighbours' behavioural characteristic of source behavioural characteristic, source row is feature number by remote node, the similarity of the numbering of object behavioural characteristic, source behavioural characteristic and object behavioural characteristic, have neighbours' behavioural characteristic flag bit to be assembled into return messages to return to behavioural characteristic place, source source node, otherwise remote node returns to the nack message of behavioural characteristic place, source source node without neighbours' behavioural characteristic flag bit;

4.5) described source node creates the limit of behavioural characteristic syntople figure according to the behavioural characteristic neighborhood that returns, the similarity between the weights on limit two behavioural characteristics representated by two summits, final structure behavioural characteristic syntople figure, unidirectional limit mode or two-way limit mode is adopted during the limit of described establishment behavioural characteristic syntople figure, when adopting unidirectional limit mode: 1) when the source behavioural characteristic of authenticating remote is the behavioural characteristic of same Malware with local object behavioural characteristic, create limit between the object behavioural characteristic in this locality and long-range source behavioural characteristic, be added in local incidence edge data structure, 2) according to receive remote node return the neighbours' behavioural characteristic message with neighbours' behavioural characteristic flag bit time, check in local incidence edge data structure check whether in about this message in the local behavioural characteristic of source row representated by feature number and message object behavioural characteristic number representated by long-range behavioural characteristic between create limit, if had, do not do any operation, if not, in breath in the local behavioural characteristic of source row representated by feature number and message object behavioural characteristic number representated by long-range behavioural characteristic between create limit, be added in local incidence edge data structure, final structure obtains feature syntople figure, when adopting two-way limit mode, when source behavioural characteristic and the local object behavioural characteristic of the behavioural characteristic neighborhood checking submodule authenticating remote of this locality be same Malware behavioural characteristic and the object behavioural characteristic received about this this locality is numbered and this long-range source row be feature number with when having neighbours' behavioural characteristic message of neighbours' behavioural characteristic flag bit, limit is created between the object behavioural characteristic just in this this locality and this long-range source behavioural characteristic, be added in local incidence edge data structure, final structure obtains feature syntople figure.

5. Behavior-based control fragment according to claim 4 share characteristic of malware convergence analysis method, it is characterized in that, described step 5) detailed step as follows:

5.1) described source node is on the basis of described feature syntople figure, adopt the distributed minimal spanning tree algorithm on the large limit of prioritizing selection weights on syntople figure, to generate the maximum similar tree of behavioural characteristic as fusion tree according to the incidence edge data structure of behavioural characteristic, based on the maximum similar tree of behavioural characteristic, determine that the described fusion sequence merging behavioural characteristic in tree makes the most similar behavioural characteristic preferentially be merged;

5.2) described source node is that the behavioural characteristic on each summit in feature syntople figure generates separately a process or thread is acted on behalf of as Fusion Features, or generates a Fusion Features agency for the behavioural characteristic on all summits in feature syntople figure, the fusion results comprising self behavioural characteristic is submitted to the Fusion Features agency of lower one deck by described Fusion Features agency according to described fusion sequence, the last layer Fusion Features that the Fusion Features proxy records receiving fusion results receives described fusion results acts on behalf of the adjacent side in character pair syntople figure, check that described adjacent side is the limit of the maximum weight in the incidence edge of self behavioural characteristic in all of its neighbor limit, if, then the behavioural characteristic of self and all fusion results that receives are carried out merging and fusion results is submitted to lower one deck Fusion Features agency according to described fusion sequence, otherwise, merge all fusion results of receiving and fusion results and the behavioural characteristic of self are submitted to lower one deck Fusion Features agency according to described fusion sequence, final fusion obtains the described root behavioural characteristic merging tree,

5.3) the described root behavioural characteristic merging tree is exported as characteristic of malware convergence analysis result.

6. the characteristic of malware convergence analysis system that a Behavior-based control fragment is shared, it is characterized in that, comprise the node disposing geographical position dispersion in a network respectively, each node is responsible for collection and the analysis of Malware sample in a slice network area, and described node comprises:

Behavioural characteristic segmentation module, gather Malware sample for node and be divided into the fixing behavior segment set of length, statistics this locality has the local Malware sample number of each behavior segment behavior in described behavior segment set and obtains the local statistical property of behavior segment set and submit to behavior segment coordination sharing module;

Distributed hashtable module, for setting up for building distributed hashtable in node;

Behavior segment coordination sharing module, for each node, behavior segment set and the issue of local statistical property thereof are shared to distributed hashtable, gather the identical behavior segment from different node by the node of distributed hashtable and add up the global property gathering described behavior segment, the global property of described behavior segment comprises behavior segment, source node address, source node this locality has the ternary of the local Malware sample number of described behavior segment behavior to set, behavior segment set with global property is returned to the source node of shared behavior segment set and local statistical property thereof,

Neighbours' behavioural characteristic finds module, behavioural characteristic for the Malware formed according to behavior segment and the global property thereof of Malware calculates the candidate neighbor nodes collection with similar behavioural characteristic, and the remote node concentrated to candidate neighbor nodes sends source behavioural characteristic corresponding to Malware; Simultaneously node alternatively neighbor node concentrate remote node time be whether neighbours' behavioural characteristic of source behavioural characteristic by the source behavioural characteristic received and local object behavioural characteristic multilevel iudge object behavioural characteristic, judged result is returned source node as behavioural characteristic neighborhood, according to the behavioural characteristic neighborhood structure behavioural characteristic syntople figure that remote node returns;

Distributed level merges tree constructing module, for the basis at described feature syntople figure, adopt the distributed minimal spanning tree algorithm on the large limit of prioritizing selection weights on syntople figure, to generate the maximum similar tree of behavioural characteristic according to the incidence edge data structure of behavioural characteristic and realize distributed generation fusion tree as fusion tree;

Behavioural characteristic is Fusion Module progressively, for the described behavioural characteristic merged in tree is merged, and based on the maximum similar tree of behavioural characteristic, determine that the described fusion sequence merging behavioural characteristic in tree makes the most similar behavioural characteristic preferentially be merged, the described root behavioural characteristic merging tree is exported as characteristic of malware convergence analysis result.

7. the characteristic of malware convergence analysis system that Behavior-based control fragment according to claim 6 is shared, is characterized in that, described behavioural characteristic segmentation module comprises:

Continuous behavior subsequence submodule or behavior rely on dividing sub-picture submodule, described Continuous behavior subsequence submodule is used for the behavior sequence behavior of Malware sample being considered as order execution, selects the behavior segment set that the Continuous behavior subsequence of regular length is fixed as the length that segmentation obtains from the behavior sequence that described order performs; Described behavior dependence dividing sub-picture submodule, for setting up behavior dependency graph according to the dependence between the behavior operating data of Malware sample, selects the behavior segment set that the behavior dependence subgraph obtaining fixed vertices number is fixed as the length that segmentation obtains from described behavior dependency graph;

Behavior segment set statistics submodule, has the local Malware sample number of each behavior segment behavior in described behavior segment set and obtains the local statistical property of behavior segment set for adding up this locality and export behavior segment coordination sharing module to.

8. the characteristic of malware convergence analysis system that Behavior-based control fragment according to claim 6 is shared, it is characterized in that, described distributed hashtable module comprises:

Behavior segment keyword mapping submodule, for obtaining key value according to the behavior segment of input by Hash calculation;

Keyword route submodule, for searching the collection analysis node of this key value responsible according to key value, and stores distributed hashtable Message routing to the node being responsible for this key value;

Described behavior segment coordination sharing module comprises:

Behavior segment issues submodule, the key value of described behavior segment is obtained for calling distributed hashtable module, described behavior segment and local statistical property thereof are encapsulated as distributed hashtable message, then described key value and distributed hashtable message are sent to distributed hashtable module;

Behavior segment receives submodule, receives behavior segment and local statistical property thereof that distributed hashtable module stores when the node of node as distributed statistics global property;

Behavior segment statistics submodule, local Malware sample number for having a described behavior segment behavior according to behavior segment and source node thereof, source node this locality when the node of node as distributed statistics global property carries out statistics global property, described behavior segment global property comprises behavior segment, source node address, source node is local has the ternary of the local Malware sample number of described behavior segment behavior to set;

Behavior segment global property returns submodule, for the behavior segment set with global property that described behavior segment statistics submodule exports being sent to source node address when the node of node as distributed statistics global property;

Behavior segment global property receives submodule, returns for receiving described behavior segment global property when the source node of node as shared behavior segment the behavior segment set with global property that submodule sends.

9. the characteristic of malware convergence analysis system that the Behavior-based control fragment according to claim 6 or 7 or 8 is shared, is characterized in that, described neighbours' behavioural characteristic finds that module comprises:

Candidate neighbor nodes collection calculating sub module, for obtaining the behavioural characteristic of the Malware of each this locality according to the behavior segment of Malware and global property thereof, calculate the behavior segment sharing degree between described behavioural characteristic and other remote node according to formula (1), described fragment sharing degree is greater than all remote nodes of default sharing degree threshold value as the candidate neighbor nodes collection with similar behavioural characteristic;

Behavioural characteristic send submodule, for node as during source node using the behavioural characteristic of local Malware as source behavioural characteristic send to candidate neighbor nodes concentrate all remote nodes;

Behavioural characteristic receives submodule, for will receive the source behavioural characteristic that source node sends when the remote node of node alternatively neighbor node collection;

Behavioural characteristic neighborhood checking submodule, behavioural characteristic is merged for being formed by the joint act received in source behavioural characteristic and local object behavioural characteristic, judge whether the frequency that described fusion behavioural characteristic occurs in normal procedure is less than setting threshold, if be less than, illustrate that the joint act of source behavioural characteristic and object behavioural characteristic is uncommon and specific to Malware in normal procedure, judge that source behavioural characteristic and object behavioural characteristic belong to same Malware; Otherwise judge that source behavioural characteristic and object behavioural characteristic do not belong to same Malware; If belong to same Malware by checking source behavioural characteristic and object behavioural characteristic, then the object behavioural characteristic of remote node is neighbours' behavioural characteristic of source behavioural characteristic, source row is feature number by remote node, the similarity of the numbering of object behavioural characteristic, source behavioural characteristic and object behavioural characteristic, have neighbours' behavioural characteristic flag bit to be assembled into return messages to return to behavioural characteristic place, source source node, otherwise remote node returns to the nack message of behavioural characteristic place, source source node without neighbours' behavioural characteristic flag bit;

Syntople figure calculating sub module, for creating the limit of behavioural characteristic syntople figure according to the behavioural characteristic neighborhood that returns, the similarity between the weights on limit two behavioural characteristics representated by two summits, final structure behavioural characteristic syntople figure, unidirectional limit mode or two-way limit mode is adopted during the limit of described establishment behavioural characteristic syntople figure, when adopting unidirectional limit mode: 1) when the source behavioural characteristic of authenticating remote is the behavioural characteristic of same Malware with local object behavioural characteristic, create limit between the object behavioural characteristic in this locality and long-range source behavioural characteristic, be added in local incidence edge data structure, 2) according to receive remote node return the neighbours' behavioural characteristic message with neighbours' behavioural characteristic flag bit time, check in local incidence edge data structure check whether in about this message in the local behavioural characteristic of source row representated by feature number and message object behavioural characteristic number representated by long-range behavioural characteristic between create limit, if had, do not do any operation, if not, in breath in the local behavioural characteristic of source row representated by feature number and message object behavioural characteristic number representated by long-range behavioural characteristic between create limit, be added in local incidence edge data structure, when adopting two-way limit mode, when source behavioural characteristic and the local object behavioural characteristic of the behavioural characteristic neighborhood checking submodule authenticating remote of this locality be same Malware behavioural characteristic and the object behavioural characteristic received about this this locality is numbered and this long-range source row be feature number with when having neighbours' behavioural characteristic message of neighbours' behavioural characteristic flag bit, create limit between the object behavioural characteristic just in this this locality and this long-range source behavioural characteristic, be added in local incidence edge data structure,

{Share}_{tj} = \underset{{Frag}_{s} &Element; {FragSet}_{t}, {Node}_{s} = {Node}_{j}}{\underset{({Frag}_{s}, {Node}_{s}, {Num}_{s}) &Element; FragStatSet}{Σ}} \log ({Num}_{t} + {Num}_{s}) \log \frac{1}{F_{s}} (1 \leq t \leq M_{i}, 1 \leq j \leq N, j &NotEqual; i) - - - (1)

In formula (1), FragStatSet be the ternary of the global property representing behavior segment to set, N is number of nodes, M _ifor the local behavior characteristic amount of node i, the behavior segment of behavioural characteristic t integrates as FragSet _t, the local sample number of behavior feature t is Num _t, (Frag _s, Node _s, Num _s) for the behavior segment in FragStatSet be Frag _s, source node is Node _sbe Num with local sample number _sternary pair, F _sfor behavior segment is Frag _sthe frequency occurred in the behavior of normal procedure, Node _jfor remote node; Share _tj(1≤t≤M _i, 1≤j≤N, j ≠ i) and be behavioural characteristic t and remote node Node _jbetween behavior segment sharing degree.

10. the characteristic of malware convergence analysis system that Behavior-based control fragment according to claim 9 is shared, is characterized in that, described distributed level merges tree constructing module and comprises:

Behavioural characteristic maximum similar tree constructor module, for the basis at described feature syntople figure, the distributed minimal spanning tree algorithm on the large limit of prioritizing selection weights is adopted on syntople figure, to generate the maximum similar tree of behavioural characteristic as fusion tree according to the incidence edge data structure of behavioural characteristic;

Behavioural characteristic fusion sequence constructor module, for determining that based on the maximum similar tree of behavioural characteristic the described fusion sequence merging behavioural characteristic in tree makes the most similar behavioural characteristic preferentially be merged;

Described behavioural characteristic progressively Fusion Module comprises a Fusion Features agency by the behavioural characteristic on all summits in multiple Fusion Features agency of the behavioural characteristic on each summit in the form of process or thread respectively character pair syntople figure or character pair syntople figure;

Described Fusion Features agency comprises:

Behavioural characteristic submits submodule to, for the fusion results comprising self behavioural characteristic being submitted to according to described fusion sequence the Fusion Features agency of lower one deck;

Behavioural characteristic receives submodule, the fusion results that the Fusion Features agency for receiving last layer submits to the last layer Fusion Features that record receives described fusion results acts on behalf of the adjacent side in character pair syntople figure;

Behavioural characteristic fusant module, for judging that described adjacent side is the limit of the maximum weight in the incidence edge of self behavioural characteristic in all of its neighbor limit, if, then the behavioural characteristic of self and all fusion results that receives are carried out merging and fusion results is submitted to lower one deck Fusion Features agency according to described fusion sequence, otherwise, merge all fusion results of receiving and fusion results and the behavioural characteristic of self submitted to lower one deck Fusion Features agency according to described fusion sequence; Finally levy fusion agent by the Gent merging tree and merge the root behavioural characteristic obtaining described fusion tree, and the described root behavioural characteristic merging tree is exported as characteristic of malware convergence analysis result.