CN114637892A

CN114637892A - Summary graph generation method for syslog dependency graphs for attack investigation and recovery

Info

Publication number: CN114637892A
Application number: CN202210107372.9A
Authority: CN
Inventors: 孟丹; 文雨; 徐志强; 张博洋; 杨纯; 郑阳; 张东雪; 杜莹莹; 吴艳娜
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-06-17
Anticipated expiration: 2042-01-28
Also published as: CN114637892B

Abstract

The invention provides a method for generating a summary graph of a system log dependency graph for attack investigation and restoration, which comprises the following steps: determining a system entity dependency relationship graph of an attack event to be investigated and restored, wherein the dependency relationship graph comprises system entity nodes related to the attack event and a calling relationship between the system entity nodes; the system entity node comprises a process node and a resource node; executing hierarchical random walking on the process nodes in the dependency graph, and determining the behavior representation of the process nodes; clustering the process nodes based on the behavior representation, and dividing the dependency graph into at least one first subgraph based on a clustering result; compressing each first subgraph to obtain at least one second subgraph; and generating a summary corresponding to each second sub-graph, and obtaining a summary graph corresponding to the dependency graph. The invention is convenient for viewing the outline of the related system activity and the outline information of the subgraph related to the attack by dividing the dependency graph into a plurality of subgraphs and providing each subgraph with a concise outline to generate the outline graph.

Description

Summary graph generation method for syslog dependency graphs for attack investigation and recovery

技术领域technical field

本发明涉及网络安全技术领域，尤其涉及一种用于攻击调查和还原的系统日志依赖图的概要图生成方法。The present invention relates to the technical field of network security, in particular to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration.

背景技术Background technique

为了应对网络攻击，基于系统监控的因果分析成为了进行攻击调查的一个重要方法。In response to cyber attacks, causal analysis based on system monitoring has become an important method for attack investigation.

因果分析方法使用系统实体依赖关系图表示系统调用事件，基于系统实体依赖关系图，可以通过重建导致POI(Point of Interesting，兴趣点)事件的事件链来调查攻击的上下文信息，这样的上下文信息可以有效地揭示与攻击相关的事件。但是，由于依赖爆炸问题，很难从一个巨大的图形中有效地提取所需的上下文信息，需要大量的手动检查。The causal analysis method uses the system entity dependency graph to represent system call events. Based on the system entity dependency graph, the context information of the attack can be investigated by reconstructing the event chain leading to the POI (Point of Interesting) event. Such context information can Effectively reveal incidents related to attacks. However, due to the dependency explosion problem, it is difficult to efficiently extract the required contextual information from a huge graph, requiring extensive manual inspection.

针对依赖爆炸问题，现有方法主要包括自动过滤依赖关系图中的无关事件和揭示攻击相关事件等技术，这些基于系统实体依赖关系图的攻击调查技术虽然取得了良好的效果，但仍然存在手动攻击调查，使得实际应用范围比较受限。Aiming at the problem of dependency explosion, existing methods mainly include technologies such as automatic filtering of irrelevant events in dependency graphs and revealing attack-related events. Although these attack investigation techniques based on system entity dependency graphs have achieved good results, there are still manual attacks. The investigation makes the scope of practical application more limited.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供一种用于攻击调查和还原的系统日志依赖图的概要图生成方法。In view of the problems existing in the prior art, the present invention provides a method for generating a summary graph of a system log dependency graph for attack investigation and restoration.

第一方面，本发明提供一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，包括：In a first aspect, the present invention provides a method for generating a summary graph of a system log dependency graph for attack investigation and restoration, including:

确定待调查和还原的攻击事件的系统实体依赖关系图，所述系统实体依赖关系图中包含与所述待调查和还原的攻击事件相关联的系统实体节点以及所述系统实体节点之间的调用关系；其中，所述系统实体节点包括进程节点和资源节点，所述系统实体节点之间的调用关系表征系统活动；Determine the system entity dependency graph of the attack event to be investigated and restored, the system entity dependency graph including the system entity nodes associated with the attack event to be investigated and restored and calls between the system entity nodes relationship; wherein, the system entity nodes include process nodes and resource nodes, and the calling relationship between the system entity nodes represents system activities;

在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示；Perform a hierarchical random walk on the process nodes in the system entity dependency graph, and determine the behavior representation of the process nodes;

基于所述进程节点的行为表示，对所述进程节点进行聚类，并基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个第一子图；Clustering the process nodes based on the behavior representation of the process nodes, and dividing the system entity dependency graph into at least one first subgraph based on a result of the clustering;

对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，所述至少一个第二子图与所述至少一个第一子图一一对应；compressing each first sub-picture in the at least one first sub-picture to obtain at least one second sub-picture, and the at least one second sub-picture is in one-to-one correspondence with the at least one first sub-picture;

生成所述至少一个第二子图中每一个第二子图对应的概要，获得所述系统实体依赖关系图对应的概要图。A summary corresponding to each of the second subgraphs in the at least one second subgraph is generated, and a summary graph corresponding to the system entity dependency graph is obtained.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，所述在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示，包括：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, the hierarchical random walk is performed on the process nodes in the system entity dependency graph to determine The behavior representation of the process node, including:

以所述系统实体依赖关系图中的每一个进程节点为起点随机行走预设长度，以生成行走路线；Take each process node in the system entity dependency graph as a starting point to randomly walk for a preset length to generate a walking route;

基于所述行走路线，采用词向量模型获取所述进程节点的行为表示。Based on the walking route, a word vector model is used to obtain the behavior representation of the process node.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，所述对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，包括：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, the compression of each first subgraph in the at least one first subgraph is performed, Get at least one second subgraph, including:

确定所述至少一个第一子图中的目标子图中的第一模式，所述第一模式包括：同一个进程节点至少产生两个相同的进程节点集以访问同一个资源节点的模式，所述进程节点集中包括至少一个子进程节点，所述资源节点包括文件节点或网络节点；Determine the first mode in the target subgraph in the at least one first subgraph, the first mode includes: a mode in which the same process node generates at least two identical process node sets to access the same resource node, so The process node set includes at least one child process node, and the resource node includes a file node or a network node;

合并所述第一模式中相同的所述子进程节点，以及合并连接所述子进程节点的边，完成对所述目标子图的压缩，获取所述目标子图对应的第二子图。The same sub-process nodes in the first mode are merged, and the edges connecting the sub-process nodes are merged to complete the compression of the target sub-graph, and a second sub-graph corresponding to the target sub-graph is obtained.

确定所述至少一个第一子图中的目标子图中的第二模式，所述第二模式包括：同一个进程节点至少两次访问不同的资源节点的模式，所述资源节点包括文件节点或网络节点；Determine a second mode in the target subgraph in the at least one first subgraph, where the second mode includes: a mode in which the same process node accesses different resource nodes at least twice, the resource nodes include file nodes or network node;

合并所述第二模式中的所述不同的资源节点，完成对所述目标子图的压缩，获取所述目标子图对应的第二子图。The different resource nodes in the second mode are merged, the compression of the target subgraph is completed, and a second subgraph corresponding to the target subgraph is obtained.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，在所述对所述至少一个第一子图中的每一个第一子图进行压缩之前，所述方法还包括：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, when the at least one first subgraph is compressed, each first subgraph is compressed. Before, the method further includes:

在至少两个进程节点访问同一个资源节点，且所述至少两个进程节点来自不同的所述第一子图的情况下，创建所述资源节点的至少一个副本节点；In the case that at least two process nodes access the same resource node, and the at least two process nodes are from different first subgraphs, creating at least one replica node of the resource node;

将所述资源节点和所述至少一个副本节点一一对应分配给所述至少两个进程节点所在的所述第一子图，并在所述资源节点和所述至少一个副本节点之间创建定向边以连接所述资源节点和所述副本节点。Allocate the resource node and the at least one replica node to the first subgraph where the at least two process nodes are located in a one-to-one correspondence, and create an orientation between the resource node and the at least one replica node edge to connect the resource node and the replica node.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，所述在所述系统实体依赖关系图中的进程节点上执行分层随机行走之前，所述方法还包括：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, before performing a hierarchical random walk on the process nodes in the system entity dependency graph, The method also includes:

将所述系统实体依赖关系图中的每一个进程节点分别与所述系统实体依赖关系图中的所述资源节点之间的平行边进行合并，所述平行边包括：具有相同的读操作或相同的写操作类型的边。Merge each process node in the system entity dependency graph with parallel edges between the resource nodes in the system entity dependency graph, where the parallel edges include: having the same read operation or the same The edge of the type of write operation.

将所述系统实体依赖关系图中仅具有输入边而没有输出边的资源节点进行删除。The resource nodes with only input edges but no output edges in the system entity dependency graph are deleted.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，所述概要中包括以下至少一项：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, the summary includes at least one of the following:

主进程；main process;

时间跨度；time span;

目标信息流；target information flow;

其中，所述主进程表示所述第二子图中包括的系统活动的父进程节点；Wherein, the main process represents the parent process node of the system activity included in the second subgraph;

所述时间跨度表示所述第二子图中包括的系统活动的最早开始时间与最晚结束时间之间的时间间隔；the time span represents the time interval between the earliest start time and the latest end time of the system activity included in the second subgraph;

所述目标信息流表示所述第二子图中包括的系统活动对应的信息流中优先级排名位于所有信息流的排名中前预设数量位的信息流。The target information flow represents the information flow whose priority ranking is located in the top preset number of places in the ranking of all the information flows among the information flows corresponding to the system activities included in the second sub-graph.

可选地，根据本发明提供的一种用于攻击调查和还原的系统日志依赖图的概要图生成方法，所述合并所述第二模式中的所述不同的资源节点，包括：Optionally, according to a method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, the merging of the different resource nodes in the second mode includes:

将所述第二模式中的所述不同的资源节点合并为一个节点，作为合并资源节点，所述合并资源节点的属性是所述不同的资源节点的属性的并集。The different resource nodes in the second mode are merged into one node as a merged resource node, and the attribute of the merged resource node is a union of the attributes of the different resource nodes.

第二方面，本发明还提供一种用于攻击调查和还原的系统日志依赖图的概要图生成装置，包括：In a second aspect, the present invention also provides an apparatus for generating a summary graph of a system log dependency graph for attack investigation and restoration, including:

第一确定模块，用于确定待调查和还原的攻击事件的系统实体依赖关系图，所述系统实体依赖关系图中包含与所述待调查和还原的攻击事件相关联的系统实体节点以及所述系统实体节点之间的调用关系；其中，所述系统实体节点包括进程节点和资源节点，所述系统实体节点之间的调用关系表征系统活动；The first determination module is used to determine the system entity dependency graph of the attack event to be investigated and restored, the system entity dependency graph including the system entity node associated with the attack event to be investigated and restored and the The calling relationship between system entity nodes; wherein, the system entity node includes a process node and a resource node, and the calling relationship between the system entity nodes represents system activities;

第二确定模块，用于在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示；a second determining module, configured to perform a hierarchical random walk on the process nodes in the system entity dependency graph, and determine the behavior representation of the process nodes;

子图划分模块，用于基于所述进程节点的行为表示，对所述进程节点进行聚类，并基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个第一子图；a subgraph dividing module, configured to cluster the process nodes based on the behavior representation of the process nodes, and divide the system entity dependency graph into at least one first subgraph based on the result of the clustering ;

子图压缩模块，用于对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，所述至少一个第二子图与所述至少一个第一子图一一对应；A sub-picture compression module, configured to compress each first sub-picture in the at least one first sub-picture to obtain at least one second sub-picture, the at least one second sub-picture and the at least one first sub-picture One-to-one correspondence between subgraphs;

概要图生成模块，用于生成所述至少一个第二子图中每一个第二子图对应的概要，获得所述系统实体依赖关系图对应的概要图。A summary graph generation module, configured to generate a summary corresponding to each second subgraph in the at least one second subgraph, and obtain a summary graph corresponding to the system entity dependency graph.

第三方面，本发明还提供一种电子设备，包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述程序时实现如第一方面所述用于攻击调查和还原的系统日志依赖图的概要图生成方法的步骤。In a third aspect, the present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, when the processor executes the program, In one aspect, the steps of a method for generating a summary graph of a system log dependency graph for attack investigation and restoration are described.

第四方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如第一方面所述用于攻击调查和还原的系统日志依赖图的概要图生成方法的步骤。In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the system for attack investigation and recovery as described in the first aspect The steps of the summary graph generation method for the log dependency graph.

第五方面，本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述用于攻击调查和还原的系统日志依赖图的概要图生成方法的步骤。In a fifth aspect, the present invention also provides a computer program product, including a computer program, which, when executed by a processor, realizes the generation of a summary graph of a system log dependency graph for attack investigation and restoration as described in any of the above steps of the method.

本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，通过确定待调查和还原的攻击事件的系统实体依赖关系图，在系统实体依赖关系图中的进程节点上执行分层随机行走，确定进程节点的行为表示，并基于进程节点的行为表示将系统实体依赖关系图划分为至少一个第一子图，对每一个第一子图进行压缩获取至少一个第二子图，最后生成每一个第二子图的概要，从而获得系统实体依赖关系图对应的概要图；通过将系统实体依赖关系图划分为多个子图并为每个子图提供简洁的概要来生成概要图，每个子图只包含密切相关的进程，共同完成系统任务，生成的概要图通过隐藏较少的重要细节来保持系统实体依赖关系图中系统活动的语义，而且通过概要的形式将其进行可视化，不仅可以缩小系统实体依赖关系图的大小，而且便于查看相关系统活动的概要和与攻击相关的社区的概要信息。The method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, by determining the system entity dependency graph of the attack event to be investigated and restored, executes the analysis on the process nodes in the system entity dependency graph. The layer walks randomly, determines the behavior representation of the process node, and divides the system entity dependency graph into at least one first subgraph based on the behavior representation of the process node, and compresses each first subgraph to obtain at least one second subgraph, Finally, a summary of each second subgraph is generated, so as to obtain a summary graph corresponding to the system entity dependency graph; the summary graph is generated by dividing the system entity dependency graph into multiple subgraphs and providing a concise summary for each subgraph. Each subgraph only contains closely related processes that complete system tasks together. The generated summary diagram maintains the semantics of system activities in the system entity dependency diagram by hiding less important details, and visualizes it in the form of a summary, which not only can Reduce the size of system entity dependency graphs and make it easier to view a summary of related system activity and a summary of attack-related communities.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are the For some embodiments of the invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明提供的概要图生成方法的流程示意图之一；Fig. 1 is one of the schematic flow charts of the method for generating a schematic diagram provided by the present invention;

图2是本发明提供的概要图生成方法的系统实体依赖关系图示意图；2 is a schematic diagram of a system entity dependency relationship diagram of a method for generating a summary diagram provided by the present invention;

图3是本发明提供的概要图生成方法的重叠节点示意图之一；3 is one of the schematic diagrams of overlapping nodes of the method for generating a summary graph provided by the present invention;

图4是本发明提供的概要图生成方法的重叠节点示意图之二；Fig. 4 is the second schematic diagram of overlapping nodes of the method for generating a summary graph provided by the present invention;

图5是本发明提供的概要图生成方法的重叠节点示意图之三；5 is the third schematic diagram of overlapping nodes of the method for generating a summary graph provided by the present invention;

图6是本发明提供的概要图生成方法的概要图示意图；6 is a schematic diagram of a schematic diagram of a method for generating a schematic diagram provided by the present invention;

图7是本发明提供的概要图生成方法的第一模式示意图；7 is a schematic diagram of a first mode of a method for generating a schematic diagram provided by the present invention;

图8是本发明提供的概要图生成方法的第二模式示意图；Fig. 8 is the second mode schematic diagram of the summary graph generation method provided by the present invention;

图9是本发明提供的概要图生成方法的流程示意图之二；Fig. 9 is the second schematic flow chart of the method for generating a schematic diagram provided by the present invention;

图10是本发明提供的概要图生成方法监测的子图数量示意图；10 is a schematic diagram of the number of subgraphs monitored by the method for generating a summary graph provided by the present invention;

图11是本发明提供的概要图生成方法监测的子图大小分布示意图；11 is a schematic diagram of the size distribution of subgraphs monitored by the summary graph generation method provided by the present invention;

图12是本发明提供的概要图生成方法的节点压缩率分布示意图；FIG. 12 is a schematic diagram of node compression ratio distribution of a method for generating a summary graph provided by the present invention;

图13是本发明提供的概要图生成方法的边缘压缩率分布示意图；13 is a schematic diagram of the edge compression ratio distribution of the method for generating a summary image provided by the present invention;

图14是本发明提供的概要图生成装置的结构示意图；14 is a schematic structural diagram of an apparatus for generating a schematic diagram provided by the present invention;

图15是本发明提供的电子设备的实体结构示意图。FIG. 15 is a schematic diagram of the physical structure of the electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second" and the like in the description and claims of the present invention are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between "first", "second", etc. The objects are usually of one type, and the number of objects is not limited. For example, the first object may be one or more than one. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the associated objects are in an "or" relationship.

为了便于更加清晰地理解本发明各实施例，首先对一些相关的背景知识进行如下介绍。In order to facilitate a clearer understanding of the embodiments of the present invention, some related background knowledge is first introduced as follows.

为了应对网络攻击，基于系统监控的因果分析成为了进行攻击调查的一个重要方法。系统监控观察系统调用，并生成内核级审计事件作为系统审计日志。这些日志使因果关系分析能够识别入侵的入口点(反向跟踪)和攻击的分支(正向跟踪)，这已被证明在协助攻击调查和系统恢复方面是有效的。In response to cyber attacks, causal analysis based on system monitoring has become an important method for attack investigation. System Monitor observes system calls and generates kernel-level audit events as a system audit log. These logs enable causal analysis to identify entry points of intrusions (backtraces) and branches of attacks (forward traces), which has proven effective in assisting attack investigations and system recovery.

虽然因果关系分析已经在一些领域取得了不错的成效，但现有方法需要大量的手动检查，这阻碍了它们的广泛应用。因果分析方法考虑在同一系统级调用事件(例如，读取文件的过程)中涉及因果依赖性的系统实体(例如文件、进程和网络连接)。基于这些依赖关系，这些方法使用系统实体系统依赖关系图表示系统调用事件，节点是系统实体，边是事件，边与边的连接关系指的是从系统事件派生的依赖关系。While causality analysis has yielded promising results in some areas, existing methods require extensive manual inspection, which hinders their widespread adoption. Causal analysis methods consider system entities (eg, files, processes, and network connections) involved in causal dependencies within the same system-level invocation event (eg, a process to read a file). Based on these dependencies, these methods use a system-entity system dependency graph to represent system call events, where nodes are system entities, edges are events, and edge-to-edge connections refer to dependencies derived from system events.

使用系统实体依赖关系图，可以通过重建导致POI(Point of Interesting，兴趣点)事件(例如，入侵检测系统报告的警报事件)的事件链来调查攻击的上下文信息。这样的上下文信息可以有效地揭示与攻击相关的事件。但是，由于依赖爆炸问题，从一个巨大的图形(通常包含大于100K条边)中有效地提取所需的上下文信息是困难的。Using the system entity dependency graph, contextual information about an attack can be investigated by reconstructing the chain of events leading to POI (Point of Interesting) events (eg, alert events reported by an intrusion detection system). Such contextual information can effectively reveal attack-related events. However, efficiently extracting the required contextual information from a huge graph (often containing more than 100K edges) is difficult due to the dependency explosion problem.

为了克服在攻击调查中使用系统实体依赖关系图存在依赖爆炸的缺陷，现有的方法主要是自动过滤无关事件和揭示攻击相关事件的技术。虽然这些技术取得了很好的效果，但由于以下三个主要原因，手动攻击调查仍然是必不可少的：In order to overcome the defect of dependency explosion in the use of system entity dependency graph in attack investigation, the existing methods are mainly techniques for automatically filtering irrelevant events and revealing attack-related events. While these techniques have achieved great results, manual attack investigation is still essential for three main reasons:

(1)系统中总是存在剩余风险，尽管这种风险很小，但这些自动化技术无法准确揭示，尤其是严重依赖系统配置文件的技术；(1) There is always a residual risk in the system that, although small, cannot be accurately revealed by these automated techniques, especially those that rely heavily on system configuration files;

(2)威胁不断演变，以逃避防御技术，例如新兴的攻击战术和对手最近开发的技术；(2) Threats are constantly evolving to evade defensive technologies, such as emerging attack tactics and technologies recently developed by adversaries;

(3)现有技术主要依赖于启发式规则，这会导致信息丢失，还有一些技术需要在系统内部进行更改来发现攻击行为，如二进制检测，通用性较差，阻碍了它们的实际应用。(3) Existing techniques mainly rely on heuristic rules, which can lead to information loss, and some techniques need to be changed inside the system to discover attack behaviors, such as binary detection, which have poor generality and hinder their practical application.

下面结合图1-图14描述本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法及装置。The following describes the method and apparatus for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention with reference to FIGS. 1 to 14 .

图1是本发明提供的概要图生成方法的流程示意图之一，如图1所示，该方法包括如下流程：Fig. 1 is one of the schematic flow charts of the method for generating a schematic diagram provided by the present invention. As shown in Fig. 1 , the method includes the following procedures:

步骤100，确定待调查和还原的攻击事件的系统实体依赖关系图，所述系统实体依赖关系图中包含与所述待调查和还原的攻击事件相关联的系统实体节点以及所述系统实体节点之间的调用关系；其中，所述系统实体节点包括进程节点和资源节点，所述系统实体节点之间的调用关系表征系统活动；Step 100: Determine the system entity dependency graph of the attack event to be investigated and restored, and the system entity dependency graph includes the system entity node associated with the attack event to be investigated and restored and the relationship between the system entity nodes. The calling relationship between the system entity nodes; wherein, the system entity nodes include process nodes and resource nodes, and the calling relationship between the system entity nodes represents system activities;

步骤110，在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示；Step 110, performing a hierarchical random walk on the process nodes in the system entity dependency graph to determine the behavior representation of the process nodes;

步骤120，基于所述进程节点的行为表示，对所述进程节点进行聚类，并基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个第一子图；Step 120, clustering the process nodes based on the behavior representation of the process nodes, and dividing the system entity dependency graph into at least one first subgraph based on the result of the clustering;

步骤130，对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，所述至少一个第二子图与所述至少一个第一子图一一对应；Step 130, compressing each first sub-picture in the at least one first sub-picture to obtain at least one second sub-picture, the at least one second sub-picture and the at least one first sub-picture one by one correspond;

步骤140，生成所述至少一个第二子图中每一个第二子图对应的概要，获得所述系统实体依赖关系图对应的概要图。Step 140: Generate a summary corresponding to each second subgraph in the at least one second subgraph, and obtain a summary graph corresponding to the system entity dependency graph.

为了克服在攻击调查中使用系统实体依赖关系图存在依赖爆炸的缺陷，本发明将系统实体依赖关系图进行社区(子图)划分，每个社区只包含密切相关的进程，并为每个社区生成相应的概要，通过隐藏较少的重要细节来保持系统实体依赖关系图中系统活动的语义，而且通过概要的形式将其进行可视化，不仅缩小了系统实体依赖关系图的大小，而且便于查看相关系统活动的概要和与攻击相关的社区的概要信息。In order to overcome the defect of dependency explosion in the use of system entity dependency graph in attack investigation, the present invention divides the system entity dependency graph into communities (subgraphs), each community only contains closely related processes, and generates for each community The corresponding summary maintains the semantics of system activities in the system entity dependency diagram by hiding less important details, and visualizes it in the form of a summary, which not only reduces the size of the system entity dependency diagram, but also facilitates viewing related systems. A summary of the activity and a summary of the community associated with the attack.

可选地，可以确定待调查和还原的攻击事件的系统实体依赖关系图。Optionally, a system entity dependency graph of the attack event to be investigated and restored may be determined.

可选地，系统实体依赖关系图可以包括与待调查和还原的攻击事件相关联的系统实体节点以及各个系统实体节点之间的调用关系。Optionally, the system entity dependency graph may include the system entity nodes associated with the attack event to be investigated and restored and the calling relationship between the various system entity nodes.

可选地，系统实体节点可以包括进程节点和资源节点。Optionally, the system entity nodes may include process nodes and resource nodes.

可选地，资源节点可以包括文件节点和网络节点。Optionally, resource nodes may include file nodes and network nodes.

可选地，系统实体节点之间的调用关系可以表征系统活动。Optionally, the calling relationship between system entity nodes can represent system activities.

可选地，可以在系统实体依赖关系图中的进程节点上执行分层随机行走，确定进程节点的行为表示。Optionally, a hierarchical random walk may be performed on the process nodes in the system entity dependency graph to determine the behavior representation of the process nodes.

可选地，可以基于进程节点的行为表示，对进程节点进行聚类，以将行为表示类似的进程节点归为同一类。Optionally, the process nodes may be clustered based on their behavioral representations, so as to group process nodes with similar behavioral representations into the same class.

可选地，为了根据进程节点的行为表示计算其重叠聚类，可以采用软聚类方法FCM(Fuzzy C-means，模糊C-均值)对进程节点进行聚类。Optionally, in order to calculate overlapping clusters of process nodes according to their behavioral representations, a soft clustering method FCM (Fuzzy C-means, Fuzzy C-means) may be used to cluster process nodes.

具体地，与将进程节点仅分类为一个簇的硬聚类方法(即K-均值)不同，FCM通过最小化目标函数输出每个簇中每个进程节点的隶属度。Specifically, unlike the hard clustering method (i.e., K-means), which classifies process nodes into only one cluster, FCM outputs the membership degree of each process node in each cluster by minimizing the objective function.

可选地，可以基于进程节点聚类的结果，将系统实体依赖关系图划分为至少一个第一子图，从而可以将联系紧密的进程组确定为以进程节点为中心的子图。Optionally, the system entity dependency graph may be divided into at least one first subgraph based on the result of process node clustering, so that closely related process groups may be determined as subgraphs centered on process nodes.

可选地，可以对至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图。Optionally, each first sub-picture in the at least one first sub-picture may be compressed to obtain at least one second sub-picture.

例如，可以对第一子图中存在的冗余边或冗余节点进行合并或删除。For example, redundant edges or redundant nodes existing in the first subgraph may be merged or deleted.

例如，可以对第一子图中包括的具有相同读操作或相同写操作的边进行合并。For example, edges included in the first subgraph with the same read operation or the same write operation may be merged.

例如，可以将第一子图中不包含有用的攻击相关信息的只读文件节点删除。For example, read-only file nodes that do not contain useful attack-related information in the first subgraph can be deleted.

可选地，获取的至少一个第二子图与至少一个第一子图一一对应，即压缩一个第一子图，即可获取该压缩后的第一子图，将其作为第二子图。Optionally, there is a one-to-one correspondence between the acquired at least one second sub-image and the at least one first sub-image, that is, by compressing one first sub-image, the compressed first sub-image can be obtained and used as the second sub-image. .

可选地，可以生成至少一个第二子图中的每一个第二子图对应的概要，获取系统实体依赖关系图对应的概要图。Optionally, a summary corresponding to each second subgraph in the at least one second subgraph may be generated, and a summary graph corresponding to the system entity dependency graph may be obtained.

可选地，每一个第二子图对应的概要可以用于表征与该第二子图相对应的相关系统活动的概要信息。Optionally, the summary corresponding to each second subgraph may be used to represent summary information of related system activities corresponding to the second subgraph.

可选地，在生成每一个第二子图对应的概要之后，可以获取系统实体依赖关系图对应的概要图。Optionally, after the summary corresponding to each second subgraph is generated, the summary graph corresponding to the system entity dependency graph may be obtained.

可选地，可以使用在主流操作系统(例如，Windows、Linux、Mac OS和Android)上运行的系统审计工具来收集系统审计事件，包括进程事件、文件事件和网络事件。对于收集到的每个实体和每个事件，可以记录一些对安全分析至关重要的属性(例如，实体的PID(Process Identifier，进程控制符)、文件名和IP(Internet Protocol，网络协议)；事件的开始时间、结束时间和操作)，如表1和表2所示。Optionally, system audit events, including process events, file events, and network events, can be collected using system audit tools running on mainstream operating systems (eg, Windows, Linux, Mac OS, and Android). For each entity and each event collected, some attributes that are critical to security analysis can be recorded (for example, the entity's PID (Process Identifier), file name and IP (Internet Protocol); events start time, end time and operation), as shown in Table 1 and Table 2.

表1系统实体属性Table 1 System entity attributes

EntityEntity AttributesAttributes ProcessProcess PID,Name,User,CmdPID,Name,User,Cmd FileFile Name,PathName,Path NetworkNetwork IP,Port,ProtocolIP,Port,Protocol

表2系统事件属性Table 2 System event attributes

可选地，在给定POI事件(例如，关于文件下载的警报)的情况下，可以通过执行反向因果分析跟踪系统实体依赖关系，从而构建系统实体依赖关系图。Optionally, given a POI event (eg, an alert about a file download), a system entity dependency graph can be built by tracking system entity dependencies by performing a reverse causal analysis.

例如，图2是本发明提供的概要图生成方法的系统实体依赖关系图示意图，如图2所示，可以从POI事件开始，因果分析迭代地查找沿着POI事件的某些依赖路径发生并在POI事件之前发生的事件，这些发现的事件(即边)可以形成POI事件的系统实体依赖关系图。For example, Fig. 2 is a schematic diagram of a system entity dependency diagram of the method for generating a summary graph provided by the present invention. As shown in Fig. 2, starting from a POI event, the causal analysis iteratively finds occurrences along some dependency paths of the POI event and occurs in the POI event. The events that occur before the POI event, these discovered events (ie edges) can form the system entity dependency graph of the POI event.

可选地，可以将联系紧密的进程组确定为以进程节点为中心的社区(子图)。Optionally, closely related process groups can be identified as process node-centric communities (subgraphs).

可选地，以进程节点为中心的社区是一个图形，其中可以包括一个主进程节点、一组进程节点(表示由主进程生成并通过资源节点具有数据依赖关系的子进程)以及一组由主进程和子进程访问的资源节点。Optionally, a process node-centric community is a graph that can include a main process node, a set of process nodes (representing child processes generated by the main process and have data dependencies through resource nodes), and a set of process nodes generated by the main process. Resource nodes accessed by processes and child processes.

例如，图2中的leak是社区C3的主进程，其生成子进程tar、bzip、gpg和curl来压缩和上载文件，这些子进程至少与另一个子进程具有数据依赖关系，如图2所示的依赖关系图中的：tar→./upload.tar→bzip2→../upload.tar.bz2→gpg→../upload→curl→xxx->xxx。For example, leak in Figure 2 is the main process of community C3, which spawns subprocesses tar, bzip, gpg, and curl to compress and upload files, and these subprocesses have data dependencies with at least one other subprocess, as shown in Figure 2 In the dependency graph: tar→./upload.tar→bzip2→../upload.tar.bz2→gpg→../upload→curl→xxx->xxx.

可选地，以进程节点为中心的社区还可以包括属于多个社区并被称为重叠节点的进程节点或资源节点。Optionally, a process node-centric community may also include process nodes or resource nodes that belong to multiple communities and are called overlapping nodes.

例如，在图2中，leak首先与curl协作以完成C2中脚本leak.sh的执行，然后生成子进程tar、bzip2、gpg和curl以压缩和上载C3中的文件。在这种情况下，leak是C2和C3中的重叠节点。For example, in Figure 2, leak first cooperates with curl to complete the execution of the script leak.sh in C2, and then spawns subprocesses tar, bzip2, gpg, and curl to compress and upload files in C3. In this case, leaks are overlapping nodes in C2 and C3.

例如，图3是本发明提供的概要图生成方法的重叠节点示意图之一，图4是本发明提供的概要图生成方法的重叠节点示意图之二，图5是本发明提供的概要图生成方法的重叠节点示意图之三，如图3-5所示，重叠节点可以被分为如下三种类型：For example, FIG. 3 is one of the schematic diagrams of overlapping nodes of the method for generating a summary diagram provided by the present invention, FIG. 4 is a schematic diagram of the second overlapping nodes of the method for generating a summary diagram provided by the present invention, and FIG. The third schematic diagram of overlapping nodes, as shown in Figure 3-5, overlapping nodes can be divided into the following three types:

(1)主进程节点：与不同系统活动的不同子进程集合协作；(1) Main process node: cooperates with different sets of sub-processes of different system activities;

(2)子进程节点：与其同级协作以完成系统活动，同时生成子进程以完成不同的系统活动；(2) Child process node: cooperate with its peers to complete system activities, and generate child processes to complete different system activities at the same time;

(3)资源节点：是由来自不同社区的进程节点访问的资源节点。(3) Resource node: It is a resource node accessed by process nodes from different communities.

可选地，可以针对每个社区都生成一个概要，该概要可以将对应社区包括的主要系统活动进行可视化。Optionally, a summary can be generated for each community, and the summary can visualize the main system activities included in the corresponding community.

例如，图6是本发明提供的概要图生成方法的概要图示意图，如图6所示，每个社区(C1，C2，…，C10)都生成了一个简洁的概要，用于将每个社区的系统活动进行可视化。For example, FIG. 6 is a schematic diagram of the summary diagram generation method provided by the present invention. As shown in FIG. 6 , each community (C1, C2, . . . , C10) generates a concise summary for each community. to visualize system activity.

本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，通过确定待调查和还原的攻击事件的系统实体依赖关系图，在系统实体依赖关系图中的进程节点上执行分层随机行走，确定进程节点的行为表示，并基于进程节点的行为表示将系统实体依赖关系图划分为至少一个第一子图，对每一个第一子图进行压缩获取至少一个第二子图，最后生成每一个第二子图的概要，从而获得系统实体依赖关系图对应的概要图；通过将系统实体依赖关系图划分为多个子图并为每个子图提供简洁的概要来生成概要图，每个子图只包含密切相关的进程，共同完成系统任务，生成的概要图通过隐藏较少的重要细节来保持系统实体依赖关系图中系统活动的语义，而且通过概要的形式将其进行可视化，不仅缩小了系统实体依赖关系图的大小，而且便于查看相关系统活动的概要和与攻击相关的社区的概要信息。The method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, by determining the system entity dependency graph of the attack event to be investigated and restored, executes the analysis on the process nodes in the system entity dependency graph. The layer walks randomly, determines the behavior representation of the process node, and divides the system entity dependency graph into at least one first subgraph based on the behavior representation of the process node, and compresses each first subgraph to obtain at least one second subgraph, Finally, a summary of each second subgraph is generated, so as to obtain a summary graph corresponding to the system entity dependency graph; the summary graph is generated by dividing the system entity dependency graph into multiple subgraphs and providing a concise summary for each subgraph. Each subgraph contains only closely related processes that work together to complete system tasks. The generated summary diagram maintains the semantics of system activities in the system entity dependency diagram by hiding less important details, and visualizes it in the form of a summary, not only shrinking It reduces the size of the system entity dependency graph and facilitates viewing a summary of relevant system activity and a summary of the attack-related community.

可选地，所述在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示，包括：Optionally, performing a hierarchical random walk on the process nodes in the system entity dependency graph to determine the behavior representation of the process nodes includes:

可选地，可以以系统实体依赖关系图中的每一个进程节点为起点随机行走预设长度，以生成行走路线。Optionally, a preset length may be randomly walked with each process node in the system entity dependency graph as a starting point to generate a walking route.

可选地，预设长度的值可以是5或10或15，本发明对此不作具体限定。Optionally, the value of the preset length may be 5 or 10 or 15, which is not specifically limited in the present invention.

例如，以节点v₁为起点的随机行走生成特定长度的行走路线W＝{v₁,...,v_n}，其中v_i∈W是以转移概率随机选择的。从v_i到其邻居节点n的转移概率为

其中w(v_i,n)表示从v_i到n的行走权重，

表示v_i的所有邻居节点之间的行走权重之和。与现有的随机行走算法以相等概率处理邻居节点不同，本发明中的行走算法为v_i的邻居节点提供更高的概率，这些邻居节点更可能是与v_i联系紧密的进程节点。For example, a random walk starting from node v ₁ generates a walking route W = _{ _v ₁ , . The transition probability from vi to its neighbor node _n is

where w(v _i ,n) represents the walking weight from v _i to n,

represents the sum of walk weights among all neighbor nodes of _vi . Different from the existing random walk algorithms that deal with neighbor nodes with equal probability, the walk algorithm in the present invention provides a higher probability for the neighbor nodes of _vi , and these neighbor nodes are more likely to be process nodes closely related to _vi .

具体地，可以同时考虑进程节点的邻居节点和全局进程沿袭树，以确保联系紧密的进程节点被采样到相同的行走路径，从而它们将具有相似的上下文。对于每一个进程节点p，检查p的单跳邻居节点，并将p与父进程节点、子进程节点和访问的资源节点相关联。Specifically, the neighbor nodes of a process node and the global process lineage tree can be considered simultaneously to ensure that closely related process nodes are sampled to the same walking path, so that they will have similar contexts. For each process node p, check the single-hop neighbor nodes of p, and associate p with the parent process node, child process node, and visited resource nodes.

可选地，可以基于生成的行走路线，采用词向量模型获取进程节点的行为表示。Optionally, based on the generated walking route, a word vector model may be used to obtain behavior representations of process nodes.

可选地，可以采用word2vec模型获取每一个进程节点基于行走路线的行为表示。Optionally, the word2vec model can be used to obtain the behavior representation of each process node based on the walking route.

例如，可以将系统实体依赖关系图中的节点视为单词，将行走路线视为单词的有序序列来进行类比。For example, an analogy can be made by thinking of nodes in a system entity dependency graph as words, and walking routes as ordered sequences of words.

可选地，可以使用一种广泛使用的单词表示学习算法SkipGram学习行走路线中包括的进程节点的行为表示。Alternatively, a widely used word representation learning algorithm, SkipGram, can be used to learn the behavioral representations of the process nodes included in the walking route.

本发明以系统实体依赖关系图中的每个进程节点为起点随机行走预设长度，以生成行走路线，然后基于行走路线，采用词向量模型学习进程节点的行为表示，最后可以基于进程节点的行为表示，将具有类似行为表示的进程节点划分至同一个子图中，可以有效实现将联系紧密的进程组确定为以进程为中心的社区。The invention takes each process node in the system entity dependency graph as a starting point to randomly walk a preset length to generate a walking route, and then uses a word vector model to learn the behavior representation of the process node based on the walking route, and finally can be based on the behavior of the process node. Representation, dividing process nodes with similar behavioral representations into the same subgraph can effectively realize the identification of closely related process groups as process-centric communities.

可选地，所述对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，包括：Optionally, the compressing each first sub-picture in the at least one first sub-picture to obtain at least one second sub-picture includes:

可选地，可以确定系统实体依赖关系图中至少一个第一子图中的目标子图中的第一模式。Optionally, the first pattern in the target subgraph in the at least one first subgraph in the system entity dependency graph may be determined.

可选地，第一模式可以包括目标子图中同一个进程节点至少产生两个相同的进程节点集以访问同一个资源节点的模式，并且进程节点集中包括至少一个子进程节点。Optionally, the first mode may include a mode in which the same process node in the target subgraph generates at least two identical process node sets to access the same resource node, and the process node set includes at least one child process node.

可选地，可以合并第一模式中相同的子进程节点，并合并连接子进程节点的边，完成对目标子图的压缩。Optionally, the same child process nodes in the first mode can be merged, and the edges connecting the child process nodes can be merged to complete the compression of the target subgraph.

可选地，可以识别系统实体依赖关系图中每一个第一子图中的第一模式，并将第一模式中的冗余节点和冗余边进行合并，获取压缩后的第二子图，然后基于压缩后的第二子图，生成第二子图对应的概要。Optionally, the first mode in each first subgraph in the system entity dependency graph can be identified, and redundant nodes and redundant edges in the first mode can be merged to obtain a compressed second subgraph, Then, based on the compressed second subgraph, a summary corresponding to the second subgraph is generated.

可选地，第一模式可以用于描述产生相同的进程集以处理某些资源的重复活动。Alternatively, the first pattern may be used to describe repetitive activities that produce the same set of processes to process certain resources.

例如，图7是本发明提供的概要图生成方法的第一模式示意图，如图7所示，进程节点P0重复生成名为P1和P2的子进程节点以写入文件F1，此类保持重复的活动并不能为安全分析提供额外的价值，因此可以按照图7所示的方式将多个P1和P2分别进行合并，以去除冗余节点和冗余边。For example, FIG. 7 is a schematic diagram of the first mode of the method for generating a summary graph provided by the present invention. As shown in FIG. 7 , the process node P0 repeatedly generates child process nodes named P1 and P2 to write to the file F1, and such repeated Activities do not provide additional value for security analysis, so multiple P1 and P2 can be merged separately in the manner shown in Figure 7 to remove redundant nodes and redundant edges.

可选地，可以通过四个步骤(构建进程沿袭树，与已访问资源的关联，挖掘基于过程的模式和基于模式的压缩)识别第一子图中的第一模式，并合并第一模式中重复的节点和边，完成对第一子图的压缩。Optionally, the first pattern in the first subgraph can be identified through four steps (building a process lineage tree, association with accessed resources, mining process-based patterns, and pattern-based compression) Duplicate nodes and edges to complete the compression of the first subgraph.

本发明通过识别子图中的第一模式，并对第一模式中重复的边和节点进行合并，实现对子图的压缩，有效减少了冗余边和节点，缩小了系统实体依赖关系图的大小。The invention realizes the compression of the subgraph by identifying the first mode in the subgraph and merging the repeated edges and nodes in the first mode, effectively reducing redundant edges and nodes, and reducing the size of the system entity dependency graph. size.

可选地，可以确定系统实体依赖关系图中至少一个第一子图中的目标子图中的第二模式。Optionally, the second pattern in the target subgraph in at least one of the first subgraphs in the system entity dependency graph may be determined.

可选地，第二模式可以包括目标子图中的同一个进程节点至少两次访问不同的资源节点的模式。Optionally, the second mode may include a mode in which the same process node in the target subgraph accesses different resource nodes at least twice.

例如，图8是本发明提供的概要图生成方法的第二模式示意图，如图8所示，进程节点P0重复访问资源节点F1，F2，…，Fn，此类保持重复的活动并不能为安全分析提供额外的价值，因此可以将资源节点F1，F2，…，Fn按照图8所示的方式进行合并，以去除冗余节点和冗余边。For example, FIG. 8 is a schematic diagram of the second mode of the method for generating a summary graph provided by the present invention. As shown in FIG. 8 , the process node P0 repeatedly accesses the resource nodes F1, F2, . The analysis provides additional value, so the resource nodes F1, F2, ..., Fn can be merged as shown in Figure 8 to remove redundant nodes and redundant edges.

可选地，可以合并第二模式中的不同的资源节点，完成对目标子图的压缩。Optionally, different resource nodes in the second mode may be combined to complete the compression of the target subgraph.

可选地，可以识别系统实体依赖关系图中每一个第一子图中的第二模式，并将第二模式中的冗余节点和冗余边进行合并，获取压缩后的第二子图，基于压缩后的第二子图，生成第二子图对应的概要。Optionally, the second mode in each first subgraph in the system entity dependency graph can be identified, and redundant nodes and redundant edges in the second mode can be merged to obtain a compressed second subgraph, Based on the compressed second subgraph, a summary corresponding to the second subgraph is generated.

可选地，为了识别第二模式，可以先将进程节点与其访问的资源节点相关联，然后可以搜索每一个资源节点以识别重复访问。Optionally, to identify the second pattern, process nodes can be associated with the resource nodes they visit, and then each resource node can be searched to identify repeated visits.

可选地，可以根据搜索到的第二模式，将第二模式中的资源节点合并为一个节点。Optionally, the resource nodes in the second mode may be combined into one node according to the searched second mode.

可选地，合并之后得到的资源节点的属性可以是原始资源节点属性的并集。Optionally, the attributes of the resource nodes obtained after merging may be a union of the attributes of the original resource nodes.

本发明通过识别子图中的第二模式，并对第二模式中重复的边和节点进行合并，实现对子图的压缩，有效减少了冗余边和节点，缩小了系统实体依赖关系图的大小。The present invention realizes the compression of the subgraph by identifying the second mode in the subgraph and merging the repeated edges and nodes in the second mode, effectively reducing redundant edges and nodes, and reducing the size of the system entity dependency graph. size.

可选地，在所述对所述至少一个第一子图中的每一个第一子图进行压缩之前，所述方法还包括：Optionally, before the compressing each first sub-picture in the at least one first sub-picture, the method further includes:

具体地，在对第一子图进行压缩之前，可以首先搜索第一子图中的重叠节点，该重叠节点是一个资源节点，并且该资源节点与来自不同的第一子图中的多个进程节点相连接；在搜索到重叠节点之后，可以创建该重叠节点的副本节点，并将创建的副本节点和资源节点一一对应分配给多个进程节点所在的第一子图。Specifically, before compressing the first subgraph, an overlapping node in the first subgraph may be searched first, where the overlapping node is a resource node, and the resource node is associated with multiple processes from different first subgraphs The nodes are connected; after the overlapping node is searched, a replica node of the overlapping node can be created, and the created replica node and the resource node are assigned to the first subgraph where the multiple process nodes are located in a one-to-one correspondence.

可选地，在至少两个进程节点访问同一个资源节点，且至少两个进程节点来自不同的第一子图的情况下，可以创建资源节点的至少一个副本节点。Optionally, when at least two process nodes access the same resource node, and the at least two process nodes are from different first subgraphs, at least one replica node of the resource node may be created.

可选地，可以将资源节点和至少一个副本节点一一对应分配给至少两个进程节点所在的第一子图。Optionally, the resource node and the at least one replica node may be assigned to the first subgraph where the at least two process nodes are located in a one-to-one correspondence.

可选地，可以在资源节点以及各个副本节点之间创建定向边以连接资源节点和所创建的副本节点。Optionally, directed edges may be created between the resource nodes and the respective replica nodes to connect the resource nodes and the created replica nodes.

例如，给定一个资源节点v的两个副本节点v₁和v₂，其中v₁包含于子图C_i中，v₂包含于子图C_j中，假设v₁有输入边e₁，v₂有输出边e₂，并且e₁的开始时间早于e₂的结束事件，则可以创建定向边v₁→v₂。For example, given two replica nodes v ₁ and v ₂ of a resource node v, where v ₁ is contained in subgraph C _i and v ₂ is contained in subgraph C _j , suppose v ₁ has input edges e ₁ , v ₂ has an output edge e ₂ , and the start time of e ₁ is earlier than the end event of e ₂ , then a directed edge v ₁ →v ₂ can be created.

例如，给定资源节点r和进程节点p，如果它们通过边缘连接，则v与p所属的社区相关联。如果一个资源节点与来自不同社区的多个进程节点连接，则该资源节点是一个重叠节点，可以创建资源节点的副本，并为每个社区分配一个副本。由于这些节点缺乏可见的信息流方向，本发明中创建了定向边缘以连接副本(例如，图2中的细虚线箭头)。For example, given a resource node r and a process node p, if they are connected by an edge, then v is associated with the community to which p belongs. If a resource node is connected to multiple process nodes from different communities, the resource node is an overlapping node, and a replica of the resource node can be created, with one replica assigned to each community. Since these nodes lack a visible direction of information flow, directed edges are created in the present invention to connect replicas (eg, thin dashed arrows in Figure 2).

可选地，可以将社区之间的依赖性分类为基于边缘的依赖性(即，社区之间的社区间边缘表示的依赖性)和基于节点的依赖性(即，重叠节点表示的依赖性)。Alternatively, dependencies between communities can be classified into edge-based dependencies (ie, dependencies represented by inter-community edges between communities) and node-based dependencies (ie, dependencies represented by overlapping nodes) .

可选地，所述在所述系统实体依赖关系图中的进程节点上执行分层随机行走之前，所述方法还包括：Optionally, before performing the hierarchical random walk on the process nodes in the system entity dependency graph, the method further includes:

具体地，在基于分层随机行走方法对系统实体依赖关系图进行子图划分之前，可以首先对系统实体依赖关系图进行预处理，该预处理可以包括将系统实体依赖关系图中的进程节点与资源节点之间的平行边进行合并，以去除冗余的边。Specifically, before dividing the system entity dependency graph into subgraphs based on the hierarchical random walk method, the system entity dependency graph may be preprocessed first, and the preprocessing may include dividing the process nodes in the system entity dependency graph with the Parallel edges between resource nodes are merged to remove redundant edges.

可选地，在系统实体依赖关系图中的进程节点上执行分层随机行走之前，可以将系统实体依赖关系图中的每一个进程节点分别与系统实体依赖关系图中的资源节点之间的平行边进行合并。Optionally, before the hierarchical random walk is performed on the process nodes in the system entity dependency graph, each process node in the system entity dependency graph can be paralleled with the resource nodes in the system entity dependency graph. Edges are merged.

可选地，平行边可以包括具有相同的读操作或相同的写操作类型的边。Alternatively, parallel edges may include edges with the same read operation or the same type of write operation.

可以理解的是，系统实体依赖关系图在进程节点和文件节点或网络节点之间通常有许多平行边，即操作系统通常存在短时间内重复的读操作或写操作。这是因为操作系统通常通过将数据按比例分配给多个系统调用来执行读任务或写任务。这些平行边不会为攻击调查提供额外有用的信息，因此可以直接将相同操作类型的平行边合并为一条边。It can be understood that the system entity dependency graph usually has many parallel edges between process nodes and file nodes or network nodes, that is, the operating system usually has repeated read or write operations in a short period of time. This is because operating systems typically perform read or write tasks by distributing data proportionally across multiple system calls. These parallel edges do not provide additional useful information for attack investigation, so parallel edges of the same operation type can be directly merged into one edge.

本发明通过对系统实体依赖关系图中不包含有用攻击相关信息的平行边进行合并，不仅有效减少了冗余边，而且便于生成主要系统活动的语义的概要信息。By merging parallel edges that do not contain useful attack related information in the system entity dependency graph, the present invention not only effectively reduces redundant edges, but also facilitates the generation of semantic summary information of main system activities.

可选地，在系统实体依赖关系图中的进程节点上执行分层随机行走之前，可以将系统实体依赖关系图中仅具有输入边而没有输出边的资源节点进行删除。Optionally, before the hierarchical random walk is performed on the process nodes in the system entity dependency graph, the resource nodes having only input edges but no output edges in the system entity dependency graph may be deleted.

例如，系统实体依赖关系图有许多只读文件，这些文件是库、配置文件和用于进程初始化的资源(例如，/lib64/libdl.so.2)，不包含有用的攻击相关信息。因此，可以过滤(删除)只读文件节点，而保留进程节点，以保留主要系统活动的语义。For example, the system entity dependency graph has many read-only files, which are libraries, configuration files, and resources used for process initialization (eg, /lib64/libdl.so.2), that do not contain useful attack-related information. Thus, read-only file nodes can be filtered (removed), while process nodes are preserved, preserving the semantics of major system activity.

本发明通过对系统实体依赖关系图中不包含有用攻击相关信息的资源节点进行删除，不仅有效减少了冗余节点，而且便于生成主要系统活动的语义的概要信息。By deleting the resource nodes that do not contain useful attack related information in the system entity dependency graph, the invention not only effectively reduces redundant nodes, but also facilitates the generation of semantic summary information of main system activities.

可选地，所述概要中包括以下至少一项：Optionally, the summary includes at least one of the following:

主进程；main process;

时间跨度；time span;

目标信息流；target information flow;

可选地，生成的第二子图的概要中可以包括以下至少一项：主进程、时间跨度和目标信息流。Optionally, the generated summary of the second subgraph may include at least one of the following: a main process, a time span, and a target information flow.

可选地，主进程可以表示第二子图中包括的系统活动的父进程节点，或根进程节点。Optionally, the main process may represent the parent process node of the system activity included in the second subgraph, or the root process node.

可选地，时间跨度可以表示第二子图中包括的系统活动的最早开始时间与最晚结束时间之间的时间间隔。Optionally, the time span may represent the time interval between the earliest start time and the latest end time of the system activity included in the second subgraph.

可选地，目标信息流可以表示第二子图中包括的系统活动对应的信息流中优先级得分排名前预设个数的信息流。Optionally, the target information flow may represent a preset number of information flows in the information flow corresponding to the system activity included in the second subgraph in the top priority score ranking.

可选地，目标信息流可以为第二子图中包括的系统活动对应的信息流中优先级得分较高的信息流。Optionally, the target information flow may be an information flow with a higher priority score among the information flows corresponding to the system activities included in the second subgraph.

可选地，目标信息流可以为优先级排名位于所有信息流的排名中前预设数量位的信息流。Optionally, the target information flow may be the information flow whose priority ranking is located in the top preset number of places in the ranking of all the information flows.

例如，目标信息流可以为优先级排名前2位或前3位或前4位的信息流，具体数量位本发明对此不作具体限定。For example, the target information flow may be the information flow with the top 2 or the top 3 or the top 4 in the priority ranking, and the specific quantity is not specifically limited in the present invention.

可选地，在生成第二子图对应的概要之前，可以先对第二子图进行信息流提取。Optionally, before generating the summary corresponding to the second subgraph, information flow extraction may be performed on the second subgraph.

可选地，可以根据系统实体依赖关系图中的多个第二子图之间的信息流，识别每一个第二子图的输入节点和输出节点，然后通过查找每一对输入节点和输出节点的路径来生成信息流。Optionally, the input node and output node of each second subgraph can be identified according to the information flow between multiple second subgraphs in the system entity dependency graph, and then each pair of input nodes and output nodes can be found by searching for each pair of input nodes and output nodes. path to generate information flow.

例如，给定一个以进程节点为中心的社区(子图)，可以首先标识其输入节点和输出节点，其中，输入节点表示社区的传入信息流，也就是社区间边缘连接线的目标节点和没有输出边的网络节点。此外，对于没有输入边的社区，可以选择主进程节点作为输入节点，输出节点表示社区的输出信息流，也就是社区间连接边的源节点。有输入边的网络节点代表外部IP和POI。然后，对于每对输入节点和输出节点，可以使用深度优先搜索(DepthFirst Search，DFS)算法查找最长路径，而不使用重复节点作为信息流。这样的路径通常比较短的路径覆盖更多的活动信息。For example, given a community (subgraph) centered on a process node, one can first identify its input nodes and output nodes, where the input nodes represent the incoming information flow of the community, that is, the target nodes of the inter-community edge connections and A network node with no output edges. In addition, for a community without input edges, the main process node can be selected as the input node, and the output node represents the output information flow of the community, that is, the source node of the connecting edge between communities. Network nodes with input edges represent external IPs and POIs. Then, for each pair of input and output nodes, a Depth First Search (DFS) algorithm can be used to find the longest path without using duplicate nodes as information flow. Such paths typically cover more activity information than shorter paths.

可选地，可以对第二子图中提取的信息流进行优先级排序。Optionally, the information flows extracted in the second subgraph may be prioritized.

例如，可以根据信息流表示主要活动(如攻击行为)的可能性，计算第二子图中所有信息流的优先级得分，并基于优先级得分对信息流进行优先级排序，最终将排名位于所有信息流的排名中前三的信息流作为第二子图对应的概要中的目标信息流。For example, the priority score of all the information flows in the second subgraph can be calculated according to the probability that the information flow represents the main activity (such as attack behavior), and the information flow can be prioritized based on the priority score, and finally the ranking will be placed among all the information flows. The top three information flows in the ranking of the information flows are used as the target information flows in the summary corresponding to the second subgraph.

可选地，所述合并所述第二模式中的所述不同的资源节点，包括：Optionally, the merging of the different resource nodes in the second mode includes:

可选地，可以将第二模式中的不同的资源节点合并为一个节点，作为合并资源节点。Optionally, different resource nodes in the second mode may be merged into one node as a merged resource node.

可选地，合并资源节点的属性可以是不同的资源节点的属性的并集。Optionally, the properties of the merged resource nodes may be a union of properties of different resource nodes.

例如，可以将图8中的资源节点F1，F2，…，Fn进行合并，合并后的节点的属性是F1，F2，…，Fn属性的并集。For example, the resource nodes F1, F2, .

图9是本发明提供的概要图生成方法的流程示意图之二，如图9所示，该方法主要包括五个部分：(1)依赖关系图生成；(2)依赖关系图预处理；(3)社区监测；(4)社区压缩；(5)社区概要。每个部分的具体实现与上文描述的用于攻击调查和还原的系统日志依赖图的概要图生成方法可相互对应参照，在此不再赘述。FIG. 9 is the second schematic flow chart of the method for generating a summary graph provided by the present invention. As shown in FIG. 9 , the method mainly includes five parts: (1) generating a dependency graph; (2) preprocessing the dependency graph; (3) ) Community Monitoring; (4) Community Compression; (5) Community Summary. The specific implementation of each part and the above-described method for generating a system log dependency graph for attack investigation and restoration can be referred to each other correspondingly, and will not be repeated here.

下面通过实验对本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法进行验证。The method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention is verified by experiments below.

(1)数据集包括如下两个：(1) The dataset includes the following two:

(1.1)攻击数据集；(1.1) Attack dataset;

采用Sysdig(一种功能强大的系统工具)从6台拥有10个活动用户的Linux主机收集攻击数据集。这些主机上的常规系统任务包括web浏览、文本编辑、代码开发和一些其他服务(如数据库)。在这些主机上，根据已知的漏洞和杀伤链执行了6次多步骤攻击。收集的数据集包含3天内的1亿个事件。Attack datasets were collected from 6 Linux hosts with 10 active users using Sysdig, a powerful system tool. Common system tasks on these hosts include web browsing, text editing, code development, and some other services (such as databases). On these hosts, 6 multi-step attacks were performed based on known vulnerabilities and kill chains. The collected dataset contains 100 million events over 3 days.

(1.2)DARPA TC数据集。(1.2) DARPA TC dataset.

DARPA TC数据集致力于开发高级持久性威胁(Advanced Persistent Threat，APT)的取证和检测。此数据集记录不同操作系统(例如Linux和Windows)上各种漏洞攻击的攻击痕迹。根据攻击描述，排除了失败的攻击，并在评估中使用了8次攻击(5000万次事件)。The DARPA TC dataset is dedicated to the development of forensics and detection of Advanced Persistent Threats (APTs). This dataset records attack traces of various exploits on different operating systems such as Linux and Windows. Based on the attack description, failed attacks were excluded and 8 attacks (50 million events) were used in the evaluation.

(2)实验效果。(2) Experimental effect.

(a)在概要依赖关系图方面的总体效果。(a) Overall effect on summary dependency graphs.

应用本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法为表3所示的依赖关系图生成概要图，并测量监测到的社区数量及其规模。The summary graph generation method of the system log dependency graph for attack investigation and restoration provided by the present invention is applied to generate a summary graph for the dependency graph shown in Table 3, and the number and scale of the monitored communities are measured.

表3攻击依赖关系图的统计Table 3 Statistics of attack dependency graph

图10是本发明提供的概要图生成方法监测的子图数量示意图，如图10所示，本发明提供的概要图生成方法将依赖关系图平均划分为18.4个社区(子图)。与原始依赖关系图相比，平均有1302.1个节点，其大小缩小了70.7倍。这些结果表明，使用更小的社区数量，可以将所有社区可视化，以便可以轻松查看所有相关系统活动的概要。另外，从图10中还可以看到，网络钓鱼电子邮件(C.S)的最大社区数量为48个，其中包括不同的系统任务(例如，在firefox中浏览网页、发送或接收电子邮件和日历服务)。10 is a schematic diagram of the number of sub-graphs monitored by the summary graph generation method provided by the present invention. As shown in FIG. 10 , the summary graph generation method provided by the present invention divides the dependency graph into 18.4 communities (sub-graphs) on average. Compared to the original dependency graph, which has an average of 1302.1 nodes, its size is reduced by a factor of 70.7. These results show that with a smaller number of communities, all communities can be visualized so that an overview of all relevant system activity can be easily seen. Also, as can be seen from Figure 10, the maximum number of communities for phishing emails (C.S) is 48, which includes different system tasks (e.g. browsing the web in firefox, sending or receiving emails and calendar services) .

图11是本发明提供的概要图生成方法监测的子图大小分布示意图，如图11所示，显示了14次攻击的社区大小分布，从图中可以看出，社区(子图)规模相对较小(平均15.7个节点)，这大大减少了检查每个社区的工作量。与原始依赖关系图相比，这些结果还表明社区压缩在压缩重复边方面非常有效，每个社区平均减少216.4条冗余边。此外，概要图平均只需要2.26MB来存储概要图，而原始依赖关系图平均需要344.32MB。Figure 11 is a schematic diagram of the size distribution of subgraphs monitored by the method for generating a summary graph provided by the present invention. As shown in Figure 11, the community size distribution of 14 attacks is shown. It can be seen from the figure that the size of the community (subgraph) is relatively large Small (15.7 nodes on average), this greatly reduces the workload of checking each community. Compared with the original dependency graph, these results also show that community compression is very effective in compressing duplicate edges, reducing redundant edges by an average of 216.4 per community. In addition, the summary graph requires only 2.26MB on average to store the summary graph, while the original dependency graph requires an average of 344.32MB.

表4是分别使用本发明方法和Nodoze生成的边统计表，如表4所示，将所有社区的top-1、top-2和top-3信息流中的事件数量与最先进的依赖图简化方法NoDoze确定的事件进行了比较，Nodoze从良性系统行为中学习执行配置文件，并基于使用依赖关系图中每个路径的配置文件计算的异常分数来减少依赖关系图。Table 4 is the edge statistics table generated by the method of the present invention and Nodoze respectively, as shown in Table 4, the number of events in the top-1, top-2 and top-3 information flows of all communities and the state-of-the-art dependency graph are simplified Events identified by method NoDoze are compared, where Nodoze learns execution profiles from benign system behavior and reduces the dependency graph based on anomaly scores computed using profiles for each path in the dependency graph.

表4本发明方法和Nodoze生成的边统计表Table 4 The edge statistics table generated by the method of the present invention and Nodoze

从表4中可以看出，本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法的top-3信息流的边缘平均比NoDoze减少21倍。NoDoze的性能相对较差，因为它的有效性在很大程度上取决于执行概要文件是否能够覆盖所有良性事件，并且具有代表性，这是非常困难的，因为大多数系统的运行时环境具有多功能性。因此，从一个系统学习到的执行配置文件很难推广到其他系统，而本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法不受相同的限制，因为本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法需要额外的执行配置文件。As can be seen from Table 4, the edge of the top-3 information flow of the method for generating the summary graph of the system log dependency graph for attack investigation and restoration provided by the present invention is 21 times lower on average than that of NoDoze. NoDoze has relatively poor performance because its effectiveness depends heavily on the execution profile being able to cover all benign events and be representative, which is very difficult because the runtime environment of most systems has multiple Feature. Therefore, the execution configuration file learned from one system is difficult to generalize to other systems, and the method for generating the summary graph of the system log dependency graph for attack investigation and restoration provided by the present invention is not subject to the same limitation, because the The summary graph generation method for syslog dependency graphs for attack investigation and restoration requires an additional execution configuration file.

(b)与HOLMES方法的协作。(b) Collaboration with the HOLMES method.

将本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法与最先进的调查技术之一HOLMES进行结合，HOLMES构建了一个高级场景图(High-levelScenario Graph，HSG)，该图集成了战术、技术和程序(Tactics，Techniques andProcedures，TTP)，这是描述高级持久性威胁APT步骤的一个重要指标，并使用HSG将低级事件信息流映射到杀伤链中的步骤。Combining the summary graph generation method of the system log dependency graph for attack investigation and restoration provided by the present invention with HOLMES, one of the most advanced investigation technologies, HOLMES constructs a high-level scene graph (High-level Scenario Graph, HSG), which The diagram integrates Tactics, Techniques and Procedures (TTP), an important metric that describes APT steps for advanced persistent threats, and uses HSG to map the flow of low-level incident information to steps in the kill chain.

首先为14个攻击案例构建HSG，然后使用HSG将排名靠前的信息流映射到杀死链中的步骤。结果显示，HOLEMS识别了37个与攻击相关的社区中的35个，召回率为96.2％，而且可以观察到，前2位的信息流足以找到杀伤链。此外，基于本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，仍然可以轻松识别HOLMES未监测到的攻击相关社区，因为这些社区的信息流通常来自攻击相关社区并进入另一个攻击相关社区，使其成为攻击链不可或缺的步骤。这些结果表明，本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法可以轻松地与其他自动技术合作，突出显示与攻击相关的社区，并有助于识别自动技术未发现的其他与攻击相关的社区。The HSG is first constructed for the 14 attack cases, and then the HSG is used to map the top-ranked information flows to steps in the kill chain. The results show that HOLEMS identifies 35 of the 37 attack-related communities with a recall rate of 96.2%, and it can be observed that the top 2 information flows are sufficient to find the kill chain. In addition, based on the summary graph generation method of the system log dependency graph for attack investigation and restoration provided by the present invention, it is still possible to easily identify attack-related communities that are not detected by HOLMES, because the information flow of these communities usually comes from attack-related communities and enters Another attack related community, making it an indispensable step in the attack chain. These results demonstrate that the summary graph generation method for syslog dependency graphs for attack investigation and restoration provided by the present invention can easily work with other automated techniques, highlighting communities associated with attacks, and helps identify automated techniques not found of other attack-related communities.

(c)社区监测的比较。(c) Comparison of community monitoring.

将本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法与其他最先进的社区监测算法进行比较，以验证本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法社区监测技术的有效性。考虑到依赖图的重叠性质，选择了9种典型的重叠社区监测算法作为基线，包括NISE(2016)、EgoSpliter(2017)、NMNF(2017)、DANMF(2018)、PMCV(2019)、CGAN(2019)、VGRAPH(2019)、CNRL(2019)和DeepWalk(2014)，并使用F1分数评估监测到的社区和标记的地面真相社区之间的总体对应关系，实验结果如表5所示。Compare the summary graph generation method of the system log dependency graph for attack investigation and restoration provided by the present invention with other state-of-the-art community monitoring algorithms to verify the system log dependency graph for attack investigation and restoration provided by the present invention. Summary graph generation methods for community monitoring of the effectiveness of techniques. Considering the overlapping nature of dependency graphs, 9 typical overlapping community monitoring algorithms are selected as baselines, including NISE (2016), EgoSpliter (2017), NMNF (2017), DANMF (2018), PMCV (2019), CGAN (2019) ), VGRAPH (2019), CNRL (2019), and DeepWalk (2014), and use the F1 score to evaluate the overall correspondence between the monitored communities and the labeled ground truth communities, and the experimental results are shown in Table 5.

表5 14个攻击事件的社区监测结果Table 5 Community monitoring results of 14 attack events

从表5中可以看出，本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法获得的F1成绩平均比基线成绩高2.29倍，表明本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法能够有效地监测以进程节点为中心的社区，而其他基线的性能较差。As can be seen from Table 5, the F1 score obtained by the method for generating the summary graph of the system log dependency graph for attack investigation and restoration provided by the present invention is on average 2.29 times higher than the baseline score, indicating that the method for attack investigation and recovery provided by the present invention The summary graph generation method of the restored syslog dependency graph is able to effectively monitor the community centered on process nodes, while the performance of other baselines is poor.

(d)社区压缩的效果。(d) The effect of community compression.

图12是本发明提供的概要图生成方法的节点压缩率分布示意图，图13是本发明提供的概要图生成方法的边缘压缩率分布示意图，如图12和图13所示，可以看出，对于一个社区，节点数和边数分别平均减少了38.4％和44.7％，最大减少量为节点的97.3％和边的98.9％。此外，还验证了信息流在压缩后没有更改，原因是重复活动具有相同的信息流，这些信息流通常通过单个节点进入重复活动形成的子图，然后通过另一个单个节点离开子图，因此压缩重复活动不会改变信息流中的事件。总之，压缩这些重复的活动仍然保留了社区表示的任务的语义。Fig. 12 is a schematic diagram of the distribution of the node compression ratio of the method for generating a summary graph provided by the present invention, and Fig. 13 is a schematic diagram of the distribution of the compression ratio of the edge of the method for generating a summary graph provided by the present invention. As shown in Figs. 12 and 13, it can be seen that for For a community, the number of nodes and edges were reduced by an average of 38.4% and 44.7%, respectively, with the largest reduction of 97.3% of nodes and 98.9% of edges. In addition, it is verified that the information flow does not change after compression, the reason is that repeated activities have the same information flow, which usually enters the subgraph formed by the repeated activity through a single node, and then leaves the subgraph through another single node, so the compression Repeated activities do not change the events in the stream. In conclusion, compressing these repetitive activities still preserves the semantics of the task represented by the community.

(e)信息流排名的有效性。(e) Effectiveness of information flow ranking.

表6显示了包含攻击相关事件的社区C3和不包含攻击相关事件的社区C8的前3个信息流。C3中的事件表明，攻击者运行恶意脚本来压缩、加密敏感文件并将其上载到远程服务器。从表中可以看到，使用优先级为0.8234的top-1信息流可以有效地表示这些攻击行为，虽然top-2和top-3也可以覆盖这些行为，但top-1信息流的输入节点是一个恶意脚本进程，更有助于进一步跟踪创建恶意脚本的社区。C8中的事件显示，用户通过sshd登录到自己的主机，将压缩文件从服务器传输到主机，然后解压缩文件。从表中可以看到，优先级最高(0.4914)的排名前1的信息流可以表示所有这些活动，而排名前2的信息流缺少用于sshd登录的事件，排名前3的信息流缺少sshd登录事件，并包含一个文件事件(/dev/null！bash)出现在许多社区中。Table 6 shows the top 3 information flows of community C3, which contains attack-related events, and community C8, which does not contain attack-related events. Incidents in C3 show that attackers run malicious scripts to compress, encrypt, and upload sensitive files to a remote server. As can be seen from the table, these attack behaviors can be effectively represented using the top-1 information flow with a priority of 0.8234. Although top-2 and top-3 can also cover these behaviors, the input node of the top-1 information flow is A malicious script process that further helps to track down the community that created malicious scripts. The events in C8 showed that the user logged into their own host via sshd, transferred the compressed file from the server to the host, and then decompressed the file. As you can see from the table, the top 1 stream with the highest priority (0.4914) can represent all of these activities, while the top 2 streams lack events for sshd logins, and the top 3 streams lack sshd logins event, and including a file event (/dev/null!bash) appeared in many communities.

表6包含攻击的C3社区和不包含攻击的C8社区的前3个信息流Table 6 Top 3 information flows of C3 communities with attacks and C8 communities without attacks

本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，通过确定待调查和还原的攻击事件的系统实体依赖关系图，在系统实体依赖关系图中的进程节点上执行分层随机行走，确定进程节点的行为表示，并基于进程节点的行为表示将系统实体依赖关系图划分为至少一个第一子图，对每一个第一子图进行压缩获取至少一个第二子图，最后生成每一个第二子图的概要，从而获得系统实体依赖关系图对应的概要图；通过将系统实体依赖关系图划分为多个子图并为每个子图提供简洁的概要来生成概要图，每个子图只包含密切相关的进程，共同完成系统任务，生成的概要图通过隐藏较少的重要细节来保持系统实体依赖关系图中系统活动的语义，而且通过概要的形式将其进行可视化，不仅可以缩小依赖关系图的大小，而且便于查看相关系统活动的概要和与攻击相关的社区的概要信息。The method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, by determining the system entity dependency graph of the attack event to be investigated and restored, executes the analysis on the process nodes in the system entity dependency graph. The layer walks randomly, determines the behavior representation of the process node, and divides the system entity dependency graph into at least one first subgraph based on the behavior representation of the process node, and compresses each first subgraph to obtain at least one second subgraph, Finally, a summary of each second subgraph is generated, so as to obtain a summary graph corresponding to the system entity dependency graph; the summary graph is generated by dividing the system entity dependency graph into multiple subgraphs and providing a concise summary for each subgraph. Each subgraph only contains closely related processes that complete system tasks together. The generated summary diagram maintains the semantics of system activities in the system entity dependency diagram by hiding less important details, and visualizes it in the form of a summary, which not only can Reduces the size of the dependency graph and makes it easy to view a summary of relevant system activity and a summary of the community associated with the attack.

下面对本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成装置进行描述，下文描述的用于攻击调查和还原的系统日志依赖图的概要图生成装置与上文描述的用于攻击调查和还原的系统日志依赖图的概要图生成方法可相互对应参照。The device for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention will be described below. The summary graph generation method of the system log dependency graph for attack investigation and restoration can be referred to each other.

图14是本发明提供的概要图生成装置的结构示意图，如图14所示，该装置包括：第一确定模块1410、第二确定模块1420、子图划分模块1430、子图压缩模块1440和概要图生成模块1450；其中：FIG. 14 is a schematic structural diagram of an apparatus for generating a summary image provided by the present invention. As shown in FIG. 14 , the apparatus includes: a first determination module 1410, a second determination module 1420, a sub-image division module 1430, a sub-image compression module 1440, and a summary Graph generation module 1450; wherein:

第一确定模块1410用于确定待调查和还原的攻击事件的系统实体依赖关系图，所述系统实体依赖关系图中包含与所述待调查和还原的攻击事件相关联的系统实体节点以及所述系统实体节点之间的调用关系；其中，所述系统实体节点包括进程节点和资源节点，所述系统实体节点之间的调用关系表征系统活动；The first determining module 1410 is configured to determine a system entity dependency graph of the attack event to be investigated and restored, and the system entity dependency graph includes the system entity node associated with the attack event to be investigated and restored and the The calling relationship between system entity nodes; wherein, the system entity node includes a process node and a resource node, and the calling relationship between the system entity nodes represents system activities;

第二确定模块1420用于在所述系统实体依赖关系图中的进程节点上执行分层随机行走，确定所述进程节点的行为表示；The second determining module 1420 is configured to perform a hierarchical random walk on the process nodes in the system entity dependency graph, and determine the behavior representation of the process nodes;

子图划分模块1430用于基于所述进程节点的行为表示，对所述进程节点进行聚类，并基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个第一子图；The subgraph dividing module 1430 is configured to cluster the process nodes based on the behavior representation of the process nodes, and divide the system entity dependency graph into at least one first subgraph based on the result of the clustering ;

子图压缩模块1440用于对所述至少一个第一子图中的每一个第一子图进行压缩，获取至少一个第二子图，所述至少一个第二子图与所述至少一个第一子图一一对应；The sub-picture compression module 1440 is configured to compress each first sub-picture in the at least one first sub-picture to obtain at least one second sub-picture, the at least one second sub-picture and the at least one first sub-picture One-to-one correspondence between subgraphs;

概要图生成模块1450用于生成所述至少一个第二子图中每一个第二子图对应的概要，获得所述系统实体依赖关系图对应的概要图。The summary graph generating module 1450 is configured to generate a summary corresponding to each second subgraph in the at least one second subgraph, and obtain a summary graph corresponding to the system entity dependency graph.

本发明提供的用于攻击调查和还原的系统日志依赖图的概要图生成装置，通过确定待调查和还原的攻击事件的系统实体依赖关系图，在系统实体依赖关系图中的进程节点上执行分层随机行走，确定进程节点的行为表示，并基于进程节点的行为表示将系统实体依赖关系图划分为至少一个第一子图，对每一个第一子图进行压缩获取至少一个第二子图，最后生成每一个第二子图的概要，从而获得系统实体依赖关系图对应的概要图；通过将系统实体依赖关系图划分为多个子图并为每个子图提供简洁的概要来生成概要图，每个子图只包含密切相关的进程，共同完成系统任务，生成的概要图通过隐藏较少的重要细节来保持系统实体依赖关系图中系统活动的语义，而且通过概要的形式将其进行可视化，不仅可以缩小系统实体依赖关系图的大小，而且便于查看相关系统活动的概要和与攻击相关的社区的概要信息。The device for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the present invention, by determining the system entity dependency graph of the attack event to be investigated and restored, executes the analysis on the process nodes in the system entity dependency graph. The layer walks randomly, determines the behavior representation of the process node, and divides the system entity dependency graph into at least one first subgraph based on the behavior representation of the process node, and compresses each first subgraph to obtain at least one second subgraph, Finally, a summary of each second subgraph is generated, so as to obtain a summary graph corresponding to the system entity dependency graph; the summary graph is generated by dividing the system entity dependency graph into multiple subgraphs and providing a concise summary for each subgraph. Each subgraph only contains closely related processes that complete system tasks together. The generated summary diagram maintains the semantics of system activities in the system entity dependency diagram by hiding less important details, and visualizes it in the form of a summary, which not only can Reduce the size of system entity dependency graphs and make it easier to view a summary of related system activity and a summary of attack-related communities.

图15是本发明提供的电子设备的实体结构示意图，如图15所示，该电子设备可以包括：处理器(processor)1510、通信接口(Communications Interface)1520、存储器(memory)1530和通信总线1540，其中，处理器1510，通信接口1520，存储器1530通过通信总线1540完成相互间的通信。处理器1510可以调用存储器1530中的逻辑指令，以执行上述各方法所提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，该方法包括：FIG. 15 is a schematic diagram of the physical structure of the electronic device provided by the present invention. As shown in FIG. 15 , the electronic device may include: a processor (processor) 1510 , a communication interface (Communications Interface) 1520 , a memory (memory) 1530 and a communication bus 1540 , wherein the processor 1510 , the communication interface 1520 , and the memory 1530 complete the communication with each other through the communication bus 1540 . The processor 1510 can invoke the logic instructions in the memory 1530 to execute the method for generating the summary graph of the system log dependency graph for attack investigation and restoration provided by the above methods, the method includes:

确定待调查和还原的攻击事件的系统实体依赖关系图，所述系统实体依赖关系图中包括与所述待调查和还原的攻击事件相关联的系统实体节点以及所述系统实体节点之间的调用关系；其中，所述系统实体节点包括进程节点和资源节点，所述系统实体节点之间的调用关系表征系统活动；Determine the system entity dependency graph of the attack event to be investigated and restored, the system entity dependency graph including the system entity nodes associated with the attack event to be investigated and restored and calls between the system entity nodes relationship; wherein, the system entity nodes include process nodes and resource nodes, and the calling relationship between the system entity nodes represents system activities;

基于所述进程节点的行为表示，对所述进程节点进行聚类，并基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个子图；Clustering the process nodes based on the behavior representation of the process nodes, and dividing the system entity dependency graph into at least one subgraph based on a result of the clustering;

此外，上述的存储器1530中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 1530 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述各方法所提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，该方法包括：In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer When executed, the computer can execute the method for generating a summary graph of a system log dependency graph for attack investigation and restoration provided by the above methods, and the method includes:

基于所述进程节点的行为表示，对所述进程节点进行聚类，基于所述聚类的结果，将所述系统实体依赖关系图划分为至少一个第一子图；Clustering the process nodes based on the behavior representation of the process nodes, and dividing the system entity dependency graph into at least one first subgraph based on a result of the clustering;

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各提供的用于攻击调查和还原的系统日志依赖图的概要图生成方法，该方法包括：In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and the computer program is implemented when executed by a processor to execute the system logs for attack investigation and recovery provided by each of the above A summary graph generation method for dependency graphs, which includes:

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for generating a profile of a system log dependency graph for attack investigation and recovery, comprising:

determining a system entity dependency relationship graph of an attack event to be investigated and restored, wherein the system entity dependency relationship graph comprises system entity nodes associated with the attack event to be investigated and restored and call relations among the system entity nodes; the system entity nodes comprise process nodes and resource nodes, and the calling relationship among the system entity nodes represents system activities;

executing layered random walking on the process nodes in the system entity dependency relationship graph, and determining the behavior representation of the process nodes;

clustering the process nodes based on the behavior representation of the process nodes, and dividing the system entity dependency relationship graph into at least one first subgraph based on the clustering result;

compressing each first sub-graph of the at least one first sub-graph to obtain at least one second sub-graph, wherein the at least one second sub-graph is in one-to-one correspondence with the at least one first sub-graph;

and generating a summary corresponding to each second sub-graph in the at least one second sub-graph, and obtaining a summary graph corresponding to the system entity dependency relationship graph.

2. The method for generating a profile of a system log dependency graph for attack investigation and recovery as claimed in claim 1, wherein the performing a hierarchical random walk on a process node in the system entity dependency graph to determine a behavioral representation of the process node comprises:

randomly walking by a preset length by taking each process node in the system entity dependency relationship graph as a starting point to generate a walking route;

and acquiring the behavior representation of the process node by adopting a word vector model based on the walking route.

3. The method for generating a profile of a system log dependency graph for attack investigation and recovery as claimed in claim 1, wherein the compressing each of the at least one first sub-graph to obtain at least one second sub-graph comprises:

determining a first pattern in a target sub-graph of the at least one first sub-graph, the first pattern comprising: the method comprises the steps that at least two identical process node sets are generated by the same process node to access the same resource node mode, the process node sets comprise at least one sub-process node, and the resource node comprises a file node or a network node;

and merging the same child process nodes in the first mode, merging edges connecting the child process nodes, completing compression of the target subgraph, and acquiring a second subgraph corresponding to the target subgraph.

4. The method for generating a profile of a system log dependency graph for attack investigation and recovery according to claim 1, wherein the compressing each of the at least one first sub-graph to obtain at least one second sub-graph comprises:

determining a second pattern in a target sub-graph of the at least one first sub-graph, the second pattern comprising: the mode that the same process node accesses different resource nodes at least twice, wherein the resource nodes comprise file nodes or network nodes;

and combining the different resource nodes in the second mode, completing the compression of the target subgraph, and acquiring a second subgraph corresponding to the target subgraph.

5. The method for profile generation of a system log dependency graph for attack investigation and restoration according to any of claims 1-4, wherein prior to the compressing each of the at least one first sub-graph, the method further comprises:

under the condition that at least two process nodes access the same resource node and come from different first subgraphs, creating at least one copy node of the resource node;

and allocating the resource node and the at least one copy node to the first subgraph in which the at least two process nodes are positioned in a one-to-one correspondence manner, and creating a directional edge between the resource node and the at least one copy node to connect the resource node and the copy node.

6. The method for profile generation of a system log dependency graph for attack investigation and recovery as claimed in claim 1 wherein prior to performing hierarchical random walks on process nodes in the system entity dependency graph, the method further comprises:

merging each process node in the system entity dependency graph with a parallel edge between the resource nodes in the system entity dependency graph, respectively, where the parallel edge includes: edges having the same read operation or the same write operation type.

7. The method for generating a profile of a system log dependency graph for attack investigation and recovery as claimed in claim 1, wherein prior to performing hierarchical random walks on process nodes in the system entity dependency graph, the method further comprises:

and deleting the resource nodes which only have input edges but not output edges in the system entity dependency relationship graph.

8. The method of generating a summary graph of a system log dependency graph for attack investigation and recovery as claimed in claim 1, wherein the summary includes at least one of:

a main process;

a time span;

a target information stream;

wherein the master process represents a parent process node of system activity included in the second subgraph;

the time span represents a time interval between an earliest start time and a latest end time of system activity included in the second subgraph;

and the target information flow represents the information flow of which the priority ranking is a preset number of bits before the ranking of all the information flows in the information flow corresponding to the system activity in the second subgraph.

9. The method of profile generation for a system log dependency graph for attack investigation and recovery as claimed in claim 4 wherein the merging the different resource nodes in the second schema comprises:

merging the different resource nodes in the second mode into one node as a merged resource node, wherein the attribute of the merged resource node is a union of the attributes of the different resource nodes.

10. An overview generating apparatus for a system log dependency graph for attack investigation and recovery, comprising:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a system entity dependency relationship graph of an attack event to be investigated and restored, and the system entity dependency relationship graph comprises system entity nodes related to the attack event to be investigated and restored and call relations among the system entity nodes; the system entity nodes comprise process nodes and resource nodes, and the calling relationship among the system entity nodes represents system activities;

the second determination module is used for executing hierarchical random walking on the process nodes in the system entity dependency relationship graph and determining the behavior representation of the process nodes;

the subgraph division module is used for clustering the process nodes based on the behavior representation of the process nodes and dividing the system entity dependency relationship graph into at least one first subgraph based on the clustering result;

the subgraph compression module is used for compressing each first subgraph in the at least one first subgraph to obtain at least one second subgraph, and the at least one second subgraph is in one-to-one correspondence with the at least one first subgraph;

and the schematic diagram generating module is used for generating a schematic diagram corresponding to each of the at least one second sub-diagram and obtaining a schematic diagram corresponding to the system entity dependency diagram.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method for profile generation of a syslog dependency graph for attack investigation and recovery as claimed in any one of claims 1 to 9.

12. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the method for profile generation of a system log dependency graph for attack investigation and recovery as claimed in any one of claims 1 to 9.

13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method for generating a profile for a syslog dependency graph for attack investigation and recovery as claimed in any one of claims 1 to 9.