CN116232708A

CN116232708A - A text-based threat intelligence-based attack chain construction and attack source tracing method and system

Info

Publication number: CN116232708A
Application number: CN202310124597.XA
Authority: CN
Inventors: �田润; 连一峰; 彭媛媛; 张海霞; 黄克振; 张立武
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2023-06-06

Abstract

The invention provides an attack chain construction and attack tracing method and system based on text threat information. The method comprises the following steps: collecting text threat information of an attack event of a known attack organization, and performing data preprocessing operation on the collected data to enable the collected data to meet the requirement of subsequent analysis; secondly, extracting network security entities and relations by using the collected text threat information to form a network security event description triplet; then, for the network security event description triplets, constructing a security event attack chain based on the time line and the logic sequence; and finally, training an attack organization feature model by utilizing a security event attack chain, and carrying out traceability prediction of unknown source attack events based on the model. The invention starts from text threat information, extracts the network security event attack chain from the text threat information, combines the known attack sources to describe the attack characteristic model for attack organizations, and can effectively serve the tracing and responsibility tracing work of network security personnel on network security events with unknown sources.

Description

A text-based threat intelligence-based attack chain construction and attack source tracing method and system

技术领域technical field

本发明提出了一种基于文本型威胁情报的攻击链构建与攻击溯源方法和系统，属于网络安全技术领域。The invention proposes a text-based threat intelligence-based attack chain construction and attack source tracing method and system, belonging to the technical field of network security.

背景技术Background technique

随着网络信息技术的飞速发展，网络攻击事件的频度、烈度不断加剧，对网络安全监测与防范能力的要求持续提升，其中攻击事件溯源是监测防范工作中的重要一环。但伴随着网络攻击技术水平的不断提高，网络攻击从单一攻击逐步向有组织有计划的多步攻击发展，如高级可持续攻击(APT)等。近年来，利用大数据分析、机器学习等技术发现网络攻击链中的各类恶意行为及其关联关系，逐步实现攻击链的复盘与溯源，一直是网络安全领域的难点问题。目前攻击事件描述信息主要以文本型威胁情报的形式在网络安全行业内流动，从文本型威胁情报等非结构化数据向攻击链信息转化的工作通常由人工完成，自动化程度较低，因此有必要基于自然语言处理技术，实现从文本型威胁情报等非结构化文本信息中自动化抽取网络安全事件攻击链，提高网络安全事件分析效率。With the rapid development of network information technology, the frequency and intensity of network attacks are increasing, and the requirements for network security monitoring and prevention capabilities continue to increase. The traceability of attack events is an important part of monitoring and prevention work. However, with the continuous improvement of network attack technology, network attacks have gradually evolved from single attacks to organized and planned multi-step attacks, such as advanced persistent attacks (APT). In recent years, it has always been a difficult problem in the field of network security to use big data analysis, machine learning and other technologies to discover various malicious behaviors and their correlations in the network attack chain, and to gradually realize the recovery and traceability of the attack chain. At present, attack event description information mainly flows in the network security industry in the form of text-based threat intelligence. The conversion of unstructured data such as text-based threat intelligence to attack chain information is usually done manually, with a low degree of automation. Therefore, it is necessary to Based on natural language processing technology, it realizes the automatic extraction of network security event attack chains from unstructured text information such as text-based threat intelligence, and improves the efficiency of network security event analysis.

传统的基于攻击链溯源工作针对单次安全事件攻击链进行分析，缺少与其他安全事件的横向对比与关联分析，同时缺乏与安全事件背后的攻击组织的关联分析。然而实现相同的攻击目标具有多种不同的方式，每个攻击组织都有其独特的攻击链偏好、人员与武器库，这些攻击组织特征信息对攻击组织识别工作具有较大意义。因此有必要基于攻击链信息，为已知的攻击组织进行建模，并利用攻击组织模型实现未来安全事件攻击链的溯源预测，为网络安全溯源追责工作提供依据和支持。The traditional attack chain-based traceability work analyzes the attack chain of a single security event, lacks horizontal comparison and correlation analysis with other security events, and lacks correlation analysis with the attacking organization behind the security event. However, there are many different ways to achieve the same attack goal, and each attack organization has its own unique attack chain preference, personnel, and weapon arsenal. These characteristic information of attack organizations are of great significance to the identification of attack organizations. Therefore, it is necessary to model known attacking organizations based on the attack chain information, and use the attacking organization model to realize the traceability prediction of future security event attack chains, and provide basis and support for network security traceability and accountability work.

发明内容Contents of the invention

本发明提出了一种基于文本型威胁情报的攻击链构建与攻击溯源方法和系统，基于网络安全服务商等安全机构提供的文本型威胁情报，提取安全事件攻击链，建立攻击组织攻击路径模型，从而实现对未来安全事件攻击链的溯源预测。The present invention proposes a text-based threat information-based attack chain construction and attack traceability method and system, based on text-based threat information provided by security organizations such as network security service providers, extracting security event attack chains, and establishing attack path models for attacking organizations. In this way, the traceability and prediction of the attack chain of future security events can be realized.

为实现上述目的，本发明提出一种基于文本型威胁情报的攻击链构建与攻击溯源方法，包含以下步骤：In order to achieve the above purpose, the present invention proposes a text-based threat intelligence-based attack chain construction and attack traceability method, including the following steps:

采集已知攻击组织的攻击事件的文本型威胁情报，对采集的文本型威胁情报进行数据预处理操作，使其符合后续分析需求；Collect text-based threat intelligence on attack events of known attacking organizations, and perform data preprocessing operations on the collected text-based threat intelligence to meet subsequent analysis requirements;

利用采集的文本型威胁情报抽取网络安全实体与关系，构成网络安全事件描述三元组；Use the collected text-based threat intelligence to extract network security entities and relationships to form a network security event description triplet;

对于抽取的网络安全事件描述三元组，基于时间线与逻辑顺序，构建安全事件攻击链；For the extracted network security event description triplets, build a security event attack chain based on the timeline and logical sequence;

利用已知攻击组织的安全事件攻击链训练攻击组织特征模型，并基于训练完成的攻击组织特征模型进行未知来源攻击事件的溯源预测。Use the security event attack chain of known attack organizations to train the attack organization characteristic model, and perform traceability prediction of unknown source attack events based on the trained attack organization characteristic model.

进一步地，所述数据预处理操作包括对数据进行补全、去重、修正等操作。Further, the data preprocessing operation includes performing operations such as completion, deduplication, and correction on the data.

进一步地，所述构成网络安全事件描述三元组，包括以下步骤：根据采集处理得到的文本型威胁情报，运用BERT、双向长短时神经网络等自然语言处理技术，从威胁情报中抽取攻击组织、IP、攻击工具、攻击方法等网络安全实体，并运用远程监督学习方法，抽取实体间关系，形成实体——关系——实体格式的网络安全事件描述三元组。Further, the formation of a network security event description triplet includes the following steps: according to the text-based threat intelligence obtained through collection and processing, using natural language processing technologies such as BERT and bidirectional long-short-time neural network, to extract attacking organizations, IP, attack tools, attack methods and other network security entities, and use the remote supervision learning method to extract the relationship between entities, and form entity-relationship-entity format network security event description triplets.

进一步地，所述构建安全事件攻击链，包括以下步骤：基于人工设定的攻击子事件定义，将网络安全事件描述三元组划分为不同的网络安全子事件，并基于时间线与文本描述逻辑，按照实际发生顺序与因果逻辑排列网络安全子事件，形成对本次网络安全事件的攻击链描述。Further, the construction of the security event attack chain includes the following steps: based on the artificially set attack sub-event definition, dividing the network security event description triple into different network security sub-events, and based on the timeline and text description logic , arrange the network security sub-events according to the actual sequence of occurrence and causal logic to form an attack chain description of this network security incident.

进一步地，所述利用已知攻击组织的的安全事件攻击链训练攻击组织特征模型，包括以下步骤：将网络安全子事件类别组成贝叶斯条件概率网络，输入已知攻击组织的安全事件攻击链，训练各个节点的条件概率作为该攻击组织的攻击特征画像，即攻击组织特征模型。Further, the training attack organization feature model using the security event attack chains of known attack organizations includes the following steps: forming a Bayesian conditional probability network of network security sub-event categories, and inputting the security event attack chains of known attack organizations , the conditional probability of each node is trained as the attack characteristic portrait of the attack organization, that is, the attack organization characteristic model.

进一步地，所述基于训练完成的攻击组织特征模型进行未知来源攻击事件的溯源预测，包括：对于未知来源的攻击事件文本型威胁情报，通过前述步骤转化为安全事件攻击链，匹配已知攻击组织的攻击特征画像，选择置信度最高的攻击组织作为溯源预测结果。Further, the source-tracing prediction of unknown-source attack events based on the trained attacking organization feature model includes: for unknown-sourced attack event text-based threat intelligence, through the aforementioned steps, convert it into a security event attack chain, and match known attacking organization According to the attack feature portrait, the attack organization with the highest confidence is selected as the traceability prediction result.

基于同一发明构思，本发明还提供一种采用上述方法的一种基于文本型威胁情报的攻击链构建与攻击溯源系统，其包括：Based on the same inventive concept, the present invention also provides a text-based threat intelligence-based attack chain construction and attack traceability system using the above method, which includes:

数据采集与预处理模块，用于采集已知攻击组织的攻击事件的文本型威胁情报，对采集的文本型威胁情报进行数据预处理操作，使其符合后续模型分析需求；The data collection and preprocessing module is used to collect text-based threat intelligence of attack events of known attack organizations, and perform data preprocessing operations on the collected text-based threat intelligence to make it meet the subsequent model analysis requirements;

实体与关系抽取模块，用于抽取网络安全实体与关系，构成网络安全事件描述三元组；The entity and relationship extraction module is used to extract network security entities and relationships to form a network security event description triplet;

攻击链生成模块，用于归纳网络安全子事件，并基于时间线与逻辑顺序，构建安全事件攻击链，描述安全事件过程；The attack chain generation module is used to summarize the network security sub-events, and based on the timeline and logical sequence, build a security event attack chain and describe the security event process;

特征模型训练与溯源预测模块，用于基于生成的安全事件攻击链，训练攻击组织特征模型，并对未知攻击来源的网络安全事件进行溯源预测。The feature model training and traceability prediction module is used to train the attack organization feature model based on the generated security event attack chain, and perform traceability prediction for network security events with unknown attack sources.

本发明从文本型威胁情报出发，从中抽取网络安全事件攻击链，从整体上描述安全事件经过与内在逻辑，并结合已知攻击来源，为攻击组织刻画攻击特征模型，有效服务于网络安全人员对未知来源的网络安全事件的溯源与追责工作。The present invention starts from text-based threat intelligence, extracts network security event attack chains from it, describes the process and internal logic of security events as a whole, and combines known attack sources to describe attack characteristic models for attacking organizations, effectively serving the network security personnel. Source tracing and accountability of network security incidents from unknown sources.

附图说明Description of drawings

图1是本发明的一种基于文本型威胁情报的攻击链构建与攻击溯源方法的流程图。FIG. 1 is a flowchart of a text-based threat intelligence-based attack chain construction and attack source tracing method of the present invention.

具体实施方式Detailed ways

为使本发明的技术方案能更明显易懂，特举实施例并结合附图详细说明如下。In order to make the technical solution of the present invention more comprehensible, specific embodiments and accompanying drawings are described in detail as follows.

本发明的一种基于文本型威胁情报的攻击链构建与攻击溯源方法，其流程如图1所示，包括以下步骤：A text-based threat intelligence-based attack chain construction and attack source tracing method of the present invention, the process of which is shown in Figure 1, including the following steps:

步骤1：数据采集与预处理Step 1: Data Acquisition and Preprocessing

安全事件文本型威胁情报包括行动的顺序、对被攻击系统的影响、破坏指标等(IOC)，其中对本次攻击事件中攻击者的攻击行动顺序与目标的描述对后续的分析与建模价值更大。数据预处理过程去除文本型威胁情报中的无关数据、重复数据、异常符号等，基于已有网络安全数据库对部分残缺信息进行补全，解决信息的不一致性问题，以提高采集数据的质量，保证采集数据的完整性、准确性。数据清洗包括但不限于以下几点：Security event text-based threat intelligence includes the sequence of actions, the impact on the attacked system, and indicators of damage (IOCs). bigger. The data preprocessing process removes irrelevant data, repeated data, abnormal symbols, etc. in the text-based threat intelligence, completes some incomplete information based on the existing network security database, and solves the problem of information inconsistency, so as to improve the quality of collected data and ensure Completeness and accuracy of collected data. Data cleaning includes but is not limited to the following:

1)应支持采集数据的缺失信息的处理，如在威胁情报中未详细给出或标注为未知信息，但可以在已有网络安全数据库中查询得到的信息；1) It should support the processing of missing information in the collected data, such as information that is not given in detail or marked as unknown information in the threat intelligence, but can be found in the existing network security database;

2)应支持采集数据的异常信息处理，异常信息包括重复信息和错误信息，在对多个来源的文本型威胁情报进行合并时可能会出现重复信息，错误信息可能由于数据采集模块不够健全，获取到的威胁情报存在错误字符或乱码等；2) It should support abnormal information processing of collected data. Abnormal information includes repeated information and error information. When merging text-based threat intelligence from multiple sources, repeated information may appear. Error information may be due to the insufficient soundness of the data acquisition module. The received threat information contains wrong characters or garbled characters;

3)应支持采集数据的非需求信息清洗，为了避免信息的冗余对算法准确性造成干扰，需要对采集数据中的非需求信息进行删除。3) It should support the cleaning of non-required information in the collected data. In order to avoid the interference of the redundancy of information on the accuracy of the algorithm, it is necessary to delete the non-required information in the collected data.

步骤2：抽取网络安全事件描述三元组Step 2: Extract network security event description triplets

本实施例依据了相关的网络安全领域相关标准规范，如ISO/IEC 27000信息安全管理体系标准族、网络安全威胁信息格式规范GB/T36643-2018、信息安全技术术语GB/T25069-2019、信息安全技术网络攻击定义及描述规范GB/T37027-2018、以及部分公共安全行业标准等，从而确定网络安全领域核心本体、概念和术语，建立网络安全知识本体模型，并明确属性集合。This embodiment is based on relevant standards and specifications in the field of network security, such as ISO/IEC 27000 information security management system standard family, network security threat information format specification GB/T36643-2018, information security technical terms GB/T25069-2019, information security Technical network attack definition and description specification GB/T37027-2018, and some public security industry standards, etc., to determine the core ontology, concepts and terms in the field of network security, establish a network security knowledge ontology model, and clarify the set of attributes.

例如：For example:

(1)网络资产：包括网络空间中的各种硬件设备、软件设备、网络环境、虚拟人员等。(1) Network assets: including various hardware devices, software devices, network environments, virtual personnel, etc. in cyberspace.

(2)脆弱性：包括漏洞脆弱性、弱点脆弱性，如漏洞、系统配置、防护软件等。(2) Vulnerability: including vulnerability vulnerability, vulnerability vulnerability, such as vulnerability, system configuration, protection software, etc.

(3)网络攻击：包括攻击者、攻击方式、利用工具、攻击事件、攻击后果等。其中攻击者包括个人、团体或黑客组织；攻击方式包括攻击使用的手段，如拒绝服务攻击、后门攻击、漏洞攻击、网络扫描窃听、网络钓鱼、干扰事件、高级威胁事件、其他网络攻击事件；利用工具包括正常软件和恶意软件。(3) Cyber attacks: including attackers, attack methods, utilization tools, attack events, attack consequences, etc. Attackers include individuals, groups, or hacker organizations; attack methods include attack methods, such as denial of service attacks, backdoor attacks, vulnerability attacks, network scanning eavesdropping, phishing, interference events, advanced threat events, and other network attack events; Tools include both benign and malware.

抽取网络安全事件描述三元组的具体实施方法包括：The specific implementation methods for extracting network security event description triplets include:

1)网络安全实体抽取1) Network Security Entity Extraction

首先将输入的文本型威胁情报文本按照字符切分为字符序列S＝{s₁,s₂,...,s_n}，采用BERT模型对字符进行向量嵌入，其中特殊专有名词和英文单词作为一个字符进行嵌入。为了获取更丰富的特征，BERT模型的输入E_i为一个字符在符号嵌入、片段嵌入和位置嵌入三个维度上的向量之和。模型采用遮蔽语言模型方法进行预训练，在训练过程中以15％的概率用掩码标记来替换训练序列中的字符，然后预测出掩码标记位置原有的单词，而被掩码标记遮盖的部分中，80％使用“[MASK]”标记替换，20％的使用随机的其他字符替换，从而保证BERT对所有的字符保持敏感，提高向量嵌入效率。First, the input text-based threat intelligence text is divided into character sequences S={s ₁ , s ₂ ,...,s _n } according to the characters, and the BERT model is used to embedding the characters as vectors, among which special proper nouns and English words Embed as a character. In order to obtain richer features, the input E _i of the BERT model is the sum of the vectors of a character in the three dimensions of symbol embedding, segment embedding and position embedding. The model is pre-trained using the masked language model method. During the training process, the characters in the training sequence are replaced with mask marks with a probability of 15%, and then the original words at the position of the mask marks are predicted, while the words covered by the mask marks In the part, 80% are replaced by "[MASK]" marks, and 20% are replaced by random other characters, so as to ensure that BERT remains sensitive to all characters and improve the efficiency of vector embedding.

下一步将BERT输出的字符向量序列X＝{x₁,x₂,...,x_n}作为BiLSTM神经网络的输入。BiLSTM由前向和后向两个LSTM网络组成，在特征提取时将X同时输入前向和后向两个LSTM网络，计算得到当前时刻的隐藏层向量序列进行拼接，然后针对该向量运用公式o_t＝tanh(W_hh_t+b_o)进行激活函数运算，其中o_t作为当前时刻输出，h_t为当前时刻的隐藏层向量，W_t和b_o分别为输出门的权重矩阵与偏置项。In the next step, the character vector sequence X={x ₁ ,x ₂ ,...,x _n } output by BERT is used as the input of the BiLSTM neural network. BiLSTM consists of two forward and backward LSTM networks. During feature extraction, X is input into the forward and backward two LSTM networks at the same time, and the hidden layer vector sequence at the current moment is calculated for splicing, and then the formula o is used for the vector _t = tanh(W _h h _t + b _o ) for activation function operation, where o _t is the output at the current moment, h _t is the hidden layer vector at the current moment, W _t and b _o are the weight matrix and bias of the output gate respectively item.

最后人工标注自然语言文本中的实体信息，并采用条件随机场约束标注符号的出现顺序，以BiLSTM神经网络输出作为输入，对于字符序列的每一个预测序列，其得分等于字符本身预测得分与字符序列间的转移分数之和，最终选取得分最高的预测序列作为模型输出。Finally, the entity information in the natural language text is manually labeled, and the conditional random field is used to constrain the appearance order of the label symbols, and the output of the BiLSTM neural network is used as input. For each predicted sequence of the character sequence, its score is equal to the predicted score of the character itself and the character sequence. The sum of the transition scores among them, and finally select the prediction sequence with the highest score as the model output.

2)网络安全事件关系抽取2) Network security event relationship extraction

首先根据实际需要，设定抽取的关系种类以及头尾实体定义，然后基于远程监督学习构建PCNN(Piecewise Convolutional Neural Network，分段卷积神经网络)关系抽取模型，对整体样本进行关系抽取。具体实施方法包括：First, set the type of relationship to be extracted and the definition of head and tail entities according to actual needs, and then build a PCNN (Piecewise Convolutional Neural Network, piecewise convolutional neural network) relationship extraction model based on remote supervised learning to extract relationships from the overall sample. Specific implementation methods include:

首先对小规模样本进行人工标注，然后对于大量未被标注的样本，将其中所有被标注过的实体对均标注为人工标注的关系，并将具有相同关系的实体对放入一个句袋中，后续的模型训练以句袋作为基本单元。然后对句袋中的每个实体对所在句子，将其在两个实体的位置分割为三段，每一段单独作为一个句子投入CNN网络进行训练，CNN网络的输入为字符嵌入向量与该字符与两个实体之间的相对距离的拼接向量，输出为该句子在每个关系类别上的置信度，最终选择置信度最大的句子作为该句袋的特征，并以此作为这对实体关系的判定标准。First, small-scale samples are manually labeled, and then for a large number of unlabeled samples, all labeled entity pairs are labeled as human-labeled relationships, and entity pairs with the same relationship are put into a sentence bag. Subsequent model training uses bag-of-sentence as the basic unit. Then, for each entity pair in the sentence bag, divide it into three sections at the position of the two entities, each section is put into the CNN network as a sentence for training, and the input of the CNN network is the character embedding vector and the character and The spliced vector of the relative distance between two entities is output as the confidence of the sentence in each relationship category, and finally the sentence with the highest confidence is selected as the feature of the sentence bag, and used as the judgment of the entity relationship standard.

步骤3：构建安全事件攻击链Step 3: Build a security event attack chain

安全事件攻击链由一系列攻击行为组成，用以描述攻击者在本次攻击事件中的行动顺序与前后逻辑。本实施例将攻击行为限定为固定种类的网络攻击技术，包括非法篡改、暴力破解、远程控制、数据窃取、DoS攻击、扫描探测、网站挂马等(这些是网络安全子事件)，同时添加攻击开始和攻击结束标记作为攻击链的头尾，以便提取首次攻击和最终目的的特征信息。具体实施方法包括：The security event attack chain is composed of a series of attack behaviors, which is used to describe the action sequence and logic of the attacker in this attack event. In this embodiment, the attack behavior is limited to fixed types of network attack techniques, including illegal tampering, brute force cracking, remote control, data theft, DoS attacks, scanning detection, website hanging horses, etc. (these are network security sub-events), while adding attack The start and attack end markers serve as the head and tail of the attack chain in order to extract the characteristic information of the first attack and the final purpose. Specific implementation methods include:

首先组织专家基于攻击行为流程构建攻击行为判定规则库，并在此基础上构建攻击行为判定模型，从而实现网络安全事件描述三元组的抽象概括与提取。其中，“构建攻击行为判定规则库”的方法是：人工分析攻击行为流程，构建实体关系与攻击行为之间的对应关系。其中，“构建攻击行为判定模型”的方法是：利用实现构建的攻击行为判定规则库，在实体实例图中检索符合规则的实体与关系，从而标注为对应的攻击行为。First, experts are organized to build an attack behavior judgment rule base based on the attack behavior process, and on this basis, an attack behavior judgment model is constructed to realize the abstraction and extraction of network security event description triplets. Among them, the method of "constructing an attack behavior determination rule base" is: manually analyzing the attack behavior process, and constructing the corresponding relationship between the entity relationship and the attack behavior. Among them, the method of "constructing an attack behavior judgment model" is: using the constructed attack behavior judgment rule base, searching for entities and relationships conforming to the rules in the entity instance graph, and marking them as corresponding attack behaviors.

然后基于时间线与逻辑顺序对攻击行为进行排序。具体方法为：对于每一个识别出的攻击行为，检索其所在段落的时间信息，并将其标注为攻击行为的发生时间，根据时间顺序进行排序；如果不存在显式的时间信息，则人工标注出自然语言文本中的逻辑词(首先、其次、然后、最后等)，人为设定逻辑顺序，并以此为依据对攻击行为进行排序。Attack behaviors are then sorted based on timeline and logical order. The specific method is: for each identified attack behavior, retrieve the time information of the paragraph it is in, and mark it as the occurrence time of the attack behavior, and sort it according to the order of time; if there is no explicit time information, manually mark it Logical words (first, second, then, last, etc.) in the natural language text are extracted, and the logical order is artificially set, and the attack behaviors are sorted based on this.

步骤4：特征模型训练与溯源预测Step 4: Feature model training and traceability prediction

根据步骤3中得到的已知攻击来源(攻击组织)的攻击链信息，将其输入贝叶斯网络进行训练。具体实施方式为维护一个矩阵M_m*m，其中m代表已知的攻击行为种类，矩阵中的每一个元素M_ij代表攻击行为i后下一次攻击为攻击行为j的概率，计算方法为：According to the attack chain information of the known attack source (attack organization) obtained in step 3, input it into the Bayesian network for training. The specific implementation method is to maintain a matrix M _m*m , where m represents the type of known attack behavior, and each element M _ij in the matrix represents the probability that the next attack after attack behavior i is attack behavior j, and the calculation method is:

针对后续的未知攻击来源的网络安全事件文本型威胁情报，先经过步骤1至步骤3的过程，处理得到本次攻击事件的攻击链信息X＝{x₁,x₂,...,x_t}，然后基于已有的贝叶斯网络计算其为每个攻击组织的置信度，计算方法为：For subsequent text-based threat intelligence of network security events from unknown attack sources, first go through the process of step 1 to step 3 to process and obtain the attack chain information of this attack event X={x ₁ ,x ₂ ,...,x _t }, and then calculate its confidence for each attack organization based on the existing Bayesian network, the calculation method is:

其中，t表示攻击链长度，k表示迭代变量。Among them, t represents the length of the attack chain, and k represents the iteration variable.

最后选择置信度最大的攻击组织作为溯源预测结果。Finally, the attack organization with the highest confidence is selected as the traceability prediction result.

本发明的另一个实施例提供一种采用上述方法的基于文本型威胁情报的攻击链构建与攻击溯源系统，其包括：Another embodiment of the present invention provides a text-based threat intelligence-based attack chain construction and attack traceability system using the above method, which includes:

其中各模块的具体实施过程参见前文对本发明方法的描述。For the specific implementation process of each module, refer to the foregoing description of the method of the present invention.

本发明的另一实施例提供一种计算机设备(计算机、服务器、智能手机等)，其包括存储器和处理器，所述存储器存储计算机程序，所述计算机程序被配置为由所述处理器执行，所述计算机程序包括用于执行本发明方法中各步骤的指令。Another embodiment of the present invention provides a computer device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program configured to be executed by the processor, Said computer program comprises instructions for carrying out the steps in the method of the invention.

本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘)，所述计算机可读存储介质存储计算机程序，所述计算机程序被计算机执行时，实现本发明方法的各个步骤。Another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk), the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the method of the present invention is implemented each step.

以上公开的本发明的具体实施例，其目的在于帮助理解本发明的内容并据以实施，本领域的普通技术人员可以理解，在不脱离本发明的精神和范围内，各种替换、变化和修改都是可能的。本发明不应局限于本说明书的实施例所公开的内容，本发明的保护范围以权利要求书界定的范围为准。The specific embodiments of the present invention disclosed above are intended to help understand the content of the present invention and implement it accordingly. Those skilled in the art can understand that various replacements, changes and changes can be made without departing from the spirit and scope of the present invention. Modifications are possible. The present invention should not be limited to the content disclosed in the embodiments of this specification, and the scope of protection of the present invention is subject to the scope defined in the claims.

Claims

1. A text-based threat intelligence-based attack chain construction and attack source tracing method, characterized in that it comprises the following steps:

Collect text-based threat intelligence on attack events of known attacking organizations, and perform data preprocessing operations on the collected text-based threat intelligence to meet subsequent analysis requirements;

Use the collected text-based threat intelligence to extract network security entities and relationships to form a network security event description triplet;

For the extracted network security event description triplets, build a security event attack chain based on the timeline and logical sequence;

Use the security event attack chain of known attack organizations to train the attack organization characteristic model, and perform traceability prediction of unknown source attack events based on the trained attack organization characteristic model.

2. The method according to claim 1, wherein the text-based threat intelligence includes the sequence of actions, the impact on the system under attack, and damage indicators, wherein the attack action sequence of the attacker in this attack event is related to The description of the target is more valuable for subsequent analysis and modeling; the data preprocessing operation removes irrelevant data, repeated data, and abnormal symbols in the text-based threat intelligence, and completes some incomplete information based on the existing network security database. Solve the inconsistency of information to improve the quality of collected data and ensure the integrity and accuracy of collected data.

3. The method according to claim 1, wherein said extracting network security entities and relationships by using the collected text-based threat intelligence to form a network security event description triplet comprises: according to the collected text-based threat information, Use natural language processing technology to extract network security entities, and use distance supervision learning to realize the relationship extraction between entities, and form entity-relationship-entity format description triples of network security events.

4. The method according to claim 3, wherein the extraction of network security entities using natural language processing technology is to use the BERT-BiLSTM-CRF network to realize network security entity extraction, and the steps include:

Segment the input text-based threat intelligence text into character sequences S={s ₁ ,s ₂ ,...,s _n }, and use the BERT model to embedding characters as vectors, in which special proper nouns and English words are used as A character is embedded;

Use the character vector sequence X={x ₁ ,x ₂ ,...,x _n } output by BERT as the input of the BiLSTM neural network;

Manually annotate entity information in natural language text, and use conditional random field to constrain the order of appearance of annotation symbols, and use BiLSTM neural network output as input. For each predicted sequence of character sequence, its score is equal to the predicted score of the character itself and the difference between the character sequence. The sum of the transfer scores, and finally select the predicted sequence with the highest score as the model output.

5. The method according to claim 3, wherein the step of using remote supervised learning to realize inter-entity relationship extraction comprises:

First, small-scale samples are manually labeled, and then for a large number of unlabeled samples, all labeled entity pairs are labeled as human-labeled relationships, and entity pairs with the same relationship are put into a sentence bag. Subsequent model training uses sentence bags as the basic unit;

Then, for each entity pair in the sentence bag, divide it into three sections at the position of the two entities, each section is put into the CNN network as a sentence for training, and the input of the CNN network is the character embedding vector and the character and The concatenated vector of the relative distance between two entities is output as the confidence of the sentence in each relationship category, and finally the sentence with the highest confidence is selected as the feature of the sentence bag, and used as the criterion for determining the entity relationship.

6. The method according to claim 1, wherein the step of building a security event attack chain comprises: first organizing experts to build an attack behavior judgment rule base based on the attack behavior process, and building an attack behavior judgment model on this basis , so as to realize the abstraction and extraction of network security event description triplets, and then sort the attack behaviors based on the timeline and logical order.

7. The method according to claim 1, wherein the training attacking organization characteristic model using the security event attack chain of known attacking organizations comprises: forming a Bayesian conditional probability network of network security sub-event categories, Input the security event attack chain of a known attack organization, and train the conditional probability of each node as the attack characteristic image of the attack organization, that is, the attack organization characteristic model; the traceability prediction of unknown source attack events is performed based on the trained attack organization characteristic model , including: for text-based threat intelligence of attack events from unknown sources, convert it into a security event attack chain, match the attack feature portrait of known attack organizations, and select the attack organization with the highest confidence as the traceability prediction result.

8. A text-based threat intelligence-based attack chain construction and attack traceability system using the method of any one of claims 1 to 7, characterized in that it includes:

The data collection and preprocessing module is used to collect text-based threat intelligence of attack events of known attack organizations, and perform data preprocessing operations on the collected text-based threat intelligence to make it meet the subsequent model analysis requirements;

The entity and relationship extraction module is used to extract network security entities and relationships to form a network security event description triplet;

The attack chain generation module is used to summarize the network security sub-events, and based on the timeline and logical sequence, build a security event attack chain and describe the security event process;

The feature model training and traceability prediction module is used to train the attack organization feature model based on the generated security event attack chain, and perform traceability prediction for network security events with unknown attack sources.

9. An electronic device, characterized in that it comprises a memory and a processor, the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes a computer program for performing claims 1- 7. Instructions for the method of any one of claims.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the method according to any one of claims 1-7 is realized.