CN110321394A

CN110321394A - The network security data method for organizing and computer storage medium of knowledge based map

Info

Publication number: CN110321394A
Application number: CN201910614670.5A
Authority: CN
Inventors: 张阳; 王佳贺; 魏松杰; 袁德砦
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-10-11

Abstract

The invention discloses a network security data organization method and a computer storage medium based on a knowledge graph. The method comprises the following steps: 1) collecting massive network security data, and preprocessing by cleaning and filtering; 2) constructing network security data Knowledge base; 3) Formulate a feature template according to the context information of the data, and combine the feature template to generate a word vector from the data; 4) Input the word vector into the BiLSTM model, and complete network security entity recognition and network security entity relationship extraction through underlying parameter sharing; 5 ) combines the results of network security entity identification and network security entity relationship extraction with the network security knowledge base to construct a network security data knowledge graph. The invention solves the problems of weak data correlation, low organizational efficiency and single expression form faced by the network security data organization and management in the prior art, effectively improves the data correlation and fusion degree, and improves the data organization efficiency.

Description

Network security data organization method and computer storage medium based on knowledge graph

技术领域technical field

本发明涉及一种网络安全数据组织方法及计算机存储介质，特别是涉及一种基于知识图谱的网络安全数据组织方法及计算机存储介质。The invention relates to a network security data organization method and a computer storage medium, in particular to a network security data organization method and computer storage medium based on a knowledge graph.

背景技术Background technique

随着计算机网络和通信技术的不断革新和完善，网络空间也面临着前所未有的机遇和挑战。面对如此严峻的安全态势，实时反映网络动向的网络安全数据已然成为洞察网络情况、分析网络异常、评估网络环境等要素的关键纽带。网络安全数据具体指“人、机、物”三元世界在网络空间中彼此交互、融合所产生的，且在互联网上可获取并反映网络状态的具有时效性和关联性的数据。作为网络安全信息的载体，网络安全数据的组织和关联对于信息资源的开发和利用具有十分重大的意义。With the continuous innovation and improvement of computer network and communication technology, cyberspace is also facing unprecedented opportunities and challenges. Faced with such a severe security situation, network security data that reflects network trends in real time has become a key link to gain insight into network conditions, analyze network anomalies, and evaluate network environment. Network security data specifically refers to the time-sensitive and relevant data that is generated by the interaction and fusion of the three-dimensional world of "human, machine, and object" in cyberspace, and which can be obtained on the Internet and reflect the network status. As the carrier of network security information, the organization and association of network security data is of great significance to the development and utilization of information resources.

然而，现有的数据组织方法大都侧重于数据的存储，对于数据之间的关联规则考虑的不够充分，导致数据间的融合程度下降，因此数据的价值无法充分体现，最终会显现出“信息过载、知识缺乏”等问题。此外，上述数据组织方法大都是受限于某一专业领域的特定场景，对于多源异构、海量繁杂的网络安全大环境下的数据组织研究涉猎甚少。随着人工智能和机器学习等新兴技术的崛起，网络空间安全问题变得日益紧迫，亟需一种全面的、高效的数据组织方法为数据提供安全性保障。However, most of the existing data organization methods focus on the storage of data, and the association rules between data are not considered enough, resulting in a decrease in the degree of fusion between data, so the value of data cannot be fully reflected, and eventually it will show "information overload" , lack of knowledge, etc. In addition, most of the above data organization methods are limited to specific scenarios in a professional field, and there is little research on data organization in the multi-source heterogeneous, massive and complex network security environment. With the rise of emerging technologies such as artificial intelligence and machine learning, cyberspace security issues have become increasingly urgent, and a comprehensive and efficient data organization method is urgently needed to provide data security.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明要解决的技术问题是提供一种基于知识图谱的网络安全数据组织方法及计算机存储介质，解决了现有技术中网络安全数据组织和管理所面临的数据关联性弱、组织效率不高和表现形式单一等问题，有效提高了数据关联性和融合度，提升了数据的组织效率。Purpose of the invention: The technical problem to be solved by the present invention is to provide a network security data organization method and computer storage medium based on knowledge graph, which solves the problems of weak data correlation and organizational efficiency faced by network security data organization and management in the prior art. Problems such as low level and single form of expression effectively improve the correlation and integration of data, and improve the efficiency of data organization.

技术方案：本发明所述的基于知识图谱的网络安全数据组织方法，包括以下步骤：Technical solution: The method for organizing network security data based on knowledge graph according to the present invention includes the following steps:

(1)采集海量网络安全数据，通过清洗和过滤的方式进行预处理；(1) Collect massive amounts of network security data and preprocess them by cleaning and filtering;

(2)构建网络安全知识库；(2) Build a network security knowledge base;

(3)根据数据的上下文信息制定特征模板，并结合特征模板将数据生成字向量；(3) formulating a feature template according to the context information of the data, and generating a word vector from the data in combination with the feature template;

(4)将字向量输入BiLSTM模型，通过底层参数共享完成网络安全实体识别和网络安全实体关系抽取；(4) Input the word vector into the BiLSTM model, and complete network security entity recognition and network security entity relationship extraction through underlying parameter sharing;

(5)将步骤(4)中网络安全实体识别和网络安全实体关系抽取的结果与所述网络安全知识库结合，构建网络安全数据知识图谱。(5) Combine the results of network security entity identification and network security entity relationship extraction in step (4) with the network security knowledge base to construct a network security data knowledge graph.

进一步的，步骤(1)中所述的网络安全数据包括从网络资产信息、网络威胁信息、网络状态信息、网络脆弱性信息和安全事件信息五个方面进行采集。Further, the network security data described in step (1) includes collection from five aspects: network asset information, network threat information, network status information, network vulnerability information and security event information.

进一步的，步骤(1)中的预处理具体为：Further, the preprocessing in step (1) is specifically:

(1)选择格式规范的原始数据作为过滤规则的制订依据，并根据其定义对应的正则表达式，筛查出不规范的数据值、数据类型和数据格式的数据，予以纠正，无法纠正的数据进行丢弃；(1) Select the original data in a standardized format as the basis for formulating filtering rules, and screen out data with irregular data values, data types and data formats according to the corresponding regular expressions defined by them, and correct them. Uncorrectable data to discard;

(2)采用Bloom-Filter算法去除重复的数据；(2) Use Bloom-Filter algorithm to remove duplicate data;

(3)采用均值插补的方式补足残缺数据值，对于定类数据采用众数进行插补，对于定量数据采用均值进行插补。(3) The incomplete data values are supplemented by means of mean value interpolation. For categorical data, the mode is used for interpolation, and for quantitative data, the mean value is used for interpolation.

进一步的，步骤(2)中所述网络安全知识库包括物理安全、主机安全、网络结构安全、应用安全和数据安全五个本体。Further, the network security knowledge base described in step (2) includes five ontologies of physical security, host security, network structure security, application security and data security.

进一步的，所述的特征模板为当前识别字和当前识别字前后所设置数目的识别字所组成的识别字的集合。Further, the feature template is a set of identification words formed by the current identification word and the number of identification words set before and after the current identification word.

进一步的，步骤(3)中的数据生成字向量的方法具体为：按照特征模板读取数据，通过Bert模型生成字向量。Further, the method for generating a word vector from the data in step (3) is specifically: reading the data according to the feature template, and generating the word vector through the Bert model.

进一步的，步骤(4)中完成网络安全实体识别和网络安全实体关系抽取的方法为：将字向量输入BiLSTM模型进行网络安全实体识别，所述BiLSTM模型包括输入层、特征模板、字嵌入层、BiLSTM层及CRF层，然后将所述BiLSTM模型的CRF层更换为Attention层和Softmax层进行输出，完成网络安全实体关系抽取。Further, the method for completing network security entity identification and network security entity relationship extraction in step (4) is: input the word vector into the BiLSTM model for network security entity recognition, and the BiLSTM model includes an input layer, a feature template, a word embedding layer, BiLSTM layer and CRF layer, and then replace the CRF layer of the BiLSTM model with the Attention layer and the Softmax layer for output, and complete the network security entity relationship extraction.

进一步的，步骤(5)所述的网络安全数据知识图谱包括两部分，其一是通用知识图谱，包括先前已知的网络漏洞信息、攻击威胁信息及安全公告信息；其二是扩展知识图谱，主要包括网络节点信息、网络拓扑信息、网络连通信息、网络运维信息。Further, the network security data knowledge graph described in step (5) includes two parts, one is a general knowledge graph, including previously known network vulnerability information, attack threat information and security announcement information; the other is an extended knowledge graph, It mainly includes network node information, network topology information, network connectivity information, and network operation and maintenance information.

进一步的，步骤(5)还包括将网络安全数据知识图谱采用OrientDB图形数据库进行存储。Further, step (5) also includes storing the network security data knowledge graph using an OrientDB graph database.

本发明所述的计算机存储介质，其上存储有计算机程序，所述计算机程序在被计算机处理器执行时实现上述的基于知识图谱的网络安全数据组织方法。The computer storage medium of the present invention has a computer program stored thereon, and when the computer program is executed by a computer processor, the above-mentioned network security data organization method based on knowledge graph is realized.

有益效果：本发明针对于网络安全数据海量、多态、异构等特征，利用知识图谱的建立有效提高数据间的关联性和融合度，增强数据的组织效率，减少了大量人工标注的负担，体现出良好的数据管理效果，充分发挥数据的价值，进而能够推动数据优势经由知识优势转化为决策优势。本发明具有数据关联性强、融合度高等优点，可提升数据的组织效率，使得数据价值最大化。Beneficial effects: Aiming at the characteristics of massive, polymorphic and heterogeneous network security data, the present invention utilizes the establishment of a knowledge graph to effectively improve the correlation and degree of fusion between data, enhance the efficiency of data organization, and reduce the burden of a large number of manual annotations. It reflects a good data management effect, gives full play to the value of data, and then promotes the transformation of data advantages into decision-making advantages through knowledge advantages. The invention has the advantages of strong data correlation and high degree of integration, which can improve the efficiency of data organization and maximize the value of data.

附图说明Description of drawings

图1是本发明实施方式的总体架构图；1 is an overall architecture diagram of an embodiment of the present invention;

图2是本实施方式的网络安全知识库结构图；Fig. 2 is the network security knowledge base structure diagram of this embodiment;

图3是本实施方式的网络安全实体识别模型图；Fig. 3 is the network security entity recognition model diagram of the present embodiment;

图4是本实施方式的网络安全关系抽取模型图；Fig. 4 is the network security relation extraction model diagram of the present embodiment;

图5是本实施方式的网络安全数据知识图谱结构图。FIG. 5 is a structural diagram of a knowledge graph of network security data in this embodiment.

具体实施方式Detailed ways

本发明实施方式的方法步骤如图1所示，具体步骤为：The method steps of the embodiment of the present invention are shown in Figure 1, and the specific steps are:

步骤1、从不同层面采集海量的网络安全数据作为数据源，然后将接入的数据通过清洗和过滤的方式进行预处理。Step 1. Collect massive amounts of network security data from different levels as data sources, and then preprocess the accessed data by cleaning and filtering.

其中，网络安全数据从网络资产信息、网络威胁信息、网络状态信息、网络脆弱性信息和安全事件信息五个方面进行采集。其中，漏洞信息主要来自各个漏洞数据库如CVE、NVD、中国国家漏洞数据库；攻击威胁数据和安全公告信息从信息安全网站和安全应急响应中心采集，前者主要包括看雪论坛、吐司论坛和Freebuf，后者主要是国家互联网应急中心、腾讯、携程安全应急响应中心；其它数据通过开源工具进行资产扫描探测、脆弱性扫描、拓扑发现，以获得网络结构的相关原始数据。Among them, network security data is collected from five aspects: network asset information, network threat information, network status information, network vulnerability information and security event information. Among them, vulnerability information mainly comes from various vulnerability databases such as CVE, NVD, and China National Vulnerability Database; attack threat data and security bulletin information are collected from information security websites and security emergency response centers. The former mainly includes Kanxue Forum, Toast Forum and Freebuf, The latter are mainly the National Internet Emergency Response Center, Tencent, and Ctrip Security Emergency Response Center; other data use open source tools for asset scanning detection, vulnerability scanning, and topology discovery to obtain relevant raw data on the network structure.

针对获取到的数据源，其清洗和过滤的具体步骤是：For the acquired data source, the specific steps of cleaning and filtering are:

(1)初步过滤：选择格式规范的原始数据作为过滤规则的制订依据，并根据其定义对应的正则表达式，筛查出不规范的数据值、数据类型、数据格式，予以纠正。对于无法纠正的数据进行丢弃。(1) Preliminary filtering: Select the original data with standard format as the basis for formulating filtering rules, and screen out irregular data values, data types, and data formats according to the corresponding regular expressions defined by them, and correct them. Uncorrectable data is discarded.

(2)去除冗余：采用Bloom-Filter算法去除重复数据，Bloom-Filter算法底层使用的是位图，当一个元素被加入集合时，通过K个Hash函数将这个元素映射成一个位阵列中的K个点，并把它们置为1。在冗余检索时，只需查看这些点是否全部为1，就能够判断集合中是否存在冗余。如果这些点出现0，则被检索的元素存在的概率为0；如果结果都为1，则被检索的元素存在的可能性非常大。(2) Redundancy removal: The Bloom-Filter algorithm is used to remove duplicate data. The bottom layer of the Bloom-Filter algorithm uses a bitmap. When an element is added to the set, the element is mapped into a bit array through K Hash functions. K points and set them to 1. When searching for redundancy, it is only necessary to check whether these points are all 1 to determine whether there is redundancy in the set. If these points appear 0, the probability of the existence of the retrieved element is 0; if the results are all 1, the probability of the existence of the retrieved element is very high.

(3)补足残缺：采用均值插补的方式处理残缺数据值，对于定类数据采用众数进行插补，对于定量数据采用均值进行插补。(3) Compensation of incompleteness: The incomplete data value is processed by means of mean interpolation. For categorical data, mode is used for interpolation, and for quantitative data, mean value is used for interpolation.

步骤2、通过本领域的知识经验构建网络安全知识库，综合考虑数据的上下文信息，制定特征模板并生成字向量。本领域的知识经验就是该领域专家对网络安全领域的实践经验归纳的知识，实际过程中可以根据已有的数据库进行总结来获得实现。Step 2: Construct a network security knowledge base based on the knowledge and experience in the field, comprehensively consider the context information of the data, formulate a feature template and generate a word vector. The knowledge and experience in this field is the knowledge summed up by experts in the field of practical experience in the field of network security, which can be obtained by summarizing the existing database in the actual process.

其中，网络安全知识库模型如图2所示。整个知识库包含五个本体，即物理安全、主机安全、网络结构安全、应用安全和数据安全，各个部分相互独立工作，同时又通过网络体系形成大的整体，每个本体详细介绍如下：Among them, the network security knowledge base model is shown in Figure 2. The entire knowledge base contains five ontologies, namely physical security, host security, network structure security, application security and data security. Each part works independently of each other, and at the same time forms a large whole through the network system. The details of each ontology are as follows:

(1)物理安全：包括整个网络所处环境安全和网络所属的相关设备安全；(1) Physical security: including the security of the environment where the entire network is located and the security of related equipment to which the network belongs;

(2)主机安全：包括服务器、终端设备的操作系统安全和文件安全；(2) Host security: including operating system security and file security of servers and terminal equipment;

(3)网络结构安全：包括整个网络拓扑安全、访问控制安全、入侵防范安全和设备防护安全；(3) Network structure security: including the entire network topology security, access control security, intrusion prevention security and equipment protection security;

(4)应用安全：包括网络应用软件安全和系统应用软件安全；(4) Application security: including network application software security and system application software security;

(5)数据安全：包括数据完整性安全和数据保密性安全。(5) Data security: including data integrity security and data confidentiality security.

根据专家经验构造的网络安全知识库，并以此作为基础，通过预先构造的本体关系初步筛选，提取局部上下文特征，形成特征模版。其中，特征模板的具体设计如下：Based on the network security knowledge base constructed according to the experience of experts, the local context features are extracted through preliminary screening of pre-constructed ontology relationships to form feature templates. Among them, the specific design of the feature template is as follows:

所述的特征模板为当前识别字和当前识别字前后所设置数目的识别字所组成的识别字的集合，特征模板包括了当前识别字及其前后几个位置，特征模板的大小可以根据实际情况进行设置。即，定义当前识别字的相邻几个位置为“监视窗口”，特征模板的大小也就是监视窗口的大小和所包含的上下文信息成正相关，本实施例的“监视窗口”设置为7。Described feature template is the set of recognition word that the current recognition word and the number of recognition words set before and after the current recognition word are formed. Make settings. That is, several adjacent positions of the current recognition word are defined as “monitoring windows”, and the size of the feature template, that is, the size of the monitoring window is positively correlated with the contained context information, and the “monitoring window” in this embodiment is set to 7.

本实施例的特征模版为：The feature template of this embodiment is:

x[0,-3],x[0,-2],x[0,-1],x[0,0],x[0,1],x[0,2],x[0,3]x[0,-3],x[0,-2],x[0,-1],x[0,0],x[0,1],x[0,2],x[0,3 ]

其中，特征模板的格式为x[row,col]，通过row和col来标定特征的来源，row代表相对当前字符位置的行，即当前行的row值为0，col对应所选特征的列。Among them, the format of the feature template is x[row,col], and the source of the feature is demarcated by row and col, row represents the row relative to the current character position, that is, the row value of the current row is 0, and col corresponds to the column of the selected feature.

最后，利用Bert模型预先训练字向量文件，结果是将输入的词语或字表示成一定长度的数值向量形式，并包含着潜在的语义关系。Finally, using the Bert model to pre-train the word vector file, the result is that the input word or word is represented as a numerical vector of a certain length, and contains potential semantic relations.

步骤3、将结合局部上下文特征的字向量作为输入，利用BiLSTM模型通过底层参数共享完成网络安全实体识别和网络安全实体关系抽取。Step 3. Taking the word vector combined with the local context feature as input, the BiLSTM model is used to complete the network security entity recognition and network security entity relationship extraction through the underlying parameter sharing.

其中，将结合了局部上下文特征的字向量作为BiLSTM神经网络的输入进行实体识别，如图3所示。网络安全实体识别过程是通过BiLSTM-CRF模型实现的，具体包括输入层、特征模板、字嵌入层、BiLSTM层及CRF层。网络安全实体识别模型的正确率输出的定义如下：Among them, the word vector combined with local context features is used as the input of the BiLSTM neural network for entity recognition, as shown in Figure 3. The network security entity recognition process is realized by the BiLSTM-CRF model, which includes the input layer, feature template, word embedding layer, BiLSTM layer and CRF layer. The definition of the correct rate output of the cybersecurity entity recognition model is as follows:

对于给定输入序列X＝(X₁,X₂,…,X_n)，假设A是大小为n×k的BiLSTM网络的输出的分数矩阵，k为标签种类数，A_i,j为第词的第j标签所得分数。那么对于预测的标签y＝(y₁,y₂,…,y_n)，其分数定义为：For a given input sequence X=(X ₁ ,X ₂ ,...,X _n ), suppose A is the score matrix of the output of a BiLSTM network of size n×k, k is the number of label types, A _i,j is the word The score for the jth label of . Then for the predicted label y=(y ₁ , y ₂ ,...,y _n ), its score is defined as:

其中，T是k+2阶的转移分数矩阵，T_i,j表示从标签i转移到标签j的分数，y₀和y_n是在句子开始和结束位增加的标签，产生标签序列y的概率为：where T is the transition score matrix of order k+2, T _i,j represents the score of the transition from label i to label j, y ₀ and _yn are the labels added at the beginning and end of the sentence, the probability of generating the label sequence y for:

最后利用最大似然估计函数来计算最大化正确标签序列的对数概率为：Finally, the maximum likelihood estimation function is used to calculate the log probability of maximizing the correct label sequence as:

网络安全实体识别模型的具体训练过程如下：The specific training process of the network security entity recognition model is as follows:

(1)进入epoch循环，接着进入batch循环；(1) Enter the epoch cycle, and then enter the batch cycle;

(2)初始化参数；(2) Initialization parameters;

(3)根据特征模板及Bert模型提取字符特征；(3) Extract character features according to the feature template and Bert model;

(4)利用BiLSTM-CRF算法前向传递提取特征；(4) Using the BiLSTM-CRF algorithm to forward transfer the extracted features;

(5)利用CRF算法计算全局正确标签的概率；(5) Use the CRF algorithm to calculate the probability of the global correct label;

(6)利用BiLSTM-CRF算法后向传递提取特征；(6) Using the BiLSTM-CRF algorithm to pass back the extracted features;

(7)更新参数；(7) Update parameters;

(8)结束batch循环和epoch循环。(8) End the batch loop and the epoch loop.

在实体识别模型的基础上，通过保留底层并实现参数共享，将CRF层更换为Attention层和Softmax层进行输出，构建网络安全实体关系抽取模型，如图4所示。其中，Attention层可以生成一个权重向量，通过与这个权重向量相乘，使每一次迭代中的词汇级别的特征合并为句子级别的特征，Attention层的权重矩阵可以通过下面得到：On the basis of the entity recognition model, by retaining the bottom layer and realizing parameter sharing, the CRF layer is replaced with the Attention layer and the Softmax layer for output, and a network security entity relationship extraction model is constructed, as shown in Figure 4. Among them, the Attention layer can generate a weight vector, and by multiplying this weight vector, the vocabulary-level features in each iteration are merged into sentence-level features, and the weight matrix of the Attention layer can be obtained as follows:

M＝tanh(H)M=tanh(H)

α＝softmax(ω^TM)r＝Hα^T α=softmax(ω ^T M)r=Hα ^T

其中，d^ω为词向量的维度，ω^T是一个训练学习得到的参数向量的转置。然后，将用以分类的句子表示如下：in, d ^ω is the dimension of the word vector, and ω ^T is the transpose of the parameter vector obtained by training. Then, the sentences for classification are represented as follows:

h^*＝tanh(r)h ^* =tanh(r)

最后，使用一个Softmax分类器来预测标签，将上一层得到的h^*作为输入。Finally, a Softmax classifier is used to predict the labels, taking the h ^* obtained from the previous layer as input.

f(y|S)＝softmax(W_sh^*+b_s)f(y|S)=softmax(W _s h ^* +b _s )

那么，就是求解的实体关系类别，其中W_s代表学习权重，b_s代表偏置参数。So, is the entity relationship category to be solved, where W _s represents the learning weight and b _s represents the bias parameter.

步骤4、将实体识别和关系抽取的结果与专家知识相结合，构建网络安全数据知识图谱并对其进行存储，最终完成网络安全数据的组织。Step 4. Combine the results of entity recognition and relationship extraction with expert knowledge, build a knowledge graph of network security data and store it, and finally complete the organization of network security data.

其中，将实体识别和关系抽取的结果与专家知识合并，过滤掉重复知识并构建完整的知识图谱。所设计的网络安全知识图谱包括两部分内容，如图5所示。其一是通用知识图谱，即先前已知的网络漏洞信息、攻击威胁信息及安全公告信息。其二是涵盖网络结构的扩展知识图谱，主要包括网络节点信息、网络拓扑信息、网络连通信息、网络运维信息。最后，知识图谱的存储采用OrientDB图形数据库来组织和管理网络安全数据。Among them, the results of entity recognition and relation extraction are merged with expert knowledge, duplicate knowledge is filtered out and a complete knowledge graph is constructed. The designed network security knowledge graph includes two parts, as shown in Figure 5. One is the general knowledge graph, that is, the previously known network vulnerability information, attack threat information and security announcement information. The second is an extended knowledge graph covering the network structure, which mainly includes network node information, network topology information, network connectivity information, and network operation and maintenance information. Finally, the storage of the knowledge graph adopts the OrientDB graph database to organize and manage network security data.

本发明实施例如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM，Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样，本发明实例不限制于任何特定的硬件和软件结合。If the embodiments of the present invention are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products are stored in a storage medium and include several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is caused to execute all or part of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, read only memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes. As such, embodiments of the present invention are not limited to any particular combination of hardware and software.

相应的，本发明的实施方式还提供了一种计算机存储介质，其上存储有计算机程序。当所述计算机程序由处理器执行时，可以实现前述基于知识图谱的网络安全数据组织方法。例如，该计算机存储介质为计算机可读存储介质。Correspondingly, embodiments of the present invention also provide a computer storage medium on which a computer program is stored. When the computer program is executed by the processor, the aforementioned method for organizing network security data based on knowledge graph can be implemented. For example, the computer storage medium is a computer-readable storage medium.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Claims

1. A network security data organization method based on knowledge graph is characterized by comprising the following steps:

(1) acquiring mass network security data, and preprocessing the data in a cleaning and filtering mode;

(2) constructing a network security knowledge base;

(3) formulating a feature template according to the context information of the data, and generating a word vector from the data by combining the feature template;

(4) inputting the word vector into a BilSTM model, and completing network security entity identification and network security entity relationship extraction through bottom layer parameter sharing;

(5) and (4) combining the results of the network security entity identification and the network security entity relation extraction in the step (4) with the network security knowledge base to construct a network security data knowledge graph.

2. The knowledge-graph-based network security data organization method according to claim 1, wherein: the network security data in the step (1) is acquired from five aspects of network asset information, network threat information, network state information, network vulnerability information and security event information.

3. The knowledge-graph-based network security data organization method according to claim 1, wherein the preprocessing in the step (1) is specifically:

(1) selecting original data with a standard format as a formulation basis of a filtering rule, screening out data with an irregular data value, a data type and a data format according to a regular expression corresponding to the definition of the original data, and correcting the data;

(2) removing repeated data by adopting a Bloom-Filter algorithm;

(3) and (3) complementing the incomplete data value by adopting a mean interpolation mode, carrying out interpolation on the classified data by adopting a mode, and carrying out interpolation on the quantitative data by adopting a mean value.

4. The knowledge-graph-based network security data organization method according to claim 1, wherein: the network security knowledge base in the step (2) comprises five bodies of physical security, host security, network structure security, application security and data security.

5. The knowledge-graph-based network security data organization method according to claim 1, wherein: the characteristic template is a set of identifiers formed by the current identifier and identifiers with the number set before and after the current identifier.

6. The knowledge-graph-based network security data organization method according to claim 1, wherein the method for generating word vectors by the data in the step (3) is specifically as follows: and reading data according to the characteristic template, and generating a word vector through a Bert model.

7. The knowledge-graph-based network security data organization method according to claim 1, wherein the method for performing network security entity identification and network security entity relationship extraction in the step (4) comprises: inputting the word vector into a BilSTM model for network security entity identification, wherein the BilSTM model comprises an input layer, a characteristic template, a word embedding layer, a BilSTM layer and a CRF layer, and then replacing the CRF layer of the BilSTM model with an Attention layer and a Softmax layer for output, thereby completing network security entity relationship extraction.

8. The knowledge-graph-based network security data organization method according to claim 1, wherein: the network security data knowledge graph in the step (5) comprises two parts, wherein one part is a general knowledge graph which comprises previously known network vulnerability information, attack threat information and security bulletin information; and the other is to expand the knowledge graph, which mainly comprises network node information, network topology information, network communication information and network operation and maintenance information.

9. The knowledge-graph-based network security data organization method according to claim 1, wherein the step (5) further comprises storing the network security data knowledge graph using an OrientDB graph database.

10. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 9.