[go: up one dir, main page]

CN111930893A - Knowledge graph construction method for scenic spot abnormal events - Google Patents

Knowledge graph construction method for scenic spot abnormal events Download PDF

Info

Publication number
CN111930893A
CN111930893A CN202010806519.4A CN202010806519A CN111930893A CN 111930893 A CN111930893 A CN 111930893A CN 202010806519 A CN202010806519 A CN 202010806519A CN 111930893 A CN111930893 A CN 111930893A
Authority
CN
China
Prior art keywords
entity
document
sub
graph
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010806519.4A
Other languages
Chinese (zh)
Inventor
钟艳如
贺昭荣
罗笑南
汪华登
李芳�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202010806519.4A priority Critical patent/CN111930893A/en
Publication of CN111930893A publication Critical patent/CN111930893A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种面向景区异常事件的知识图谱构建方法,包括以下几大步骤:搜集相关的互联网文档数据,审查文档数据的质量以决定更进一步的整理工作;抽取出所收集的数据中的异常事件的实体和关系,用特定的匹配方法将各个实体用关系连接;然后将抽取的实体进行聚类分为多个实体簇,从而构建出异常事件知识图谱的模式层;对实体簇中的每个实体进行划分,构建单独的子图谱;在最后将已有的子图谱进行合并,组成完整的基于异常事件的知识图谱。本发明将知识图谱与景区异常事件相结合,可以对景区异常事件的相关信息进行详细快速的整理,更好的选取异常事件的应对措施,本发明可很好的处理知识图谱中实体关系的复杂性,提高知识图谱的数据获取效率和构建质量。

Figure 202010806519

The invention discloses a knowledge map construction method for abnormal events in scenic spots, which includes the following major steps: collecting relevant Internet document data, examining the quality of the document data to decide further sorting work; extracting abnormality in the collected data The entities and relationships of events are connected with each other by a specific matching method; then the extracted entities are clustered into multiple entity clusters, so as to construct the pattern layer of the knowledge map of abnormal events; Each entity is divided into separate sub-graphs; at the end, the existing sub-graphs are merged to form a complete knowledge graph based on abnormal events. The invention combines the knowledge graph with the abnormal events of the scenic spot, and can sort out the relevant information of the abnormal events in the scenic spot in detail and quickly, so as to better select the countermeasures for the abnormal events, and the present invention can well deal with the complex entity relationship in the knowledge graph. to improve the data acquisition efficiency and construction quality of knowledge graphs.

Figure 202010806519

Description

一种面向景区异常事件的知识图谱构建方法A knowledge graph construction method for abnormal events in scenic spots

技术领域technical field

本发明涉及自然语言处理技术领域,特指一种面向景区异常事件的知识图谱构建方法。The invention relates to the technical field of natural language processing, in particular to a knowledge map construction method for abnormal events in scenic spots.

背景技术Background technique

随着全球经济的不断发展以及世界人口的不断增涨,景区的游客数量不断增加,因此景区也成为人口高度密集的场合,使得其容易出现各式的异常事件影响景区秩序甚至造成严重的公共财产损失。目前景区异常事件的研究主要运用监控手段进行预防,但各类异常事件具有复杂性和多变性,实时监控只能在事件发生后进行相应准备,无法从根源上进行预防,检测系统也无法在事件后续的处理过程中为异常事件进行决策上的帮助。如何迅速应对景区的异常事件成为相关领域亟需解决的问题。With the continuous development of the global economy and the continuous increase of the world population, the number of tourists in the scenic spot is increasing, so the scenic spot has also become a highly densely populated place, making it prone to various abnormal events that affect the order of the scenic spot and even cause serious public property damage loss. At present, the research on abnormal events in scenic spots mainly uses monitoring methods to prevent them. However, various abnormal events are complex and variable. Real-time monitoring can only be prepared after the event occurs, and cannot be prevented from the root cause, and the detection system cannot detect the event. Helps in decision-making for abnormal events in subsequent processing. How to quickly respond to abnormal events in scenic spots has become an urgent problem to be solved in related fields.

近年来知识图谱逐渐成为一个热门的科学研究领域。知识图谱通过将应用数学、图形学、信息可视化技术、信息科学等学科的理论与方法与计量学引文分析、共现分析等方法结合,并利用可视化的图谱形象地展示学科的核心结构、发展历史、前沿领域以及整体知识架构达到多学科融合目的的现代理论,是显示知识发展进程与结构关系的一系列各种不同的图形,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。In recent years, knowledge graph has gradually become a popular scientific research field. The knowledge graph combines the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with methods such as metrology citation analysis and co-occurrence analysis, and uses the visual graph to visualize the core structure and development history of the discipline. , cutting-edge fields, and the modern theory of the overall knowledge structure to achieve the purpose of multi-disciplinary integration. It is a series of different graphics showing the development process and structural relationship of knowledge. Visual technology is used to describe knowledge resources and their carriers, and to mine, analyze, construct, and draw. and show knowledge and the interconnections between them.

知识谱图凭借其强大的语义表示能力,很适合作为异常事件所涉及的复杂关系的表示载体。但目前景区异常事件领域基于知识图谱构建方法的研究极少。现有的知识图谱构建技术只涉及到常规知识图谱的构建,方法可移植性较差,对特定领域的适用性较差。在对异常事件领域的知识抽取和构建上都是基于传统的方法,获取和解析的效率低,无法实现对相关信息的快速查询和联想,不能对景区异常事件的应对方案制定起到有效帮助作用。因此设计基于景区异常事件知识图谱的高效构建方法是目前非常重要的问题。With its powerful semantic representation ability, knowledge spectrum graph is very suitable as a representation carrier of complex relationships involved in abnormal events. However, there are very few researches on the construction of knowledge graphs in the field of abnormal events in scenic spots. The existing knowledge graph construction technologies only involve the construction of conventional knowledge graphs, and the methods are less portable and less applicable to specific fields. The extraction and construction of knowledge in the field of abnormal events are based on traditional methods. The efficiency of acquisition and analysis is low, and it is impossible to quickly query and associate relevant information, and it cannot effectively help the formulation of response plans for abnormal events in scenic spots. . Therefore, designing an efficient construction method based on the knowledge graph of abnormal events in scenic spots is a very important issue at present.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足,提供一种轻量高效构建景区特定异常事件知识图谱的方法,实现对异常事件信息的快速查询,对异常事件的发生起到一定的预防及应对方案的有效制定作用。The purpose of the present invention is to overcome the deficiencies of the prior art, to provide a lightweight and efficient method for constructing a knowledge map of specific abnormal events in scenic spots, to realize rapid query of abnormal event information, and to provide a certain prevention and response plan for the occurrence of abnormal events. effective formulation.

实现本发明目的的技术方案是:The technical scheme that realizes the object of the present invention is:

一种面向景区异常事件的知识图谱构建方法,包括以下步骤:A knowledge graph construction method for abnormal events in scenic spots, comprising the following steps:

S1:使用构建好的网络爬虫从互联网上收集景区异常事件相关的资料与文档,然后通过TF-IDF文本相似度计算方法判断文档的资料相关度,筛选出最合适的数据文档保留;S1: Use the constructed web crawler to collect data and documents related to abnormal events in scenic spots from the Internet, and then use the TF-IDF text similarity calculation method to judge the data relevance of the documents, and filter out the most suitable data documents to keep;

S2:在景区异常事件的相关资料内建立语义分词数据集,标记分词数据集中的异常事件实体和实体间关系的词性标记为名词和动词,根据标记的名词和动词在爬取的数据文档中抽取景区异常事件的实体和实体关系,并通过原始的数据文档关联实体和关系;S2: Establish a semantic word segmentation dataset in the relevant information of the abnormal events in the scenic spot, mark the abnormal event entities and the part of speech of the relationship between the entities in the word segmentation dataset as nouns and verbs, and extract them from the crawled data documents according to the marked nouns and verbs Entity and entity relationship of abnormal events in scenic spots, and related entities and relationships through original data documents;

S3:将抽取出的实体进行聚类,形成多个不同的实体簇,每个实体簇包含若干个实体类型,参考专业文档的说明结构来构建出对应异常事件的知识图谱模式层,以此对本体进行构建;S3: Cluster the extracted entities to form multiple different entity clusters, each entity cluster contains several entity types, refer to the description structure of the professional document to construct the knowledge graph mode layer corresponding to the abnormal event, so as to Ontology is constructed;

S4:根据步骤2中抽取的异常事件的实体关系找到实体簇中的实体所相对应的文档数据,通过余弦相似度计算找到与相应实体对应的文档中与该实体相近的描述词语,将实体与相近的描述词语的关系进行匹配,得到相应的子图谱;S4: Find the document data corresponding to the entity in the entity cluster according to the entity relationship of the abnormal event extracted in step 2, find the description words that are similar to the entity in the document corresponding to the corresponding entity through cosine similarity calculation, and associate the entity with the entity. The relationship of similar description words is matched to obtain the corresponding sub-graph;

S5:合并所有的子图谱:将子图谱进行关系连接,对知识进行合并获得面向景区异常事件的知识图谱。S5: Merge all sub-graphs: The sub-graphs are connected by relationship, and knowledge is merged to obtain a knowledge graph for abnormal events in scenic spots.

进一步的,所述步骤S1中通过TF-IDF文本相似度计算方法判断文档的资料相关度的步骤是:Further, the step of judging the data relevance of the document by the TF-IDF text similarity calculation method in the step S1 is:

S1-1:建立相应的数据质量判断模型,如公式(1)所示:S1-1: Establish a corresponding data quality judgment model, as shown in formula (1):

Figure BDA0002629327620000021
Figure BDA0002629327620000021

公式中Si,j’表示第i层的第j篇文档与当前爬取的所有文档的相似度,Si-1,j表示第i-1层的第j篇文档与当前爬取的所有文档的相似度,同理,Si,j’,k表示第i层的第j篇文档与该层的第k篇文档的相似度,其中第j’篇文档的链接在第j篇文档之中,Wi和Ti表示第i层设定的权重值;In the formula, S i,j' represents the similarity between the jth document of the i-th layer and all the currently crawled documents, and S i-1,j represents the j-th document of the i-1th layer and all the currently crawled documents. The similarity of documents, similarly, S i,j',k represents the similarity between the jth document of the i-th layer and the kth document of the layer, where the link of the j'th document is between the jth document. In, Wi and T i represent the weight values set by the i -th layer;

S1-2:设定阈值α,当需要进行判断的文档与其他文档的相似度Si,j小于阈值α时,便判定当前文档不合格,并对该层不符合条件的文档数量统计为Xi,将该层与判定文档内容重复的文档数量统计为YiS1-2: Set the threshold α, when the similarity S i,j of the document to be judged and other documents is less than the threshold α, the current document is judged to be unqualified, and the number of documents that do not meet the conditions in this layer is counted as X i , the number of documents whose content of the layer is duplicated with the determined document content is counted as Y i ;

S1-3:统计该层所有文档的数量Ni,计算该层文档的不合格率(Xi+Yi)/Ni,设定新的阈值β判断是否继续进行对该层文档的爬取,若不合格率大于阈值β,则停止对下一层文档的爬取。S1-3: Count the number Ni of all documents in this layer, calculate the failure rate of documents in this layer (Xi+Yi)/Ni, and set a new threshold β to judge whether to continue crawling the documents in this layer, if not qualified If the rate is greater than the threshold β, the crawling of the next layer of documents is stopped.

进一步的,所述步骤S2中的分词数据集是在爬取文档时标记文档中属于分词词库中的词的词性,并删除其出现在文档中的停用词。Further, the word segmentation data set in the step S2 is to mark the part of speech of the words in the word segmentation thesaurus in the document when crawling the document, and delete the stop words that appear in the document.

进一步的,所述步骤S3中聚类的操作方法是:Further, the operation method of clustering in the step S3 is:

S3-1:对抽取的文本实体使用复合神经网络模型进行词向量的训练,得到含有语法和语义特征词的分布式表达;S3-1: Use the compound neural network model to train the word vectors for the extracted text entities to obtain distributed expressions containing grammatical and semantic feature words;

S3-2:在完成词分布式表达后,对实体采用KMeans无监督算法进行聚类。S3-2: After the distributed expression of words is completed, the KMeans unsupervised algorithm is used to cluster the entities.

进一步的,所述步骤S4中构建子图谱的方法中还采取了实体映射,其方法是:先找到某个实体所对应的的文档,然后通过文档链接找出与该文档相似的相关论文文档,并建立该实体与这些文档间的映射关系。Further, entity mapping is also adopted in the method for constructing the sub-atlas in the step S4, and the method is: first find the document corresponding to a certain entity, and then find out the related paper document similar to the document through the document link, And establish the mapping relationship between the entity and these documents.

进一步的,所述步骤S5中图谱合并的方法是:如果两个子图谱都包含相同的实体a,则将其中一个子图谱中和实体a相连的实体b与另一个子图谱中的实体a相连,完成子图谱的合并;如果某个子图谱中没有与其它子图谱相同的实体,则进行对各个子图谱的实体相似度进行计算然后进行合并。Further, the method for merging graphs in the step S5 is: if the two subgraphs contain the same entity a, then the entity b connected to the entity a in one of the subgraphs is connected to the entity a in the other subgraph, Complete the merging of sub-graphs; if a sub-graph does not have the same entity as other sub-graphs, calculate the entity similarity of each sub-graph and then merge.

更进一步的,所述计算子图谱间相似度的具体方法是:Further, the specific method for calculating the similarity between the sub-atlas is:

S5-1:将爬取的文档数据和上层文档的链接相对应,分为同层链接的文档和异层链接的文档;S5-1: Corresponding the crawled document data with the links of the upper-level documents, and divide them into documents linked at the same level and documents linked at different levels;

S5-2:某个子图谱对应总知识图谱中的文档A,另一个子图谱对应知识图谱中的文档B,两个子图谱中没有相同的实体,则计算文档A中实体和文档B中实体以及第i层其它文档中所有实体的相似度,筛选出与待合并子图谱相似度最高的候选子图谱,具体公式如下:S5-2: A sub-graph corresponds to document A in the total knowledge graph, another sub-graph corresponds to document B in the knowledge graph, and there is no identical entity in the two sub-graphs, then calculate the entity in document A and the entity in document B and the first The similarity of all entities in other documents in layer i, the candidate sub-atlas with the highest similarity with the sub-atlas to be merged is screened out. The specific formula is as follows:

Figure BDA0002629327620000031
Figure BDA0002629327620000031

其中Gk表示与待合并子图谱相似度最高的候选子图谱,|G1|和|Gt|分别表示第1个和第t个候选子图谱中的实体个数,a表示待合并的子图谱中的实体,n表示文档G对应的待合并子图谱中的实体个数,b表示候选子图谱中的实体,s(a,b)表示实体a和b的相似度;where G k represents the candidate sub-map with the highest similarity to the sub-map to be merged, |G 1 | and |Gt| represent the number of entities in the 1st and t-th candidate sub-maps, respectively, and a represents the sub-map to be merged The entities in , n represents the number of entities in the sub-graph to be merged corresponding to document G, b represents the entities in the candidate sub-graph, and s(a, b) represents the similarity between entities a and b;

Figure BDA0002629327620000032
Figure BDA0002629327620000032

其中E表示文档A中对应的待合并子图谱中的实体与候选子图谱中所有实体相似度最大的实体,|Gk|表示与待合并子图谱相似度最大的候选子图谱中实体的个数;where E represents the entity in the corresponding sub-graph to be merged in document A with the greatest similarity to all entities in the candidate sub-graph, |G k | represents the number of entities in the candidate sub-graph with the greatest similarity to the sub-graph to be merged ;

E'=arg max{s(b,E)} (b=1,2,...,|Gk|) (4)E'=arg max{s(b,E)} (b=1,2,...,|G k |) (4)

其中E’表示候选子图谱中实体与实体E相似度之和最大的实体,将文档A对应的待合并子图谱中的实体E与候选子图谱Gk中的实体E’相连接,并标记说明是通过模型计算得出;Among them, E' represents the entity with the largest similarity between the entity and the entity E in the candidate sub-graph. The entity E in the sub-graph to be merged corresponding to the document A is connected with the entity E' in the candidate sub-graph G k , and the description is marked. is calculated by the model;

S5-3:将相似度最高的文档所对应的候选子图谱与文档A对应的待合并子图谱进行合并。S5-3: Merge the candidate sub-graph corresponding to the document with the highest similarity with the sub-graph to be merged corresponding to document A.

本发明提供的一种基于复合神经网络的景区异常事件抽取方法,该方法可以减少事件抽取对触发词的依赖,避免触发词的歧义对类别判断的影响,并可以在不规范句式中有效的提取事件。The invention provides a method for extracting abnormal events in scenic spots based on a composite neural network. The method can reduce the dependence of event extraction on trigger words, avoid the influence of ambiguity of trigger words on category judgment, and can be effective in irregular sentence patterns. Extract events.

附图说明Description of drawings

图1是本发明方法的流程图;Fig. 1 is the flow chart of the inventive method;

图2是爬取文档数据并对数据质量进行判断的流程图;Fig. 2 is a flowchart of crawling document data and judging data quality;

图3是景区异常事件的知识图谱模式层示意图;Fig. 3 is the schematic diagram of the knowledge graph mode layer of the abnormal event of the scenic spot;

具体实施方式Detailed ways

下面结合附图和具体实施对本发明进行详细说明。本实施用例以本发明技术方案为前提进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施用例。The present invention will be described in detail below with reference to the accompanying drawings and specific implementations. This implementation example is implemented on the premise of the technical solution of the present invention, and provides a detailed implementation manner and a specific operation process, but the protection scope of the present invention is not limited to the following implementation example.

实施例:Example:

如图1方法流程图所示,本实施例以火灾事件为例,提供一种面向景区异常事件的知识图谱构建方法,该方法包括如下步骤:As shown in the flow chart of the method in FIG. 1 , the present embodiment takes a fire event as an example to provide a method for constructing a knowledge graph for abnormal events in scenic spots, and the method includes the following steps:

步骤1:使用构建好的网络爬虫从互联网上收集景区异常事件相关的资料与文档,然后通过TF-IDF文本相似度计算方法判断文档的资料相关度,筛选出最合适的数据文档保留,具体流程如图2所示;Step 1: Use the constructed web crawler to collect data and documents related to abnormal events in scenic spots from the Internet, and then use the TF-IDF text similarity calculation method to judge the data relevance of the documents, and filter out the most suitable data documents to retain. The specific process as shown in picture 2;

步骤1-1:建立相应的数据质量判断模型,如公式(1)所示:Step 1-1: Establish a corresponding data quality judgment model, as shown in formula (1):

Figure BDA0002629327620000041
Figure BDA0002629327620000041

公式中Si,j’表示第i层的第j篇文档与当前爬取的所有文档的相似度,Si-1,j表示第i-1层的第j篇文档与当前爬取的所有文档的相似度,同理Si,j’,k表示第i层的第j篇文档与该层的第k篇文档的相似度,其中第j’篇文档的链接在第j篇文档之中,Wi和Ti表示第i层设定的权重值;In the formula, S i,j' represents the similarity between the jth document of the i-th layer and all the currently crawled documents, and S i-1,j represents the j-th document of the i-1th layer and all the currently crawled documents. The similarity of documents, similarly S i,j',k represents the similarity between the jth document of the i-th layer and the k-th document of the layer, where the link of the j'th document is in the jth document , Wi and T i represent the weight values set by the i -th layer;

步骤1-2:设定阈值α,当需要进行判断的文档与其他文档的相似度Si,j小于阈值α时,便判定当前文档不合格,并对该层不合格的文档数量统计为Xi,将该层与判定文档内容重复的文档数量统计为YiStep 1-2: Set the threshold α, when the similarity S i,j of the document to be judged and other documents is less than the threshold α, the current document is determined to be unqualified, and the number of unqualified documents in this layer is counted as X i , the number of documents whose content of the layer is duplicated with the determined document content is counted as Y i ;

步骤1-3:统计该层所有文档的数量Ni,计算该层文档的不合格率(Xi+Yi)/Ni,设定新的阈值β判断是否继续进行对该层文档的爬取,若不合格率大于阈值β,则停止对下一层文档的爬取。Step 1-3: Count the number Ni of all documents in this layer, calculate the failure rate (Xi+Yi)/Ni of documents in this layer, and set a new threshold β to judge whether to continue crawling documents in this layer, if not. If the pass rate is greater than the threshold β, the crawling of the next layer of documents is stopped.

在文档数据的爬取过程中,边爬取边判断获取文档的数据质量,并决定是否对其进行更深层次的数据爬取,可以提高获取的数据质量和效率,减少后期文本清洗及分析工作的难度。本发明中的数据挖掘方法在构建其它领域的知识图谱过程中同样适用;In the process of document data crawling, judging the data quality of the obtained document while crawling, and deciding whether to carry out deeper data crawling, can improve the quality and efficiency of the obtained data, and reduce the later text cleaning and analysis work. difficulty. The data mining method in the present invention is also applicable in the process of constructing knowledge graphs in other fields;

步骤2:在景区异常事件的相关资料内建立语义分词数据集,标记分词数据集中的异常事件实体和实体间关系的词性标记为名词和动词,根据标记的名词和动词在爬取的数据文档中抽取景区异常事件的实体和实体关系,并通过原始的数据文档关联实体和关系。分词词库在爬取文档内容时用于标记文档中属于词库中词的词性,在文档中出现的属于停用词词库的词直接进行删除。Step 2: Establish a semantic word segmentation data set in the relevant information of the abnormal events in the scenic spot, mark the abnormal event entities in the word segmentation data set and the part-of-speech of the relationship between the entities and mark them as nouns and verbs, according to the marked nouns and verbs in the crawled data document Extract entities and entity relationships of abnormal events in scenic spots, and associate entities and relationships through original data documents. The word segmentation thesaurus is used to mark the part of speech of the words in the thesaurus in the document when crawling the content of the document, and the words belonging to the stop word library appearing in the document are directly deleted.

建立针对火灾事件领域的文本词库可以使抽取的结果更为精确,对结果进行词性标注可以提高特定事件信息抽取的准确率及召回率,而且对词性的信息抽取面向每一个子文档,保证实体的对齐,降低复杂性。Establishing a text thesaurus for the field of fire events can make the extraction results more accurate. Part-of-speech tagging on the results can improve the accuracy and recall rate of specific event information extraction, and the part-of-speech information extraction is oriented to each sub-document to ensure that the entity alignment to reduce complexity.

步骤3:将抽取出的实体进行聚类,形成多个不同的实体簇,每个实体簇包含若干个实体类型,参考专业文档的说明结构来构建出对应异常事件的知识图谱模式层,如图3所示;Step 3: Cluster the extracted entities to form a number of different entity clusters, each entity cluster contains several entity types, refer to the description structure of the professional document to construct the knowledge graph mode layer corresponding to the abnormal event, as shown in the figure 3 shown;

其中聚类操作的具体方法是:The specific method of clustering operation is:

步骤3-1:对抽取的文本实体使用复合神经网络模型进行词向量的训练,得到含有语法和语义特征词的分布式表达;Step 3-1: Use the compound neural network model to train the word vectors for the extracted text entities, and obtain distributed expressions containing grammatical and semantic feature words;

步骤3-2:在完成词分布式表达后,对实体采用KMeans无监督算法进行聚类。Step 3-2: After completing the distributed expression of words, use KMeans unsupervised algorithm to cluster entities.

步骤4:根据步骤2中抽取的异常事件的实体关系找到实体簇中的实体所相对应的文档数据,通过余弦相似度计算找到与相应实体对应的文档中与该实体相近的描述词语,将实体与相近的描述词语的关系进行匹配,得到相应的子图谱;Step 4: Find the document data corresponding to the entity in the entity cluster according to the entity relationship of the abnormal event extracted in Step 2, and find the description words that are similar to the entity in the document corresponding to the corresponding entity through cosine similarity calculation. Match the relationship with the similar description words to obtain the corresponding sub-map;

在构建子图谱的方法中还采取了实体映射,其具体方法是:先找到某个实体所对应的的文档,然后通过文档链接找出与该文档相似的相关论文文档,并建立该实体与这些文档间的映射关系;In the method of constructing the sub-graph, entity mapping is also adopted. The specific method is: first find the document corresponding to an entity, and then find the related paper documents similar to the document through the document link, and establish the entity and these documents. Mapping relationship between documents;

步骤5:合并所有的子图谱:将子图谱进行关系连接,对知识进行合并获得面向景区异常事件的知识图谱;其计算候选子图谱和待合并子图谱的相似度的方式是:Step 5: Merge all sub-graphs: Connect the sub-graphs by relationship, and merge knowledge to obtain a knowledge graph for abnormal events in scenic spots; the method of calculating the similarity between the candidate sub-graphs and the sub-graphs to be merged is:

步骤5-1:将爬取的文档数据和上层文档的链接相对应,分为同层链接的文档和异层链接的文档;Step 5-1: Corresponding the crawled document data with the links of the upper-level documents, and divide them into documents linked at the same level and documents linked at different levels;

步骤5-2:某个子图谱对应总知识图谱中的文档A,另一个子图谱对应知识图谱中的文档B,两个子图谱中没有相同的实体,则计算文档A中实体和文档B中实体以及第i层其它文档中所有实体的相似度,筛选出与待合并子图谱相似度最高的候选子图谱,具体公式如下:Step 5-2: A sub-graph corresponds to document A in the general knowledge graph, and another sub-graph corresponds to document B in the knowledge graph. If there are no identical entities in the two sub-graphs, calculate the sum of the entity in document A and the entity in document B. The similarity of all entities in other documents of the i-th layer, and the candidate sub-map with the highest similarity to the sub-map to be merged is screened out. The specific formula is as follows:

Figure BDA0002629327620000051
Figure BDA0002629327620000051

其中Gk表示与待合并子图谱相似度最高的候选子图谱,|G1|和|Gt|分别表示第1个和第t个候选子图谱中的实体个数,a表示待合并的子图谱中的实体,n表示文档G对应的待合并子图谱中的实体个数,b表示候选子图谱中的实体,s(a,b)表示实体a和b的相似度;where G k represents the candidate sub-map with the highest similarity to the sub-map to be merged, |G 1 | and |Gt| represent the number of entities in the 1st and t-th candidate sub-maps, respectively, and a represents the sub-map to be merged The entities in , n represents the number of entities in the sub-graph to be merged corresponding to document G, b represents the entities in the candidate sub-graph, and s(a, b) represents the similarity between entities a and b;

Figure BDA0002629327620000052
其中E表示文档A中对应的待合并子图谱中的实体与候选子图谱中所有实体相似度最大的实体,|Gk|表示与待合并子图谱相似度最大的候选子图谱中实体的个数;
Figure BDA0002629327620000052
where E represents the entity in the corresponding sub-graph to be merged in document A with the greatest similarity to all entities in the candidate sub-graph, |G k | represents the number of entities in the candidate sub-graph with the greatest similarity to the sub-graph to be merged ;

E'=arg max{s(b,E)} (b=1,2,...,|Gk|) (4)E'=arg max{s(b,E)} (b=1,2,...,|Gk|) (4)

其中E’表示候选子图谱中实体与实体E相似度之和最大的实体,将文档A对应的待合并子图谱中的实体E与候选子图谱Gk中的实体E’相连接,并标记说明是通过模型计算得出;Among them, E' represents the entity with the largest similarity between the entity and the entity E in the candidate sub-graph. The entity E in the sub-graph to be merged corresponding to the document A is connected with the entity E' in the candidate sub-graph G k , and the description is marked. is calculated by the model;

步骤5-3:将相似度最高的文档所对应的候选子图谱与文档A对应的待合并子图谱进行合并。Step 5-3: Merge the candidate sub-graph corresponding to the document with the highest similarity with the sub-graph to be merged corresponding to document A.

本发明先构建相关领域的子图谱,再根据子图谱对知识图谱进行合并,可以提高知识图谱的构建效率,并提高领域知识图谱的质量。The present invention constructs sub-graphs of related fields first, and then merges the knowledge graphs according to the sub-graphs, which can improve the construction efficiency of the knowledge graphs and the quality of the domain knowledge graphs.

以上所述只是本发明的一种优选实施方式,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可作出若干的改进和修饰,这些改进也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. For those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements should also be regarded as the present invention. the scope of protection of the invention.

Claims (7)

1.一种面向景区异常事件的知识图谱构建方法,其特征在于:包括以下步骤:1. a knowledge map construction method for abnormal events in scenic spots, is characterized in that: comprise the following steps: S1:使用构建好的网络爬虫从互联网上收集景区异常事件相关的资料与文档,然后通过TF-IDF文本相似度计算方法判断文档的资料相关度,筛选出最合适的数据文档保留;S1: Use the constructed web crawler to collect data and documents related to abnormal events in scenic spots from the Internet, and then use the TF-IDF text similarity calculation method to judge the data relevance of the documents, and filter out the most suitable data documents to keep; S2:在景区异常事件的相关资料内建立语义分词数据集,标记分词数据集中的异常事件实体和实体间关系的词性标记为名词和动词,根据标记的名词和动词在爬取的数据文档中抽取景区异常事件的实体和实体关系,并通过原始的数据文档关联实体和关系;S2: Establish a semantic word segmentation dataset in the relevant information of the abnormal events in the scenic spot, mark the abnormal event entities and the part of speech of the relationship between the entities in the word segmentation dataset as nouns and verbs, and extract them from the crawled data documents according to the marked nouns and verbs Entity and entity relationship of abnormal events in scenic spots, and related entities and relationships through original data documents; S3:将抽取出的实体进行聚类,形成多个不同的实体簇,每个实体簇包含若干个实体类型,参考专业文档的说明结构来构建出对应异常事件的知识图谱模式层,以此对本体进行构建;S3: Cluster the extracted entities to form multiple different entity clusters, each entity cluster contains several entity types, refer to the description structure of the professional document to construct the knowledge graph mode layer corresponding to the abnormal event, so as to Ontology is constructed; S4:根据步骤2中抽取的异常事件的实体关系找到实体簇中的实体所相对应的文档数据,通过余弦相似度计算找到与相应实体对应的文档中与该实体相近的描述词语,将实体与相近的描述词语的关系进行匹配,得到相应的子图谱;S4: Find the document data corresponding to the entity in the entity cluster according to the entity relationship of the abnormal event extracted in step 2, find the description words that are similar to the entity in the document corresponding to the corresponding entity through cosine similarity calculation, and associate the entity with the entity. The relationship of similar description words is matched to obtain the corresponding sub-graph; S5:合并所有的子图谱:将子图谱进行关系连接,对知识进行合并获得面向景区异常事件的知识图谱。S5: Merge all sub-graphs: The sub-graphs are connected by relationship, and knowledge is merged to obtain a knowledge graph for abnormal events in scenic spots. 2.根据权利要求1所述的知识图谱构建方法,其特征是:所述步骤S1中通过TF-IDF文本相似度计算方法判断文档的资料相关度的步骤是:2. knowledge graph construction method according to claim 1 is characterized in that: in described step S1, the step of judging the data correlation degree of document by TF-IDF text similarity calculation method is: S1-1:建立相应的数据质量判断模型,如公式(1)所示:S1-1: Establish a corresponding data quality judgment model, as shown in formula (1):
Figure FDA0002629327610000011
Figure FDA0002629327610000011
公式中Si,j’表示第i层的第j篇文档与当前爬取的所有文档的相似度,Si-1,j表示第i-1层的第j篇文档与当前爬取的所有文档的相似度,同理,Si,j’,k表示第i层的第j篇文档与该层的第k篇文档的相似度,其中第j’篇文档的链接在第j篇文档之中,Wi和Ti表示第i层设定的权重值;In the formula, S i,j' represents the similarity between the jth document of the i-th layer and all the currently crawled documents, and S i-1,j represents the j-th document of the i-1th layer and all the currently crawled documents. The similarity of documents, similarly, S i,j',k represents the similarity between the jth document of the i-th layer and the kth document of the layer, where the link of the j'th document is between the jth document. In, Wi and T i represent the weight values set by the i -th layer; S1-2:设定阈值α,当需要进行判断的文档与其他文档的相似度Si,j小于阈值α时,便判定当前文档不合格,并对该层不符合条件的文档数量统计为Xi,将该层与判定文档内容重复的文档数量统计为YiS1-2: Set the threshold α, when the similarity S i,j of the document to be judged and other documents is less than the threshold α, the current document is judged to be unqualified, and the number of documents that do not meet the conditions in this layer is counted as X i , the number of documents whose content of the layer is duplicated with the determined document content is counted as Y i ; S1-3:统计该层所有文档的数量Ni,计算该层文档的不合格率(Xi+Yi)/Ni,设定新的阈值β判断是否继续进行对该层文档的爬取,若不合格率大于阈值β,则停止对下一层文档的爬取。S1-3: Count the number Ni of all documents in this layer, calculate the failure rate of documents in this layer (Xi+Yi)/Ni, and set a new threshold β to judge whether to continue crawling the documents in this layer, if not qualified If the rate is greater than the threshold β, the crawling of the next layer of documents is stopped.
3.根据权利要求1所述的知识图谱构建方法,其特征是:所述步骤S2中的分词数据集是在爬取文档时标记文档中属于分词词库中的词的词性,并删除其出现在文档中的停用词。3. The knowledge graph construction method according to claim 1, wherein the word segmentation data set in the step S2 is to mark the part of speech of the word in the word segmentation thesaurus in the marked document when crawling the document, and delete its occurrence Stop words in the document. 4.根据权利要求1所述的知识图谱构建方法,其特征是:所述步骤S3中聚类的操作方法是:4. knowledge graph construction method according to claim 1 is characterized in that: the operation method of clustering in described step S3 is: S3-1:对抽取的文本实体使用复合神经网络模型进行词向量的训练,得到含有语法和语义特征词的分布式表达;S3-1: Use the compound neural network model to train the word vectors for the extracted text entities to obtain distributed expressions containing grammatical and semantic feature words; S3-2:在完成词分布式表达后,对实体采用KMeans无监督算法进行聚类。S3-2: After the distributed expression of words is completed, the KMeans unsupervised algorithm is used to cluster the entities. 5.根据权利要求1所述的知识图谱构建方法,其特征是:所述步骤S4中构建子图谱的方法中还采取了实体映射,其方法是:先找到某个实体所对应的的文档,然后通过文档链接找出与该文档相似的相关论文文档,并建立该实体与这些文档间的映射关系。5. The knowledge graph construction method according to claim 1 is characterized in that: entity mapping is also adopted in the method for constructing sub-graphs in the step S4, and its method is: first find the document corresponding to a certain entity, Then find the related paper documents similar to the document through the document link, and establish the mapping relationship between the entity and these documents. 6.根据权利要求1所述的知识图谱构建方法,其特征是:所述步骤S5中图谱合并的方法是:如果两个子图谱都包含相同的实体a,则将其中一个子图谱中和实体a相连的实体b与另一个子图谱中的实体a相连,完成子图谱的合并;如果某个子图谱中没有与其它子图谱相同的实体,则进行对各个子图谱的实体相似度进行计算然后进行合并。6. The knowledge graph construction method according to claim 1 is characterized in that: the method for merging graphs in the step S5 is: if both subgraphs contain the same entity a, then one of the subgraphs is neutralized with the entity a The connected entity b is connected to the entity a in another sub-graph to complete the merging of the sub-graphs; if there is no entity in a sub-graph that is the same as other sub-graphs, the entity similarity of each sub-graph is calculated and then merged. . 7.根据权利要求6所述的知识图谱构建方法,其特征是:所述计算子图谱间相似度的方法是:7. knowledge graph construction method according to claim 6 is characterized in that: the method for the similarity between the described calculation subgraphs is: S5-1:将爬取的文档数据和上层文档的链接相对应,分为同层链接的文档和异层链接的文档;S5-1: Corresponding the crawled document data with the links of the upper-level documents, and divide them into documents linked at the same level and documents linked at different levels; S5-2:某个子图谱对应总知识图谱中的文档A,另一个子图谱对应知识图谱中的文档B,两个子图谱中没有相同的实体,则计算文档A中实体和文档B中实体以及第i层其它文档中所有实体的相似度,筛选出与待合并子图谱相似度最高的候选子图谱,具体公式如下:S5-2: A sub-graph corresponds to document A in the total knowledge graph, another sub-graph corresponds to document B in the knowledge graph, and there is no identical entity in the two sub-graphs, then calculate the entity in document A and the entity in document B and the first The similarity of all entities in other documents in layer i, the candidate sub-atlas with the highest similarity with the sub-atlas to be merged is screened out. The specific formula is as follows:
Figure FDA0002629327610000021
Figure FDA0002629327610000021
其中Gk表示与待合并子图谱相似度最高的候选子图谱,|G1|和|Gt|分别表示第1个和第t个候选子图谱中的实体个数,a表示待合并的子图谱中的实体,n表示文档G对应的待合并子图谱中的实体个数,b表示候选子图谱中的实体,s(a,b)表示实体a和b的相似度;where G k represents the candidate sub-map with the highest similarity to the sub-map to be merged, |G 1 | and |Gt| represent the number of entities in the 1st and t-th candidate sub-maps, respectively, and a represents the sub-map to be merged The entities in , n represents the number of entities in the sub-graph to be merged corresponding to document G, b represents the entities in the candidate sub-graph, and s(a, b) represents the similarity between entities a and b;
Figure FDA0002629327610000022
Figure FDA0002629327610000022
其中E表示文档A中对应的待合并子图谱中的实体与候选子图谱中所有实体相似度最大的实体,|Gk|表示与待合并子图谱相似度最大的候选子图谱中实体的个数;where E represents the entity in the corresponding sub-graph to be merged in document A with the greatest similarity to all entities in the candidate sub-graph, |G k | represents the number of entities in the candidate sub-graph with the greatest similarity to the sub-graph to be merged ; E'=arg max{s(b,E)}(b=1,2,...,|Gk|) (4)E'=arg max{s(b,E)}(b=1,2,...,|G k |) (4) 其中E’表示候选子图谱中实体与实体E相似度之和最大的实体,将文档A对应的待合并子图谱中的实体E与候选子图谱Gk中的实体E’相连接,并标记说明是通过模型计算得出;Among them, E' represents the entity with the largest similarity between the entity and the entity E in the candidate sub-graph. The entity E in the sub-graph to be merged corresponding to the document A is connected with the entity E' in the candidate sub-graph G k , and the description is marked. is calculated by the model; S5-3:将相似度最高的文档所对应的候选子图谱与文档A对应的待合并子图谱进行合并。S5-3: Merge the candidate sub-graph corresponding to the document with the highest similarity with the sub-graph to be merged corresponding to document A.
CN202010806519.4A 2020-08-12 2020-08-12 Knowledge graph construction method for scenic spot abnormal events Pending CN111930893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806519.4A CN111930893A (en) 2020-08-12 2020-08-12 Knowledge graph construction method for scenic spot abnormal events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806519.4A CN111930893A (en) 2020-08-12 2020-08-12 Knowledge graph construction method for scenic spot abnormal events

Publications (1)

Publication Number Publication Date
CN111930893A true CN111930893A (en) 2020-11-13

Family

ID=73311261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806519.4A Pending CN111930893A (en) 2020-08-12 2020-08-12 Knowledge graph construction method for scenic spot abnormal events

Country Status (1)

Country Link
CN (1) CN111930893A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868508A (en) * 2021-09-23 2021-12-31 北京百度网讯科技有限公司 Writing material query method and device, electronic equipment and storage medium
CN114780745A (en) * 2022-04-20 2022-07-22 北京明略昭辉科技有限公司 Method and device for constructing knowledge system, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874378A (en) * 2017-01-05 2017-06-20 北京工商大学 The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation
CN107704637A (en) * 2017-11-20 2018-02-16 中国人民解放军国防科技大学 A knowledge map construction method for emergencies

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874378A (en) * 2017-01-05 2017-06-20 北京工商大学 The entity of rule-based model extracts the method that knowledge mapping is built with relation excavation
CN107704637A (en) * 2017-11-20 2018-02-16 中国人民解放军国防科技大学 A knowledge map construction method for emergencies

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868508A (en) * 2021-09-23 2021-12-31 北京百度网讯科技有限公司 Writing material query method and device, electronic equipment and storage medium
CN114780745A (en) * 2022-04-20 2022-07-22 北京明略昭辉科技有限公司 Method and device for constructing knowledge system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107704637B (en) knowledge graph construction method for emergency
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN112256939B (en) Text entity relation extraction method for chemical field
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
Li et al. Competitive analysis for points of interest
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN110516067A (en) Public opinion monitoring method, system and storage medium based on topic detection
CN101661513B (en) Detection method of network focus and public sentiment
CN105975984B (en) Network quality evaluation method based on evidence theory
CN110825877A (en) A Semantic Similarity Analysis Method Based on Text Clustering
CN106951438A (en) A kind of event extraction system and method towards open field
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN106682172A (en) Keyword-based document research hotspot recommending method
CN101582080A (en) Web image clustering method based on image and text relevant mining
CN106156286A (en) Type extraction system and method towards technical literature knowledge entity
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN105550216A (en) Searching method and device of academic research information and excavating method and device of academic research information
CN114860882A (en) Fair competition review auxiliary method based on text classification model
CN105868347A (en) Tautonym disambiguation method based on multistep clustering
CN105654144A (en) Social network body constructing method based on machine learning
CN116244446B (en) Social media cognitive threat detection method and system
CN111930893A (en) Knowledge graph construction method for scenic spot abnormal events
Yang et al. News topic detection based on capsule semantic graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201113