[go: up one dir, main page]

CN118673158A - Knowledge graph construction method based on named entity recognition and relation extraction - Google Patents

Knowledge graph construction method based on named entity recognition and relation extraction Download PDF

Info

Publication number
CN118673158A
CN118673158A CN202410803255.5A CN202410803255A CN118673158A CN 118673158 A CN118673158 A CN 118673158A CN 202410803255 A CN202410803255 A CN 202410803255A CN 118673158 A CN118673158 A CN 118673158A
Authority
CN
China
Prior art keywords
entity recognition
knowledge graph
vector
named entity
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410803255.5A
Other languages
Chinese (zh)
Inventor
吴茵
孙艳
罗翠云
梁凯
危秋珍
张力欣
顾广锋
熊莉
梁振成
赵梓淇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Power Grid Co Ltd
Original Assignee
Guangxi Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Power Grid Co Ltd filed Critical Guangxi Power Grid Co Ltd
Priority to CN202410803255.5A priority Critical patent/CN118673158A/en
Publication of CN118673158A publication Critical patent/CN118673158A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于命名实体识别和关系抽取的知识图谱构建方法,本发明采集电力领域语料,从语料中分割出句子,并对句子进行预处理;构建电力设备本体体系,包括类、属性和实例;收集实体识别训练集并将实体识别训练集输入双向长短期记忆网络进行训练,获得可用于实体识别的命名实体识别模型;获取输入句子的分词结果以及句法依赖树,在句法依赖树上构建图结构,将其放入擅长处理拓扑结构的GCN模型中,根据学习的权重矩阵预测关系,形成设备关联抽取模型;构建知识图谱。本发明无需人工过多参与,而且语料处理的速率远高于人工,进而可以使得可以在一定周期内识别更多的电力领域语料,进而使得最终构造的知识图谱更加完善。

The present invention discloses a knowledge graph construction method based on named entity recognition and relationship extraction. The present invention collects corpus in the electric power field, segments sentences from the corpus, and pre-processes the sentences; constructs an electric power equipment ontology system, including classes, attributes, and instances; collects entity recognition training sets and inputs the entity recognition training sets into a bidirectional long short-term memory network for training to obtain a named entity recognition model that can be used for entity recognition; obtains the word segmentation results and syntactic dependency trees of the input sentences, constructs a graph structure on the syntactic dependency tree, puts it into a GCN model that is good at processing topological structures, predicts relationships based on the learned weight matrix, and forms an equipment association extraction model; and constructs a knowledge graph. The present invention does not require excessive human involvement, and the rate of corpus processing is much higher than that of manual labor, so that more corpus in the electric power field can be recognized within a certain period, thereby making the final constructed knowledge graph more perfect.

Description

Knowledge graph construction method based on named entity recognition and relation extraction
Technical Field
The invention relates to the technical field of deep learning algorithm application, in particular to a knowledge graph construction method based on named entity recognition and relation extraction.
Background
The knowledge graph is to excavate and analyze knowledge and corresponding carriers thereof by using a visualization technology, and to display the internal relation of the knowledge graph. Thus, the features of this field should be understood in detail during the construction process. When the ontology construction of the power equipment knowledge graph is carried out, the characteristics of the field, such as the characteristics of the type, the function, the performance index and the like of the power equipment, are considered, so that the construction range of the ontology is limited, and the semantic strictness and the expandability of the ontology are ensured. In the electric power system, the knowledge graph has important reference functions for the specification of a starting scheme, the composition of an overhaul list and the like. With the advent of the digital age, information in the form of social media, articles, news, etc., has exploded. Most of this data is in unstructured form, and manual management and efficient use of this information is cumbersome, tedious, and time-consuming and labor-intensive.
At present, for the construction of a power system knowledge graph, information such as power entities, relationships among the entities and the like is usually collected manually so as to construct the knowledge graph based on the information, but the mode of manual collection is complicated, the collection rate is low, the collection effect is poor, time is wasted and mistakes are easy to occur.
In view of this, a knowledge graph construction method based on named entity recognition and relationship extraction is needed.
Disclosure of Invention
Aiming at the problems that in the prior art, the process of manually collecting information such as electric entities and relationships among the entities is complicated, the collection rate is low, the collection effect is poor, time is wasted and errors are prone to occurring in the process of establishing a knowledge graph based on the information, the invention provides a knowledge graph establishing method based on named entity identification and relationship extraction. The specific technical scheme is as follows:
a knowledge graph construction method based on named entity recognition and relation extraction comprises the following steps:
collecting the corpus in the electric power field, segmenting sentences from the corpus, and preprocessing the sentences;
Constructing a power equipment body system comprising classes, attributes and instances;
collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition;
Obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation;
And combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph.
The named entity recognition model is constructed as follows:
Sorting out a domain dictionary related to the power equipment from unstructured data related to the power equipment, forming a new sentence according to the domain dictionary by using the nouns of the same type of power equipment to replace, and expanding a training set to enhance the data;
dividing the sentence into words and classifying the words after dividing the words according to the domain dictionary, and classifying the words which do not appear in the domain dictionary by part-of-speech tagging;
The method comprises the steps of obtaining a word vector and a word vector, forming a word class vector through random initialization, and integrating the word vector, the word vector and the word class vector in series to be used as an embedded vector;
it is input into a two-way long and short term memory network to encode the context information of each chinese character and decode the tags of the entire sentence using CRF.
The named entity recognition model is divided into a joint embedding layer, a Bi-LSTM layer and a CRF layer, wherein:
The first layer is a joint embedding layer of word vectors, word vectors and word type vectors. Respectively replacing words and phrases in the data set with pre-trained word vectors and word vectors, and connecting the word vectors and initialized word class vectors in series to form a final input vector as a word representation;
the second layer is the Bi-LSTM layer, which aims at automatically extracting semantic and temporal features from the context;
The third layer is a CRF layer, and aims to solve the dependency relationship between output labels and obtain a global optimal labeling sequence of the text.
Preferably, the process of dividing sentences from the corpus is as follows: punctuation in the corpus is collected and the corpus is segmented based on the punctuation including periods.
The equipment association extraction model forming process specifically comprises the following steps:
obtaining word segmentation results of an input sentence and a syntax dependency tree from an existing natural language processing tool package;
Vectorizing the text language by using a BERT pre-training model;
constructing a graph structure on a syntax dependency tree, representing the graph structure through an adjacency matrix, and distributing different weights for different dependency relations between any two words, wherein the calculation of the weights is based on connection and dependency types thereof;
And putting the constructed graph structure containing the attention effect into a GCN model good at processing the topological structure, and predicting the relation according to the learned weight matrix.
The equipment association extraction model is divided into a vector embedding module, an attention moment array conversion module and a GCN module, and the working process is as follows:
Processing the corpus to obtain a text sequence after word segmentation and a syntax dependency tree;
at the vector embedding module, using the BERT pre-training model to vector the text sequence to obtain a word vector;
integrating the text sequence characteristics, the syntax dependency matrix and the dependency type matrix in the attention moment array conversion module to form a new weight to replace the weight of the original standard GCN;
Performing feature learning on the graph structure by using a GCN module;
based on the output of the GCN module, a classifier is used to predict the relationship labels between the two entities.
Preferably, the method further comprises the step of carrying out visualization processing on the constructed knowledge graph, specifically: and constructing a knowledge graph in the form of a < entity 1-relation-entity 2> triplet based on the identified entity and the relation between the two entities, and then connecting the two entities and marking the form of the relation between the two entities for visual display.
A computer readable storage medium, the computer readable storage medium comprising a stored program, wherein the program when run controls a device in which the computer readable storage medium resides to perform a knowledge graph construction method based on named entity recognition and relationship extraction as described above.
A processor for running a program, wherein the program runs a knowledge graph construction method based on named entity recognition and relationship extraction as described above.
Compared with the prior art, the invention has the beneficial effects that:
The method collects the corpus in the electric power field, segments sentences from the corpus and carries out pretreatment on the sentences; constructing a power equipment body system comprising classes, attributes and instances; collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition; obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation; and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph. The whole process does not need excessive participation of manpower, and the speed of corpus processing based on the entity recognition model and the equipment association extraction model is far higher than that of manpower, so that more power domain corpora can be recognized in a certain period, and the finally constructed knowledge graph is more perfect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a named entity recognition model;
FIG. 3 is a block diagram of a device association extraction model.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Embodiments of the present invention are further described below with reference to fig. 1-3.
In one embodiment of the present invention, a knowledge graph construction method based on named entity recognition and relationship extraction is provided, comprising the steps of:
S1: collecting the corpus in the electric power field, segmenting sentences from the corpus, and preprocessing the sentences;
s2: constructing a power equipment body system comprising classes, attributes and instances;
s3: collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition;
s4: obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation;
S5: and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph.
The method comprises the key steps of named entity identification, relation extraction, knowledge graph construction and the like. Further explanation follows based on these several key steps:
named entity identification:
the boundaries and types of specific power equipment entities are identified from unstructured text, i.e. the power equipment in the power industry corpus is identified and classified, but the identification of chinese power equipment entities is not easy. At present, chinese power equipment entity identification still faces significant challenges, mainly for the following reasons: first, as a professional field, the names of electric devices include many words that are complex and specific to the field, and the names of some devices are long and rare. For example, "electromagnetic all-insulated voltage transformers". Second, there is a lack of a common chinese power text dataset due to the difficulty in acquiring and marking power domain text. Finally, chinese is complex: on the one hand, chinese is free of blank spaces of English texts as natural boundaries; on the other hand, the Chinese structure is complex, and the number of nested and omitted sentences is large.
The power device-related domain dictionary is collated from unstructured data about the power device and new sentences are formed from the domain dictionary using the same type of power device noun replacement to augment the training set to augment the data. Then, the sentence is segmented, the segmented words are classified according to the domain dictionary, and the words which do not appear in the domain dictionary are classified through part-of-speech tagging. And obtaining a word vector and a word vector, forming a word class vector through random initialization, and integrating the word vector, the word vector and the word class vector in series to be used as an embedded vector. Finally, it is input into a two-way long and short term memory network (Bi LSTM) to encode the context information of each chinese character and decode the tags of the entire sentence using CRF.
As shown in fig. 2, the model is divided into three layers, i.e., a joint embedded layer, a Bi LSTM layer, and a CRF layer. The first layer is a joint embedding layer of word vectors, word vectors and word type vectors. And respectively replacing the words and the phrases in the data set with pre-trained word vectors and word vectors, and connecting the word vectors and the initialized word class vectors in series to form a final input vector as the representation of the words. The second layer is the Bi LSTM layer, which aims to automatically extract semantic and temporal features from the context. The third layer is a CRF layer, and aims to solve the dependency relationship between output labels and obtain a global optimal labeling sequence of the text.
And (3) relation extraction:
Important context information of the task is distinguished by taking advantage of an attention-seeking convolutional network (ATT-GCN), dependency type information among words of a syntax dependency tree which is ignored in the past is utilized, semantic information input into a relation extraction model is further enriched, and the model is helped to understand information contained in a text. Specifically, word segmentation results of an input sentence and a syntax dependency tree are first obtained from an off-the-shelf natural language processing tool package. Then, for vectorization of text language, a BERT pre-training model is used.
Then, a graph structure (represented by an adjacency matrix) is built on the syntax dependency tree, different weights are distributed for different dependency relationships between any two words, and the calculation of the weights is based on the connection and the dependency types thereof. And finally, putting the constructed graph structure containing the attention effect into a GCN model good at processing the topological structure, and predicting the relation according to the learned weight matrix.
Not only can the ATT-GCN distinguish important context information from the syntax-dependent tree and utilize them accordingly, so that no fixed pruning strategy need be relied upon, but the ATT-GCN can also be studied with most of the previously ignored dependency type information. And performing feature learning through the GCN model by using three feature information, namely text sequence information, syntax dependency information and syntax dependency type information, wherein the overall structure of the model is shown in figure 3.
The model is divided into three modules, namely a vector embedding module, an attention moment array conversion module and a GCN module. Firstly, processing the corpus to obtain a text sequence after word segmentation and a syntax dependency tree. And then, at a vector embedding module, using the BERT pre-training model to vector the text sequence to obtain a word vector. Then, the text sequence features, the syntax dependency matrix and the dependency type matrix are integrated at the attention moment array conversion module to form new weights to replace the weights of the original standard GCN. Thereafter, feature learning is performed on the graph structure using the GCN module. Finally, based on the output of the GCN module, a classifier is used for predicting the relationship label between the two entities.
Knowledge graph construction:
The knowledge graph is to excavate and analyze knowledge and corresponding carriers thereof by using a visualization technology, and to display the internal relation of the knowledge graph. Thus, the features of this field should be understood in detail during the construction process. When the ontology construction of the power equipment knowledge graph is carried out, the characteristics of the field, such as the characteristics of the type, the function, the performance index and the like of the power equipment, are considered, so that the construction range of the ontology is limited, and the semantic strictness and the expandability of the ontology are ensured. For example, the basic concepts of the electrical equipment, such as generators, transformers, switches, etc., and the relationships between them, such as to which electrical system, electrical connection, etc., belong, may be defined.
When the ontology construction of the electric power field is carried out, the ontology contained in the ontology construction shall be analyzed and selected by referring to the knowledge system of the existing third party or the related data of the field. In the ontology construction of the power equipment knowledge graph, important terms that may be involved include names, functions, features, and the like of the power equipment. These terms are further clarified and defined in the subsequent ontology definition.
In the power equipment knowledge graph, the largest class can be defined as 'power equipment', then the subclasses of the class are defined as 'transformer', 'switch equipment', 'cable', and the like, and finally a complete inheritance structure is formed; a bottom-up approach may also be employed, i.e., starting with the lowest, most concrete class, progressively looking up for its parent class until the top-most abstract concept is reached. For example, in the power equipment knowledge graph, the lowest class may be defined as "transformer", then the parent class is found as "power equipment", and then the parent class is continuously searched upwards until the highest abstract concept is reached; a combination of these two methods may also be used for definition.
The related concepts, attributes and relationships of the power equipment body are comprehensively summarized under the guidance of the power system expert. For example, transformers, inverters, circuit breakers, protection devices, voltage transformers, and the like are listed and classified as main body elements of equipment in the electric power field. Defining attributes and relationships for each category determined in the previous step after determining the category of the power equipment, wherein the attributes are used for describing inherent characteristics of the concept, such as rated capacity, rated voltage, rated current, capacity ratio, voltage ratio and the like of the transformer; the relationship is then used to represent the relationship between different concepts, such as the relationship between a voltage transformer and a relay, etc. The consistency of the power equipment body, the attribute and the relation can be ensured by defining the constraint, and the data quality can be improved.
In the construction of the power equipment knowledge graph, a conceptual relation is provided for the construction of the knowledge graph through the construction of a power equipment ontology system. The data quality is improved through the constraint on the entity, the relation and the attribute, so that the construction effect of the power equipment knowledge graph is improved. After the classes and properties are designed, instances of the various classes need to be added. Creating an instance of a class is similar to entering data into a table in a database, where the attribute names and their value ranges have been given in the attribute map. A complete ontology consists of classes, attributes and instances. For example, the power transformation device includes a step-up transformer, an inverter, a voltage transformer, and the like.
After the above ontology construction and knowledge extraction, a set of electrical equipment entities and their interrelationships are obtained. In this way, a representation of the triplet is formed, namely < entity 1-relationship-entity 2> and < entity-attribute value >. For example, it is to organize entities and relationships into the form < capacitive divider, co-operating device, reactor >. After the triples are formed, knowledge storage and visual presentation of the triples is required.
In summary, the invention collects the corpus in the electric power field, segments sentences from the corpus, and preprocesses the sentences; constructing a power equipment body system comprising classes, attributes and instances; collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition; obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation; and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph. The whole process does not need excessive participation of manpower, and the speed of corpus processing based on the entity recognition model and the equipment association extraction model is far higher than that of manpower, so that more power domain corpora can be recognized in a certain period, and the finally constructed knowledge graph is more perfect.
Those of ordinary skill in the art will appreciate that the elements (or steps, infra) of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and the components of the examples have been generally described in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the division of the units is merely a logic function division, and there may be other division manners in actual implementation, for example, multiple units may be combined into one unit, one unit may be split into multiple units, or some features may be omitted.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: u disk, read-0nlyMemory, random access memory (RAM, randomAccessMemory), removable hard disk, magnetic disk or optical disk, etc.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims (9)

1. The knowledge graph construction method based on named entity recognition and relation extraction is characterized by comprising the following steps of:
collecting the corpus in the electric power field, segmenting sentences from the corpus, and preprocessing the sentences;
Constructing a power equipment body system comprising classes, attributes and instances;
collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition;
Obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation;
And combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph.
2. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 1, wherein the construction process of the named entity recognition model is as follows:
Sorting out a domain dictionary related to the power equipment from unstructured data related to the power equipment, forming a new sentence according to the domain dictionary by using the nouns of the same type of power equipment to replace, and expanding a training set to enhance the data;
dividing the sentence into words and classifying the words after dividing the words according to the domain dictionary, and classifying the words which do not appear in the domain dictionary by part-of-speech tagging;
The method comprises the steps of obtaining a word vector and a word vector, forming a word class vector through random initialization, and integrating the word vector, the word vector and the word class vector in series to be used as an embedded vector;
it is input into a two-way long and short term memory network to encode the context information of each chinese character and decode the tags of the entire sentence using CRF.
3. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 2, wherein the named entity recognition model is divided into a joint embedding layer, a Bi-LSTM layer and a CRF layer, wherein:
The first layer is a joint embedding layer of a word vector, a word vector and a word class vector, words and words in the data set are replaced by a pre-trained word vector and a pre-trained word vector respectively, and the word vector, the word vector and the initialized word class vector are connected in series to be used as a word representation to form a final input vector;
the second layer is the Bi-LSTM layer, which aims at automatically extracting semantic and temporal features from the context;
The third layer is a CRF layer, and aims to solve the dependency relationship between output labels and obtain a global optimal labeling sequence of the text.
4. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 1, wherein the process of dividing sentences from corpus is as follows: punctuation in the corpus is collected and the corpus is segmented based on the punctuation including periods.
5. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 1, wherein the equipment association extraction model forming process specifically comprises the following steps:
obtaining word segmentation results of an input sentence and a syntax dependency tree from an existing natural language processing tool package;
Vectorizing the text language by using a BERT pre-training model;
constructing a graph structure on a syntax dependency tree, representing the graph structure through an adjacency matrix, and distributing different weights for different dependency relations between any two words, wherein the calculation of the weights is based on connection and dependency types thereof;
And putting the constructed graph structure containing the attention effect into a GCN model good at processing the topological structure, and predicting the relation according to the learned weight matrix.
6. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 5, wherein the equipment association extraction model is divided into a vector embedding module, an attention moment array conversion module and a GCN module, and the working process is as follows:
Processing the corpus to obtain a text sequence after word segmentation and a syntax dependency tree;
at the vector embedding module, using the BERT pre-training model to vector the text sequence to obtain a word vector;
integrating the text sequence characteristics, the syntax dependency matrix and the dependency type matrix in the attention moment array conversion module to form a new weight to replace the weight of the original standard GCN;
Performing feature learning on the graph structure by using a GCN module;
based on the output of the GCN module, a classifier is used to predict the relationship labels between the two entities.
7. The knowledge graph construction method based on named entity recognition and relation extraction according to claim 1, further comprising the step of performing visualization processing on the constructed knowledge graph, specifically: and constructing a knowledge graph in the form of a < entity 1-relation-entity 2> triplet based on the identified entity and the relation between the two entities, and then connecting the two entities and marking the form of the relation between the two entities for visual display.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium includes a stored program, wherein the apparatus in which the computer-readable storage medium is controlled to execute the knowledge graph construction method based on named entity recognition and relationship extraction according to any one of claims 1 to 7 when the program runs.
9. A processor, wherein the processor is configured to run a program, and wherein the program executes the knowledge graph construction method based on named entity recognition and relationship extraction according to any one of claims 1 to 7.
CN202410803255.5A 2024-06-20 2024-06-20 Knowledge graph construction method based on named entity recognition and relation extraction Pending CN118673158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410803255.5A CN118673158A (en) 2024-06-20 2024-06-20 Knowledge graph construction method based on named entity recognition and relation extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410803255.5A CN118673158A (en) 2024-06-20 2024-06-20 Knowledge graph construction method based on named entity recognition and relation extraction

Publications (1)

Publication Number Publication Date
CN118673158A true CN118673158A (en) 2024-09-20

Family

ID=92730342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410803255.5A Pending CN118673158A (en) 2024-06-20 2024-06-20 Knowledge graph construction method based on named entity recognition and relation extraction

Country Status (1)

Country Link
CN (1) CN118673158A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119474401A (en) * 2024-10-30 2025-02-18 大连理工大学 A knowledge graph construction method and system for cable structured process library

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119474401A (en) * 2024-10-30 2025-02-18 大连理工大学 A knowledge graph construction method and system for cable structured process library

Similar Documents

Publication Publication Date Title
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN117290489A (en) A method and system for rapid construction of industry question and answer knowledge base
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN112463926A (en) Data retrieval/intelligent question answering method, device and storage medium
WO2023108991A1 (en) Model training method and apparatus, knowledge classification method and apparatus, and device and medium
CN112559684A (en) Keyword extraction and information retrieval method
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN114840685A (en) Emergency plan knowledge graph construction method
CN116975313B (en) Semantic tag generation method and device based on electric power material corpus
CN119396997B (en) Real-time data analysis and visualization method and system in big data environment
CN114970553B (en) Information analysis method and device based on large-scale unmarked corpus and electronic equipment
CN112148886A (en) Method and system for constructing content knowledge graph
CN113821590B (en) Text category determining method, related device and equipment
CN101079024B (en) Special word list dynamic generation system and method
CN117235228A (en) Customer service question-answer interaction method, device, equipment and storage medium
CN111651569A (en) A Knowledge Base Question Answering Method and System in Electric Power Field
CN120216703A (en) A method and system for constructing electric power engineering knowledge base based on large model
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN113157860A (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
CN118964533A (en) Retrieval enhancement generation method and system supporting multi-language knowledge base
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination