Disclosure of Invention
Aiming at the problems that in the prior art, the process of manually collecting information such as electric entities and relationships among the entities is complicated, the collection rate is low, the collection effect is poor, time is wasted and errors are prone to occurring in the process of establishing a knowledge graph based on the information, the invention provides a knowledge graph establishing method based on named entity identification and relationship extraction. The specific technical scheme is as follows:
a knowledge graph construction method based on named entity recognition and relation extraction comprises the following steps:
collecting the corpus in the electric power field, segmenting sentences from the corpus, and preprocessing the sentences;
Constructing a power equipment body system comprising classes, attributes and instances;
collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition;
Obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation;
And combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph.
The named entity recognition model is constructed as follows:
Sorting out a domain dictionary related to the power equipment from unstructured data related to the power equipment, forming a new sentence according to the domain dictionary by using the nouns of the same type of power equipment to replace, and expanding a training set to enhance the data;
dividing the sentence into words and classifying the words after dividing the words according to the domain dictionary, and classifying the words which do not appear in the domain dictionary by part-of-speech tagging;
The method comprises the steps of obtaining a word vector and a word vector, forming a word class vector through random initialization, and integrating the word vector, the word vector and the word class vector in series to be used as an embedded vector;
it is input into a two-way long and short term memory network to encode the context information of each chinese character and decode the tags of the entire sentence using CRF.
The named entity recognition model is divided into a joint embedding layer, a Bi-LSTM layer and a CRF layer, wherein:
The first layer is a joint embedding layer of word vectors, word vectors and word type vectors. Respectively replacing words and phrases in the data set with pre-trained word vectors and word vectors, and connecting the word vectors and initialized word class vectors in series to form a final input vector as a word representation;
the second layer is the Bi-LSTM layer, which aims at automatically extracting semantic and temporal features from the context;
The third layer is a CRF layer, and aims to solve the dependency relationship between output labels and obtain a global optimal labeling sequence of the text.
Preferably, the process of dividing sentences from the corpus is as follows: punctuation in the corpus is collected and the corpus is segmented based on the punctuation including periods.
The equipment association extraction model forming process specifically comprises the following steps:
obtaining word segmentation results of an input sentence and a syntax dependency tree from an existing natural language processing tool package;
Vectorizing the text language by using a BERT pre-training model;
constructing a graph structure on a syntax dependency tree, representing the graph structure through an adjacency matrix, and distributing different weights for different dependency relations between any two words, wherein the calculation of the weights is based on connection and dependency types thereof;
And putting the constructed graph structure containing the attention effect into a GCN model good at processing the topological structure, and predicting the relation according to the learned weight matrix.
The equipment association extraction model is divided into a vector embedding module, an attention moment array conversion module and a GCN module, and the working process is as follows:
Processing the corpus to obtain a text sequence after word segmentation and a syntax dependency tree;
at the vector embedding module, using the BERT pre-training model to vector the text sequence to obtain a word vector;
integrating the text sequence characteristics, the syntax dependency matrix and the dependency type matrix in the attention moment array conversion module to form a new weight to replace the weight of the original standard GCN;
Performing feature learning on the graph structure by using a GCN module;
based on the output of the GCN module, a classifier is used to predict the relationship labels between the two entities.
Preferably, the method further comprises the step of carrying out visualization processing on the constructed knowledge graph, specifically: and constructing a knowledge graph in the form of a < entity 1-relation-entity 2> triplet based on the identified entity and the relation between the two entities, and then connecting the two entities and marking the form of the relation between the two entities for visual display.
A computer readable storage medium, the computer readable storage medium comprising a stored program, wherein the program when run controls a device in which the computer readable storage medium resides to perform a knowledge graph construction method based on named entity recognition and relationship extraction as described above.
A processor for running a program, wherein the program runs a knowledge graph construction method based on named entity recognition and relationship extraction as described above.
Compared with the prior art, the invention has the beneficial effects that:
The method collects the corpus in the electric power field, segments sentences from the corpus and carries out pretreatment on the sentences; constructing a power equipment body system comprising classes, attributes and instances; collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition; obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation; and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph. The whole process does not need excessive participation of manpower, and the speed of corpus processing based on the entity recognition model and the equipment association extraction model is far higher than that of manpower, so that more power domain corpora can be recognized in a certain period, and the finally constructed knowledge graph is more perfect.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Embodiments of the present invention are further described below with reference to fig. 1-3.
In one embodiment of the present invention, a knowledge graph construction method based on named entity recognition and relationship extraction is provided, comprising the steps of:
S1: collecting the corpus in the electric power field, segmenting sentences from the corpus, and preprocessing the sentences;
s2: constructing a power equipment body system comprising classes, attributes and instances;
s3: collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition;
s4: obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation;
S5: and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph.
The method comprises the key steps of named entity identification, relation extraction, knowledge graph construction and the like. Further explanation follows based on these several key steps:
named entity identification:
the boundaries and types of specific power equipment entities are identified from unstructured text, i.e. the power equipment in the power industry corpus is identified and classified, but the identification of chinese power equipment entities is not easy. At present, chinese power equipment entity identification still faces significant challenges, mainly for the following reasons: first, as a professional field, the names of electric devices include many words that are complex and specific to the field, and the names of some devices are long and rare. For example, "electromagnetic all-insulated voltage transformers". Second, there is a lack of a common chinese power text dataset due to the difficulty in acquiring and marking power domain text. Finally, chinese is complex: on the one hand, chinese is free of blank spaces of English texts as natural boundaries; on the other hand, the Chinese structure is complex, and the number of nested and omitted sentences is large.
The power device-related domain dictionary is collated from unstructured data about the power device and new sentences are formed from the domain dictionary using the same type of power device noun replacement to augment the training set to augment the data. Then, the sentence is segmented, the segmented words are classified according to the domain dictionary, and the words which do not appear in the domain dictionary are classified through part-of-speech tagging. And obtaining a word vector and a word vector, forming a word class vector through random initialization, and integrating the word vector, the word vector and the word class vector in series to be used as an embedded vector. Finally, it is input into a two-way long and short term memory network (Bi LSTM) to encode the context information of each chinese character and decode the tags of the entire sentence using CRF.
As shown in fig. 2, the model is divided into three layers, i.e., a joint embedded layer, a Bi LSTM layer, and a CRF layer. The first layer is a joint embedding layer of word vectors, word vectors and word type vectors. And respectively replacing the words and the phrases in the data set with pre-trained word vectors and word vectors, and connecting the word vectors and the initialized word class vectors in series to form a final input vector as the representation of the words. The second layer is the Bi LSTM layer, which aims to automatically extract semantic and temporal features from the context. The third layer is a CRF layer, and aims to solve the dependency relationship between output labels and obtain a global optimal labeling sequence of the text.
And (3) relation extraction:
Important context information of the task is distinguished by taking advantage of an attention-seeking convolutional network (ATT-GCN), dependency type information among words of a syntax dependency tree which is ignored in the past is utilized, semantic information input into a relation extraction model is further enriched, and the model is helped to understand information contained in a text. Specifically, word segmentation results of an input sentence and a syntax dependency tree are first obtained from an off-the-shelf natural language processing tool package. Then, for vectorization of text language, a BERT pre-training model is used.
Then, a graph structure (represented by an adjacency matrix) is built on the syntax dependency tree, different weights are distributed for different dependency relationships between any two words, and the calculation of the weights is based on the connection and the dependency types thereof. And finally, putting the constructed graph structure containing the attention effect into a GCN model good at processing the topological structure, and predicting the relation according to the learned weight matrix.
Not only can the ATT-GCN distinguish important context information from the syntax-dependent tree and utilize them accordingly, so that no fixed pruning strategy need be relied upon, but the ATT-GCN can also be studied with most of the previously ignored dependency type information. And performing feature learning through the GCN model by using three feature information, namely text sequence information, syntax dependency information and syntax dependency type information, wherein the overall structure of the model is shown in figure 3.
The model is divided into three modules, namely a vector embedding module, an attention moment array conversion module and a GCN module. Firstly, processing the corpus to obtain a text sequence after word segmentation and a syntax dependency tree. And then, at a vector embedding module, using the BERT pre-training model to vector the text sequence to obtain a word vector. Then, the text sequence features, the syntax dependency matrix and the dependency type matrix are integrated at the attention moment array conversion module to form new weights to replace the weights of the original standard GCN. Thereafter, feature learning is performed on the graph structure using the GCN module. Finally, based on the output of the GCN module, a classifier is used for predicting the relationship label between the two entities.
Knowledge graph construction:
The knowledge graph is to excavate and analyze knowledge and corresponding carriers thereof by using a visualization technology, and to display the internal relation of the knowledge graph. Thus, the features of this field should be understood in detail during the construction process. When the ontology construction of the power equipment knowledge graph is carried out, the characteristics of the field, such as the characteristics of the type, the function, the performance index and the like of the power equipment, are considered, so that the construction range of the ontology is limited, and the semantic strictness and the expandability of the ontology are ensured. For example, the basic concepts of the electrical equipment, such as generators, transformers, switches, etc., and the relationships between them, such as to which electrical system, electrical connection, etc., belong, may be defined.
When the ontology construction of the electric power field is carried out, the ontology contained in the ontology construction shall be analyzed and selected by referring to the knowledge system of the existing third party or the related data of the field. In the ontology construction of the power equipment knowledge graph, important terms that may be involved include names, functions, features, and the like of the power equipment. These terms are further clarified and defined in the subsequent ontology definition.
In the power equipment knowledge graph, the largest class can be defined as 'power equipment', then the subclasses of the class are defined as 'transformer', 'switch equipment', 'cable', and the like, and finally a complete inheritance structure is formed; a bottom-up approach may also be employed, i.e., starting with the lowest, most concrete class, progressively looking up for its parent class until the top-most abstract concept is reached. For example, in the power equipment knowledge graph, the lowest class may be defined as "transformer", then the parent class is found as "power equipment", and then the parent class is continuously searched upwards until the highest abstract concept is reached; a combination of these two methods may also be used for definition.
The related concepts, attributes and relationships of the power equipment body are comprehensively summarized under the guidance of the power system expert. For example, transformers, inverters, circuit breakers, protection devices, voltage transformers, and the like are listed and classified as main body elements of equipment in the electric power field. Defining attributes and relationships for each category determined in the previous step after determining the category of the power equipment, wherein the attributes are used for describing inherent characteristics of the concept, such as rated capacity, rated voltage, rated current, capacity ratio, voltage ratio and the like of the transformer; the relationship is then used to represent the relationship between different concepts, such as the relationship between a voltage transformer and a relay, etc. The consistency of the power equipment body, the attribute and the relation can be ensured by defining the constraint, and the data quality can be improved.
In the construction of the power equipment knowledge graph, a conceptual relation is provided for the construction of the knowledge graph through the construction of a power equipment ontology system. The data quality is improved through the constraint on the entity, the relation and the attribute, so that the construction effect of the power equipment knowledge graph is improved. After the classes and properties are designed, instances of the various classes need to be added. Creating an instance of a class is similar to entering data into a table in a database, where the attribute names and their value ranges have been given in the attribute map. A complete ontology consists of classes, attributes and instances. For example, the power transformation device includes a step-up transformer, an inverter, a voltage transformer, and the like.
After the above ontology construction and knowledge extraction, a set of electrical equipment entities and their interrelationships are obtained. In this way, a representation of the triplet is formed, namely < entity 1-relationship-entity 2> and < entity-attribute value >. For example, it is to organize entities and relationships into the form < capacitive divider, co-operating device, reactor >. After the triples are formed, knowledge storage and visual presentation of the triples is required.
In summary, the invention collects the corpus in the electric power field, segments sentences from the corpus, and preprocesses the sentences; constructing a power equipment body system comprising classes, attributes and instances; collecting an entity recognition training set, inputting the entity recognition training set into a two-way long-short-term memory network for training, and obtaining a named entity recognition model for entity recognition; obtaining word segmentation results and syntax dependency trees of input sentences, constructing a graph structure on the syntax dependency trees, putting the graph structure into a GCN model which is good at processing topological structures, and forming a device association extraction model according to the learned weight matrix prediction relation; and combining the relationship between the entity identified by the named entity identification model and the two entities output by the equipment association extraction model to construct a knowledge graph. The whole process does not need excessive participation of manpower, and the speed of corpus processing based on the entity recognition model and the equipment association extraction model is far higher than that of manpower, so that more power domain corpora can be recognized in a certain period, and the finally constructed knowledge graph is more perfect.
Those of ordinary skill in the art will appreciate that the elements (or steps, infra) of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and the components of the examples have been generally described in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the division of the units is merely a logic function division, and there may be other division manners in actual implementation, for example, multiple units may be combined into one unit, one unit may be split into multiple units, or some features may be omitted.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: u disk, read-0nlyMemory, random access memory (RAM, randomAccessMemory), removable hard disk, magnetic disk or optical disk, etc.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.