CN117408255A

CN117408255A - A dual-graph neural network medical named entity recognition method based on multi-feature fusion

Info

Publication number: CN117408255A
Application number: CN202311317519.8A
Authority: CN
Inventors: 马甲林; 古汉钊; 韩庆宾; 李澳繁; 谢乾; 汪涛
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2024-01-16

Abstract

The invention provides a double-graph neural network medical named entity recognition method based on multi-feature fusion, which comprises the steps of preprocessing text sequences; on one hand, the context characteristics are obtained by inputting the context characteristics into a two-way long-short-term memory network; on the other hand, respectively constructing a word co-occurrence diagram and a dependency syntax diagram; the graph convolution neural network learns the dependence and interaction between nodes according to the node relation of the graph structure to obtain global characteristic information; carrying out feature fusion on the context feature information and the global feature information, and calculating the dependency relationship inside the features by adopting a multi-head self-attention mechanism to obtain comprehensive feature information; and calculating comprehensive characteristic information by using a CRF model, and performing sequence labeling to obtain an optimal labeling sequence. Compared with the prior art, the invention extracts text features from the perspective of double graphs by constructing the word co-occurrence graph and the dependency syntax graph, and introduces a multi-head self-attention mechanism to calculate the dependency relationship inside the features, thereby effectively improving the accuracy of identifying the medical named entity.

Description

Double-graph neural network medical named entity recognition method based on multi-feature fusion

Technical Field

The invention relates to the technical field of natural language processing, in particular to a double-graph neural network medical named entity identification method based on multi-feature fusion.

Background

Named entity recognition (Named Entity Recognition, NER) in the medical field is an important task in natural language processing in the medical field, aiming at identifying specific entities from text, such as disease names, drug names, surgical procedures, diseases and diagnostics, etc. With the continuous development of medical research and medical practice, a great deal of Chinese electronic medical record text data is generated in the clinical treatment process, and the data contains rich medical information. The key information of related entities is automatically extracted from the massive texts through the medical naming entity recognition technology, and the information can be used for personalized medical service system construction and clinical auxiliary decision support, and has important significance for professional research in the medical field.

The entities in the medical field have diversity and complexity compared to the task of named entity identification in the general field, the same drug name may have multiple aliases and variant names, the same disease may have different naming schemes, and the disease name is often more complex. Meanwhile, complex relationships exist between entities in the electronic medical record, such as relationships between doctors and patients, relationships between diagnosis and treatment, and the like, which make the task of identifying named entities in the medical field more difficult.

Disclosure of Invention

The invention aims to: aiming at the technical problems in the background technology, the invention provides a dual-graph neural network medical named entity recognition method based on multi-feature fusion, which is characterized in that context feature information in a sequence is extracted through a two-way long-short-term memory network, global feature information of the sequence is obtained through a graph convolution neural network, the context information and the global information are subjected to feature fusion, and the dependency relationship inside the features is calculated through a self-attention mechanism, so that the accuracy of medical named entity recognition is improved.

The technical scheme is as follows: the invention provides a double-graph neural network medical entity identification method based on multi-feature fusion, which comprises the following steps:

step 1: preprocessing data of the text sequence, and dividing a training set and a testing set;

step 2: inputting text sequences into a two-way long and short term memory network module, enhancing the association of context Wen Yuyi and obtaining hidden layer feature vector H containing contextual information _t ；

Step 3: respectively constructing a word co-occurrence diagram and a dependency syntax diagram according to the text sequence;

step 4: learning the dependency relationship and interaction between nodes according to the node relationship of the graph structure by using the graph convolution neural network to obtain global feature information H of fusion word co-occurrence node feature information and dependency syntax feature information _C ；

Step 5: carrying out feature fusion on the context feature information obtained in the step 2 and the global feature information obtained in the step 4, and calculating the dependency relationship inside the feature by adopting a multi-head self-attention mechanism to obtain comprehensive feature information;

step 6: and (5) calculating the comprehensive characteristic information obtained in the step (5) by using a CRF model, and performing sequence labeling to obtain an optimal labeling sequence.

Further, the specific method of the step 1 is as follows:

step 1.1: performing text word segmentation, text normalization, text cleaning and removal of stop words and low-frequency words on the text sequence;

and step 1.2, performing a shuffle operation on the data set, and dividing the data set into a training set and a testing set according to the proportion of 7:3.

Further, the specific method of the step 2 is as follows:

step 2.1, enhancing the association of the upper and lower Wen Yuyi by utilizing a two-way long-short-term memory network, and acquiring a hidden layer feature vector containing context information;

step 2.1, using l= (L ₁ ,l ₂ ,…,l _i …,l _n ) Representing the text sequence after the text preprocessing of step 1, and using a pre-trained word vector model GloVe for the text sequence l _i Representing to obtain Wherein l _i ∈R ^d N represents the length of the sentence and d represents the word vector dimension;

step 2.2, the two-way long-short-term memory network obtains the text sequence l through the forward and reverse LSTM networks respectively _i Context dependent information in front and back directions is calculated to obtain a forward vector and a backward vector, and the two vectors are spliced and used as output of a hidden layer and expressed as:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o

h _t ＝o _t *tanh(C _t )

wherein f _t 、i _t And o _t Respectively representing a forgetting gate, an input gate and an output gate at the time t,representing t-moment candidate memory cell vector, C _t A memory cell vector h representing time t _t Representing hidden layer output vector, representing dot product, W representing weight matrix, b representing bias vector, σ (·) and tanh (·) representing sigmoid activation function and hyperbolic tangent activation function, respectively;

by outputting the forward LSTM of a bidirectional long and short term memory networkAnd reverse LSTM output->Splicing to obtain final hidden layer characteristics +.>

Further, the specific process of the step 3 is as follows:

step 3.1, constructing a word co-occurrence diagram;

step 3.1.1, the text sequence of length n is represented by w= { W ₁ ,w ₂ ,…,w _i ,…,x _n Represented by }, w _i Representing an ith word in the text sequence, and setting a designated sliding window size;

step 3.1.2 sliding window along text sequence from left to right, window center word is w _i If w _j And w _i Within a window, w is constructed _i And w _j Connecting edges between word nodes;

step 3.1.3, constructing a word co-occurrence graph adjacency matrix A _m Expressed as:wherein the method comprises the steps of _ci j represents the number of times two words co-occur within each window;

step 3.2, constructing a dependency syntax diagram;

step 3.2.1, analyzing the text sequence by using a dependency syntax analyzer Stanford NLP to obtain a syntax dependency tree of the sentence;

step 3.2.2, constructing a graph structure G= (V, E) by taking words in the syntactic dependency tree as nodes and taking dependency relationship as directed edges, wherein each edge E (i, j) E represents a head node x _i To a slave node x _j Is dependent on (a); wherein V represents the nodes of the graph, E represents the nodes and the edges of the nodes;

step 3.2.3 constructing an adjacency matrix A from the syntactic dependency graph _s Expressed as:

further, the specific process of the step 4 is as follows:

step 4.1, aggregating and updating the nodes in the word co-occurrence graph and the dependency syntax graph according to the nodes, meanwhile, the graph convolution operation considers the neighbor node characteristics of the nodes, updates the characteristic representation of the nodes by utilizing the information of the neighbor nodes, and aims at the first-layer word nodeAnd syntax node->Layer 1 word node->And syntax node->The update mode is defined as follows:

wherein W is ^l Representing model training weight parameters, RELU (·) representing activation functions,representing a normalized symmetric adjacency matrix expressed as:Wherein D represents a degree matrix;

step 4.2, splicing the updated word node vector representation and the dependency syntax node vector representation to obtain global feature information H _C Expressed as:

further, the specific process of the step 5 is as follows:

step 5.1, performing feature fusion on the context feature information obtained in the step 2 and the global feature information obtained in the step 4 to obtain a fusion vector, and marking the fusion vector as R= [ H ] _t ,H _C ]；

Step 5.2, performing linear transformation on the fusion vector R to obtain 3 different vector matrixes Q (query), K (Key) and V (Value), wherein q=k=v=r; and (3) carrying out dot product operation on each keyword matrix K by using Q, scaling an operation result, and normalizing by using a softmax function to obtain attention weight, wherein a calculation formula is expressed as follows:

wherein,representing the dimension of the key, softmax (·) represents probability mapping of the input element;

and 5.3, performing multiple parallel attention calculations on the high-dimensional feature vectors through a multi-head self-attention mechanism, wherein the number of heads is e, each attention head has different feature parameters, after attention information of one head is obtained through calculation, performing linear transformation on the e heads after splicing to finally obtain comprehensive feature information of fused attention information, and the comprehensive feature information is expressed as:

M _i ＝SelfAttention(QW _Q ,KW _K ,VW _V )

MulAtt(Q,K,V)＝Concat(M ₁ ,M ₂ ,…,M _e )W

wherein W is _Q ,W _K ,W _V The weight matrix learned by the model in training is represented, concat (·) represents the feature vector extracted by the e attention heads is spliced, and W represents the weight matrix generated in the implementation process.

Further, the specific process of the step 6 is as follows:

for a given sequence x= { x ₁ ,x ₂ ,…,x _n Sum prediction sequence y= { ₁ ,y ₂ ,…,y _n -its calculated score is expressed as:

wherein,representing y _i Transfer to y _i+1 Probability of->Indicating that the i-th word is marked y _i S (x, y) represents the probability that the input sentence sequence x is labeled with the label sequence y.

The beneficial effects are that:

the invention acquires topological structure information and syntax dependency information among words by constructing the word co-occurrence graph and the dependency syntax graph, and solves the problem that the word ambiguity of a named entity and the dependency information among words cannot be fully utilized. The context characteristic information is captured by utilizing the two-way long-short-term memory network, the double-graph updating learning is performed through the graph convolution neural network, and a multi-head self-attention mechanism is introduced to perform characteristic fusion, so that the identification accuracy of the medical named entity is effectively improved.

Drawings

FIG. 1 is a flowchart of a dual-map neural network medical named entity recognition method based on multi-feature fusion in an embodiment of the invention;

FIG. 2 is a schematic diagram of an overall architecture in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a word co-occurrence diagram structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a dependency algorithm in an embodiment of the present invention.

Detailed Description

The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.

The invention discloses a double-graph neural network medical entity identification method based on multi-feature fusion, which comprises the following steps:

step 1, preprocessing data of a text sequence; the data preprocessing comprises the steps of performing text word segmentation, capitalization and lowercase, removing non-text content, removing stop words and low-frequency words on a text sequence; and then, carrying out a shuffle operation on the data set, and dividing the data set into a training set and a testing set according to the proportion of 7:3.

And step 2, inputting the text sequence into a two-way long-short-term memory network module, associating the enhanced context Wen Yuyi and acquiring a hidden layer feature vector containing context information.

Step 2.1, using l= (L ₁ ,l ₂ ,…,l _i …,l _n ) Representing the text sequence after the text preprocessing of step 1, and using a pre-trained word vector model GloVe for the text sequence l _i Representing to obtain Wherein l _i ∈R ^d N represents the length of the sentence and d represents the word vector dimension.

Step 2.2, the two-way long-short-term memory network obtains the text sequence l through the forward and reverse LSTM networks respectively _i Context dependency information in front and back directions ensures that each word obtains rich semantic information, and the forward vector and the backward vector obtained through calculation are spliced and serve as output of a hidden layer and can be expressed as:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )

h _t ＝o _t *tanh(C _t )

wherein f _t 、i _t And o _t Respectively representing a forgetting gate, an input gate and an output gate at the time t,representing t-moment candidate memory cell vector, C _t A memory cell vector h representing time t _t Represents hidden layer output vector, represents dot product, W represents weight matrix, b represents bias vector, σ (·) and tanh (·) represent ssigmoid activation function and hyperbolic tangent activation function, respectively.

And 3, respectively constructing a word co-occurrence diagram and a dependency syntax diagram according to the text sequence.

Step 3.1, constructing a word co-occurrence diagram;

step 3.1.2 sliding window along text sequence from left to right, window center word is w _i If w _j And w _i Within a window, w is constructed _i And w _j The edges between word nodes, as shown in FIG. 3;

step 3.1.3, constructing a word co-occurrence graph adjacency matrix A _m Can be expressed as:

wherein c _ij Representing the number of times two words co-occur within each window.

And 3.2, constructing a dependency syntax graph.

And 3.2.1, analyzing the text sequence by using a dependency syntax analyzer Stanford NLP to obtain a syntax dependency tree of the sentence.

Step 3.2.2, constructing a graph structure G= (V, E) by taking words in the syntactic dependency tree as nodes and taking dependency relationship as directed edges, wherein each edge E (i, j) E represents a head node x _i To a slave node x _j As shown in fig. 4.

Where V represents the nodes of the graph and E represents the nodes and their edges.

Step 3.2.3 constructing an adjacency matrix from the syntactic dependency graphA _s Can be expressed as:

and 4, learning the dependency relationship and interaction between nodes according to the node relationship of the graph structure by the graph convolution neural network to obtain the global feature information of the fusion word co-occurrence node feature information and the dependency syntax feature information.

wherein W is ^l Representing model training weight parameters, RELU (·) representing activation functions,representing a normalized symmetric adjacency matrix, which can be expressed as:

where D represents the degree matrix.

Step 4.2, splicing the updated word node vector representation and the dependency syntax node vector representation to obtain global feature information H _C Can be expressed as:

and 5, carrying out feature fusion on the context feature information obtained in the step 2 and the global feature information obtained in the step 4, and calculating the dependency relationship inside the feature by adopting a multi-head self-attention mechanism to obtain comprehensive feature information.

Step 5.1, in order to improve the association degree and the dependency relationship between words in sentences, performing feature fusion on the context feature information obtained in the step 2 and the global feature information obtained in the step 4 to obtain a fusion vector, and marking the fusion vector as R= [ H ] _t ,H _C ]。

Step 5.2, the fusion vector R is linearly transformed to obtain 3 different vector matrices Q (query), K (Key) and V (Calue), and q=k=v=r. And (3) carrying out dot product operation on each keyword matrix K by using Q, scaling an operation result, and normalizing by a softmax function to obtain an attention weight, wherein a calculation formula can be expressed as follows:

wherein,representing the dimensions of the key, softmax (·) represents the probability mapping of the input element.

And 5.3, performing multiple parallel attention calculations on the high-dimensional feature vectors through a multi-head self-attention mechanism, wherein the number of heads is e, each attention head has different feature parameters, after attention information of one head is obtained through calculation, the e heads are spliced and then are subjected to linear transformation to finally obtain comprehensive feature information of fused attention information, and the comprehensive feature information can be expressed as:

M _i ＝SelfAttention(QW _Q ,KW _K ,VW _V )

MulAtt(Q,K,V)＝Concat(M ₁ ,M ₂ ,…,M _e )W

And 6, calculating the comprehensive characteristic information obtained in the step 5 by using a CRF model, and performing sequence labeling to obtain an optimal labeling sequence.

Step 6.1, CRF is a discriminant probability model, and can simultaneously utilize the correlation between input features and labels to predict the global optimal solution of a label sequence, thereby greatly improving the performance of a named entity recognition task. For a given sequence x= { x ₁ ,x ₂ ,…,x _n Sum prediction sequence y= { y ₁ ,y ₂ ,…,y _n -its calculated score can be expressed as:

The foregoing embodiments are merely illustrative of the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the present invention and to implement the same, not to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A dual-image neural network medical entity identification method based on multi-feature fusion is characterized by comprising the following steps:

2. The method for identifying the medical named entity of the double-graph neural network based on the multi-feature fusion according to claim 1, wherein the specific method of the step 1 is as follows:

and 1.2, performing a shuffle operation on the data set, and dividing the data set into a training set and a testing set according to the ratio of 7:3.

3. The dual-map neural network medical named entity recognition method based on multi-feature fusion according to claim 1, wherein the specific method of the step 2 is as follows:

step 2.1, using l= (L ₁ ，l ₂ ，...，l _i ...，l _n ) Representing the text sequence after the text preprocessing of step 1, and using a pre-trained word vector model GloVe for the text sequence l _i Representing to obtain Wherein l _i ∈R ^d N represents the length of the sentence and d represents the word vector dimension;

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )

h _t ＝o _t *tanh(C _t )

4. The method for identifying the medical named entity of the double-graph neural network based on the multi-feature fusion according to claim 1, wherein the specific process of the step 3 is as follows:

step 3.1, constructing a word co-occurrence diagram;

step 3.1.1, the text sequence of length n is represented by w= { W ₁ ，w ₂ ，...，w _i ，...，x _n Represented by }, w _i Representing an ith word in the text sequence, and setting a designated sliding window size;

step 3.1.3, constructing a word co-occurrence graph adjacency matrix A _m Expressed as:wherein c _ij Representing the number of times two words co-occur within each window;

step 3.2, constructing a dependency syntax diagram;

5. the method for identifying the medical named entity of the double-graph neural network based on the multi-feature fusion according to claim 1, wherein the specific process of the step 4 is as follows:

6. the method for identifying the medical named entity of the double-graph neural network based on the multi-feature fusion according to claim 1, wherein the specific process of the step 5 is as follows:

step 5.1, performing feature fusion on the context feature information obtained in the step 2 and the global feature information obtained in the step 4 to obtain a fusion vector, and marking the fusion vector as R= [ H ] _t ，H _C ]；

M _i ＝SelfAttention(QW _Q ，KW _K ，VW _V )

MulAtt(Q，K，V)＝Concat(M ₁ ，M ₂ ，...，M _e )W

wherein W is _Q ，W _K ，W _V The weight matrix learned by the model in training is represented, concat (·) represents the feature vector extracted by the e attention heads is spliced, and W represents the weight matrix generated in the implementation process.

7. The method for identifying the medical named entity of the double-graph neural network based on the multi-feature fusion according to claim 6, wherein the specific process of the step 6 is as follows:

for a given sequence x= { x ₁ ，x ₂ ，...，x _n The i and predicted sequence y= { y ₁ ，y ₂ ，...，y _n -its calculated score is expressed as:

wherein,representing y _i Transfer to y _i+1 Probability of->Represent the firsti words are labeled y _i S (x, y) represents the probability that the input sentence sequence x is labeled with the label sequence y.