Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The inventor finds that the time complexity or the space complexity of the schemes is higher when using the prior art, and the comprehensive accuracy and precision of the drawing extraction are lower, so that the method is difficult to apply to the actual industrial-level use scene of large-scale long text. The inventor therefore proposes a solution that is both high performance and efficient to accommodate existing industrial-grade application scenarios.
As shown in fig. 1, an embodiment of the present invention provides an alternating sequence generating model training method, including:
s11, acquiring a training sample pair from a sample library, wherein the training sample pair comprises paired training texts and a training information graph, and the training information graph comprises a plurality of nodes and at least one edge connecting two nodes in the plurality of nodes.
Illustratively, the training information graph may be considered as a heterogeneous multiple graph (Li et al, 2014; shi et al, 2017) g= (V, E), where V is a set of nodes (typically representing spans in an input document (t s,te)), E is a multiple set of edges having a node type mapping function Φ: v→q and an edge type mapping function ψ: e→r. It is assumed that node types and edge types are extracted from a limited vocabulary. Node types may be used to represent entity types (PER, ORG, etc.), while edge types may represent relationships between nodes (PHYS, OGR-AFF, etc.).
S12, generating a training alternating sequence containing node information and side information according to the training information graph.
In this embodiment, the space of the heterogeneous multi-graph G is not modeled directly, but rather the mapping S π=fs(G,π).fs from G to sequence space S π is built up from graph traversal algorithms such as Breadth First Search (BFS) or Depth First Search (DFS) and internal ordering of node and edge types, depending on the (given) ordering pi of nodes and their edges in G.
In some embodiments, the node information includes node type information, and the side information includes actual side type information and virtual side type information.
The present application assumes that the element s i π of sequence s π is derived from a finite set of node representations V (defined below), node type set Q, edge type set R and "virtual" edge type set U,S i π. Epsilon. V.u.Q.u.u. The virtual edge type u= { [ SOS ], [ EOS ], [ SEP ] } does not represent an edge in G, but is used for control sequence generation, indicating the start/end of a sequence and the separation of the hierarchy in the figure.
Illustratively, the training alternating sequence includes node information and side information that are spaced apart from each other. For example, assume the sequence s π=s0 π,…sn π has an alternating structure, where s 0 π,s2 π,s4 π. In the case of BFS, the present application exploits the fact that BFS accesses nodes layer by layer (i.e., in the order p i,ci1,…,cik,pj, where c ik is the kth child of parent node p i, connected by edge e ik, p j may or may not be identical to one of the child nodes of p i), the present application turns it into a sequence,
sπ=pi,ψ(ei1),ci1,...,
ψ(eik),cik,[SEP],pj,...
Among other things, the present application uses a special edge type [ SEP ] to delineate the hierarchy in the graph. The name of a particular edge type may be arbitrary, including but not limited to [ SEP ] as referred to herein. In the case of DFS, the [ SEP ] type occurs immediately after the leaf node. This representation allows the present application to explicitly recover the original information graph if the present application knows on which type of graph traversal algorithm (BFS or DFS) the alternating sequence is based.
S13, training the alternate sequence to generate a model according to the training text and the training alternate sequence.
In this embodiment, when the information graph is extracted from the text through the model, the graph is not directly modeled, but the problem of extracting the graph from the text is converted into the problem of extracting the alternating sequence from the text, so that the alternating sequence generating model obtained by the method of this embodiment has only linear time and space complexity when being used for graph extraction, and the time and space efficiency is significantly improved.
In some embodiments, generating the training alternating sequence including the node information and the side information according to the training information graph for step S20 includes traversing the training information graph using a preset traversal algorithm to generate the training alternating sequence including the node information and the side information. The preset traversal algorithm may be a Breadth First Search (BFS) algorithm or a Depth First Search (DFS) algorithm, which is not limited in the present application.
In some embodiments, the training information graph includes a span of addresses as input text fragments and a type as a representation of an abstract concept, wherein the type may be a span of a vocabulary length of 1 of node type information, actual side type information, and virtual side type information.
The representation of the nodes and edges of the alternating sequence relies on the observation that there are only two objects in the information graph, span (as addresses of the input text segments) and type (as representations of abstract concepts). Since the present application can treat the types as a special span of length 1 based on the vocabulary of all types (Q.u.r.u.), the present application defines these ordered sets of text spans and type spans of length 1 as mixed spans. The indices in the ordered set may be mapped back to type or text span reversibly according to their size. The task of generating an information graph is thus converted into an alternating sequence of generating a hybrid span by joint indexing of spans and types.
In some embodiments, training an alternating sequence generation model according to the training text and the training alternating sequence comprises processing an output distribution of the alternating sequence generation model by using an alternating mask to obtain alternating sequences composed of node information and side information which are mutually spaced.
Illustratively, the alternating sequence generation model is a neural decoder that is forced to generate alternating sequences by decoding spans and types only in a hybrid manner, with the decoder of the present application having only linear spatial and temporal complexity with respect to the length of the input sequence for each decoding step, and because of its nature as a sequential decision process, it can capture interdependencies between the terms and types.
As shown in fig. 2, a flowchart of an embodiment of a method for extracting a graph from text according to the present invention includes:
S21, inputting the text to be extracted into an alternating sequence generation model trained by the alternating sequence generation model training method to obtain a target alternating sequence;
s22, producing a target information graph according to the target alternating sequence.
In the embodiment, the graph is not directly modeled when the information graph is extracted from the text through the model, but the problem of extracting the graph from the text is converted into the problem of extracting the alternating sequence from the text, so that the graph is extracted from the text only with linear time and space complexity, and the time and space efficiency is remarkably improved.
In some embodiments, the generating a target information map from the target alternating sequence comprises:
And processing the target alternating sequence according to a preset traversal algorithm adopted for training the alternating sequence generation model to generate a target information graph.
In order to more clearly describe the technical solution of the present invention, and also to more directly demonstrate the real-time performance and the advantages of the present invention with respect to the prior art, the technical background, the technical solution, the experiments performed, etc. of the present invention will be described in more detail below.
Abstract
Text-to-Graph extraction aims at automatically extracting an information Graph consisting of references (or entities) and types from natural language Text. Existing approaches, such as form filling and pairwise scoring, exhibit impressive performance on various information extraction tasks, but they are difficult to extend to datasets with longer input text due to their second order spatial/temporal complexity with respect to input length. In this work, the present application proposes that Hybrid SPan Generator (HySPA) map the information graph to an alternating sequence of node and edge types, and directly generate such a sequence by a hybrid span decoder that can cycle decode spans and types in linear temporal and spatial complexity. A large number of experiments on an ACE05 dataset show that the method of the application is also significantly superior to the existing methods in terms of joint entity and relationship extraction tasks.
1. Introduction to the invention
Information Extraction (IE) can be regarded as an extraction task of Text-to-Graph, aimed at extracting from unstructured Text an information Graph (Li et al, 2014; shi et al, 2017) consisting of references (or entities) and types, where the nodes of the Graph refer to the references or entity types and the edges are relationship types representing the relationship between the nodes. Typical methods of graph extraction are to break down the extraction process into subtasks, such as Named Entity Recognition (NER) (Florian et al, 2006,2010) and Relational Extraction (RE) (Sun et al, 2011; jiang and Zhai, 2007), and execute them separately (Chan and Roth, 2011) or jointly (Li and Ji, 2014; eberts and Ulges, 2019).
Recent joint IE models (Wadden et al, 2019; wang and Lu, 2020; lin et al, 2020) exhibit impressive performance on various IE tasks, as they can mitigate error propagation and exploit inter-dependencies between tasks. Previous work often used pairwise scoring techniques to identify the type of relationship between entities. However, this approach is computationally inefficient because it requires enumeration of all possible pairs of entities in the document, and because of the sparsity of relationships between entities, the relationship type is in most cases null. Furthermore, the pairwise scoring technique evaluates each relationship type independently, and thus cannot capture correlations between relationship types of different reference pairs.
Another approach is to consider the joint information extraction task as a form filling problem (Zhang et al, 2017;Wang and Lu,2020) and use a multi-dimensional recurrent neural network to generate a two-dimensional form (Graves et al, 2007). This can capture the interrelationship between entities and relationships, but the spatial complexity grows quadratically with respect to the length of the input text, making this approach impractical for long sequences.
Some attempts, for example, seq2RDF (Liu et al, 2018) and IMoJIE (Kolluru et al, 2020), take advantage of the powerful functions of the Seq2Seq model (Cho et al, 2014) to capture correlations between references and types with first order complexity, but they all use pre-defined vocabularies for reference predictions, which depend largely on the distribution of target words, and cannot handle out-of-vocabulary words that are not visible.
To solve these problems, the present application proposes a first order method of reversibly mapping a target graph to an alternating sequence of nodes and edges and applying a hybrid span generator that directly learns to generate such an alternating sequence. The main contributions of the application are three:
The present application proposes a general technique to make a reversible mapping between the information graph and the alternating sequence (assuming a given graph traversal algorithm). Generating the alternating sequence corresponds to generating the original information map.
The present application proposes a new neural decoder that is forced to generate alternating sequences by decoding spans and types only in a hybrid way. For each decoding step, the decoder of the present application has only linear spatial and temporal complexity with respect to the length of the input sequence, and because of its nature as a sequential decision process, it can capture the interdependence between the index terms and the types.
The present application conducted a number of experiments on Automated Content Extraction (ACE) datasets, indicating that the model of the present application achieves the most advanced performance at present on federated entity and relationship extraction tasks that aim to extract knowledge graphs from a piece of unstructured text.
2. Modeling information graphs as alternating sequences
The information graph can be considered as a heterogeneous multi-graph (Li et al, 2014; shi et al, 2017) g= (V, E), where V is a set of nodes (typically representing spans in an input document (t s,te)), E is a multiple set of edges with node type mapping functions phi v→q and edge type mapping functions phi e→r. It is assumed that the node and edge types are extracted from a limited vocabulary. Node types may be used, for example, to represent entity types (PER, ORG, etc.), while edge types may represent relationships between nodes (PHYS, ORG-AFF, etc.). In this work, the present application represents node types as individual nodes that are connected to their node v by special edge types.
Representing the information graph as a sequence the present application does not model the space of the heterogeneous multi-graph G directly, but rather builds a mapping S π=fs(G,π).fs from G to sequence space S π depending on the (given) ordering pi of nodes and their edges in G, built by graph traversal algorithms such as Breadth First Search (BFS) or Depth First Search (DFS) and internal ordering of node and edge types. The present application assumes that the element s i π of the result sequence s π is obtained from a finite set that includes node representation V, node type Q, edge type (actual edge type) R, and "virtual" edge type U: s i π. Epsilon. V.u.Q.u.u. The virtual edge type u= { [ SOS ], [ EOS ], [ SEP ] } does not represent an edge in G, but is used for control sequence generation, indicating the start/end of a sequence and division of a hierarchy in the figure.
The application further assumes s π=s0 π,…sn π that the representation has an alternating structure, where s 0 π,s2 π,s4 π. In the case of BFS, the present application takes advantage of the fact that it accesses nodes layer by layer, i.e., in order p i,ci1,…,cik,pj (where c ik is the kth child of parent node p i, connected by edge e ik, p j may or may not be equal to one of the child nodes of p i), the present application changes s π into a sequence,
sπ=pi,ψ(ei1),ci1,...,
ψ(eik),cik,[SEP],pj,...
Among other things, the present application uses a special edge type [ SEP ] to delineate the hierarchy in the graph. This representation allows the present application to explicitly recover the original graph if the present application knows what type of graph traversal (BFS or DFS) is assumed. Algorithm 1 (used by the present application to convert the graph in the training data to a sequence) shows how an alternating sequence can construct a given graph using BFS traversal. Fig. 3 shows an alternating sequence of information multiple maps. The length |s π | is limited linearly by the size O of the graph (|s π |) =o (|v|+|e|) (which is also the complexity of typical graph traversal algorithms such as BFS/DFS).
FIG. 3 the present application shows the directed multiple graph as an alternating sequence of nodes (A, B, C, D, E) and edges (1, 2, 3,4, [ S ]). Here, the graph is traversed by Breadth First Search (BFS) in ascending order of node and edge types. "[ s ]" or [ SEP ] is a virtual edge type, representing the end of each BFS level.
Node and edge representations the node and edge representations of the present application (explained below) rely on the observation that there are only two objects in the information graph, span (as address of the input text segment) and type (as representation of the abstract concept). Since the application can treat the types as a special span of length 1 based on the vocabulary of all types (Q U), the application only needs O (nm++ Q U) indexes to express the representation of the span type vocabulary and the input text based on the series connection, wherein n is the maximum input length, m is the maximum span length, and m < < n. The present application defines these ordered sets of text spans and type spans of length 1 as mixed spans. These indices may be mapped back to type or text span reversibly according to their size (details of this mapping are explained in section 3.2). The task of generating an information graph is thus transformed to generate an alternating sequence of hybrid spans by joint indexing of spans and types.
Generating sequence the present application models the distribution p (s π) by a sequence generator h with a parameter θ (|s| is the length of s π):
The present application will be discussed in the following section how the sequence generator h is forced to generate sequences only in space S π, since the present application does not want h to assign non-zero probabilities to any sequence without a corresponding graph.
3. HySPA hybrid span generation of alternating sequences
In order to directly generate a target sequence (which alternates between representing the nodes of the input mid-span and the node/edge type set that relies on the extraction task of the present application), the present application first constructs a hybrid representation H, which is a concatenation of hidden representations from the edge type, node type and input text. This representation serves as both the context space and the output space of the decoder of the present application. The present application then reversibly maps both the span of the input text and the index of the type to a hybrid span based on the representation H. Finally, the alternate sequence y π∈Sπ is formed by the automatically generated blend spans by the blend span decoder. By converting the graph extraction task to the sequence generation task, the present application can easily use beam search decoding to reduce exposure bias (wiserman and Rush, 2016) that may occur in the sequence decision process, and thus find a globally better graph representation.
HySPA high-level overview the HySPA model takes as input a piece of text (e.g., a sentence or paragraph) and predefined node and edge types and outputs an alternating sequence representation of the information graph. The present application forces the alternate generation of this sequence by applying an alternate mask to the output probabilities. The detailed architecture is described in the following subsections.
3.1, Text and type encoder
Fig. 4 shows the encoder architecture of the proposed model of the present application, where the sign ++is the join operator, k is the index of the word vector in H 0, l e = |r|+|u|. The right color table represents the meta-type assignments for the different blocks of the connectives vector from H 0. For node type set Q, edge type set R, and virtual edge type U, the present application arranges type list v as a concatenation of edge type, virtual edge type, and tag names for node type, i.e.
Wherein, Representing the join operator between the two lists,The list of type names in sets R, U, Q (e.g.,). Note that the order of connection between the list of type names may be arbitrary as long as it remains consistent throughout the model. Then, just like in the embedding section of the table-sequence encoder (Wang and Lu, 2020), for each type v i, the present application uses context word embedding, gloVe embedding (Pennington et al 2014) and feature embedding from the pre-trained language model to embed the type tag symbols, where GloVe, full call Global Vectors for Word Representation, is a word characterization (word representation) tool based on global word frequency statistics (count-based & overall statistics).
Where l p = |r|+|u|+|q| is the number of various types, W 0∈Rde×dm is the weight matrix of the linear projection layer, d e=dc+dg+dk is the total embedding dimension, and d m is the size of the hidden layer of the model of the application. After the present application obtains the context embeddings of the markers for each type v i e v, the present application takes the average of these marker vectors as a representation of v i and freezes its updates during training. For more details, reference is made to appendix a.
The embedding approach is also used to embed words in the input text x. Unlike the approach of type embedding, the present application presents the word as a contextual embedding of the first sub-tag from a pre-trained language model (LM, egBERT (Devlin et al, 2018)), and fine-tunes the language model in an end-to-end fashion.
After obtaining the type-embedded E v and the text-embedded E x, respectively, the present application concatenates them along the sequence length dimension to form a hybrid representation H 0. Since H 0 is a concatenation of word vectors from four different types of labels (i.e., edge type, virtual edge type, node type, and text), meta-type embedding is applied to indicate this type of difference between vector blocks from representation H 0 (as shown in FIG. 4). The final context representation H is obtained by meta-type embedding and element addition of H 0,
Where l h=lp +|x| is the high of the mixed representation matrix H of the present application.
3.2 Reversible mapping between Span & Types and Mixed spans
Given a span in text, t= (t s,te)∈N2,ts<te, the present application converts span t to an index k in representation H, k.gtoreq.l p,
k=gk(ts,te)=tsm+te-ts-1+lp∈N,
Where m is the maximum length of the span, l p = |r|+|u|+|q|. The present application keeps the type indexes in the graph unchanged because they are smaller than l p and k.gtoreq.l p. Since for information graphs the maximum span length m is usually referred to as being much smaller than the length of the text, i.e. m < < n, the present application can maintain the linear spatial complexity of the inventive decoder with respect to the input text length n by only considering the maximum magnitude of the reduction k of spans of length less than m from O (n 2) to O (nm). Fig. 5 shows a specific example of an alternating sequence of knowledge maps in an ACE05 dataset according to the present application, representing an (middle) example of an alternating sequence of knowledge maps (bottom) from an ACE05 training set, in which case he was caught (He was captured in Baghdad late Monday night) in dada on monday late night. Where a 1 denotes algorithm 1, the present application takes m=16 and l p =19 in this example. "19" in the alternating sequence is the index of the span (0, 1) of "he", "83" is the index of the span (4, 5) of "dag", and "10" is the virtual edge type [ SEP ]. The input text (top) of the chart is "he was caught in dada on monday late night".
Since t s,te, k are natural numbers, the present application can construct an inverse map g t, convert the index k in H back to t= (t s,te),
te=gt(k)=gts(k)+max(0,k-lp)mod m,
Wherein, Is an integer bottom function and mod is a modulo operator. Note that g t (k) can be directly applied to the indexes in the type segment of H, keeping their values unchanged, i.e.,
With this feature, the present application can easily incorporate the mapping g t into the decoder of the present application to map the alternating sequence y π back to the span in the mixed representation H.
3.3 Hybrid span decoder
Fig. 6 shows a general model architecture of the hybrid span decoder of the present application. The decoder of the present application takes as input the context representation H and cyclically decodes the alternating sequence y π given the sequence start marker. N is the decoder layer number, before the softmax functionRepresenting the join operator, H y N is a hidden representation of the sequence y π from the last decoder layer. The hybrid span decoder of the present application can be understood as an autoregressive model that operates in a closed context space and an output space defined by H.
Attention-based hybrid span coding given the alternating sequence y π and the mapping g t (section 3.2), the decoder of the present application first maps each index in y π to a span, (t si,tei)=gt(yi π), based on the representation H, then transforms the span of the attention mask M 0 to allow the model to learn the weighted sum of the context word representation fragments representing the span as span references,
Wherein, Is the |y π | repeated hidden representation of the beginning of the sequence marker [ CLS ], the text segment from H, H y is a parameter that the application can learn for the mixed final representation span y π.W1,W2,b1,b2, and t si,tei is the start and end positions of the span that the application is encoding. Note that for a type spans of length 1, the result of the softmax calculation will always be 1, which results in its span representation being exactly the embedding vector desired by the present application.
Traversing embedding to distinguish the mixed span of different positions in y pi, a simple approach is to add sinusoidal position embedding to Hy (VASWANI ET al., 2017). However, this approach treats the alternating sequence as a normal sequence and ignores the underlying graph structure it encodes. To alleviate this problem, the present application proposes a novel traversal embedding method that captures traversal level information, parent-child information, and intra-level connection information as alternatives to original location embedding. The traversal embedding of the present application may encode either BFS or DFS traversal patterns. As an example, the present application herein assumes BFS traversal.
FIG. 7 BFS traversal embedding example of alternating sequence, [ "He", type, PER, [ SEP ], "Bagda", type, GPE, PHYS, "He" ]. The BFS traversal embedding of the present application is a point-wise sum of the layer embedding L, the parent-child embedding P and the tree embedding T given the alternating sequence y,
TravEmbed(y)=L(y)+P(y)+T(y)∈R|y|×dm
Wherein the layer embedding assigns the same embedding vector L i for each position of the BFS traversal level i and embeds the values of the padding embedding vector according to the non-parametric sinusoidal position, since the application expects the embedding extrapolation of the application to sequences longer than any sequence in the training set. Parent-child embedding the positions of the parent and child nodes in the BFS traversal level assign different randomly initialized embedding vectors to help the model distinguish between the two nodes. For encoding intra-layer connection information, the insight of the present application is that the connection between each node in the BFS layer can be seen as a tree of depth 3, where the first depth takes the parent node, the second depth fills the edge type and the third depth consists of child nodes corresponding to each edge type. The tree embeddings of the present application are then formed by encoding the position information of the depth 3 tree using the tree position embeddings (Shiv and Quirk, 2019) of each BFS level. Fig. 7 shows a specific example of how these embeddings function for a given alternating sequence. The obtained traversal embeds are then added point-by-point to the hidden representation of the alternating sequence H y to inject the traversal information of the graph structure.
Internal block by slicing the input text representation H text from the mixed representation H and the target sequence representation H y, the present application applies an N-layer converter structure with mixed attention (He et al, 2018) to allow the model of the present application to take advantage of the edges or nodes from different attention layers when decoding alternating sequences. Note the practical choice of the neural structure of the hybrid span decoder of the present application perpendicular to the internal blocks, the design of the hybrid attention transformer of the present application (He et al, 2018) is chosen because its hierarchical coordination properties are more empirically suited for the heterogeneous decoding of two different types of sequence elements by the present application. The detailed structure of the inner block is explained in appendix E.
Hybrid span decoding for a hybrid span decoding module, the present application first cuts out the hidden representation of the alternating sequence y π from the output of the N-layer inner block and represents it as H y N. Then, for each hidden representation h yi N∈Hy N,0≤i≤|yπ, the present application applies two different linear layers to obtain a start position representation s yi and an end position representation e yi,
Where W 5,W6∈Rdm×dm and b 5,b6∈Rdm are learnable parameters. The present application then calculates the score of TYPES SEGMENT and the text segment TARGET SPANS for H, respectively, and concatenates them together before the final softmax operator to jointly estimate the probabilities of text spans and TYPE SPANS,
Where H i is the score vector for the possible spans in the type segment of H, and t i is the score vector for the possible spans in the text segment of H. Since the span length of the type span is always 1, the application only needs an addition calculation between an element-wise start position score h si and an end position score h ei to calculate that the entry of h i.ti contains the score of the text span, t si,j+tei,k; k-j < m, calculated with the help of a unrolling function that converts vector t ei∈Rn into a stack of n sliding windows of size m, maximum span length, stride 1. The alternation mask m a∈Rlp,ma′∈Rn is defined as:
Where l e = |r|+|u| is the total number of edge types. Thus, while the present application has a joint model of node and edge types, the output distribution is enforced by the alternation mask to produce an alternating decoding of node and edge types, which is the primary reason for the present application to call this decoder a hybrid span decoder.
4. Experiment
4.1 Experimental settings
The present application tests the model of the present application on an ACE 2005 data set distributed by LDC3, which data set comprises 1 ten thousand 4 thousand 5 hundred sentences, 3 ten thousand 8 thousand entities (of 7 types) and 7100 relations (of 6 types), these data sets coming from the general news field, see annex C for details.
According to the previous work, the present application uses F1 as an evaluation index for NER and RE. For the NER task, the predictions are marked as correct when both the type and the boundary span match the golden entity. For RE tasks, predictions are correct when both the relationship type and boundaries of the two entities are correct.
4.2 Implementation details
In training the model of the present application, the present application applies a cross entropy loss with a label smoothing factor of 0.1. The model was trained using 2048 markers per lot (approximately 28 lots), 25000 times using AdamW optimizers (Loshchilov and Hutter, 2018), a learning rate of 2e -4, a weight decay of 0.01, and 2000 preheats using an inverse square root scheduler. Following the TabSeq model (Wang and Lu, 2020), the present application uses RoBERTa-large (Liu et al) or ALBERT-xxlarge-v1 (Lan et al, 2020) as a pre-training language model during training and slows its learning rate by a factor of 0.1. When ALBERT-xxlarge-v1 had a drop rate of 0.1, the stealth drop rate of RoBERTalarge reached 0.2. The drop rate of the hybrid span decoder of the present application during training is also 0.1. The application sets the maximum span length, m=16, the concealment size of the model of the application, d m =256, and the number of decoder blocks, n=12. Although beam searching should help the present application reduce exposure bias in theory, the present application does not observe any performance gain and validation set length loss during grid searching of beam sizes (detailed grid searching is set forth in appendix a). Therefore, the application sets the common beam size to 1, sets the length penalty to 1, and leaves the theoretical experimental contradiction to be studied in the future. The model of the present application was built using FAIRSEQ toolkit (Ott et al, 2019) for efficient distributed training, all experiments were performed on two NVIDIA TITAN X GPUs.
Table 1. Joint NER and RE F1 scores of ie model on ACE05 test set. The complexity of the entity and relationship decoding part of the model is calculated (n is the length of the input text). The performance of the TabSeq model reported here is based on the same ALBERT-xxlarge (Lan et al 2020) pre-trained language model as the present application.
Table 2 ablation study of ACE05 test set. "-trade-embedding": the application removes Traversal embedding, instead of sinusoidal location embedding, the below abs is based on the model after this abs. "-Masking": the present application removes the alternation mask from the hybrid span decoder. "BFS". The present application uses DFS instead of BFS as the traversal. "-Mixedattention": the present application removes the mixed-attention layer and uses a standard converter encoder decoder structure. "-Span-attention": the present application removes Span attention in the Span coding module and instead averages the words in the Span.
4.3 Results
Table 1 compares the model of the present application with the previous latest results on the ACE05 test set. Compared to SOTA, tabSeq (Wang and Lu, 2020) which used the pre-trained language model of ALBERT before, the model of the present application used ALBERT had significantly better performance on both NER score and RE score while maintaining linear spatial complexity an order of magnitude smaller than TabSeq. The model of the present application is the first joint model with both linear spatial and temporal complexity compared to all previous joint IE models, and therefore has the best scalability for large-scale real-world applications.
4.4 Ablation study
To demonstrate the effectiveness of the method of the present application, the present application conducted an ablation experiment on an ACE05 dataset. As shown in Table 2, after the present application removes the traversal embeddings, the RE F1 score drops significantly, indicating that the traversal embeddings of the present application can help encode the graph structure and improve the relational prediction. Furthermore, if the alternate masking is abandoned, both NER F1 and RE F1 scores will drop significantly, which proves the importance of enforcing the alternate pattern. The present application can observe that the mixed attention layer contributes significantly to the relational extraction. This is because layer-by-layer coordination can help the decoder unwrap source features and exploit different layer features between entity and relationship predictions. The present application may also observe that DFS traversal performs worse than BFS. The present application suspects that this is because the resulting alternating sequence from DFS is generally longer than the alternating sequence from BFS due to the nature of the knowledge graph, thereby increasing the difficulty of learning.
4.5 Error analysis
After analyzing 80 remaining errors, the present application classifies and discusses the following common cases (fig. 8 is a schematic diagram of the distribution of remaining errors over the ACE05 test set). These may require additional functionality and strategies to address.
The contextual shortfall is that in many examples, the answer entity is a pronoun and, given the limited context, cannot be typed accurately, it is difficult to correctly categorize the application into an organization in the sense that the application notices that they say that they do not want to use the destroyed word, and in fact that they say that they do so by others. This can be mitigated by using the entire document as input, with cross-sentence context.
Rarely used words the problem of rarely used words is that words in the test set rarely appear in the training set and generally do not appear in the dictionary. In the sentence "base also navy FA-18 and navy Heriers", the term "Heriers" (a vehicle that is misclassified as a person by the model) is neither present in the training set nor well understood by the pre-trained language model, in which case the model can only rely on subword level representations.
Background knowledge is required that the entities mentioned in the sentence are generally difficult to infer from the context, but are easily identified by looking up the knowledge base, in that the model of the application erroneously predicts that the guest is a vehicle, in that the guest should issue a more intense alert, and that the guest is referred to herein as an European aerospace company. The system of the present application also divides the united nations and the regulations into two entities, a non-existent relationship triplet (regulations, part of the united nations) is generated. Such errors can be avoided by consulting a knowledge base (e.g., DBpedia (Bizer, et al, 2009) or performing entity linking).
Inherent ambiguity many examples are inherently ambiguous, e.g., the european union can be categorized as an organization or political entity, while some entities (e.g., military bases) can be both sites and organizations or facilities.
5. Related work
NER is typically done in conjunction with RE to reduce error propagation and learn the interrelationship between tasks. One approach is to consider the joint task as a square table filling problem (Miwa and Sasaki, 2014; gupta et al, 2016; wang and Lu, 2020), where the ith column or row represents the ith token. The table has a diagonal line indicating the order of the entity and other entries as a relationship between pairs of tags. The other line of work is to execute the RE after the NER. In Miwa and Bansal (2016), authors used BiLSTM (Graves et al, 2013) as NER, and therefore used Tree-LSTM (Tai et al, 2015) based on dependency graphs for RE. Wadden et al. (2019) and goldenrain et al. (2019) On the other hand, a method for constructing a dynamic text span graph is adopted to detect entities and relations. Extended wattle et al. (2019), woods, et al. (2020) ONEIE is presented that further incorporates global features based on cross-subtask and instance constraints, aimed at extracting IE results as graphs. Note that the model of the present application differs from ONEIE (Lin et al 2020) in that the model of the present application automatically captures global relationships by autoregressive generation, while ONEIE uses feature engineering templates, and furthermore, ONEIE requires pairwise classification of relationship extractions, while the method of the present application effectively generates existing relationships and entities.
While several Seq2 Seq-based models have been proposed (Zhang et al, 2020; zeng et al, 2018,2020;Wei et al, 2019; zhang et al, 2019) to generate triples (i.e., node-edge-nodes), the model of the present application differs fundamentally from them in that (1) it generates a BFS/DFS traversal of a target graph that captures dependencies between nodes and edges and has a shorter target sequence, (2) the present application models nodes because spans in text are independent of vocabulary, and thus the present application can still generate spans for nodes based on context information even if their labels are rare or unseen words.
6. Conclusion(s)
In this work, the present application proposes a hybrid span generation (HySPA) model, which is the first end-to-end text-to-graph extraction model with linear spatial and temporal complexity in the graph-encoding phase. In addition to scalability, the model also enables the current most advanced performance on ACE05 federated entities and relationship extraction tasks. In view of the flexibility of the structure of the hybrid span generator of the present application, there is still a rich research direction in the future, such as hybrid span generation in combination with external knowledge, applying more efficient sparse self-attention, and developing better search methods to find more globally rational graphs represented by alternating sequences.
In some embodiments, another method is provided in which the mixed-attention layer is removed and a standard transducer encoder decoder structure is used. This version is simpler in structure but inferior in performance to the version using the mixed-attention layer.
In some embodiments, another approach is also provided in which a DFS traversal is used instead of a BFS traversal to construct an alternating sequence representation of the graph, while such a version also uses a DFS traversal embedding (see appendix D for details) instead of a BFS traversal embedding. This version of graph extraction accuracy is inferior to BFS traversal.
In some embodiments, another method is also provided in which words in a span are averaged to encode the span instead of performing attention-based span encoding. This version of the model structure is simpler and has fewer model parameters but the graph extraction accuracy is inferior to attention-based span coding.
Appendix A super parameter
The present application uses a 100-dimensional GloVe-word insert trained on 6B tokens as an initialization and freezes its updates during training. The feature is embedded with 30-dimensional LSTM encoding and the GloVe embedding of the out-of-vocabulary tags is replaced with a randomly initialized vector, following Wang and Lu (2020). The present application uses a gradient cut of 0.25 during training. The number of mixed attention of the present application is set to 8. The bundle size and length penalty are determined by a grid search of the validation set of the ACE05 dataset, the bundle size ranging from 1 to 7, the step size being 1, the length penalty ranging from 0.7 to 1.2, the step size being 0.1. The application selects the best beam size and length penalty based on a metric that extracts the F1 score from the relationship.
Appendix B training details
The model of the present application uses ALBERT-xxlarge pre-trained language models with 2.36 billions of parameters. On average, the present application can be trained on two NVIDIA TITAN X GPUs for 20 hours using the best model of ALBERT-xxlarge.
Appendix C data
Automatic Content Extraction (ACE) 2005 data sets contain english, arabic, and chinese training data for 2005 Automatic Content Extraction (ACE) technical evaluations, providing entity, relationship, and event notes. The present application follows the use of Wardon et al (2019) for preprocessing and data splitting. The pre-processed data contained 7100 relationships, 3 thousands of 8 thousands of entities, and 1 thousands of 4 thousands of 5 hundred sentences. The split contains 10051 training samples, 2424 development samples, and 2050 test samples.
Appendix D DFS traversal embedding
Since the parent-child information is already contained in the intra-layer connection of the DFS traversal, the present application has only the sum of the layer embedding and the connection embedding of the DFS traversal embedding. Similar to BFS embedding, DFS layer embedding assigns the same embedding vector Li for each location at DFS traversal layer i, but the values of the embedding vectors are randomly initialized, rather than populated with non-parametric sinusoidal location embedding, because there is no information between traversal levels of the proximity DFS. But for elements in the DFS level the application does have explicit distance information, i.e. for the DFS level d= [ a, B, C., [ sep ] ], the distance from a to the element [ a, B, C., [ sep ] ] is [0,1,2; 3., |d| -1]. The present application encodes this distance information with sinusoidal position embedding, which becomes the connection embedding of the present application, capturing intra-layer connection information.
Appendix E converter with mixed attention layer
The present application first cuts out the hidden representation of the input text from the mixed representation H and represents it as H text, then inputs the output of the input text representation H text and the mixed span code H y into the stack of N mixed attention/feed forward blocks with the structure shown in fig. 9.
Since generating node and edge types may require features from different layers, the present application uses mixed attention (He et al, 2018), which allows the model of the present application to take advantage of features from different attention layers, H y,
Where n= |x| is the length of the input text, and l m = |x|+|y pi| is the total length of the source and target features. The concatenation of source feature H text and target feature H y is denoted as H 0, and source/target embedding (He et al, 2018) is also added to H 0 before the first layer of mixed attention to allow the model to distinguish features from the source and target sequences. The mixed-attention layer and the feedforward layer combine to form a decoder block:
Where W q,k,v,bq,k,v,W3∈Rdm×4dm,W4∈R4dm×dm,b3,b4 is a leachable parameter, layerNorm is a layer normalization layer (Ba et al, 2016). The decoder blocks are stacked N times to obtain the final hidden representation H N and output the final representation of the target sequence H N y. The temporal complexity of mixed attention when encoding source features is O (n 2), but due to causal masking of target features, the present application can cache the hidden representation of this part when generating the target mark, thus preserving the temporal complexity O (n) for each decoding step.
It should be noted that, for simplicity of description, the foregoing method embodiments are all illustrated as a series of acts combined, but it should be understood and appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention. In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In some embodiments, embodiments of the present invention provide a non-transitory computer readable storage medium having stored therein one or more programs comprising execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing a method of any of the above described methods of the present invention for extracting a graph from text.
In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method of extracting a graph from text of any of the above.
In some embodiments, the present invention further provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of extracting a graph from text.
In some embodiments, embodiments of the present invention also provide a storage medium having a computer program stored thereon, wherein the program when executed by a processor implements a method of extracting a graph from text.
Fig. 10 is a schematic hardware structure of an electronic device for executing a method for extracting a graph from text according to another embodiment of the present application, as shown in fig. 10, where the device includes:
one or more processors 1010, and a memory 1020, one processor 1010 being illustrated in fig. 10.
The apparatus for performing the method of extracting a graph from text may further include an input device 1030 and an output device 1040.
The processor 1010, memory 1020, input device 1030, and output device 1040 may be connected by a bus or other means, for example in fig. 10.
The memory 1020 serves as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the method for extracting a graph from text in an embodiment of the present application. The processor 1010 executes various functional applications of the server and data processing, i.e., implements the method of extracting a graph from text of the above-described method embodiments, by running non-volatile software programs, instructions, and modules stored in the memory 1020.
The memory 1020 may include a storage program area that may store an operating system, an application program required for at least one function, a storage data area that may store data created according to the use of a device that extracts a drawing from text, and the like. In addition, memory 1020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 1020 may optionally include memory located remotely from processor 1010, which may be connected via a network to a device that extracts the graph from the text. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 1030 may receive input numeric or character information and generate signals related to user settings and function control of the device that extracts the drawing from the text. The output 1040 may include a display device such as a display screen.
The one or more modules are stored in the memory 1020 that, when executed by the one or more processors 1010, perform the method of extracting a graph from text in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID and UMPC devices, etc., such as iPad.
(3) Portable entertainment devices such devices can display and play multimedia content. Such devices include audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.
(4) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.