CN113487024B

CN113487024B - Alternating sequence generation model training method, method for extracting graphs from text

Info

Publication number: CN113487024B
Application number: CN202110725279.XA
Authority: CN
Inventors: 任立椋
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2024-12-10
Anticipated expiration: 2041-06-29
Also published as: WO2023274059A1; CN113487024A

Abstract

The present invention discloses a training method for an alternating sequence generation model, comprising: obtaining a training sample pair from a sample library, the training sample pair comprising a paired training text and a training information graph, the training information graph comprising a plurality of nodes and at least one edge connecting two of the plurality of nodes; generating a training alternating sequence comprising node information and edge information according to the training information graph; and training an alternating sequence generation model according to the training text and the training alternating sequence. When extracting an information graph from a text through a model, the graph is not directly modeled, but the problem of extracting a graph from a text is converted into the problem of extracting an alternating sequence from a text, so that the alternating sequence generation model obtained by the method of this embodiment has only linear time and space complexity when used for graph extraction, and the time and space efficiency are significantly improved.

Description

Alternating sequence generation model training method and method for extracting graph from text

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to an alternating sequence generating model training method, a method for extracting a graph from a text, an electronic device, and a computer readable storage medium.

Background

Existing methods for extracting a graph from text typically use a neural network to encode a piece of text and then use a pairwise scoring method to generate the edges of the graph, or use a multi-dimensional recurrent neural network to generate a graph connection table, or use a method for generating a node-edge-node triplet sequence of the graph to extract the graph from the text. While some techniques may represent nodes of the graph as specific words or words.

The temporal and spatial complexity of these techniques is typically high (greater than linear complexity), or nodes containing unusual/unseen words cannot be extracted accurately, or the dependency between graph elements (edges and nodes) is ignored, and the accuracy and precision of graph extraction are low;

While traversing all possible text pairs using a pairwise scoring approach can have a high degree of temporal complexity, methods using multi-dimensional recurrent neural networks require storing hidden representations of the entire graph connection table and can have a high degree of spatial complexity. Representing graph nodes as specific words or words may result in the node classifier not being able to accurately estimate the probability distribution of unusual/unseen words, and thus not being able to accurately extract these words as nodes of the graph, which may also affect the overall accuracy and precision of graph extraction. The method of pairwise scoring classifies each edge as an independent element, ignores the dependency relationship between edges, and the method of generating the triplet sequence classifies the edge and the node independently when generating the triplet, ignores the dependency relationship between edges and nodes. These ignorions of dependencies can affect the overall accuracy and precision of graph extraction.

In general, the inventors have found that these schemes have high temporal or spatial complexity while drawing extraction has low overall accuracy and precision, and are difficult to apply to real industrial-scale usage scenarios of large-scale long text.

Disclosure of Invention

Embodiments of the present invention provide an alternating sequence generating model training method, a method for extracting a graph from text, an electronic device and a computer readable storage medium, which are used for at least solving one of the above technical problems.

In a first aspect, an embodiment of the present invention provides an alternating sequence generation model training method, including:

Obtaining training sample pairs from a sample library, wherein the training sample pairs comprise paired training texts and training information graphs, and each training information graph comprises a plurality of nodes and at least one edge connecting two nodes in the plurality of nodes;

Generating a training alternating sequence containing node information and side information according to the training information graph;

And training the alternate sequence according to the training text and the alternate sequence to generate a model.

In some embodiments, the generating a training alternating sequence including node information and side information according to the training information graph includes traversing the training information graph by a preset traversal algorithm to generate a training alternating sequence including node information and side information.

In some embodiments, the training alternating sequence includes node information and side information that are spaced apart from each other.

In some embodiments, the node information includes node type information, and the side information includes actual side type information and virtual side type information.

In some embodiments, the training information graph includes a span of addresses as input text fragments and a type as a representation of an abstract concept, wherein the type may be a span of a vocabulary length of 1 of node type information, actual side type information, and virtual side type information.

In some embodiments, training an alternating sequence generation model according to the training text and the training alternating sequence comprises processing an output distribution of the alternating sequence generation model by using an alternating mask to obtain alternating sequences composed of node information and side information which are mutually spaced.

In a second aspect, an embodiment of the present invention provides a method for extracting a graph from text, including:

Inputting the text to be extracted into an alternating sequence generating model trained by the method to obtain a target alternating sequence;

and generating a target information graph according to the target alternating sequence.

In some embodiments, the generating a target information map from the target alternating sequence comprises:

And processing the target alternating sequence according to a preset traversal algorithm adopted for training the alternating sequence generation model to generate a target information graph.

In a third aspect, embodiments of the present invention provide a storage medium having stored therein one or more programs including execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing a method of extracting a graph from text according to any of the above-described aspects of the present invention.

In a fourth aspect, an electronic device is provided that includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods of the present invention for extracting a graph from text.

In a fifth aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of extracting a graph from text of any of the above.

In this embodiment, when the information graph is extracted from the text through the model, the graph is not directly modeled, but the problem of extracting the graph from the text is converted into the problem of extracting the alternating sequence from the text, so that the alternating sequence generating model obtained by the method of this embodiment has only linear time and space complexity when being used for graph extraction, and the time and space efficiency is significantly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of an alternating sequence generation model training method of the present invention;

FIG. 2 is a flow chart of one embodiment of a method of extracting a graph from text according to the present invention;

FIG. 3 is a schematic diagram of an alternate sequence of information multiple diagrams according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an encoder architecture according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of one embodiment of an alternating sequence of knowledge maps in an ACE05 dataset, in accordance with the present invention;

FIG. 6 is a schematic diagram of one embodiment of a hybrid span decoder of the present invention;

FIG. 7 is a schematic diagram of alternate sequence BFS traversal embedding of the present invention;

FIG. 8 is a schematic diagram illustrating the distribution of residual errors on an ACE05 test set in accordance with the present invention;

FIG. 9 is a schematic diagram of a transducer with a mixed-attention layer according to the present invention;

fig. 10 is a schematic structural diagram of an embodiment of an electronic device of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The inventor finds that the time complexity or the space complexity of the schemes is higher when using the prior art, and the comprehensive accuracy and precision of the drawing extraction are lower, so that the method is difficult to apply to the actual industrial-level use scene of large-scale long text. The inventor therefore proposes a solution that is both high performance and efficient to accommodate existing industrial-grade application scenarios.

As shown in fig. 1, an embodiment of the present invention provides an alternating sequence generating model training method, including:

s11, acquiring a training sample pair from a sample library, wherein the training sample pair comprises paired training texts and a training information graph, and the training information graph comprises a plurality of nodes and at least one edge connecting two nodes in the plurality of nodes.

Illustratively, the training information graph may be considered as a heterogeneous multiple graph (Li et al, 2014; shi et al, 2017) g= (V, E), where V is a set of nodes (typically representing spans in an input document (t _s,t_e)), E is a multiple set of edges having a node type mapping function Φ: v→q and an edge type mapping function ψ: e→r. It is assumed that node types and edge types are extracted from a limited vocabulary. Node types may be used to represent entity types (PER, ORG, etc.), while edge types may represent relationships between nodes (PHYS, OGR-AFF, etc.).

S12, generating a training alternating sequence containing node information and side information according to the training information graph.

In this embodiment, the space of the heterogeneous multi-graph G is not modeled directly, but rather the mapping S ^π＝f_s(G,π).f_s from G to sequence space S ^π is built up from graph traversal algorithms such as Breadth First Search (BFS) or Depth First Search (DFS) and internal ordering of node and edge types, depending on the (given) ordering pi of nodes and their edges in G.

The present application assumes that the element s _i ^π of sequence s ^π is derived from a finite set of node representations V (defined below), node type set Q, edge type set R and "virtual" edge type set U,S _i ^π. Epsilon. V.u.Q.u.u. The virtual edge type u= { [ SOS ], [ EOS ], [ SEP ] } does not represent an edge in G, but is used for control sequence generation, indicating the start/end of a sequence and the separation of the hierarchy in the figure.

Illustratively, the training alternating sequence includes node information and side information that are spaced apart from each other. For example, assume the sequence s ^π＝s₀ ^π,…s_n ^π has an alternating structure, where s ₀ ^π,s₂ ^π,s₄ ^π. In the case of BFS, the present application exploits the fact that BFS accesses nodes layer by layer (i.e., in the order p _i,c_i1,…,c_ik,p_j, where c _ik is the kth child of parent node p _i, connected by edge e _ik, p _j may or may not be identical to one of the child nodes of p _i), the present application turns it into a sequence,

s^π＝p_i,ψ(e_i1),c_i1,...,

ψ(e_ik),c_ik,[SEP],p_j,...

Among other things, the present application uses a special edge type [ SEP ] to delineate the hierarchy in the graph. The name of a particular edge type may be arbitrary, including but not limited to [ SEP ] as referred to herein. In the case of DFS, the [ SEP ] type occurs immediately after the leaf node. This representation allows the present application to explicitly recover the original information graph if the present application knows on which type of graph traversal algorithm (BFS or DFS) the alternating sequence is based.

S13, training the alternate sequence to generate a model according to the training text and the training alternate sequence.

In some embodiments, generating the training alternating sequence including the node information and the side information according to the training information graph for step S20 includes traversing the training information graph using a preset traversal algorithm to generate the training alternating sequence including the node information and the side information. The preset traversal algorithm may be a Breadth First Search (BFS) algorithm or a Depth First Search (DFS) algorithm, which is not limited in the present application.

The representation of the nodes and edges of the alternating sequence relies on the observation that there are only two objects in the information graph, span (as addresses of the input text segments) and type (as representations of abstract concepts). Since the present application can treat the types as a special span of length 1 based on the vocabulary of all types (Q.u.r.u.), the present application defines these ordered sets of text spans and type spans of length 1 as mixed spans. The indices in the ordered set may be mapped back to type or text span reversibly according to their size. The task of generating an information graph is thus converted into an alternating sequence of generating a hybrid span by joint indexing of spans and types.

Illustratively, the alternating sequence generation model is a neural decoder that is forced to generate alternating sequences by decoding spans and types only in a hybrid manner, with the decoder of the present application having only linear spatial and temporal complexity with respect to the length of the input sequence for each decoding step, and because of its nature as a sequential decision process, it can capture interdependencies between the terms and types.

As shown in fig. 2, a flowchart of an embodiment of a method for extracting a graph from text according to the present invention includes:

S21, inputting the text to be extracted into an alternating sequence generation model trained by the alternating sequence generation model training method to obtain a target alternating sequence;

s22, producing a target information graph according to the target alternating sequence.

In the embodiment, the graph is not directly modeled when the information graph is extracted from the text through the model, but the problem of extracting the graph from the text is converted into the problem of extracting the alternating sequence from the text, so that the graph is extracted from the text only with linear time and space complexity, and the time and space efficiency is remarkably improved.

In order to more clearly describe the technical solution of the present invention, and also to more directly demonstrate the real-time performance and the advantages of the present invention with respect to the prior art, the technical background, the technical solution, the experiments performed, etc. of the present invention will be described in more detail below.

Abstract

Text-to-Graph extraction aims at automatically extracting an information Graph consisting of references (or entities) and types from natural language Text. Existing approaches, such as form filling and pairwise scoring, exhibit impressive performance on various information extraction tasks, but they are difficult to extend to datasets with longer input text due to their second order spatial/temporal complexity with respect to input length. In this work, the present application proposes that Hybrid SPan Generator (HySPA) map the information graph to an alternating sequence of node and edge types, and directly generate such a sequence by a hybrid span decoder that can cycle decode spans and types in linear temporal and spatial complexity. A large number of experiments on an ACE05 dataset show that the method of the application is also significantly superior to the existing methods in terms of joint entity and relationship extraction tasks.

1. Introduction to the invention

Information Extraction (IE) can be regarded as an extraction task of Text-to-Graph, aimed at extracting from unstructured Text an information Graph (Li et al, 2014; shi et al, 2017) consisting of references (or entities) and types, where the nodes of the Graph refer to the references or entity types and the edges are relationship types representing the relationship between the nodes. Typical methods of graph extraction are to break down the extraction process into subtasks, such as Named Entity Recognition (NER) (Florian et al, 2006,2010) and Relational Extraction (RE) (Sun et al, 2011; jiang and Zhai, 2007), and execute them separately (Chan and Roth, 2011) or jointly (Li and Ji, 2014; eberts and Ulges, 2019).

Recent joint IE models (Wadden et al, 2019; wang and Lu, 2020; lin et al, 2020) exhibit impressive performance on various IE tasks, as they can mitigate error propagation and exploit inter-dependencies between tasks. Previous work often used pairwise scoring techniques to identify the type of relationship between entities. However, this approach is computationally inefficient because it requires enumeration of all possible pairs of entities in the document, and because of the sparsity of relationships between entities, the relationship type is in most cases null. Furthermore, the pairwise scoring technique evaluates each relationship type independently, and thus cannot capture correlations between relationship types of different reference pairs.

Another approach is to consider the joint information extraction task as a form filling problem (Zhang et al, 2017;Wang and Lu,2020) and use a multi-dimensional recurrent neural network to generate a two-dimensional form (Graves et al, 2007). This can capture the interrelationship between entities and relationships, but the spatial complexity grows quadratically with respect to the length of the input text, making this approach impractical for long sequences.

Some attempts, for example, seq2RDF (Liu et al, 2018) and IMoJIE (Kolluru et al, 2020), take advantage of the powerful functions of the Seq2Seq model (Cho et al, 2014) to capture correlations between references and types with first order complexity, but they all use pre-defined vocabularies for reference predictions, which depend largely on the distribution of target words, and cannot handle out-of-vocabulary words that are not visible.

To solve these problems, the present application proposes a first order method of reversibly mapping a target graph to an alternating sequence of nodes and edges and applying a hybrid span generator that directly learns to generate such an alternating sequence. The main contributions of the application are three:

The present application proposes a general technique to make a reversible mapping between the information graph and the alternating sequence (assuming a given graph traversal algorithm). Generating the alternating sequence corresponds to generating the original information map.

The present application proposes a new neural decoder that is forced to generate alternating sequences by decoding spans and types only in a hybrid way. For each decoding step, the decoder of the present application has only linear spatial and temporal complexity with respect to the length of the input sequence, and because of its nature as a sequential decision process, it can capture the interdependence between the index terms and the types.

The present application conducted a number of experiments on Automated Content Extraction (ACE) datasets, indicating that the model of the present application achieves the most advanced performance at present on federated entity and relationship extraction tasks that aim to extract knowledge graphs from a piece of unstructured text.

2. Modeling information graphs as alternating sequences

The information graph can be considered as a heterogeneous multi-graph (Li et al, 2014; shi et al, 2017) g= (V, E), where V is a set of nodes (typically representing spans in an input document (t _s,t_e)), E is a multiple set of edges with node type mapping functions phi v→q and edge type mapping functions phi e→r. It is assumed that the node and edge types are extracted from a limited vocabulary. Node types may be used, for example, to represent entity types (PER, ORG, etc.), while edge types may represent relationships between nodes (PHYS, ORG-AFF, etc.). In this work, the present application represents node types as individual nodes that are connected to their node v by special edge types.

Representing the information graph as a sequence the present application does not model the space of the heterogeneous multi-graph G directly, but rather builds a mapping S ^π＝f_s(G,π).f_s from G to sequence space S ^π depending on the (given) ordering pi of nodes and their edges in G, built by graph traversal algorithms such as Breadth First Search (BFS) or Depth First Search (DFS) and internal ordering of node and edge types. The present application assumes that the element s _i ^π of the result sequence s ^π is obtained from a finite set that includes node representation V, node type Q, edge type (actual edge type) R, and "virtual" edge type U: s _i ^π. Epsilon. V.u.Q.u.u. The virtual edge type u= { [ SOS ], [ EOS ], [ SEP ] } does not represent an edge in G, but is used for control sequence generation, indicating the start/end of a sequence and division of a hierarchy in the figure.

The application further assumes s ^π＝s₀ ^π,…s_n ^π that the representation has an alternating structure, where s ₀ ^π,s₂ ^π,s₄ ^π. In the case of BFS, the present application takes advantage of the fact that it accesses nodes layer by layer, i.e., in order p _i,c_i1,…,c_ik,p_j (where c _ik is the kth child of parent node p _i, connected by edge e _ik, p _j may or may not be equal to one of the child nodes of p _i), the present application changes s ^π into a sequence,

s^π＝p_i,ψ(e_i1),c_i1,...,

ψ(e_ik),c_ik,[SEP],p_j,...

Among other things, the present application uses a special edge type [ SEP ] to delineate the hierarchy in the graph. This representation allows the present application to explicitly recover the original graph if the present application knows what type of graph traversal (BFS or DFS) is assumed. Algorithm 1 (used by the present application to convert the graph in the training data to a sequence) shows how an alternating sequence can construct a given graph using BFS traversal. Fig. 3 shows an alternating sequence of information multiple maps. The length |s ^π | is limited linearly by the size O of the graph (|s ^π |) =o (|v|+|e|) (which is also the complexity of typical graph traversal algorithms such as BFS/DFS).

FIG. 3 the present application shows the directed multiple graph as an alternating sequence of nodes (A, B, C, D, E) and edges (1, 2, 3,4, [ S ]). Here, the graph is traversed by Breadth First Search (BFS) in ascending order of node and edge types. "[ s ]" or [ SEP ] is a virtual edge type, representing the end of each BFS level.

Node and edge representations the node and edge representations of the present application (explained below) rely on the observation that there are only two objects in the information graph, span (as address of the input text segment) and type (as representation of the abstract concept). Since the application can treat the types as a special span of length 1 based on the vocabulary of all types (Q U), the application only needs O (nm++ Q U) indexes to express the representation of the span type vocabulary and the input text based on the series connection, wherein n is the maximum input length, m is the maximum span length, and m < < n. The present application defines these ordered sets of text spans and type spans of length 1 as mixed spans. These indices may be mapped back to type or text span reversibly according to their size (details of this mapping are explained in section 3.2). The task of generating an information graph is thus transformed to generate an alternating sequence of hybrid spans by joint indexing of spans and types.

Generating sequence the present application models the distribution p (s ^π) by a sequence generator h with a parameter θ (|s| is the length of s ^π):

The present application will be discussed in the following section how the sequence generator h is forced to generate sequences only in space S ^π, since the present application does not want h to assign non-zero probabilities to any sequence without a corresponding graph.

3. HySPA hybrid span generation of alternating sequences

In order to directly generate a target sequence (which alternates between representing the nodes of the input mid-span and the node/edge type set that relies on the extraction task of the present application), the present application first constructs a hybrid representation H, which is a concatenation of hidden representations from the edge type, node type and input text. This representation serves as both the context space and the output space of the decoder of the present application. The present application then reversibly maps both the span of the input text and the index of the type to a hybrid span based on the representation H. Finally, the alternate sequence y ^π∈S^π is formed by the automatically generated blend spans by the blend span decoder. By converting the graph extraction task to the sequence generation task, the present application can easily use beam search decoding to reduce exposure bias (wiserman and Rush, 2016) that may occur in the sequence decision process, and thus find a globally better graph representation.

HySPA high-level overview the HySPA model takes as input a piece of text (e.g., a sentence or paragraph) and predefined node and edge types and outputs an alternating sequence representation of the information graph. The present application forces the alternate generation of this sequence by applying an alternate mask to the output probabilities. The detailed architecture is described in the following subsections.

3.1, Text and type encoder

Fig. 4 shows the encoder architecture of the proposed model of the present application, where the sign ++is the join operator, k is the index of the word vector in H ₀, l _e = |r|+|u|. The right color table represents the meta-type assignments for the different blocks of the connectives vector from H ₀. For node type set Q, edge type set R, and virtual edge type U, the present application arranges type list v as a concatenation of edge type, virtual edge type, and tag names for node type, i.e.

Wherein, Representing the join operator between the two lists,The list of type names in sets R, U, Q (e.g.,). Note that the order of connection between the list of type names may be arbitrary as long as it remains consistent throughout the model. Then, just like in the embedding section of the table-sequence encoder (Wang and Lu, 2020), for each type v _i, the present application uses context word embedding, gloVe embedding (Pennington et al 2014) and feature embedding from the pre-trained language model to embed the type tag symbols, where GloVe, full call Global Vectors for Word Representation, is a word characterization (word representation) tool based on global word frequency statistics (count-based & overall statistics).

Where l _p = |r|+|u|+|q| is the number of various types, W ₀∈R^de×dm is the weight matrix of the linear projection layer, d _e＝d_c+d_g+d_k is the total embedding dimension, and d _m is the size of the hidden layer of the model of the application. After the present application obtains the context embeddings of the markers for each type v _i e v, the present application takes the average of these marker vectors as a representation of v _i and freezes its updates during training. For more details, reference is made to appendix a.

The embedding approach is also used to embed words in the input text x. Unlike the approach of type embedding, the present application presents the word as a contextual embedding of the first sub-tag from a pre-trained language model (LM, egBERT (Devlin et al, 2018)), and fine-tunes the language model in an end-to-end fashion.

After obtaining the type-embedded E _v and the text-embedded E _x, respectively, the present application concatenates them along the sequence length dimension to form a hybrid representation H ₀. Since H ₀ is a concatenation of word vectors from four different types of labels (i.e., edge type, virtual edge type, node type, and text), meta-type embedding is applied to indicate this type of difference between vector blocks from representation H ₀ (as shown in FIG. 4). The final context representation H is obtained by meta-type embedding and element addition of H ₀,

Where l _h＝l_p +|x| is the high of the mixed representation matrix H of the present application.

3.2 Reversible mapping between Span & Types and Mixed spans

Given a span in text, t= (t _s,t_e)∈N²,t_s<t_e, the present application converts span t to an index k in representation H, k.gtoreq.l _p,

k=g_k(t_s,t_e)＝t_sm+t_e-t_s-1+l_p∈N,

Where m is the maximum length of the span, l _p = |r|+|u|+|q|. The present application keeps the type indexes in the graph unchanged because they are smaller than l _p and k.gtoreq.l _p. Since for information graphs the maximum span length m is usually referred to as being much smaller than the length of the text, i.e. m < < n, the present application can maintain the linear spatial complexity of the inventive decoder with respect to the input text length n by only considering the maximum magnitude of the reduction k of spans of length less than m from O (n ²) to O (nm). Fig. 5 shows a specific example of an alternating sequence of knowledge maps in an ACE05 dataset according to the present application, representing an (middle) example of an alternating sequence of knowledge maps (bottom) from an ACE05 training set, in which case he was caught (He was captured in Baghdad late Monday night) in dada on monday late night. Where a ₁ denotes algorithm 1, the present application takes m=16 and l _p =19 in this example. "19" in the alternating sequence is the index of the span (0, 1) of "he", "83" is the index of the span (4, 5) of "dag", and "10" is the virtual edge type [ SEP ]. The input text (top) of the chart is "he was caught in dada on monday late night".

Since t _s,t_e, k are natural numbers, the present application can construct an inverse map g _t, convert the index k in H back to t= (t _s,t_e),

t_e＝g_t(k)＝gt_s(k)+max(0,k-l_p)mod m,

Wherein, Is an integer bottom function and mod is a modulo operator. Note that g _t (k) can be directly applied to the indexes in the type segment of H, keeping their values unchanged, i.e.,

With this feature, the present application can easily incorporate the mapping g _t into the decoder of the present application to map the alternating sequence y ^π back to the span in the mixed representation H.

3.3 Hybrid span decoder

Fig. 6 shows a general model architecture of the hybrid span decoder of the present application. The decoder of the present application takes as input the context representation H and cyclically decodes the alternating sequence y ^π given the sequence start marker. N is the decoder layer number, before the softmax functionRepresenting the join operator, H _y ^N is a hidden representation of the sequence y ^π from the last decoder layer. The hybrid span decoder of the present application can be understood as an autoregressive model that operates in a closed context space and an output space defined by H.

Attention-based hybrid span coding given the alternating sequence y ^π and the mapping g _t (section 3.2), the decoder of the present application first maps each index in y ^π to a span, (t _si,t_ei)＝g_t(y_i ^π), based on the representation H, then transforms the span of the attention mask M ₀ to allow the model to learn the weighted sum of the context word representation fragments representing the span as span references,

Wherein, Is the |y ^π | repeated hidden representation of the beginning of the sequence marker [ CLS ], the text segment from H, H _y is a parameter that the application can learn for the mixed final representation span y ^π.W₁,W₂,b₁,b₂, and t _si,t_ei is the start and end positions of the span that the application is encoding. Note that for a type spans of length 1, the result of the softmax calculation will always be 1, which results in its span representation being exactly the embedding vector desired by the present application.

Traversing embedding to distinguish the mixed span of different positions in y pi, a simple approach is to add sinusoidal position embedding to Hy (VASWANI ET al., 2017). However, this approach treats the alternating sequence as a normal sequence and ignores the underlying graph structure it encodes. To alleviate this problem, the present application proposes a novel traversal embedding method that captures traversal level information, parent-child information, and intra-level connection information as alternatives to original location embedding. The traversal embedding of the present application may encode either BFS or DFS traversal patterns. As an example, the present application herein assumes BFS traversal.

FIG. 7 BFS traversal embedding example of alternating sequence, [ "He", type, PER, [ SEP ], "Bagda", type, GPE, PHYS, "He" ]. The BFS traversal embedding of the present application is a point-wise sum of the layer embedding L, the parent-child embedding P and the tree embedding T given the alternating sequence y,

TravEmbed(y)=L(y)+P(y)+T(y)∈R^|y|×dm

Wherein the layer embedding assigns the same embedding vector L _i for each position of the BFS traversal level i and embeds the values of the padding embedding vector according to the non-parametric sinusoidal position, since the application expects the embedding extrapolation of the application to sequences longer than any sequence in the training set. Parent-child embedding the positions of the parent and child nodes in the BFS traversal level assign different randomly initialized embedding vectors to help the model distinguish between the two nodes. For encoding intra-layer connection information, the insight of the present application is that the connection between each node in the BFS layer can be seen as a tree of depth 3, where the first depth takes the parent node, the second depth fills the edge type and the third depth consists of child nodes corresponding to each edge type. The tree embeddings of the present application are then formed by encoding the position information of the depth 3 tree using the tree position embeddings (Shiv and Quirk, 2019) of each BFS level. Fig. 7 shows a specific example of how these embeddings function for a given alternating sequence. The obtained traversal embeds are then added point-by-point to the hidden representation of the alternating sequence H _y to inject the traversal information of the graph structure.

Internal block by slicing the input text representation H _text from the mixed representation H and the target sequence representation H _y, the present application applies an N-layer converter structure with mixed attention (He et al, 2018) to allow the model of the present application to take advantage of the edges or nodes from different attention layers when decoding alternating sequences. Note the practical choice of the neural structure of the hybrid span decoder of the present application perpendicular to the internal blocks, the design of the hybrid attention transformer of the present application (He et al, 2018) is chosen because its hierarchical coordination properties are more empirically suited for the heterogeneous decoding of two different types of sequence elements by the present application. The detailed structure of the inner block is explained in appendix E.

Hybrid span decoding for a hybrid span decoding module, the present application first cuts out the hidden representation of the alternating sequence y ^π from the output of the N-layer inner block and represents it as H _y ^N. Then, for each hidden representation h _yi ^N∈H_y ^N,0≤i≤|y^π, the present application applies two different linear layers to obtain a start position representation s _yi and an end position representation e _yi,

Where W ₅,W₆∈R^dm×dm and b ₅,b₆∈R^dm are learnable parameters. The present application then calculates the score of TYPES SEGMENT and the text segment TARGET SPANS for H, respectively, and concatenates them together before the final softmax operator to jointly estimate the probabilities of text spans and TYPE SPANS,

Where H _i is the score vector for the possible spans in the type segment of H, and t _i is the score vector for the possible spans in the text segment of H. Since the span length of the type span is always 1, the application only needs an addition calculation between an element-wise start position score h _si and an end position score h _ei to calculate that the entry of h _i.t_i contains the score of the text span, t _si,j+t_ei,k; k-j < m, calculated with the help of a unrolling function that converts vector t _ei∈Rⁿ into a stack of n sliding windows of size m, maximum span length, stride 1. The alternation mask m _a∈R^lp,m_a′∈Rⁿ is defined as:

Where l _e = |r|+|u| is the total number of edge types. Thus, while the present application has a joint model of node and edge types, the output distribution is enforced by the alternation mask to produce an alternating decoding of node and edge types, which is the primary reason for the present application to call this decoder a hybrid span decoder.

4. Experiment

4.1 Experimental settings

The present application tests the model of the present application on an ACE 2005 data set distributed by LDC3, which data set comprises 1 ten thousand 4 thousand 5 hundred sentences, 3 ten thousand 8 thousand entities (of 7 types) and 7100 relations (of 6 types), these data sets coming from the general news field, see annex C for details.

According to the previous work, the present application uses F1 as an evaluation index for NER and RE. For the NER task, the predictions are marked as correct when both the type and the boundary span match the golden entity. For RE tasks, predictions are correct when both the relationship type and boundaries of the two entities are correct.

4.2 Implementation details

In training the model of the present application, the present application applies a cross entropy loss with a label smoothing factor of 0.1. The model was trained using 2048 markers per lot (approximately 28 lots), 25000 times using AdamW optimizers (Loshchilov and Hutter, 2018), a learning rate of 2e ^-4, a weight decay of 0.01, and 2000 preheats using an inverse square root scheduler. Following the TabSeq model (Wang and Lu, 2020), the present application uses RoBERTa-large (Liu et al) or ALBERT-xxlarge-v1 (Lan et al, 2020) as a pre-training language model during training and slows its learning rate by a factor of 0.1. When ALBERT-xxlarge-v1 had a drop rate of 0.1, the stealth drop rate of RoBERTalarge reached 0.2. The drop rate of the hybrid span decoder of the present application during training is also 0.1. The application sets the maximum span length, m=16, the concealment size of the model of the application, d _m =256, and the number of decoder blocks, n=12. Although beam searching should help the present application reduce exposure bias in theory, the present application does not observe any performance gain and validation set length loss during grid searching of beam sizes (detailed grid searching is set forth in appendix a). Therefore, the application sets the common beam size to 1, sets the length penalty to 1, and leaves the theoretical experimental contradiction to be studied in the future. The model of the present application was built using FAIRSEQ toolkit (Ott et al, 2019) for efficient distributed training, all experiments were performed on two NVIDIA TITAN X GPUs.

Table 1. Joint NER and RE F1 scores of ie model on ACE05 test set. The complexity of the entity and relationship decoding part of the model is calculated (n is the length of the input text). The performance of the TabSeq model reported here is based on the same ALBERT-xxlarge (Lan et al 2020) pre-trained language model as the present application.

Table 2 ablation study of ACE05 test set. "-trade-embedding": the application removes Traversal embedding, instead of sinusoidal location embedding, the below abs is based on the model after this abs. "-Masking": the present application removes the alternation mask from the hybrid span decoder. "BFS". The present application uses DFS instead of BFS as the traversal. "-Mixedattention": the present application removes the mixed-attention layer and uses a standard converter encoder decoder structure. "-Span-attention": the present application removes Span attention in the Span coding module and instead averages the words in the Span.

4.3 Results

Table 1 compares the model of the present application with the previous latest results on the ACE05 test set. Compared to SOTA, tabSeq (Wang and Lu, 2020) which used the pre-trained language model of ALBERT before, the model of the present application used ALBERT had significantly better performance on both NER score and RE score while maintaining linear spatial complexity an order of magnitude smaller than TabSeq. The model of the present application is the first joint model with both linear spatial and temporal complexity compared to all previous joint IE models, and therefore has the best scalability for large-scale real-world applications.

4.4 Ablation study

To demonstrate the effectiveness of the method of the present application, the present application conducted an ablation experiment on an ACE05 dataset. As shown in Table 2, after the present application removes the traversal embeddings, the RE F1 score drops significantly, indicating that the traversal embeddings of the present application can help encode the graph structure and improve the relational prediction. Furthermore, if the alternate masking is abandoned, both NER F1 and RE F1 scores will drop significantly, which proves the importance of enforcing the alternate pattern. The present application can observe that the mixed attention layer contributes significantly to the relational extraction. This is because layer-by-layer coordination can help the decoder unwrap source features and exploit different layer features between entity and relationship predictions. The present application may also observe that DFS traversal performs worse than BFS. The present application suspects that this is because the resulting alternating sequence from DFS is generally longer than the alternating sequence from BFS due to the nature of the knowledge graph, thereby increasing the difficulty of learning.

4.5 Error analysis

After analyzing 80 remaining errors, the present application classifies and discusses the following common cases (fig. 8 is a schematic diagram of the distribution of remaining errors over the ACE05 test set). These may require additional functionality and strategies to address.

The contextual shortfall is that in many examples, the answer entity is a pronoun and, given the limited context, cannot be typed accurately, it is difficult to correctly categorize the application into an organization in the sense that the application notices that they say that they do not want to use the destroyed word, and in fact that they say that they do so by others. This can be mitigated by using the entire document as input, with cross-sentence context.

Rarely used words the problem of rarely used words is that words in the test set rarely appear in the training set and generally do not appear in the dictionary. In the sentence "base also navy FA-18 and navy Heriers", the term "Heriers" (a vehicle that is misclassified as a person by the model) is neither present in the training set nor well understood by the pre-trained language model, in which case the model can only rely on subword level representations.

Background knowledge is required that the entities mentioned in the sentence are generally difficult to infer from the context, but are easily identified by looking up the knowledge base, in that the model of the application erroneously predicts that the guest is a vehicle, in that the guest should issue a more intense alert, and that the guest is referred to herein as an European aerospace company. The system of the present application also divides the united nations and the regulations into two entities, a non-existent relationship triplet (regulations, part of the united nations) is generated. Such errors can be avoided by consulting a knowledge base (e.g., DBpedia (Bizer, et al, 2009) or performing entity linking).

Inherent ambiguity many examples are inherently ambiguous, e.g., the european union can be categorized as an organization or political entity, while some entities (e.g., military bases) can be both sites and organizations or facilities.

5. Related work

NER is typically done in conjunction with RE to reduce error propagation and learn the interrelationship between tasks. One approach is to consider the joint task as a square table filling problem (Miwa and Sasaki, 2014; gupta et al, 2016; wang and Lu, 2020), where the ith column or row represents the ith token. The table has a diagonal line indicating the order of the entity and other entries as a relationship between pairs of tags. The other line of work is to execute the RE after the NER. In Miwa and Bansal (2016), authors used BiLSTM (Graves et al, 2013) as NER, and therefore used Tree-LSTM (Tai et al, 2015) based on dependency graphs for RE. Wadden et al. (2019) and goldenrain et al. (2019) On the other hand, a method for constructing a dynamic text span graph is adopted to detect entities and relations. Extended wattle et al. (2019), woods, et al. (2020) ONEIE is presented that further incorporates global features based on cross-subtask and instance constraints, aimed at extracting IE results as graphs. Note that the model of the present application differs from ONEIE (Lin et al 2020) in that the model of the present application automatically captures global relationships by autoregressive generation, while ONEIE uses feature engineering templates, and furthermore, ONEIE requires pairwise classification of relationship extractions, while the method of the present application effectively generates existing relationships and entities.

While several Seq2 Seq-based models have been proposed (Zhang et al, 2020; zeng et al, 2018,2020;Wei et al, 2019; zhang et al, 2019) to generate triples (i.e., node-edge-nodes), the model of the present application differs fundamentally from them in that (1) it generates a BFS/DFS traversal of a target graph that captures dependencies between nodes and edges and has a shorter target sequence, (2) the present application models nodes because spans in text are independent of vocabulary, and thus the present application can still generate spans for nodes based on context information even if their labels are rare or unseen words.

6. Conclusion(s)

In this work, the present application proposes a hybrid span generation (HySPA) model, which is the first end-to-end text-to-graph extraction model with linear spatial and temporal complexity in the graph-encoding phase. In addition to scalability, the model also enables the current most advanced performance on ACE05 federated entities and relationship extraction tasks. In view of the flexibility of the structure of the hybrid span generator of the present application, there is still a rich research direction in the future, such as hybrid span generation in combination with external knowledge, applying more efficient sparse self-attention, and developing better search methods to find more globally rational graphs represented by alternating sequences.

In some embodiments, another method is provided in which the mixed-attention layer is removed and a standard transducer encoder decoder structure is used. This version is simpler in structure but inferior in performance to the version using the mixed-attention layer.

In some embodiments, another approach is also provided in which a DFS traversal is used instead of a BFS traversal to construct an alternating sequence representation of the graph, while such a version also uses a DFS traversal embedding (see appendix D for details) instead of a BFS traversal embedding. This version of graph extraction accuracy is inferior to BFS traversal.

In some embodiments, another method is also provided in which words in a span are averaged to encode the span instead of performing attention-based span encoding. This version of the model structure is simpler and has fewer model parameters but the graph extraction accuracy is inferior to attention-based span coding.

Appendix A super parameter

The present application uses a 100-dimensional GloVe-word insert trained on 6B tokens as an initialization and freezes its updates during training. The feature is embedded with 30-dimensional LSTM encoding and the GloVe embedding of the out-of-vocabulary tags is replaced with a randomly initialized vector, following Wang and Lu (2020). The present application uses a gradient cut of 0.25 during training. The number of mixed attention of the present application is set to 8. The bundle size and length penalty are determined by a grid search of the validation set of the ACE05 dataset, the bundle size ranging from 1 to 7, the step size being 1, the length penalty ranging from 0.7 to 1.2, the step size being 0.1. The application selects the best beam size and length penalty based on a metric that extracts the F1 score from the relationship.

Appendix B training details

The model of the present application uses ALBERT-xxlarge pre-trained language models with 2.36 billions of parameters. On average, the present application can be trained on two NVIDIA TITAN X GPUs for 20 hours using the best model of ALBERT-xxlarge.

Appendix C data

Automatic Content Extraction (ACE) 2005 data sets contain english, arabic, and chinese training data for 2005 Automatic Content Extraction (ACE) technical evaluations, providing entity, relationship, and event notes. The present application follows the use of Wardon et al (2019) for preprocessing and data splitting. The pre-processed data contained 7100 relationships, 3 thousands of 8 thousands of entities, and 1 thousands of 4 thousands of 5 hundred sentences. The split contains 10051 training samples, 2424 development samples, and 2050 test samples.

Appendix D DFS traversal embedding

Since the parent-child information is already contained in the intra-layer connection of the DFS traversal, the present application has only the sum of the layer embedding and the connection embedding of the DFS traversal embedding. Similar to BFS embedding, DFS layer embedding assigns the same embedding vector Li for each location at DFS traversal layer i, but the values of the embedding vectors are randomly initialized, rather than populated with non-parametric sinusoidal location embedding, because there is no information between traversal levels of the proximity DFS. But for elements in the DFS level the application does have explicit distance information, i.e. for the DFS level d= [ a, B, C., [ sep ] ], the distance from a to the element [ a, B, C., [ sep ] ] is [0,1,2; 3., |d| -1]. The present application encodes this distance information with sinusoidal position embedding, which becomes the connection embedding of the present application, capturing intra-layer connection information.

Appendix E converter with mixed attention layer

The present application first cuts out the hidden representation of the input text from the mixed representation H and represents it as H _text, then inputs the output of the input text representation H _text and the mixed span code H _y into the stack of N mixed attention/feed forward blocks with the structure shown in fig. 9.

Since generating node and edge types may require features from different layers, the present application uses mixed attention (He et al, 2018), which allows the model of the present application to take advantage of features from different attention layers, H _y,

Where n= |x| is the length of the input text, and l _m = |x|+|y pi| is the total length of the source and target features. The concatenation of source feature H _text and target feature H _y is denoted as H ₀, and source/target embedding (He et al, 2018) is also added to H ₀ before the first layer of mixed attention to allow the model to distinguish features from the source and target sequences. The mixed-attention layer and the feedforward layer combine to form a decoder block:

Where W _q,k,v,b_q,k,v,W₃∈R^dm×4dm,W₄∈R^4dm×dm,b₃,b₄ is a leachable parameter, layerNorm is a layer normalization layer (Ba et al, 2016). The decoder blocks are stacked N times to obtain the final hidden representation H _N and output the final representation of the target sequence H ^N _y. The temporal complexity of mixed attention when encoding source features is O (n ²), but due to causal masking of target features, the present application can cache the hidden representation of this part when generating the target mark, thus preserving the temporal complexity O (n) for each decoding step.

It should be noted that, for simplicity of description, the foregoing method embodiments are all illustrated as a series of acts combined, but it should be understood and appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention. In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In some embodiments, embodiments of the present invention provide a non-transitory computer readable storage medium having stored therein one or more programs comprising execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing a method of any of the above described methods of the present invention for extracting a graph from text.

In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method of extracting a graph from text of any of the above.

In some embodiments, the present invention further provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of extracting a graph from text.

In some embodiments, embodiments of the present invention also provide a storage medium having a computer program stored thereon, wherein the program when executed by a processor implements a method of extracting a graph from text.

Fig. 10 is a schematic hardware structure of an electronic device for executing a method for extracting a graph from text according to another embodiment of the present application, as shown in fig. 10, where the device includes:

one or more processors 1010, and a memory 1020, one processor 1010 being illustrated in fig. 10.

The apparatus for performing the method of extracting a graph from text may further include an input device 1030 and an output device 1040.

The processor 1010, memory 1020, input device 1030, and output device 1040 may be connected by a bus or other means, for example in fig. 10.

The memory 1020 serves as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the method for extracting a graph from text in an embodiment of the present application. The processor 1010 executes various functional applications of the server and data processing, i.e., implements the method of extracting a graph from text of the above-described method embodiments, by running non-volatile software programs, instructions, and modules stored in the memory 1020.

The memory 1020 may include a storage program area that may store an operating system, an application program required for at least one function, a storage data area that may store data created according to the use of a device that extracts a drawing from text, and the like. In addition, memory 1020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 1020 may optionally include memory located remotely from processor 1010, which may be connected via a network to a device that extracts the graph from the text. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 1030 may receive input numeric or character information and generate signals related to user settings and function control of the device that extracts the drawing from the text. The output 1040 may include a display device such as a display screen.

The one or more modules are stored in the memory 1020 that, when executed by the one or more processors 1010, perform the method of extracting a graph from text in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID and UMPC devices, etc., such as iPad.

(3) Portable entertainment devices such devices can display and play multimedia content. Such devices include audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.

(4) Other electronic devices with data interaction function.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. An alternating sequence generation model training method, comprising:

training an alternating sequence according to the training text and the training alternating sequence to generate a model;

The alternating sequence generation model is a hybrid span decoder that cuts out the hidden representations H _y ^N of the alternating sequence y ^π from the output of the N-layer inner block, then for each hidden representation H _yi ^N∈H_y ^N, 0≤i≤|y^π l, uses two different linear layers to obtain a start position representation s _yi and an end position representation e _yi, Where m is the maximum span length, d _m is the size of the hidden layer, W ₅,W₆∈R^dm×dm and b ₅, b₆∈R^dm are parameters that can be learned, then jointly estimate the probability of text span and type span:

Where lp is the number of various types of node type sets, edge type sets, and virtual edge type sets, H _types is the input type representation, H _text is the input text representation, n is the maximum input length of the input text, H _i is the scoring vector for the span in the type segment of H, and t _i is the scoring vector for the span in the text segment of H, the alternation mask m _a∈R^lp, m_a´∈Rⁿ is defined as:

Where l _e = |r|+|u| is the total number of edge types.

2. The method of claim 1, wherein generating a training alternating sequence comprising node information and side information from the training information graph comprises:

And traversing the training information graph by adopting a preset traversing algorithm to generate a training alternating sequence containing node information and side information.

3. A method according to claim 1 or 2, characterized in that the training alternating sequence comprises node information and side information spaced apart from each other.

4. A method according to claim 3, wherein the node information comprises node type information, and the side information comprises actual side type information and virtual side type information.

5. The method of claim 4, wherein the training information graph includes text spans as addresses of the input text segments and types as representations of abstract concepts, wherein the types are spans of 1 in length of a vocabulary of node type information, actual side type information, and virtual side type information.

6. A method according to claim 3, wherein training an alternating sequence from the training text and the training alternating sequence to generate a model comprises:

And processing the output distribution of the alternating sequence generation model by adopting an alternating mask so as to obtain alternating sequences formed by node information and side information which are mutually spaced.

7. A method of extracting a graph from text, comprising:

Inputting a text to be extracted into an alternating sequence generating model trained by the method of any one of claims 1-6 to obtain a target alternating sequence;

8. The method of claim 7, wherein the generating a target information map from the target alternating sequence comprises:

9. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 7-8.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps of the method according to any one of claims 7-8.