[go: up one dir, main page]

CN113127632B - Heterogeneous graph-based text summarization method and device, storage medium and terminal - Google Patents

Heterogeneous graph-based text summarization method and device, storage medium and terminal Download PDF

Info

Publication number
CN113127632B
CN113127632B CN202110533278.5A CN202110533278A CN113127632B CN 113127632 B CN113127632 B CN 113127632B CN 202110533278 A CN202110533278 A CN 202110533278A CN 113127632 B CN113127632 B CN 113127632B
Authority
CN
China
Prior art keywords
sentence
text
word
vectors
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110533278.5A
Other languages
Chinese (zh)
Other versions
CN113127632A (en
Inventor
蒋昌俊
闫春钢
丁志军
王俊丽
张亚英
张超波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202110533278.5A priority Critical patent/CN113127632B/en
Priority to PCT/CN2021/103504 priority patent/WO2022241913A1/en
Publication of CN113127632A publication Critical patent/CN113127632A/en
Application granted granted Critical
Publication of CN113127632B publication Critical patent/CN113127632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text summarization method and device based on heterogeneous graphs, a storage medium and a terminal, wherein the method comprises the following steps: performing knowledge fusion on a preset knowledge base and a target text, acquiring word characteristics and sentence characteristics of the target text, and constructing a text heterogeneous graph of the target text based on the word characteristics and the sentence characteristics; updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph; calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous graph, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector; and respectively weighting sentence features in the updated version text heterogeneous image based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels. The invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as intermediaries of the sentences, thereby enriching the association among the sentences and indirectly transmitting information.

Description

基于异质图的文本摘要方法及装置、存储介质和终端Heterogeneous graph-based text summarization method and device, storage medium and terminal

技术领域technical field

本发明涉及文本生成技术领域,尤其涉及一种基于异质图的文本摘要方法及装置、存储介质和终端。The present invention relates to the technical field of text generation, and in particular, to a method and device for text summarization based on heterogeneous graphs, a storage medium and a terminal.

背景技术Background technique

自动生成文本摘要是自然语言处理领域中一类重要任务,该类任务旨在将原文本压缩,生成包含原文主要内容的简短描述。研究主要有两个方面:生成式和抽取式。生成式的重点是在对整个文档进行编码后逐字生成摘要,而抽取式方法则是直接从文档中选择句子以组合成摘要。与生成式相比,抽取式摘要方法效率更高且生成的摘要可读性更好。Automatically generating text summaries is an important task in the field of natural language processing, which aims to compress the original text and generate a short description containing the main content of the original text. There are two main research areas: generative and extractive. Generative methods focus on generating summaries verbatim after encoding the entire document, while extractive methods directly select sentences from documents to combine into summaries. Extractive summarization methods are more efficient and generate more readable summaries than generative ones.

抽取式文本摘要任务中的关键步骤为建立各语句及其与文章之间的联系,目前现有的方法大多是通过基于递归神经网络(RNN)获取句子的关联,但该类方式通常无法捕捉句子的长距离依赖关系。使用图结构表示文本是解决上述问题的一个较为有效的方式,但如何合理的将文本建模为图依然有待研究。最近,图神经网络(GNN)对图数据展现出强大的特征抽取能力,随之基于图神经网络的文本摘要方法也被提出,一些工作使用修辞结构理论(RST)将句子分解为基本语义单元(EDU)并构建修辞结构理论结构树,再使用图卷积网络(GCN)完成图信息聚合更新。虽然基于基本语义单元的方式取得较好效果,但其生成基本语义单元的过程较为复杂,且其中仅使用一种节点构建图。且抽取式摘要中句子之间的关联强弱较为重要,但目前的异质图工作中仅在不同类节点间添加边,使得句子间无直接关联。The key step in the extractive text summarization task is to establish the relationship between each sentence and the article. At present, most of the existing methods are based on the recurrent neural network (RNN) to obtain the sentence association, but this kind of method usually cannot capture the sentence. long-range dependencies. Using graph structure to represent text is an effective way to solve the above problems, but how to model text as graph reasonably remains to be studied. Recently, Graph Neural Networks (GNNs) have demonstrated powerful feature extraction capabilities for graph data. Subsequently, text summarization methods based on Graph Neural Networks have also been proposed. Some works use Rhetorical Structure Theory (RST) to decompose sentences into basic semantic units ( EDU) and construct the rhetorical structure theory structure tree, and then use the graph convolutional network (GCN) to complete the graph information aggregation update. Although the method based on the basic semantic unit achieves good results, the process of generating the basic semantic unit is complicated, and only one kind of node is used to construct the graph. In addition, the strength of the association between sentences in the extractive summary is more important, but the current work on heterogeneous graphs only adds edges between different types of nodes, so that there is no direct association between sentences.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是现有基于图神经网络的文本摘要生成方式中生成基本语义单元的过程较为复杂,仅使用了一种节点构建图,且句子之间的关联性较弱,不利于抽取式摘要的生成。The technical problem to be solved by the present invention is that the process of generating the basic semantic unit in the existing text summary generation method based on graph neural network is relatively complicated, only one kind of node is used to construct the graph, and the correlation between sentences is weak, which is not conducive to Generation of extractive summaries.

为了解决上述技术问题,本发明提供了一种基于异质图的文本摘要方法,包括:In order to solve the above technical problems, the present invention provides a method for text summarization based on heterogeneous graphs, including:

将预设知识库与目标文本进行知识融合,获取所述目标文本的单词特征和句子特征,基于所述单词特征和句子特征构建所述目标文本的文本异质图;Perform knowledge fusion with the preset knowledge base and the target text, obtain word features and sentence features of the target text, and construct a text heterogeneity graph of the target text based on the word features and sentence features;

基于边权重和注意力权重,通过图注意力网络对所述文本异质图进行更新,获取更新版文本异质图;Based on the edge weight and the attention weight, the text heterogeneous graph is updated through the graph attention network to obtain an updated version of the text heterogeneous graph;

计算所述更新版文本异质图中句子向量的多类摘要指标,并根据每个句子向量对应的多类摘要指标计算对应句子向量的分类权重;Calculate the multi-category summary index of the sentence vector in the updated version of the text heterogeneity graph, and calculate the classification weight of the corresponding sentence vector according to the multi-category summary index corresponding to each sentence vector;

基于句子向量的分类权重分别对所述更新版文本异质图中的句子特征进行加权,并基于加权后的句子特征获取对应的句子标签,根据获取的句子标签生成文本摘要。The classification weights based on the sentence vectors respectively weight the sentence features in the updated text heterogeneity graph, obtain corresponding sentence labels based on the weighted sentence features, and generate text summaries according to the obtained sentence labels.

优选地,将预设知识库与目标文本进行知识融合,获取所述目标文本的单词特征和句子特征包括:Preferably, knowledge fusion is performed between the preset knowledge base and the target text, and the acquisition of word features and sentence features of the target text includes:

分别将预设知识库中知识和目标文本中内容编码向量化,以获取所述预设知识库中的知识向量和所述目标文本中的单词向量;Respectively vectorize the knowledge in the preset knowledge base and the content coding in the target text to obtain the knowledge vector in the preset knowledge base and the word vector in the target text;

分别计算所述目标文本中每个单词向量与所述预设知识库中知识向量的注意力权重,获取所述目标文本中每个单词向量的注意力权重;Respectively calculate the attention weight of each word vector in the target text and the knowledge vector in the preset knowledge base, and obtain the attention weight of each word vector in the target text;

依次以所述目标文本中单词向量的注意力权重为权重,分别对所述预设知识库中知识向量进行加权合并,以获得所述目标文本中每个单词向量的知识权重;Taking the attention weight of the word vector in the target text as the weight in turn, weighting and merging the knowledge vectors in the preset knowledge base respectively to obtain the knowledge weight of each word vector in the target text;

基于所述目标文本中每个单词向量的知识权重获取对应单词向量的单词特征;Obtain the word feature of the corresponding word vector based on the knowledge weight of each word vector in the target text;

分别对所述目标文本中每个句子向量所包含单词向量的单词特征进行局部特征捕获和全局特征捕获,以获取每个句子向量的局部特征和全局特征,再根据每个句子向量的局部特征和全局特征分别获取对应句子向量的句子特征。Perform local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text, respectively, to obtain the local features and global features of each sentence vector, and then according to the local features of each sentence vector and The global feature obtains the sentence features of the corresponding sentence vectors respectively.

优选地,基于所述单词特征和句子特征构建所述目标文本的文本异质图包括:Preferably, constructing the text heterogeneity graph of the target text based on the word features and sentence features includes:

基于所述目标文本中句子向量的句子特征,通过余弦相似度计算方式计算所述目标文本中所有句子向量两两句子向量间的同质边权重;Based on the sentence features of the sentence vectors in the target text, calculate the homogeneous edge weights between all sentence vectors in the target text by a cosine similarity calculation method;

基于所述目标文本中单词向量的单词特征以及单词向量所属句子向量的句子特征,通过TF-IDF算法计算所述目标文本中所有单词向量以及单词向量所属句子向量间的异质边权重;Based on the word feature of the word vector in the target text and the sentence feature of the sentence vector to which the word vector belongs, calculate the heterogeneous edge weights between all the word vectors in the target text and the sentence vector to which the word vector belongs by the TF-IDF algorithm;

以所述目标文本中的单词向量作为单词节点,所述目标文本中的句子向量作为句子节点,基于所述单词向量的单词特征、句子向量的句子特征、句子向量间的同质边权重以及单词向量与其所属句子向量间的异质边权重构建所述目标文本的文本异质图。Taking the word vector in the target text as the word node, the sentence vector in the target text as the sentence node, based on the word feature of the word vector, the sentence feature of the sentence vector, the homogeneous edge weight between the sentence vectors and the word The heterogeneous edge weights between the vectors and the sentence vectors to which they belong to construct the text heterogeneity graph of the target text.

优选地,基于边权重和注意力权重,通过图注意力网络对所述文本异质图进行更新,获取更新版文本异质图步骤包括:Preferably, based on edge weights and attention weights, the text heterogeneous graph is updated through a graph attention network, and the step of obtaining an updated version of the text heterogeneous graph includes:

计算所述目标文本中所有句子向量两两句子向量间的注意力权重,并计算所述目标文本中所有的单词向量和单词向量所属句子向量间的注意力权重,并获取所述目标文本中所有同质边权重和所有异质边权重;Calculate the attention weights between all sentence vectors in the target text, and calculate the attention weights between all word vectors in the target text and the sentence vectors to which the word vectors belong, and obtain all the word vectors in the target text. Homogeneous edge weights and all heterogeneous edge weights;

基于所述目标文本中所有句子向量两两句子向量间的注意力权重、所有的单词向量和单词向量所属句子向量间的注意力权重、所有同质边权重和所有异质边权重,通过图注意力网络对所述文本异质图中的所有单词节点和所有句子节点进行更新,获取更新版文本异质图。Based on the attention weights between all sentence vectors in the target text, the attention weights between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights, through the graph attention The force network updates all word nodes and all sentence nodes in the text heterogeneity graph to obtain an updated version of the text heterogeneity graph.

优选地,基于所述目标文本中所有句子向量两两句子向量间的注意力权重、所有的单词向量和单词向量所属句子向量间的注意力权重、所有同质边权重和所有异质边权重,通过图注意力网络对所述文本异质图中的所有单词节点和所有句子节点进行更新包括:Preferably, based on the attention weights between all sentence vectors in the target text, the attention weights between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights, Updating all word nodes and all sentence nodes in the text heterogeneous graph through the graph attention network includes:

以单词节点作为中心节点,以与所述中心节点相连的句子节点之间的注意力权重和与所述中心节点相连的句子节点之间的异质边权重的乘积作为权重,对与所述中心节点相连的句子节点的句子特征进行加权聚合,实现对所述单词节点的更新;Taking the word node as the central node, the product of the attention weight between the sentence nodes connected to the central node and the heterogeneous edge weights between the sentence nodes connected to the central node is used as the weight. The sentence features of the sentence nodes connected by the nodes are weighted and aggregated to realize the update of the word nodes;

以句子节点作为中心节点,以与所述中心节点相连的单词节点之间的注意力权重和与所述中心节点相连的单词节点之间的异质边权重的乘积作为权重,对与所述中心节点相连的单词节点的单词特征进行加权聚合,实现对所述句子节点的更新;Taking the sentence node as the central node, and taking the product of the attention weight between the word nodes connected to the central node and the heterogeneous edge weights between the word nodes connected to the central node as the weight, the The word features of the word nodes connected by the nodes are weighted and aggregated to realize the update of the sentence nodes;

以句子节点作为中心节点,以与所述中心节点相连的单词节点之间的注意力权重和与所述中心节点相连的句子节点之间的同质边权重的乘积作为权重,对与所述中心节点相连的句子节点的句子特征进行加权聚合,实现对所述句子节点的更新。Take the sentence node as the central node, and take the product of the attention weight between the word nodes connected to the central node and the weight of the homogeneous edge between the sentence nodes connected to the central node as the weight. The sentence features of the sentence nodes connected by the nodes are weighted and aggregated to realize the update of the sentence nodes.

优选地,计算所述更新版文本异质图中句子向量的多类摘要指标,并根据每个句子向量对应的多类摘要指标分别计算对应句子向量的分类权重步骤包括:Preferably, the steps of calculating the multi-category summary indicators of the sentence vectors in the updated version of the text heterogeneity graph, and respectively calculating the classification weights of the corresponding sentence vectors according to the multi-category summary indicators corresponding to each sentence vector include:

计算所述更新版文本异质图中每个句子向量的相关度分数、冗余度分数、新信息分数以及面向召回率评估的度量标准分数;Calculate the relevance score, redundancy score, new information score, and metric score for recall evaluation of each sentence vector in the updated version of the text heterogeneity graph;

分别基于每个句子向量的相关度分数、冗余度分数、新信息分数以及面向召回率评估的度量标准分数,通过Sigmoid函数计算对应句子向量的分类权重。Based on the relevance score, redundancy score, new information score, and metric score for recall evaluation of each sentence vector, respectively, the classification weight of the corresponding sentence vector is calculated by the Sigmoid function.

优选地,计算所述更新版文本异质图中单个句子向量的相关度分数、冗余度分数、新信息分数以及面向召回率评估的度量标准分数步骤包括:Preferably, the steps of calculating the relevance score, redundancy score, new information score and recall-oriented metric score of a single sentence vector in the updated text heterogeneity graph include:

基于所述更新版文本异质图的文本特征及该句子向量的句子特征,通过双线性函数计算该句子向量的相关度分数;Based on the text feature of the updated version of the text heterogeneity graph and the sentence feature of the sentence vector, the correlation score of the sentence vector is calculated by a bilinear function;

基于更新版文本异质图中该句子向量的句子特征,通过双线性函数计算该句子向量的冗余度分数;Based on the sentence feature of the sentence vector in the updated version of the text heterogeneity graph, the redundancy score of the sentence vector is calculated by the bilinear function;

基于更新版文本异质图中该句子向量的句子特征以及所述预设知识库中的知识向量,通过双线性函数计算该句子向量的新信息量分数;Based on the sentence feature of the sentence vector in the updated version of the text heterogeneity map and the knowledge vector in the preset knowledge base, the new information content score of the sentence vector is calculated by a bilinear function;

基于未编码向量化的目标文本以及其中该句子向量的文本内容,通过面向召回率评估的度量标准函数计算该句子向量的面向召回率评估的度量标准分数。Based on the unencoded vectorized target text and the textual content of the sentence vector in it, a recall-oriented metric score for the sentence vector is calculated by the recall-oriented metric function.

为了解决上述技术问题,本发明还提供了一种基于异质图的文本摘要装置,包括:In order to solve the above technical problems, the present invention also provides a text summarization device based on heterogeneous graph, including:

文本异质图构建模块,用于将预设知识库与目标文本进行知识融合,获取所述目标文本的单词特征和句子特征,基于所述单词特征和句子特征构建所述目标文本的文本异质图;The text heterogeneity graph building module is used to fuse the preset knowledge base with the target text, obtain the word features and sentence features of the target text, and construct the text heterogeneity of the target text based on the word features and sentence features picture;

更新模块,用于基于边权重和注意力权重,通过图注意力网络对所述文本异质图进行更新,获取更新版文本异质图;an update module, configured to update the text heterogeneous graph through the graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;

分类权重获取模块,用于计算所述更新版文本异质图中句子向量的多类摘要指标,并根据每个句子向量对应的多类摘要指标计算对应句子向量的分类权重;A classification weight acquisition module, configured to calculate the multi-category summary index of the sentence vector in the updated version of the text heterogeneity graph, and calculate the classification weight of the corresponding sentence vector according to the multi-category summary index corresponding to each sentence vector;

摘要生成模块,用于基于句子向量的分类权重分别对所述更新版文本异质图中的句子特征进行加权,并基于加权后的句子特征获取对应的句子标签,根据获取的句子标签生成文本摘要。The summary generation module is used to weight the sentence features in the updated text heterogeneity map based on the classification weights of the sentence vectors, obtain corresponding sentence labels based on the weighted sentence features, and generate text summaries according to the obtained sentence labels. .

为了解决上述技术问题,本发明还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现所述基于异质图的文本摘要方法。In order to solve the above-mentioned technical problem, the present invention also provides a computer-readable storage medium on which a computer program is stored, which implements the heterogeneous graph-based text summarization method when the program is executed by a processor.

为了解决上述技术问题,本发明还提供了一种终端,包括:处理器及存储器;In order to solve the above technical problems, the present invention also provides a terminal, including: a processor and a memory;

所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行所述基于异质图的文本摘要方法。The memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the terminal executes the heterogeneous graph-based text summarization method.

与现有技术相比,上述方案中的一个或多个实施例可以具有如下优点或有益效果:Compared with the prior art, one or more embodiments of the above solutions may have the following advantages or beneficial effects:

应用本发明实施例提供的基于异质图的文本摘要方法,将文本依据语义和句法关系连接构建成文本异质图,结合图注意力网络更新单词和句子两类节点特征,并设计多个摘要相关的度量指标对句子进行加权评估以用于最终摘要提取,不仅考虑了单词与句子间的信息传递,同时还考虑了句子间的相互影响。进一步添加的外部知识库能更好的帮助模型理解文本,针对摘要任务设计的多角度指标可在分类前为句子添加相应的权重,有效提升了模型对文本特征利用能力,进而生成较为准确且可读性较高的摘要。Using the heterogeneous graph-based text summarization method provided by the embodiment of the present invention, the text is constructed into a text heterogeneous graph according to semantic and syntactic relations, and the node features of words and sentences are updated in combination with the graph attention network, and multiple summaries are designed. The related metrics weighted the sentences for the final summary extraction, not only considering the information transfer between words and sentences, but also the interaction between sentences. The further added external knowledge base can better help the model understand the text, and the multi-angle indicators designed for the summary task can add corresponding weights to sentences before classification, which effectively improves the model's ability to utilize text features, and then generates more accurate and reliable data. Readable abstract.

本发明的其它特征和优点将在随后的说明书中阐述,并且部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present invention will be set forth in the description which follows, and in part will become apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the description, claims and drawings.

附图说明Description of drawings

附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例共同用于解释本发明,并不构成对本发明的限制。在附图中:The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the specification, and together with the embodiments of the present invention, are used to explain the present invention, and do not constitute a limitation to the present invention. In the attached image:

图1示出了本发明实施例一基于异质图的文本摘要方法的流程图;1 shows a flowchart of a method for text summarization based on a heterogeneous graph according to an embodiment of the present invention;

图2示出了本发明实施例一基于异质图的文本摘要方法的过程示意图;FIG. 2 shows a schematic process diagram of a text summarization method based on a heterogeneous graph according to Embodiment 1 of the present invention;

图3示出了本发明实施例一基于异质图的文本摘要方法中文本异质图构建过程示意图;3 shows a schematic diagram of the construction process of a text heterogeneous graph in a text summarization method based on a heterogeneous graph according to Embodiment 1 of the present invention;

图4示出了本发明实施例一基于异质图的文本摘要方法中文本异质图单层更新过程示意图;4 shows a schematic diagram of a single-layer update process of a text heterogeneous graph in a text summarization method based on a heterogeneous graph according to Embodiment 1 of the present invention;

图5示出了本发明实施例一基于异质图的文本摘要方法的消融学习实验结果示意图;5 shows a schematic diagram of an ablation learning experiment result of a text summarization method based on a heterogeneous graph according to Embodiment 1 of the present invention;

图6示出了本发明实施例一中基于CNN&DailyMail数据集实验结果及与其他文摘方法进行的对比实验结果示意图;6 shows a schematic diagram of the experimental results based on the CNN&DailyMail data set and the comparative experimental results performed with other abstracting methods in the first embodiment of the present invention;

图7示出了本发明实施例一基于异质图的文本摘要方法中多角度指标对文摘的影响示意图;7 shows a schematic diagram of the influence of multi-angle indicators on the abstract in the text abstract method based on the heterogeneous graph according to Embodiment 1 of the present invention;

图8示出了本发明实施例一多角度指标量化样例;FIG. 8 shows a multi-angle index quantization example according to an embodiment of the present invention;

图9示出了本发明实施例二基于异质图的文本摘要装置的结构示意图;FIG. 9 shows a schematic structural diagram of a text summarization device based on a heterogeneous graph according to Embodiment 2 of the present invention;

图10示出了本发明实施例四终端的结构示意图。FIG. 10 shows a schematic structural diagram of a terminal according to Embodiment 4 of the present invention.

具体实施方式Detailed ways

以下将结合附图及实施例来详细说明本发明的实施方式,借此对本发明如何应用技术手段来解决技术问题,并达成技术效果的实现过程能充分理解并据以实施。需要说明的是,只要不构成冲突,本发明中的各个实施例以及各实施例中的各个特征可以相互结合,所形成的技术方案均在本发明的保护范围之内。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples, so as to fully understand and implement the implementation process of how the present invention applies technical means to solve technical problems and achieve technical effects. It should be noted that, as long as there is no conflict, each embodiment of the present invention and each feature of each embodiment can be combined with each other, and the formed technical solutions all fall within the protection scope of the present invention.

自动生成文本摘要是自然语言处理领域中一类重要任务,现有关于文本摘要的生成主要有两种方式,生成式和抽取式。生成式的重点是在对整个文档进行编码后逐字生成摘要,而抽取式方法则是直接从文档中选择句子以组合成摘要。与生成式相比,抽取式摘要方法具有效率更高且可读性好的优点。抽取式文本摘要任务中的关键步骤为建立各语句及其与文章之间的联系,目前现有的方法通常无法捕捉句子的长距离依赖关系。最近,图神经网络(GNN)对图数据展现出强大的特征抽取能力,随之基于图神经网络的文本摘要方法也被提出。虽然基于图神经网络的摘要生成方式取得较好效果,但其生成基本语义单元的过程较为复杂,仅使用一种节点构建图,且仅在不同类节点间添加边,使得句子间的关联较弱。Automatically generating text summaries is an important task in the field of natural language processing. There are two main ways to generate text summaries, generative and extractive. Generative methods focus on generating summaries verbatim after encoding the entire document, while extractive methods directly select sentences from documents to combine into summaries. Compared with generative methods, extractive summarization methods have the advantages of higher efficiency and better readability. The key step in the extractive text summarization task is to establish the relationship between each sentence and the article, and the existing methods are usually unable to capture the long-distance dependencies of sentences. Recently, graph neural networks (GNNs) have shown powerful feature extraction capabilities for graph data, and subsequently, text summarization methods based on graph neural networks have also been proposed. Although the summary generation method based on graph neural network has achieved good results, the process of generating basic semantic units is relatively complicated. Only one type of node is used to construct the graph, and only edges are added between different types of nodes, which makes the correlation between sentences weak. .

实施例一Example 1

为解决现有技术中存在的上述技术问题,本发明实施例提供了一种基于异质图的文本摘要方法。In order to solve the above technical problems existing in the prior art, an embodiment of the present invention provides a text summarization method based on a heterogeneous graph.

图1示出了本发明实施例一基于异质图的文本摘要方法的流程图;图2示出了本发明实施例一基于异质图的文本摘要方法的过程示意图。参考图1和图2所示,本发明实施例基于异质图进行抽取式文本摘要方法包括如下备步骤。Fig. 1 shows a flow chart of a method for text summarization based on a heterogeneous graph according to an embodiment of the present invention; Fig. 2 shows a schematic process diagram of a method for text summarization based on a heterogeneous graph according to an embodiment of the present invention. Referring to FIG. 1 and FIG. 2 , the method for extracting text summarization based on a heterogeneous graph according to an embodiment of the present invention includes the following preparation steps.

进一步地,为了更清楚地说明本发明基于异质图进行抽取式文本摘要方法的具体实施方法,事先做出如下定义:Further, in order to more clearly illustrate the specific implementation method of the method for extracting text summarization based on heterogeneous graphs of the present invention, the following definitions are made in advance:

句子集和单词集的定义:给定一个包含m个句子和n个单词的目标文本d,S={s1,s2...sm}则是d的句子集,

Figure BDA0003068795020000051
是句子i的单词集。Definition of sentence set and word set: Given a target text d containing m sentences and n words, S={s 1 , s 2 ... s m } is the sentence set of d,
Figure BDA0003068795020000051
is the set of words in sentence i.

文本图的定义为:G={V,E}代表图,V代表节点集,E代表边集。由于本发明使用的异构图包含两种类型的节点,因此可以分为单词节点集和句子节点集。因此,文本图TG={VTG,ETG}被设计为异构图,其中:The definition of text graph is: G={V, E} represents the graph, V represents the node set, and E represents the edge set. Since the heterogeneous graph used in the present invention contains two types of nodes, they can be divided into word node sets and sentence node sets. Therefore, the text graph TG = {V TG , E TG } is designed as a heterogeneous graph, where:

(1)VTG=W∪S包含两类节点,W={W1,W2,...,Wm}代表单词集的集合。S={s1,s2...sm}代表句子集合。(1) V TG =W∪S contains two types of nodes, W={W 1 , W 2 , . . . , W m } represents a set of word sets. S={s 1 , s 2 . . . s m } represents a set of sentences.

(2)ETG=Eheter∪Ehomo,其中

Figure BDA0003068795020000061
代表异质边,Ehomo={(si,sj)|si,sj∈S}代表同质边。(2)E TG =E heter ∪E homo , where
Figure BDA0003068795020000061
represents a heterogeneous edge, and E homo = {(s i , s j )|s i , s j ∈ S} represents a homogenous edge.

(3)eij表示异质边(wij,si)的权重,e′ij表示同质边(si,sj)的权重。(3) e ij represents the weight of the heterogeneous edge (wi ij , s i ), and e′ ij represents the weight of the homogeneous edge (s i , s j ).

步骤S101,将预设知识库与目标文本进行知识融合,获取目标文本的单词特征和句子特征,基于单词特征和句子特征构建所述目标文本的文本异质图。Step S101 , perform knowledge fusion on the preset knowledge base and the target text, obtain word features and sentence features of the target text, and construct a text heterogeneity graph of the target text based on the word features and sentence features.

首先要实现知识融合,即利用外部知识库完成对词特征的知识丰富,使文本的特征表示同时具备语义感知和知识感知。具体地,选取预设知识库,预设知识库的语言类型需与目标文本的语言类型相同。即当目标文本为中文时,预设知识库需选取中文知识库;当目标文本为英文时,预设知识库需选取英文知识库。其次为将预设知识库的知识融入到目标文本的词特征中,需要先分别将选取的预设知识库中知识和目标文本内容进行编码向量化,以获取预设知识库的所有知识向量以及目标文本的所有单词向量。为了简化说明,直接用d表示编码向量化后的目标文本,W表示编码向量化后目标文本单词向量集的集合,wi表示W中的单词向量;再假设K表示编码向量化的预设知识库,k表示K中的知识向量。First of all, it is necessary to realize knowledge fusion, that is, to use the external knowledge base to complete the knowledge enrichment of word features, so that the feature representation of the text has both semantic perception and knowledge perception. Specifically, a preset knowledge base is selected, and the language type of the preset knowledge base must be the same as the language type of the target text. That is, when the target text is Chinese, the default knowledge base needs to select the Chinese knowledge base; when the target text is English, the default knowledge base needs to select the English knowledge base. Secondly, in order to integrate the knowledge of the preset knowledge base into the word features of the target text, it is necessary to first encode and vectorize the knowledge in the selected preset knowledge base and the content of the target text to obtain all knowledge vectors of the preset knowledge base and All word vectors of the target text. In order to simplify the description, directly use d to represent the target text after encoding and vectorization, W to represent the set of word vector sets of the target text after encoding and vectorization, and w i to represent the word vectors in W; and then suppose that K represents the preset knowledge of encoding vectorization library, k represents the knowledge vector in K.

而后求取目标文本中各单词向量的单词特征。其中,每个单词向量的单词特征的求取方式均为:通过双线性操作计算单词向量wi与预设知识库中所有知识向量k的注意力权重,以获取该单词向量wi的注意力权重,该单词向量wi的注意力权重βi的计算方式为:Then, the word features of each word vector in the target text are obtained. Among them, the method of obtaining the word feature of each word vector is: calculating the attention weight of the word vector wi and all the knowledge vectors k in the preset knowledge base through bilinear operation, so as to obtain the attention of the word vector wi force weight, the calculation method of the attention weight β i of the word vector w i is:

βi=BiLinear(K,WKB,w) (1)β i =BiLinear(K,W KB ,w) (1)

其中,WKB为可训练的权重参数。Among them, W KB is a trainable weight parameter.

计算得到目标文本中各单词向量的注意力权重后,需依次以目标文本中单词向量的注意力权重为权重,分别对预设知识库中知识向量进行加权合并,以获得目标文本中每个单词向量的知识权重。其中单个单词向量知识权重knowledge的获取方式为:After calculating the attention weight of each word vector in the target text, it is necessary to take the attention weight of the word vector in the target text as the weight in turn, and weight and merge the knowledge vectors in the preset knowledge base respectively to obtain each word in the target text. A vector of knowledge weights. The way to obtain the knowledge weight knowledge of a single word vector is:

knowledge=βiK (2)knowledge=β i K (2)

此时单词向量的知识权重knowledge中已经包含单词相关知识。At this time, the knowledge weight knowledge of the word vector already contains word-related knowledge.

之后再基于目标文本中每个单词向量的知识权重获取对应单词向量的单词特征。即将每个单词向量的知识权重与对应单词向量进行连接wk=[w,knowledge],即得到对应单词向量的单词特征,该种单词特征将同时具备语义感知和知识感知。Then, based on the knowledge weight of each word vector in the target text, the word features of the corresponding word vector are obtained. That is to connect the knowledge weight of each word vector with the corresponding word vector w k = [w, knowledge], that is, to obtain the word feature of the corresponding word vector, which will have both semantic perception and knowledge perception.

在获取了目标文本所有单词向量的单词特征后,即可实现目标文本中所有句子向量的句子特征的获取。具体分别对编码向量化后目标文本中每个句子向量所包含单词向量的单词特征进行局部特征捕获和全局特征捕获,以获取每个句子向量的局部特征和全局特征,再根据每个句子向量的局部特征和全局特征分别获取每个句子向量的句子特征。其中局部特征的捕获是通过卷积神经网络(CNN)提取的,全局特征的捕获是通过BiLSTM提取的。同时在获取编码向量化后目标文本中所有句子向量的句子特征后,还可获取编码向量化后目标文本的文本特征。具体句子特征和文本特征的计算方式如下:After the word features of all word vectors in the target text are obtained, the sentence features of all sentence vectors in the target text can be obtained. Specifically, the word features of the word vectors contained in each sentence vector in the target text after encoding vectorization are subjected to local feature capture and global feature capture respectively, so as to obtain the local features and global features of each sentence vector. Local features and global features obtain sentence features for each sentence vector, respectively. Among them, the capture of local features is extracted by convolutional neural network (CNN), and the capture of global features is extracted by BiLSTM. At the same time, after obtaining the sentence features of all sentence vectors in the target text after encoding vectorization, the text features of the target text after encoding vectorization can also be obtained. The specific sentence features and text features are calculated as follows:

Figure BDA0003068795020000071
Figure BDA0003068795020000071

D=BiLSZM([s1,...,sm]) (4)。D=BiLSZM([s 1 , . . . , s m ]) (4).

在获取编码向量化后目标文本中单词特征和句子特征后,可通过语义语法构建目标文本的文本异质图,在此之前还需获取目标文本的单词-句子异质边权重和句子-句子同质边权重。文本异质图的构建过程如图3所示,参考图3所示,本发明实施例将文本摘要视为分类问题,将句子作为被分类的最小单位,因此生成文摘时句子间的关联关系尤为重要,本实施例具体的应对方式为在句子间添加同质边,构建句子-句子全连接无向图,同质边表示句子间的语义相关度。同质边表示量请求方式为:基于编码向量化后目标文本中句子向量的句子特征,通过余弦相似度计算方式计算编码向量化后目标文本中所有句子向量两两句子向量间的同质边权重。单词与句子间的异质边表示两者的所属关系,为添加更多与文本相关的信息,本实施例基于编码向量化后目标文本中单词向量的单词特征以及单词向量所属句子向量的句子特征,通过TF-IDF算法计算目标文本中所有单词向量以及单词向量所属句子向量间的异质边权重。After obtaining the word features and sentence features in the target text after encoding vectorization, the text heterogeneity graph of the target text can be constructed through semantic grammar. Quality edge weights. The construction process of the text heterogeneity graph is shown in FIG. 3. Referring to FIG. 3, the embodiment of the present invention regards the text summary as a classification problem, and regards the sentence as the smallest unit to be classified. Therefore, the association between the sentences when generating the summary is particularly Importantly, the specific response method of this embodiment is to add homogeneous edges between sentences to construct a sentence-sentence fully connected undirected graph, and the homogeneous edges represent the semantic relevance between sentences. The request method for the amount of homogeneous edge representation is: based on the sentence features of the sentence vectors in the target text after coding vectorization, calculate the homogeneous edge weights between all sentence vectors in the target text after coding vectorization through the cosine similarity calculation method. . Heterogeneous edges between words and sentences represent the relationship between the two. In order to add more text-related information, this embodiment is based on the word features of the word vectors in the target text after encoding and vectorization and the sentence features of the sentence vectors to which the word vectors belong. , and calculate the heterogeneous edge weights between all word vectors in the target text and the sentence vectors to which the word vectors belong through the TF-IDF algorithm.

以目标文本中的单词向量作为单词节点,目标文本中的句子向量作为句子节点,基于单词向量的单词特征、句子向量的句子特征、句子向量间的同质边权重以及单词向量与其所属句子向量间的异质边权重构建目标文本的文本异质图。Take the word vector in the target text as the word node, and the sentence vector in the target text as the sentence node, based on the word feature of the word vector, the sentence feature of the sentence vector, the homogeneous edge weight between the sentence vectors, and the relationship between the word vector and the sentence vector to which it belongs. The heterogeneous edge weights of , construct a text-heterogeneous graph of the target text.

步骤S102,基于边权重和注意力权重,通过图注意力网络对文本异质图进行更新,获取更新版文本异质图。Step S102 , based on the edge weight and the attention weight, update the text heterogeneous graph through the graph attention network to obtain an updated version of the text heterogeneous graph.

为了进一步对利用图注意力网络对文本异质图的更新过程进行说明,我们将编码向量化后目标文本中单词向量的单词特征和句子特征分别表述为单词节点的隐藏状态Hw和句子节点的隐藏状态Hs,并将文本特征表述为HDIn order to further illustrate the update process of text heterogeneous graph using graph attention network, we express the word feature and sentence feature of the word vector in the target text after encoding vectorization as the hidden state Hw of the word node and the sentence node's Hide the state H s and express the text features as HD .

图4示出了本发明实施例一基于异质图的文本摘要方法中文本异质图单层更新过程示意图。参照图4所示,计算目标文本中所有句子向量两两句子向量间的注意力权重,并计算目标文本中所有的单词向量和单词向量所属句子间的注意力权重,并获取目标文本中所有同质边权重和所有异质边权重。其中句子向量间的注意力权重计算方式如下,单词向量和单词向量所属句子间的注意力权重计算方式也可参考如下公式计算:FIG. 4 shows a schematic diagram of a single-layer update process of a text heterogeneous graph in a text summarization method based on a heterogeneous graph according to Embodiment 1 of the present invention. Referring to Figure 4, calculate the attention weights between all sentence vectors in the target text, and calculate the attention weights between all word vectors in the target text and the sentences to which the word vectors belong, and obtain all the same sentence vectors in the target text. Primitive edge weights and all heterogeneous edge weights. The calculation method of the attention weight between the sentence vectors is as follows, and the calculation method of the attention weight between the word vector and the sentence to which the word vector belongs can also be calculated by referring to the following formula:

两句子向量间的注意力权重计算方式为:The attention weight between two sentence vectors is calculated as:

Figure BDA0003068795020000081
Figure BDA0003068795020000081

其中,hi和hj表示两个句子节点的隐藏状态,Wa,Wq,Wk,Wv为可训练参数,αij即为hi与hj之间的注意力权重。Among them, hi and h j represent the hidden states of two sentence nodes, W a , W q , W k , and W v are trainable parameters, and α ij is the attention weight between hi and h j .

通过注意力权重计算更新增量ui,更新增量可通过公式(6)计算:The update increment ui is calculated by the attention weight, and the update increment can be calculated by formula (6):

Figure BDA0003068795020000082
Figure BDA0003068795020000082

其中,μi可表示单词节点增量,也可表示句子节点增量,Ni表示节点i的邻居节点集合。Among them, μ i can represent the increment of word nodes, and can also represent the increment of sentence nodes, and N i represents the set of neighbor nodes of node i.

为使语义关联信息也参与到更新中,将异质边权重eij和同质边权重e′ij引入,从语义和注意力模型两个角度控制节点的更新程度。同质边权重和异质边权重计算方式如下:In order to make the semantic related information also participate in the update, the heterogeneous edge weight e ij and the homogeneous edge weight e' ij are introduced to control the update degree of nodes from the perspective of semantics and attention model. Homogeneous edge weights and heterogeneous edge weights are calculated as follows:

Figure BDA0003068795020000083
Figure BDA0003068795020000084
Figure BDA0003068795020000083
or
Figure BDA0003068795020000084

此时式子(6)可修改为:At this time, equation (6) can be modified as:

Figure BDA0003068795020000085
Figure BDA0003068795020000085

上述过程即为图注意力网络如何利用句子向量间的注意力权重、单词向量和单词向量所属句子间的注意力权重、同质边权重和异质边权重计算更新增量的过程。The above process is how the graph attention network calculates the update increment by using the attention weights between sentence vectors, the attention weights between word vectors and the sentences to which the word vectors belong, the weights of homogeneous edges and the weights of heterogeneous edges.

而通过图注意力网络对文本异质图进行更新的过程实际是对文本异质图中单词节点的更新和句子节点的更新。进一步图注意力网络每次更新迭代过程其实包含三个过程,即句子节点对单词节点的更新,单词节点对句子节点的更新以及句子节点间的相互更新。The process of updating the text heterogeneous graph through the graph attention network is actually the update of the word node and the sentence node in the text heterogeneous graph. Further, each update iteration process of the graph attention network actually includes three processes, namely, the update of sentence nodes to word nodes, the update of word nodes to sentence nodes, and the mutual update of sentence nodes.

其中,句子节点对单词节点的更新包括:以单词节点作为中心节点,以与所述中心节点相连的句子节点之间的注意力权重和与所述中心节点相连的句子节点之间的异质边权重的乘积作为权重,对与所述中心节点相连的句子节点的句子特征进行加权聚合,实现对所述单词节点的更新。单词节点对句子节点的更新包括:以句子节点作为中心节点,以与所述中心节点相连的单词节点之间的注意力权重和与所述中心节点相连的单词节点之间的异质边权重的乘积作为权重,对与所述中心节点相连的单词节点的单词特征进行加权聚合,实现对所述句子节点的更新。句子节点之间的相互更新包括:以句子节点作为中心节点,以与所述中心节点相连的句子节点之间的注意力权重和与所述中心节点相连的句子节点之间的同质边权重的乘积作为权重,对与所述中心节点相连的句子节点的句子特征进行加权聚合,实现对所述句子节点的更新。在句子节点层级还可应用LSTM对文本特征Hd进行更新。其中上述聚合方式为:使用对应的注意力权重以及边全值加权聚合。为了对上述内容进行说明,以下呈现了图注意力网络第t次的更新过程:The updating of the word node by the sentence node includes: using the word node as the central node, using the attention weight between the sentence nodes connected to the central node and the heterogeneous edges between the sentence nodes connected to the central node The product of the weights is used as the weight, and the sentence features of the sentence nodes connected to the central node are weighted and aggregated to realize the update of the word nodes. The update of the sentence node by the word node includes: taking the sentence node as the central node, and taking the attention weight between the word nodes connected to the central node and the heterogeneous edge weight between the word nodes connected to the central node. The product is used as a weight, and the word features of the word nodes connected to the central node are weighted and aggregated to realize the update of the sentence nodes. The mutual update between sentence nodes includes: taking the sentence node as the central node, taking the attention weight between the sentence nodes connected to the central node and the homogeneous edge weight between the sentence nodes connected to the central node. The product is used as a weight, and the sentence features of the sentence nodes connected to the central node are weighted and aggregated to realize the update of the sentence nodes. At the sentence node level, LSTM can also be applied to update the text feature H d . The above aggregation method is: using the corresponding attention weight and the weighted aggregation of the edge full value. To illustrate the above, the following presents the t-th update process of the graph attention network:

Figure BDA0003068795020000091
Figure BDA0003068795020000091

Figure BDA0003068795020000092
Figure BDA0003068795020000092

Figure BDA0003068795020000093
Figure BDA0003068795020000093

Figure BDA0003068795020000094
Figure BDA0003068795020000094

Figure BDA0003068795020000095
Figure BDA0003068795020000095

其中GAT(G,Hs,Hw)表示图注意力更新层,G为文本图,Hs为句子特征,作为注意力机制中的请求矩阵,Hw为单词特征,作为注意力机制中的键和值矩阵。

Figure BDA0003068795020000096
表示单词对句子传递的消息,通过多层感知机(MLP)对该层隐藏状态进行更新。优选地,多层感知机为包含两个线性隐藏层的多层感知机。where GAT(G, H s , H w ) represents the graph attention update layer, G is the text graph, H s is the sentence feature, as the request matrix in the attention mechanism, H w is the word feature, as the attention mechanism in the Key and value matrix.
Figure BDA0003068795020000096
Represents the message conveyed by word-to-sentence, and the hidden state of this layer is updated through a multi-layer perceptron (MLP). Preferably, the multilayer perceptron is a multilayer perceptron comprising two linear hidden layers.

每次迭代更新后都对文本特征Hd进行更新:The text feature H d is updated after each iteration update:

Figure BDA0003068795020000097
Figure BDA0003068795020000097

通过同质异质图结构基于图注意力网络GAT的更新迭代,句子将通过单词的间接连接获取到更多的跨句信息,句子向量间的同质边使其获取到长距离关联,为文摘提取提供更多信息。Through the update iteration of the homogeneous and heterogeneous graph structure based on the graph attention network GAT, the sentence will obtain more cross-sentence information through the indirect connection of the words, and the homogeneous edge between the sentence vectors enables it to obtain long-distance association, which is the abstract Extract provides more information.

步骤S103,计算更新版文本异质图中句子向量的多类摘要指标,并根据每个句子向量对应的多类摘要指标计算对应句子向量的分类权重。Step S103: Calculate the multi-category summary index of the sentence vector in the updated version of the text heterogeneity graph, and calculate the classification weight of the corresponding sentence vector according to the multi-category summary index corresponding to each sentence vector.

具体地,计算更新版文本异质图中每个句子向量的相关度分数、冗余度分数、新信息分数以及面向召回率评估的度量标准分数;分别基于每个句子向量的相关度分数、冗余度分数、新信息分数以及面向召回率评估的度量标准分数,通过Sigmoid函数计算对应句子向量的分类权重。即为了抽取出合适的句子作为摘要,本发明设定多角度的句子评估指标,从相关度(Rel)、冗余度(Red)、新信息量(Info)以及Rouge-F1评分四个角度对句子进行评价打分,使用分数对句子特征进行加权,挑选最优N项作为抽取结果。Specifically, calculate the relevance score, redundancy score, new information score, and metric score for recall evaluation of each sentence vector in the updated text heterogeneity graph; The redundancy score, the new information score, and the metric score for recall evaluation are used to calculate the classification weight of the corresponding sentence vector through the Sigmoid function. That is, in order to extract suitable sentences as abstracts, the present invention sets multi-angle sentence evaluation indicators, which are evaluated from four perspectives: relevance (Rel), redundancy (Red), new information (Info) and Rouge-F1 score. The sentences are evaluated and scored, the scores are used to weight the sentence features, and the optimal N items are selected as the extraction results.

相关度是非常直观的度量标准,表示句子与全文的相关程度,其值越高表明句子越能代表文章的主题;更新版文本异质图中单个句子的相关度分数计算方式为:基于更新版文本异质图的文本特征及该句子向量的句子特征,通过双线性函数计算该句子向量的相关度分数。冗余度是与相关度相对的概念,良好的文摘不仅要与原文主题契合,更应尽可能的保持简洁,即文摘自身的冗余度应尽量低;更新版文本异质图中单个句子向量的冗余度分数计算方式为:基于更新版文本异质图中该句子向量的句子特征,通过双线性函数计算该句子向量的冗余度分数。相关度是一个忽略背景知识和其他信息来源的标准,而新信息量则将结合背景知识对文本内容进行评估,对于读者来说,阅读摘要更希望能了解到自己此前并不了解的知识,这些新知识即为新信息量;更新版文本异质图中单个句子向量的新信息量分数计算方式为:基于更新版文本异质图中该句子向量的句子特征以及所述预设知识库中的知识向量,通过双线性函数计算该句子的新信息量分数。Rouge是文本摘要中常用的机器评分方法,将其作为一项评估指标能进一步提升文摘的准确性;更新版文本异质图中单个句子向量的面向召回率评估的度量标准分数计算方式为:基于未编码向量化的目标文本以及其中该句子向量的文本内容,通过面向召回率评估的度量标准函数计算该句子向量的面向召回率评估的度量标准分数。Relevance is a very intuitive metric, which indicates the degree of relevance between a sentence and the full text. The higher the value, the better the sentence represents the topic of the article; the relevancy score of a single sentence in the updated version of the text heterogeneity graph is calculated as follows: based on the updated version The text features of the text heterogeneity graph and the sentence features of the sentence vector are used to calculate the relevance score of the sentence vector through a bilinear function. Redundancy is a concept relative to relevance. A good abstract should not only match the theme of the original text, but also keep it as concise as possible, that is, the redundancy of the abstract itself should be as low as possible; a single sentence vector in the updated text heterogeneity graph The redundancy score is calculated as follows: based on the sentence features of the sentence vector in the updated version of the text heterogeneity graph, the redundancy score of the sentence vector is calculated by a bilinear function. Relevance is a criterion that ignores background knowledge and other sources of information, while the amount of new information will be combined with background knowledge to evaluate the content of the text. For readers, reading the abstract is more hopeful to know the knowledge that they did not know before. These New knowledge is new information content; the calculation method of the new information content score of a single sentence vector in the updated text heterogeneity map is: based on the sentence features of the sentence vector in the updated text heterogeneity map and the preset knowledge base. The knowledge vector, the new informativeness score of the sentence is calculated by a bilinear function. Rouge is a commonly used machine scoring method in text summarization. Using it as an evaluation index can further improve the accuracy of summarization; the metric score calculation method for recall evaluation of a single sentence vector in the updated text heterogeneity graph is as follows: based on For the unencoded vectorized target text and the textual content of that sentence vector in it, a recall-oriented metric score for the sentence vector is calculated by the recall-oriented metric function.

综合四个摘要指标对句子s打分,具体计算式子如下:The sentence s is scored based on the four summary indicators, and the specific calculation formula is as follows:

Rel=hsWrelHD (15)Rel=h s W rel H D (15)

Red=hsWredAs (16)Red=h s W red A s (16)

Info=hsWinfoHk (17)Info=h s W info H k (17)

Rouge=R(s,ref) (18)Rouge=R(s, ref) (18)

Score=Sigmoid(Rel-Red+Info+Rouge) (19)Score=Sigmoid(Rel-Red+Info+Rouge) (19)

其中hs句子s的特征向量,hd为其所属文本的特征向量,Hk为知识库编码,s为句子本身,ref为参考摘要,Wrel、Wred、Winfo为可学习参数,R为Rouge计算函数,各项指标计算结果通过Sigmoid函数处理作为当前句子的分类权重。where h s is the feature vector of the sentence s, h d is the feature vector of the text to which it belongs, H k is the knowledge base code, s is the sentence itself, ref is the reference summary, W rel , W red , and W info are learnable parameters, R For the Rouge calculation function, the calculation results of each indicator are processed by the Sigmoid function as the classification weight of the current sentence.

步骤S104,基于每个句子向量的分类权重分别对对应更新后的句子特征进行加权,并基于加权后的句子特征获取对应的句子标签,根据获取的句子标签生成文本摘要。Step S104 , weight the corresponding updated sentence features based on the classification weight of each sentence vector, obtain corresponding sentence labels based on the weighted sentence features, and generate text summaries according to the obtained sentence labels.

具体地,在上述步骤获取每个句子向量的分类权重后,分别对应更新后的句子特征进行加权,以获取加权后的句子特征,而后通过感知机分类器分别基于加权后的句子特征获取对应的句子标签,而后根据获取的句子标签生成文本摘要。在使用感知机分类器选择摘要句子过程中,选取最佳的N个句子作为摘要,使用三元组阻塞策略降低摘要冗余度。Specifically, after the classification weight of each sentence vector is obtained in the above steps, the updated sentence features are weighted respectively to obtain the weighted sentence features, and then the perceptron classifier is used to obtain the corresponding sentence features based on the weighted sentence features. sentence labels, and then generate text summaries based on the obtained sentence labels. In the process of using the perceptron classifier to select the summary sentences, the best N sentences are selected as the summary, and the triple blocking strategy is used to reduce the summary redundancy.

为验证本发明的有效性,将本发明基于异质图的文本摘要方法与其他的方法进行对比实验,采用Rouge评价方法对各文摘方法的效果进行评价,Rouge评价的得分情况如图5所示。其中,KHHGS表示发明基于异质图的文本摘要方法。In order to verify the effectiveness of the present invention, the text summarization method based on the heterogeneous graph of the present invention is compared with other methods, and the Rouge evaluation method is used to evaluate the effect of each abstraction method. The score of the Rouge evaluation is shown in Figure 5. . Among them, KHHGS represents the invention of a heterogeneous graph-based text summarization method.

于对比实验中,本发明以CNN&Daily Mail数据作为数据集,与其他几种文摘方法对比,结果如图5所示。从对比实验结果中可发现,与RNN模型BiGRU相比,KHHGS模型(本申请方法)有着明显优势,同时也领先于Transformer模型。从表中数据可以发现Transformer评分与同质图方法评分相近,说明Transformer确实可被视为句子层级的全连接图。KHHGS效果也优于先前的异质图文摘模型HSG,在Rouge的三项指标中分别有0.14/0.46/0.97的提升,由此可见本发明提出的相关策略能有效提升异质图模型对文本摘要任务的效果。In the comparative experiment, the present invention uses the CNN&Daily Mail data as the data set, and compares it with several other summarization methods. The results are shown in Figure 5. From the comparative experimental results, it can be found that compared with the RNN model BiGRU, the KHHGS model (the method of this application) has obvious advantages, and is also ahead of the Transformer model. From the data in the table, it can be found that the Transformer score is similar to that of the homogeneous graph method, indicating that Transformer can indeed be regarded as a fully connected graph at the sentence level. The effect of KHHGS is also better than the previous heterogeneous graph abstract model HSG, and the three indicators of Rouge are improved by 0.14/0.46/0.97 respectively. It can be seen that the relevant strategy proposed by the present invention can effectively improve the heterogeneous graph model for text abstraction. effect of the task.

为更好的解释本发明提出的相关策略在文本摘要任务上的有效性,对模型进行消融学习,分别将外部知识库、同质图以及文摘指标计算三个模块从模型中移除进行实验分析。如图6所示,表中每条数据表示去除相关模块后的实验结果。增加知识库后对Rouge-2及Rouge-L有一定程度的提升,但对Rouge-1并无明显改善,可能由于本发明使用的知识库为通用知识库,对新闻类文本语料效果提升较弱,鉴于目前没有公认较为成熟的新闻知识库;同质图的添加能够显著增加各项Rouge指标,特别是Rouge-L值,可能由于同质图的加入增强了句子之间的联系,使得句子间的关系能够更好的被模型利用,影响到最终文摘抽取结果中最长重叠子串数量;此外,针对文摘的多角度标准能够有效提升效果,关于不同指标对摘要的影响,做出进一步实验,如图7所示,横轴表示某个指标分数值所在范围,纵轴表示获得该范围内分数的句子属于参考摘要的概率,从图中可知,相关度和Rouge分数对摘要结果影响较大,这两项分数较高的句子很大概率属于参考摘要;新信息量对摘要也有一定程度的影响,但并不显著,由于本发明使用通用知识库作为计算新信息量的背景知识,所以在这项内容中可以认为将辨识度较低的通用知识进行过滤,类似数据处理时去除停用词的操作,使模型能将重点放在其他关键语句上。冗余度并没有明显的影响,可能由于本发明提出的是基于句子层级的文摘方法,而对模型评估使用的Rouge评分属于摘要层级的评估,单个句子冗余度的评价并不能有效提升文摘方法的最终效果。In order to better explain the effectiveness of the relevant strategy proposed by the present invention on the text summarization task, the model is ablated and learned, and the three modules of the external knowledge base, the homogeneous graph and the summarization index calculation are removed from the model for experimental analysis. . As shown in Figure 6, each piece of data in the table represents the experimental result after removing the relevant modules. After the knowledge base is added, Rouge-2 and Rouge-L are improved to a certain extent, but Rouge-1 is not significantly improved. It may be because the knowledge base used in the present invention is a general knowledge base, and the effect on news text corpus is weak. , in view of the fact that there is currently no recognized mature news knowledge base; the addition of homogenous graphs can significantly increase various Rouge indicators, especially the Rouge-L value, which may be due to the addition of homogenous graphs to enhance the connection between sentences. The relationship can be better used by the model, which affects the number of longest overlapping substrings in the final abstract extraction result; in addition, the multi-angle standard for abstracts can effectively improve the effect, and further experiments are conducted on the impact of different indicators on abstracts. As shown in Figure 7, the horizontal axis represents the range of a certain index score, and the vertical axis represents the probability that the sentence obtained within this range belongs to the reference abstract. It can be seen from the figure that the relevance and Rouge score have a greater impact on the abstract results. The sentences with higher scores of these two items are highly likely to belong to the reference abstract; the new information volume also has a certain degree of influence on the abstract, but it is not significant. Since the present invention uses the general knowledge base as the background knowledge for calculating the new information volume, here In the item content, it can be considered that the general knowledge with low recognition degree is filtered, similar to the operation of removing stop words during data processing, so that the model can focus on other key sentences. Redundancy has no obvious impact. It may be because the present invention proposes a sentence-level summarization method, and the Rouge score used for model evaluation belongs to the summary-level evaluation. The evaluation of single sentence redundancy cannot effectively improve the summarization method. the final effect.

为更加直观的表现本发明提出的各项指标的作用,选取一条测试样例对指标进行量化,如图8所示,表中每一行是原文的一个句子,右边对应各项指标通过公式计算的归一化分数及总分数,通过上文分析,冗余度在本模型中未起到关键性的作用,故未将其加入到量化列表中。表中数据显示,句子长度与相关度指标有一定的联系,句子越长其与原文的相关度越高,包含信息也越多,故表中最长的句子2获得最高的相关度得分,同时从句子2内容中不难判断,其描述与参考摘要极为相似,所以其获得了较高的Rouge分数,最终句子2得到最高的综合评分。In order to more intuitively express the role of the indicators proposed by the present invention, a test sample is selected to quantify the indicators, as shown in Figure 8, each row in the table is a sentence of the original text, and the corresponding indicators on the right are calculated by formulas. Normalized scores and total scores, through the above analysis, redundancy does not play a key role in this model, so it is not added to the quantification list. The data in the table shows that there is a certain relationship between the sentence length and the relevance index. The longer the sentence, the higher the relevance to the original text and the more information it contains. Therefore, the longest sentence 2 in the table gets the highest relevance score, and at the same time It is not difficult to judge from the content of sentence 2. Its description is very similar to the reference abstract, so it has obtained a high Rouge score, and finally sentence 2 has the highest comprehensive score.

综上所述,本发明提出的一种基于异质图的文本摘要方法,相比于现有常用的摘要方法,在捕捉长距离依赖关系方面有较大优势,此外,本发明设计的针对文本摘要的多角度指标能充分利用文本数据特征,从而提高文本摘要质量。To sum up, the text summarization method based on heterogeneous graph proposed by the present invention has great advantages in capturing long-distance dependencies compared with the existing commonly used summarization methods. Abstract's multi-angle indicators can make full use of text data features, thereby improving the quality of text summaries.

本发明实施例提供的基于异质图的文本摘要方法,将文本依据语义和句法关系连接构建成文本异质图,结合图注意力网络更新单词和句子两类节点特征,并设计多个摘要相关的度量指标对句子进行加权评估以用于最终摘要提取,不仅考虑了单词与句子间的信息传递,同时还考虑了句子间的相互影响。进一步添加的外部知识库能更好的帮助模型理解文本,针对摘要任务设计的多角度指标可在分类前为句子添加相应的权重,有效提升了模型对文本特征利用能力,进而生成较为准确且可读性较高的摘要。本发明实施例更直接的方式是将句子和单词分别作为两类节点构建异质图,使用单词节点作为句子的中介,丰富句子间的关联并间接传递信息。The heterogeneous graph-based text summarization method provided by the embodiment of the present invention constructs a text heterogeneous graph based on semantic and syntactic relationship connections, updates two types of node features of words and sentences in combination with a graph attention network, and designs multiple abstract related The metric for weighted evaluation of sentences for final summary extraction takes into account not only the information transfer between words and sentences, but also the inter-sentence interactions. The further added external knowledge base can better help the model understand the text, and the multi-angle indicators designed for the summary task can add corresponding weights to sentences before classification, which effectively improves the model's ability to utilize text features, and then generates more accurate and reliable data. Readable abstract. A more direct way in the embodiment of the present invention is to construct a heterogeneous graph by using sentences and words as two types of nodes respectively, and using word nodes as the intermediary of sentences to enrich the association between sentences and transmit information indirectly.

实施例二Embodiment 2

为解决现有技术中存在的上述技术问题,本发明实施例提供了一种基于异质图的文本摘要装置。In order to solve the above technical problems existing in the prior art, an embodiment of the present invention provides a text summarization device based on a heterogeneous graph.

图9示出了本发明实施例二基于异质图的文本摘要装置的结构示意图;参照图9所示,本发明基于异质图的文本摘要装置包括依次连接的文本异质图构建模块、更新模块、分类权重获取模块和摘要生成模块。FIG. 9 shows a schematic structural diagram of a text summarization device based on a heterogeneous graph according to Embodiment 2 of the present invention; with reference to FIG. 9 , the text summarization device based on a heterogeneous graph of the present invention includes sequentially connected text heterogeneous graph building modules, updating module, classification weight acquisition module and summary generation module.

其中,文本异质图构建模块用于将预设知识库与目标文本进行知识融合,获取目标文本的单词特征和句子特征,基于单词特征和句子特征构建目标文本的文本异质图。Among them, the text heterogeneity graph building module is used to fuse the preset knowledge base with the target text, obtain the word features and sentence features of the target text, and build the text heterogeneity graph of the target text based on the word features and sentence features.

更新模块用于基于边权重和注意力权重,通过图注意力网络对文本异质图进行更新,获取更新版文本异质图。The update module is used to update the text heterogeneity graph through the graph attention network based on the edge weight and attention weight, and obtain the updated version of the text heterogeneity graph.

分类权重获取模块用于计算更新版文本异质图中句子向量的多类摘要指标,并根据每个句子向量对应的多类摘要指标计算对应句子向量的分类权重。The classification weight acquisition module is used to calculate the multi-class summary index of the sentence vector in the updated version of the text heterogeneity graph, and calculate the classification weight of the corresponding sentence vector according to the multi-class summary index corresponding to each sentence vector.

摘要生成模块用于基于句子向量的分类权重分别对更新版文本异质图中的句子特征进行加权,并基于加权后的句子特征获取对应的句子标签,根据获取的句子标签生成文本摘要。The summary generation module is used to weight the sentence features in the updated text heterogeneity map based on the classification weights of the sentence vectors, obtain the corresponding sentence labels based on the weighted sentence features, and generate text summaries according to the obtained sentence labels.

本发明实施例提供的基于异质图的文本摘要装置,将文本依据语义和句法关系连接构建成文本异质图,结合图注意力网络更新单词和句子两类节点特征,并设计多个摘要相关的度量指标对句子进行加权评估以用于最终摘要提取,不仅考虑了单词与句子间的信息传递,同时还考虑了句子间的相互影响。进一步添加的外部知识库能更好的帮助模型理解文本,针对摘要任务设计的多角度指标可在分类前为句子添加相应的权重,有效提升了模型对文本特征利用能力,进而生成较为准确且可读性较高的摘要。本发明实施例更直接的方式是将句子和单词分别作为两类节点构建异质图,使用单词节点作为句子的中介,丰富句子间的关联并间接传递信息。The heterogeneous graph-based text summarization device provided by the embodiment of the present invention constructs a text heterogeneous graph by connecting the text according to semantic and syntactic relations, updates the node features of words and sentences in combination with the graph attention network, and designs multiple abstract related The metric for weighted evaluation of sentences for final summary extraction takes into account not only the information transfer between words and sentences, but also the inter-sentence interactions. The further added external knowledge base can better help the model understand the text, and the multi-angle indicators designed for the summary task can add corresponding weights to sentences before classification, which effectively improves the model's ability to utilize text features, and then generates more accurate and reliable data. Readable abstract. A more direct way in the embodiment of the present invention is to construct a heterogeneous graph by using sentences and words as two types of nodes respectively, and using word nodes as the intermediaries of sentences to enrich the association between sentences and transmit information indirectly.

实施例三Embodiment 3

为解决现有技术中存在的上述技术问题,本发明实施例还提供了一种存储介质,其存储有计算机程序,该计算机程序被处理器执行时可实现实施例一中基于异质图的文本摘要方法中的所有步骤。In order to solve the above-mentioned technical problems existing in the prior art, an embodiment of the present invention further provides a storage medium, which stores a computer program, and when the computer program is executed by a processor, can realize the text based on the heterogeneous graph in the first embodiment. Summary of all steps in the method.

基于异质图的文本摘要方法的具体步骤以及应用本发明实施例提供的可读存储介质获取的有益效果均与实施例一相同,在此不在对其进行赘述。The specific steps of the heterogeneous graph-based text summarization method and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the present invention are the same as those of the first embodiment, and will not be repeated here.

需要说明的是:存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。It should be noted that the storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk or optical disk.

实施例四Embodiment 4

为解决现有技术中存在的上述技术问题,本发明实施例还提供了一种终端。In order to solve the above technical problems existing in the prior art, an embodiment of the present invention further provides a terminal.

图10示出了本发明实施例四终端的结构示意图,参照图10所示,本实施例终端包括相互连接的处理器及存储器;存储器用于存储计算机程序,处理器用于执行存储器存储的计算机程序,以使终端执行时可实现实施例一基于异质图的文本摘要方法中的所有步骤。FIG. 10 shows a schematic structural diagram of a terminal according to Embodiment 4 of the present invention. Referring to FIG. 10 , the terminal in this embodiment includes a processor and a memory that are connected to each other; the memory is used to store computer programs, and the processor is used to execute the computer programs stored in the memory. , so that all steps in the method for text summarization based on a heterogeneous graph in Embodiment 1 can be implemented when the terminal executes it.

基于异质图的文本摘要方法的具体步骤以及应用本发明实施例提供的终端获取的有益效果均与实施例一相同,在此不在对其进行赘述。The specific steps of the method for text summarization based on the heterogeneous graph and the beneficial effects obtained by the terminal provided by the application of the embodiment of the present invention are the same as those of the first embodiment, which will not be repeated here.

需要说明的是,存储器可能包含随机存取存储器(Random Access Memory,简称RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。同理处理器也可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital SignalProcessing,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。It should be noted that the memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Similarly, the processor can also be a general-purpose processor, including a central processing unit (CPU for short), a network processor (NP for short), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP for short) , Application Specific Integrated Circuit (ASIC for short), Field Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.

虽然本发明所公开的实施方式如上,但所述的内容只是为了便于理解本发明而采用的实施方式,并非用以限定本发明。任何本发明所属技术领域内的技术人员,在不脱离本发明所公开的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本发明的保护范围,仍须以所附的权利要求书所界定的范围为准。Although the disclosed embodiments of the present invention are as above, the content described is only an embodiment adopted to facilitate understanding of the present invention, and is not intended to limit the present invention. Any person skilled in the art to which the present invention belongs, without departing from the spirit and scope disclosed by the present invention, can make any modifications and changes in the form and details of the implementation, but the protection scope of the present invention is still The scope as defined by the appended claims shall prevail.

Claims (7)

1. A text summarization method based on heterogeneous graphs comprises the following steps:
performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features;
updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;
calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;
respectively weighting sentence features in the updated version text heterogeneous image based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels;
wherein constructing a text heterogeneous graph of the target text based on the word features and sentence features comprises:
based on sentence features of sentence vectors in the target text, calculating the homogeneous edge weight between every two sentence vectors of all the sentence vectors in the target text in a cosine similarity calculation mode;
calculating heterogeneous edge weights among all word vectors and sentence vectors to which the word vectors belong in the target text through a TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong;
taking word vectors in the target text as word nodes and sentence vectors in the target text as sentence nodes, constructing a text heterogeneous graph of the target text based on word features of the word vectors, sentence features of the sentence vectors, homogeneous edge weights among the sentence vectors and heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong,
and the multiclass abstract indexes of the sentence vector comprise a relevance score, a redundancy score, a new information score and a recall-oriented evaluation metric score,
updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight, wherein the step of acquiring the updated text heterogeneous graph comprises the following steps:
calculating attention weights between every two sentence vectors of all sentence vectors in the target text, calculating attention weights between all word vectors in the target text and the sentence vectors to which the word vectors belong, and acquiring all homogeneous edge weights and all heterogeneous edge weights in the target text;
updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on the attention weight between every two sentence vectors in the target text, the attention weight between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights to obtain an updated version text heterogeneous graph,
wherein updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on attention weights between every two sentence vectors of all sentence vectors in the target text, attention weights between all word vectors and sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights comprises:
taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes;
taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and heterogeneous edge weight between the word nodes connected with the central nodes as weight, and carrying out weighted aggregation on word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes;
and taking sentence nodes as central nodes, taking the product of attention weight between the sentence nodes connected with the central nodes and homogeneous edge weight between the sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.
2. The method of claim 1, wherein performing knowledge fusion on a preset knowledge base and a target text, and obtaining word features and sentence features of the target text comprises:
respectively encoding and vectorizing knowledge in a preset knowledge base and content in a target text to acquire a knowledge vector in the preset knowledge base and a word vector in the target text;
respectively calculating the attention weight of each word vector in the target text and the attention weight of the knowledge vector in the preset knowledge base to obtain the attention weight of each word vector in the target text;
sequentially taking the attention weight of the word vector in the target text as a weight, and respectively weighting and combining the knowledge vectors in the preset knowledge base to obtain the knowledge weight of each word vector in the target text;
acquiring word features of corresponding word vectors based on knowledge weight of each word vector in the target text;
and respectively performing local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text to obtain the local features and the global features of each sentence vector, and respectively obtaining the sentence features of the corresponding sentences according to the local features and the global features of each sentence vector.
3. The method of claim 1, wherein calculating multiple types of summarization indexes of sentence vectors in the updated text heterogeneous map and calculating classification weights of corresponding sentence vectors according to the multiple types of summarization indexes corresponding to each sentence vector respectively comprises:
calculating a relevance score, a redundancy score, a new information score and a recall rate evaluation-oriented metric score of each sentence vector in the updated text heterogeneous graph;
and calculating the classification weight of the corresponding sentence vector through a Sigmoid function based on the relevance score, the redundancy score, the new information score and the recall rate evaluation-oriented metric score of each sentence vector.
4. The method of claim 3, wherein the step of calculating relevance scores, redundancy scores, new information scores, and recall-assessment-oriented metric scores for individual sentence vectors in the updated textual heterogeneous graph comprises:
calculating the relevance score of the sentence vector through a bilinear function based on the text characteristics of the updated version text heterogeneous image and the sentence characteristics of the sentence vector;
calculating a redundancy score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image;
calculating a new information quantity score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous graph and the knowledge vector in the preset knowledge base;
and calculating the recall-rate evaluation-oriented metric score of the sentence vector through the recall-rate evaluation-oriented metric function based on the target text which is not coded and vectorized and the text content of the sentence vector.
5. A text summarization device based on heterogeneous graphs is characterized by comprising:
the text heterogeneous graph building module is used for carrying out knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and building a text heterogeneous graph of the target text based on the word features and the sentence features;
the updating module is used for updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;
the classification weight acquisition module is used for calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;
the abstract generating module is used for weighting sentence characteristics in the updated version text heterogeneous graph respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics and generating a text abstract according to the acquired sentence labels;
wherein constructing a text heterogeneous graph of the target text based on the word features and sentence features comprises:
based on sentence characteristics of sentence vectors in the target text, calculating the homogeneous edge weight between every two sentence vectors of all sentence vectors in the target text in a cosine similarity calculation mode;
calculating heterogeneous edge weights among all word vectors and sentence vectors to which the word vectors belong in the target text through a TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong;
taking word vectors in the target text as word nodes and sentence vectors in the target text as sentence nodes, constructing a text heterogeneous graph of the target text based on word features of the word vectors, sentence features of the sentence vectors, homogeneous edge weights among the sentence vectors and heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong,
and the multiclass abstract indexes of the sentence vector comprise a relevance score, a redundancy score, a new information score and a recall-oriented evaluation metric score,
updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight, and acquiring an updated text heterogeneous graph, wherein the step of acquiring the updated text heterogeneous graph comprises the following steps:
calculating attention weights between every two sentence vectors of all sentence vectors in the target text, calculating attention weights between all word vectors in the target text and the sentence vectors to which the word vectors belong, and acquiring all homogeneous edge weights and all heterogeneous edge weights in the target text;
updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on the attention weight between every two sentence vectors in the target text, the attention weight between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights to obtain an updated version text heterogeneous graph,
wherein, based on the attention weights between every two sentence vectors of all sentence vectors in the target text, the attention weights between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights, updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network comprises:
taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between the sentence nodes connected with the central nodes as weight, and performing weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes;
taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and heterogeneous edge weight between the word nodes connected with the central nodes as weight, and performing weighted aggregation on the word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes;
and taking sentence nodes as central nodes, taking the product of attention weight between the sentence nodes connected with the central nodes and homogeneous edge weight between the sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.
6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for text summarization based on heterogeneous maps according to any one of claims 1 to 4.
7. A terminal, comprising: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the text summarization method based on heterogeneous graphs according to any one of claims 1 to 4.
CN202110533278.5A 2021-05-17 2021-05-17 Heterogeneous graph-based text summarization method and device, storage medium and terminal Active CN113127632B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110533278.5A CN113127632B (en) 2021-05-17 2021-05-17 Heterogeneous graph-based text summarization method and device, storage medium and terminal
PCT/CN2021/103504 WO2022241913A1 (en) 2021-05-17 2021-06-30 Heterogeneous graph-based text summarization method and apparatus, storage medium, and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110533278.5A CN113127632B (en) 2021-05-17 2021-05-17 Heterogeneous graph-based text summarization method and device, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN113127632A CN113127632A (en) 2021-07-16
CN113127632B true CN113127632B (en) 2022-07-26

Family

ID=76783109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110533278.5A Active CN113127632B (en) 2021-05-17 2021-05-17 Heterogeneous graph-based text summarization method and device, storage medium and terminal

Country Status (2)

Country Link
CN (1) CN113127632B (en)
WO (1) WO2022241913A1 (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779233B (en) * 2021-08-25 2025-06-13 上海浦东发展银行股份有限公司 Abstract extraction method, device, computer equipment and readable storage medium
CN113672706B (en) * 2021-08-31 2024-04-26 清华大学苏州汽车研究院(相城) Text abstract extraction method based on attribute heterogeneous network
CN114091429B (en) * 2021-10-15 2024-10-22 山东师范大学 Text abstract generation method and system based on heterogeneous graph neural network
CN113935314B (en) * 2021-10-22 2025-06-20 深圳平安智慧医健科技有限公司 Abstract extraction method, device, terminal equipment and medium based on heterogeneous graph network
CN114581540B (en) * 2022-03-03 2024-07-02 上海人工智能创新中心 Scene task processing method, device, equipment and computer readable storage medium
CN114722192A (en) * 2022-03-11 2022-07-08 内蒙古农业大学 Cycle inspection clue multi-label classification method based on heterogeneous graph neural network
CN114860920B (en) * 2022-04-20 2024-09-13 内蒙古工业大学 Method for generating single language theme abstract based on different composition
CN115906867B (en) * 2022-11-30 2023-10-31 华中师范大学 Test question feature extraction and knowledge point labeling method based on hidden knowledge space mapping
US12124493B2 (en) 2023-02-27 2024-10-22 International Business Machines Corporation Generating key point graphs using directional relation scores
CN116541514A (en) * 2023-04-27 2023-08-04 浙江大学 Knowledge map entity abstract generation method integrating user interaction and low redundancy
CN116450813B (en) * 2023-06-19 2023-09-19 深圳得理科技有限公司 Text key information extraction method, device, equipment and computer storage medium
CN117407488A (en) * 2023-10-17 2024-01-16 安徽大学 A method and system for mining key phrases in government hotline content based on heterogeneous graphs
CN117520995B (en) * 2024-01-03 2024-04-02 中国海洋大学 A method and system for abnormal user detection in network information platform
CN118332101B (en) * 2024-04-03 2025-05-16 中国科学院信息工程研究所 Hierarchical iteration-based long text extraction type abstract generation method and device
CN118069828B (en) * 2024-04-22 2024-06-28 曲阜师范大学 Article recommendation method based on heterogeneous graph and semantic fusion
CN119293239B (en) * 2024-12-09 2025-04-08 阿里云飞天(杭州)云计算技术有限公司 Data classification method and work order classification method
CN119415683B (en) * 2025-01-03 2025-04-11 中国人民解放军国防科技大学 Multi-document abstract generation system and method based on aspect guidance

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046698A (en) * 2019-04-28 2019-07-23 北京邮电大学 Heterogeneous figure neural network generation method, device, electronic equipment and storage medium
CN112380435A (en) * 2020-11-16 2021-02-19 北京大学 Literature recommendation method and recommendation system based on heterogeneous graph neural network

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195941B2 (en) * 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
US10055486B1 (en) * 2014-08-05 2018-08-21 Hrl Laboratories, Llc System and method for real world event summarization with microblog data
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
US10824815B2 (en) * 2019-01-02 2020-11-03 Netapp, Inc. Document classification using attention networks
CN110334192B (en) * 2019-07-15 2021-09-24 河北科技师范学院 Text abstract generating method and system, electronic device and storage medium
CN110737768B (en) * 2019-10-16 2022-04-08 信雅达科技股份有限公司 Text abstract automatic generation method and device based on deep learning and storage medium
CN111858913A (en) * 2020-07-08 2020-10-30 北京嘀嘀无限科技发展有限公司 A method and system for automatically generating text summaries
CN112084331B (en) * 2020-08-27 2024-09-06 清华大学 Text processing, model training method, device, computer equipment and storage medium
CN112288091B (en) * 2020-10-30 2023-03-21 西南电子技术研究所(中国电子科技集团公司第十研究所) Knowledge inference method based on multi-mode knowledge graph
CN112464657B (en) * 2020-12-07 2022-07-08 上海交通大学 Hybrid text abstract generation method, system, terminal and storage medium
CN112818113A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Automatic text summarization method based on heteromorphic graph network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046698A (en) * 2019-04-28 2019-07-23 北京邮电大学 Heterogeneous figure neural network generation method, device, electronic equipment and storage medium
CN112380435A (en) * 2020-11-16 2021-02-19 北京大学 Literature recommendation method and recommendation system based on heterogeneous graph neural network

Also Published As

Publication number Publication date
CN113127632A (en) 2021-07-16
WO2022241913A1 (en) 2022-11-24

Similar Documents

Publication Publication Date Title
CN113127632B (en) Heterogeneous graph-based text summarization method and device, storage medium and terminal
Pei et al. Memory-attended recurrent network for video captioning
CN108717408B (en) A sensitive word real-time monitoring method, electronic equipment, storage medium and system
CN108304911B (en) Knowledge extraction method, system and equipment based on memory neural network
CN114218400B (en) Data lake query system and method based on semantics
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN104298776B (en) Search-engine results optimization system based on LDA models
CN107526799A (en) A kind of knowledge mapping construction method based on deep learning
Li et al. Residual attention-based LSTM for video captioning
CN106202256A (en) Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN104169948A (en) Methods, apparatus and products for semantic processing of text
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
US11983205B2 (en) Semantic phrasal similarity
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN113672693B (en) Tag recommendation method for online question answering platform based on knowledge graph and tag association
Gao et al. Self-attention driven adversarial similarity learning network
CN106033426A (en) Image retrieval method based on latent semantic minimum hash
CN118411572B (en) Small sample image classification method and system based on multi-mode multi-level feature aggregation
Song et al. Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning
CN103473307A (en) Cross-media sparse Hash indexing method
Jia Music emotion classification method based on deep learning and improved attention mechanism
CN117807259A (en) Cross-modal hash retrieval method based on deep learning technology
Dourado et al. Bag of textual graphs (BoTG): A general graph‐based text representation model
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
Xu et al. Short text classification of chinese with label information assisting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant