CN114706972B - An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression - Google Patents
An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression Download PDFInfo
- Publication number
- CN114706972B CN114706972B CN202210275509.1A CN202210275509A CN114706972B CN 114706972 B CN114706972 B CN 114706972B CN 202210275509 A CN202210275509 A CN 202210275509A CN 114706972 B CN114706972 B CN 114706972B
- Authority
- CN
- China
- Prior art keywords
- sentence
- node
- word
- text
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000007906 compression Methods 0.000 title claims abstract description 23
- 230000006835 compression Effects 0.000 title claims abstract description 22
- 230000003595 spectral effect Effects 0.000 claims abstract description 11
- 238000013210 evaluation model Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000009193 crawling Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims description 2
- 238000013016 damping Methods 0.000 claims description 2
- 238000003064 k means clustering Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims 2
- 238000010606 normalization Methods 0.000 claims 1
- 238000012216 screening Methods 0.000 abstract description 6
- 238000013528 artificial neural network Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000408659 Darpa Species 0.000 description 1
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 125000002015 acyclic group Chemical group 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013079 data visualisation Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000013107 unsupervised machine learning method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本发明涉及一种基于多句压缩的无监督科技情报摘要自动生成方法,属于自然语言生成技术领域。针对科技情报领域的多文档文本生成,首先基于LDA主题相似度词库扩展方法的主题爬虫来获取源数据。通过文本信息的权威性、时效性、内容相关性三个指标的文本信息价值评估模型,对所有文本段落进行排序。选取得分较高段落的作为生成最终科技情报的原始文本。最后,采用基于谱聚类和多句压缩的无监督多文档摘要方法,自动生成科技情报摘要。本方法有效解决了在数据筛选过程中,科技情报生成对于数据时效性以及权威性要求较高的问题,以及科技情报领域由于数据集缺乏导致传统基于神经网络多文档生成方法无法应用的问题。
The present invention relates to an unsupervised method for automatically generating a scientific and technological intelligence summary based on multi-sentence compression, and belongs to the technical field of natural language generation. For the generation of multi-document text in the field of scientific and technological intelligence, the source data is firstly obtained by a subject crawler based on the LDA subject similarity vocabulary expansion method. All text paragraphs are sorted by a text information value evaluation model based on three indicators: authority, timeliness, and content relevance of text information. The paragraphs with higher scores are selected as the original text for generating the final scientific and technological intelligence. Finally, an unsupervised multi-document summary method based on spectral clustering and multi-sentence compression is used to automatically generate a scientific and technological intelligence summary. This method effectively solves the problem that the generation of scientific and technological intelligence has high requirements for data timeliness and authority during data screening, and the problem that the traditional neural network-based multi-document generation method cannot be applied in the field of scientific and technological intelligence due to the lack of data sets.
Description
技术领域Technical Field
本发明涉及一无监督科技情报摘要自动生成方法,具体涉及一种基于多句压缩的无监督科技情报摘要自动生成方法,属于自然语言生成技术领域。The invention relates to an unsupervised scientific and technological intelligence summary automatic generation method, in particular to an unsupervised scientific and technological intelligence summary automatic generation method based on multi-sentence compression, and belongs to the technical field of natural language generation.
背景技术Background technique
科技情报工作,对国家庞大科技战略拟定、庞大科技计划部署和经济社会发展都施展了关键功能,为社会、经济与科技的发展做出了贡献,是一个国家科技计划部署和经济社会发展都施展了关键功能中关键的构成部分。Scientific and technological intelligence work has played a key role in the formulation of national major scientific and technological strategies, the deployment of major scientific and technological plans, and economic and social development, and has contributed to the development of society, economy, and science and technology. It is a key component in the deployment of national scientific and technological plans and economic and social development.
在科技情报领域中,面对大数据环境下,采用人工收集、整理、筛选有价值的文本数据,并人工撰写情报报告,需要消耗大量的人力和时间成本,因此,当前人们对于情报的需求不再满足于信息资源的整序获取,不再满足于以文献单元为主要特征的加工整理和存取分析,而是对信息分析深度了更高要求,包括数据资源快速评价推荐、知识单元的抽取和分析、多维据融合、细粒度数据分析以及可视化、计算化的数据呈现与分析等,力争将大数据去冗分类、去粗存精、去伪存真,实现基本自动化的情报摘要生成。In the field of scientific and technological intelligence, in the face of a big data environment, manually collecting, organizing, and screening valuable text data and manually writing intelligence reports consumes a lot of manpower and time costs. Therefore, people's current demand for intelligence is no longer satisfied with the orderly acquisition of information resources, and is no longer satisfied with the processing, organization, storage and access analysis with document units as the main feature. Instead, they have higher requirements for the depth of information analysis, including rapid evaluation and recommendation of data resources, extraction and analysis of knowledge units, multi-dimensional data fusion, fine-grained data analysis, and visualization and computerized data presentation and analysis, etc., striving to deduce and classify big data, remove the coarse and retain the fine, remove the false and retain the true, and realize basic automated intelligence summary generation.
但是,在信息爆炸的时代,由于科技情报信息的来源纷乱复杂,如何从大量的信息中快速准确的找到自己需要的有用信息是一个很大的挑战。要实现基本自动化情报生成,第一步就是要高效收集有效的信息。此外,由于情报的时效性和权威性在情报研究中非常重要,在做文献资料选择时需要着重考虑。并且,由于信息来源不同而导致信息结构不统一,将多个异构文档整合处理并生成最终报告也是一个难点。综上所述,在实现科技情报摘要的自动生成过程中,主要需要解决的问题是:融合时间等因素的异构文本综合评价推荐以及多文档摘要。However, in the era of information explosion, due to the complexity of the sources of scientific and technological intelligence information, how to quickly and accurately find the useful information you need from a large amount of information is a big challenge. To achieve basic automated intelligence generation, the first step is to efficiently collect effective information. In addition, since the timeliness and authority of intelligence are very important in intelligence research, they need to be considered when selecting literature. In addition, due to the different sources of information, the information structure is not unified, and it is also difficult to integrate and process multiple heterogeneous documents and generate a final report. In summary, in the process of realizing the automatic generation of scientific and technological intelligence summaries, the main problems that need to be solved are: comprehensive evaluation and recommendation of heterogeneous texts integrating factors such as time and multi-document summaries.
目前,在有效信息收集方面,比较好的方法有主题爬虫。大多数研究者采用基于链接和基于内容的爬取策略结合的方法,均取得了不错的效果。然而,在科技情报领域中,获取资料的途径通常为国内外权威智库,而智库网页中存在链接的情况较少,因此,在情报领域基于内容的爬取方法更为适用。在多文档摘要领域的研究中,最新的成果大都采用首先对多文档进行排序,筛选出最重要的前N个文档,接着采用神经网络或神经网络与图模型结合的方法,有的作者也将预训Bert等预练模型融合进模型中。上述方法在有监督多文档摘要中取得了不错的效果。然而,在科技情报领域,数据集缺乏是一个不可忽视的问题,这使得有监督方法在该领域实际并不可用。At present, the better methods for effective information collection include topic crawlers. Most researchers have adopted a combination of link-based and content-based crawling strategies, and have achieved good results. However, in the field of scientific and technological intelligence, the way to obtain information is usually authoritative think tanks at home and abroad, and there are fewer links in think tank web pages. Therefore, content-based crawling methods are more applicable in the intelligence field. In the research in the field of multi-document summarization, most of the latest results adopt the method of first sorting multiple documents, screening out the most important top N documents, and then using neural networks or neural networks combined with graph models. Some authors have also integrated pre-trained models such as pre-trained Bert into the model. The above methods have achieved good results in supervised multi-document summarization. However, in the field of scientific and technological intelligence, the lack of data sets is an issue that cannot be ignored, which makes supervised methods actually unusable in this field.
发明内容Summary of the invention
本发明的目的是为了解决科技情报领域手动收集筛选以及生成报告困难的技术问题,创造性地提出一种贯穿数据收集、数据筛选到情报生成的自动化科技情报摘要生成方法。本方法有效解决了在数据筛选过程中,科技情报生成对于数据时效性以及权威性要求较高的问题,以及科技情报领域由于数据集缺乏导致传统基于神经网络多文档生成方法无法应用的问题。The purpose of the present invention is to solve the technical problems of the difficulty of manually collecting, screening and generating reports in the field of scientific and technological intelligence, and creatively propose an automated scientific and technological intelligence summary generation method that runs through data collection, data screening and intelligence generation. This method effectively solves the problem that in the process of data screening, scientific and technological intelligence generation has high requirements for data timeliness and authority, and the problem that the traditional neural network-based multi-document generation method cannot be applied in the field of scientific and technological intelligence due to the lack of data sets.
本发明的创新点在于:针对科技情报领域的多文档文本生成,首先基于LDA(Latent Dirichlet Allocation,一种文档主题生成模型,也称三层贝叶斯概率模型,包含词、主题和文档三层结构)主题相似度词库扩展方法的主题爬虫来获取源数据。通过文本信息的权威性、时效性、内容相关性三个指标的文本信息价值评估模型,对所有文本段落进行排序。选取得分较高段落的作为生成最终科技情报的原始文本。最后,采用基于谱聚类和多句压缩的无监督多文档摘要方法,自动生成科技情报摘要。The innovation of the present invention is that: for the generation of multi-document text in the field of scientific and technological intelligence, firstly, the source data is obtained by the subject crawler based on the LDA (Latent Dirichlet Allocation, a document topic generation model, also known as a three-layer Bayesian probability model, including a three-layer structure of words, topics and documents) topic similarity vocabulary expansion method. All text paragraphs are sorted by a text information value evaluation model based on three indicators: authority, timeliness and content relevance of text information. The paragraphs with higher scores are selected as the original text for generating the final scientific and technological intelligence. Finally, an unsupervised multi-document summarization method based on spectral clustering and multi-sentence compression is used to automatically generate a summary of scientific and technological intelligence.
本发明是通过以下技术方案实现的。The present invention is achieved through the following technical solutions.
一种基于多句压缩的无监督科技情报摘要自动生成方法,包括以下步骤:A method for automatically generating unsupervised scientific and technological intelligence summaries based on multi-sentence compression comprises the following steps:
步骤1:采用基于LDA主题相似度词库扩展方法的主题爬虫方式,进行文本内容抓取,获取源数据。Step 1: Use the topic crawler method based on the LDA topic similarity vocabulary expansion method to crawl text content and obtain source data.
通过给定的初始关键词,在主题描述不充分的情况下,通过主题爬虫自身对主题相关资源的收集功能,不断扩充语料,循环训练模型,不断完善、扩展、更新主题描述,从而更加全面、准确地获取想要的内容。Through the given initial keywords, when the topic description is insufficient, the topic crawler can collect topic-related resources by itself, continuously expand the corpus, cyclically train the model, and continuously improve, expand, and update the topic description, so as to obtain the desired content more comprehensively and accurately.
步骤2:对爬取的文本,根据其内容与关键词的相关性以及该源文本的时效性和权威性,进行评估排序。选取得分排名至少前40的段落的文本,作为生成最终科技情报的原始文本。Step 2: Evaluate and rank the crawled texts based on the relevance of their content to keywords and the timeliness and authority of the source texts. Select the texts of paragraphs with scores ranking at least in the top 40 as the original texts for generating the final scientific and technological intelligence.
步骤3:以步骤2中得到的结果文本作为模型的输入,采用基于谱聚类和多句压缩的无监督多文档摘要模型,得到摘要结果。Step 3: Use the result text obtained in step 2 as the input of the model, and use the unsupervised multi-document summarization model based on spectral clustering and multi-sentence compression to obtain the summary result.
有益效果Beneficial Effects
本发明方法,与现有技术相比,具有以下优点:Compared with the prior art, the method of the present invention has the following advantages:
1.本方法,分别提出了一个论文专利文本信息评估模型和一个智库文章文本信息评估模型。模型有很强的通用性,可以适用于所有的论文专利文本和所有的智库文章。1. This method proposes a paper patent text information evaluation model and a think tank article text information evaluation model. The model has strong versatility and can be applied to all paper patent texts and all think tank articles.
2.本方法提供了从数据获取到文本生成的自动化科技情报摘要生成方法,利用主题爬虫,提升了数据对于主题关键词的相关性,减少冗余数据,优化了数据获取以及清洗阶段的效率。在文本生成阶段利用谱聚类和多句压缩的组合方法,提升了无监督多文档摘要的效果。2. This method provides an automated scientific intelligence summary generation method from data acquisition to text generation. It uses a subject crawler to improve the relevance of data to subject keywords, reduce redundant data, and optimize the efficiency of data acquisition and cleaning stages. In the text generation stage, a combination of spectral clustering and multi-sentence compression is used to improve the effect of unsupervised multi-document summarization.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明方法的整体流程图;FIG1 is an overall flow chart of the method of the present invention;
图2是本发明方法步骤1以及实施例1的主题爬虫模块的架构图;FIG2 is an architecture diagram of the subject crawler module of step 1 of the method of the present invention and embodiment 1;
图3是本发明方法步骤2以及实施例1的文本信息价值评估过程的流程图;FIG3 is a flow chart of step 2 of the method of the present invention and the text information value assessment process of Example 1;
图4是本发明方法步骤3以及实施例1的多文档摘要算法的流程图;FIG4 is a flowchart of step 3 of the method of the present invention and the multi-document summarization algorithm of embodiment 1;
图5是本发明方法步骤3.4以及实施例1的多文档摘要过程中所使用的多句压缩算法的流程图。FIG. 5 is a flow chart of the multi-sentence compression algorithm used in step 3.4 of the method of the present invention and the multi-document summarization process of Example 1.
具体实施方式Detailed ways
为了使本发明的目的,技术方案及优点更加清楚明白,以下结合附图对本发明做进一步详细说明。应当理解,此处所描述的具体实施方式,仅仅用以解释本发明,并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings. It should be understood that the specific implementation methods described herein are only used to explain the present invention and are not used to limit the present invention.
一种基于多句压缩的无监督科技情报摘要自动生成方法,包括如下步骤:A method for automatically generating unsupervised scientific and technological intelligence summaries based on multi-sentence compression comprises the following steps:
步骤1:采用基于LDA主题相似度词库扩展方法的主题爬虫方式,进行文本内容抓取,获取源数据。Step 1: Use the topic crawler method based on the LDA topic similarity vocabulary expansion method to crawl text content and obtain source data.
由于仅给定少量关键词,通过爬虫爬取到的内容与实际希望爬取到的内容并不完全相符,因此,采用主题爬虫方式,能够在提升准确度、扩大爬取范围的同时,尽可能提高爬取的效率。Since only a small number of keywords are given, the content crawled by the crawler does not completely match the content that is actually desired to be crawled. Therefore, the use of thematic crawlers can improve the accuracy, expand the crawling scope, and maximize the efficiency of crawling.
通过给定的初始关键词,在主题描述不充分的情况下,通过主题爬虫自身对主题相关资源的收集功能,不断扩充语料,循环训练模型,不断完善、扩展、更新主题描述,以求更加全面、准确地获取想要的内容。Through the given initial keywords, when the topic description is insufficient, the topic crawler can collect topic-related resources by itself, continuously expand the corpus, cyclically train the model, and continuously improve, expand, and update the topic description in order to obtain the desired content more comprehensively and accurately.
具体地,步骤1包括以下步骤:Specifically, step 1 includes the following steps:
步骤1.1:根据给定的初始关键词,爬取相应结果网页,将这些新增的网页提取摘要,作为LDA新的训练语料。Step 1.1: Based on the given initial keywords, crawl the corresponding result web pages, extract summaries from these newly added web pages, and use them as new training corpus for LDA.
步骤1.2:对训练预料做词嵌入(word embedding)。可以利用word2vec模型实现。Step 1.2: Do word embedding for the training predictions. This can be done using the word2vec model.
步骤1.3:结合原有的语料库,经LDA训练得到新的主题文档,用于覆盖更新原有主题爬虫的主题文档。Step 1.3: Combined with the original corpus, new topic documents are obtained through LDA training to cover and update the topic documents of the original topic crawler.
步骤2:对爬取的文本,根据其内容与关键词的相关性以及该源文本的时效性和权威性,进行评估并排序。Step 2: Evaluate and sort the crawled text based on the relevance of its content to keywords and the timeliness and authority of the source text.
对文本信息价值的评估,通常从信息的传播源、传播特征、内容特征等方面展开分析。信息传播源反映了信息发布主体的特征,包括发布渠道、发布者的权威性等。信息的传播特征反映了信息传播过程的形式特征。只有经过广泛、深入和快速传播的信息,其内在价值才能够拥有充分体现的机会,通常包括传播人数、传播速度、传播链深度等等。此外,信息具有明显的时效性特征,过期的信息往往会变得价值全无。The evaluation of the value of text information usually starts from the analysis of the information's dissemination source, dissemination characteristics, content characteristics, etc. The information dissemination source reflects the characteristics of the information publisher, including the release channel, the authority of the publisher, etc. The information dissemination characteristics reflect the formal characteristics of the information dissemination process. Only information that has been widely, deeply and quickly disseminated can have its intrinsic value fully reflected, usually including the number of disseminators, the speed of dissemination, the depth of the dissemination chain, etc. In addition, information has obvious timeliness characteristics, and expired information often becomes worthless.
因此,通过提炼出文本信息的权威性、时效性、内容相关性这三个特征维度,构建文本信息价值评估模型。Therefore, a text information value assessment model is constructed by extracting the three characteristic dimensions of text information: authority, timeliness, and content relevance.
具体地,步骤2包括以下步骤:Specifically, step 2 includes the following steps:
步骤2.1:将所有文本按照段落进行分割。在后续计算中,以段落为单位进行。Step 2.1: Segment all text into paragraphs. In subsequent calculations, the calculations are performed in paragraphs.
其中,对论文、专利、期刊类的价值评估的方法如下:Among them, the methods for evaluating the value of papers, patents, and journals are as follows:
针对论文、专利、期刊类文本,将影响因子、第一作者总发文量和总下载量、该文本下载量、引用量作为权威性评判指标,将发布时间作为时效性指标,将摘要与主题词库的相似性作为内容相关性指标,并为每个指标设定相应参数,构建文本信息价值评估模型,综合计算文本的价值评分。For papers, patents, and journal texts, the impact factor, the total number of publications and total downloads of the first author, the number of downloads of the text, and the number of citations are used as authority evaluation indicators. The publication time is used as a timeliness indicator. The similarity between the abstract and the subject vocabulary is used as a content relevance indicator. Corresponding parameters are set for each indicator, and a text information value assessment model is constructed to comprehensively calculate the value score of the text.
进一步地,本发明提出了针对一种针对论文、专利、期刊类文本的价值评分计算方法,包括以下步骤:Furthermore, the present invention proposes a value scoring calculation method for papers, patents, and journal texts, comprising the following steps:
第一步:计算权威性x1。Step 1: Calculate the authority x 1 .
对于权威性x1,与权威性相关的因素包括文本的发表期刊权威性、作者在本领域中的权威性,以本领域其他研究者对该文本的评价。For authority x 1 , factors related to authority include the authority of the journal in which the text is published, the authority of the author in the field, and the evaluation of the text by other researchers in the field.
其中,期刊类的权威性x11,采用该期刊影响因子与所有文献影响因子的最大值的比值表示,如式1所示:The authority of a journal x 11 is expressed as the ratio of the journal's impact factor to the maximum impact factor of all literature, as shown in Formula 1:
论文、专利类的权威性,由作者作为第一作者在该领域发表文章数量以及该作者作为第一作者发表的文章被下载的总量决定,如式2所示:The authority of papers and patents is determined by the number of articles published by the author as the first author in the field and the total number of downloads of the articles published by the author as the first author, as shown in Formula 2:
论文本身的价值,由该论文的下载量和引用量来决定,如式3所示:The value of the paper itself is determined by the number of downloads and citations of the paper, as shown in Formula 3:
第二步:计算时效性x2。Step 2: Calculate the timeliness x 2 .
设文本信息价值随时间的衰减系数为μ,信息获取时刻与信息发布时刻的时间间隔为Δt,则信息价值随时间变化的计算如式4所示:Assuming that the decay coefficient of the value of text information over time is μ, and the time interval between the time of information acquisition and the time of information release is Δt, the calculation of the change of information value over time is shown in Formula 4:
x2=e-μΔt (4)x 2 = e - μΔt (4)
其中,e为自然常数。Among them, e is a natural constant.
第三步:计算内容相关性x3。Step 3: Calculate content relevance x 3 .
具体地,可以采用BM25算法计算文本内容的相关性。将主题爬虫获取到的主题词库中的每个词看为qi。对于该文本的摘要a,计算每个词qi与a的相关度得分,将qi与a的相关性得分进行加权求和,得到当前文本与主题词库的相关性得分Score(Q,a),如式5所示:Specifically, the BM25 algorithm can be used to calculate the relevance of text content. Each word in the subject word library obtained by the subject crawler is regarded as q i . For the summary a of the text, the relevance score of each word q i and a is calculated, and the relevance score of q i and a is weighted summed to obtain the relevance score Score(Q,a) of the current text and the subject word library, as shown in Formula 5:
其中Wi表示第i个词qi的权重,使用TF-IDF算法计算;n表示词库中单词总数;R(qi,a)表示单词qi与a的相关性,通过式6、式7计算:Where Wi represents the weight of the i-th word qi , calculated using the TF-IDF algorithm; n represents the total number of words in the vocabulary; R( qi ,a) represents the correlation between words qi and a, calculated using Equation 6 and Equation 7:
其中,tfta是单词t在a中的词频;La是a的长度,Lave是所有文本的平均长度,变量k是一个正的参数,用来标准化文章词频的范围;b是一个可调的参数,0<b<1,表示用决定使用文档长度来表示信息量的范围;K为计算时的中间结果。Among them, tf ta is the word frequency of word t in a; La is the length of a, Lave is the average length of all texts, variable k is a positive parameter used to standardize the range of article word frequency; b is an adjustable parameter, 0<b<1, which means to use the document length to determine the range of information; K is the intermediate result of the calculation.
步骤2.3:对智库文章的价值进行评估,方法如下:Step 2.3: Evaluate the value of think tank articles as follows:
针对智库文章类文本,将文章作者的粉丝数、发文数量作为权威性指标,将发布时间作为时效性指标,将文章摘要与主题词库的相似性作为内容相关性指标,并为每个指标设定相应的参数,构建智库文章文本信息价值评估模型。For think tank articles, the number of followers and the number of posts of the article author are used as authority indicators, the publication time is used as the timeliness indicator, the similarity between the article abstract and the subject vocabulary is used as the content relevance indicator, and corresponding parameters are set for each indicator to construct a think tank article text information value assessment model.
本发明提出了针对一种针对智库文章类文本的价值评分计算方法,包括以下步骤:The present invention proposes a value scoring calculation method for think tank article text, comprising the following steps:
第一步:计算权威性x1。Step 1: Calculate the authority x 1 .
对于智库文章,由于其本身不存在下载量、引用量等数据,并且智库的权威性不存在定量的衡量标准,所以以该文章作者的粉丝数以及发文数量作为其权威性的衡量指标。具体采用式8、式9计算:For think tank articles, since there is no data such as download volume and citation volume, and there is no quantitative measurement standard for the authority of think tanks, the number of fans and the number of posts of the author of the article are used as the measurement indicators of its authority. The specific calculation is based on formula 8 and formula 9:
第二步:计算时效性x2。Step 2: Calculate the timeliness x 2 .
计算方法同步骤2.2第二步所述方法。The calculation method is the same as that described in the second step of step 2.2.
第三步:计算内容相关性x3。Step 3: Calculate content relevance x 3 .
计算方法同步骤2.2第三步所述方法。The calculation method is the same as that described in the third step of step 2.2.
步骤2.4:计算文本的信息价值。Step 2.4: Calculate the information value of the text.
将文本信息价值定义为新的权威性特征、时效性特征和内容相关性特征的线性组合。同时,考虑到时效性的乘数效应,得到测算信息价值为:The value of text information is defined as a linear combination of the new authority feature, timeliness feature, and content relevance feature. At the same time, considering the multiplier effect of timeliness, the information value is calculated as:
X=[δ1(α1x11+α2x12+α3x13)+δ2(βx3)]x2 (10)X=[δ 1 (α 1 x 11 +α 2 x 12 +α 3 x 13 )+δ 2 (βx 3 )]x 2 (10)
其中,X表示此段文本信息的价值,α1、α2、α3、δ1、δ2表示不同特征对文本价值的影响因子,其值根据实际需要进行选择。本发明中,可以取α1=α2=0.3,α3=0.4,δ1=δ2=0.5。Wherein, X represents the value of this text information, α 1 , α 2 , α 3 , δ 1 , δ 2 represent the influence factors of different features on the text value, and their values are selected according to actual needs. In the present invention, α 1 = α 2 = 0.3, α 3 = 0.4, δ 1 = δ 2 = 0.5 can be taken.
步骤2.5:将每个段落按照其文本信息价值评分进行排序,选择排序结果的之多前40条段落,作为后续进行多文档摘要的文本数据。Step 2.5: Sort each paragraph according to its text information value score, and select the top 40 paragraphs in the sorting results as the text data for subsequent multi-document summarization.
步骤3:采用基于谱聚类和多句压缩的无监督多文档摘要模型,得到摘要结果。Step 3: Use an unsupervised multi-document summarization model based on spectral clustering and multi-sentence compression to obtain the summary result.
由于在多文档摘要领域不存在与论文专利或智库文章文本相似的标注数据集,因此,本发明提出了一种基于谱聚类和多句压缩的无监督机器学习方法进行摘要生成。该方法将原始文档转换为句子图,同时考虑语言和深度表示,然后应用谱聚类得到多个句子簇,最后对每个簇进行压缩生成最终摘要。Since there is no annotated dataset similar to the text of papers, patents or think tank articles in the field of multi-document summarization, this paper proposes an unsupervised machine learning method based on spectral clustering and multi-sentence compression for summary generation. This method converts the original document into a sentence graph, taking into account both language and depth representation, then applies spectral clustering to obtain multiple sentence clusters, and finally compresses each cluster to generate the final summary.
具体地,步骤3包括以下步骤:Specifically, step 3 includes the following steps:
步骤3.1:处理文本数据。Step 3.1: Process text data.
对于步骤2最终得到的与一个主题相关的段落集合P={p1,p2,…pn},最终目标是生成一个囊括原始文档中重要信息并且无冗余信息的摘要S。For the set of paragraphs P = {p 1 , p 2 , ... p n } related to a topic finally obtained in step 2, the ultimate goal is to generate a summary S that encompasses the important information in the original document and has no redundant information.
以句子作为文本的最小处理单位,并考虑到最后一步需要进行句子压缩,因此保留所有停用词。具体方法为:生成一个句子列表(可以通过调用SpaCy的NLP模块生成。SpaCy是NLP任务领域内的一个比较领先的工业级处理库),并将其作为后续构建的句子图的输入。Sentences are used as the smallest processing unit of text, and considering that sentence compression is required in the last step, all stop words are retained. The specific method is: generate a sentence list (which can be generated by calling SpaCy's NLP module. SpaCy is a leading industrial-grade processing library in the field of NLP tasks) and use it as the input of the sentence graph constructed later.
步骤3.2:建立结构化的句子图,其中的节点与步骤3.1生成的句子对应,并根据句子之间的词汇关系和深层语义关系绘制边。Step 3.2: Build a structured sentence graph, where the nodes correspond to the sentences generated in step 3.1, and draw edges based on the lexical relations and deep semantic relations between sentences.
该步骤的目标是:识别能够表示段落集合P话语结构的成对句子连接,采用基于近似话语图(ADG)并结合深度embedding技术来构建句子图。The goal of this step is to identify paired sentence connections that can represent the discourse structure of the paragraph set P, and to construct a sentence graph based on the Approximate Discourse Graph (ADG) combined with deep embedding technology.
具体地,构建一个图G=(V,E),图的节点vi∈V表示一条句子,V表示节点的集合,ei,j∈E表示节点vi和节点vj之间的边,E表示边的集合。对于任意两个不相同的节点vi和节点vj,如果它们所代表的句子存在以下关系,则它们相互连接,且之间存在一条值为1的边,即ei,j=1。Specifically, a graph G = (V, E) is constructed, where the node vi ∈ V of the graph represents a sentence, V represents a set of nodes, e i,j ∈ E represents the edge between node vi and node v j , and E represents the set of edges. For any two different nodes vi and node v j , if the sentences they represent have the following relationship, they are connected to each other and there is an edge with a value of 1 between them, that is, e i,j = 1.
图G构建规则包括:The construction rules of graph G include:
去动词化的名词关联:根据英文语法,当在一个动词短语中提到某个事件或实体时,通常在接下来的句子中该事件或实体会被表示为改动词的从属名词或名词短语。通过WordNet(一种基于认知语言学的英语词典,把单词以字母顺序排列,并且按照单词的意义组成一个“单词的网络”)寻找这个动词短语的名词形式。若在某句子后的句子中出现了该句中动词短语的名词形式,则这两个句子表示的节点相互连接。De-verbalized noun association: According to English grammar, when an event or entity is mentioned in a verb phrase, it is usually represented as a subordinate noun or noun phrase of the modified verb in the following sentence. The noun form of the verb phrase is found through WordNet (an English dictionary based on cognitive linguistics, which arranges words in alphabetical order and forms a "network of words" according to the meaning of the words). If the noun form of the verb phrase in a sentence appears in the sentence after a certain sentence, the nodes represented by the two sentences are connected to each other.
实体延续:此项考虑了词语上的关联性。如果句子vi和句子vj和包含相同的实体类别(例如组织机构,人名,产品等),则这两个节点相互连接。Entity continuation: This takes into account the relevance of words. If sentence vi and sentence vj contain the same entity category (such as organization, person, product, etc.), the two nodes are connected to each other.
话语标记语:如果相邻句子间存在语义上的关系,例如存在连接词however,meanwhile,furthermore等,则这两个句子表示的节点相互连接。Discourse markers: If there is a semantic relationship between adjacent sentences, such as the presence of conjunctions such as however, meanwhile, furthermore, etc., the nodes represented by the two sentences are connected to each other.
句子相似度:通过平均一个句子的所有单词向量作为句子表示,并用两个句子向量的余弦相似度计算句子的相似性得分。如果相似性得分达到设定阈值,则判定这两个节点相互连接。Sentence similarity: The sentence is represented by averaging all word vectors of a sentence and calculating the similarity score of the sentence using the cosine similarity of the two sentence vectors. If the similarity score reaches the set threshold, the two nodes are considered to be connected to each other.
步骤3.3:应用图聚类,得到图内分区。Step 3.3: Apply graph clustering to obtain the graph partitions.
目前,大多数的图聚类方法都是根据连接节点的边来识别图中的节点群。而本发明采用一种谱聚类的方法,具体如下:At present, most graph clustering methods identify node groups in the graph based on the edges connecting the nodes. The present invention adopts a spectral clustering method, which is as follows:
第一步:获取通过上述方式构建的句子图的拉普拉斯矩阵(可以由图的度矩阵减去邻接矩阵得到);Step 1: Get the Laplacian matrix of the sentence graph constructed in the above way (which can be obtained by subtracting the adjacency matrix from the degree matrix of the graph);
第二步:计算该矩阵的前m个特征向量,用来定义每个句子的特征向量;Step 2: Calculate the first m eigenvectors of the matrix to define the eigenvector of each sentence;
第三步:通过k-means聚类方式将这些句子划分为m个类别。Step 3: Divide these sentences into m categories through k-means clustering.
由此得到了表示不同重点信息的m个句子类别,接下来对m个类别的句子集分别进行多句压缩操作,得到m条摘要,压缩过程见步骤3.4。Thus, we obtain m sentence categories representing different key information. Next, we perform multi-sentence compression operations on the sentence sets of m categories to obtain m summaries. The compression process is shown in step 3.4.
步骤3.4:从抽取得到的子图中生成摘要。Step 3.4: Generate a summary from the extracted subgraph.
多句压缩(MSC),是要从每个包含一组语义相关语句的群集中生成一个摘要语句。目前,经典方法是构建一个单词图,并选择一个由最短路径构建的语句作为摘要。Multi-sentence Compression (MSC) is to generate a summary sentence from each cluster containing a group of semantically related sentences. Currently, the classic method is to build a word graph and select a sentence built by the shortest path as the summary.
本发明提出一种新的实现方法,对经典方法进行了拓展,具体如下:The present invention proposes a new implementation method, which expands the classical method, as follows:
第一步:构建单词图。Step 1: Build a word graph.
对于句子集合S={s1,s2,…,sn},首先对于每一个在句子中出现的单词映射为一个节点。由于自然语言中一词多义的情况广泛存在,因此,每个节点使用一个二元组(token,tag)作为其标识,并且每考虑一个重复出现的单词时,按照以下规则对单词图进行调整:For the sentence set S = {s 1 ,s 2 ,…,s n }, firstly, each word appearing in the sentence is mapped to a node. Since polysemy is common in natural language, each node uses a bigram (token, tag) as its identifier, and each time a repeated word is considered, the word graph is adjusted according to the following rules:
对于非停用词、非标点,且没有候选节点(当前单词图中没有(token,tag)和该单词对应)的单词,直接建立一个新的节点。For words that are not stop words, non-punctuation marks, and have no candidate nodes (there is no (token, tag) corresponding to the word in the current word graph), a new node is directly created.
对于非停用词、非标点,且只有一个候选节点的单词,将该单词直接映射到该候选节点上。For a word that is not a stop word, is not a punctuation mark, and has only one candidate node, the word is directly mapped to the candidate node.
对于非停用词、非标点,且有多个候选节点的单词:将该单词映射到与上下文最接近的节点,但要保单词图无环——即同一个句子的两个相同的单词不能映射到一个节点上。如果没有满足条件的节点,则新建一个节点。For words that are not stop words, non-punctuation, and have multiple candidate nodes: map the word to the node closest to the context, but keep the word graph acyclic - that is, two identical words in the same sentence cannot be mapped to the same node. If there is no node that meets the conditions, create a new node.
对于停用词和标点,如有相同上下文的节点,就映射为该节点,否则新建一个节点。For stop words and punctuation marks, if there is a node with the same context, it is mapped to that node, otherwise a new node is created.
对于节点之间的边的权重,考虑节点之间的共现概率,两个节点共现概率越大,其边权越小,当两个节点之间存在边时,如果它们存在多跳连接,则增强其边权,并且随着路径长度变长,多跳连接增强作用减弱,具体采用式11表示:For the weight of the edge between nodes, the co-occurrence probability between nodes is considered. The greater the co-occurrence probability of two nodes, the smaller the edge weight. When there is an edge between two nodes, if they have a multi-hop connection, the edge weight is enhanced. As the path length becomes longer, the enhancement effect of the multi-hop connection is weakened. Specifically, it is expressed by formula 11:
其中,w(ei,j)表示节点i与节点j之间边的权重;freq(i)、freq(j)分别表示映射到节点i、节点j的单词数;diff(s,i,j)指句子s中映射到节点i的单词和映射到节点j的单词的偏移位置之间的距离;Where w(e i,j ) represents the weight of the edge between node i and node j; freq(i) and freq(j) represent the number of words mapped to node i and node j respectively; diff(s,i,j) refers to the distance between the offset positions of the word mapped to node i and the word mapped to node j in sentence s;
第二步:召回阶段。在上述单词图中找到F条最短路径,每一个路径组成的句子都是一个候选答案。Step 2: Recall phase: Find F shortest paths in the above word graph, and each sentence formed by the path is a candidate answer.
该步本质是为了解决有限制的F最短路径问题。本发明中,采用Yen's算法求解该问题。算法分为两部分,算出第1条最短路径P(1),然后在此基础上依次算出其他的F-1条最短路径。在求P(i+1)时,将P(i)上除了终止节点外的所有节点都视为偏离节点,并计算每个偏离节点到终止节点的最短路径,再与之前的P(i)上起始节点到偏离节点的路径拼接,构成候选路径,进而求得最短偏离路径。选择排名前100的路径作为候选句子路径。The essence of this step is to solve the limited F shortest path problem. In the present invention, Yen's algorithm is used to solve this problem. The algorithm is divided into two parts. The first shortest path P(1) is calculated, and then the other F-1 shortest paths are calculated in sequence on this basis. When calculating P(i+1), all nodes on P(i) except the terminal node are regarded as deviation nodes, and the shortest path from each deviation node to the terminal node is calculated, and then spliced with the path from the starting node to the deviation node on the previous P(i) to form a candidate path, and then the shortest deviation path is obtained. The top 100 paths are selected as candidate sentence paths.
第三步:对上述候选答案重新排序,选择排序最靠前的一个候选答案作为最终的答案。Step 3: Reorder the candidate answers above and select the candidate answer with the highest ranking as the final answer.
具体地,使用TextRank提取关键短语,并设计新的得分进行重排序。首先,每个节点使用式12更新其得分,直至收敛:Specifically, TextRank is used to extract key phrases and new scores are designed for re-ranking. First, each node updates its score using Equation 12 until convergence:
其中,S(ni)表示单词图中节点ni的得分。阻尼系数d,其值可以取0.85。adj(ni)表示与节点ni相邻的节点,w(ej,i)表示节点nj与节点ni之间边的权重。Where S( ni ) represents the score of node n i in the word graph. The damping coefficient d can be 0.85. adj( ni ) represents the nodes adjacent to node n i , and w(e j,i ) represents the weight of the edge between node n j and node n i .
然后,根据关键字组合得到关键短语r,其得分score(r)如下:Then, the key phrase r is obtained based on the keyword combination, and its score score(r) is as follows:
其中,TextRank(w)表示经由TextRank算法计算得到的单词节点w的得分。分母为关键短语r的加权长度length(r),对分数进行归一化操作是为了倾向于选择更长的短语。Among them, TextRank(w) represents the score of word node w calculated by the TextRank algorithm. The denominator is the weighted length length(r) of the key phrase r. The score is normalized to tend to select longer phrases.
最后,通过将第二步得到的候选句子中总路径的加权长度乘以其包含的关键短语分数的总和来对路径进行重新排序。根据关键短语得分,计算每个句子的最终得分:Finally, the paths are re-ranked by multiplying the weighted length of the total paths in the candidate sentences obtained in the second step by the sum of the key phrase scores they contain. Based on the key phrase scores, the final score of each sentence is calculated:
其中,length(c)表示句子c的加权长度,path(c)表示句子c的完整路径。Among them, length(c) represents the weighted length of sentence c, and path(c) represents the complete path of sentence c.
选择该分数最小的作为生成的摘要,最终将m个类别生成的摘要连接,得到最终的完整摘要。The one with the smallest score is selected as the generated summary, and finally the summaries generated by m categories are connected to obtain the final complete summary.
实施例Example
本实施例描述了本发明所述方法的一个具体实施例。This embodiment describes a specific embodiment of the method of the present invention.
实施示意图如图1整体流程所示。本发明提供科技情报摘要生成过程中,从文本资料获取,到数据处理,再到摘要文本生成的完整过程。本发明具体实施时,首先主题爬虫模块开始工作,根据用户提供的关键词库,获取到分析所需的数据,接着文本信息价值评估模块对获取到的数据进行分析以及排序,最后将排序结果作为摘要生成模块的输入,带入模型得到最终结果。The implementation diagram is shown in the overall process of Figure 1. The present invention provides a complete process from text data acquisition to data processing and then to summary text generation in the process of generating scientific and technological intelligence summaries. When the present invention is implemented, first the subject crawler module starts working, and obtains the data required for analysis according to the keyword library provided by the user, then the text information value assessment module analyzes and sorts the obtained data, and finally the sorting result is used as the input of the summary generation module and brought into the model to obtain the final result.
首先,根据用户提供的关键词,在谷歌学术、DARPA、IARPA、兰德智库运用主题爬虫模块获取数据。图2是发明一种基于句子图谱聚类的无监督科技情报自动生成方法中获取数据的流程。按照本发明介绍的步骤1,根据给定的初始关键词爬取一定数量的网页,然后将这些新增的网页提取摘要作为LDA新的训练语料,接着利用word2vec对训练预料做wordembedding,最后结合原有的语料库,经LDA训练得到新的主题文档,用于覆盖更新原有主题爬虫的主题文档。First, according to the keywords provided by the user, the subject crawler module is used to obtain data in Google Scholar, DARPA, IARPA, and RAND Corporation. Figure 2 is a process for obtaining data in an unsupervised scientific and technological intelligence automatic generation method based on sentence graph clustering. According to step 1 introduced in the present invention, a certain number of web pages are crawled according to the given initial keywords, and then the abstracts of these newly added web pages are extracted as new training corpus for LDA, and then word2vec is used to perform word embedding on the training predictions. Finally, combined with the original corpus, a new subject document is obtained through LDA training to cover and update the subject document of the original subject crawler.
获取到所需要的文本数据后,根据文本的属性数据对文本价值进行评估,评估方式如图3所示。先将所有文本按照段落进行分割,然后一次根据文本数据的期刊、作者、下载量等数据计算文本数据的权威性、时效性和内容相关性,接着结合权威性、时效性和内容相关性计算得到文本信息的价值。最后,根据文本信息的价值对于文本数据进行排序,选择前40个文本作为后续进行多文档摘要的文本数据。After obtaining the required text data, the text value is evaluated based on the text attribute data. The evaluation method is shown in Figure 3. First, all texts are divided into paragraphs, and then the authority, timeliness and content relevance of the text data are calculated based on the journal, author, download volume and other data of the text data. Then, the value of the text information is calculated by combining the authority, timeliness and content relevance. Finally, the text data is sorted according to the value of the text information, and the top 40 texts are selected as the text data for subsequent multi-document summarization.
最后是文本生成阶段。具体流程如图4所示,首先处理文本数据,将上一步中得到的段落数据切分为句子,然后调用SpaCy库的NLP模块生成句子列表,接着根据步骤3.2中描述的规则构建一个无向句子图,接着根据步骤3.3中所描述的谱聚类方法对句子图进行聚类产生m个类,最后对于m个类采用多句压缩的方式生成摘要。多句压缩的流程如图5所示,首先根据步骤3.4中第一步所描述的规则构建词图,接着用Yen’s算法求得图中的100条最短路径,最后进行重排序。重排序的过程是:先采用TextRank算法提取出关键短语,接着根据关键短语重新计算句子的得分,最后对100条路径的得分进行排序,分数最小的路径中的单词所连结成的句子就是该类的摘要结果。最后将m个类中生成的摘要连接,生成最终摘要。Finally, the text generation stage. The specific process is shown in Figure 4. First, the text data is processed, and the paragraph data obtained in the previous step is divided into sentences. Then, the NLP module of the SpaCy library is called to generate a sentence list. Then, an undirected sentence graph is constructed according to the rules described in step 3.2. Then, the sentence graph is clustered according to the spectral clustering method described in step 3.3 to generate m classes. Finally, a summary is generated for the m classes using multi-sentence compression. The process of multi-sentence compression is shown in Figure 5. First, a word graph is constructed according to the rules described in the first step of step 3.4. Then, the 100 shortest paths in the graph are obtained using the Yen’s algorithm, and finally, the reordering is performed. The reordering process is: first, the key phrases are extracted using the TextRank algorithm, then the scores of the sentences are recalculated based on the key phrases, and finally, the scores of the 100 paths are sorted. The sentence connected by the words in the path with the smallest score is the summary result of the class. Finally, the summaries generated in the m classes are connected to generate the final summary.
以上所述为本发明的较佳实施例而已,本发明不应该局限于该实施例和附图所公开的内容。凡是不脱离本发明所公开的精神下完成的等效或修改,都落入本发明保护的范围。The above is only a preferred embodiment of the present invention, and the present invention should not be limited to the contents disclosed in the embodiment and the drawings. Any equivalent or modification completed without departing from the spirit disclosed in the present invention shall fall within the scope of protection of the present invention.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275509.1A CN114706972B (en) | 2022-03-21 | 2022-03-21 | An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210275509.1A CN114706972B (en) | 2022-03-21 | 2022-03-21 | An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114706972A CN114706972A (en) | 2022-07-05 |
CN114706972B true CN114706972B (en) | 2024-06-18 |
Family
ID=82169773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210275509.1A Active CN114706972B (en) | 2022-03-21 | 2022-03-21 | An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114706972B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115687960B (en) * | 2022-12-30 | 2023-07-11 | 中国人民解放军61660部队 | Text clustering method for open source security information |
CN116127321A (en) * | 2023-02-16 | 2023-05-16 | 广东工业大学 | Training method, pushing method and system for ship news pushing model |
CN116541505B (en) * | 2023-07-05 | 2023-09-19 | 华东交通大学 | A dialogue summary generation method based on adaptive dialogue segmentation |
CN117951357A (en) * | 2024-03-25 | 2024-04-30 | 中国标准化研究院 | Dynamic scientific and technological standard monitoring method and system based on big data |
CN118364093B (en) * | 2024-06-19 | 2024-10-01 | 粤港澳大湾区数字经济研究院(福田) | Text-based content recommendation method, system, intelligent terminal and medium |
CN119046477B (en) * | 2024-11-01 | 2025-01-28 | 新瑞数城技术有限公司 | Global knowledge graph management method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570171A (en) * | 2016-11-03 | 2017-04-19 | 中国电子科技集团公司第二十八研究所 | Semantics-based sci-tech information processing method and system |
CN108399194A (en) * | 2018-01-29 | 2018-08-14 | 中国科学院信息工程研究所 | A kind of Cyberthreat information generation method and system |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11551567B2 (en) * | 2014-08-28 | 2023-01-10 | Ideaphora India Private Limited | System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter |
CN108959312B (en) * | 2017-05-23 | 2021-01-29 | 华为技术有限公司 | Method, device and terminal for generating multi-document abstract |
MY189086A (en) * | 2018-11-14 | 2022-01-25 | Mimos Berhad | System and method for dynamic entity sentiment analysis |
CN111435375A (en) * | 2018-12-25 | 2020-07-21 | 南京知常容信息技术有限公司 | Threat information automatic labeling method based on FastText |
US11874882B2 (en) * | 2019-07-02 | 2024-01-16 | Microsoft Technology Licensing, Llc | Extracting key phrase candidates from documents and producing topical authority ranking |
CN111177365B (en) * | 2019-12-20 | 2022-08-02 | 山东科技大学 | Unsupervised automatic abstract extraction method based on graph model |
CN111782810A (en) * | 2020-06-30 | 2020-10-16 | 湖南大学 | Text abstract generation method based on theme enhancement |
CN112632228A (en) * | 2020-12-30 | 2021-04-09 | 深圳供电局有限公司 | Text mining-based auxiliary bid evaluation method and system |
CN113743113A (en) * | 2021-09-01 | 2021-12-03 | 武汉长江通信产业集团股份有限公司 | Emotion abstract extraction method based on TextRank and deep neural network |
CN114117035A (en) * | 2021-11-25 | 2022-03-01 | 北京航空航天大学 | Unsupervised cantonese forum extraction type abstract method |
-
2022
- 2022-03-21 CN CN202210275509.1A patent/CN114706972B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570171A (en) * | 2016-11-03 | 2017-04-19 | 中国电子科技集团公司第二十八研究所 | Semantics-based sci-tech information processing method and system |
CN108399194A (en) * | 2018-01-29 | 2018-08-14 | 中国科学院信息工程研究所 | A kind of Cyberthreat information generation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN114706972A (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442760B (en) | A synonym mining method and device for question answering retrieval system | |
CN114706972B (en) | An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression | |
CN108197117B (en) | A Chinese text keyword extraction method based on document topic structure and semantics | |
US9971974B2 (en) | Methods and systems for knowledge discovery | |
US8751218B2 (en) | Indexing content at semantic level | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN110543564B (en) | Domain Label Acquisition Method Based on Topic Model | |
CN110888991B (en) | A segmented semantic annotation method in a weak annotation environment | |
CN107220295A (en) | A kind of people's contradiction reconciles case retrieval and mediation strategy recommends method | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
JP5710581B2 (en) | Question answering apparatus, method, and program | |
CN104008090A (en) | Multi-subject extraction method based on concept vector model | |
CN102495892A (en) | Webpage information extraction method | |
Gupta et al. | A novel hybrid text summarization system for Punjabi text | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN115563313A (en) | Semantic retrieval system for literature and books based on knowledge graph | |
CN114048305A (en) | A similar case recommendation method for administrative punishment documents based on graph convolutional neural network | |
Alotaibi et al. | A cognitive inspired unsupervised language-independent text stemmer for Information retrieval | |
CN118626611A (en) | Retrieval method, device, electronic device and readable storage medium | |
Karpagam et al. | A framework for intelligent question answering system using semantic context-specific document clustering and Wordnet | |
CN112948544B (en) | Book retrieval method based on deep learning and quality influence | |
Moghadam et al. | Comparative study of various Persian stemmers in the field of information retrieval | |
Syed et al. | Information retrieval for Malay text: A decade review of research (2008–2019) | |
Asrori et al. | Performance analysis graph-based keyphrase extraction in Indonesia scientific paper | |
Zhuang | Architecture of Knowledge Extraction System based on NLP |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |