[go: up one dir, main page]

CN118152559A - News element extraction method based on large model - Google Patents

News element extraction method based on large model Download PDF

Info

Publication number
CN118152559A
CN118152559A CN202311715799.8A CN202311715799A CN118152559A CN 118152559 A CN118152559 A CN 118152559A CN 202311715799 A CN202311715799 A CN 202311715799A CN 118152559 A CN118152559 A CN 118152559A
Authority
CN
China
Prior art keywords
news
elements
model
format
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311715799.8A
Other languages
Chinese (zh)
Inventor
李丕绩
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202311715799.8A priority Critical patent/CN118152559A/en
Publication of CN118152559A publication Critical patent/CN118152559A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a news element extraction method based on a large model. By labeling the related information of 5W1H in news, a high-quality data set for news element extraction is constructed and used for a news element extraction model. Extracting elements in news by a labeling person, and then constructing a training format according to the extracting problem and the original text and the sequence of extracting the elements; models LLaMa, vicuna and Guanaco are selected for training respectively, and the models are trained on the data marked in different fields so as to learn the extraction capacity of each news field. The scoring selection module calculates ROUGE and BLEU scores between the elements extracted from the news and the elements marked by the marking personnel by the model to judge the extraction effect. The experimental effect performed on the CNN/DAILYMAIL dataset gave a higher score than ChatGPT. Aiming at the problem of lacking high-quality manually marked 5W1H data sets, the learning capacity of a large model is further mined by constructing high-quality data, so that the level of news extraction is improved.

Description

一种基于大模型新闻要素抽取的方法A method for extracting news elements based on large models

技术领域Technical Field

本发明涉及一种基于大模型新闻要素抽取的方法,属于计算机领域中自然语言处理领域。The invention relates to a method for extracting news elements based on a large model, and belongs to the field of natural language processing in the computer field.

背景技术Background technique

新闻要素(5W1H)的抽取对于事件提取,文本摘要任务是十分重要的。这些问题涵盖了人们对一个事物或事件所关心的最基本信息。通过回答这些问题可以全面了解事情发生的过程。如何在理解新闻内容的基础上保证抽取内容的准确性,一直是一个非常具有挑战性的难题。之前的研究大多数是基于语义和语法的规则对原文进行抽取,这样的抽取方式不仅需要耗费大量的时间而且存在对文章信息遗漏的情况。随着深度学习的不断发展,模型的性能得到了很大的提升。特别是ChatGPT的出现,也给抽取任务带来新的发展方向。虽然ChatGPT在语义理解和信息抽取任务中表现出了强大的能力,但是对于新闻这种长文本来说,大模型提取的内容往往存在事实不一致,重要信息遗漏等问题。因此仅仅依靠大模型本身的性能来抽取新闻要素就变得尤为困难。The extraction of news elements (5W1H) is very important for event extraction and text summarization tasks. These questions cover the most basic information that people care about about a thing or event. By answering these questions, we can fully understand the process of what happened. How to ensure the accuracy of the extracted content on the basis of understanding the news content has always been a very challenging problem. Most of the previous studies extracted the original text based on semantic and grammatical rules. This extraction method not only takes a lot of time but also misses the information of the article. With the continuous development of deep learning, the performance of the model has been greatly improved. In particular, the emergence of ChatGPT has also brought new development directions to the extraction task. Although ChatGPT has shown strong capabilities in semantic understanding and information extraction tasks, for long texts such as news, the content extracted by large models often has problems such as inconsistent facts and missing important information. Therefore, it becomes particularly difficult to extract news elements relying solely on the performance of the large model itself.

最近的研究表明用人工标注特定领域的数据集来微调大模型能够带来性能的提高。因此数据集的质量的数量对于微调任务变得尤为重要。但是目前并没有一个用于新闻要素抽取的高质量的数据集,于是我们选取了四个不同领域的新闻数据集共3500条,通过人工标注新闻中的5W1H要素来构建一个适用于新闻要素抽取的数据集,并使用大模型对标注的数据集进行高效微调,最终得到的模型效果可以适用于不同领域新闻要素的抽取任务。Recent studies have shown that fine-tuning large models with manually annotated datasets in specific fields can improve performance. Therefore, the quality and quantity of datasets become particularly important for fine-tuning tasks. However, there is currently no high-quality dataset for news feature extraction. Therefore, we selected 3,500 news datasets from four different fields, manually annotated the 5W1H elements in the news to build a dataset suitable for news feature extraction, and used a large model to efficiently fine-tune the annotated dataset. The final model effect can be applied to news feature extraction tasks in different fields.

发明内容Summary of the invention

本发明为解决的技术问题:The present invention aims to solve the following technical problems:

本发明致力于解决新闻要素抽取中以下几大难点,从新闻中抽取全面准确的事件要素:The present invention is dedicated to solving the following difficulties in news element extraction and extracting comprehensive and accurate event elements from news:

-高质量的数据集:目前已有的新闻要素抽取数据集存在重要要素缺失,数量过少的情况,并没有一个涵盖5W1H新闻要素抽取的高质量数据集。-High-quality datasets: The existing news element extraction datasets are missing important elements or are too small in number. There is no high-quality dataset that covers the 5W1H news element extraction.

-一致性:新闻要素抽取的内容必须确保与新闻原文所表述的内容一致,不能存在对新闻事实表述错误的情况。-Consistency: The content of news element extraction must ensure consistency with the content of the original news text, and there must be no incorrect statements of news facts.

-全面性:所抽取的新闻要素能够包含新闻事件的所有内容,比如一件事情的发生原文讲述了多种原因,不能存在遗漏的情况。- Comprehensiveness: The extracted news elements can include all the contents of the news event. For example, the original text of an event describes multiple reasons for the occurrence of an event, and there should be no omissions.

本发明为解决其技术问题采用如下技术方案:The present invention adopts the following technical solutions to solve the technical problems:

一种基于大模型新闻要素抽取的方法,包括以下步骤:A method for extracting news elements based on a large model comprises the following steps:

1)构建包括多条新闻的数据集,标注人员对每条新闻进行分类,并从每条新闻中抽取所有与5W1H相关的句子或短语作为抽取的要素,其中5W1H指的是原因Why、对象What、地点Where、时间When、人员Who、方法How;1) Construct a data set including multiple news items. Labelers classify each news item and extract all sentences or phrases related to 5W1H from each news item as extracted elements. 5W1H refers to Why, What, Where, When, Who, and How.

2)由步骤1)抽取的要素作为模型训练的标签,构建微调的指令数据集格式;2) The elements extracted from step 1) are used as labels for model training to construct a fine-tuning instruction dataset format;

3)将构建好的指令数据集格式放入到大模型中进行训练,来学习抽取新闻要素的能力。3) Put the constructed instruction dataset format into the large model for training to learn the ability to extract news elements.

优选的,步骤2)构建微调的指令数据集格式的步骤为:Preferably, the step 2) of constructing the fine-tuned instruction data set format is:

首先构建需要的问题模板:First, build the required question template:

Question:"Below is an instruction that describes a task."即问题:“下面是描述任务的指令;”Question: "Below is an instruction that describes a task."

"Write a response that appropriately completes the request."即“写一个适当完成请求的响应;”"Write a response that appropriately completes the request."

"Instruction:please extract what when where why who and how from thenews."即“说明:请从新闻中抽取对象、时间、地点、原因、人员和如方法;”"Instruction: please extract what when where why who and how from the news."

然后将新闻按照text:新闻内容的格式保留;最后将抽取的要素以json的格式保存并以Response:”output”:{要素}作为模型输出结果的格式,最终得到的指令数据集格式为:Then keep the news in the format of text: news content; finally save the extracted elements in json format and use Response:"output":{element} as the format of the model output result. The final instruction dataset format is:

{“input”:<Question>,”text”:<新闻>,”Response”:<”output”:{要素}>}。{“input”:<Question>,”text”:<News>,”Response”:<”output”:{element}>}.

优选的,步骤3)中,将构建好的数据集格式放入大模型中进行训练,在这里我们选择了三个用于训练标注数据的大模型,分别是LLaMA,Vicuna和Guanaco;然后将步骤1)中包括多条新闻的数据集按照8:1:1的比例划分训练集,测试集和验证集;训练过程中使用QLoRA来微调大模型;Preferably, in step 3), the constructed data set format is put into the large model for training. Here, we select three large models for training labeled data, namely LLaMA, Vicuna and Guanaco; then the data set including multiple news in step 1) is divided into training set, test set and validation set in a ratio of 8:1:1; QLoRA is used to fine-tune the large model during the training process;

在推理阶段阶段为了加快推理速度,使用CT2框架,通过int 8量化虽然略微损失了模型效果但进一步降低耗时;In order to speed up the inference process, the CT2 framework is used. Although int 8 quantization slightly reduces the model effect, it further reduces the time consumption.

在推理阶段阶段采用Top-p sampling算法生成句子,在解码过程中,只从累积概率超过阈值p的最小单词集合中进行随机采样,其中阈值p值为0.9;同时把同时把max_token设置为2000,使得大模型尽可能多的输出从新闻中抽取的要素。In the inference stage, the Top-p sampling algorithm is used to generate sentences. In the decoding process, random sampling is only performed from the minimum set of words whose cumulative probability exceeds the threshold p, where the threshold p is 0.9. At the same time, max_token is set to 2000, so that the large model can output as many elements extracted from the news as possible.

优选的,根据LLaMA,Vicuna和Guanaco大模型抽取到的要素与标注人员标注的要素计算ROUGE和BLEU,结果ROUGE和BLEU越高表明与标注人员标注的语句越接近,效果越好。Preferably, ROUGE and BLEU are calculated based on the elements extracted by the LLaMA, Vicuna and Guanaco large models and the elements annotated by the annotators. The higher the ROUGE and BLEU, the closer they are to the sentences annotated by the annotators and the better the effect.

本发明采用以上技术方案与现有技术相比,具有以下有益效果:Compared with the prior art, the present invention adopts the above technical solution and has the following beneficial effects:

-本发明针对新闻要素抽取缺少高质量数据集的问题,通过人工标注不同领域新闻要素,结合大模型高效微调的学习能力,帮助模型训练在各个领域新闻抽取要素能力,来较好地完成新闻要素抽取的任务。-The present invention aims to solve the problem of lack of high-quality data sets for news element extraction. By manually labeling news elements in different fields and combining the learning ability of efficient fine-tuning of large models, the model is trained to be able to extract news elements in various fields, so as to better complete the task of news element extraction.

-本发明验证了经过人工标注的数据微调后的大模型在抽取新闻要素的能力方面强于ChatGPT。-The present invention verifies that the large model fine-tuned with manually annotated data is stronger than ChatGPT in the ability to extract news elements.

-本发明测试了四种不同领域数据集之间情况。即使在不同领域的新闻中也体现出了抽取的能力。因此我们认为该抽取能力可以扩大到不同新闻领域中。- The present invention tests the situation between four different domain data sets. Even in news from different domains, the extraction capability is also reflected. Therefore, we believe that the extraction capability can be expanded to different news domains.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是提出的新闻要素抽取框架整体流程示意图。Figure 1 is a schematic diagram of the overall process of the proposed news element extraction framework.

具体实施方式Detailed ways

下面结合附图对本发明创造做进一步详细说明。The present invention is further described in detail below in conjunction with the accompanying drawings.

本发明提出的新闻要素抽取方法由三个模块组成,该方法的整体架构如图1所示。The news element extraction method proposed in the present invention consists of three modules, and the overall architecture of the method is shown in FIG1 .

(1)构建数据集模块(1) Building a dataset module

我们使用了四个公开的新闻数据集(CNN/DailyMail,XSum,NYT,RA-MADS)作为我们标注的对象,其中选取了前三个数据集各1000条,最后一个数据集450条,共3500条,为了方便标注人员进行标注,所选取的新闻在550个单词左右。We used four public news datasets (CNN/DailyMail, XSum, NYT, and RA-MADS) as our annotation objects. We selected 1,000 items from each of the first three datasets and 450 items from the last dataset, for a total of 3,500 items. To facilitate annotation by the annotators, the selected news items were about 550 words.

首先由标注人员判断新闻属于哪个类别。我们对新闻类别大致进行了以下六个划分。分别是:①事故和自然灾害,②袭击(刑事/恐怖),③新技术,④健康与安全,⑤濒危资源,以及⑥调查和审判(刑事/法律/其他)。First, the labelers determine which category the news belongs to. We roughly divide the news categories into the following six categories: ① Accidents and natural disasters, ② Attacks (criminal/terrorist), ③ New technologies, ④ Health and safety, ⑤ Endangered resources, and ⑥ Investigations and trials (criminal/legal/other).

标注人员根据文章的内容,从原文中提取所有与5W1H相关的句子或短语。为了确保提取的准确性,所有提取的要素都是直接从原文中提取的,没有任何语义概括。其次对于每个语句(如果它属于要被提取的要素),必须有一个唯一的要素与之对应。例如,如果语句被定义What,则不能被定义为其他情况。这样可以避免语句的歧义。同时,为了避免每个句子中信息的冗余,我们选择语句的一个片段进行提取,例如,对时间地点的提取只需要从句子中找到相信的信息即可,而不必提取到整句话。在提取过程中也有一些类型的新闻很难识别其所属的要素。为了识别某些要素,我们根据原文的含义对其进行了广泛的定义。According to the content of the article, the annotators extract all the sentences or phrases related to 5W1H from the original text. In order to ensure the accuracy of the extraction, all the extracted elements are extracted directly from the original text without any semantic generalization. Secondly, for each sentence (if it belongs to the element to be extracted), there must be a unique element corresponding to it. For example, if the sentence is defined as What, it cannot be defined as other cases. This can avoid the ambiguity of the sentence. At the same time, in order to avoid the redundancy of information in each sentence, we select a fragment of the sentence for extraction. For example, the extraction of time and place only needs to find the believed information from the sentence, without having to extract the whole sentence. There are also some types of news in the extraction process that are difficult to identify the elements to which they belong. In order to identify certain elements, we have broadly defined them according to the meaning of the original text.

(2)模型训练与推理模块(2) Model training and inference module

由构建数据集模块得到的抽取要素作为模型训练的标签。为了遵循大模型微调的方式需要构建微调的指令数据集格式。The extracted features obtained by the dataset building module are used as labels for model training. In order to follow the large model fine-tuning method, it is necessary to build a fine-tuning instruction dataset format.

首先构建需要的问题模板,我们设计了以下模板:First, we build the required question template. We designed the following template:

-Question:"Below is an instruction that describes a task."-Question: "Below is an instruction that describes a task."

"Write a response that appropriately completes the request.""Write a response that appropriately completes the request."

"Instruction:please extract what when where why who and how from thenews.""Instruction: please extract what when where why who and how from the news."

为了方便得到模型输出的结果,我们并没有以对话的方式对每个要抽取的要素进行提问,而是直接让模型提取到所有的要素,经过后续的实验发现模型对此种提问方式也能达到抽取的效果。In order to facilitate the output of the model, we did not ask questions to each element to be extracted in a dialogue manner, but directly let the model extract all the elements. After subsequent experiments, we found that the model can also achieve the extraction effect with this way of asking questions.

接下来是输入的原文,由于算力和时间等条件因素的限制,我们将新闻进行了截断,每篇新闻保留了750个单词。将保留的新闻按照text:新闻内容的格式保留。Next is the original text input. Due to the limitations of computing power and time, we truncated the news and retained 750 words for each news article. The retained news was retained in the format of text: news content.

最后是抽取的要素,将抽取到的要素以json的格式保存并以Response:”output”:{要素}作为模型学到的输出结果的格式。最终构建的指令数据集格式如下:Finally, the extracted elements are saved in json format and the format of the output result learned by the model is Response:"output":{element}. The format of the instruction dataset finally constructed is as follows:

-{“input”:<Question>,”text”:<新闻>,”Response”:<”output”:{要素}>}-{“input”:<Question>,”text”:<News>,”Response”:<”output”:{element}>}

1)大模型的微调1) Fine-tuning of large models

将构建好的数据集格式放入到大模型中进行训练,来学习抽取新闻要素的能力。我们选用了三种大语言模型:LLaMA,Vicuna和Guanaco。The constructed dataset format is put into the large model for training to learn the ability to extract news elements. We selected three large language models: LLaMA, Vicuna and Guanaco.

LLaMA模型是由MetaAI发布的一个开放且高效的大型基础语言模型,整个训练数据集在token化之后大约包含1.4T的token。采用的训练数据集都是公开的。Vicuna模型是在LLaMa-13B的基础上使用监督数据微调得到的模型,数据集来自于ShareGPT.com产生的用户对话数据,共70K条。Guanaco模型是一个基于Meta的LLaMa7B模型构建的先进的指令跟随语言模型。The LLaMA model is an open and efficient large-scale basic language model released by MetaAI. The entire training dataset contains approximately 1.4T tokens after tokenization. The training datasets used are all public. The Vicuna model is a model fine-tuned using supervised data based on LLaMa-13B. The dataset comes from user conversation data generated by ShareGPT.com, totaling 70K. The Guanaco model is an advanced instruction-following language model built on Meta's LLaMa7B model.

我们将标注的数据集按照8:1:1的比例来划分训练集,测试集和验证集。其中RA-MADS数据集只有450条,我们没有进行划分而是把这个数据集加入到其他三个数据集中训练来增强模型的训练效果。我们使用QLoRA来微调模型,QLoRA使用一种新的高精度技术将预训练模型量化为int 4。再极大降低内存需求的同时能保证模型的预测性能。成为了微调大模型首选的方法。We divide the labeled dataset into training set, test set and validation set in a ratio of 8:1:1. The RA-MADS dataset has only 450 records. We did not divide it but added it to the other three datasets to enhance the training effect of the model. We use QLoRA to fine-tune the model. QLoRA uses a new high-precision technology to quantize the pre-trained model to int 4. It greatly reduces the memory requirements while ensuring the predictive performance of the model. It has become the preferred method for fine-tuning large models.

在模型推理阶段阶段为了加快模型的推理速度,我们使用CT2框架,将微调好的模型的LORA权重和基础模型的权重融合转换成CT2的格式,通过int 8量化略损失模型效果的方式进一步降低耗时,但没有影响模型推理的性能。In order to speed up the model reasoning during the model reasoning stage, we use the CT2 framework to fuse the LORA weights of the fine-tuned model and the weights of the base model into the CT2 format, and further reduce the time consumption by slightly losing the model effect through int 8 quantization, but without affecting the performance of model reasoning.

大模型的生成具有不确定性,在推理阶段我们采用Top-p sampling算法进行生成。在解码过程中,只从累积概率超过某个阈值p的最小单词集合中进行随机采样,而不考虑其他低概率的单词。因为它只关注概率分布的核心部分,而忽略了尾部部分。在这里我们把p设置为0.9,那么我们只从累积概率达到0.9的最小单词集合中选择一个单词,而不考虑其他累积概率小于0.9的单词。这样可以避免采样到一些不合适或不相关的单词,同时也可以保留一些有创意的单词。同时把max_token设置为2000,使得模型尽可能多的输出从原文抽取的要素信息。The generation of large models is uncertain. In the inference stage, we use the Top-p sampling algorithm to generate them. During the decoding process, random sampling is only performed from the minimum set of words whose cumulative probability exceeds a certain threshold p, without considering other words with low probabilities. Because it only focuses on the core part of the probability distribution and ignores the tail part. Here we set p to 0.9, then we only select a word from the minimum set of words with a cumulative probability of 0.9, without considering other words with a cumulative probability less than 0.9. This can avoid sampling some inappropriate or irrelevant words, and also retain some creative words. At the same time, max_token is set to 2000, so that the model outputs as much feature information extracted from the original text as possible.

即使使用了Top-p sampling算法,生成的答案也会存在与原来训练格式不一致的情况。我们采取的策略是对生成的结果进行查找,对5W1H每个要素的划分可以根据生成的提示词。比如属于What部分的前边会出现What,后边会有冒号来表示从原文抽取的内容属于What。这样我们就可以对每个要素进行划分。如果所给出的答案均没有这种提示词出现则认为模型输出的结果无效重新生成。最后我们把生成的结果存到json文件中。Even if the Top-p sampling algorithm is used, the generated answers may be inconsistent with the original training format. The strategy we adopt is to search the generated results, and the division of each element of 5W1H can be based on the generated prompt words. For example, What will appear in front of the What part, and there will be a colon after it to indicate that the content extracted from the original text belongs to What. In this way, we can divide each element. If none of the answers given have such prompt words, the results of the model output are considered invalid and regenerated. Finally, we save the generated results in a json file.

2)零样本与少样本2) Zero-shot and few-shot

我们测试了ChatGPT与GPT-4在零样本与少样本抽取新闻要素的能力。实验结果表明经过微调后的效果超过了ChatGPT在零样本与少样本的情况并接近GPT-4在5个样本上的效果。We tested the ability of ChatGPT and GPT-4 to extract news elements in zero-shot and few-shot scenarios. The experimental results show that the effect after fine-tuning exceeds that of ChatGPT in zero-shot and few-shot scenarios and is close to that of GPT-4 in 5-shot scenarios.

(3)评分选择模块(3) Rating selection module

模型抽取到的新闻要素与标注人员标注的要素计算ROUGE和BLEU,结果越高表明与标注人员标注的语句越接近,效果越好。ROUGE and BLEU are calculated based on the news elements extracted by the model and the elements annotated by the annotators. The higher the result, the closer it is to the sentences annotated by the annotators and the better the effect.

ROUGE是一类用于自动评估文本生成任务的评估指标。它主要关注生成文本与参考文本之间的重叠信息,通过计算召回率来衡量两者的相似度。具体而言,ROUGE评分计算公式根据生成文本和参考文本的共享单词、共享词组和共享序列来计算相似度。ROUGE计算公式主要有ROUGE-1,ROUGE-2,ROUGE-L三种指标。ROUGE is a type of evaluation metric used to automatically evaluate text generation tasks. It focuses on the overlapping information between the generated text and the reference text, and measures the similarity between the two by calculating the recall rate. Specifically, the ROUGE score calculation formula calculates the similarity based on the shared words, shared phrases, and shared sequences of the generated text and the reference text. The ROUGE calculation formula mainly includes three indicators: ROUGE-1, ROUGE-2, and ROUGE-L.

Rouge-N用于衡量自动生成的N元组(N个相邻单词)与参考摘要中的N元组之间的重叠度。公式如下:Rouge-N is used to measure the overlap between the automatically generated N-grams (N adjacent words) and the N-grams in the reference summary. The formula is as follows:

其中,S是自动生成的摘要集合,N是N元组集合,Countmatch(n,s)是自动生成的摘要集合中N元组n在参考摘要集合中的次数,Count(n)是N元组n在参考摘要集合中出现的次数。Where S is the automatically generated summary set, N is the N-tuple set, Count match(n,s) is the number of times N-tuple n in the automatically generated summary set appears in the reference summary set, and Count(n) is the number of times N-tuple n appears in the reference summary set.

ROUGE-L用于衡量自动生成的摘要与参考摘要之间的最长公共子序列(LCS)。公式如下:ROUGE-L is used to measure the longest common subsequence (LCS) between the automatically generated summary and the reference summary. The formula is as follows:

其中,S是自动生成的摘要集合,r是参考摘要,LCS(s,r)是自动生成的摘要s和参考摘要r之间的最长公共子序列的长度。|s|是自动生成的摘要s的长度。Where S is the set of automatically generated summaries, r is the reference summary, LCS(s,r) is the length of the longest common subsequence between the automatically generated summary s and the reference summary r. |s| is the length of the automatically generated summary s.

BLEU最早用于机器翻译任务上,用于评估机器翻译的语句的合理性。具体来讲,BLEU通过衡量生成序列和参考序列之间的重合度进行计算的。公式如下:BLEU was first used in machine translation tasks to evaluate the rationality of machine translated sentences. Specifically, BLEU is calculated by measuring the overlap between the generated sequence and the reference sequence. The formula is as follows:

其中lc代表表示机器翻译译文的长度,ls表示参考答案的有效长度。当存在多个参考译文时,选取和翻译译文最接近的长度。当翻译译文长度大于参考译文的长度时,惩罚系数为1。 Where l c represents the length of the machine translation text, and l s represents the effective length of the reference answer. When there are multiple reference texts, the length closest to the translation text is selected. When the translation text length is greater than the reference text length, the penalty coefficient is 1.

BLEU的原型系统采用的是均匀加权,N的上限为4,即最多统计4-gram的精度。The prototype system of BLEU adopts uniform weighting, and the upper limit of N is 4, which means that at most 4-gram accuracy is counted.

对于得到的BLEU和ROUGE指标的结果,选取模型得分较高的作为基准模型用于对其他领域新闻抽取要素。For the results of BLEU and ROUGE indicators, the model with higher score is selected as the benchmark model for extracting elements of news in other fields.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims (4)

1. A method for extracting news elements based on a large model, which is characterized by comprising the following steps:
1) Constructing a data set comprising a plurality of news, classifying each news by a labeling person, and extracting all sentences or phrases related to 5W1H from each news as extracted elements, wherein the 5W1H refers to a reason Why, an object What, a place while, a time while, a person wha and a method How;
2) The elements extracted in the step 1) are used as labels for model training, and a fine-tuning instruction data set format is constructed;
3) And putting the constructed instruction data set format into a large model for training to learn the capability of extracting news elements.
2. The method of claim 1, wherein the step of 2) constructing a fine-tuned instruction data set format is:
Firstly, constructing a required problem template:
Question: "Below is an instruction that describes a task:" i.e. problem: "following are instructions describing tasks; "
"Write a response that appropriately completes the request" i.e. "write a response to the appropriate completion request; "
"Construction: simple extract WHAT WHEN WHERE WHY who and how from the news," i.e. "description: please extract objects, time, place, reason, personnel and method from news; "
Then reserving news according to the text, namely reserving the format of news content; finally, the extracted elements are stored in json format, response: "output": { elements } is used as the format of the model output result, and the format of the finally obtained instruction data set is as follows:
{ "input": < Question >, "text": < news >, "Response": < "output": { element }.
3. A method of large model based news element extraction as claimed in claim 2, wherein in step 3), the constructed dataset format is put into a large model for training, where we select three large models for training annotation data, LLaMA, vicuna and Guanaco respectively; the dataset comprising a plurality of news in step 1) is then processed according to 8:1:1, dividing a training set, a test set and a verification set in proportion; QLoRA is used in the training process to fine tune the large model;
In order to accelerate the reasoning speed in the reasoning stage, a CT2 framework is used, and the time consumption is further reduced through the int 8 quantization;
In the reasoning stage, a Top-P SAMPLING algorithm is adopted to generate sentences, and in the decoding process, random sampling is only carried out from the minimum word set with the cumulative probability exceeding a threshold p, wherein the value of the threshold p is 0.9; at the same time, max_token is set to 2000.
4. A method of large model news element extraction according to claim 3, wherein the elements extracted from the large models LLaMA, vicuna and Guanaco are calculated ROUGE and BLEU with the elements marked by the marking person, and the higher the results ROUGE and BLEU indicate the closer to the sentence or phrase marked by the marking person, the better the effect.
CN202311715799.8A 2023-12-14 2023-12-14 News element extraction method based on large model Pending CN118152559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311715799.8A CN118152559A (en) 2023-12-14 2023-12-14 News element extraction method based on large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311715799.8A CN118152559A (en) 2023-12-14 2023-12-14 News element extraction method based on large model

Publications (1)

Publication Number Publication Date
CN118152559A true CN118152559A (en) 2024-06-07

Family

ID=91289328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311715799.8A Pending CN118152559A (en) 2023-12-14 2023-12-14 News element extraction method based on large model

Country Status (1)

Country Link
CN (1) CN118152559A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5365463A (en) * 1990-12-21 1994-11-15 International Business Machines Corporation Method for evaluating the timing of digital machines with statistical variability in their delays
CN112307336A (en) * 2020-10-30 2021-02-02 中国平安人寿保险股份有限公司 Hotspot information mining and previewing method and device, computer equipment and storage medium
US20220035877A1 (en) * 2021-10-19 2022-02-03 Intel Corporation Hardware-aware machine learning model search mechanisms
WO2023103308A1 (en) * 2021-12-07 2023-06-15 苏州浪潮智能科技有限公司 Model training method and apparatus, text prediction method and apparatus, and electronic device and medium
CN116976640A (en) * 2023-08-30 2023-10-31 中电科东方通信集团有限公司 Automatic service generation method, device, computer equipment and storage medium
CN117033603A (en) * 2023-08-28 2023-11-10 北京易华录信息技术股份有限公司 Construction method, device, equipment and storage medium of large model in vertical field

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5365463A (en) * 1990-12-21 1994-11-15 International Business Machines Corporation Method for evaluating the timing of digital machines with statistical variability in their delays
CN112307336A (en) * 2020-10-30 2021-02-02 中国平安人寿保险股份有限公司 Hotspot information mining and previewing method and device, computer equipment and storage medium
US20220035877A1 (en) * 2021-10-19 2022-02-03 Intel Corporation Hardware-aware machine learning model search mechanisms
WO2023103308A1 (en) * 2021-12-07 2023-06-15 苏州浪潮智能科技有限公司 Model training method and apparatus, text prediction method and apparatus, and electronic device and medium
CN117033603A (en) * 2023-08-28 2023-11-10 北京易华录信息技术股份有限公司 Construction method, device, equipment and storage medium of large model in vertical field
CN116976640A (en) * 2023-08-30 2023-10-31 中电科东方通信集团有限公司 Automatic service generation method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李俊峰;: "多特征融合的新闻聚类相似度计算方法", 软件, no. 12, 15 December 2017 (2017-12-15) *

Similar Documents

Publication Publication Date Title
US11449556B2 (en) Responding to user queries by context-based intelligent agents
US12222970B2 (en) Generative event extraction method based on ontology guidance
US20220309357A1 (en) Knowledge graph (kg) construction method for eventuality prediction and eventuality prediction method
CN103605492B (en) A kind of self adaptation speech training method and platform
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN101609672B (en) Speech recognition semantic confidence feature extraction method and device
CN112905736B (en) An unsupervised text sentiment analysis method based on quantum theory
CN116127056A (en) A Multi-Level Feature Enhanced Method for Medical Dialogue Summarization
CN116342167A (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN114756663A (en) A kind of intelligent question answering method, system, device and computer readable storage medium
CN117992614A (en) A method, device, equipment and medium for sentiment classification of Chinese online course reviews
Briscoe et al. Automated assessment of ESOL free text examinations
Cao et al. 5W1H Extraction With Large Language Models
CN111723583B (en) Statement processing method, device, equipment and storage medium based on intention role
CN118468964A (en) A method, system, device and storage medium for accelerating reasoning of a language model
CN118152559A (en) News element extraction method based on large model
WO2023169301A1 (en) Text processing method and apparatus, and electronic device
Murugathas et al. Domain specific question & answer generation in tamil
CN116205220A (en) Method, system, equipment and medium for extracting trigger words and argument
CN115688799A (en) A Chinese self-supervised word meaning understanding method and system
CN115292492A (en) Method, device and equipment for training intention classification model and storage medium
CN114548108A (en) Multi-feature-fused power scheduling text entity identification method and device
Wu et al. $\rho $-hot Lexicon Embedding-based Two-level LSTM for Sentiment Analysis
CN114357964A (en) Subjective question scoring method, model training method, computer equipment and storage medium
CN109325225B (en) A general association-based part-of-speech tagging method

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination