CN118535682A

CN118535682A - A retrieval enhancement method combining keyword extraction and semantic analysis

Info

Publication number: CN118535682A
Application number: CN202410745468.7A
Authority: CN
Inventors: 刘晓玉
Original assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Current assignee: Xiaoduo Intelligent Technology Beijing Co ltd
Priority date: 2024-06-11
Filing date: 2024-06-11
Publication date: 2024-08-23

Abstract

The present invention provides a retrieval enhancement method combining keyword extraction and semantic analysis, including: receiving user query information and converting the query information into text data; using a preset keyword extraction model and a semantic analysis model to screen out key words of the text data; inputting the key words into a pre-trained natural language processing model as training samples to determine a set of candidate key sentences; based on a vector retrieval module, returning the knowledge information most relevant to the text data from a vector database by calculating cosine similarity; answering questions in the text data according to the knowledge information to obtain answers. The present invention can more accurately understand the information queried by the user, which greatly improves the retrieval efficiency and accuracy of the information, thereby better meeting the needs of the user.

Description

A retrieval enhancement method combining keyword extraction and semantic analysis

技术领域Technical Field

本发明涉及信息检索技术领域，特别是涉及一种结合关键词提取与语义分析的检索增强方法。The present invention relates to the technical field of information retrieval, and in particular to a retrieval enhancement method combining keyword extraction and semantic analysis.

背景技术Background Art

随着互联网的快速发展，信息量爆炸式增长。然而传统的搜索引擎并不能满足用户多样化的需求和个性化体验的要求。因此需要一种能够提高检索结果多样性、准确性和个性化的技术来应对这一挑战。With the rapid development of the Internet, the amount of information has exploded. However, traditional search engines cannot meet the diverse needs of users and the requirements for personalized experience. Therefore, a technology that can improve the diversity, accuracy and personalization of retrieval results is needed to meet this challenge.

关键词提取是自然语言处理中的一项重要任务之一。它旨在从文本数据集中抽取具有代表性的词语或短语作为关键字进行查询匹配以获取相关知识。目前常用的方法包括基于词频统计的方法如BM25算法等以及深度学习模型如BERT-basedKeywordExtractionModels等等。但是这些方法往往只考虑了单个单词或者短语的重要性而忽略了它们之间的关联关系。这导致了所提取的关键词可能不完整或不准确。此外，由于政务问答领域涉及的专业知识和术语较多，现有的关键词提取方法难以适应这种复杂的情况。因此有必要提出新的技术方案来解决这些问题。Keyword extraction is one of the important tasks in natural language processing. It aims to extract representative words or phrases from text datasets as keywords for query matching to obtain relevant knowledge. Currently commonly used methods include methods based on word frequency statistics such as the BM25 algorithm and deep learning models such as BERT-basedKeywordExtractionModels. However, these methods often only consider the importance of a single word or phrase and ignore the relationship between them. This results in the extracted keywords being incomplete or inaccurate. In addition, due to the large amount of professional knowledge and terminology involved in the field of government question answering, existing keyword extraction methods are difficult to adapt to this complex situation. Therefore, it is necessary to propose new technical solutions to solve these problems.

发明内容Summary of the invention

为了克服现有技术的不足，本发明的目的是提供一种结合关键词提取与语义分析的检索增强方法。In order to overcome the deficiencies of the prior art, the object of the present invention is to provide a retrieval enhancement method combining keyword extraction and semantic analysis.

为实现上述目的，本发明提供了如下方案：To achieve the above object, the present invention provides the following solutions:

一种结合关键词提取与语义分析的检索增强方法，包括：A retrieval enhancement method combining keyword extraction and semantic analysis, comprising:

接收用户的查询信息并将所述查询信息转换为文本数据；Receiving query information from a user and converting the query information into text data;

利用预设的关键词提取模型和语义分析模型筛选出所述文本数据的关键词语；Filter out key words of the text data using a preset keyword extraction model and a semantic analysis model;

将所述关键词语作为训练样本输入至预训练自然语言处理模型，以确定候选key语句集合；Input the key words as training samples into a pre-trained natural language processing model to determine a set of candidate key sentences;

基于向量检索模块，通过计算余弦相似度从向量数据库中返回与所述文本数据最相关的知识信息；Based on the vector retrieval module, the knowledge information most relevant to the text data is returned from the vector database by calculating the cosine similarity;

根据所述知识信息对所述文本数据中的问题进行解答，得到解答答案。The questions in the text data are answered according to the knowledge information to obtain answers.

优选地，所述关键词提取模型的构建方法包括：Preferably, the method for constructing the keyword extraction model includes:

构建第一微调数据集；所述第一微调数据集中包括用户提出的问题及问题对应的关键词；Constructing a first fine-tuning data set; the first fine-tuning data set includes questions raised by users and keywords corresponding to the questions;

根据所述第一微调数据集对所述预训练自然语言处理模型进行微调，得到所述关键词提取模型。The pre-trained natural language processing model is fine-tuned according to the first fine-tuning data set to obtain the keyword extraction model.

优选地，所述语义分析模型的构建方法包括：Preferably, the method for constructing the semantic analysis model includes:

构建第二微调数据集；所述第二微调数据集中包括用户提出的问题及问题对应的真实含义；Constructing a second fine-tuning dataset; the second fine-tuning dataset includes questions raised by users and the true meanings of the questions;

根据所述第二微调数据集对所述预训练自然语言处理模型进行微调，得到所述语义分析模型。The pre-trained natural language processing model is fine-tuned according to the second fine-tuning dataset to obtain the semantic analysis model.

优选地，将所述关键词语作为训练样本输入至预训练自然语言处理模型，以确定候选key语句集合，包括：Preferably, the key words are input as training samples into a pre-trained natural language processing model to determine a set of candidate key sentences, including:

利用人工标注的方式将所述关键词语分为三类问题；所述三类问题包括：正确答案、错误答案和重复答案；The key words are divided into three types of questions by manual annotation; the three types of questions include: correct answers, wrong answers and repeated answers;

使用预训练自然语言处理模型对所述三类问题对应的关键词语分别进行微调，以使每个关键词语在对应的类别上表现出更高的准确率；Use the pre-trained natural language processing model to fine-tune the key words corresponding to the three types of questions respectively, so that each key word shows a higher accuracy rate in the corresponding category;

采用统计学的方法计算每条关键词语的相关性得分，并根据相关性得分排序后选择排名前3％的相关性最高的关键词语作为所述候选key语句集合。A statistical method is used to calculate the relevance score of each keyword, and the top 3% of the most relevant keywords are selected as the candidate key statement set after sorting according to the relevance scores.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明提供了一种结合关键词提取与语义分析的检索增强方法，包括：接收用户的查询信息并将所述查询信息转换为文本数据；利用预设的关键词提取模型和语义分析模型筛选出所述文本数据的关键词语；将所述关键词语作为训练样本输入至预训练自然语言处理模型，以确定候选key语句集合；基于向量检索模块，通过计算余弦相似度从向量数据库中返回与所述文本数据最相关的知识信息；根据所述知识信息对所述文本数据中的问题进行解答，得到解答答案。本发明能够更准确地理解用户查询的信息，这大大提高了信息的检索效率和准确性，从而更好地满足用户的需要。The present invention provides a retrieval enhancement method combining keyword extraction and semantic analysis, including: receiving user query information and converting the query information into text data; using a preset keyword extraction model and a semantic analysis model to screen out key words of the text data; inputting the key words into a pre-trained natural language processing model as training samples to determine a set of candidate key sentences; based on a vector retrieval module, returning the knowledge information most relevant to the text data from a vector database by calculating cosine similarity; answering questions in the text data according to the knowledge information to obtain answers. The present invention can more accurately understand the information queried by the user, which greatly improves the retrieval efficiency and accuracy of the information, thereby better meeting the needs of the user.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本发明实施例提供的方法流程图。FIG1 is a flow chart of a method provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明的目的是提供一种结合关键词提取与语义分析的检索增强方法，能够更准确地理解用户查询的信息，这大大提高了信息的检索效率和准确性，从而更好地满足用户的需要。The purpose of the present invention is to provide a retrieval enhancement method combining keyword extraction and semantic analysis, which can more accurately understand the information queried by users, which greatly improves the retrieval efficiency and accuracy of information, thereby better meeting the needs of users.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明实施例提供的方法流程图，如图1所示，本发明提供了一种结合关键词提取与语义分析的检索增强方法，包括：FIG1 is a flow chart of a method provided by an embodiment of the present invention. As shown in FIG1 , the present invention provides a retrieval enhancement method combining keyword extraction and semantic analysis, including:

步骤100：接收用户的查询信息并将所述查询信息转换为文本数据；Step 100: receiving query information from a user and converting the query information into text data;

步骤200：利用预设的关键词提取模型和语义分析模型筛选出所述文本数据的关键词语；Step 200: Filter out key words of the text data using a preset keyword extraction model and semantic analysis model;

步骤300：将所述关键词语作为训练样本输入至预训练自然语言处理模型，以确定候选key语句集合；Step 300: Input the key words as training samples into a pre-trained natural language processing model to determine a set of candidate key sentences;

步骤400：基于向量检索模块，通过计算余弦相似度从向量数据库中返回与所述文本数据最相关的知识信息；Step 400: Based on the vector retrieval module, the knowledge information most relevant to the text data is returned from the vector database by calculating the cosine similarity;

步骤500：根据所述知识信息对所述文本数据中的问题进行解答，得到解答答案。Step 500: answer the questions in the text data according to the knowledge information to obtain answers.

具体的，本实施例的关键词提取模型的构建过程如下：Specifically, the construction process of the keyword extraction model in this embodiment is as follows:

首先本实施例制作了微调数据集，其中包含用户提出的问题及其关键词。这些关键词的筛选基于词频、词性和语义等因素，以确保选出最具代表性的关键词。关键词提取大模型基于qwen大模型进行微调，通过评估词语的重要性来提取关键词。First, this embodiment creates a fine-tuning dataset, which contains questions raised by users and their keywords. The selection of these keywords is based on factors such as word frequency, part of speech, and semantics to ensure that the most representative keywords are selected. The keyword extraction model is fine-tuned based on the qwen model to extract keywords by evaluating the importance of words.

进一步地，本实施例的语义分析大模型的构建过程如下：Furthermore, the construction process of the semantic analysis large model of this embodiment is as follows:

本实施例制作了包含用户问题及其真实含义的微调数据集。这有助于模型理解和处理错别字、重复、啰嗦或难懂的语言。语义分析大模型同样基于qwen大模型(预训练自然语言处理模型)进行微调，以提高其语言理解能力和回答的准确性。This example creates a fine-tuned dataset containing user questions and their true meanings. This helps the model understand and process typos, repetitions, verbose or difficult language. The semantic analysis model is also fine-tuned based on the qwen model (pre-trained natural language processing model) to improve its language understanding ability and the accuracy of the answer.

更进一步地，本实施例的向量检索模块使用嵌入模型将非结构化数据编码为向量，并从向量库中检索知识。通过计算余弦相似度，返回与用户输入最相关的topk个知识。Furthermore, the vector retrieval module of this embodiment uses an embedding model to encode unstructured data into vectors and retrieve knowledge from the vector library, and returns the top k pieces of knowledge most relevant to the user input by calculating cosine similarity.

此外，本实施例还包括对话大模型，作为系统的输出模块，对话大模型接收用户输入和检索到的知识，并生成回答。该模型是在qwen大模型，即qwen1.5-14b模型和10W条专业问答知识的基础上进行微调的。In addition, this embodiment also includes a dialogue model, which is the output module of the system. The dialogue model receives user input and retrieved knowledge and generates answers. The model is fine-tuned based on the qwen model, i.e., the qwen1.5-14b model and 100,000 professional question-and-answer knowledge.

可选地，所述方法还包括以下步骤：Optionally, the method further comprises the following steps:

接收用户的查询信息并将其转换为文本形式；使用词频、词性和语义等因素筛选出该查询的关键词语；将这些关键词语作为训练样本提交给qwen1.5-14b大型预训练自然语言处理模型以获得相应的权重值，从而得到具有较高权重的候选关键语句集合；通过计算余弦相似度从向量数据库中返回与该查询最相关的topk个知识；根据返回的知识对query中的问题进行解答，并将答案发送至用户端。Receive the user's query information and convert it into text form; use factors such as word frequency, part of speech and semantics to filter out the key words of the query; submit these key words as training samples to the qwen1.5-14b large pre-trained natural language processing model to obtain the corresponding weight values, thereby obtaining a set of candidate key sentences with higher weights; return the topk most relevant knowledge to the query from the vector database by calculating cosine similarity; answer the questions in the query based on the returned knowledge and send the answers to the user.

进一步地，本实施例首先获取包含用户提出的问题及其真实含义的微调数据集，然后利用人工标注的方式将问题分为三类，即正确答案、错误答案和啰嗦重复答案；接着使用qwen1.5-14b大型预训练的自然语言处理模型对这三类问题的句子分别进行微调，使得每个句子在对应的类别上表现出更高的准确率；最后，采用统计学的方法计算每条句子的相关性得分，并根据相关性得分排序后选择排名前3％的相关性最高的句子作为候选关键语句集合。Furthermore, this embodiment first obtains a fine-tuning dataset containing questions raised by users and their true meanings, and then divides the questions into three categories by manual annotation, namely, correct answers, incorrect answers, and long-winded and repetitive answers; then uses the qwen1.5-14b large-scale pre-trained natural language processing model to fine-tune the sentences of these three types of questions respectively, so that each sentence shows a higher accuracy rate in the corresponding category; finally, a statistical method is used to calculate the relevance score of each sentence, and after sorting according to the relevance score, the top 3% of the most relevant sentences are selected as the candidate key sentence set.

更进一步地，本实施例通过遍历所有候选关键语句集合中的每一个句子来判断其是否满足条件；如果满足条件则对该句话赋予较高的分数值并在后续过程中优先考虑它所提供的相关信息；反之若不满足条件则对其给予较低的分值并且在后续过程中不予考虑它的贡献程度。Furthermore, this embodiment traverses each sentence in the set of all candidate key sentences to determine whether it meets the conditions; if the conditions are met, a higher score value is given to the sentence and the relevant information it provides is given priority in the subsequent process; otherwise, if the conditions are not met, a lower score is given to it and its contribution is not considered in the subsequent process.

更进一步地，本实施例通过对每一组候选关键语句集合内的各个句子依次进行评分操作来实现这一目标；其中评分的标准是根据它们能够提供有效帮助的程度而定；其次，当某一特定时间间隔内累计有超过一定数量的候选关键语句集合未能达到预期效果时，则停止当前任务并对之前已完成的所有任务进行回顾检查以确定是否存在需要改进之处；最后，一旦发现存在需要改善之处则立即对其进行调整以确保整体性能得到提升。Furthermore, the present embodiment achieves this goal by scoring each sentence in each set of candidate key sentences in turn; wherein the scoring criteria are based on the degree to which they can provide effective help; secondly, when more than a certain number of candidate key sentence sets fail to achieve the expected results within a specific time interval, the current task is stopped and all previously completed tasks are reviewed to determine whether there is any need for improvement; finally, once any need for improvement is found, it is adjusted immediately to ensure that the overall performance is improved.

更进一步地，本实施例在对每一组候选关键语句集合内的各个句子依次进行评分操作的同时也记录下它们的原始分值以便于后续对比分析；其次，当某一特定时间间隔内累计有超过一定数量且平均分值低于某个阈值的候选关键语句集合仍未取得理想成绩时，则停止当前任务并进行回顾检查以确定是否有必要对现有策略进行优化；最后，一旦发现有必要对现有策略进行优化的情形出现即可及时采取相应措施予以解决。Furthermore, this embodiment records the original scores of each sentence in each set of candidate key sentences in turn for the convenience of subsequent comparative analysis; secondly, when a certain number of candidate key sentence sets with an average score below a certain threshold have accumulated within a certain time interval and still have not achieved the desired results, the current task is stopped and a review is conducted to determine whether it is necessary to optimize the existing strategy; finally, once a situation is found that it is necessary to optimize the existing strategy, corresponding measures can be taken in a timely manner to resolve it.

对应上述方法，本实施例还提供了一种基于qwen大模型的政务对话系统包括：一个微调数据集；一个用于构建关键字抽取模型的大模型和一个用于对用户输入进行语言理解的语义分析大模型；一个向量库和向量检索模块以及一个生成回答的对话大模型；所述方法还包括以下步骤：接收用户的查询信息并将其转换为文本形式；使用词频、词性和语义等因素筛选出该query的关键词语；将这些keyword作为训练样本提交给qwen1.5-14b大型预训练自然语言处理模型以获得相应的权重值从而得到具有较高权重的候选key语句集合；通过计算余弦相似度从vector数据库中返回与该query最相关的topk个knowledge；根据return的知识对Query中的问题进行解答并将answer发送至user端。Corresponding to the above method, this embodiment also provides a government dialogue system based on the qwen big model, including: a fine-tuning data set; a big model for building a keyword extraction model and a big model for semantic analysis of user input; a vector library and a vector retrieval module and a big dialogue model for generating answers; the method also includes the following steps: receiving the user's query information and converting it into text form; using factors such as word frequency, part of speech and semantics to filter out the key words of the query; submitting these keywords as training samples to the qwen1.5-14b large pre-trained natural language processing model to obtain corresponding weight values to obtain a set of candidate key sentences with higher weights; returning the topk knowledge most relevant to the query from the vector database by calculating the cosine similarity; answering the questions in the Query based on the returned knowledge and sending the answer to the user.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

(1)本发明通过使用qwen大模型进行微调，我们的技术能够更准确地理解用户查询的政务信息。这大大提高了政务信息的检索效率和准确性，从而更好地满足用户的需要；(1) By using the Qwen large model for fine-tuning, our technology can more accurately understand the government information that users query. This greatly improves the efficiency and accuracy of government information retrieval, thereby better meeting the needs of users;

(2)本发明结合关键词提取与语义分析的技术方法可以应用于其他领域的信息检索中，例如医疗保健或教育等领域，以提高这些领域的信息搜索能力并提供更好的服务体验；(2) The technical method of combining keyword extraction and semantic analysis in the present invention can be applied to information retrieval in other fields, such as healthcare or education, to improve the information search capabilities in these fields and provide a better service experience;

(3)本发明整合了多个模块和技术手段来处理不同类型的数据输入，包括非结构化文本、图像等，使得整个系统的功能更加全面且灵活性更高。这将有助于解决当前互联网快速发展带来的海量信息和个性化需求之间的矛盾问题，为广大网民提供一个高效便捷的数据获取渠道和服务平台。(3) The present invention integrates multiple modules and technical means to process different types of data input, including unstructured text, images, etc., making the entire system more comprehensive and more flexible. This will help solve the contradiction between the massive information brought about by the rapid development of the Internet and personalized needs, and provide a highly efficient and convenient data acquisition channel and service platform for the majority of netizens.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The above examples are only used to help understand the method and core ideas of the present invention. At the same time, for those skilled in the art, according to the ideas of the present invention, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting the present invention.

Claims

1. A search enhancement method combining keyword extraction and semantic analysis, characterized by comprising:

Receiving query information from a user and converting the query information into text data;

Filter out key words of the text data using a preset keyword extraction model and a semantic analysis model;

Input the key words as training samples into a pre-trained natural language processing model to determine a set of candidate key sentences;

Based on the vector retrieval module, the knowledge information most relevant to the text data is returned from the vector database by calculating the cosine similarity;

The questions in the text data are answered according to the knowledge information to obtain answers.

2. The retrieval enhancement method combining keyword extraction and semantic analysis according to claim 1, characterized in that the method for constructing the keyword extraction model comprises:

Constructing a first fine-tuning data set; the first fine-tuning data set includes questions raised by users and keywords corresponding to the questions;

The pre-trained natural language processing model is fine-tuned according to the first fine-tuning data set to obtain the keyword extraction model.

3. The retrieval enhancement method combining keyword extraction and semantic analysis according to claim 1, characterized in that the method for constructing the semantic analysis model comprises:

Constructing a second fine-tuning dataset; the second fine-tuning dataset includes questions raised by users and the true meanings of the questions;

The pre-trained natural language processing model is fine-tuned according to the second fine-tuning dataset to obtain the semantic analysis model.

4. The retrieval enhancement method combining keyword extraction and semantic analysis according to claim 1 is characterized in that the keyword is input as a training sample into a pre-trained natural language processing model to determine a set of candidate key sentences, including:

The key words are divided into three types of questions by manual annotation; the three types of questions include: correct answers, wrong answers and repeated answers;

Use the pre-trained natural language processing model to fine-tune the key words corresponding to the three types of questions respectively, so that each key word shows a higher accuracy rate in the corresponding category;

A statistical method is used to calculate the relevance score of each keyword, and the top 3% of the most relevant keywords are selected as the candidate key statement set after sorting according to the relevance scores.