CN116775846A - Domain knowledge question and answer method, system, equipment and medium - Google Patents
Domain knowledge question and answer method, system, equipment and medium Download PDFInfo
- Publication number
- CN116775846A CN116775846A CN202310942972.1A CN202310942972A CN116775846A CN 116775846 A CN116775846 A CN 116775846A CN 202310942972 A CN202310942972 A CN 202310942972A CN 116775846 A CN116775846 A CN 116775846A
- Authority
- CN
- China
- Prior art keywords
- vector
- domain knowledge
- question
- sentence
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种领域知识问答方法、系统、设备及介质,属于自然语言处理技术领域,其中方法包括:获取查询语句;将查询语句输入至语义相似度模型,得到语义相似度模型输出的查询向量;语义相似度模型是基于领域知识进行训练后得到的;基于查询向量及知识库中至少一个问题向量,确定查询向量与各问题向量的相似度;基于查询向量与各问题向量的相似度,从各问题向量中确定至少一个目标向量;基于各目标向量对应的答案,确定查询语句对应的答案,提升了匹配结果的准确率和问答系统的可扩展性,通过向量匹配的方式,提升了问答系统的响应速度。
The invention provides a domain knowledge question and answer method, system, equipment and medium, which belongs to the technical field of natural language processing. The method includes: obtaining a query statement; inputting the query statement into a semantic similarity model to obtain a query vector output by the semantic similarity model. ; The semantic similarity model is obtained after training based on domain knowledge; based on the query vector and at least one question vector in the knowledge base, the similarity between the query vector and each question vector is determined; based on the similarity between the query vector and each question vector, from At least one target vector is determined in each question vector; based on the answer corresponding to each target vector, the answer corresponding to the query statement is determined, which improves the accuracy of the matching results and the scalability of the question and answer system. Through vector matching, the question and answer system is improved response speed.
Description
技术领域Technical field
本发明涉及自然语言处理技术领域,尤其涉及一种领域知识问答方法、系统、设备及介质。The present invention relates to the technical field of natural language processing, and in particular to a domain knowledge question and answer method, system, equipment and medium.
背景技术Background technique
在智能运维场景中,运维人员在日常工作中经常需要进行运维知识的查询操作,如查询某种的运维流程,查询某个运维命令等。具备文本匹配功能的运维机器人可以便捷的实现此功能。运维人员将所要查询的知识存储在运维机器人配套的知识库中,通过对话界面输入所要查询的问题语句,机器人通过一系列的自然语言算法(NLP)进行解析、查询,将最符合的答案返回给用户。In intelligent operation and maintenance scenarios, operation and maintenance personnel often need to query operation and maintenance knowledge in their daily work, such as querying a certain operation and maintenance process, querying a certain operation and maintenance command, etc. Operation and maintenance robots with text matching functions can easily implement this function. The operation and maintenance personnel store the knowledge to be queried in the knowledge base of the operation and maintenance robot, and input the question statement to be queried through the dialogue interface. The robot analyzes and queries through a series of natural language algorithms (NLP), and provides the most suitable answer. Return to user.
现有文本匹配技术可以采用文本分类算法或者文本相似度匹配类算法。文本分类算法方面,传统分类算法有支持向量机,逻辑回归;深度神经网络算法有LSTM,TextCNN等。文本相似度匹配类算法一般为交互式。Existing text matching technology can use text classification algorithms or text similarity matching algorithms. In terms of text classification algorithms, traditional classification algorithms include support vector machines and logistic regression; deep neural network algorithms include LSTM, TextCNN, etc. Text similarity matching algorithms are generally interactive.
交互式文本匹配算法计算耗时巨大,分类算法属于有监督算法,每次训练模型需要标注好的标签,不易于运维机器人的知识扩展与即时查询,同时,训练样本的质量低下影响了问答匹配的准确率。因此,现有的领域知识问题系统中存在响应速度慢、匹配准确率低、不易扩展的问题。The interactive text matching algorithm is very time-consuming to calculate. The classification algorithm is a supervised algorithm. Each training model needs to be labeled, which makes it difficult to operate and maintain the robot's knowledge expansion and real-time query. At the same time, the low quality of the training samples affects the question and answer matching. accuracy. Therefore, existing domain knowledge problem systems have problems such as slow response speed, low matching accuracy, and difficulty in expansion.
发明内容Contents of the invention
本发明提供一种领域知识问答方法、系统、设备及介质,用以解决现有技术中知识问题系统中存在的响应速度慢、匹配准确率低、不易扩展的缺陷,实现对问答系统性能的全面提升。The present invention provides a field knowledge question and answer method, system, equipment and medium to solve the shortcomings of slow response speed, low matching accuracy and difficulty in expansion in the knowledge question system in the prior art, and achieve comprehensive improvement of the performance of the question and answer system. promote.
本发明提供一种领域知识问答方法,包括:The present invention provides a domain knowledge question and answer method, including:
获取查询语句;Get the query statement;
将所述查询语句输入至语义相似度模型,得到所述语义相似度模型输出的查询向量;所述语义相似度模型是基于领域知识进行训练后得到的;Input the query statement into the semantic similarity model to obtain the query vector output by the semantic similarity model; the semantic similarity model is obtained after training based on domain knowledge;
基于所述查询向量及知识库中至少一个问题向量,确定所述查询向量与各所述问题向量的相似度;Based on the query vector and at least one question vector in the knowledge base, determine the similarity between the query vector and each of the question vectors;
基于所述查询向量与各所述问题向量的相似度,从各所述问题向量中确定至少一个目标向量;Determine at least one target vector from each of the question vectors based on the similarity between the query vector and each of the question vectors;
基于各所述目标向量对应的答案,确定所述查询语句对应的答案。Based on the answer corresponding to each target vector, the answer corresponding to the query statement is determined.
根据本发明提供的领域知识问答方法,所述语义相似度模型基于以下步骤训练得到:According to the domain knowledge question and answer method provided by the present invention, the semantic similarity model is trained based on the following steps:
基于领域知识字典,通过神经网络算法对Word2Vec模型进行加权训练,得到基于逆文件频率加权的领域知识句向量;Based on the domain knowledge dictionary, the Word2Vec model is weighted and trained through the neural network algorithm to obtain domain knowledge sentence vectors based on inverse document frequency weighting;
基于所述领域知识句向量进行扩展,得到领域知识句子对;Expand based on the domain knowledge sentence vector to obtain domain knowledge sentence pairs;
基于所述领域知识句子对,对所述语义相似度模型进行训练。The semantic similarity model is trained based on the domain knowledge sentence pairs.
根据本发明提供的领域知识问答方法,所述方法还包括:According to the domain knowledge question and answer method provided by the present invention, the method further includes:
对至少一个领域知识进行分词处理,计算分词后的各所述领域知识的逆文件频率加权值;Perform word segmentation processing on at least one domain knowledge, and calculate the inverse document frequency weighted value of each domain knowledge after word segmentation;
基于所述分词后的各所述领域知识的逆文件频率加权值,构建所述领域知识字典。The domain knowledge dictionary is constructed based on the inverse document frequency weighted value of each domain knowledge after word segmentation.
根据本发明提供的领域知识问答方法,所述基于所述领域知识句向量进行扩展,得到领域知识句子对,包括:According to the domain knowledge question and answer method provided by the present invention, the domain knowledge sentence pairs are obtained by expanding based on the domain knowledge sentence vector, including:
通过同义词替换和/或语句重复方式,生成与所述句向量对应的语句向量;Generate a sentence vector corresponding to the sentence vector through synonym replacement and/or sentence repetition;
基于所述句向量及所述语句向量,确定所述领域知识句子对;Based on the sentence vector and the sentence vector, determine the domain knowledge sentence pair;
计算所述句向量与所述语句向量的相似度;Calculate the similarity between the sentence vector and the sentence vector;
基于所述相似度,对所述领域知识句子对进行标注。Based on the similarity, the domain knowledge sentence pairs are marked.
根据本发明提供的领域知识问答方法,所述基于所述相似度,对所述领域知识句子对进行标注,包括:According to the domain knowledge question and answer method provided by the present invention, labeling the domain knowledge sentence pairs based on the similarity includes:
若所述相似度大于或等于第一预设阈值,则将所述领域知识句子对标注为相似;If the similarity is greater than or equal to the first preset threshold, mark the domain knowledge sentence pair as similar;
或者,若所述相似度小于或等于第二预设阈值,则将所述领域知识句子对标注为不相似。Alternatively, if the similarity is less than or equal to the second preset threshold, the domain knowledge sentence pair is marked as dissimilar.
根据本发明提供的领域知识问答方法,在所述基于所述领域知识句子对,对所述语义相似度模型进行训练之前,所述方法还包括:According to the domain knowledge question and answer method provided by the present invention, before training the semantic similarity model based on the domain knowledge sentence pairs, the method further includes:
基于中文先验知识,对所述语义相似度模型进行预处理训练。Based on Chinese prior knowledge, the semantic similarity model is preprocessed and trained.
根据本发明提供的领域知识问答方法,所述方法还包括:According to the domain knowledge question and answer method provided by the present invention, the method further includes:
采用实时分布式搜索分析引擎组件对所述知识库中的各所述问题向量进行分布式存储和检索。A real-time distributed search analysis engine component is used to perform distributed storage and retrieval of each of the question vectors in the knowledge base.
本发明还提供一种领域知识问答系统,包括:The present invention also provides a domain knowledge question and answer system, including:
查询语句获取模块,用于获取查询语句;Query statement acquisition module, used to obtain query statements;
查询向量获取模块,用于将所述查询语句输入至语义相似度模型,得到所述语义相似度模型输出的查询向量;所述语义相似度模型是基于领域知识进行训练后得到的;A query vector acquisition module is used to input the query statement into the semantic similarity model to obtain the query vector output by the semantic similarity model; the semantic similarity model is obtained after training based on domain knowledge;
相似度确定模块,用于基于所述查询向量及知识库中至少一个问题向量,确定所述查询向量与各所述问题向量的相似度;A similarity determination module, configured to determine the similarity between the query vector and each of the question vectors based on the query vector and at least one question vector in the knowledge base;
目标向量确定模块,用于基于所述查询向量与各所述问题向量的相似度,从各所述问题向量中确定至少一个目标向量;A target vector determination module, configured to determine at least one target vector from each of the question vectors based on the similarity between the query vector and each of the question vectors;
答案确定模块,用于基于各所述目标向量对应的答案,确定所述查询语句对应的答案。An answer determination module is used to determine the answer corresponding to the query statement based on the answer corresponding to each of the target vectors.
本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述的领域知识问答方法。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it realizes any of the above-mentioned domain knowledge. Question and answer method.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述的领域知识问答方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the field knowledge question and answer method as described in any of the above-mentioned methods is implemented.
本发明提供的一种领域知识问答方法、系统、设备及介质,将获取的查询语句输入至经过基于领域知识训练后得到的语义相似度模型,得到查询语句对应的查询向量,在预先存储若干个问题向量的知识库中进行匹配,选取与查询向量相似度最高的若干个问题向量为目标向量,基于目标向量确定查询语句对应的答案,基于领域知识训练后得到的语义相似度模型实现了对训练样本的优化,通过查询语句与问题语句的匹配得到的目标向量对应答案,使用该语义相似度模型进行领域知识问答,能够提升问答系统的响应速度,由于该语义相似度模型的训练样本基于领域知识进行了优化训练,因而提升了问答系统的匹配结果的准确率。The invention provides a domain knowledge question and answer method, system, equipment and medium. The obtained query statement is input into the semantic similarity model obtained after training based on domain knowledge, and the query vector corresponding to the query statement is obtained. Several pre-stored query vectors are obtained. Matching is performed in the knowledge base of question vectors, and several question vectors with the highest similarity to the query vector are selected as target vectors. The answer corresponding to the query statement is determined based on the target vector. The semantic similarity model obtained after training based on domain knowledge realizes the training Optimization of samples, the answer corresponding to the target vector obtained by matching the query statement and the question statement, using the semantic similarity model for domain knowledge question and answer, can improve the response speed of the question and answer system, because the training samples of the semantic similarity model are based on domain knowledge Optimized training was performed, thus improving the accuracy of the matching results of the question and answer system.
附图说明Description of drawings
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1是本发明提供的领域知识问答方法的流程示意图之一;Figure 1 is one of the flow diagrams of the domain knowledge question and answer method provided by the present invention;
图2是本发明提供的领域知识问答方法的流程示意图之二;Figure 2 is the second schematic flow chart of the domain knowledge question and answer method provided by the present invention;
图3是本发明提供的领域知识问答方法的训练示意图之一;Figure 3 is one of the training schematic diagrams of the domain knowledge question and answer method provided by the present invention;
图4是本发明提供的领域知识问答方法的训练示意图之二;Figure 4 is the second training schematic diagram of the domain knowledge question and answer method provided by the present invention;
图5是本发明提供的领域知识问答方法的训练示意图之三;Figure 5 is the third schematic diagram of training of the domain knowledge question and answer method provided by the present invention;
图6是本发明提供的领域知识问答方法的流程示意图之三;Figure 6 is the third schematic flow chart of the domain knowledge question and answer method provided by the present invention;
图7是本发明提供的领域知识问答方法的预处理示意图;Figure 7 is a schematic diagram of preprocessing of the domain knowledge question and answer method provided by the present invention;
图8是本发明提供的领域知识问答方法的流程示意图之四;Figure 8 is the fourth schematic flow chart of the domain knowledge question and answer method provided by the present invention;
图9是本发明提供的领域知识问答系统的结构示意图;Figure 9 is a schematic structural diagram of the domain knowledge question and answer system provided by the present invention;
图10是本发明提供的电子设备的结构示意图。Figure 10 is a schematic structural diagram of the electronic device provided by the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.
下面结合图1-图10描述本发明的领域知识问答方法、系统、设备及介质。The following describes the domain knowledge question and answer method, system, equipment and medium of the present invention with reference to Figures 1-10.
图1是本发明提供的领域知识问答方法的流程示意图之一,如图1所示,该方法包括如下步骤S1-S5:Figure 1 is one of the flow diagrams of the domain knowledge question and answer method provided by the present invention. As shown in Figure 1, the method includes the following steps S1-S5:
S1、获取查询语句;S1. Get the query statement;
可选地,在本步骤中,该领域知识问答系统的应用场景为智能运维领域,用户可以在系统前端的交互页面输入查询语句,通过运维机器人设备获取前端页面的查询语句,该领域知识问答系统集成在运维机器人设备的知识库中。查询语句中一般包括运维领域的关键词,可以为查询某种运维操作的流程,某种运维名词的释义,或者查询某个运维命令,如“oos-job的执行命令是什么?”,“有哪些redis的常用命令?”。Optionally, in this step, the application scenario of the knowledge question and answer system in this field is the field of intelligent operation and maintenance. The user can input query statements on the interactive page of the front-end of the system, and obtain the query statements of the front-end page through the operation and maintenance robot equipment. The knowledge in this field The question and answer system is integrated into the knowledge base of the operation and maintenance robot equipment. Query statements generally include keywords in the field of operation and maintenance, which can be to query the process of a certain operation and maintenance operation, the definition of a certain operation and maintenance term, or to query a certain operation and maintenance command, such as "What is the execution command of oos-job?" ", "What are the common commands of redis?".
S2、将查询语句输入至语义相似度模型,得到语义相似度模型输出的查询向量;S2. Input the query statement into the semantic similarity model and obtain the query vector output by the semantic similarity model;
所述语义相似度模型是基于领域知识进行训练后得到的;The semantic similarity model is obtained after training based on domain knowledge;
可选地,在本实施例中,该语义相似度模型是经过预训练和微调的方式进行训练后得到的,并且采用有监督的训练方式,有监督的训练方式是指训练样本带有设定的标签,在本实施例中,标签分为“0”和“1”,代表输入的训练样本之间的相似程度是否达到设定阈值。基于领域知识对该模型进行训练微调,使得该模型对运维领域知识的向量转化准确率更高。Optionally, in this embodiment, the semantic similarity model is obtained through pre-training and fine-tuning, and adopts a supervised training method. The supervised training method means that the training samples have settings In this embodiment, the labels are divided into "0" and "1", which represent whether the similarity between the input training samples reaches the set threshold. The model is trained and fine-tuned based on domain knowledge, making the model's vector transformation of operation and maintenance domain knowledge more accurate.
S3、基于查询向量及知识库中至少一个问题向量,确定查询向量与各问题向量的相似度;S3. Based on the query vector and at least one question vector in the knowledge base, determine the similarity between the query vector and each question vector;
可选地,在本步骤中,知识库中存储的是若干个问题向量,这些问题向量都是问题语句经过上述的语义相似度模型后,得到问题语句对应的问题向量,存储在知识库中。在接受到用户输入的查询语句之后,查询语句会经过语义相似度模型,得到与查询语句对应的查询向量,再将查询向量与知识库中的若干个问题向量进行一一比较,比较方法为计算查询向量与各个问题向量的相似度,相似度的计算方式可以选用余弦相似度、杰卡德相似系数、皮尔逊相关系数等。Optionally, in this step, several question vectors are stored in the knowledge base. These question vectors are the question vectors corresponding to the question statements obtained after passing through the above semantic similarity model and stored in the knowledge base. After receiving the query statement input by the user, the query statement will go through the semantic similarity model to obtain the query vector corresponding to the query statement, and then compare the query vector with several question vectors in the knowledge base one by one. The comparison method is calculation The similarity between the query vector and each question vector can be calculated using cosine similarity, Jaccard similarity coefficient, Pearson correlation coefficient, etc.
S4、基于查询向量与各问题向量的相似度,从各问题向量中确定至少一个目标向量;S4. Based on the similarity between the query vector and each question vector, determine at least one target vector from each question vector;
可选地,在本步骤中,通过分别计算查询向量与知识库中各问题向量的相似度,按照相似度由高至低,匹配输出TOPN个目标向量;计算查询向量与知识库中各问题向量的相似度,依据相似度有高至低进行排序,将大于或等于相似度设定阈值的认定为目标向量,在本实施例中,相似度设定阈值可以设定为80%。将TOPN个问题向量作为目标向量,若在知识库中匹配后与查询向量相似度大于或等于设定阈值的问题向量只有一个,则只输出一个目标向量。在具体实施中,如果用户输入的问题语句对应的查询向量,与知识库中存贮的问题向量匹配的相似度非常低,即相似度低于所设定的相似度设定阈值,此时不返回目标向量,并且在问答界面处显示“您查询的问题未找到答案”。Optionally, in this step, by calculating the similarity between the query vector and each question vector in the knowledge base respectively, matching and outputting TOPN target vectors according to the similarity from high to low; calculating the similarity between the query vector and each question vector in the knowledge base The similarity is sorted from high to low, and those greater than or equal to the similarity setting threshold are identified as target vectors. In this embodiment, the similarity setting threshold can be set to 80%. Use TOPN question vectors as target vectors. If there is only one question vector that has a similarity greater than or equal to the set threshold with the query vector after matching in the knowledge base, only one target vector will be output. In specific implementation, if the query vector corresponding to the question statement input by the user matches the question vector stored in the knowledge base with a very low similarity, that is, the similarity is lower than the set similarity threshold, then no The target vector is returned, and "No answer to the question you queried was found" is displayed on the Q&A interface.
S5、基于各目标向量对应的答案,确定查询语句对应的答案。S5. Based on the answers corresponding to each target vector, determine the answer corresponding to the query statement.
在知识库中,与问题语句对应的答案字符串也是以向量形式进行存储,为一一映射关系,在查询向量与问题向量匹配成功后,系统会在知识库中分布式查询问题向量对应的答案,将其返回给运维机器人并输入至前端交互页面进行标亮显示。In the knowledge base, the answer string corresponding to the question statement is also stored in the form of a vector, which is a one-to-one mapping relationship. After the query vector and the question vector are successfully matched, the system will query the answers corresponding to the question vector in a distributed manner in the knowledge base. , return it to the operation and maintenance robot and input it to the front-end interactive page for highlighting.
本实施例提供的领域知识问答方法,通过将获取的查询语句输入至经过基于领域知识训练后得到的语义相似度模型,得到查询语句对应的查询向量,在预先存储若干个问题向量的知识库中进行匹配,选取与查询向量相似度最高的若干个问题向量为目标向量,基于目标向量确定查询语句对应的答案,基于领域知识训练后得到的语义相似度模型实现了对训练样本的优化,知识库中的知识问题语句和客户端的查询语句经过该语义相似度模型后,会得到各自对应的向量,通过向量相似度计算进行查询语句与问题语句的匹配,使用该语义相似度模型进行领域知识问答,能够提升问答系统的响应速度,由于该语义相似度模型的训练样本基于领域知识进行了优化训练,因而提升了问答系统的匹配结果的准确率。The domain knowledge question and answer method provided in this embodiment inputs the obtained query statement into the semantic similarity model obtained after training based on domain knowledge, and obtains the query vector corresponding to the query statement. In the knowledge base that stores several question vectors in advance, Matching is performed, and several question vectors with the highest similarity to the query vector are selected as target vectors, and the answer corresponding to the query statement is determined based on the target vector. The semantic similarity model obtained after training based on domain knowledge realizes the optimization of the training samples, and the knowledge base After passing through the semantic similarity model, the knowledge question statements in the database and the client's query statements will get their corresponding vectors. The query statements and question statements are matched through vector similarity calculation, and the semantic similarity model is used for domain knowledge question and answer. It can improve the response speed of the question and answer system. Since the training samples of the semantic similarity model are optimized and trained based on domain knowledge, the accuracy of the matching results of the question and answer system is improved.
基于上述实施例,进一步地,上述步骤S2中的语义相似度模型是基于以下步骤训练得到的,图2是本发明提供的领域知识问答方法的流程示意图之二,如图2所示,上述语义相似度模型的训练步骤包括如下步骤S21-S23:Based on the above embodiment, further, the semantic similarity model in the above step S2 is trained based on the following steps. Figure 2 is the second flow diagram of the domain knowledge question and answer method provided by the present invention. As shown in Figure 2, the above semantics The training steps of the similarity model include the following steps S21-S23:
S21、基于领域知识字典,通过神经网络算法对Word2Vec模型进行加权训练,得到基于逆文件频率加权的领域知识句向量;S21. Based on the domain knowledge dictionary, the Word2Vec model is weighted and trained through the neural network algorithm to obtain the domain knowledge sentence vector based on inverse document frequency weighting;
可选地,在本实施例中,Word2vec模型是用来产生词向量的相关模型。Word2Vec模型可用来映射每个词到一个向量,可用来表示词对词之间的关系,代表分布式词向量,Word2Vec模型可以根据语言中“一个词可以被它周围的词描述”的特点,通过学习大量的数据集,得到这个词的全面的描述。Optionally, in this embodiment, the Word2vec model is a related model used to generate word vectors. The Word2Vec model can be used to map each word to a vector, which can be used to express the relationship between words and represent distributed word vectors. The Word2Vec model can be based on the characteristics of the language that "a word can be described by the words around it". Study large data sets to get a comprehensive description of the word.
在本实施例中,通过神经网络算法对Word2Vec进行训练,得到每个词或单个字所对应的向量表示,相加求均之后为本实施例的句向量。In this embodiment, Word2Vec is trained through a neural network algorithm to obtain a vector representation corresponding to each word or a single word. After addition and averaging, the sentence vector of this embodiment is obtained.
图3是本发明提供的领域知识问答方法的训练示意图之一,如图3所示,本实施例中Word2Vec模型的训练采用连续词袋模型(Continuous Bag of Words,CBOW)。训练特征使用基于词和字的共同方式进行训练,每个词或每个字的向量长度设定为300维,CBOW模型的输入为一串文本,通常文本的长度设定为5,也可以根据具体情况进行更改。如图3所示,w(t)代表每个样本中的词或者字,CBOW型依靠w(t)的上下文w(t-2)、w(t-1)、w(t+1)、w(t+2)推算出词或字w(t),这样计算的结果使得训练样本中相近的词在向量上会有相近的表示。CBOW的基本思想是,给定一个单词的上下文,预测该单词本身。例如,例如对于句子“Thecat climbed up the tree”,如果文本长度为5,那么当中心单词为“climbed”时,上下文单词为“The”、“cat”、“up”和“the”。CBOW模型要求根据这四个上下文单词,计算出“climbed”的概率分布。CBOW模型相比于传统的基于计数或者基于矩阵分解等方法,能够利用大规模语料库进行训练,能够学习到高质量且低维度,通常在50~300之间的稠密向量,节省存储空间和计算资源,并且能够捕捉单词之间的复杂关系,如同义词、反义词、类比关系等。Figure 3 is one of the training schematic diagrams of the domain knowledge question and answer method provided by the present invention. As shown in Figure 3, in this embodiment, the Word2Vec model is trained using the Continuous Bag of Words (CBOW) model. The training features are trained using a common method based on words and characters. The vector length of each word or character is set to 300 dimensions. The input of the CBOW model is a string of text. Usually the length of the text is set to 5. It can also be based on Changes will be made on a case-by-case basis. As shown in Figure 3, w(t) represents the word or character in each sample. The CBOW type relies on the context w(t-2), w(t-1), w(t+1), w(t+2) calculates the word or word w(t). The result of this calculation is that similar words in the training sample will have similar representations in the vector. The basic idea of CBOW is to predict the word itself given the context of a word. For example, for the sentence "Thecat climbed up the tree", if the text length is 5, then when the center word is "climbed", the context words are "The", "cat", "up" and "the". The CBOW model requires calculating the probability distribution of "climbed" based on these four context words. Compared with traditional counting-based or matrix decomposition-based methods, the CBOW model can use large-scale corpora for training, and can learn high-quality and low-dimensional dense vectors, usually between 50 and 300, saving storage space and computing resources. , and can capture complex relationships between words, such as synonyms, antonyms, analogy relationships, etc.
图4是本发明提供的领域知识问答方法的训练示意图之二,如图4所示,在计算句向量表示时,领域知识字典中存在的词,可以直接获取对应的词向量,对于领域知识字典中没有收录的词,则通过获取词的每个字的向量,进行相加求平均后得到句向量。如图4所示,各个知识对应的向量长度为300维,在进行与对应的IDF值相乘的步骤后,经过相加求均步骤,得到的句向量的向量长度也为300维。Figure 4 is the second training schematic diagram of the domain knowledge question and answer method provided by the present invention. As shown in Figure 4, when calculating the sentence vector representation, the corresponding word vectors can be directly obtained for the words existing in the domain knowledge dictionary. For the domain knowledge dictionary For words that are not included in , the sentence vector is obtained by obtaining the vector of each word of the word, adding and averaging. As shown in Figure 4, the length of the vector corresponding to each knowledge is 300 dimensions. After the step of multiplying with the corresponding IDF value and the addition and averaging step, the vector length of the obtained sentence vector is also 300 dimensions.
若采用Word2Vec生成的词向量相加后求均值得到句向量,则意味着此时算法认为句子中的每个词均为同等权重,这会导致查询向量与问题向量的匹配效果不佳。如用户输入查询语句为,“有哪些redis的常用命令”,若句子中的每个词的权重相等,那么除过运维领域关键词“redis”之外,剩下的问题表述语句“有哪些、的、常用、命令”的相似度为百分之百,算法会将“有哪些webgate常用命令”匹配在第1位,而将“redis的命令有什么”匹配在第2位。此时引入逆文件频率(Inverse Document Frequency,IDF)值来提升“redis”,“webgate”等运维领域关键词的权重,降低“常用”、“命令”词的权重,就可以优化排序结果。需要注意的是,当在实际使用时如果出现词语不在领域知识字典中,我们统计词典中所有逆文件频率(IDF)值的前四位数作为未收录词的逆文件频率(IDF)值,即默认未收录词不常出现,其权重应稍大一些。If the word vectors generated by Word2Vec are added and averaged to obtain a sentence vector, it means that the algorithm considers each word in the sentence to have equal weight, which will lead to poor matching between the query vector and the question vector. For example, if the user inputs a query sentence of, "What are the common commands of redis?", if each word in the sentence has equal weight, then in addition to the keyword "redis" in the operation and maintenance field, the remaining problem statement sentence "What are the common commands of redis?" The similarity of ",,Commonly Used,Commands" is 100%. The algorithm will match "What common webgate commands are there" in the first position, and "What are the redis commands" will be matched in the second position. At this time, the Inverse Document Frequency (IDF) value is introduced to increase the weight of keywords in the operation and maintenance field such as "redis" and "webgate", and reduce the weight of the words "commonly used" and "command" to optimize the sorting results. It should be noted that when in actual use, if a word is not in the domain knowledge dictionary, we count the first four digits of all inverse document frequency (IDF) values in the dictionary as the inverse document frequency (IDF) value of the unincluded word, that is By default, uncollected words appear less frequently and their weight should be slightly larger.
S22、基于领域知识句向量进行扩展,得到领域知识句子对;S22. Expand based on the domain knowledge sentence vector to obtain domain knowledge sentence pairs;
在本步骤中,基于S21步骤中获取的领域知识句向量,创建领域知识句子对,为后续的模型训练做准备。In this step, based on the domain knowledge sentence vector obtained in step S21, a domain knowledge sentence pair is created to prepare for subsequent model training.
可选地,在本步骤中,将获取的领域知识句向量进行扩展,实际上是建立与领域知识句向量相似的语句,也可称为句向量,句向量与经过扩展的句向量,共同构成领域知识句子对。Optionally, in this step, the acquired domain knowledge sentence vector is expanded, which actually creates a sentence similar to the domain knowledge sentence vector, which can also be called a sentence vector. The sentence vector and the expanded sentence vector together constitute Domain knowledge sentence pairs.
S23、基于领域知识句子对,对语义相似度模型进行训练。S23. Based on domain knowledge sentence pairs, train the semantic similarity model.
在本实施例中,对语义相似度模型的训练采用预训练和微调的方式进行。基础模型选用孪生网络架构的SBert模型(Siamese network Bert),SBert模型的子网络都使用BERT模型,且两个BERT模型共享参数。当对比两个句子相似度时,在本实施例中,输入的领域知识句子对,它们分别输入BERT网络,输出是两组表征句子的向量。采用有监督训练方式,此处的有监督训练方式,指输入的训练样本带有标记。In this embodiment, the semantic similarity model is trained using pre-training and fine-tuning. The basic model uses the SBert model (Siamese network Bert) with a twin network architecture. The subnetworks of the SBert model all use the BERT model, and the two BERT models share parameters. When comparing the similarity of two sentences, in this embodiment, the input domain knowledge sentence pairs are input into the BERT network respectively, and the output is two sets of vectors representing the sentences. A supervised training method is used. The supervised training method here means that the input training samples are labeled.
图5是本发明提供的领域知识问答方法的训练示意图之三,如图5所示,该孪生网络(SBert)架构拥有两个输入通道,输入为步骤S22中获得的领域知识句子对,即句向量与经过相似扩展的句向量,两个通道共享同一个语义表示模型(BERT),经过池化层的处理输出句向量和经过扩展的句向量分别对应的向量u、v,然后进行余弦相似度计算。训练过程中的损失函数选用COS相似度或者L2距离算法。Figure 5 is the third training schematic diagram of the domain knowledge question and answer method provided by the present invention. As shown in Figure 5, the twin network (SBert) architecture has two input channels, and the input is the domain knowledge sentence pair obtained in step S22, that is, the sentence Vectors and similarly expanded sentence vectors. The two channels share the same semantic representation model (BERT). After processing by the pooling layer, the vectors u and v corresponding to the sentence vector and the expanded sentence vector are output, respectively, and then cosine similarity is performed. calculate. The loss function during the training process uses COS similarity or L2 distance algorithm.
本实施例提供的领域知识问答方法,对使用的语义相似度模型进行训练,基于领域知识字典,通过神经网络算法对Word2Vec模型进行加权训练,得到基于逆文件频率加权的领域知识句向量,便捷有效的解决了Word2Vec的词语等同权重导致的匹配准确率低的问题,提升了句向量的表征效果。The domain knowledge question and answer method provided in this embodiment trains the semantic similarity model used. Based on the domain knowledge dictionary, the Word2Vec model is weighted and trained through the neural network algorithm to obtain the domain knowledge sentence vector based on inverse document frequency weighting, which is convenient and effective. It solves the problem of low matching accuracy caused by the equal weight of words in Word2Vec, and improves the representation effect of sentence vectors.
基于上述实施例,进一步地,上述步骤S21中的领域知识字典的构建步骤如下,图6是本发明提供的领域知识问答方法的流程示意图之三,如图6所示,领域知识字典的构建步骤包括如下步骤S211-S212:Based on the above embodiment, further, the steps for constructing the domain knowledge dictionary in the above step S21 are as follows. Figure 6 is the third flow diagram of the domain knowledge question and answer method provided by the present invention. As shown in Figure 6, the steps for constructing the domain knowledge dictionary are as follows. Including the following steps S211-S212:
S211、对至少一个领域知识进行分词处理,计算分词后的各领域知识的逆文件频率加权值;S211. Perform word segmentation processing on at least one domain knowledge, and calculate the inverse document frequency weighted value of each domain knowledge after word segmentation;
构建的领域知识字典为运维领域知识字典。构建该领域知识字典是为后续更准确的分词做准备,字典中大部分内容属于计算机类,“hbase”、“hive”、“html5”、“css”等。字典中其他类型属于运维专有名词字典。将所有运维知识进行分词后计算每个词语的逆文件频率(IDF)值。例如运维知识“oos-job的执行命令是什么”,分词后呈现出“oos-job、的、执行、命令、是、什么”。此时因为字典中存在“oos-job”,因此分词器可以将该词语完整准确地切分出来。接着计算所有词语的逆文件频率(IDF)值,计算公式如公式(1):The constructed domain knowledge dictionary is the operation and maintenance domain knowledge dictionary. The construction of a knowledge dictionary in this field is to prepare for more accurate word segmentation in the future. Most of the contents in the dictionary belong to the computer category, such as "hbase", "hive", "html5", "css", etc. Other types in the dictionary belong to the dictionary of operation and maintenance proper nouns. After segmenting all operation and maintenance knowledge into words, calculate the inverse document frequency (IDF) value of each word. For example, the operation and maintenance knowledge "What is the execution command of oos-job", after word segmentation, it shows "oos-job, the execution command, is, what". At this time, because "oos-job" exists in the dictionary, the word segmenter can segment the word completely and accurately. Then calculate the inverse document frequency (IDF) value of all words, the calculation formula is as formula (1):
N表示知识库中所有领域知识的总数,DF表示包含该词的知识数,分母加1的作用是为了防止DF=0的情况出现。容易看出,该词出现的次数越多,分母就越大,取对数后的值就越小,说明这个词在所有知识中的权重,可以理解为重要程度就越小。例如“oos-job、的、执行、命令、是、什么”中,oos-job存在的知识条目中的数量比其他词语更少,因此其IDF值会比其他词语更大。求取IDF的值目的是为了上述步骤S21中可以给不同词语的词向量赋予不同权重。N represents the total number of knowledge in all fields in the knowledge base, DF represents the number of knowledge containing the word, and the function of adding 1 to the denominator is to prevent the situation where DF=0. It is easy to see that the more times the word appears, the larger the denominator is, and the smaller the value after taking the logarithm is, indicating that the weight of this word in all knowledge can be understood as the smaller the importance. For example, in "oos-job, execution, command, is, what", oos-job has fewer knowledge entries than other words, so its IDF value will be larger than other words. The purpose of obtaining the value of IDF is to assign different weights to the word vectors of different words in the above step S21.
S212、基于分词后的各所述领域知识的逆文件频率加权值,构建领域知识字典。S212. Construct a domain knowledge dictionary based on the inverse document frequency weighted values of each domain knowledge after word segmentation.
图7是本发明提供的领域知识问答方法的预处理示意图,如图7所示,为构建的领域知识字典,包括分词后的领域知识及对应的IDF值。Figure 7 is a preprocessing schematic diagram of the domain knowledge question and answer method provided by the present invention. As shown in Figure 7, it is a constructed domain knowledge dictionary, including the domain knowledge after word segmentation and the corresponding IDF value.
本实施例提供的领域知识问答方法,通过构建运维领域知识字典,包含分词后的领域知识及对应的IDF值,提升了后续步骤中语义相似度模型训练输入样本的质量,提升了问答系统匹配的准确率。The domain knowledge question and answer method provided in this embodiment improves the quality of the semantic similarity model training input samples in subsequent steps by constructing an operation and maintenance domain knowledge dictionary, including the domain knowledge after word segmentation and the corresponding IDF value, and improves the matching of the question and answer system. accuracy.
基于上述实施例,进一步地,图8是本发明提供的领域知识问答方法的流程示意图之四,如图8所示,上述步骤S22的执行步骤包括如下步骤S221-S224:Based on the above embodiment, further, FIG. 8 is a flow diagram 4 of the domain knowledge question and answer method provided by the present invention. As shown in FIG. 8, the execution steps of the above step S22 include the following steps S221-S224:
S221、通过同义词替换和/或语句重复方式,生成与句向量对应的语句向量;S221. Generate a sentence vector corresponding to the sentence vector through synonym replacement and/or sentence repetition;
本步骤中,对获取的句向量进行扩展,方式为同义词替换和/或语句重复,利用同义词库配合python脚本自动化完成,由语句向量生成对应的语句向量的方式如表1所示:In this step, the obtained sentence vector is expanded by synonym replacement and/or sentence repetition, which is completed automatically using thesaurus and python script. The corresponding sentence vector is generated from the sentence vector as shown in Table 1:
表1Table 1
同义词替换和语句重复的方式可以同时进行,也可以进行其中一个,如示例中的,将原始句向量“有哪些redis的常用命令”进行扩展,通过同义词库,将“哪些”替换为“什么”,或者将“常用”替换为“惯用”,这样形成的语句向量表征的意思与原始句向量是相同的,并且运维领域关键词“redis”并没有替换。Synonym replacement and statement repetition can be performed at the same time, or one of them can be performed. For example, in the example, the original sentence vector "What are the common commands of redis" is expanded, and "which" is replaced with "what" through the synonym library. , or replace "commonly used" with "idiomatic". The meaning of the sentence vector represented in this way is the same as the original sentence vector, and the keyword "redis" in the operation and maintenance field has not been replaced.
S222、基于句向量及语句向量,确定领域知识句子对;S222. Based on sentence vectors and sentence vectors, determine domain knowledge sentence pairs;
S223、计算所述句向量与所述语句向量的相似度;S223. Calculate the similarity between the sentence vector and the sentence vector;
可选地,本步骤中,使用COS余弦相似度计算句向量与语句向量的相似度。Optionally, in this step, COS cosine similarity is used to calculate the similarity between the sentence vector and the sentence vector.
S224、基于所述相似度,对所述领域知识句子对进行标注。S224. Based on the similarity, label the domain knowledge sentence pairs.
本实施例提供的领域知识问答方法,通过对获取的句向量利用同义词库配合python脚本自动化通过同义词替换和/或语句重复的方式生成句子对,可以自动化批量生产高质量扩展的运维领域的问题语句,在后续对模型的优化中,只需要关注运维领域知识字典中的关键词更新即可,原有的领域知识语句进行了自动化扩展,提升了问答系统的可扩展性和可用性。The domain knowledge question and answer method provided in this embodiment can automatically generate high-quality extended operation and maintenance field questions in batches by using the thesaurus and python scripts to automatically generate sentence pairs through synonym replacement and/or sentence repetition for the obtained sentence vectors. Statements, in the subsequent optimization of the model, you only need to pay attention to the keyword updates in the operation and maintenance domain knowledge dictionary. The original domain knowledge statements are automatically expanded, improving the scalability and usability of the question and answer system.
基于上述实施例,进一步地,上述S224步骤还包括:Based on the above embodiment, further, the above step S224 also includes:
若相似度大于或等于第一预设阈值,则将领域知识句子对标注为相似;If the similarity is greater than or equal to the first preset threshold, the domain knowledge sentence pair is marked as similar;
或者,若相似度小于或等于第二预设阈值,则将领域知识句子对标注为不相似。Alternatively, if the similarity is less than or equal to the second preset threshold, mark the domain knowledge sentence pair as dissimilar.
可选地,在本实施例中,对句子对进行标注,是为了进行有监督训练方式,COS相似度计算句子对之间的相似度,若相似度大于或等于第一预设阈值,本实施例中第一预设阈值为95%,则将领域知识句子对标注为相似,可以设置标签为1;若相似度小于或等于第二预设阈值,本实施例中第二预设阈值为30%,则将领域知识句子对标注为不相似,设置标签为0,标注标签后的领域知识句子对如表2所示:Optionally, in this embodiment, the sentence pairs are marked for supervised training. The COS similarity calculates the similarity between the sentence pairs. If the similarity is greater than or equal to the first preset threshold, this implementation In the example, the first preset threshold is 95%, then the domain knowledge sentence pairs are marked as similar, and the label can be set to 1; if the similarity is less than or equal to the second preset threshold, the second preset threshold in this example is 30 %, then mark the domain knowledge sentence pairs as dissimilar, set the label to 0, and the domain knowledge sentence pairs after labeling are shown in Table 2:
表2Table 2
本实施例提供的领域知识问答方法,通过对领域知识句子对进行标注,方便后续模型训练中的监督训练方式,带有相似度标签的领域知识句子对输入,提升了语义相似度模型训练输入样本的质量,提升了问答系统匹配的准确率。The domain knowledge question and answer method provided in this embodiment facilitates the supervised training method in subsequent model training by labeling domain knowledge sentence pairs. The input of domain knowledge sentence pairs with similarity labels improves the semantic similarity model training input samples. The quality improves the matching accuracy of the question and answer system.
基于上述实施例,进一步地,在步骤S23之前还包括:Based on the above embodiment, further, before step S23, it also includes:
基于中文先验知识,对语义相似度模型进行预处理训练。Based on Chinese prior knowledge, the semantic similarity model is preprocessed and trained.
本步骤为语义相似度模型的预训练阶段,首先选择各个领域的中文语义文本相似性STS(Semantic Textual Similarity)数据集进行训练。该数据集覆盖较多领域,包括闲聊,金融,小说,体育等。预训练的目的是希望SBert模型可以首先学习到整个中文领域广泛的语义知识,该预训练阶段,会包括语句对及语句对之间的皮尔逊相关系数,皮尔逊相关系数数值一般范围0-5,分值越大代表两个语句之间越相似,例如语句对“一个小孩在骑马”与“孩子在骑马”之间的皮尔逊相关系数为4,而语句对“一个女人弹吉他”和“一个人在看书”之间的皮尔逊相关系数为2。对语义相似度模型的预训练,有助于接下来针对运维领域知识的微调效果。经过多次模型训练与测试,当数据集选取10万条语句对,Bert模型参数dropout_rate=0.2时,可以获得最优模型,训练时我们冻结Bert的若干初始网络层,可以最大程度的保留预训练阶段的中文先验知识。This step is the pre-training stage of the semantic similarity model. First, Chinese semantic text similarity STS (Semantic Textual Similarity) data sets in various fields are selected for training. This data set covers many fields, including gossip, finance, novels, sports, etc. The purpose of pre-training is to hope that the SBert model can first learn extensive semantic knowledge in the entire Chinese field. This pre-training stage will include sentence pairs and the Pearson correlation coefficient between sentence pairs. The Pearson correlation coefficient value generally ranges from 0-5. , the larger the score, the more similar the two statements are. For example, the Pearson correlation coefficient between the statement pair "a child is riding a horse" and "the child is riding a horse" is 4, while the statement pair "a woman plays the guitar" and " The Pearson correlation coefficient between "a person reading a book" is 2. The pre-training of the semantic similarity model is helpful for the subsequent fine-tuning of knowledge in the operation and maintenance field. After multiple model training and testing, when the data set selects 100,000 statement pairs and the Bert model parameter dropout_rate=0.2, the optimal model can be obtained. During training, we freeze several initial network layers of Bert to retain the pre-training to the greatest extent. Stage prior knowledge of Chinese.
本实施例提供的领域知识问答方法,通过对语义相似度模型基于中文先验知识进行预训练,优化了模型的训练基础,提升了后续运维领域知识微调的效果,提升了最终语义相似度模型的输出准确率。The domain knowledge question and answer method provided in this embodiment optimizes the training basis of the model by pre-training the semantic similarity model based on Chinese prior knowledge, improves the effect of subsequent operation and maintenance domain knowledge fine-tuning, and improves the final semantic similarity model. output accuracy.
基于上述实施例,进一步地,领域知识问答方法还包括:Based on the above embodiment, further, the domain knowledge question and answer method also includes:
采用实时分布式搜索分析引擎组件对所述知识库中的各所述问题向量进行分布式存储和检索。A real-time distributed search analysis engine component is used to perform distributed storage and retrieval of each of the question vectors in the knowledge base.
本实施例提供的领域知识问答方法,通过时分布式搜索分析引擎(ElasticSearch)组件构建高性能的向量检索库,在得到SBert模型之后,我们将所有知识问题语句通过模型得到其向量,并存储在ES中,ES在7.3版本以上引入了dense_vector索引类型,支持计算向量之间COS相似度。本实施例的领域知识问答方法,测试了单节点ES与集群ES的向量检索性能。本实施例中以单节点、向量长度300维为例进行测试。在具体实施中,3节点经过优化的ES集群可以在100毫秒时间内,从100万条数据中检索出相似运维知识,在保证模型的文本匹配准确率的基础上,提升了问答系统快速响应检索的能力。The domain knowledge question and answer method provided in this embodiment uses the time-distributed search analysis engine (ElasticSearch) component to build a high-performance vector retrieval library. After obtaining the SBert model, we obtain the vectors of all knowledge question statements through the model and store them in In ES, ES introduced the dense_vector index type in version 7.3 or above, which supports the calculation of COS similarity between vectors. The domain knowledge question and answer method in this embodiment tested the vector retrieval performance of single-node ES and cluster ES. In this embodiment, a single node and a vector length of 300 dimensions are used as an example for testing. In the specific implementation, the 3-node optimized ES cluster can retrieve similar operation and maintenance knowledge from 1 million pieces of data in 100 milliseconds. On the basis of ensuring the text matching accuracy of the model, it improves the rapid response of the question and answer system. Retrieval capabilities.
本发明还提供一种领域知识问答系统,图9是本发明提供的领域知识问答系统的结构示意图,如图9所示,领域知识问答系统包括:The present invention also provides a domain knowledge question and answer system. Figure 9 is a schematic structural diagram of the domain knowledge question and answer system provided by the present invention. As shown in Figure 9, the domain knowledge question and answer system includes:
查询语句获取模块91,用于获取查询语句;Query statement acquisition module 91, used to obtain query statements;
查询向量获取模块92,用于将所述查询语句输入至语义相似度模型,得到所述语义相似度模型输出的查询向量;所述语义相似度模型是基于领域知识进行训练后得到的;The query vector acquisition module 92 is used to input the query statement into the semantic similarity model to obtain the query vector output by the semantic similarity model; the semantic similarity model is obtained after training based on domain knowledge;
相似度确定模块93,用于基于所述查询向量及知识库中至少一个问题向量,确定所述查询向量与各所述问题向量的相似度;The similarity determination module 93 is configured to determine the similarity between the query vector and each of the question vectors based on the query vector and at least one question vector in the knowledge base;
目标向量确定模块94,用于基于所述查询向量与各所述问题向量的相似度,从各所述问题向量中确定至少一个目标向量;A target vector determination module 94, configured to determine at least one target vector from each of the question vectors based on the similarity between the query vector and each of the question vectors;
答案确定模块95,用于基于各所述目标向量对应的答案,确定所述查询语句对应的答案。The answer determination module 95 is used to determine the answer corresponding to the query statement based on the answer corresponding to each of the target vectors.
本实施例的领域知识问答系统,通过各个模块之间的相互配合,通过查询语句获取模块获取查询语句,输入至经过基于领域知识训练后得到的语义相似度模型,通过查询向量获取模块得到查询语句对应的查询向量,在预先存储若干个问题向量的知识库中进行匹配,选取与查询向量相似度最高的若干个问题向量为目标向量,基于目标向量确定查询语句对应的答案,基于领域知识训练后得到的语义相似度模型实现了对训练样本的优化,知识库中的知识问题语句和客户端的查询语句经过该语义相似度模型后,会得到各自对应的向量,通过向量相似度计算进行查询语句与问题语句的匹配,使用该语义相似度模型进行领域知识问答,能够提升问答系统的响应速度,由于该语义相似度模型的训练样本基于领域知识进行了优化训练,因而提升了问答系统的匹配结果的准确率。In the domain knowledge question and answer system of this embodiment, through the cooperation between various modules, the query statement is obtained through the query statement acquisition module, input into the semantic similarity model obtained after training based on domain knowledge, and the query statement is obtained through the query vector acquisition module The corresponding query vectors are matched in a knowledge base that stores several question vectors in advance, and the question vectors with the highest similarity to the query vectors are selected as target vectors. Based on the target vectors, the answers corresponding to the query statements are determined. After training based on domain knowledge The obtained semantic similarity model optimizes the training samples. After passing through the semantic similarity model, the knowledge question statements in the knowledge base and the client query statements will get their corresponding vectors. The query statements are compared with each other through vector similarity calculation. Matching question statements and using the semantic similarity model for domain knowledge question and answer can improve the response speed of the question and answer system. Since the training samples of the semantic similarity model are optimized and trained based on domain knowledge, the matching results of the question and answer system are improved. Accuracy.
图10是本发明提供的电子设备的结构示意图,如图10所示,该电子设备可以包括:处理器(processor)1010、通信接口(Communications Interface)1020、存储器(memory)1030和通信总线1040,其中,处理器1010,通信接口1020,存储器1030通过通信总线1040完成相互间的通信。处理器1010可以调用存储器1030中的逻辑指令,以执行领域知识问答方法,该方法包括:获取查询语句;将所述查询语句输入至语义相似度模型,得到所述语义相似度模型输出的查询向量;所述语义相似度模型是基于领域知识进行训练后得到的;基于所述查询向量及知识库中至少一个问题向量,确定所述查询向量与各所述问题向量的相似度基于所述查询向量与各所述问题向量的相似度,从各所述问题向量中确定至少一个目标向量;基于各所述目标向量对应的答案,确定所述查询语句对应的答案。此外,上述的存储器1030中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Figure 10 is a schematic structural diagram of an electronic device provided by the present invention. As shown in Figure 10, the electronic device may include: a processor (processor) 1010, a communications interface (Communications Interface) 1020, a memory (memory) 1030 and a communication bus 1040. Among them, the processor 1010, the communication interface 1020, and the memory 1030 complete communication with each other through the communication bus 1040. The processor 1010 can call logical instructions in the memory 1030 to execute a domain knowledge question and answer method. The method includes: obtaining a query statement; inputting the query statement into a semantic similarity model to obtain a query vector output by the semantic similarity model. ; The semantic similarity model is obtained after training based on domain knowledge; Based on the query vector and at least one question vector in the knowledge base, determine the similarity between the query vector and each of the question vectors based on the query vector Based on the similarity with each of the question vectors, at least one target vector is determined from each of the question vectors; based on the answer corresponding to each of the target vectors, the answer corresponding to the query statement is determined. In addition, the above-mentioned logical instructions in the memory 1030 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的领域知识问答方法,该方法包括:获取查询语句;将所述查询语句输入至语义相似度模型,得到所述语义相似度模型输出的查询向量;所述语义相似度模型是基于领域知识进行训练后得到的;基于所述查询向量及知识库中至少一个问题向量,确定所述查询向量与各所述问题向量的相似度;基于所述查询向量与各所述问题向量的相似度,从各所述问题向量中确定至少一个目标向量;基于各所述目标向量对应的答案,确定所述查询语句对应的答案。In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to execute the domain knowledge question and answer method provided by each of the above methods. The method includes : Obtain the query statement; input the query statement into the semantic similarity model to obtain the query vector output by the semantic similarity model; the semantic similarity model is obtained after training based on domain knowledge; based on the query vector and at least one question vector in the knowledge base, determining the similarity between the query vector and each of the question vectors; based on the similarity between the query vector and each of the question vectors, determining at least one target from each of the question vectors Vector; determine the answer corresponding to the query statement based on the answer corresponding to each target vector.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310942972.1A CN116775846A (en) | 2023-07-28 | 2023-07-28 | Domain knowledge question and answer method, system, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310942972.1A CN116775846A (en) | 2023-07-28 | 2023-07-28 | Domain knowledge question and answer method, system, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116775846A true CN116775846A (en) | 2023-09-19 |
Family
ID=87986063
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310942972.1A Pending CN116775846A (en) | 2023-07-28 | 2023-07-28 | Domain knowledge question and answer method, system, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116775846A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117290492A (en) * | 2023-11-27 | 2023-12-26 | 深圳市灵智数字科技有限公司 | Knowledge base question-answering method and device, electronic equipment and storage medium |
CN117312519A (en) * | 2023-09-28 | 2023-12-29 | 国网北京市电力公司 | Power supply service question-answering method, device, equipment and medium |
CN118132735A (en) * | 2024-05-07 | 2024-06-04 | 支付宝(杭州)信息技术有限公司 | Medical rule base generation method and device |
CN118940725A (en) * | 2024-07-01 | 2024-11-12 | 合肥霍因科技有限公司 | A standardized semantic mapping method and system based on artificial intelligence and big data |
CN119106666A (en) * | 2024-07-12 | 2024-12-10 | 太极计算机股份有限公司 | Thematic indexing method, device and equipment based on domain knowledge |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190133931A (en) * | 2018-05-24 | 2019-12-04 | 한국과학기술원 | Method to response based on sentence paraphrase recognition for a dialog system |
CN113821622A (en) * | 2021-09-29 | 2021-12-21 | 平安银行股份有限公司 | Answer retrieval method and device based on artificial intelligence, electronic equipment and medium |
CN114064855A (en) * | 2021-11-10 | 2022-02-18 | 国电南瑞南京控制系统有限公司 | Information retrieval method and system based on transformer knowledge base |
US20230069935A1 (en) * | 2019-11-20 | 2023-03-09 | Korea Advanced Institute Of Science And Technology | Dialog system answering method based on sentence paraphrase recognition |
-
2023
- 2023-07-28 CN CN202310942972.1A patent/CN116775846A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190133931A (en) * | 2018-05-24 | 2019-12-04 | 한국과학기술원 | Method to response based on sentence paraphrase recognition for a dialog system |
US20230069935A1 (en) * | 2019-11-20 | 2023-03-09 | Korea Advanced Institute Of Science And Technology | Dialog system answering method based on sentence paraphrase recognition |
CN113821622A (en) * | 2021-09-29 | 2021-12-21 | 平安银行股份有限公司 | Answer retrieval method and device based on artificial intelligence, electronic equipment and medium |
CN114064855A (en) * | 2021-11-10 | 2022-02-18 | 国电南瑞南京控制系统有限公司 | Information retrieval method and system based on transformer knowledge base |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117312519A (en) * | 2023-09-28 | 2023-12-29 | 国网北京市电力公司 | Power supply service question-answering method, device, equipment and medium |
CN117290492A (en) * | 2023-11-27 | 2023-12-26 | 深圳市灵智数字科技有限公司 | Knowledge base question-answering method and device, electronic equipment and storage medium |
CN118132735A (en) * | 2024-05-07 | 2024-06-04 | 支付宝(杭州)信息技术有限公司 | Medical rule base generation method and device |
CN118940725A (en) * | 2024-07-01 | 2024-11-12 | 合肥霍因科技有限公司 | A standardized semantic mapping method and system based on artificial intelligence and big data |
CN118940725B (en) * | 2024-07-01 | 2025-02-25 | 合肥霍因科技有限公司 | A standardized semantic mapping method and system based on artificial intelligence and big data |
CN119106666A (en) * | 2024-07-12 | 2024-12-10 | 太极计算机股份有限公司 | Thematic indexing method, device and equipment based on domain knowledge |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109885672B (en) | Question-answering type intelligent retrieval system and method for online education | |
CN110059160B (en) | End-to-end context-based knowledge base question-answering method and device | |
CN116775846A (en) | Domain knowledge question and answer method, system, equipment and medium | |
CN109960786A (en) | Chinese word similarity calculation method based on fusion strategy | |
CN110457708B (en) | Vocabulary mining method and device based on artificial intelligence, server and storage medium | |
CN108595696A (en) | A kind of human-computer interaction intelligent answering method and system based on cloud platform | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN111259127A (en) | Long text answer selection method based on transfer learning sentence vector | |
CN112836027A (en) | Method for determining text similarity, question answering method and question answering system | |
CN112182145B (en) | Text similarity determination method, device, equipment and storage medium | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111143574A (en) | Query and visualization system construction method based on minority culture knowledge graph | |
CN111881264B (en) | A method and electronic device for long text retrieval in open domain question answering tasks | |
CN111858896A (en) | A Knowledge Base Question Answering Method Based on Deep Learning | |
CN111309891A (en) | System for reading robot to automatically ask and answer questions and application method thereof | |
CN113468311B (en) | Knowledge graph-based complex question and answer method, device and storage medium | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN115795018A (en) | Multi-strategy intelligent searching question-answering method and system for power grid field | |
CN118277509A (en) | Knowledge graph-based data set retrieval method | |
CN112989813A (en) | Scientific and technological resource relation extraction method and device based on pre-training language model | |
CN116541535A (en) | Automatic knowledge graph construction method, system, equipment and medium | |
CN110728135A (en) | Text theme indexing method and device, electronic equipment and computer storage medium | |
CN118797005A (en) | Intelligent question-answering method, device, electronic device, storage medium and product | |
CN119088915A (en) | Integrated retrieval dialogue method, system and medium based on large model knowledge base | |
CN116401344A (en) | Method and device for searching table according to question |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |