CN111259127A

CN111259127A - Long text answer selection method based on transfer learning sentence vector

Info

Publication number: CN111259127A
Application number: CN202010043764.4A
Authority: CN
Inventors: 张引; 王炜
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-09
Anticipated expiration: 2040-01-15
Also published as: CN111259127B

Abstract

The invention discloses a long text answer selection method based on the transfer learning sentence vector. A two-stage method is used to construct a transfer learning sentence vector network and a training prediction network. The transfer learning sentence vector network includes a twin network structure, an attention aggregation structure and a Classification layer; training prediction network includes Siamese network structure and distance metric layer. First of all, the present invention does not need to perform word segmentation on the text sequence of the data set, and directly uses the complete question answer sentence as the input, thereby avoiding the error propagation caused by the word segmentation tool. Second, the training prediction network in the second stage has a simple structure and high computational efficiency. Finally, the transfer learning method is introduced combined with the twin network structure and attention mechanism to obtain sentence vector model weights with more similar semantics, and the second-stage training prediction network provides sentence-level semantic vectors, which is better than traditional methods and ordinary deep learning. The network method has better effect, especially for long text data, its effect is more prominent.

Description

A long text answer selection method based on transfer learning sentence vector

技术领域technical field

本发明涉及自然语言处理、深度学习中的预训练语言模型、注意力机制。具体为一种基于迁移学习句向量的长文本答案选择方法。The present invention relates to natural language processing, pre-training language model in deep learning, and attention mechanism. Specifically, it is a long text answer selection method based on transfer learning sentence vectors.

背景技术Background technique

互联网在这些年高速发展，各种信息平台以“井喷”的方式暴增。据Hootsuite网站和wearesocial两个网站不完全统计，截至2019年，世界上的网民数量已经突破3.5亿人，而且全球45％的人口都是社交媒体的使用者。数据显示，从2018年到2019年网络用户新增了439万人次，而且社交媒体用户在这一年中增长了348万人次。大量的数据显示，目前全球网络已经到达了一个非常发达的盛世，随着网络带来的是无数的互联网知识信息。大量的网站承载着网络信息充斥在互联网环境中，带来的问题是如何有效的搜索利用，因此搜索引擎的存在就显得非常重要。现在计算机的存储和计算速度已经迎来了黄金时代，以前计算机的计算能力和存储能力成为阻碍搜索引擎发展的绊脚石，这些问题随着高性能计算，高性能存储的到来，如何高效精准搜索到最相关的检索结果成为搜索引擎的研究重点。With the rapid development of the Internet in recent years, various information platforms have exploded in a "blowout" manner. According to incomplete statistics from Hootsuite and wearesocial, as of 2019, the number of Internet users in the world has exceeded 350 million, and 45% of the world's population are users of social media. The data shows that from 2018 to 2019, there were 4.39 million new online users, and social media users increased by 3.48 million during the year. A large amount of data shows that the current global network has reached a very developed and prosperous age, and with the network brings countless Internet knowledge and information. A large number of websites carry network information and are flooded in the Internet environment, which brings about the problem of how to search and utilize effectively. Therefore, the existence of search engines is very important. Now the storage and computing speed of computers have ushered in a golden age. In the past, the computing power and storage capacity of computers had become a stumbling block to the development of search engines. With the advent of high-performance computing and high-performance storage, how to efficiently and accurately search for the most Relevant search results have become the focus of search engine research.

针对这个研究重点，我们必须攻克的是精准检索海量文档中最相关的信息这一难题。纵观搜索发展历史，第一代搜索引擎Archie 3主要用来搜索分布在各个主机中的文件。当万维网出现后，出现了EINet Galaxy(Tradewave Galaxy)4，功能相当于是最早的门户网站。中间经过历代搜索引擎技术更新，在以百度、谷歌、微软等大型互联网公司主导的Baidu搜索引擎、Google搜索引擎、Bing搜索引擎为主的竞争下，如何精准搜索将仍是今后一直持续的研究热点。随着人工智能浪潮兴起，机器学习，深度学习方法为图像识别、自然语言处理、语音识别处理等领域带来了新的解决思路。面对搜索引擎检索召回的结果不理想的现状，很多检索结果需要搜索人进行二次筛选过滤，因此自动问答技术应运而生。For this research focus, we must overcome the problem of accurately retrieving the most relevant information in massive documents. Throughout the history of search development, the first-generation search engine Archie 3 was mainly used to search for files distributed in various hosts. When the World Wide Web appeared, EINet Galaxy (Tradewave Galaxy) 4 appeared, which functioned as the earliest portal website. After successive generations of search engine technology updates, under the competition of Baidu, Google, and Bing search engines dominated by large Internet companies such as Baidu, Google, and Microsoft, how to search accurately will continue to be a research hotspot in the future. . With the rise of artificial intelligence, machine learning and deep learning methods have brought new solutions to the fields of image recognition, natural language processing, and speech recognition processing. Faced with the unsatisfactory status of search engine retrieval and recall results, many retrieval results require secondary screening and filtering by searchers, so automatic question-answering technology emerges as the times require.

答案选择技术是自动问答技术中的一个重要步骤，在生活中有着广泛应用，如小米的小爱同学、Iphone的Siri、微软的小冰以及百度度秘等等都是自动问答技术的实际落地产物。在任务型自动问答领域，自动问答技术成就的机器人助手能够极大的解放双手，只用语音命令即可控制完成一系列任务。在闲聊型自动问答领域，闲聊机器人可以为乏味枯燥的生活中增添一丝人生乐趣。在现代医学领域，自动问答技术可以为医生患者构建更加方便高效的沟通方式。因此，如何改进自动问答领域的问答精准性变得尤为重要，针对检索型自动问答领域，在其中扮演着非常重要角色的答案选择技术，在上文中介绍的搜索引擎中同样占据着非常重要的角色。Answer selection technology is an important step in automatic question answering technology, which is widely used in life, such as Xiaomi's Xiao Ai, Iphone's Siri, Microsoft's Xiaobing and Baidu Dubi, etc. are the actual products of automatic question answering technology. . In the field of task-based automatic question answering, the robotic assistants achieved by automatic question answering technology can greatly liberate their hands and can control and complete a series of tasks with only voice commands. In the field of chat-type automatic question and answer, chatbots can add a touch of joy to the boring life. In the field of modern medicine, automatic question answering technology can build a more convenient and efficient way of communication for doctors and patients. Therefore, how to improve the accuracy of question answering in the field of automatic question answering has become particularly important. For the field of automatic question answering of retrieval, the answer selection technology that plays a very important role in it also plays a very important role in the search engine introduced above. .

现有的答案选择方法通常使用孪生网络结构，对问题文本和答案文本分别建模，最后通过余弦距离等相似度度量方法分辨问题和答案是否匹配。但是传统方法主要是聚焦在短文本匹配任务上，缺乏对长文本应用场景的研究，难以解决长文本应用领域的“语义迁移”和“语义鸿沟”等问题。而且由于医疗领域的问答数据普遍具有“问题短答案长”的特性，其使用现有的答案选择方法匹配效果和召回精度都无法满足上线需要，因此为了更好地对长文本数据进行答案选择，主要涉及的技术难点如下：Existing answer selection methods usually use a siamese network structure to model the question text and the answer text separately, and finally distinguish whether the question and answer match through similarity measures such as cosine distance. However, traditional methods mainly focus on short text matching tasks, lack of research on long text application scenarios, and it is difficult to solve the problems of "semantic transfer" and "semantic gap" in the field of long text applications. Moreover, since the question-and-answer data in the medical field generally has the characteristics of "short questions and long answers", the matching effect and recall precision of the existing answer selection methods cannot meet the online needs. Therefore, in order to better select answers for long text data, The main technical difficulties involved are as follows:

1.如何设计模型建模长文本序列；1. How to design a model to model long text sequences;

2.如何利用外部知识，引入迁移学习方法提升召回精度；2. How to use external knowledge to introduce transfer learning methods to improve recall accuracy;

3.如何设计评价指标量化模型的效果。3. How to design the evaluation index to quantify the effect of the model.

发明内容SUMMARY OF THE INVENTION

为了解决上述问题，本发明提出了一种基于迁移学习句向量的长文本答案选择方法，使用BERT作为特征提取层建模长文本数据，采取迁移学习+训练预测两阶段任务。首先，问题和答案文本序列作为输入，使用BERT输入格式进行处理，不需要额外分词，避免了分词造成的错误传播。其次，使用迁移学习方法并辅以孪生网络结构和注意力聚合结构，使迁移学习得到的问题和答案句向量更加语义相似。最后，在训练预测过程中使用迁移学习的模型权重参数初始化得到文本的句向量，并简单地通过距离度量方法计算问题和答案句向量的语义相似度，由于简化了训练预测网络结构，获得了更高的召回效率及更低的显存占用，本发明采用两阶段方式，相对于直接使用BERT的[CLS]语义向量，获得了更高的召回精度。In order to solve the above problems, the present invention proposes a long text answer selection method based on transfer learning sentence vectors, uses BERT as a feature extraction layer to model long text data, and adopts a two-stage task of transfer learning + training prediction. First, the question and answer text sequences are used as input and processed using the BERT input format, which does not require additional word segmentation and avoids error propagation caused by word segmentation. Second, using the transfer learning method supplemented by the Siamese network structure and the attention aggregation structure, the question and answer sentence vectors obtained by transfer learning are more semantically similar. Finally, in the training prediction process, the model weight parameters of transfer learning are used to initialize the sentence vector of the text, and the semantic similarity of the question and answer sentence vectors is simply calculated by the distance measurement method. High recall efficiency and lower video memory occupation, the present invention adopts a two-stage method, and obtains higher recall precision compared to the [CLS] semantic vector that directly uses BERT.

为了实现上述目的，本发明采用如下的技术方案：In order to achieve the above object, the present invention adopts the following technical scheme:

一种基于迁移学习句向量的长文本答案选择方法，步骤如下：A long text answer selection method based on transfer learning sentence vector, the steps are as follows:

1)使用XPATH设计爬虫爬取问诊论坛医患问答数据，并做数据清洗；将医患问答数据中的答案作为正样本；针对医患问答数据中的问题，使用Lucene索引工具进行相关性答案的检索召回，将相关性答案作为负样本；根据获得的正样本和负样本构造点式答案选择数据集，并按照27:1～8:1的比例划分迁移学习数据集和训练预测数据集；1) Use XPATH to design a crawler to crawl the doctor-patient question and answer data of the consultation forum, and do data cleaning; use the answers in the doctor-patient question and answer data as a positive sample; for the questions in the doctor-patient question and answer data, use the Lucene index tool to make relevant answers According to the retrieval and recall, the correlation answers are regarded as negative samples; the point answer selection data set is constructed according to the obtained positive samples and negative samples, and the transfer learning data set and the training prediction data set are divided according to the ratio of 27:1 to 8:1;

2)建立迁移学习句向量网络，包括孪生网络结构、注意力聚合结构和分类层，所述孪生网络结构包括成对的输入层、特征提取层、池化层，所述的注意力聚合结构包括注意力层、聚合网络层；所述的特征提取层采用BERT模型，加载全词遮盖权重BERT参数进行初始化，特征提取后取均值池化输出，并依次经过注意力层、聚合网络层对特征进行聚合输出；聚合输出向量同BERT池化输出向量进行拼接并输入到分类层进行二分类输出；2) Establish a transfer learning sentence vector network, including a twin network structure, an attention aggregation structure and a classification layer. The twin network structure includes a paired input layer, a feature extraction layer, and a pooling layer. The attention aggregation structure includes Attention layer and aggregation network layer; the feature extraction layer adopts the BERT model, and loads the full word masking weight BERT parameter for initialization. After feature extraction, the average pooling output is taken, and the features are processed through the attention layer and the aggregation network layer in turn. Aggregated output; the aggregated output vector is spliced with the BERT pooling output vector and input to the classification layer for binary output;

利用步骤1)得到的迁移学习数据集，对迁移学习句向量网络进行训练，采用MRR、Precision@K评价指标方法，将问题和答案是否匹配的二分类值同真实标签进行匹配，选择匹配分数最高的模型对应的网络参数，得到BertAttTL迁移学习句向量模型；Using the transfer learning data set obtained in step 1), the transfer learning sentence vector network is trained, and the MRR and Precision@K evaluation index methods are used to match the binary value of whether the question and the answer match with the real label, and select the highest matching score. The network parameters corresponding to the model are obtained, and the BertAttTL transfer learning sentence vector model is obtained;

3)建立训练预测网络，包括孪生网络结构和距离度量层，所述孪生网络结构包括成对的输入层、特征提取层、池化层；所述的特征提取层采用BERT模型，使用步骤2)得到的BertAttTL迁移学习句向量模型的权重参数对训练预测网络中的BERT模型和池化层参数进行初始化，经池化层输出问题句向量和答案句向量，将两种句向量输入到距离度量层获取语义相似度，依照相似度以阈值进行划分得到是否相似的二分类值作为预测内容输出；利用步骤1)得到的训练预测数据集对训练预测网络进行训练，采用MRR、Precision@K评价指标方法，将最终得到的二分类值同真实标签进行匹配，选取匹配分数最高的模型对应的网络参数，得到训练好的训练预测网络；3) Establish a training prediction network, including a twin network structure and a distance measurement layer, and the twin network structure includes a paired input layer, a feature extraction layer, and a pooling layer; the feature extraction layer adopts the BERT model, using step 2) The obtained weight parameters of the BertAttTL transfer learning sentence vector model initialize the BERT model and pooling layer parameters in the training prediction network, output the question sentence vector and the answer sentence vector through the pooling layer, and input the two sentence vectors to the distance measurement layer. Obtain the semantic similarity, and divide the similarity according to the threshold to obtain the similar binary value as the prediction content output; use the training prediction data set obtained in step 1) to train the training prediction network, and use the MRR and Precision@K evaluation index methods , match the final binary classification value with the real label, select the network parameters corresponding to the model with the highest matching score, and obtain a trained training prediction network;

4)将待处理的问题和答案文本输入步骤3)得到的训练预测网络中，输出所有候选答案的二分类值，得到待处理问题的最终答案。4) Input the question and answer text to be processed into the training prediction network obtained in step 3), output the binary classification values of all candidate answers, and obtain the final answer to the question to be processed.

进一步的，所述的MRR、Precision@K评价指标方法具体为：Further, the described MRR and Precision@K evaluation index methods are specifically:

将迁移学习句向量网络或训练预测网络的输出表示为pred＝[p₁,p₂,...,p_n]，其中p_i表示第i个候选答案的预测值0或1，0表示不相似，1表示相似，n表示样本集中的测试样例的个数；真实标签数据表示为label＝[t₁,t₁,,...,t_n]，其中t_i表示第i个候选答案的真实标签0或1，0表示不相似，1表示相似，n表示样本集中的测试样例的个数；针对一个问题的所有候选答案，通过迁移学习句向量网络或训练预测网络获取二分类值之后进行排序，得到针对第i个问题的正确答案的排名rank_i；Denote the output of the transfer learning sentence vector network or the training prediction network as pred=[p ₁ ,p ₂ ,...,p _n ], where pi represents the predicted value of the _ith candidate answer, 0 or 1, and 0 represents no Similar, 1 represents similarity, n represents the number of test samples in the sample set; the real label data is represented as label=[t ₁ ,t ₁ ,,...,t _n ], where t _i represents the ith candidate answer The true label of 0 or 1, 0 indicates dissimilarity, 1 indicates similarity, and n indicates the number of test samples in the sample set; for all candidate answers of a question, obtain binary classification values through transfer learning sentence vector network or training prediction network Then sort to get the rank _i of the correct answer for the i-th question;

MRR计算公式为：The formula for calculating MRR is:

其中，Q为问题集合，|Q|表示所有问题的数量；Among them, Q is the set of questions, and |Q| represents the number of all questions;

Precision@K计算公式为：The formula for Precision@K is:

其中，Precisiin表示精度，K表示指标中考虑的答案的个数，在本发明中取值为1、2和3，Num(True Answers)表示正确答案的个数，Sum(related K Answers)表示召回的相关答案总个数。Among them, Precisiin represents precision, K represents the number of answers considered in the index, and in the present invention, the values are 1, 2, and 3, Num (True Answers) represents the number of correct answers, and Sum (related K Answers) represents recall The total number of relevant answers.

进一步的，所述的迁移学习句向量网络包括孪生网络结构、注意力聚合结构和分类层，所述孪生网络结构包括成对的输入层、特征提取层、池化层，所述的注意力聚合结构包括注意力层、聚合网络层，注意力层主要是在孪生网络结构中加入了注意力机制，使用问题的上下文来丰富答案文本的语义表征，同时使用答案的上下文来丰富问题文本的语义表征，通过问题和答案的语义交互，其能有效提高匹配效果；聚合网络层主要是在注意力机制后对问题融合表征和答案融合表征通过比较层和聚合层，进一步加深模型对问题和答案的相似部分及不相似部分的特征建模，其能在注意力机制基础上有效提升匹配效果。所述的特征提取层采用BERT进行建模，并且BERT采用全词覆盖的BERT权重参数进行初始化；Further, the transfer learning sentence vector network includes a twin network structure, an attention aggregation structure and a classification layer, and the twin network structure includes a paired input layer, a feature extraction layer, and a pooling layer. The attention aggregation layer. The structure includes an attention layer and an aggregation network layer. The attention layer mainly adds an attention mechanism to the Siamese network structure, uses the context of the question to enrich the semantic representation of the answer text, and uses the context of the answer to enrich the semantic representation of the question text. , through the semantic interaction of questions and answers, it can effectively improve the matching effect; the aggregation network layer mainly fuses the question and answer representations after the attention mechanism. The comparison layer and the aggregation layer further deepen the similarity between the question and the answer. Feature modeling of parts and dissimilar parts, which can effectively improve the matching effect based on the attention mechanism. The feature extraction layer is modeled with BERT, and BERT is initialized with the BERT weight parameter covered by the whole word;

将配对样本输入到孪生网络结构中，成对的输入层对应问题和答案两个文本序列，按照BERT的输入格式[CLS]+Question+[SEP]、[CLS]+Answer+[SEP]分别对问题文本和答案文本进行处理；BERT特征建模后将12层池化层输出取均值，分别得到维度统一的池化输出：问题池化输出Q pool和答案池化输出A pool，维度长度为768维；Input the paired samples into the Siamese network structure. The paired input layers correspond to two text sequences of question and answer. According to the input format of BERT [CLS]+Question+[SEP], [CLS]+Answer+[SEP], the question text is respectively And the answer text is processed; after the BERT feature modeling, the output of the 12-layer pooling layer is averaged, and the pooled outputs with unified dimensions are obtained respectively: the question pooling output Q pool and the answer pooling output A pool, the dimension length is 768 dimensions;

将问题池化输出Q pool和答案池化输出A pool输入到注意力层，通过注意力机制分别得到问题语义对齐向量Z₂和答案语义对齐向量Z₂'；将Q pool、A pool、Z₂和Z₂'输入到聚合网络层，针对问题，Q pool和Z₂通过[Q pool,Z₂],[Q pool,Q pool-Z₂],[Q pool,Qpool*Z₂]变换，然后经过一层线性变换拼接得到拼接向量[O₁,O₂,O₃]，所述的拼接向量经过一层线性变换并使用DropOut机制，得到问题注意力聚合输出Fused_Q；同理，针对答案，Apool和Z₂'经过聚合网络层，得到答案注意力聚合输出Fused_A；The question pooling output Q pool and the answer pooling output A pool are input to the attention layer, and the question semantic alignment vector Z ₂ and the answer semantic alignment vector Z ₂ ' are obtained respectively through the attention mechanism; Q pool, A pool, Z ₂ and Z ₂ ' are input to the aggregation network layer, for the problem, Q pool and Z ₂ are transformed by [Q pool, Z ₂ ], [Q pool, Q pool-Z ₂ ], [Q pool, Qpool*Z ₂ ], and then The splicing vector [O ₁ , O ₂ , O ₃ ] is obtained by splicing through a layer of linear transformation. The splicing vector is subjected to a layer of linear transformation and the DropOut mechanism is used to obtain the problem attention aggregation output Fused _Q ; Similarly, for the answer, Apool and Z ₂ ' go through the aggregation network layer to get the answer attention aggregation output Fused _A ;

将Fused_Q、Fused_A、Q pool、A pool进一步拼接得到[Q pool,A pool,|Q pool-Apool|,Q pool*A pool,Fused Q,Fused A]，将拼接向量输入到分类层，通过Softmax分类得到预测输出pred＝[p₁,p₂,...,p_n]，其中p_i表示第i个候选答案的预测值0或1，0表示不相似，1表示相似，n表示样本集中的测试样例的个数。Further splicing Fused _Q , Fused _A , Q pool, and A pool to obtain [Q pool,A pool,|Q pool-Apool|,Q pool*A pool,Fused Q,Fused A], and input the splicing vector into the classification layer, The predicted output pred=[p ₁ ,p ₂ ,...,p _n ] is obtained by Softmax classification, where p _i represents the predicted value of the ith candidate answer, 0 or 1, 0 means dissimilar, 1 means similar, and n means The number of test examples in the sample set.

进一步的，所述的步骤3)中语义相似度的计算方法采用余弦距离、曼哈顿距离、欧拉度量、点乘度量中的任意一种。Further, the method for calculating the semantic similarity in the step 3) adopts any one of cosine distance, Manhattan distance, Euler metric, and dot product metric.

本发明具备的有益效果：The beneficial effects that the present invention has:

(1)本发明使用自然语言处理技术中的预训练语言模型BERT来获得长文本数据的字表征，不需要额外的数据分词阶段，避免了分词工具造成的分词不准确问题，从而避免了因分词不准确造成的语义错误传播问题；(1) The present invention uses the pre-trained language model BERT in the natural language processing technology to obtain the word representation of the long text data, does not require an additional data word segmentation stage, avoids the inaccurate word segmentation problem caused by the word segmentation tool, and thus avoids word segmentation due to word segmentation. Semantic error propagation problems caused by inaccuracy;

(2)设计了两阶段方法，第一阶段有效使用迁移学习方法利用到大规模平行语料知识，第二阶段使用简单的训练预测网络具有更高的模型推理效率，综合两阶段任务，具有更高的答案选择召回精度；(2) A two-stage method is designed. In the first stage, the transfer learning method is effectively used to utilize the knowledge of large-scale parallel corpus. In the second stage, a simple training prediction network is used to have higher model inference efficiency. The answer chooses the recall precision;

(3)针对大批量答案搜索场景，本发明提出的直接获得全部文本序列句向量的方式，能有效避免预训练语言模型的多个文本对之间的耗时计算，效率更高。例如：预训练语言模型计算同一个问题和m个答案的匹配分数时，每次都需要将问题和其中一个答案配对送入模型中去进行计算，因此问题被重复编码了m次，问题和答案总共编码了2*m次，这在大规模搜索场景下，m值非常大，所带来的的额外时间开销是非常巨大的，而本发明仅需获得问题和所有答案的句向量，仅需要编码问题一次和答案m次共m+1次，相比于2*m次编码工作，本发明的方法减少了将近一半的编码时间，因此效率更高；(3) For a large-scale answer search scenario, the method of directly obtaining all text sequence sentence vectors proposed by the present invention can effectively avoid time-consuming computation among multiple text pairs of the pre-trained language model, and is more efficient. For example: when the pre-trained language model calculates the matching score of the same question and m answers, the question and one of the answers need to be paired into the model for calculation each time, so the question is repeatedly encoded m times, the question and the answer A total of 2*m times are encoded. In the large-scale search scenario, the value of m is very large, and the extra time overhead brought by it is very huge. However, the present invention only needs to obtain the sentence vectors of the question and all the answers, and only needs The coding question is once and the answer is m times a total of m+1 times. Compared with 2*m times of coding work, the method of the present invention reduces the coding time by nearly half, so the efficiency is higher;

(4)本发明采取预训练语言模型BERT作为特征提取器，能有效对长文本数据进行语义建模，避免了现有答案选择方法在长文本数据上的“语义迁移”和“语义鸿沟”现象。(4) The present invention adopts the pre-trained language model BERT as the feature extractor, which can effectively carry out semantic modeling on long text data, and avoids the phenomenon of "semantic migration" and "semantic gap" in the long text data of the existing answer selection methods .

附图说明Description of drawings

图1为基于迁移学习句向量的长文本答案选择方法的迁移学习模型结构图；Fig. 1 is the transfer learning model structure diagram of the long text answer selection method based on the transfer learning sentence vector;

图2为基于迁移学习句向量的长文本答案选择方法的训练预测模型结构图。Fig. 2 is the structure diagram of the training prediction model of the long text answer selection method based on the transfer learning sentence vector.

具体实施方式Detailed ways

以下结合具体实例对本发明做详细说明。The present invention will be described in detail below with reference to specific examples.

由于医疗领域的问答数据普遍具有“问题短答案长”的特性，其使用现有的答案选择方法匹配效果和召回精度都无法满足上线需要，因此本发明提出的基于迁移学习句向量的长文本答案选择方法经实验验证，能够有效处理长文本答案选择问题。Since the question and answer data in the medical field generally has the characteristics of "short questions and long answers", the matching effect and recall accuracy of the existing answer selection methods cannot meet the online needs. Therefore, the long text answer based on the transfer learning sentence vector proposed by the present invention The selection method has been experimentally verified and can effectively deal with long text answer selection problems.

如图1所示，本发明提出的一种基于迁移学习句向量的长文本答案选择方法，所采用的迁移学习句向量网络包括包括输入层、特征提取层、注意力聚合网络层、分类层，所述的特征提取层采用BERT进行建模，并且BERT采用全词覆盖的BERT权重参数进行初始化；As shown in Figure 1, a method for selecting long text answers based on transfer learning sentence vectors proposed by the present invention, the adopted transfer learning sentence vector network includes an input layer, a feature extraction layer, an attention aggregation network layer, and a classification layer, The feature extraction layer is modeled with BERT, and BERT is initialized with the BERT weight parameter covered by the whole word;

输入层对应问题和答案两个文本序列，按照BERT的输入格式[CLS]+Question+[SEP],[CLS]+Answer+[SEP]对两个文本进行处理。BERT特征建模后将12层池化层输出取均值得到维度统一的池化输出，维度长度为768维；注意力聚合网络层将两个文本序列通过注意力机制得到语义对齐输出，对齐向量Z2和池化输出Z1通过[Z1,Z2],[Z1,Z1-Z2],[Z1,Z1*Z2]变换，并分别经过一层线性变换拼接得到[O1,O2,O3],拼接向量经过一层线性变换并使用DropOut机制，得到问题注意力聚合输出FusedQ和答案注意力聚合输出FusedA，将这两者同池化输出拼接得到[Q pool,A pool,|Q pool-A pool|,Q pool*A pool,Fused Q,FusedA]，将其通过Softmax分类得到预测输出，经过迁移学习句向量网络训练，得到语义更加相似的句向量。The input layer corresponds to two text sequences of question and answer, and processes the two texts according to the BERT input format [CLS]+Question+[SEP], [CLS]+Answer+[SEP]. After the BERT feature modeling, the output of the 12-layer pooling layer is averaged to obtain a pooled output with a unified dimension, and the dimension length is 768 dimensions; the attention aggregation network layer obtains the semantic alignment output of the two text sequences through the attention mechanism, and the alignment vector Z2 And the pooled output Z1 is transformed by [Z1, Z2], [Z1, Z1-Z2], [Z1, Z1*Z2], and is spliced through a layer of linear transformation to obtain [O1, O2, O3], and the splicing vector goes through a Layer linear transformation and use the DropOut mechanism to obtain the question attention aggregation output FusedQ and the answer attention aggregation output FusedA, and splicing the two pooled outputs to obtain [Q pool,A pool,|Q pool-A pool|,Q pool *A pool, Fused Q, FusedA], classify it through Softmax to get the predicted output, and after the transfer learning sentence vector network training, get the sentence vector with more similar semantics.

如图2所示，本发明提出的一种基于迁移学习句向量的长文本答案选择方法，所采用的训练预测网络包括的训练预测网络包括输入层、特征提取层，距离度量层，所述的特征提取层采用BERT，并用步骤3)训练好的迁移学习权重参数进行初始化；As shown in Figure 2, a method for selecting long text answers based on transfer learning sentence vectors proposed by the present invention, the training prediction network included in the adopted training prediction network includes an input layer, a feature extraction layer, and a distance measurement layer. The feature extraction layer adopts BERT, and is initialized with the transfer learning weight parameters trained in step 3);

输入层对应问题和答案两个文本序列，按照BERT的输入格式[CLS]+Question+[SEP],[CLS]+Answer+[SEP]对两个文本进行处理。BERT特征建模后将12层池化层输出取均值得到维度统一的池化输出，维度长度为768维；使用步骤3)训练的迁移学习权重参数进行初始化，得到了语义更加相似的句向量，采用余弦距离、曼哈顿距离、欧拉度量、点乘度量计算两个句向量的相似度，使用阈值对相似度进行分割得到是否相似的二分类值。The input layer corresponds to two text sequences of question and answer, and processes the two texts according to the BERT input format [CLS]+Question+[SEP], [CLS]+Answer+[SEP]. After the BERT feature modeling, the output of the 12-layer pooling layer is averaged to obtain a pooled output with a unified dimension, and the dimension length is 768 dimensions; use the transfer learning weight parameters trained in step 3) to initialize, and obtain sentence vectors with more similar semantics. The cosine distance, Manhattan distance, Euler metric, and dot product metric are used to calculate the similarity of two sentence vectors, and a threshold is used to segment the similarity to obtain the binary value of whether they are similar.

在本发明的一个具体实施实例中，采用上述迁移学习句向量网络和训练预测网络对长文本问答数据进行答案选择，步骤如下：In a specific implementation example of the present invention, the above-mentioned transfer learning sentence vector network and training prediction network are used to select an answer on the long text question and answer data, and the steps are as follows:

步骤一、通过Python和XPATH构建爬虫框架，对三九健康网等医疗问诊平台抓取医患问答数据，采取一定规则方法去除文本之外的网页标签，如<div>等，对数据进行去重，经过处理，最终得到约575万条医患问答数据，按(问题，病情描述，病情回答)三元组的形式入库存储。Step 1. Build a crawler framework through Python and XPATH, grab the doctor-patient question-and-answer data from medical consultation platforms such as Sanjiu Health.com, and take certain rules and methods to remove webpage tags other than text, such as <div>, etc., to remove the data. After processing, about 5.75 million doctor-patient question and answer data are finally obtained, which are stored in the database in the form of (question, condition description, condition answer) triplet.

步骤二、入库的病情回答为正确答案，使用Lucene工具对该问题进行相关性答案召回，召回500条按相关度排序的负样本答案集合，在第1至5条负样本中抽取一条、第5至50条负样本中抽取一条、第50至100条负样本中抽取一条和第100至500条负样本中抽取一条。对于召回的相关负样本答案少于100条的样本，候选答案集合构造中减少最后第100至500间的采样。4354417条全量数据集按话题类别抽样出小样本数据集作为训练预测数据集，包含120000条训练集、20000条验证集以及20000条测试集，按总量的8:1取标注数据作为迁移学习数据集，其中迁移学习数据集不与训练预测数据集有交叉部分。Step 2. The answer of the condition in the warehouse is the correct answer. Use the Lucene tool to recall the relevant answer to the question, recall 500 negative sample answer sets sorted by relevance, and select one negative sample from the 1st to 5th negative samples. One out of 5 to 50 negative samples, one out of 50 to 100 negative samples, and one out of 100 to 500 negative samples. For the recalled samples with less than 100 relevant negative sample answers, the final sampling between the 100th and 500th is reduced in the construction of the candidate answer set. 4,354,417 full data sets are sampled according to topic categories, and small sample data sets are used as training prediction data sets, including 120,000 training sets, 20,000 validation sets and 20,000 test sets. The labeled data is taken as the transfer learning data at 8:1 of the total. set, where the transfer learning dataset does not intersect with the training prediction dataset.

在本发明的一个具体实施例中，语料格式如下：In a specific embodiment of the present invention, the corpus format is as follows:

其中Question表示问题文本，Answer表示答案文本。Where Question represents the question text and Answer represents the answer text.

步骤三、使用Pytorch搭建迁移学习句向量网络，使用全词遮盖BERT权重参数进行初始化，网络包括输入层、特征提取层、注意力聚合网络层、分类层，在步骤二得到的迁移学习数据集上进行训练预测，最终得到语义向量更加相似的句向量模型权重文件。Step 3. Use Pytorch to build a transfer learning sentence vector network, and use the whole word to cover the BERT weight parameters for initialization. The network includes an input layer, a feature extraction layer, an attention aggregation network layer, and a classification layer. On the transfer learning data set obtained in step 2 Carry out training prediction, and finally get the weight file of sentence vector model with more similar semantic vectors.

迁移学习句向量网络训练的损失函数采用交叉熵损失：The loss function trained by the transfer learning sentence vector network adopts the cross entropy loss:

loss＝-y*logy′loss=-y*logy'

其中y表示问题答案是否匹配的真实标签，y′为样例数据是否匹配的模型预测向量。where y represents the true label of whether the answer to the question matches, and y′ is the model prediction vector of whether the sample data matches.

在测试集中，针对一个问题q和3个答案[a₁,a₂,a₃]对于预测向量pred＝[0.71,0.68.0.35]和真实标签label＝[0,1,0]，依照如下MRR计算公式，|Q|＝1，以阈值0.5作为划分预测结果得到pred＝[1,1,0]，根据真实标签可知正确答案标签预测正确，同时对答案按照预测概率进行排序得知第二条答案预测概率最高排在第二位，即rank_i＝2，则MRR＝1/2＝0.5。依照Precision-K计算公式，K取1、2和3，可知当k＝1时，Num(True Answers)＝0,则Precision@1＝0；当k＝2时，Num(True Answers)＝1，Sum(related K Answers)＝2，则Precision@2＝0.5；当k＝3时，Num(True Answers)＝1，Sum(related K Answers)＝3，则Precision@3＝1/3＝0.33。本实例仅针对一个问题和多个答案进行解释，测试集中存在多个问题，需要将最后结果指标按照问题数量取均值计算即可。In the test set, for a question q and 3 answers [a ₁ , a ₂ , a ₃ ] for the prediction vector pred=[0.71, 0.68.0.35] and the true label label=[0, 1, 0], according to the following MRR The calculation formula, |Q|=1, takes the threshold value of 0.5 as the division prediction result to get pred=[1,1,0], according to the real label, we can see that the correct answer label is correctly predicted, and at the same time, the answers are sorted according to the predicted probability to get the second The answer prediction probability is the second highest, that is, rank _i = 2, then MRR = 1/2 = 0.5. According to the calculation formula of Precision-K, K takes 1, 2 and 3. It can be known that when k=1, Num(True Answers)=0, then Precision@1=0; when k=2, Num(True Answers)=1 , Sum(related K Answers)=2, then Precision@2=0.5; when k=3, Num(True Answers)=1, Sum(related K Answers)=3, then Precision@3=1/3=0.33 . This example only explains one question and multiple answers. There are multiple questions in the test set, and the final result indicator needs to be calculated by taking the average of the number of questions.

步骤四、使用Pytorch搭建训练预测网络，使用步骤三迁移学习句向量网络权重模型进行初始化，包括输入层、特征提取层，距离度量层，在步骤二得到的训练预测小样本数据集上进行训练预测。Step 4. Use Pytorch to build a training prediction network, use the step 3 transfer learning sentence vector network weight model to initialize, including the input layer, feature extraction layer, distance measurement layer, and perform training prediction on the training prediction small sample data set obtained in step 2. .

训练预测网络的损失函数采用均方误差损失：The loss function for training the prediction network uses the mean squared error loss:

loss＝(y-y＇)² loss=(yy') ²

得到问题句向量和答案句向量之后，使用余弦相似度分类器计算两个句向量的语义相似度，公式如下，例如:问题句向量为[1,1,0,0,1],答案句向量为[0,1,1,0,0]，则依照余弦相似度公式计算相似度为

针对测试集中所有样例得出pred预测结果，同时与真实标签label进行比较，按照MRR和Precision@K(k取值1,2,3)计算公式得到测试集上的指标。After obtaining the question sentence vector and the answer sentence vector, use the cosine similarity classifier to calculate the semantic similarity of the two sentence vectors, the formula is as follows, for example: the question sentence vector is [1,1,0,0,1], the answer sentence vector is [0,1,1,0,0], then the similarity calculated according to the cosine similarity formula is

The pred prediction results are obtained for all samples in the test set, and compared with the real label label, and the indicators on the test set are obtained according to the calculation formula of MRR and Precision@K (k value is 1, 2, 3).

步骤五、使用步骤四训练好的模型在测试集数据上进行推理，最终得到的预测值按阈值进行分割，即可获得问题答案是否语义相似。Step 5: Use the model trained in Step 4 to perform inference on the test set data, and the final predicted value is divided according to the threshold to obtain whether the answer to the question is semantically similar.

与现有的技术相比，首先，本发明不需要对数据集文本序列进行分词，直接以完整的问题答案句子作为输入，避免了分词工具造成的错误传播。其次，第二阶段的训练预测网络结构简单，计算效率高。最后，引入迁移学习方法结合孪生网络结构及注意力机制获得语义更加相似的句向量模型权重，对第二阶段的训练预测网络提供了代表句子级的语义向量，获得了比传统方法及普通深度学习网络方法更好的效果，尤其对于长文本数据，其效果更突出。为了客观地评价本发明的模型的性能，将本发明的模型与其他模型进行比较，对比模型包括Siamese RNN、QACNN、DEATT、Cam、Seq Match Seq、ESIM。本实施例采用的评价指标为MRR、Precision@1、Precision@2、Precision@3。这些指标用于评价问题和召回答案之间的相似度。数值越大，效果越好。如表1所示，本发明综合两阶段任务，具有更高的答案选择召回精度，本发明的模型效果优于所有的对比模型。如表2所示，本发明与预训练语言模型BERT相比，推理阶段的耗时只有0.5秒，效率高。Compared with the prior art, firstly, the present invention does not need to perform word segmentation on the text sequence of the data set, and directly uses the complete question answer sentence as the input, thereby avoiding the error propagation caused by the word segmentation tool. Second, the training prediction network in the second stage has a simple structure and high computational efficiency. Finally, the transfer learning method is introduced combined with the twin network structure and attention mechanism to obtain sentence vector model weights with more similar semantics, and the second-stage training prediction network provides sentence-level semantic vectors, which is better than traditional methods and ordinary deep learning. The network method has better effect, especially for long text data, its effect is more prominent. In order to objectively evaluate the performance of the model of the present invention, the model of the present invention is compared with other models, and the comparative models include Siamese RNN, QACNN, DEATT, Cam, Seq Match Seq, and ESIM. The evaluation indicators used in this embodiment are MRR, Precision@1, Precision@2, and Precision@3. These metrics are used to evaluate the similarity between questions and recalled answers. The larger the value, the better the effect. As shown in Table 1, the present invention integrates two-stage tasks and has higher recall precision for answer selection, and the model effect of the present invention is better than all comparison models. As shown in Table 2, compared with the pre-trained language model BERT, the time-consuming of the inference stage is only 0.5 seconds, and the efficiency is high.

表1对比实验的召回精度结果Table 1. Recall precision results of comparative experiments

模型Model MRRMRR Precision@1Precision@1 Precision@2Precision@2 Precision@3Precision@3 Siamese RNNSiamese RNN 0.5717690.571769 0.3111370.311137 0.5804830.580483 0.8334330.833433 QACNNQACNN 0.6128440.612844 0.3633270.363327 0.6504700.650470 0.8732250.873225 DEATTDEATT 0.5259450.525945 0.2583480.258348 0.5080980.508098 0.7450510.745051 CamCam 0.6363390.636339 0.4159170.415917 0.6564690.656469 0.8276340.827634 Seq Match SeqSeq Match Seq 0.6313400.631340 0.4075180.407518 0.6510700.651070 0.8288340.828834 ESIMESIM 0.5235290.523529 0.2547490.254749 0.5052990.505299 0.7432510.743251 本发明this invention 0.7391360.739136 0.5434910.543491 0.8186360.818636 0.9714060.971406

表2本发明和预训练语言模型的计算耗时结果对比Table 2 Comparison of the calculation time-consuming results of the present invention and the pre-trained language model

模型Model 推理阶段耗时(答案数量m＝4)Time-consuming inference stage (number of answers m=4) 预训练语言模型BERTPre-trained language model BERT 4.5秒4.5 seconds 本发明this invention 0.5秒0.5 seconds

以上实施例仅表达了本发明的一种具体实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above embodiment only expresses a specific embodiment of the present invention, and its description is relatively specific and detailed, but it should not be construed as a limitation on the patent scope of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can also be made, which all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention should be subject to the appended claims.

Claims

1. a long text answer selection method based on transfer learning sentence vector, is characterized in that, step is as follows:

1) Obtain authoritative doctor-patient question and answer data, and use the answers in the doctor-patient question and answer data as positive samples; for the questions in the doctor-patient question and answer data, use the Lucene index tool to retrieve and recall relevant answers, and use the relevant answers as negative samples; Construct the answer selection data set according to the obtained positive samples and negative samples, and divide the transfer learning data set and the training prediction data set according to the ratio of 27:1 to 8:1;

2) Establish a transfer learning sentence vector network, including a twin network structure, an attention aggregation structure and a classification layer. The twin network structure includes a paired input layer, a feature extraction layer, and a pooling layer. The attention aggregation structure includes Attention layer and aggregation network layer; the feature extraction layer adopts the BERT model, and loads the full word masking weight BERT parameter for initialization. After feature extraction, the average pooling output is taken, and the features are processed through the attention layer and the aggregation network layer in turn. Aggregated output; the aggregated output vector is spliced with the BERT pooled output vector and input to the classification layer for binary output;

Using the transfer learning data set obtained in step 1), the transfer learning sentence vector network is trained, and the MRR and Precision@K evaluation index methods are used to match the binary value of whether the question and the answer match with the real label, and select the highest matching score. The network parameters corresponding to the model are obtained, and the BertAttTL transfer learning sentence vector model is obtained;

3) Establish a training prediction network, including a twin network structure and a distance measurement layer, and the twin network structure includes a paired input layer, a feature extraction layer, and a pooling layer; the feature extraction layer adopts the BERT model, using step 2) The obtained weight parameters of the BertAttTL transfer learning sentence vector model initialize the BERT model and pooling layer parameters in the training prediction network, output the question sentence vector and the answer sentence vector through the pooling layer, and input the two sentence vectors to the distance measurement layer. Obtain the semantic similarity, and divide the similarity according to the threshold to obtain the similar binary value as the prediction content output; use the training prediction data set obtained in step 1) to train the training prediction network, and use the MRR and Precision@K evaluation index methods , match the final binary classification value with the real label, select the network parameters corresponding to the model with the highest matching score, and obtain a trained training prediction network;

4) Input the question and answer text to be processed into the training prediction network obtained in step 3), output the binary classification values of all candidate answers, and obtain the final answer to the question to be processed.

2. a kind of long text answer selection method based on transfer learning sentence vector as claimed in claim 1, is characterized in that, described MRR, Precision@K evaluation index method is specifically:

Denote the output of the transfer learning sentence vector network or the training prediction network as pred=[p ₁ ,p ₂ ,...,p _n ], where pi represents the predicted value of the _ith candidate answer, 0 or 1, and 0 represents no Similar, 1 represents similarity, n represents the number of test samples in the sample set; the real label data is represented as label=[t ₁ ,t ₂ ,,...,t _n ], where t _i represents the ith candidate answer The true label of 0 or 1, 0 means dissimilar, 1 means similar; for all candidate answers of a question, after obtaining the binary classification value through the transfer learning sentence vector network or training the prediction network, sort them, and get the correct answer for the i-th question. the answer's rank rank _i ;

The formula for calculating MRR is:

Among them, Q is the set of questions, and |Q| represents the number of all questions;

The formula for Precision@K is:

Among them, Precision represents precision, K represents the number of answers considered in the index, and in the present invention, the values are 1, 2, and 3, Num (True Answers) represents the number of correct answers, and Sum (related K Answers) represents recall The total number of relevant answers.

3. a kind of long text answer selection method based on transfer learning sentence vector as claimed in claim 1 is characterized in that, described transfer learning sentence vector network comprises twin network structure, attention aggregation structure and classification layer, described The Siamese network structure includes a paired input layer, a feature extraction layer, and a pooling layer. The attention aggregation structure includes an attention layer and an aggregation network layer; the feature extraction layer is modeled by BERT, and BERT uses a full The BERT weight parameters covered by the word are initialized;

Input the paired samples into the Siamese network structure. The paired input layers correspond to two text sequences of question and answer. According to the input format of BERT [CLS]+Question+[SEP], [CLS]+Answer+[SEP], the question text is respectively And the answer text is processed; after the BERT feature modeling, the output of the 12-layer pooling layer is averaged, and the pooled outputs with unified dimensions are obtained respectively: the question pooling output Q pool and the answer pooling output A pool, the dimension length is 768 dimensions;

The question pooling output Q pool and the answer pooling output A pool are input to the attention layer, and the question semantic alignment vector Z ₂ and the answer semantic alignment vector Z ₂ ' are obtained respectively through the attention mechanism; Q pool, A pool, Z ₂ And Z ₂ ' is input to the aggregation network layer, for the problem, Q pool and Z ₂ are transformed by [Q pool, Z ₂ ], [Q pool, Q pool-Z ₂ ], [Q pool, Q pool*Z ₂ ], Then, the splicing vector [O ₁ , O ₂ , O ₃ ] is obtained by splicing through a layer of linear transformation. The splicing vector is subjected to a layer of linear transformation and the DropOut mechanism is used to obtain the problem attention aggregation output Fused _Q ; Similarly, for the answer , A pool and Z ₂ ' go through the aggregation network layer to get the answer attention aggregation output Fused _A ;

Further splicing Fused _Q , Fused _A , Q pool, and A pool to obtain [Q pool,A pool,|Q pool-A pool|,Q pool*A pool,Fused Q,Fused A], and input the splicing vector into the classification layer , the predicted output pred=[p ₁ ,p ₂ ,...,p _n ] is obtained by Softmax classification, where p _i represents the predicted value of the ith candidate answer 0 or 1, 0 means dissimilar, 1 means similar, n Indicates the number of test samples in the sample set.

4. a kind of long text answer selection method based on transfer learning sentence vector as claimed in claim 1, is characterized in that, in described step 3), the calculation method of semantic similarity adopts cosine distance, Manhattan distance, Euler metric , any of the point product metrics.