CN111552773A

CN111552773A - A method and system for finding key sentences of question-like or not in reading comprehension task

Info

Publication number: CN111552773A
Application number: CN202010330141.5A
Authority: CN
Inventors: 许光銮; 于泓峰; 孙显; 田雨; 姚方龙; 李沛光; 吴红莉; 刘那与
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-18

Abstract

The method and system for finding whether or not a question-like key sentence in a reading comprehension task provided by the present invention includes: selecting the existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then mining the data set based on the coding layer network. The semantic information of sentences in sentences and chapter paragraphs is obtained, and the word embedding representation of each word is obtained; and an algorithm model is constructed, and the question sentences and chapter paragraphs mined through the coding layer network are used to calculate whether the question is a question or not by using the neural network model and TFIDF. Key sentence; input the question and answer data to be read into the trained algorithm model, and predict the key sentence of whether it is a question or not. The present invention can provide more key sentence support, calculate the weight of the key sentence through the combination of the bidirectional gated cyclic network and TF-IDF, and improve the answering efficiency and accuracy of whether or not the question is a question.

Description

A method and system for finding key sentences of question-like or not in reading comprehension task

技术领域technical field

本发明属于处理自然语言数据技术领域，具体涉及一种阅读理解任务中是否类问题关键句寻找方法及系统。The invention belongs to the technical field of processing natural language data, and in particular relates to a method and system for finding key sentences of whether or not a question is in a reading comprehension task.

背景技术Background technique

随着网络信息的爆炸式的增长，各种信息充斥着整个网络环境。人们现在已经习惯于去网络上搜寻一些解决问题的方法。当用户并不是十分熟悉一些搜索技巧的时候，他们往往需要花费很多的时间去筛选搜索引擎返回的结果。阅读理解系统的诞生有效的解决了前面提到的信息烦杂的问题。阅读理解系统采用自然语言处理的方法将用户提交的问题进行分析，获取相关答案然后返回给用户。With the explosive growth of network information, all kinds of information flood the entire network environment. People are now accustomed to go to the Internet to find some solutions to their problems. When users are not very familiar with some search techniques, they often need to spend a lot of time filtering the results returned by search engines. The birth of reading comprehension system effectively solves the problem of complicated information mentioned above. The reading comprehension system uses natural language processing to analyze the questions submitted by users, obtain relevant answers and return them to users.

是否类问题关键句的寻找，一直是阅读理解任务中一个关键的问题。如何找到问题的关键句，是回答该问题的一个关键点。也是评价这个阅读理解系统是否优秀的一个重点。是否类问题关键句的寻找是一项集自然语言处理与自然语言理解于一体的项目。如何的在没有人参与的情况下机器自动的正确的找到是否类问题关键句，如何正确的回答该关键句是目前阅读理解研究者共同面对的难题。如果能正确的解决以上问题，阅读理解问答系统将会被应用到生活的各个方面。同时，我们发现由于当前阅读理解问答系统由于缺少相关问句的支持，也使得阅读理解问答系统的内容过于单一，不适合当前人们的需求。同时在查找与问题相关的关键句时正确率过低，导致模型在回答问题时，无法找到与问题匹配的关键信息，而且传统方法的可扩展性差。The search for key sentences of question-like or not has always been a key issue in reading comprehension tasks. How to find the key sentence of the question is a key point to answer the question. It is also a key point in evaluating whether this reading comprehension system is excellent. The search for key sentences of whether or not is a project that integrates natural language processing and natural language understanding. How to automatically and correctly find the key sentence of the question or not without human participation, and how to correctly answer the key sentence is a common problem faced by reading comprehension researchers at present. If the above problems can be solved correctly, the reading comprehension question answering system will be applied to all aspects of life. At the same time, we found that due to the lack of the support of relevant questions in the current reading comprehension question answering system, the content of the reading comprehension question answering system is too single, which is not suitable for the needs of current people. At the same time, the accuracy rate of finding key sentences related to the question is too low, so that the model cannot find the key information matching the question when answering the question, and the scalability of the traditional method is poor.

发明内容SUMMARY OF THE INVENTION

为了解决现有的传统规则的方法的在查找与问题相关的关键句时正确率过低，导致模型在回答问题时，无法找到与问题匹配的关键信息，而且传统方法的可扩展性差，解决关键句查询领域的局限。为此，本发明提出了一种阅读理解任务中是否类问题关键句寻找方法，包括：In order to solve the problem of the low accuracy rate of the existing traditional method in finding key sentences related to the problem, the model cannot find the key information matching the problem when answering the question, and the scalability of the traditional method is poor. Sentence query field limitations. To this end, the present invention proposes a method for finding key sentences of question-like or not in a reading comprehension task, including:

选择已有的阅读理解问答数据，对所述问答数据进行预处理，得到数据集，然后将数据集分割为训练集、验证集和测试集；Selecting the existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into a training set, a verification set and a test set;

基于构建的编码层网络，挖掘练集和测试集里问句与篇章段落中的句子的语义信息，获得句子中每个词的词嵌入表示；Based on the constructed coding layer network, mine the semantic information of the sentences in the questions and paragraphs in the training set and the test set, and obtain the word embedding representation of each word in the sentence;

构建算法模型，将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句；Build an algorithm model, and use the neural network model and TFIDF to calculate the key sentences of whether or not the question is a question or not.

将待阅读理解问答数据输入训练好的算法模型，对所述待阅读理解问答数据中的是否类问题的关键句进行预测。Input the data to be read comprehension question-and-answer data into the trained algorithm model, and predict the key sentences of whether or not in the to-be-read comprehension question and answer data.

优选的，对所述问答数据进行预处理包括：Preferably, preprocessing the question and answer data includes:

对所述问答数据进行多尺度，多细粒度的小样本数据处理；Perform multi-scale and multi-fine-grained small sample data processing on the question and answer data;

所述数据处理还包含对网页、网址、图片、脏乱文字的处理。The data processing also includes the processing of web pages, website addresses, pictures, and dirty words.

优选的，所述将数据集分割为训练集、验证集和测试集，包括：Preferably, the data set is divided into training set, validation set and test set, including:

对数据集进行基于相关语义理解的分割，随机选取80％个样本作为训练集；选择10％个样本划分为验证集；将和所述训练数据中剩余的10％个样本划分为测试集；The data set is divided based on relevant semantic understanding, and 80% of the samples are randomly selected as the training set; 10% of the samples are selected as the verification set; and the remaining 10% of the samples in the training data are divided into the test set;

同时生成满足模型需要的词向量；At the same time, generate word vectors that meet the needs of the model;

所述分割包含对原文章段落与问题校正后的数据分割。The segmentation includes data segmentation corrected for the original article paragraphs and questions.

优选的，所述将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句，包括：Preferably, the key sentence of whether the question is a question or not is calculated by using the neural network model and TFIDF to calculate the question sentence and chapter paragraph after mining through the coding layer network, including:

将通过编码层网络挖掘后的问句与篇章段落输入到神经网络进行训练；Input the questions and paragraphs mined through the coding layer network into the neural network for training;

同时对问句和篇章段落进行TFIDF计算；At the same time, TFIDF calculation is performed on questions and paragraphs;

将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示，后进行余弦相似度计算；Multiply the weight of each word obtained by tf-idf by the final vector representation of each sentence output by the neural network, and then calculate the cosine similarity;

选择相似度最高的作为所述是否类问句的关键句。The key sentence with the highest similarity is selected as the key sentence of the whether question sentence.

优选的，所述将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示，包括：Preferably, the final vector representation of each sentence obtained by multiplying the weight of each word obtained by tf-idf by the output of the neural network includes:

将问句和篇章段落每个词的tf-idf权重与通过双向门控循环单元GRU网络对上下文特征进行学习，得到每个句子的最终向量表示。The tf-idf weights of each word in question sentences and discourse paragraphs are combined with the contextual features learned through the bidirectional gated recurrent unit GRU network to obtain the final vector representation of each sentence.

优选的，所述词嵌入表示如下式：Preferably, the word embedding is expressed as follows:

w_i＝word2vec(t_i)w _i =word2vec(t _i )

其中w_i为问句或文本中第i个词t_i的词嵌入表示，word2vec()为词嵌入公式where w _i is the word embedding representation of the ith word t _i in the question or text, and word2vec() is the word embedding formula

优选的，所述通过双向门控循环单元GRU网络对上下文特征进行学习的计算式如下：Preferably, the calculation formula for learning the context feature through the bidirectional gated recurrent unit GRU network is as follows:

其中，GRU为门控循环单元网络，h_i为第i个词的门控循环单元网络输出的隐层向量，“；”为拼接操作，w_i为问句或文本中第i个词t_i的词嵌入表示，h_i-1为第i个词前一个词的门控循环单元网络输出的隐层向量，h_i+1为第i个词的门控循环单元网络输出的隐层向量，

为正序向量，

为逆序向量。Among them, GRU is the gated recurrent unit network, hi is the hidden layer vector output by the gated recurrent unit network of the _ith word, ";" is the splicing operation, _wi is the question or the ith word t _i in the text The word embedding representation of , h _i-1 is the hidden layer vector output by the gated recurrent unit network of the word before the i-th word, h _i+1 is the hidden layer vector output by the gated recurrent unit network of the i-th word,

is a positive sequence vector,

is a vector in reverse order.

优选的，所述对问句和篇章段落进行TFIDF计算式如下：Preferably, the TFIDF calculation formula for the question and the paragraph is as follows:

tfidf(i,j)＝tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)

其中，tf(i,j)表示在文档j中单词i的词频，其中n_i,j表示文档j中单词i出现的次数，∑_kn_i,j为文档j中的单词总数，idf(i)则是由总文件数目D除以包含该单词i的文件的数目，t_i为该单词i；d_j为包含该单词的文件j；再将得到的商进行取对数得到。where tf(i,j) represents the word frequency of word i in document j, where n _i,j represents the number of occurrences of word i in document j, ∑ _k n _i,j is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t _i is the word i; d _j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.

优选的，所述将挖掘后的问句与篇章段落输入到神经网络进行训练的同时，对问句和篇章段落进行TFIDF计算，得到最为合适的句子作为是否类问题的关键句，之后还包括：Preferably, while inputting the mined questions and text paragraphs into the neural network for training, perform TFIDF calculation on the questions and text paragraphs to obtain the most suitable sentence as the key sentence of the question whether it is a class or not, and then include:

对预测的结果进行校正，获得正确的结果；Correct the predicted results to obtain correct results;

保存校正后的结果，利用校正后的数据对模型进行训练，更新算法模型。Save the corrected results, use the corrected data to train the model, and update the algorithm model.

基于同一种发明构思，本发明还提出一种阅读理解任务中是否类问题关键句寻找系统，包括：Based on the same inventive concept, the present invention also proposes a system for finding question-like key sentences in reading comprehension tasks, including:

数据准备模块，用于选择已有的阅读理解问答数据，对所述问答数据进行预处理，得到数据集，然后将数据集分割为训练集、验证集和测试集；The data preparation module is used to select the existing reading comprehension question and answer data, preprocess the question and answer data to obtain a data set, and then divide the data set into a training set, a verification set and a test set;

词句编码模块，用于基于构建的编码层网络，挖掘练集和测试集里问句与篇章段落中的句子的语义信息，获得句子中每个词的词嵌入表示；The word and sentence encoding module is used to mine the semantic information of the sentences in the questions and paragraphs in the training set and the test set based on the constructed encoding layer network, and obtain the word embedding representation of each word in the sentence;

关键句匹配模块，用于将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句；The key sentence matching module is used to calculate the key sentence of whether the question is a question or not by using the neural network model and TFIDF to calculate the question sentence and chapter paragraph after mining through the coding layer network;

关键句确定模块，用于将待阅读理解问答数据输入训练好的算法模型，对所述待阅读理解问答数据中的是否类问题的关键句进行预测。The key sentence determination module is used for inputting the question and answer data to be read comprehension into the trained algorithm model, and predicts the key sentence of the question whether or not in the question and answer data to be read comprehension.

本发明的有益效果为：The beneficial effects of the present invention are:

本发明提供的一种阅读理解任务中是否类问题关键句寻找方法及系统，包括：选择已有的阅读理解问答数据，对所述问答数据进行预处理，得到数据集，然后将数据集分割为训练集、验证集和测试集；基于构建的编码层网络，挖掘练集和测试集里问句与篇章段落中的句子的语义信息，获得句子中每个词的词嵌入表示；构建算法模型，将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句；将待阅读理解问答数据输入训练好的算法模型，对所述待阅读理解问答数据中的是否类问题的关键句进行预测，通过本发明提供的技术方案，可以提供更多的关键句支撑。同时，因为本发明无需人工标注，更适合应用于生活中的真实场景，提高问答系统的效率与准确率。The invention provides a method and system for finding key sentences of whether or not a question in a reading comprehension task, comprising: selecting existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into Training set, validation set and test set; based on the constructed coding layer network, mine the semantic information of the sentences in the questions and paragraphs in the training set and test set, and obtain the word embedding representation of each word in the sentence; build the algorithm model, The question sentences and chapter paragraphs mined through the coding layer network are calculated by using the neural network model and TFIDF to obtain the key sentences of whether or not the question is a question; input the question-and-answer data to be read into the trained algorithm model, and the question-and-answer data to be read comprehension To predict the key sentences of the whether or not problem, more key sentence supports can be provided through the technical solution provided by the present invention. At the same time, because the present invention does not need manual annotation, it is more suitable for application in real scenes in life, and improves the efficiency and accuracy of the question answering system.

本发明提供的技术方案中基于神经网络与TF-IDF的阅读理解任务中是否类问题关键句寻找方法，通过双向门控循环网络与TF-IDF结合来计算关键句的权重，进一步提高是否类问题的回答效率，大大提高了是否类问题的正确率。不仅大幅度减少了人工标注，而且提高了阅读理解的准确度。In the technical solution provided by the present invention, the method for finding key sentences in the reading comprehension task based on neural network and TF-IDF calculates the weight of the key sentences by combining the bidirectional gated cyclic network and TF-IDF, so as to further improve the problem of whether or not The efficiency of answering, greatly improves the correct rate of whether or not. It not only greatly reduces manual annotation, but also improves the accuracy of reading comprehension.

附图说明Description of drawings

图1为本发明提供的一种阅读理解任务中是否类问题关键句寻找方法流程图；Fig. 1 is a kind of reading comprehension task provided by the present invention is a kind of question key sentence search method flow chart;

图2为本发明基于tf-idf的是否类问题关键句寻找方法步骤示意图；Fig. 2 is based on tf-idf of the present invention whether the key sentence of class question is searched for method steps schematic diagram;

图3为本发明提供的一种阅读理解任务中是否类问题关键句寻找系统的结构图。FIG. 3 is a structural diagram of a system for finding key sentences in a reading comprehension task provided by the present invention.

具体实施方式Detailed ways

本发明的目的在于提供一种基于神经网络与TF-IDF的阅读理解任务中是否类问题关键句寻找方法，通过双向门控循环网络与TF-IDF结合的方式来对关键句进行查找，其中TF-IDF(term frequency-inverse document frequency)是一种用于信息检索与数据挖掘的加权技术。TF意思是词频(Term Frequency)，IDF意思是逆文本频率指数(InverseDocument Frequency)。The object of the present invention is to provide a method for finding key sentences in the reading comprehension task based on neural network and TF-IDF, which is to search for key sentences through the combination of bidirectional gated recurrent network and TF-IDF, wherein TF -IDF (term frequency-inverse document frequency) is a weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means InverseDocument Frequency.

下面结合具体实施方式对本发明进行详细说明。The present invention will be described in detail below with reference to specific embodiments.

实施例1：Example 1:

如图1所示，本发明提供一种阅读理解任务中是否类问题关键句寻找方法，按照以下步骤进行：As shown in Figure 1, the present invention provides a method for finding a key sentence of a question or not in a reading comprehension task, which is carried out according to the following steps:

S1选择已有的阅读理解问答数据，对所述问答数据进行预处理，得到数据集，然后将数据集分割为训练集、验证集和测试集；S1 selects the existing reading comprehension question and answer data, preprocesses the question and answer data to obtain a data set, and then divides the data set into a training set, a verification set and a test set;

S2基于构建的编码层网络，挖掘练集和测试集里问句与篇章段落中的句子的语义信息，获得句子中每个词的词嵌入表示；Based on the constructed coding layer network, S2 mines the semantic information of the sentences in the questions and paragraphs in the training set and the test set, and obtains the word embedding representation of each word in the sentence;

S3构建算法模型，将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句；S3 builds an algorithm model, and uses the neural network model and TFIDF to calculate the key sentences of whether the question is a question or not by using the question sentences and chapter paragraphs mined by the coding layer network;

S4将待阅读理解问答数据输入训练好的算法模型，对所述待阅读理解问答数据中的是否类问题的关键句进行预测。S4 inputs the question-and-answer data to be read comprehension into the trained algorithm model, and predicts the key sentences of whether or not in the question-and-answer data to be read comprehension.

如图2所示本发明采用的技术方案是按照以下步骤进行：The technical scheme that the present invention adopts as shown in Figure 2 is to carry out according to the following steps:

步骤1、选择已有的阅读理解问答数据，对这些数据进行预处理。Step 1. Select the existing reading comprehension question and answer data, and preprocess the data.

选择的阅读理解问答数据具有相同的属性，例如具有相同数据格式、具有相同专业方向，具有是否类问题等。数据的预处理包括对阅读理解问答数据进行多尺度，多细粒度的小样本数据处理，包括分句、分词、正则化、清洗等，还包含对网页、网址、图片、脏乱文字的处理。其目的是对已有的数据进行优化和扩充。The selected reading comprehension question-and-answer data have the same attributes, such as the same data format, the same professional direction, and whether it is a question or not. Data preprocessing includes multi-scale and multi-fine-grained small sample data processing on reading comprehension question and answer data, including sentence segmentation, word segmentation, regularization, cleaning, etc. It also includes processing of web pages, URLs, pictures, and dirty text. Its purpose is to optimize and expand existing data.

步骤2、分割训练集和测试集，进行嵌入层模型预训练，生成满足模型需要的词向量。Step 2: Split the training set and the test set, perform pre-training of the embedding layer model, and generate word vectors that meet the needs of the model.

进一步的，对数据集进行基于相关语义理解的分割，随机选取80％个样本作为训练集；选择10％个样本划分为验证集；将和所述训练数据中剩余的10％个样本划分为测试集；同时生成满足模型需要的词向量；其中分割包含对原文章段落与问题校正后的数据分割。对数据集进行基于相关语义理解的分割，使训练集与测试集中的是否类信息达到完美的分割比例，大大提高了训练速度与收敛率。Further, the data set is divided based on relevant semantic understanding, and 80% of the samples are randomly selected as the training set; 10% of the samples are selected to be divided into the verification set; and the remaining 10% of the samples in the training data are divided into the test set. At the same time, word vectors that meet the needs of the model are generated; the segmentation includes the data segmentation after correcting the original article paragraphs and questions. The data set is segmented based on relevant semantic understanding, so that the class information in the training set and the test set can achieve a perfect segmentation ratio, which greatly improves the training speed and convergence rate.

步骤3、编码层网络，挖掘问句与篇章段落中的句子的语义信息。Step 3. The coding layer network mines the semantic information of the sentences in the questions and the paragraphs.

编码层网络采用了词嵌入的方式对问句与篇章段落进行词嵌入表示；The coding layer network uses the word embedding method to express the word embedding for the question and the paragraph;

w_i＝word2vec(t_i)w _i =word2vec(t _i )

其中w_i为问句或文本中第i个词t_i的词嵌入表示,word2vec()为词嵌入公式。Where w _i is the word embedding representation of the ith word t _i in the question or text, and word2vec() is the word embedding formula.

步骤4、进行基于TF-IDF融合的模型改进，包括：将问句与篇章段落输入到神经网络的同时，对问句和篇章段落进行TFIDF计算。用于控制问句与篇章段落中各个句子的权重，再将问句与篇章段落进行余弦相似度计算，得到最为合适的句子作为是否类问题的关键句。Step 4. Carry out model improvement based on TF-IDF fusion, including: while inputting questions and chapter paragraphs into the neural network, perform TFIDF calculation on questions and chapter paragraphs. It is used to control the weight of each sentence in the question sentence and the text paragraph, and then calculate the cosine similarity between the question sentence and the text paragraph, and obtain the most suitable sentence as the key sentence of the question.

具体的，specific,

这里将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示，包括：将问句和篇章段落每个词的tf-idf权重与通过双向门控循环单元GRU网络对上下文特征进行学习，得到每个句子的最终向量表示。Here, the weight of each word obtained by tf-idf is multiplied by the final vector representation of each sentence output by the neural network, including: combining the tf-idf weight of each word of the question and chapter paragraph with the bidirectional gated recurrent unit The GRU network learns the contextual features to obtain the final vector representation of each sentence.

通过双向门控循环单元GRU网络对上下文特征进行学习的计算式如下：The calculation formula for learning contextual features through the bidirectional gated recurrent unit GRU network is as follows:

为正序向量，

is a positive sequence vector,

is a vector in reverse order.

对问句和篇章段落进行TFIDF计算式如下：The TFIDF calculation formula for questions and chapter paragraphs is as follows:

tfidf(i，j)＝tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)

其中，tf(i，j)表示在文档j中单词i的词频，其中n_i,j表示文档j中单词i出现的次数，∑_kn_i,j为文档j中的单词总数，idf(i)则是由总文件数目D除以包含该单词i的文件的数目，t_i为该单词i；d_j为包含该单词的文件j；再将得到的商进行取对数得到。Among them, tf(i, j) represents the word frequency of word i in document j, where n _{i, j} represents the number of occurrences of word i in document j, ∑ _k n _{i, j} is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t _i is the word i; d _j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.

步骤5、对预测的结果进行校正，获得正确的结果。Step 5. Correct the predicted result to obtain the correct result.

对不正常的生成结果进行舍弃，不用于后面的更新升级，对正常的问句生成结果。The abnormal generated results are discarded, not used for subsequent updates and upgrades, and the normal questions are generated.

步骤6、保存校正后的结果，利用校正后的数据训练模型，更新算法模型。利用校正后的算法模型，大大提高了训练速度与正确率。Step 6: Save the corrected result, use the corrected data to train the model, and update the algorithm model. Using the corrected algorithm model greatly improves the training speed and accuracy.

步骤7、输入新的待阅读理解问答数据，对数据中的是否类问题的关键句进行预测。Step 7: Input the new question-and-answer data to be read and comprehend, and predict the key sentences of whether or not in the data.

首先，分析其属性是否和步骤1中选择的阅读理解问答数据属性相同，如果属性不同，在后面的预处理操作过程中，尽量让其与步骤1中选择的数据属性保持一致。然后对阅读理解问答数据文本进行预处理操作。First, analyze whether its attributes are the same as the data attributes of the reading comprehension question and answer data selected in step 1. If the attributes are different, try to make them consistent with the data attributes selected in step 1 in the subsequent preprocessing operation. Then, preprocessing is performed on the reading comprehension question answering data text.

实施例2Example 2

给定问句与句子，北京是否是中国的首都？北京是中国的首都Given questions and sentences, is Beijing the capital of China? Beijing is the capital of China

分词：<北京><是否><是><中国><的><首都><北京><是><中国><的><首都>Participle: <Beijing> <whether><is><China><the><capital><Beijing><is><China><the><capital>

送入模型中，通过神经网络与TF-IDF获得词嵌入表示Feed it into the model and obtain the word embedding representation through neural network and TF-IDF

计算问句与篇章段落的余弦相似度Calculate the cosine similarity between a question sentence and a text paragraph

返回结果，该句是问句关键句Return the result, the sentence is the key sentence of the question sentence

给定问句与句子，北京是否是中国的首都？今年寒假延长至11月Given questions and sentences, is Beijing the capital of China? This winter vacation has been extended to November

分词：<北京><是否><是><中国><的><首都><今年><寒假><延长><至><11月>Participle: <Beijing><Whether><Yes><China><The><Capital><This Year><Winter Holiday><Extended><To><November>

返回结果，该句不是问句关键句Return the result, the sentence is not the key sentence of the question sentence

更进一步，模型更新可以选择自动更新升级或者手动一键更新升级。Further, model update can choose automatic update upgrade or manual one-click update upgrade.

实施例3Example 3

为了实现上述方法，本发明还提出一种阅读理解任务中是否类问题关键句寻找系统，如图所示包括：In order to realize the above-mentioned method, the present invention also proposes a system for finding key sentences of question or not in the reading comprehension task, as shown in the figure, including:

具体的，对所述问答数据进行预处理包括：Specifically, preprocessing the question and answer data includes:

将数据集分割为训练集、验证集和测试集，包括：Split the dataset into training, validation, and test sets, including:

将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句，包括：Using the neural network model and TFIDF to calculate the question sentences and chapter paragraphs mined by the coding layer network, the key sentences of whether or not the question are classified, including:

将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示，包括：Multiply the resulting per-word weights from tf-idf by the neural network's output for the final vector representation of each sentence, including:

词嵌入表示如下式：The word embedding is expressed as:

w_i＝word2vec(t_i)w _i =word2vec(t _i )

其中w_i为问句或文本中第i个词t_i的词嵌入表示,word2vec()为词嵌入公式Where w _i is the word embedding representation of the ith word t _i in the question or text, and word2vec() is the word embedding formula

其中，GRU为门控循环单元网络，h_i为第i个词的门控循环单元网络输出的隐层向量，“；”为拼接操作，ω_i为问句或文本中第i个词t_i的词嵌入表示。Among them, GRU is the gated recurrent unit network, hi is the hidden layer vector output by the gated recurrent unit network of the _ith word, ";" is the splicing operation, ω _i is the question or the ith word t _i in the text word embedding representation of .

tfidf(i,j)＝tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)

其中，tf(i，j)表示在文档j中单词i的词频，其中n_i，j表示文档j中单词i出现的次数，∑_kn_i，j为文档j中的单词总数，idf(i)则是由总文件数目D除以包含该单词i的文件的数目，t_i为该单词i；d_j为包含该单词的文件j；再将得到的商进行取对数得到。where tf(i, j) represents the word frequency of word i in document j, where n _{i, j} represents the number of occurrences of word i in document j, ∑ _k n _{i, j} is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t _i is the word i; d _j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.

将挖掘后的问句与篇章段落输入到神经网络进行训练的同时，对问句和篇章段落进行TFIDF计算，得到最为合适的句子作为是否类问题的关键句，之后还包括：While inputting the mined questions and chapter paragraphs into the neural network for training, the TFIDF calculation is performed on the questions and chapter paragraphs to obtain the most suitable sentence as the key sentence for the question whether it is a class or not, and then includes:

以上所述仅是对本发明的较佳实施方式而已，并非对本发明作任何形式上的限制，凡是依据本发明的技术实质对以上实施方式所做的任何简单修改，等同变化与修饰，均属于本发明技术方案的范围内。The above is only a preferred embodiment of the present invention, and does not limit the present invention in any form. Any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention belong to the present invention. within the scope of the technical solution of the invention.

Claims

1. A method for searching key sentences of whether problems are similar or not in a reading and understanding task is characterized by comprising the following steps:

selecting existing reading understanding question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into a training set, a verification set and a test set;

based on the constructed coding layer network, mining semantic information of sentences in question sentences and chapter paragraphs in a training set and a test set to obtain word embedded representation of each word in the sentences;

constructing an algorithm model, and calculating the question and chapter paragraphs mined by the coding layer network by using a neural network model and a TFIDF (fuzzy finite field Format) to obtain a key sentence of whether the question is similar to the question;

inputting the question-answer data to be read into the trained algorithm model, and predicting whether key sentences similar to problems in the question-answer data to be read are predicted.

2. The key sentence searching method according to claim 1, wherein the preprocessing of the question-answer data includes:

processing the question-answer data by using a small sample with multiple scales and multiple fine granularities;

the data processing also comprises the processing of web pages, websites, pictures and dirty and messy characters.

3. The key sentence finding method of claim 1, wherein the segmenting the data set into a training set, a validation set, and a test set comprises:

segmenting the data set based on relevant semantic understanding, and randomly selecting 80% samples as a training set; selecting 10% of samples to be divided into verification sets; dividing the data and the rest 10% of samples in the training data into a test set;

generating word vectors meeting the requirements of the model;

the segmentation includes data segmentation after correcting the original chapter paragraphs and the problem.

4. The method for finding key sentences according to claim 1, wherein the calculating the question sentences and chapter paragraphs mined by the coding layer network to obtain the key sentences of question-or-not classes by using the neural network model and TFIDF comprises:

inputting the question sentences and discourse paragraphs mined by the coding layer network into a neural network for training;

simultaneously performing TFIDF calculation on the question and the chapter paragraphs;

multiplying each word weight obtained by tf-idf by the final vector representation of each sentence output by the neural network, and then performing cosine similarity calculation;

and selecting the key sentence with the highest similarity as the question sentence whether to be classified.

5. A key sentence finding method as claimed in claim 4, wherein the multiplying the resulting per word weight of tf-idf by the final vector representation of each sentence of the output of the neural network comprises:

and (3) learning the context characteristics by the tf-idf weight of each word of the question and chapter paragraphs and the bidirectional gating circulation unit GRU network to obtain the final vector representation of each sentence.

6. The key sentence searching method of claim 4, wherein the word embedding expression is as follows:

w_i＝word2υec(t_i)

wherein w_iFor the ith word t in question or text_iWord2 υ ec () is a word embedding formula.

7. The key sentence finding method of claim 6, wherein the learning of the context feature through the bi-directional gated round-robin unit GRU network is calculated as follows:

wherein GRU is a gated cyclic unit network, h_iHidden layer vector output by gated cyclic unit network for ith word, h_i-1Hidden layer vector output by gated cyclic unit network of word preceding ith word, h_i+1The hidden layer vector output by the gated round-robin unit network for the ith word,

is a vector of a positive sequence, and is,

is the reverse order vector, "; "is a splicing operation, w_iFor the ith word t in question or text_iThe word embedded representation.

8. The method of claim 7, wherein TFIDF is calculated for question and chapter paragraphs as follows:

tfidf(i，j)＝tf(i，j)×idf(i)

where tf (i, j) represents the word frequency of word i in document j, where n_i，jRepresenting the number of occurrences of word i in document j, ∑_kn_i,jThe total number of words in document j, idf (i) is the total number of files D divided by the number of files containing the word i, t_iIs the word i; d_jIs a file j containing the word; and taking the logarithm of the obtained quotient to obtain the product.

9. The method for finding key sentences according to claim 1, wherein TFIDF calculation is performed on the question and the chapter paragraphs while inputting the mined question and chapter paragraphs into the neural network for training to obtain the most suitable sentence as the key sentence for question classification, and then further comprising:

correcting the predicted result to obtain a correct result;

and storing the corrected result, training the model by using the corrected data, and updating the algorithm model.

10. A system for searching key sentences whether to be similar to problems in reading and understanding tasks is characterized by comprising the following steps:

the data preparation module is used for selecting the existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into a training set, a verification set and a test set;

the word and sentence coding module is used for mining semantic information of the sentences in the question sentences and the chapter paragraphs in the training set and the test set based on the constructed coding layer network to obtain word embedded representation of each word in the sentences;

the key sentence matching module is used for calculating the question sentences and chapter paragraphs mined by the coding layer network by utilizing a neural network model and a TFIDF (trivial text field function) to obtain key sentences of which the question types are not similar;

and the key sentence determining module is used for inputting the question and answer data to be read into the trained algorithm model and predicting the key sentences which are similar to the questions in the question and answer data to be read.