CN111552773A - A method and system for finding key sentences of question-like or not in reading comprehension task - Google Patents
A method and system for finding key sentences of question-like or not in reading comprehension task Download PDFInfo
- Publication number
- CN111552773A CN111552773A CN202010330141.5A CN202010330141A CN111552773A CN 111552773 A CN111552773 A CN 111552773A CN 202010330141 A CN202010330141 A CN 202010330141A CN 111552773 A CN111552773 A CN 111552773A
- Authority
- CN
- China
- Prior art keywords
- question
- word
- sentences
- sentence
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000003062 neural network model Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 10
- 238000005065 mining Methods 0.000 claims abstract description 6
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 38
- 239000013598 vector Substances 0.000 claims description 33
- 238000012360 testing method Methods 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 10
- 238000010200 validation analysis Methods 0.000 claims description 4
- 238000002360 preparation method Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims 1
- 235000019580 granularity Nutrition 0.000 claims 1
- 230000000306 recurrent effect Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供的阅读理解任务中是否类问题关键句寻找方法及系统,包括:选择已有的阅读理解问答数据,对问答数据进行预处理,得到数据集,然后基于编码层网络,挖掘数据集中问句与篇章段落中的句子的语义信息,获得每个词的词嵌入表示;并构建算法模型,将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;将待阅读理解问答数据输入训练好的算法模型,对是否类问题的关键句进行预测。本发明可以提供更多的关键句支撑,通过双向门控循环网络与TF‑IDF结合来计算关键句的权重,提高是否类问题的回答效率与准确率。
The method and system for finding whether or not a question-like key sentence in a reading comprehension task provided by the present invention includes: selecting the existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then mining the data set based on the coding layer network. The semantic information of sentences in sentences and chapter paragraphs is obtained, and the word embedding representation of each word is obtained; and an algorithm model is constructed, and the question sentences and chapter paragraphs mined through the coding layer network are used to calculate whether the question is a question or not by using the neural network model and TFIDF. Key sentence; input the question and answer data to be read into the trained algorithm model, and predict the key sentence of whether it is a question or not. The present invention can provide more key sentence support, calculate the weight of the key sentence through the combination of the bidirectional gated cyclic network and TF-IDF, and improve the answering efficiency and accuracy of whether or not the question is a question.
Description
技术领域technical field
本发明属于处理自然语言数据技术领域,具体涉及一种阅读理解任务中是否类问题关键句寻找方法及系统。The invention belongs to the technical field of processing natural language data, and in particular relates to a method and system for finding key sentences of whether or not a question is in a reading comprehension task.
背景技术Background technique
随着网络信息的爆炸式的增长,各种信息充斥着整个网络环境。人们现在已经习惯于去网络上搜寻一些解决问题的方法。当用户并不是十分熟悉一些搜索技巧的时候,他们往往需要花费很多的时间去筛选搜索引擎返回的结果。阅读理解系统的诞生有效的解决了前面提到的信息烦杂的问题。阅读理解系统采用自然语言处理的方法将用户提交的问题进行分析,获取相关答案然后返回给用户。With the explosive growth of network information, all kinds of information flood the entire network environment. People are now accustomed to go to the Internet to find some solutions to their problems. When users are not very familiar with some search techniques, they often need to spend a lot of time filtering the results returned by search engines. The birth of reading comprehension system effectively solves the problem of complicated information mentioned above. The reading comprehension system uses natural language processing to analyze the questions submitted by users, obtain relevant answers and return them to users.
是否类问题关键句的寻找,一直是阅读理解任务中一个关键的问题。如何找到问题的关键句,是回答该问题的一个关键点。也是评价这个阅读理解系统是否优秀的一个重点。是否类问题关键句的寻找是一项集自然语言处理与自然语言理解于一体的项目。如何的在没有人参与的情况下机器自动的正确的找到是否类问题关键句,如何正确的回答该关键句是目前阅读理解研究者共同面对的难题。如果能正确的解决以上问题,阅读理解问答系统将会被应用到生活的各个方面。同时,我们发现由于当前阅读理解问答系统由于缺少相关问句的支持,也使得阅读理解问答系统的内容过于单一,不适合当前人们的需求。同时在查找与问题相关的关键句时正确率过低,导致模型在回答问题时,无法找到与问题匹配的关键信息,而且传统方法的可扩展性差。The search for key sentences of question-like or not has always been a key issue in reading comprehension tasks. How to find the key sentence of the question is a key point to answer the question. It is also a key point in evaluating whether this reading comprehension system is excellent. The search for key sentences of whether or not is a project that integrates natural language processing and natural language understanding. How to automatically and correctly find the key sentence of the question or not without human participation, and how to correctly answer the key sentence is a common problem faced by reading comprehension researchers at present. If the above problems can be solved correctly, the reading comprehension question answering system will be applied to all aspects of life. At the same time, we found that due to the lack of the support of relevant questions in the current reading comprehension question answering system, the content of the reading comprehension question answering system is too single, which is not suitable for the needs of current people. At the same time, the accuracy rate of finding key sentences related to the question is too low, so that the model cannot find the key information matching the question when answering the question, and the scalability of the traditional method is poor.
发明内容SUMMARY OF THE INVENTION
为了解决现有的传统规则的方法的在查找与问题相关的关键句时正确率过低,导致模型在回答问题时,无法找到与问题匹配的关键信息,而且传统方法的可扩展性差,解决关键句查询领域的局限。为此,本发明提出了一种阅读理解任务中是否类问题关键句寻找方法,包括:In order to solve the problem of the low accuracy rate of the existing traditional method in finding key sentences related to the problem, the model cannot find the key information matching the problem when answering the question, and the scalability of the traditional method is poor. Sentence query field limitations. To this end, the present invention proposes a method for finding key sentences of question-like or not in a reading comprehension task, including:
选择已有的阅读理解问答数据,对所述问答数据进行预处理,得到数据集,然后将数据集分割为训练集、验证集和测试集;Selecting the existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into a training set, a verification set and a test set;
基于构建的编码层网络,挖掘练集和测试集里问句与篇章段落中的句子的语义信息,获得句子中每个词的词嵌入表示;Based on the constructed coding layer network, mine the semantic information of the sentences in the questions and paragraphs in the training set and the test set, and obtain the word embedding representation of each word in the sentence;
构建算法模型,将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;Build an algorithm model, and use the neural network model and TFIDF to calculate the key sentences of whether or not the question is a question or not.
将待阅读理解问答数据输入训练好的算法模型,对所述待阅读理解问答数据中的是否类问题的关键句进行预测。Input the data to be read comprehension question-and-answer data into the trained algorithm model, and predict the key sentences of whether or not in the to-be-read comprehension question and answer data.
优选的,对所述问答数据进行预处理包括:Preferably, preprocessing the question and answer data includes:
对所述问答数据进行多尺度,多细粒度的小样本数据处理;Perform multi-scale and multi-fine-grained small sample data processing on the question and answer data;
所述数据处理还包含对网页、网址、图片、脏乱文字的处理。The data processing also includes the processing of web pages, website addresses, pictures, and dirty words.
优选的,所述将数据集分割为训练集、验证集和测试集,包括:Preferably, the data set is divided into training set, validation set and test set, including:
对数据集进行基于相关语义理解的分割,随机选取80%个样本作为训练集;选择10%个样本划分为验证集;将和所述训练数据中剩余的10%个样本划分为测试集;The data set is divided based on relevant semantic understanding, and 80% of the samples are randomly selected as the training set; 10% of the samples are selected as the verification set; and the remaining 10% of the samples in the training data are divided into the test set;
同时生成满足模型需要的词向量;At the same time, generate word vectors that meet the needs of the model;
所述分割包含对原文章段落与问题校正后的数据分割。The segmentation includes data segmentation corrected for the original article paragraphs and questions.
优选的,所述将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句,包括:Preferably, the key sentence of whether the question is a question or not is calculated by using the neural network model and TFIDF to calculate the question sentence and chapter paragraph after mining through the coding layer network, including:
将通过编码层网络挖掘后的问句与篇章段落输入到神经网络进行训练;Input the questions and paragraphs mined through the coding layer network into the neural network for training;
同时对问句和篇章段落进行TFIDF计算;At the same time, TFIDF calculation is performed on questions and paragraphs;
将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,后进行余弦相似度计算;Multiply the weight of each word obtained by tf-idf by the final vector representation of each sentence output by the neural network, and then calculate the cosine similarity;
选择相似度最高的作为所述是否类问句的关键句。The key sentence with the highest similarity is selected as the key sentence of the whether question sentence.
优选的,所述将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,包括:Preferably, the final vector representation of each sentence obtained by multiplying the weight of each word obtained by tf-idf by the output of the neural network includes:
将问句和篇章段落每个词的tf-idf权重与通过双向门控循环单元GRU网络对上下文特征进行学习,得到每个句子的最终向量表示。The tf-idf weights of each word in question sentences and discourse paragraphs are combined with the contextual features learned through the bidirectional gated recurrent unit GRU network to obtain the final vector representation of each sentence.
优选的,所述词嵌入表示如下式:Preferably, the word embedding is expressed as follows:
wi=word2vec(ti)w i =word2vec(t i )
其中wi为问句或文本中第i个词ti的词嵌入表示,word2vec()为词嵌入公式where w i is the word embedding representation of the ith word t i in the question or text, and word2vec() is the word embedding formula
优选的,所述通过双向门控循环单元GRU网络对上下文特征进行学习的计算式如下:Preferably, the calculation formula for learning the context feature through the bidirectional gated recurrent unit GRU network is as follows:
其中,GRU为门控循环单元网络,hi为第i个词的门控循环单元网络输出的隐层向量,“;”为拼接操作,wi为问句或文本中第i个词ti的词嵌入表示,hi-1为第i个词前一个词的门控循环单元网络输出的隐层向量,hi+1为第i个词的门控循环单元网络输出的隐层向量,为正序向量,为逆序向量。Among them, GRU is the gated recurrent unit network, hi is the hidden layer vector output by the gated recurrent unit network of the ith word, ";" is the splicing operation, wi is the question or the ith word t i in the text The word embedding representation of , h i-1 is the hidden layer vector output by the gated recurrent unit network of the word before the i-th word, h i+1 is the hidden layer vector output by the gated recurrent unit network of the i-th word, is a positive sequence vector, is a vector in reverse order.
优选的,所述对问句和篇章段落进行TFIDF计算式如下:Preferably, the TFIDF calculation formula for the question and the paragraph is as follows:
tfidf(i,j)=tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)
其中,tf(i,j)表示在文档j中单词i的词频,其中ni,j表示文档j中单词i出现的次数,∑kni,j为文档j中的单词总数,idf(i)则是由总文件数目D除以包含该单词i的文件的数目,ti为该单词i;dj为包含该单词的文件j;再将得到的商进行取对数得到。where tf(i,j) represents the word frequency of word i in document j, where n i,j represents the number of occurrences of word i in document j, ∑ k n i,j is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t i is the word i; d j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.
优选的,所述将挖掘后的问句与篇章段落输入到神经网络进行训练的同时,对问句和篇章段落进行TFIDF计算,得到最为合适的句子作为是否类问题的关键句,之后还包括:Preferably, while inputting the mined questions and text paragraphs into the neural network for training, perform TFIDF calculation on the questions and text paragraphs to obtain the most suitable sentence as the key sentence of the question whether it is a class or not, and then include:
对预测的结果进行校正,获得正确的结果;Correct the predicted results to obtain correct results;
保存校正后的结果,利用校正后的数据对模型进行训练,更新算法模型。Save the corrected results, use the corrected data to train the model, and update the algorithm model.
基于同一种发明构思,本发明还提出一种阅读理解任务中是否类问题关键句寻找系统,包括:Based on the same inventive concept, the present invention also proposes a system for finding question-like key sentences in reading comprehension tasks, including:
数据准备模块,用于选择已有的阅读理解问答数据,对所述问答数据进行预处理,得到数据集,然后将数据集分割为训练集、验证集和测试集;The data preparation module is used to select the existing reading comprehension question and answer data, preprocess the question and answer data to obtain a data set, and then divide the data set into a training set, a verification set and a test set;
词句编码模块,用于基于构建的编码层网络,挖掘练集和测试集里问句与篇章段落中的句子的语义信息,获得句子中每个词的词嵌入表示;The word and sentence encoding module is used to mine the semantic information of the sentences in the questions and paragraphs in the training set and the test set based on the constructed encoding layer network, and obtain the word embedding representation of each word in the sentence;
关键句匹配模块,用于将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;The key sentence matching module is used to calculate the key sentence of whether the question is a question or not by using the neural network model and TFIDF to calculate the question sentence and chapter paragraph after mining through the coding layer network;
关键句确定模块,用于将待阅读理解问答数据输入训练好的算法模型,对所述待阅读理解问答数据中的是否类问题的关键句进行预测。The key sentence determination module is used for inputting the question and answer data to be read comprehension into the trained algorithm model, and predicts the key sentence of the question whether or not in the question and answer data to be read comprehension.
本发明的有益效果为:The beneficial effects of the present invention are:
本发明提供的一种阅读理解任务中是否类问题关键句寻找方法及系统,包括:选择已有的阅读理解问答数据,对所述问答数据进行预处理,得到数据集,然后将数据集分割为训练集、验证集和测试集;基于构建的编码层网络,挖掘练集和测试集里问句与篇章段落中的句子的语义信息,获得句子中每个词的词嵌入表示;构建算法模型,将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;将待阅读理解问答数据输入训练好的算法模型,对所述待阅读理解问答数据中的是否类问题的关键句进行预测,通过本发明提供的技术方案,可以提供更多的关键句支撑。同时,因为本发明无需人工标注,更适合应用于生活中的真实场景,提高问答系统的效率与准确率。The invention provides a method and system for finding key sentences of whether or not a question in a reading comprehension task, comprising: selecting existing reading comprehension question and answer data, preprocessing the question and answer data to obtain a data set, and then dividing the data set into Training set, validation set and test set; based on the constructed coding layer network, mine the semantic information of the sentences in the questions and paragraphs in the training set and test set, and obtain the word embedding representation of each word in the sentence; build the algorithm model, The question sentences and chapter paragraphs mined through the coding layer network are calculated by using the neural network model and TFIDF to obtain the key sentences of whether or not the question is a question; input the question-and-answer data to be read into the trained algorithm model, and the question-and-answer data to be read comprehension To predict the key sentences of the whether or not problem, more key sentence supports can be provided through the technical solution provided by the present invention. At the same time, because the present invention does not need manual annotation, it is more suitable for application in real scenes in life, and improves the efficiency and accuracy of the question answering system.
本发明提供的技术方案中基于神经网络与TF-IDF的阅读理解任务中是否类问题关键句寻找方法,通过双向门控循环网络与TF-IDF结合来计算关键句的权重,进一步提高是否类问题的回答效率,大大提高了是否类问题的正确率。不仅大幅度减少了人工标注,而且提高了阅读理解的准确度。In the technical solution provided by the present invention, the method for finding key sentences in the reading comprehension task based on neural network and TF-IDF calculates the weight of the key sentences by combining the bidirectional gated cyclic network and TF-IDF, so as to further improve the problem of whether or not The efficiency of answering, greatly improves the correct rate of whether or not. It not only greatly reduces manual annotation, but also improves the accuracy of reading comprehension.
附图说明Description of drawings
图1为本发明提供的一种阅读理解任务中是否类问题关键句寻找方法流程图;Fig. 1 is a kind of reading comprehension task provided by the present invention is a kind of question key sentence search method flow chart;
图2为本发明基于tf-idf的是否类问题关键句寻找方法步骤示意图;Fig. 2 is based on tf-idf of the present invention whether the key sentence of class question is searched for method steps schematic diagram;
图3为本发明提供的一种阅读理解任务中是否类问题关键句寻找系统的结构图。FIG. 3 is a structural diagram of a system for finding key sentences in a reading comprehension task provided by the present invention.
具体实施方式Detailed ways
本发明的目的在于提供一种基于神经网络与TF-IDF的阅读理解任务中是否类问题关键句寻找方法,通过双向门控循环网络与TF-IDF结合的方式来对关键句进行查找,其中TF-IDF(term frequency-inverse document frequency)是一种用于信息检索与数据挖掘的加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(InverseDocument Frequency)。The object of the present invention is to provide a method for finding key sentences in the reading comprehension task based on neural network and TF-IDF, which is to search for key sentences through the combination of bidirectional gated recurrent network and TF-IDF, wherein TF -IDF (term frequency-inverse document frequency) is a weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means InverseDocument Frequency.
下面结合具体实施方式对本发明进行详细说明。The present invention will be described in detail below with reference to specific embodiments.
实施例1:Example 1:
如图1所示,本发明提供一种阅读理解任务中是否类问题关键句寻找方法,按照以下步骤进行:As shown in Figure 1, the present invention provides a method for finding a key sentence of a question or not in a reading comprehension task, which is carried out according to the following steps:
S1选择已有的阅读理解问答数据,对所述问答数据进行预处理,得到数据集,然后将数据集分割为训练集、验证集和测试集;S1 selects the existing reading comprehension question and answer data, preprocesses the question and answer data to obtain a data set, and then divides the data set into a training set, a verification set and a test set;
S2基于构建的编码层网络,挖掘练集和测试集里问句与篇章段落中的句子的语义信息,获得句子中每个词的词嵌入表示;Based on the constructed coding layer network, S2 mines the semantic information of the sentences in the questions and paragraphs in the training set and the test set, and obtains the word embedding representation of each word in the sentence;
S3构建算法模型,将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;S3 builds an algorithm model, and uses the neural network model and TFIDF to calculate the key sentences of whether the question is a question or not by using the question sentences and chapter paragraphs mined by the coding layer network;
S4将待阅读理解问答数据输入训练好的算法模型,对所述待阅读理解问答数据中的是否类问题的关键句进行预测。S4 inputs the question-and-answer data to be read comprehension into the trained algorithm model, and predicts the key sentences of whether or not in the question-and-answer data to be read comprehension.
如图2所示本发明采用的技术方案是按照以下步骤进行:The technical scheme that the present invention adopts as shown in Figure 2 is to carry out according to the following steps:
步骤1、选择已有的阅读理解问答数据,对这些数据进行预处理。Step 1. Select the existing reading comprehension question and answer data, and preprocess the data.
选择的阅读理解问答数据具有相同的属性,例如具有相同数据格式、具有相同专业方向,具有是否类问题等。数据的预处理包括对阅读理解问答数据进行多尺度,多细粒度的小样本数据处理,包括分句、分词、正则化、清洗等,还包含对网页、网址、图片、脏乱文字的处理。其目的是对已有的数据进行优化和扩充。The selected reading comprehension question-and-answer data have the same attributes, such as the same data format, the same professional direction, and whether it is a question or not. Data preprocessing includes multi-scale and multi-fine-grained small sample data processing on reading comprehension question and answer data, including sentence segmentation, word segmentation, regularization, cleaning, etc. It also includes processing of web pages, URLs, pictures, and dirty text. Its purpose is to optimize and expand existing data.
步骤2、分割训练集和测试集,进行嵌入层模型预训练,生成满足模型需要的词向量。Step 2: Split the training set and the test set, perform pre-training of the embedding layer model, and generate word vectors that meet the needs of the model.
进一步的,对数据集进行基于相关语义理解的分割,随机选取80%个样本作为训练集;选择10%个样本划分为验证集;将和所述训练数据中剩余的10%个样本划分为测试集;同时生成满足模型需要的词向量;其中分割包含对原文章段落与问题校正后的数据分割。对数据集进行基于相关语义理解的分割,使训练集与测试集中的是否类信息达到完美的分割比例,大大提高了训练速度与收敛率。Further, the data set is divided based on relevant semantic understanding, and 80% of the samples are randomly selected as the training set; 10% of the samples are selected to be divided into the verification set; and the remaining 10% of the samples in the training data are divided into the test set. At the same time, word vectors that meet the needs of the model are generated; the segmentation includes the data segmentation after correcting the original article paragraphs and questions. The data set is segmented based on relevant semantic understanding, so that the class information in the training set and the test set can achieve a perfect segmentation ratio, which greatly improves the training speed and convergence rate.
步骤3、编码层网络,挖掘问句与篇章段落中的句子的语义信息。Step 3. The coding layer network mines the semantic information of the sentences in the questions and the paragraphs.
编码层网络采用了词嵌入的方式对问句与篇章段落进行词嵌入表示;The coding layer network uses the word embedding method to express the word embedding for the question and the paragraph;
wi=word2vec(ti)w i =word2vec(t i )
其中wi为问句或文本中第i个词ti的词嵌入表示,word2vec()为词嵌入公式。Where w i is the word embedding representation of the ith word t i in the question or text, and word2vec() is the word embedding formula.
步骤4、进行基于TF-IDF融合的模型改进,包括:将问句与篇章段落输入到神经网络的同时,对问句和篇章段落进行TFIDF计算。用于控制问句与篇章段落中各个句子的权重,再将问句与篇章段落进行余弦相似度计算,得到最为合适的句子作为是否类问题的关键句。Step 4. Carry out model improvement based on TF-IDF fusion, including: while inputting questions and chapter paragraphs into the neural network, perform TFIDF calculation on questions and chapter paragraphs. It is used to control the weight of each sentence in the question sentence and the text paragraph, and then calculate the cosine similarity between the question sentence and the text paragraph, and obtain the most suitable sentence as the key sentence of the question.
具体的,specific,
将通过编码层网络挖掘后的问句与篇章段落输入到神经网络进行训练;Input the questions and paragraphs mined through the coding layer network into the neural network for training;
同时对问句和篇章段落进行TFIDF计算;At the same time, TFIDF calculation is performed on questions and paragraphs;
将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,后进行余弦相似度计算;Multiply the weight of each word obtained by tf-idf by the final vector representation of each sentence output by the neural network, and then calculate the cosine similarity;
选择相似度最高的作为所述是否类问句的关键句。The key sentence with the highest similarity is selected as the key sentence of the whether question sentence.
这里将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,包括:将问句和篇章段落每个词的tf-idf权重与通过双向门控循环单元GRU网络对上下文特征进行学习,得到每个句子的最终向量表示。Here, the weight of each word obtained by tf-idf is multiplied by the final vector representation of each sentence output by the neural network, including: combining the tf-idf weight of each word of the question and chapter paragraph with the bidirectional gated recurrent unit The GRU network learns the contextual features to obtain the final vector representation of each sentence.
通过双向门控循环单元GRU网络对上下文特征进行学习的计算式如下:The calculation formula for learning contextual features through the bidirectional gated recurrent unit GRU network is as follows:
其中,GRU为门控循环单元网络,hi为第i个词的门控循环单元网络输出的隐层向量,“;”为拼接操作,wi为问句或文本中第i个词ti的词嵌入表示,hi-1为第i个词前一个词的门控循环单元网络输出的隐层向量,hi+1为第i个词的门控循环单元网络输出的隐层向量,为正序向量,为逆序向量。Among them, GRU is the gated recurrent unit network, hi is the hidden layer vector output by the gated recurrent unit network of the ith word, ";" is the splicing operation, wi is the question or the ith word t i in the text The word embedding representation of , h i-1 is the hidden layer vector output by the gated recurrent unit network of the word before the i-th word, h i+1 is the hidden layer vector output by the gated recurrent unit network of the i-th word, is a positive sequence vector, is a vector in reverse order.
对问句和篇章段落进行TFIDF计算式如下:The TFIDF calculation formula for questions and chapter paragraphs is as follows:
tfidf(i,j)=tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)
其中,tf(i,j)表示在文档j中单词i的词频,其中ni,j表示文档j中单词i出现的次数,∑kni,j为文档j中的单词总数,idf(i)则是由总文件数目D除以包含该单词i的文件的数目,ti为该单词i;dj为包含该单词的文件j;再将得到的商进行取对数得到。Among them, tf(i, j) represents the word frequency of word i in document j, where n i, j represents the number of occurrences of word i in document j, ∑ k n i, j is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t i is the word i; d j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.
步骤5、对预测的结果进行校正,获得正确的结果。Step 5. Correct the predicted result to obtain the correct result.
对不正常的生成结果进行舍弃,不用于后面的更新升级,对正常的问句生成结果。The abnormal generated results are discarded, not used for subsequent updates and upgrades, and the normal questions are generated.
步骤6、保存校正后的结果,利用校正后的数据训练模型,更新算法模型。利用校正后的算法模型,大大提高了训练速度与正确率。Step 6: Save the corrected result, use the corrected data to train the model, and update the algorithm model. Using the corrected algorithm model greatly improves the training speed and accuracy.
步骤7、输入新的待阅读理解问答数据,对数据中的是否类问题的关键句进行预测。Step 7: Input the new question-and-answer data to be read and comprehend, and predict the key sentences of whether or not in the data.
首先,分析其属性是否和步骤1中选择的阅读理解问答数据属性相同,如果属性不同,在后面的预处理操作过程中,尽量让其与步骤1中选择的数据属性保持一致。然后对阅读理解问答数据文本进行预处理操作。First, analyze whether its attributes are the same as the data attributes of the reading comprehension question and answer data selected in step 1. If the attributes are different, try to make them consistent with the data attributes selected in step 1 in the subsequent preprocessing operation. Then, preprocessing is performed on the reading comprehension question answering data text.
实施例2Example 2
给定问句与句子,北京是否是中国的首都?北京是中国的首都Given questions and sentences, is Beijing the capital of China? Beijing is the capital of China
分词:<北京><是否><是><中国><的><首都><北京><是><中国><的><首都>Participle: <Beijing> <whether><is><China><the><capital><Beijing><is><China><the><capital>
送入模型中,通过神经网络与TF-IDF获得词嵌入表示Feed it into the model and obtain the word embedding representation through neural network and TF-IDF
计算问句与篇章段落的余弦相似度Calculate the cosine similarity between a question sentence and a text paragraph
返回结果,该句是问句关键句Return the result, the sentence is the key sentence of the question sentence
给定问句与句子,北京是否是中国的首都?今年寒假延长至11月Given questions and sentences, is Beijing the capital of China? This winter vacation has been extended to November
分词:<北京><是否><是><中国><的><首都><今年><寒假><延长><至><11月>Participle: <Beijing><Whether><Yes><China><The><Capital><This Year><Winter Holiday><Extended><To><November>
送入模型中,通过神经网络与TF-IDF获得词嵌入表示Feed it into the model and obtain the word embedding representation through neural network and TF-IDF
计算问句与篇章段落的余弦相似度Calculate the cosine similarity between a question sentence and a text paragraph
返回结果,该句不是问句关键句Return the result, the sentence is not the key sentence of the question sentence
更进一步,模型更新可以选择自动更新升级或者手动一键更新升级。Further, model update can choose automatic update upgrade or manual one-click update upgrade.
实施例3Example 3
为了实现上述方法,本发明还提出一种阅读理解任务中是否类问题关键句寻找系统,如图所示包括:In order to realize the above-mentioned method, the present invention also proposes a system for finding key sentences of question or not in the reading comprehension task, as shown in the figure, including:
数据准备模块,用于选择已有的阅读理解问答数据,对所述问答数据进行预处理,得到数据集,然后将数据集分割为训练集、验证集和测试集;The data preparation module is used to select the existing reading comprehension question and answer data, preprocess the question and answer data to obtain a data set, and then divide the data set into a training set, a verification set and a test set;
词句编码模块,用于基于构建的编码层网络,挖掘练集和测试集里问句与篇章段落中的句子的语义信息,获得句子中每个词的词嵌入表示;The word and sentence encoding module is used to mine the semantic information of the sentences in the questions and paragraphs in the training set and the test set based on the constructed encoding layer network, and obtain the word embedding representation of each word in the sentence;
关键句匹配模块,用于将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句;The key sentence matching module is used to calculate the key sentence of whether the question is a question or not by using the neural network model and TFIDF to calculate the question sentence and chapter paragraph after mining through the coding layer network;
关键句确定模块,用于将待阅读理解问答数据输入训练好的算法模型,对所述待阅读理解问答数据中的是否类问题的关键句进行预测。The key sentence determination module is used for inputting the question and answer data to be read comprehension into the trained algorithm model, and predicts the key sentence of the question whether or not in the question and answer data to be read comprehension.
具体的,对所述问答数据进行预处理包括:Specifically, preprocessing the question and answer data includes:
对所述问答数据进行多尺度,多细粒度的小样本数据处理;Perform multi-scale and multi-fine-grained small sample data processing on the question and answer data;
所述数据处理还包含对网页、网址、图片、脏乱文字的处理。The data processing also includes the processing of web pages, website addresses, pictures, and dirty words.
将数据集分割为训练集、验证集和测试集,包括:Split the dataset into training, validation, and test sets, including:
对数据集进行基于相关语义理解的分割,随机选取80%个样本作为训练集;选择10%个样本划分为验证集;将和所述训练数据中剩余的10%个样本划分为测试集;The data set is divided based on relevant semantic understanding, and 80% of the samples are randomly selected as the training set; 10% of the samples are selected as the verification set; and the remaining 10% of the samples in the training data are divided into the test set;
同时生成满足模型需要的词向量;At the same time, generate word vectors that meet the needs of the model;
所述分割包含对原文章段落与问题校正后的数据分割。The segmentation includes data segmentation corrected for the original article paragraphs and questions.
将通过编码层网络挖掘后的问句与篇章段落利用神经网络模型以及TFIDF计算得到是否类问题的关键句,包括:Using the neural network model and TFIDF to calculate the question sentences and chapter paragraphs mined by the coding layer network, the key sentences of whether or not the question are classified, including:
将通过编码层网络挖掘后的问句与篇章段落输入到神经网络进行训练;Input the questions and paragraphs mined through the coding layer network into the neural network for training;
同时对问句和篇章段落进行TFIDF计算;At the same time, TFIDF calculation is performed on questions and paragraphs;
将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,后进行余弦相似度计算;Multiply the weight of each word obtained by tf-idf by the final vector representation of each sentence output by the neural network, and then calculate the cosine similarity;
选择相似度最高的作为所述是否类问句的关键句。The key sentence with the highest similarity is selected as the key sentence of the whether question sentence.
将tf-idf的得到的每个词权重乘以神经网络的输出的每个句子的最终向量表示,包括:Multiply the resulting per-word weights from tf-idf by the neural network's output for the final vector representation of each sentence, including:
将问句和篇章段落每个词的tf-idf权重与通过双向门控循环单元GRU网络对上下文特征进行学习,得到每个句子的最终向量表示。The tf-idf weights of each word in question sentences and discourse paragraphs are combined with the contextual features learned through the bidirectional gated recurrent unit GRU network to obtain the final vector representation of each sentence.
词嵌入表示如下式:The word embedding is expressed as:
wi=word2vec(ti)w i =word2vec(t i )
其中wi为问句或文本中第i个词ti的词嵌入表示,word2vec()为词嵌入公式Where w i is the word embedding representation of the ith word t i in the question or text, and word2vec() is the word embedding formula
通过双向门控循环单元GRU网络对上下文特征进行学习的计算式如下:The calculation formula for learning contextual features through the bidirectional gated recurrent unit GRU network is as follows:
其中,GRU为门控循环单元网络,hi为第i个词的门控循环单元网络输出的隐层向量,“;”为拼接操作,ωi为问句或文本中第i个词ti的词嵌入表示。Among them, GRU is the gated recurrent unit network, hi is the hidden layer vector output by the gated recurrent unit network of the ith word, ";" is the splicing operation, ω i is the question or the ith word t i in the text word embedding representation of .
对问句和篇章段落进行TFIDF计算式如下:The TFIDF calculation formula for questions and chapter paragraphs is as follows:
tfidf(i,j)=tf(i,j)×idf(i)tfidf(i,j)=tf(i,j)×idf(i)
其中,tf(i,j)表示在文档j中单词i的词频,其中ni,j表示文档j中单词i出现的次数,∑kni,j为文档j中的单词总数,idf(i)则是由总文件数目D除以包含该单词i的文件的数目,ti为该单词i;dj为包含该单词的文件j;再将得到的商进行取对数得到。where tf(i, j) represents the word frequency of word i in document j, where n i, j represents the number of occurrences of word i in document j, ∑ k n i, j is the total number of words in document j, idf(i ) is divided by the total number of files D by the number of files containing the word i, t i is the word i; d j is the file j containing the word; and then the obtained quotient is obtained by taking the logarithm.
将挖掘后的问句与篇章段落输入到神经网络进行训练的同时,对问句和篇章段落进行TFIDF计算,得到最为合适的句子作为是否类问题的关键句,之后还包括:While inputting the mined questions and chapter paragraphs into the neural network for training, the TFIDF calculation is performed on the questions and chapter paragraphs to obtain the most suitable sentence as the key sentence for the question whether it is a class or not, and then includes:
对预测的结果进行校正,获得正确的结果;Correct the predicted results to obtain correct results;
保存校正后的结果,利用校正后的数据对模型进行训练,更新算法模型。Save the corrected results, use the corrected data to train the model, and update the algorithm model.
以上所述仅是对本发明的较佳实施方式而已,并非对本发明作任何形式上的限制,凡是依据本发明的技术实质对以上实施方式所做的任何简单修改,等同变化与修饰,均属于本发明技术方案的范围内。The above is only a preferred embodiment of the present invention, and does not limit the present invention in any form. Any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention belong to the present invention. within the scope of the technical solution of the invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010330141.5A CN111552773A (en) | 2020-04-24 | 2020-04-24 | A method and system for finding key sentences of question-like or not in reading comprehension task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010330141.5A CN111552773A (en) | 2020-04-24 | 2020-04-24 | A method and system for finding key sentences of question-like or not in reading comprehension task |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111552773A true CN111552773A (en) | 2020-08-18 |
Family
ID=72002493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010330141.5A Pending CN111552773A (en) | 2020-04-24 | 2020-04-24 | A method and system for finding key sentences of question-like or not in reading comprehension task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111552773A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112035652A (en) * | 2020-10-30 | 2020-12-04 | 杭州云嘉云计算有限公司 | Intelligent question-answer interaction method and system based on machine reading understanding |
CN112307773A (en) * | 2020-12-02 | 2021-02-02 | 上海交通大学 | Automatic generation method of custom problem data of machine reading understanding system |
CN112347756A (en) * | 2020-09-29 | 2021-02-09 | 中国科学院信息工程研究所 | A method and system for reasoning reading comprehension based on serialized evidence extraction |
CN112784579A (en) * | 2020-12-31 | 2021-05-11 | 山西大学 | Reading comprehension choice question answering method based on data enhancement |
CN112836034A (en) * | 2021-02-25 | 2021-05-25 | 北京润尼尔网络科技有限公司 | Virtual teaching method and device and electronic equipment |
CN114398467A (en) * | 2021-12-06 | 2022-04-26 | 北京理工大学 | An Automatic Extraction Method of Evidence Sentences Based on Meta Erasure |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271524A (en) * | 2018-08-02 | 2019-01-25 | 中国科学院计算技术研究所 | Entity link method in knowledge base question answering system |
CN109977199A (en) * | 2019-01-14 | 2019-07-05 | 浙江大学 | A kind of reading understanding method based on attention pond mechanism |
US20190341036A1 (en) * | 2018-05-02 | 2019-11-07 | International Business Machines Corporation | Modeling multiparty conversation dynamics: speaker, response, addressee selection using a novel deep learning approach |
CN110532557A (en) * | 2019-08-29 | 2019-12-03 | 北京计算机技术及应用研究所 | A kind of unsupervised Text similarity computing method |
CN110688491A (en) * | 2019-09-25 | 2020-01-14 | 暨南大学 | Machine reading understanding method, system, device and medium based on deep learning |
CN110826337A (en) * | 2019-10-08 | 2020-02-21 | 西安建筑科技大学 | A Short Text Semantic Training Model Acquisition Method and Similarity Matching Algorithm |
-
2020
- 2020-04-24 CN CN202010330141.5A patent/CN111552773A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190341036A1 (en) * | 2018-05-02 | 2019-11-07 | International Business Machines Corporation | Modeling multiparty conversation dynamics: speaker, response, addressee selection using a novel deep learning approach |
CN109271524A (en) * | 2018-08-02 | 2019-01-25 | 中国科学院计算技术研究所 | Entity link method in knowledge base question answering system |
CN109977199A (en) * | 2019-01-14 | 2019-07-05 | 浙江大学 | A kind of reading understanding method based on attention pond mechanism |
CN110532557A (en) * | 2019-08-29 | 2019-12-03 | 北京计算机技术及应用研究所 | A kind of unsupervised Text similarity computing method |
CN110688491A (en) * | 2019-09-25 | 2020-01-14 | 暨南大学 | Machine reading understanding method, system, device and medium based on deep learning |
CN110826337A (en) * | 2019-10-08 | 2020-02-21 | 西安建筑科技大学 | A Short Text Semantic Training Model Acquisition Method and Similarity Matching Algorithm |
Non-Patent Citations (1)
Title |
---|
林旭鸣: "机器阅读理解中答案排序问题的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347756A (en) * | 2020-09-29 | 2021-02-09 | 中国科学院信息工程研究所 | A method and system for reasoning reading comprehension based on serialized evidence extraction |
CN112347756B (en) * | 2020-09-29 | 2023-12-22 | 中国科学院信息工程研究所 | Inference reading understanding method and system based on serialization evidence extraction |
CN112035652A (en) * | 2020-10-30 | 2020-12-04 | 杭州云嘉云计算有限公司 | Intelligent question-answer interaction method and system based on machine reading understanding |
CN112307773A (en) * | 2020-12-02 | 2021-02-02 | 上海交通大学 | Automatic generation method of custom problem data of machine reading understanding system |
CN112307773B (en) * | 2020-12-02 | 2022-06-21 | 上海交通大学 | Automatic generation method of custom question data for machine reading comprehension system |
CN112784579A (en) * | 2020-12-31 | 2021-05-11 | 山西大学 | Reading comprehension choice question answering method based on data enhancement |
CN112784579B (en) * | 2020-12-31 | 2022-05-27 | 山西大学 | Reading understanding choice question answering method based on data enhancement |
CN112836034A (en) * | 2021-02-25 | 2021-05-25 | 北京润尼尔网络科技有限公司 | Virtual teaching method and device and electronic equipment |
CN114398467A (en) * | 2021-12-06 | 2022-04-26 | 北京理工大学 | An Automatic Extraction Method of Evidence Sentences Based on Meta Erasure |
CN114398467B (en) * | 2021-12-06 | 2025-02-14 | 北京理工大学 | A method for automatic extraction of evidence sentences based on meta-erasure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
CN110427463B (en) | Search statement response method and device, server and storage medium | |
CN106649818B (en) | Application search intent identification method, device, application search method and server | |
CN111552773A (en) | A method and system for finding key sentences of question-like or not in reading comprehension task | |
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
WO2023134082A1 (en) | Training method and apparatus for image caption statement generation module, and electronic device | |
CN111324728A (en) | Text event abstract generation method and device, electronic equipment and storage medium | |
CN109933664A (en) | An Improved Method for Fine-Grained Sentiment Analysis Based on Sentiment Word Embedding | |
CN112214593A (en) | Question and answer processing method and device, electronic equipment and storage medium | |
CN114298055B (en) | Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium | |
Cai et al. | Intelligent question answering in restricted domains using deep learning and question pair matching | |
CN112613293B (en) | Digest generation method, digest generation device, electronic equipment and storage medium | |
CN111563158B (en) | Text ranking method, ranking apparatus, server and computer-readable storage medium | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
Forstall et al. | What is Quantitative Intertextuality? | |
CN111723295A (en) | Content distribution method, device and storage medium | |
CN114020871B (en) | Multi-mode social media emotion analysis method based on feature fusion | |
CN116881425A (en) | Universal document question-answering implementation method, system, device and storage medium | |
CN113569018A (en) | Question and answer pair mining method and device | |
WO2023004528A1 (en) | Distributed system-based parallel named entity recognition method and apparatus | |
CN114048354A (en) | Test question retrieval method, device and medium based on multi-element characterization and metric learning | |
CN116521886A (en) | Deep learning-based education field discipline knowledge graph construction method and device | |
CN115344668A (en) | A multi-field and multi-disciplinary science and technology policy resource retrieval method and device | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200818 |