[go: up one dir, main page]

CN113836895A - An unsupervised machine reading comprehension method based on large-scale problem self-learning - Google Patents

An unsupervised machine reading comprehension method based on large-scale problem self-learning Download PDF

Info

Publication number
CN113836895A
CN113836895A CN202111151305.9A CN202111151305A CN113836895A CN 113836895 A CN113836895 A CN 113836895A CN 202111151305 A CN202111151305 A CN 202111151305A CN 113836895 A CN113836895 A CN 113836895A
Authority
CN
China
Prior art keywords
model
data
training
domain
machine reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111151305.9A
Other languages
Chinese (zh)
Other versions
CN113836895B (en
Inventor
赵天成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Linker Technology Co ltd
Honglong Technology Hangzhou Co ltd
Original Assignee
Honglong Technology Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honglong Technology Hangzhou Co ltd filed Critical Honglong Technology Hangzhou Co ltd
Publication of CN113836895A publication Critical patent/CN113836895A/en
Application granted granted Critical
Publication of CN113836895B publication Critical patent/CN113836895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于大规模问题自学习的无监督机器阅读理解方法,首先将数据分为四种类型:然后按以下步骤进行:S1、对未标注的通用数据使用标准预训练模型进行训练得到预训练语言模型;S2、对已标注的通用数据使用预训练语言模型进行训练得到问题生成器,并生成特定任务通用领域模型;S3、对未标注的域内数据使用问题生成器生成合成的域内数据,然后使用特定任务通用领域模型进行过滤,再对过滤得到的高质量的合成的域内数据集进行训练得到新预训练模型;S4、对已标注的域内数据通过过滤得到的低质量的合成数据集进行混合并标记答案,然后使用新预训练模型进行训练得到最终模型;基于最终模型,输入数据得到机器阅读理解的结果。

Figure 202111151305

The invention discloses an unsupervised machine reading comprehension method based on large-scale problem self-learning. First, the data is divided into four types: and then the following steps are performed: S1. Use a standard pre-training model to train the unlabeled general data Obtain a pre-trained language model; S2, use the pre-trained language model to train the labeled general data to obtain a problem generator, and generate a general domain model for a specific task; S3, use the problem generator for unlabeled in-domain data to generate a synthetic in-domain model Then use the general domain model for specific tasks to filter, and then train the high-quality synthetic intra-domain data set obtained by filtering to obtain a new pre-training model; S4. Low-quality synthetic data obtained by filtering the labeled intra-domain data The final model is obtained by training the new pre-trained model; based on the final model, the input data is used to obtain the results of machine reading comprehension.

Figure 202111151305

Description

Unsupervised machine reading understanding method based on large-scale problem self-learning
Technical Field
The invention relates to the field of machine reading understanding, in particular to an unsupervised machine reading understanding method based on large-scale problem self-learning.
Background
Many of the latest algorithms for Natural Language Processing (NLP) tasks require manually labeled data. We generally did not have any domain-specific tagged data sets in the early days, and annotating a sufficient amount of such data was generally expensive and laborious. Thus, for many NLP applications, even resource-rich languages (such as english) have data tagged in only a few domains.
In many NLP applications, it is very difficult to obtain a large amount of tagged data. Thus, in many cases, we will train the model from a small amount of data. However, the trained model is usually over-fit and needs to be generalized to invisible data. Thus, researchers have utilized large unmarked data sets through pre-training language models, which can generally alleviate the problem of network weights for random initialization, finding better local optima and improving the robustness of agents in invisible environments.
Significant advances in machine reading understanding (MRC) have recently been achieved by pre-training the Transformer language model over large amounts of unlabeled text data, and fine-tuning the pre-trained model over manually labeled QA datasets. In the context of a pre-trained language model, Gururangan shows the importance of using intra-domain data for additional pre-training to improve performance of downstream specific tasks.
Disclosure of Invention
The invention mainly provides an unsupervised machine reading understanding method based on large-scale problem self-learning, so that cold start can be realized in a brand new field.
The invention mainly solves the technical problems through the following technical scheme: data is first classified into four types: the method comprises the following steps of marking unmarked general data, marked general data, unmarked intra-domain data and marked intra-domain data, and then:
s1, aiming at the unmarked general data, training by using a standard pre-training model to obtain a pre-training language model based on a Transformer as the bottommost layer of the architecture;
s2, aiming at the labeled general data, training by using the pre-training language model obtained in the step S1 to obtain a problem generator, and generating a specific task general field model by using the labeled general data;
s3, aiming at unmarked intra-domain data, generating synthesized intra-domain data by using the problem generator constructed in the step S2, then filtering by using a specific task general domain model to obtain a high-quality synthesized intra-domain data set and a low-quality synthesized data set, and then training the high-quality synthesized intra-domain data set to obtain a new pre-training model;
s4, aiming at the labeled intra-domain data, mixing the low-quality synthetic data set obtained by filtering, marking answers, and then training by using a new pre-training model to obtain a final machine reading understanding model;
and inputting data to obtain a result of the machine reading understanding based on the final machine reading understanding model.
Preferably, in step S1, a GPT-2 model or a T5 model is used for model learning.
Preferably, the problem generation based on the trained T5 model is specifically as follows: extracting answers; generating a question according to the extracted answer; receiving the question and generating an answer; comparing the extracted answers with the generated answers, and judging whether the generated questions are correct or not;
the problem generation based on the trained GPT-2 model specifically comprises the following steps: given the natural order of the language, the sequence s is given as(s)1,…,sn) Is decomposed into the product of conditional expressions:
Figure BDA0003287219750000021
after the GPT-2 model training is finished, for each new word, the model calculates the probability of the next word according to all the existing characters; then, selecting high-probability words of the front K bits according to the probability, and randomly sampling the K candidate words; this process is repeated until a special symbol or sentence end symbol appears;
for the question generation scenario, the position of the potential answer in the source text is marked with a special symbol, and for a paragraph C ═ C1,...cn]And one of the potential answers a ═ a1,..,an]It will be expressed as:
X=([CLS],C,[SEP],A)
given the above X, we input it into the GPT-2 model after training or T5 after training to get the hidden vector:
H=Model(x)
Figure BDA0003287219750000033
x is the input length, h is the magnitude of the hidden vector; and finally, H inputs a layer of full link network to obtain a final result:
Figure BDA0003287219750000031
Figure BDA0003287219750000032
where W is a word, W is a matrix, b is a coefficient, and the final result is the best word for argmax output. Both W and b are obtained by learning.
Preferably, in step S3, the generated data with round-trip consistency is actively learned, so as to actively screen out weak links in the training data distribution according to the advantages and disadvantages of the existing model at different latitudes, and suggest the next batch of data to be labeled.
Preferably, in step S3, data filtering is performed by round-trip consistency, and learning efficiency is improved by active learning.
The method has the substantial effect that the method is suitable for the condition without any mark and very small mark data, and the accuracy of the model is obviously improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example (b): we use a variety of pre-trained language models (e.g., GPT-2 and T5) to generate a large amount of potential question and answer data from unlabeled paragraphs of text within a domain, and this approach allows us to implement cold-start in a completely new domain. Then we pre-train the model according to these generated samples and finally fine-tune the specific labeled data set.
Although the model trained domain-specific on the SQuAD1.1 training dataset achieved the most advanced performance on the SQuAD1.1 Dev dataset (EM score of 85%), it was completely impossible to make the same level of reasoning in the completely new domain, i.e., NewQA (EM score of 32%). We have found that when using a synthetic dataset to pre-train a model, it is important to prevent the synthetic dataset from being over-fitted, since it typically contains many noisy samples. However, these synthetic datasets are very useful when there is no or very little training data in the domain at an early stage, because we can automatically generate "machine" labeled training data in a completely new domain by this method.
By this method, 80% of the final performance is obtained without any marking data. Moreover, when we injected a small amount of labeled data (10% of the original data), the final performance level equivalent to 94% could be reached quickly by the pre-trained model. Finally we evaluate Data Dream by the NLP Checklist test framework used to rigorously test the NLP model. Our method reduces errors by 18% in the universal language capability test project (e.g., synonyms, problem spelling, time variation, etc.) in NLP Checklist.
Problem generation is a long-standing research topic, and the use of generated question-answer pairs to improve quality assurance systems shows great improvement in low-resource environments with only a small sample number. However, verifying and improving the accuracy of these generated QA pairs is relatively unexplored.
In machine translation, consistency of modeling by double learning or translation back in both translation directions can improve the quality of the translation model. The reverse translation adds the parallel data generated by synthesis as a training example, which is the inspiration of the work and achieves the best performance in both supervised and unsupervised settings. The joint distribution of questions and answers can be modeled given the context and used directly, while our work uses generative models to generate synthetic data for pre-training. Combining these two methods may be a fruitful future field of work.
QG is used to augment training data used to answer questions and focuses on text-based quality inspection tasks, aiming at selecting one or more answer sentences from the text of a given input question. The weight of each data point at the time of training is configured by comparing the generated question with the original question when ranking the sentences.
Translation-based data enhancement mechanisms may be introduced to answer questions. However, these methods are highly dependent on the availability and quality of the translation system. Although we can add more data in training with MT, it has not been significantly improved because of the difficulty in finding domain specific data in other languages.
The use of Synthetic QA corporation can improve the overall MRC task by round-trip consistency. In order to make the round trips consistent, the model should have been trained. The main difference from our work is that we assume our dataset is small and it is difficult to build an initial model. However, they assume that they already have a model and when to further refine the model. Thus, it is difficult to show improvements to new domain datasets and to consistently improve cross-domain performance.
The main contribution we propose to Data drag is four-fold:
1. i propose four steps to construct the NLP system for small sample cases.
2. We constructed Synthetic QA corporation using a number of different heterogeneous pre-trained language models and showed performance improvements over the new domain.
3. We have tested on NLP Checklist, this evaluation method can be used for the rigorous test to NLP model, the method that we propose exceeds the accuracy of the base line, and find the error rate in the function of general language is reduced greatly.
4. If the predicted answers are different, we further improve performance by actively learning the generated questions.
Our overall process is divided into four phases based on different data sets. First, for any NLP domain or task, we can classify datasets into four types:
1. unlabeled general data (e.g., BookCorpus, Wikipedia, etc.).
2. Annotated general data (or out-of-domain data) (e.g., SquAD, TriviaQA, HotpotQA, etc.).
3. Unlabeled intra-domain data (e.g., judicial cases, insurance clauses, technical specifications, etc.).
4. Annotated intradomain data (e.g., manually annotated legal portfolio).
These 4 steps are based on the size of the data set, and we take different processing modes:
first step (general data not labeled): research on unlabeled universal domain datasets has been actively conducted for 3 years. A large amount of text data is used for constructing pre-training models of converter-based pre-training languages, BERT, GPT-2, T5, and the like, which become standard NLP processing. We use a transform-based language model as the lowest layer of our architecture.
Second step (labeled general data): our goal is to build a machine-reading understanding model so there are many publicly available data sets. We use this data set to make a synthetic data generator to make a large-scale, in-domain data set. Furthermore, we use the labeled domain generic dataset to make task specific (MRC task in this work) a generic domain model.
Third step (unlabeled industry data): we generate a lot of synthetic in-domain data using the problem generator constructed in step 2. After a large amount of data is generated, we use a domain model for filtering, which uses the idea of round-trip consistency problem generation for reference. High quality samples will be used to build the pre-training model and we use further filtering methods to improve performance. Also, when we manually label these data, the pre-trained model can be used as an annotation assistant.
Fourth step (labeled industry data): in the last step, we apply active learning, which is sent to the human annotator to mark the answer using negative synthetic data from the general model. If the generated question is not grammatically correct and difficult to understand, we will ask the annotator to modify the generated question as much as possible and to annotate the answer. Finally, we train the final model using the intra-domain labeled data set.
In the following sections, we will explain in detail the implementation of each step.
The first step is as follows: self-learning of unmarked general data
At this step we have used two different strategies for model learning on unlabeled generic data. The first method is proposed by GPT-2. GPT-2 is a transformer-based large-scale language model published by OpenAI in 2019 in 2 months, contains 15 hundred million parameters, and is trained on an 800 ten thousand webpage data set. The GPT model is directly expanded, training is carried out on data quantity which exceeds 10 times, and the parameter quantity is increased by 10 times. In terms of performance, the model is able to produce coherent text passages, achieving SOTA performance on many language modeling benchmarks. And the model can achieve preliminary reading understanding, machine translation, question answering and automatic summarization without task specific training.
The second strategy is proposed by T5. The training data for T5 includes a Colossal Clean Crawled Corpus (i.e., C4 Corpus) that crawls hundreds of gigabytes of Clean English text from the Common Crawl website. The model of T5 is a standard Transformer-based Encoder-Decoder model, and the number of model parameters reaches 110 hundred million.
The second step is that: and training the generated model through the labeled general data.
Question generation is the task of automatically generating questions from text paragraphs. The simplest method is to answer the question. In answer-aware question generation, the model is provided with answers and paragraphs and asked to generate questions for the answers by considering the paragraph context. One reason for this is that most of the earlier papers use complex models/processing pipelines and there are no pre-trained models available. Thus, the problem of machine generation is often non-syntactic and difficult to understand, and thus it is difficult to use the generated data in practical applications. However, recent advances in text generation techniques supported by pre-trained converter models enable us to generate reasonable synthetic data. We used the most robust generation methods available: generation based on T5 and generation based on GPT-2.
Problem generation based on T5: t5 is a very large, novel neural network model that is trained on a mixture of unlabeled text and labeled data from popular natural language processing tasks, and then fine-tuned individually for each task that its author is addressing. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, for which each task is converted to text-to-text format. To generate a question that knows the answer, we usually need 3 models, the first one will extract the answer as a span, the second one will generate the question on top of the answer, the third one will be a QA model that will accept the question and produce an answer, and we can then compare the two to see if the generated question is correct. Having 3 models for a single task is very complex, so the goal is to create a multitask model that can accomplish these 3 tasks simultaneously.
Problem generation based on GPT-2: for the generation of problems using GPT-2, we follow the original standard text generation strategy. Given the natural order of the language model, the joint probability of the sequence s ═ (sl., sn) can be decomposed into the product of conditional expressions
Figure BDA0003287219750000081
After the above training of the probabilistic model is completed, the problem generation part can be realized by various random sampling strategies, including the sequence top-k. For each new word, the model calculates the probability of the next word based on all the existing characters. And then selecting the high-probability words with the first K bits according to the probability, and randomly sampling the K candidate words. This process is repeated until a special symbol, containing "? "or a sentence end symbol occurs.
In addition, for the question generation scenario, we label the position of the potential answer in the source text with a special symbol, for example, for a paragraph C ═ C1,...Cn]And one of the potential answers a ═ a1....An]It will be expressed as:
X=([CLS],C,[SEP],A)
given the above X, we can input it into GPT-2 or T5 to get the hidden vector:
H=Model(x)
Figure BDA0003287219750000082
x is the input length and h is the magnitude of the hidden vector. And finally, H inputs a layer of full link network to obtain a final result:
Figure BDA0003287219750000083
Figure BDA0003287219750000084
the third step: and marking the industry data through the model in the second step.
Training against AI models requires a large number of manual annotations. The process of manual labeling is costly. In addition, it is difficult for an annotator to decide what to ask in machine learning understanding, and a human annotator has many duplicate items. If the active learner produces the most divergence in the prediction, it is decided to query oracle to mark the data sample. This can be measured by entropy and KL-divergence. A high variance in the output prediction represents the most informative data sample. In this context, we actively learn generated data with round-trip consistency, so as to actively screen out weak links in training data distribution according to advantages and disadvantages of the existing model at different latitudes, and suggest the next batch of data to be labeled, thereby reducing the cost of data labeling and increasing the value of each manually labeled data point.
After obtaining the question model, we can label any unlabeled industry data with a question model generation model based on T5 or GPT-2, automatically generating potential relevant questions to train the question-answer model of the fourth part. However, it is not ideal to use all generated problems directly to achieve the training effect, because the generated problems cover a lot of noise. Therefore, we invented a round-trip consistency method to realize the control of data quality.
Data filtering by round trip consistency: round-trip consistency may be used to filter data. If the model is unable to answer the generated question, the example may be filtered. We also used this method to filter the data. However, there are some differences between the existing work and our work:
we assume that no thunder chain data exists, so our MRC is trained directly on top of the generated data.
Their approach assumes that there is training data and the goal is to improve performance with training data.
During training, we use indicator function I (q):
Figure BDA0003287219750000091
wherein
Figure BDA0003287219750000092
Is a problem of the generation of the gas,
Figure BDA0003287219750000093
is the given answer. And i (q) is used to filter whether a data point is used.
The learning efficiency is improved through active learning: first, we generate a question for a named entity or noun phrase, and then run the trained MRC model from the generic domain. If the model cannot predict the answer, we will save all samples for active learning. We implement active learning by selecting the data that the model has the least confidence through the following strategy.
Figure BDA0003287219750000101
Wherein
Figure BDA0003287219750000102
Is a problem of the generation of the gas,
Figure BDA0003287219750000103
and
Figure BDA0003287219750000104
is the given answer and context. And i (q) is used to filter whether a data point is used.
The fourth step: fine tuning with labeled industry data
Training details: when we have a large amount of unlabeled data in the domain in most cases in the real world, our approach can perform task-specific pre-training on those large unlabeled data sets. The whole training process follows the following steps;
1. on publicly available QA datasets (e.g., SQUAD, NQ, and MARCO), multiple question generators are built from multiple pre-trained language models (e.g., GPT-2 and T5).
2. Using a problem generator generates a large number of problems.
3. Pre-training is performed using the generated data set.
4. And fine-tuning the model of the previous step on the labeled data set.
We use a large number of generated quality check data sets for pre-training. We use the Span Bert framework for pre-training and fine-tuning. The goal function of the fine tuning process is to use only labeled data to reduce training errors. The main purpose of the fine tuning step is to re-adjust the weights, which may be erroneously trained due to generation errors.
The final model was evaluated using SQuAD and NewsQA. SQuAD is used to explore the effect of QG pre-training in the domain, which means that the same dataset is used for problem generation and span prediction models. To validate the new domain, which is completely different from the QG model source, we assume that the NewsQA dataset is a new domain dataset and does not contain any training, neither generating problems nor pre-training. The evaluation index includes a standard MRC index: EM and F1 scores.
Exact Match (EM) that the range of Top-1 answers matches exactly the correct answer.
Fl-Score we compute the word overlap between the returned span and the ground truth answer at the word level.
Intra-domain vs. out-of-domain: recent natural language processing models have achieved impressive performance when training and testing examples from the same dataset, but tend to perform poorly on out-of-domain (OOD) examples because many unseen events can occur during testing.
We use the SpanBERT architecture, which focuses on pre-training the span representation to achieve the current up-to-date results to show how the performance difference between the in-domain and out-of-domain datasets is. Let us assume that
The SQuADl.l training dataset is an in-domain dataset generated and pre-trained using a training problem. We use the newsga dataset as the out-of-domain corpus, which does not contain any training samples. We found that the EM score decreased by 78.5% (80.40%)>17.26%)0However, with the help of the problem generator, we can generate in-domain quality check data on unlabeled samples. Without any marker data on SQuAD1.1 we can reach 75% of the final performance, and without any marker data on NewsQA we can reach 60% of the final performance. Since we included training data in SQuAD1.1 when the build problem was generated.
Checklist evaluation: while measurement retention accuracy has been the primary method of evaluating generalization, it often overestimates the performance NLP model, while alternatives to evaluating models focus on individual tasks or specific behaviors. To enlighten the principle of behavioral testing in software engineering, a CheckList, a task independent method for testing NLP models, can be introduced. The Checklist includes a matrix of generic language capabilities and test types that facilitate a comprehensive test formulation. They demonstrated the utility of Checklist by testing three tasks, determining key failures in the business model and the latest model. The proposed method, based on pre-training of problem generation, achieves a failure rate reduction of 18%, especially in terms of animal vs vehicle v2 (39% reduction), fairness (44% reduction), time (93% reduction).
Influence of annotation data size: to explore the effectiveness of data size pre-training, we tested QG pre-training on 10% and 100% of the data set. The results show that when we have enough data sets, the model converges faster, but not much different from the final score. This indicates that QG pre-training is more useful in early stages than in later stages.
Influence of generated data size we found that the performance of the pre-trained model using the T5-based generation was better than the GPT-2-based generation. However, when we add both data at the same time, the performance will be greatly improved. The generated questions are typically longer than humans and, using both GPT and T5 generation, we can add more different questions and answers to train. On the same answer "Moninder Singh Pandher", T5, GPT and humans have completely raised questions? (T5: who was deceased by a lower court. Thus, the diversity of the models improves the generalization of subsequent MRC models.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Although terms such as label, domain, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims (5)

1.一种基于大规模问题自学习的无监督机器阅读理解方法,其特征在于,首先将数据分为四种类型:未标注的通用数据,已标注的通用数据,未标注的域内数据,已标注的域内数据,然后按以下步骤进行:1. An unsupervised machine reading comprehension method based on large-scale problem self-learning, characterized in that, firstly, the data is divided into four types: unlabeled general data, labeled general data, unlabeled intra-domain data, Labeled in-domain data, then proceed as follows: S1、针对未标注的通用数据,使用标准预训练模型进行训练,得到基于Transformer的预训练语言模型作为架构的最底层;S1. For unlabeled general data, use a standard pre-training model for training, and obtain a Transformer-based pre-training language model as the bottom layer of the architecture; S2、针对已标注的通用数据,使用步骤S1得到的预训练语言模型进行训练得到问题生成器,同时使用已标注的通用数据来生成特定任务通用领域模型;S2. For the labeled general data, use the pre-trained language model obtained in step S1 for training to obtain a problem generator, and at the same time use the labeled general data to generate a general domain model for a specific task; S3、针对未标注的域内数据,使用步骤S2中构建的问题生成器生成合成的域内数据,然后使用特定任务通用领域模型进行过滤,过滤后得到高质量的合成的域内数据集和低质量的合成数据集,再对高质量的合成的域内数据集进行训练得到新预训练模型;S3. For the unlabeled in-domain data, use the question generator constructed in step S2 to generate synthetic in-domain data, and then use the task-specific general domain model to filter, and obtain high-quality synthetic in-domain datasets and low-quality synthetic data after filtering. data set, and then train the high-quality synthetic in-domain data set to obtain a new pre-training model; S4、针对已标注的域内数据,通过过滤得到的低质量的合成数据集进行混合并标记答案,然后使用新预训练模型进行训练得到最终机器阅读理解模型;S4. For the labeled in-domain data, the low-quality synthetic data sets obtained by filtering are mixed and labeled with answers, and then the new pre-training model is used for training to obtain the final machine reading comprehension model; 基于最终机器阅读理解模型,输入数据得到机器阅读理解的结果。Based on the final machine reading comprehension model, the input data gets the result of machine reading comprehension. 2.根据权利要求1所述的一种基于大规模问题自学习的无监督机器阅读理解方法,其特征在于,步骤S1中,标准预训练模型为GPT-2模型或者T5模型。2. An unsupervised machine reading comprehension method based on large-scale problem self-learning according to claim 1, wherein in step S1, the standard pre-training model is a GPT-2 model or a T5 model. 3.根据权利要求2所述的一种基于大规模问题自学习的无监督机器阅读理解方法,其特征在于,基于训练后T5模型进行问题生成具体为:提取答案;依据提取的答案生成问题;接受该问题并产生一个答案;对提取的答案和产生的答案进行比较,判断生成的问题是否正确;3. a kind of unsupervised machine reading comprehension method based on large-scale problem self-learning according to claim 2, is characterized in that, based on T5 model after training, the question generation is specifically: extract answer; Generate question according to the answer extracted; Accept the question and generate an answer; compare the extracted answer with the generated answer to determine whether the generated question is correct; 基于训练后的GPT-2模型进行问题生成具体为:给定语言的自然顺序,将序列s=(s1,…,sn)的联合概率分解为条件式的乘积:The problem generation based on the trained GPT-2 model is specifically: given the natural order of the language, decompose the joint probability of the sequence s=(s 1 ,...,s n ) into the product of the conditional expressions:
Figure FDA0003287219740000021
Figure FDA0003287219740000021
在GPT-2模型训练完成后,对每一个新的单词,模型计算出根据现有所有字符为依据,下一个词的概率;然后根据概率,选出前K位的高概率词,在这K个候选词中进行随机采样;这个过程不断重复,直到特殊符号或者句子结束符号出现;After the GPT-2 model training is completed, for each new word, the model calculates the probability of the next word based on all existing characters; Random sampling is performed from the candidate words; this process is repeated until a special symbol or an end-of-sentence symbol appears; 针对问题生成这个场景,用特别的符号标注源文中潜在答案的位置,对于一个段落C=[c1,...cn]和其中的一个潜在答案A=[a1,..,an],会被表示为:For the scenario of question generation, the position of the potential answer in the source text is marked with special symbols, for a paragraph C=[c 1 ,...c n ] and one of the potential answers A=[a 1 ,..,an ] ], will be represented as: X=([CLS],C,[SEP],A)X=([CLS],C,[SEP],A) 给定上述X,我们将其输入训练后的GPT-2模型或者训练后的T5中后得到隐向量:Given the above X, we feed it into the trained GPT-2 model or the trained T5 to get the latent vector:
Figure FDA0003287219740000022
Figure FDA0003287219740000022
X是输入长度,h是隐向量的大小;最后H会再输入一层全链接网络得到最终结果:X is the input length, h is the size of the hidden vector; finally H will input a layer of fully linked network to get the final result:
Figure FDA0003287219740000023
Figure FDA0003287219740000023
Figure FDA0003287219740000024
Figure FDA0003287219740000024
式中,w为一个单词,W是一个矩阵,b是系数,最终得到的是argmax输出的最佳单词。In the formula, w is a word, W is a matrix, b is a coefficient, and the final result is the best word output by argmax.
4.根据权利要求1或3所述的一种基于大规模问题自学习的无监督机器阅读理解方法,其特征在于,步骤S3中,对具有往返一致性的生成数据进行主动学习,从而根据现有模型在不同纬度上的优缺点,主动筛选出训练数据分布中的薄弱环节,建议应标记的下一批数据。4. A kind of unsupervised machine reading comprehension method based on large-scale problem self-learning according to claim 1 or 3, it is characterized in that, in step S3, carry out active learning to the generation data with round-trip consistency, thereby according to the current situation. There are advantages and disadvantages of the model in different latitudes, and the weak links in the distribution of training data are actively screened out, and the next batch of data that should be marked is recommended. 5.根据权利要求4所述的一种基于大规模问题自学习的无监督机器阅读理解方法,其特征在于,步骤S3中,通过往返一致性进行数据过滤,通过主动学习提高学习效率。5. An unsupervised machine reading comprehension method based on large-scale problem self-learning according to claim 4, characterized in that, in step S3, data filtering is performed by round-trip consistency, and learning efficiency is improved by active learning.
CN202111151305.9A 2021-02-08 2021-09-29 An unsupervised machine reading comprehension method based on large-scale question self-learning Active CN113836895B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021101731076 2021-02-08
CN202110173107 2021-02-08

Publications (2)

Publication Number Publication Date
CN113836895A true CN113836895A (en) 2021-12-24
CN113836895B CN113836895B (en) 2025-05-09

Family

ID=78967302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111151305.9A Active CN113836895B (en) 2021-02-08 2021-09-29 An unsupervised machine reading comprehension method based on large-scale question self-learning

Country Status (1)

Country Link
CN (1) CN113836895B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN116229185A (en) * 2023-04-03 2023-06-06 南京大学 A Continuous Learning Image Classification Method for Open Environment
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, device and medium based on refrigerator field
CN116663679A (en) * 2023-07-25 2023-08-29 南栖仙策(南京)高新技术有限公司 Language model training method, device, equipment and storage medium
CN116701930A (en) * 2023-05-29 2023-09-05 北京零点远景网络科技有限公司 A blockchain data acquisition method and device based on a multi-architecture NLP pre-training model
CN117291245A (en) * 2023-09-25 2023-12-26 北京声智科技有限公司 Model training methods, devices, computer equipment and storage media
CN117540021A (en) * 2023-11-28 2024-02-09 中关村科学城城市大脑股份有限公司 Large language model training method, device, electronic equipment and computer readable medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276764A1 (en) * 2008-05-05 2009-11-05 Ghorbani Ali-Akbar High-level hypermedia synthesis for adaptive web
US20170105683A1 (en) * 2015-10-16 2017-04-20 General Electric Company System and method of adaptive interpretation of ecg waveforms
CN106844356A (en) * 2017-01-17 2017-06-13 中译语通科技(北京)有限公司 A kind of method that English-Chinese mechanical translation quality is improved based on data selection
CN108962224A (en) * 2018-07-19 2018-12-07 苏州思必驰信息科技有限公司 Speech understanding and language model joint modeling method, dialogue method and system
CN109992669A (en) * 2019-04-08 2019-07-09 浙江大学 A kind of keyword answering method based on language model and intensified learning
CN111079431A (en) * 2019-10-31 2020-04-28 北京航天云路有限公司 Entity relation joint extraction method based on transfer learning
CN111444677A (en) * 2020-02-21 2020-07-24 平安科技(深圳)有限公司 Reading model optimization method, device, equipment and medium based on big data
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Extractive Machine Intelligence Reading Comprehension Question Answering System
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN112309528A (en) * 2020-10-27 2021-02-02 上海交通大学 A method for generating medical image report based on visual question answering method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276764A1 (en) * 2008-05-05 2009-11-05 Ghorbani Ali-Akbar High-level hypermedia synthesis for adaptive web
US20170105683A1 (en) * 2015-10-16 2017-04-20 General Electric Company System and method of adaptive interpretation of ecg waveforms
CN106844356A (en) * 2017-01-17 2017-06-13 中译语通科技(北京)有限公司 A kind of method that English-Chinese mechanical translation quality is improved based on data selection
CN108962224A (en) * 2018-07-19 2018-12-07 苏州思必驰信息科技有限公司 Speech understanding and language model joint modeling method, dialogue method and system
CN109992669A (en) * 2019-04-08 2019-07-09 浙江大学 A kind of keyword answering method based on language model and intensified learning
CN111079431A (en) * 2019-10-31 2020-04-28 北京航天云路有限公司 Entity relation joint extraction method based on transfer learning
CN111444677A (en) * 2020-02-21 2020-07-24 平安科技(深圳)有限公司 Reading model optimization method, device, equipment and medium based on big data
CN111611361A (en) * 2020-04-01 2020-09-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Extractive Machine Intelligence Reading Comprehension Question Answering System
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN112309528A (en) * 2020-10-27 2021-02-02 上海交通大学 A method for generating medical image report based on visual question answering method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
琚心怡;: "基于深层双向Transformer编码器的早期谣言检测", 《信息通信》, no. 5, 15 May 2020 (2020-05-15), pages 17 - 22 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN114996424B (en) * 2022-06-01 2023-05-09 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN116229185A (en) * 2023-04-03 2023-06-06 南京大学 A Continuous Learning Image Classification Method for Open Environment
CN116701930A (en) * 2023-05-29 2023-09-05 北京零点远景网络科技有限公司 A blockchain data acquisition method and device based on a multi-architecture NLP pre-training model
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, device and medium based on refrigerator field
CN116501859B (en) * 2023-06-26 2023-09-01 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field
CN116663679A (en) * 2023-07-25 2023-08-29 南栖仙策(南京)高新技术有限公司 Language model training method, device, equipment and storage medium
CN117291245A (en) * 2023-09-25 2023-12-26 北京声智科技有限公司 Model training methods, devices, computer equipment and storage media
CN117540021A (en) * 2023-11-28 2024-02-09 中关村科学城城市大脑股份有限公司 Large language model training method, device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN113836895B (en) 2025-05-09

Similar Documents

Publication Publication Date Title
Gupta et al. Abstractive summarization: An overview of the state of the art
CN113836895B (en) An unsupervised machine reading comprehension method based on large-scale question self-learning
US10482115B2 (en) Providing question and answers with deferred type evaluation using text with limited structure
US8332394B2 (en) System and method for providing question and answers with deferred type evaluation
US9535898B2 (en) Natural language question expansion and extraction
US20110125734A1 (en) Questions and answers generation
Team et al. Fanar: An arabic-centric multimodal generative ai platform
Cheng et al. Research on automatic error correction method in English writing based on deep neural network
Kusuma et al. Automatic question generation with various difficulty levels based on knowledge ontology using a query template
Ahmed et al. On the application of sentence transformers to automatic short answer grading in blended assessment
Johnsi et al. Enhancing automated essay scoring by leveraging LSTM networks with hyper-parameter tuned word embeddings and fine-tuned LLMs
Ramesh et al. Coherence‐based automatic short answer scoring using sentence embedding
Kumar et al. A novel approach for text generation using RNN for language modeling
Thakkar Finetuning transformer models to build asag system
Kowsher et al. Knowledge-base optimization to reduce the response time of bangla chatbot
Kilmen et al. Shortening psychological scales: semantic similarity matters
Sendra et al. Enhanced latent semantic analysis by considering mistyped words in automated essay scoring
Ramesh et al. Coherence based automatic essay scoring using sentence embedding and recurrent neural networks
Bo et al. Bug question answering with pretrained encoders
Hendricks et al. Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Seneviratne et al. Inductive logic programming in an agent system for ontological relation extraction
Kulkarni Named Entity Recognition on Kannada Low Resource Language using Deep Learning Models
Egaña Azpiazu Exploration of aunnotation strategies for entailment-based Automatic Short Answer Grading
Hodzic Automated Extraction of Data from Insurance Websites
Abumansour Check-worthy Arabic Claim Detection across topics for Automated Fact-checking Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20221026

Address after: 310000 Room 303, building 3, No. 399, Qiuyi Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Honglong Technology (Hangzhou) Co.,Ltd.

Applicant after: HANGZHOU LINKER TECHNOLOGY CO.,LTD.

Address before: 310000 room 31191, 3 / F, building 1, No. 88, Puyan Road, Puyan street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant before: Honglong Technology (Hangzhou) Co.,Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Country or region after: China

Address after: 3rd Floor, Building 4, No. 399 Qiuyi Road, Changhe Street, Binjiang District, Hangzhou City, Zhejiang Province 311000

Applicant after: Qiongjie Intelligent Technology (Hangzhou) Co.,Ltd.

Applicant after: HANGZHOU LINKER TECHNOLOGY CO.,LTD.

Address before: Room 303, Building 3, No. 399 Qiuyi Road, Changhe Street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant before: Honglong Technology (Hangzhou) Co.,Ltd.

Country or region before: China

Applicant before: HANGZHOU LINKER TECHNOLOGY CO.,LTD.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant