Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a remote supervision relation classification data set noise reduction device and method based on natural language inference. The method converts an original data set of relational classification into a natural language inference data set, uses a supervised learning training model when a large amount of supervised data which is correctly labeled and distributed in accordance with a full-scale remote supervised relational classification data set can be provided, or else uses reinforcement learning to train the model under the condition of not depending on labeled data required by the supervised learning, finally uses the natural language inference model to evaluate the relational classification data set, and selects high-quality data as an optimized data set according to evaluation scores.
The technical scheme of the invention is specifically introduced as follows:
the invention provides a noise reduction device for deducing a classification data set based on natural language, which comprises:
a data format conversion module for converting the classified data set into a natural language inferred data set;
the natural language inference model training module is used for training the converted natural language inference data set, the natural language inference model training module uses a supervised learning training model when being capable of providing a large amount of supervised data which are correctly labeled and are distributed in accordance with the full-scale remote supervision relation classification data set, and if the supervised data cannot be provided, the natural language inference model training module uses a reinforcement learning method to train the model;
and the data set noise reduction module is used for optimizing the remote supervision relation classification data set by utilizing the trained natural language inference model.
A noise reduction method for deducing a noise reduction device of a classification data set based on natural language comprises the following steps:
converting the format of the data set, converting the relation classification data set into a natural language inference data set;
training a natural language inference model, namely training the model by using a supervised learning training model when high-quality supervised data can be provided, and training the model by using a reinforcement learning method when the high-quality supervised data cannot be provided;
and denoising the data set, and optimizing the remote supervision relation classification data set through a trained natural language inference model.
The noise reduction device mainly comprises a data format conversion module, a natural language inference model training module and a data set noise reduction module; wherein: the data format conversion module converts each triple in the relational classification dataset into an assumption in natural language inference by constructing a template for each type of feature in the relational classification dataset and converts the corresponding text corpus into a premise in natural language inference; the natural language inference model training template is divided into two conditions, wherein one condition is that high-quality labeled data can be divided from an original data set, the data set can be directly used as a training set to train the natural language inference model by supervised learning, and the other condition is that the current data set has a large noise proportion and high manual labeling cost, and the noise reduction effect of the current model on a verification set can be used as feedback to train parameters of the natural language inference model by a reinforcement learning method; and the data set noise reduction template evaluates the relation classification data set obtained by remote supervision through a trained natural language inference model, and selects a data set with high confidence coefficient as a noise-reduced data set according to the score.
Compared with the prior art, the invention has the beneficial effects that:
1. the method provided by the invention belongs to the field of directly eliminating the noise in the remote supervision relation classification data set, and does not depend on the data format of a packet level, so that the data set can be optimized by using the data set noise reduction device provided by the invention in relation classification data sets of any forms.
2. Other methods of noise reduction using reinforcement learning simply use sentence coding and category coding for the coding of data to splice, but the present invention converts the noise discovery problem into a natural language inference problem: it is assumed that this cannot be deduced from the premises. There are currently many effective models for natural language inference problems. Therefore, the method has higher calculation efficiency and better effect.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the examples.
A noise reduction device for a remote supervision relation classification data set based on natural language inference comprises the following modules:
(1) a data format conversion module: and respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, converting the triples in the relation classification into assumptions in natural language inference, and taking texts in the relation classification data set as the preconditions in the natural language inference to realize the construction of a natural language inference training set.
(2) A natural language inference model training module: when the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with a full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise, and the cost of obtaining a large-scale clean data set for the supervised learning is high, the reinforcement learning method is used for training the natural language inference model under the condition of not depending on labeled data required by the supervised learning.
(3) A dataset denoising module: and scoring the relation classification data set obtained by remote supervision through a trained natural language inference model, and selecting data with high confidence coefficient as an optimized data set according to the score.
A noise reduction method for classifying a data set noise reduction device based on a remote supervision relationship specifically comprises the following steps:
the method comprises the following steps: and converting the format of the data set. And constructing corresponding templates according to the semantics of various relations in the relation classification data set, then taking the texts in the relation classification data set as the premise in natural language inference, and converting the triples corresponding to the texts into the hypothesis in the natural language inference through the templates to realize the construction of the natural language inference training set.
Step two: and training a natural language inference model. When the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with the full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise and the cost of obtaining a large-scale clean data set for supervised learning is high, the reinforcement learning method is used for training the natural language inference model.
Step three: and a data set noise reduction step. And (4) scoring the remote supervision relation classification data set by using the natural language inference model trained in the step two, and then selecting data with high confidence coefficient as the optimized data set according to the score.
The invention converts the noise discovery problem of the remote supervision relational classification data set into the natural language inference problem, scores the relational classification data set by using the trained natural language inference model, and filters the data set based on the scores to obtain a clean data set.
A large-scale relational classification data set can be obtained through a remote supervision mode. Although the way in which the data set is constructed using remote supervision is quite efficient, the data set obtained in this way usually contains a lot of noise. The invention aims to find noise data in a data set and eliminate the noise data so as to obtain a high-quality relational classification data set.
As will be explained in detail below.
1. Data format conversion module
The format of the relation classification data set obtained by remote supervision is (h, t, r, text), wherein the text is a text, the h and the t are two entities in the text, and the r is the relation embodied in the text by the entity pair. The relational classified data is now converted into a natural language inferred data input format (P, H), where P is a premise and H is a hypothesis. The specific conversion method is as follows: firstly, respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, as shown in fig. 1, then converting the triples (H, t, r) in the original sample into the hypothesis H through the corresponding templates, and converting the text in the original sample into the hypothesis P. In this way, the natural language inference input format corresponding to the samples in all the original data sets can be obtained.
2. Natural language inference model training module
When a sufficiently high quality supervised data set is available, the natural language inference model can be trained by supervised learning directly using the natural language inference data set obtained by the data format conversion module.
When a large amount of supervised data which are correctly labeled and are distributed in accordance with a full-scale remote supervision relation classification data set cannot be obtained, the natural language inference model is trained by using reinforcement learning under the condition of not depending on labeled data required by supervision learning, and the specific training method comprises the following steps: as shown in fig. 2, a batch of original relational classification data sets are selected, format conversion is performed on the original relational classification data sets, then the original relational classification data sets are scored through a natural language inference model, and the corresponding relational classification data sets are selected according to the scores. And training a relation classifier by using the selected relation classification data set, taking the expression of the relation classifier on the verification set as the feedback of reinforcement learning, and updating the natural language inference model according to the selection result and the corresponding feedback. The specific parameter updating formula is as follows:
wherein
Refers to the parameters of the natural language inference model, beta refers to the learning rate, x
vAnd r
vInput and annotation, respectively, to a verification set, D
BRefers to a batch of data in the training set, B
sIt is referred to the size of the batch of data,
refer to the current natural language inference model at D
BThe selection of (1) is 0 or 1, where 1 indicates the data is left, 0 indicates the data is discarded, and f
θIs using D
BAdopt
A relational classification model trained from the screened data set, F1 (F)
θ(x
v),r
v) Is a relational classification model f
θThe F1 values on the validation set, δ is the moving average of the F1 values,
is a natural language inference model in parameters
Lower pair D
BAdopt
Probability of screening.
3. Data set noise reduction module
The data set denoising module is used for scoring and screening a full amount of remote supervision relation classification data sets. The relational classification data set can be converted into a corresponding natural language inference model input format through the data format conversion module, then a fully trained natural language inference model can be obtained through the natural language inference model training module, and the data set is scored by using the natural language inference model. The lower the score is, the more the assumption cannot be derived from the premise, that is, the less the original data triple is found in the corresponding text, so that the data with the higher score is selected as the optimized data set, and the data with the lower score is removed from the original data set as noise. A specific example of this module is shown in fig. 3.