CN111797609B

CN111797609B - Model training method and device

Info

Publication number: CN111797609B
Application number: CN202010639435.6A
Authority: CN
Inventors: 蔡岩松; 杜新凯; 牛国扬; 王彦昕; 刘谦; 高峰
Original assignee: Sunshine Insurance Group Co Ltd
Current assignee: Sunshine Insurance Group Co Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2024-10-25
Anticipated expiration: 2040-07-03
Also published as: CN111797609A

Abstract

The application relates to the technical field of natural language processing, and provides a model training method and device. The model training method comprises the following steps: obtaining a universal model, wherein the universal model is a pre-trained language model irrelevant to tasks; acquiring a first corpus and a second corpus, wherein the first corpus is a corpus in a general field, and the second corpus is a corpus related to a target task in a target field; determining a first data ratio based on the difference between the first corpus and the second corpus, and mixing the two corpuses according to the first data ratio to obtain first training data; a specialized model for performing the target task is trained using the first training data, the specialized model including a generic model and an adaptation structure associated with the target task. The method can be regarded as a solution for continuously training the universal model to realize the domain deviation, and the performance of the special model obtained by training is improved by reasonably matching the first corpus and the second corpus in the first training data.

Description

Model training method and device

Technical Field

The invention relates to the technical field of natural language processing, in particular to a model training method and device.

Background

In recent years, machine reading understanding is widely applied to dynamic information extraction of various articles, and assistance of various question-answering robots. In the field of natural language processing, a trained neural network model is generally utilized to perform reading and understanding tasks in specific fields (such as the field of financial insurance, the field of policy and regulation, the field of education, the field of communication and the field of IT). However, the existing pretraining models are all in the general field, and a certain degree of accuracy is lost when the pretraining models in the general field are used for performing reading and understanding tasks in a specific field, so that the pretraining models in the general field need to be continuously trained to realize the field deviation, but no good solution exists in the prior art on how the continuous training process should be realized.

Disclosure of Invention

The embodiment of the application aims to provide a model training method and device for improving the technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

In a first aspect, an embodiment of the present application provides a model training method, including: obtaining a universal model, wherein the universal model is a pre-trained language model irrelevant to tasks; acquiring a first corpus and a second corpus; the first corpus is a corpus in a general field, the second corpus is a corpus related to a target task in a target field, the target task is a natural language processing task, and the target field is a field to which the target task belongs; determining a first data proportion based on the difference between the first corpus and the second corpus, and mixing the first corpus and the second corpus according to the first data proportion to obtain first training data; wherein the difference between the first corpus and the second corpus is inversely related to the first data proportion; training a dedicated model for performing the target task using the first training data; wherein the special model comprises the universal model and an adaptation structure related to the target task.

In the method, the general model is a pre-training model which is irrelevant to tasks, the special model is a model for executing target tasks in a specific field, and the special model comprises the general model, so that the scheme for training the special model can also be regarded as a solution for continuously training the general model to realize the field deviation. The target task is a natural language processing task, but is not limited to a reading and understanding task, and may be a text classification task, a named entity recognition task, or the like.

In the scheme, first, the first corpus (data in the general field) and the second corpus (data in the specific field) are mixed to generate the first training data, so that the first training data has good knowledge intensity, is not excessively biased to field knowledge, gives up language expression modes in the general field, even has fitting and other problems, and the performance of the special model obtained by training is improved. And secondly, the scheme determines the proportion of the two data in the first training data based on the difference between the first corpus and the second corpus, is favorable for reasonably balancing the territory and universality of the model, and further improves the performance of the special model obtained by training.

In an implementation manner of the first aspect, the determining the data proportioning based on the variability between the first corpus and the second corpus includes: acquiring a first difference coefficient, wherein the first difference coefficient is positively correlated with the occurrence frequency of keywords in the target field in a test corpus in the target field; calculating a second difference coefficient according to the difference between the text length in the first corpus and the text length in the second corpus, wherein the second difference coefficient is in positive correlation with the difference between the text lengths; and determining the first data proportion according to the first difference coefficient and the second difference coefficient.

In the above implementation manner, the keywords in the target domain may be specified by domain experts, where the keywords basically only occur in the corpus in the target domain, and rarely occur in the general domain, so that the first difference coefficient may represent the difference between the first corpus and the second corpus in the knowledge level, or mainly describe the difference between the two corpora due to the different domains to which they belong. The second difference coefficient is calculated according to the difference between the text lengths in the first corpus and the second corpus, so that the difference between the first corpus and the second corpus on the language structure level is represented, or the difference caused by the language features of the two corpora is mainly described. The two kinds of differences can be comprehensively and effectively reflected to the difference between the first corpus and the second corpus.

In an implementation manner of the first aspect, the target task is a decimated reading understanding task, and the calculating a second difference coefficient according to a difference between a text length in the first corpus and a text length in the second corpus includes: calculating the average length L1 of the articles, the average length L2 of the questions and the average length L3 of the answers read and understood in the first corpus; calculating the average length P1 of the articles, the average length P2 of the questions and the average length P3 of the answers read and understood in the second corpus; calculating the second difference coefficient according to the difference between P1 and L1, the difference between P2 and L2 and the difference between P3 and L3; wherein the second difference coefficient is positively correlated with the difference between P1 and L1, the difference between P2 and L2, and the difference between P3 and L3, respectively.

The extraction type reading and understanding task refers to a reading and understanding task that answers to questions are fragments in an article source, and three elements of the extraction type reading and understanding task are the articles, the questions and the answers. Thus, in understanding tasks for a decimated reading, the length of the text in the corpus includes the length of the articles, questions, and answers. Therefore, when calculating the second difference coefficient of the first corpus and the second corpus, the difference of the chapter lengths, the difference of the question lengths and the difference of the answer lengths in the two types of corpuses need to be considered simultaneously.

In an implementation manner of the first aspect, the training a dedicated model for performing the target task using the first training data includes: in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and setting the learning rate used in the training process according to the convergence degree; wherein the learning rate is set to be inversely related to the degree of convergence.

In the above implementation manner, the adjustment of the learning rate is a dynamic adjustment performed according to the convergence degree of the model, that is, the learning rate is adjusted according to the actual performance result of model training, especially in the later stage of training, so that the learning rate can be attenuated along with the convergence of the model, and the adjustment manner is flexible and effective.

In an implementation manner of the first aspect, the learning rate is a decaying learning rate with a norm-up, and during training the dedicated model with the first training data, the estimating, with a verification set, a convergence degree of the dedicated model periodically, and setting the learning rate used in the training process according to the convergence degree includes: in a learning rate attenuation stage in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and reducing the value of the learning rate according to the convergence degree; wherein the amount of decrease in the learning rate value is inversely related to the degree of convergence.

The attenuation learning rate with the norm-up uses a smaller learning rate at the initial stage of training, and after the model is stabilized, the preset learning rate is selected for learning, namely the so-called learning rate preheating (norm-up), which is beneficial to accelerating the convergence rate of the model and improving the training effect of the model. After a period of learning and learning is performed by using a preset learning period, the learning rate begins to decay, the problems of over fitting and the like caused by overlarge learning rate when the model tends to converge are avoided, dynamic learning rate adjustment is performed by using a verification set in the decay process of the learning rate, and the adjustment mode is flexible and objective, so that the training effect is improved.

In an implementation manner of the first aspect, the target task is a decimated reading understanding task, an answer of the target task meets a first statistical rule, and the obtaining the second corpus includes: searching and/or constructing corpus for extraction reading understanding according to the target field, and screening corpus with answers meeting the first statistical rule from the corpus as the second corpus.

In the implementation manner, the second corpus obtaining process is highly adapted to the target task actually to be executed by the trained dedicated model, so that the obtained corpus is more targeted, and the training effect is more ideal (for the target task to be executed).

In an implementation manner of the first aspect, the acquiring a generic model includes: acquiring an original universal model, wherein the original universal model is a pre-trained language model in the universal field; acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, and the fourth corpus is the corpus in the target field; determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio; and training the original universal model by using the second training data to obtain the universal model.

The original general model can refer to an existing public model, such as an ERNIE model, a BERT model and the like, and the implementation mode carries out parameter fine adjustment on the original general model through training to obtain the general model, and then builds a special model based on the general model and further trains. The original general model is subjected to parameter fine adjustment, the second training data used by the original general model is similar to the first training data, the second training data is also composed of data in the general field and data in the target field, and the second training data is similar to the first training data in the proportion of the data, so that the quality of the trained general model is improved.

It should be noted that the second training data and the first training data are also different in form, because the first training data is directed to a downstream target task, e.g., a reading understanding task, a text classification task, etc.; while the second training data is independent of the target task, used by upstream pre-training tasks, e.g., masked LM, next Sentence Prediction, etc.

In a second aspect, an embodiment of the present application provides a model training method, including: acquiring an original universal model, wherein the original universal model is a pre-trained language model in the universal field; acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, the fourth corpus is the corpus in the target field to which the target task belongs, and the target task is a natural language processing task; determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio; training the original general model by using the second training data to obtain a general model; the general model is used for forming a special model by the adaptation structure related to the target task, and the special model is used for executing the target task.

The method provided by the second aspect is similar to the way the generic model is obtained by the method provided by the last implementation of the first aspect, but in the method provided by the second aspect, the dedicated model is allowed to be directly built based on the generic model and is allowed to be directly put into use without training (it is of course also possible that the dedicated model is put into use after training). In the process of training and generating the universal model, the data in the universal field and the data in the target field are reasonably mixed, so that the obtained universal model has good balance between the field property and the universality, and the performance of the special model constructed based on the universal model is improved.

In a third aspect, an embodiment of the present application provides a model training apparatus, including: the first model acquisition module is used for acquiring a general model, wherein the general model is a pre-trained language model irrelevant to tasks; the first data acquisition module is used for acquiring a first corpus and a second corpus; the first corpus is a corpus in a general field, the second corpus is a corpus related to a target task in a target field, the target task is a natural language processing task, and the target field is a field to which the target task belongs; the first data mixing module is used for determining a first data proportion based on the difference between the first corpus and the second corpus, and mixing the first corpus and the second corpus according to the first data proportion to obtain first training data; wherein the difference between the first corpus and the second corpus is inversely related to the first data proportion; the first training module is used for training a special model for executing the target task by using the first training data; wherein the special model comprises the universal model and an adaptation structure related to the target task.

In a fourth aspect, an embodiment of the present application provides a model training apparatus, including: the second model acquisition module acquires an original universal model, wherein the original universal model is a pre-trained language model in the universal field; the second data acquisition module acquires a third corpus and a fourth corpus; the third corpus is the corpus in the general field, the fourth corpus is the corpus in the target field to which the target task belongs, and the target task is a natural language processing task; the second data mixing module is used for determining a second data ratio based on the difference between the third corpus and the fourth corpus and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio; the second training module is used for training the original general model by using the second training data to obtain a general model; the general model is used for forming a special model by the adaptation structure related to the target task, and the special model is used for executing the target task.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method provided by the first aspect or any one of the possible implementations of the first aspect.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of the first aspect or any one of the possible implementations of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flow chart of a model training method provided by an embodiment of the application;

FIG. 2 shows a block diagram of a model training apparatus according to an embodiment of the present application;

FIG. 3 is a block diagram of another model training apparatus according to an embodiment of the present application;

fig. 4 shows a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Fig. 1 shows a flowchart of a model training method according to an embodiment of the present application. The method may be, but is not limited to being, performed by an electronic device, fig. 4 showing a possible configuration of which is described in detail below with respect to fig. 4. Referring to fig. 1, the method includes:

step S110: a generic model is obtained.

The general model is a language model which is irrelevant to a specific natural language processing task, and is mainly used for learning the expression mode of the language per se for the design purpose of the general model, and is not used for (or cannot be directly used for) executing the specific task in most cases; in contrast, the dedicated model (appearing in step S140) is a model for executing a specific natural language processing task, which will be hereinafter abbreviated as a target task for simplicity, which may be a reading understanding task, a text classification task, a named entity recognition task, or the like.

The target task often relates to a specific field, for example, a financial insurance field, a policy and regulation field, an education field, a communication and IT field, and the like, which may be called a target field. In contrast, some text does not belong to a particular domain, e.g. from daily conversations, which can be considered to be in the general domain.

The general model and the special model can be realized by adopting a neural network structurally, the special model can be formed by the general model and an adaptive structure related to a target task, for example, a special model capable of executing the target task can be formed by adding a full connection layer, a sofmax layer and the like after the general model, and the full connection layer and the sofmax layer are the adaptive structure related to the target task and can be designed differently according to different target tasks. In most cases, the general model is the main structure of the special model, and the adapting structure is relatively simple.

The generic model is a pre-trained model, and the task of training the generic model may be referred to as a pre-training task, which may also be referred to as an upstream task, and, conversely, the target task may be referred to as a downstream task. Pretraining tasks are generally domain-specific independent, and often used pretraining tasks include Masked LM, next Sentence Prediction, and the like. The general model is obtained in two main ways:

Mode one: obtained directly from public channels, e.g., BERT chinese model published by Google corporation, ERNIE model published by hundred degrees corporation, etc. The models are often generated by corpus training in the general field, the model input can be sentences, the output is vectorized representation of the sentences, or the models can be regarded as extracting the characteristics of the sentences, and the corresponding downstream tasks can be performed based on the characteristics. In contrast, ERNIE has some advantages in chinese processing, specifically:

(1) Compared with the BERT Chinese model, the Masked LM task in the ERNIE pre-training task adopts word coverage which is more suitable for Chinese habit, so that ERNIE has better performance when processing Chinese task.

(2) Compared with the BERT Chinese model, the number and quality of the pre-training corpus of ERNIE are better, so that ERNIE has better effect on executing Chinese tasks.

Therefore, if the target task is a Chinese task, the ERNIE models may be preferentially selected. It should be appreciated, however, that model selection strategies are not constant since language models are also under constant improvement, not excluding the BERT Chinese model from improving beyond ERNIE models.

Mode two: the method comprises the steps of firstly obtaining a model from a public channel, namely the original general model obtained at the moment, and then obtaining the general model by training and fine-tuning parameters of the original general model. For example, the parameters of the original generic model may be fine-tuned using the corpus within the target domain to appropriately bias the original generic model toward the target domain. Mode two will be further illustrated and will not be described in detail herein.

In contrast, the general model obtained in the first mode is simpler, and the general model obtained in the second mode is better in quality due to parameter fine adjustment, so that the training of the special model in the later period is facilitated.

Step S120: and acquiring a first corpus and a second corpus.

The first corpus refers to the corpus in the general field, and the second corpus refers to the corpus related to the target task in the target field. It can be understood that the higher the adaptation degree of the second corpus to the target task is, the better the effect of the special model obtained by training can be obtained when the target task is executed.

Taking the target task as an example of a removable reading understanding task, the removable reading understanding task refers to: after reading an article, a question for the article is answered, and the answer to the question must be a segment of the original text (e.g., a word or sentence in the original text). If the answer of the target task meets the first statistical rule, corpus used for extraction reading understanding can be searched and/or constructed according to the target field, and corpus with the answer meeting the first statistical rule can be screened out from the corpus to serve as the second corpus. The meaning is explained below by way of specific examples:

(1) If the answer to the target task is shorter (where the answer is shorter is the first statistical rule), for example, the answer is some named entity or time (for example, in the scenario of booking an air ticket, the destination and departure time of the air ticket are to be analyzed from a section of speech spoken by the customer), the corpus for the extraction reading understanding can be searched and/or constructed according to the target domain first, and then the corpus from which the answer is shorter and/or the answer is the named entity or time can be screened out as the second corpus.

(2) If the answer of the target task is long (where the answer is long is the first statistical rule), for example, the answer is an explanation of some concepts or an explanation of some facts (for example, in a history teaching scene, the influence of the text content answer on the Chinese society is to be improved by the seal autocratic monarchy), the corpus for the decimated reading understanding can be searched and/or constructed according to the target domain, and then the corpus from which the answer is long and/or the answer is an explanation of some concepts or an explanation of some facts can be screened out as the second corpus.

(3) If the answers to the target task can statistically analyze a certain rule, for example, a question-answering robot specially used for answering questions of a science department is designed, it is difficult to find the statistical rule from the length of the answers, but it can be determined that the answers are all in the science department, so that the answers belong to the science department and are the first statistical rule. At this time, the corpus for extraction reading understanding may be searched and/or constructed according to the target domain, and then the corpus with the answer belonging to the science department may be screened out as the second corpus.

The corpus acquisition can be automatically executed by the program, or can be manually acquired, or can be initially screened by the program, and then verified or further screened by the manual.

Step S130: and determining a first data proportion based on the difference between the first corpus and the second corpus, and mixing the first corpus and the second corpus according to the first data proportion to obtain first training data.

The generic model is mentioned before as being task-independent, so to relate the resulting specialized model to the target task, a second corpus related to the target task must be introduced into the training data of the specialized model to achieve the domain bias.

However, in the scheme of the present application, the first training data is not only the second corpus, but the first corpus is mixed with the second corpus, which is because: the data (second corpus) in the specific field has high-strength field knowledge, but the expression capability of the language is insufficient, the data (first corpus) in the general field does not have or basically does not have field knowledge, but the language expression capability is strong, and the mixed use of the two types of corpus ensures that the first training data has good knowledge strength and does not excessively deviate from the field knowledge, so that the problems of discarding the language expression capability in the general field, even overfitting and the like are solved, thereby being beneficial to improving the performance of the special model obtained by training.

Further, the ratio of the first corpus to the second corpus in the first training data (i.e., the first data ratio, for example, may refer to the ratio of the number of samples of the first corpus to the number of samples of the second corpus in the first training data) is very important, because the ratio directly determines the balance of the special model in terms of territory and universality, so that the first data ratio is reasonably set, thereby being beneficial to improving the performance of the special model obtained by training.

In the scheme of the application, the first data proportion is determined based on the difference between the first corpus and the second corpus, and the difference between the first corpus and the second corpus is inversely related to the first data proportion. If the difference between the first corpus and the second corpus is larger, the territory of the target field is stronger, and at the moment, the ratio of the first corpus in the first training data is smaller, because more field data are needed to reflect the characteristic of the strong field; otherwise, if the difference between the first corpus and the second corpus is smaller, the territory of the target field is weaker, and at this time, the ratio of the first corpus in the first training data is larger. The present application is not limited to the specific form of the negative correlation, and for example, may be an inverse ratio relationship in the simplest case, but may be other relationships.

In some implementations, the first data proportioning may be calculated as:

Step A: obtaining a first difference coefficient (denoted as lambda 1); the first difference coefficient is positively correlated with the occurrence frequency of keywords in the target field in the test corpus in the target field. The present application is not limited to the specific form of the positive correlation, and for example, in the simplest case, may be a proportional relationship, but may be other relationships.

Keywords in the target domain may be specified by domain experts, are representative, occur substantially only in corpus in the target domain, and rarely occur in the general domain, e.g., keywords "modulate", "demodulate", "frequency division multiplex" and the like in the communication and IT domain are rarely used in the general domain. Therefore, the first difference coefficient can represent the difference of the first corpus and the second corpus on the knowledge level, or mainly describe the difference of the two corpora caused by the different domains. The selection mode of the test corpus in the target field is not limited, but the corpus is not necessarily connected with the second corpus. The first coefficients of difference calculated by the above method in some fields are listed below:

FIELD	Financial insurance field	Policy and regulation field	Education field	Communication and IT fields
					First coefficient of difference	0.75	0.91	0.32	1

The first difference coefficient of the communication and IT domain is set to be 1 (the domain of the domain is considered to be very strong), and the first difference coefficients of the other domains are calculated correspondingly by taking the first difference coefficient of the communication and IT domain as a reference (not more than 1).

The value of the first difference coefficient is only related to the field to which the corpus belongs, so that the first difference coefficient can be used for a long time after being calculated, and is not required to be updated or at least is not required to be updated frequently. For example, the first coefficient of difference may be calculated in advance and the result stored in a similar table as above, and the table directly queried when the first data formulation needs to be calculated.

And (B) step (B): calculating a second difference coefficient (denoted as lambda 1) according to the difference between the text length in the first corpus and the text length in the second corpus; wherein the second coefficient of difference is positively correlated with the difference between the text lengths. The present application is not limited to the specific form of the positive correlation, and for example, in the simplest case, may be a proportional relationship, but may be other relationships.

The second difference coefficient is calculated according to the difference between the text lengths in the first corpus and the second corpus, so that the difference between the first corpus and the second corpus on the language structure level is represented, or the difference caused by the language features of the two corpora is mainly described.

For example, for the case where the target task is a decimated read understanding task, the first coefficient of difference may be calculated as:

first, an article average length L1, a question average length L2 and an answer average length L3 of reading understanding in a first corpus (for example, SQUAD data sets) are calculated;

then, calculating the average length P1 of the articles, the average length P2 of the questions and the average length P3 of the answers which are read and understood in the second corpus;

finally, calculating a second difference coefficient according to the difference between P1 and L1, the difference between P2 and L2 and the difference between P3 and L3; the second difference coefficient is positively correlated with the differences between P1 and L1, between P2 and L2, and between P3 and L3, respectively.

For example, in an alternative, the second coefficient of difference is calculated as follows:

λ2＝g(P1/L1)*g(P2/L2)*g(P3/L3)

where g (-) is a predefined function, g (x/y) =x/y if x > y, otherwise g (x/y) =y/x.

It will be appreciated that the above formula for calculating the second difference coefficient is only an example, and that different formulas may be used in different schemes, for example, multiplying some coefficients on the right side of the equation, or summing the three parts on the right side of the equation instead of multiplying them by weights, or the difference of the text lengths may not necessarily be defined as a ratio, may be defined as an absolute value of the difference, etc.

It should be further understood that, taking the extraction type reading and understanding task as an example, if the extraction type reading and understanding task is not used, concepts such as articles, questions, answers and the like may not exist in the corpus, and at this time, another way is also adopted to calculate the difference of the text length, for example, calculate the difference of the sentence length in the corpus.

Step C: and determining a first data proportion according to the first difference coefficient and the second difference coefficient.

For example, the first data ratio may be defined as 1 (λ1×λ2), but other definitions may be adopted, for example, 1 (λ1+λ2), and the application is not limited thereto. The specific definition mode can be determined according to the actual performance of the trained special model.

In addition, it is not precluded that in some implementations the first data ratio is determined based solely on the first coefficient of difference or the second coefficient of difference. For example, for reading understanding tasks there is a SQUAD dataset widely accepted by the academy, but for other natural language processing tasks there is not necessarily a well-defined dataset as such, at which time the first coefficient of difference may be directly proportioned as the first data.

As to how to mix the two types of corpus according to the first data, the present application is not limited, and for example, the corpus may be uniformly mixed together according to the first data ratio, or may be randomly mixed according to the first data ratio, or the like.

The first training data may also be cleaned after it has been obtained to improve the data quality, of course the data cleaning step is optional, e.g. if the quality of the first training data already meets the training requirements, no data cleaning has to be performed. In addition, the data cleansing may be performed after the first corpus and the second corpus are obtained, and it is not necessary to wait until the first training data are obtained. Data cleansing items include, but are not limited to:

(1) Converting the full-angle symbols and the half-angle symbols in the first training data into unified full-half-angle forms, such as all unified into half-angle symbols or all unified into full-angle symbols;

(2) Punctuation marks in the first training data are converted into a unified punctuation form. For example, the same punctuation in chinese and english text may differ in sign (e.g., chinese and english periods are circles and dots, respectively), requiring a unified form that avoids the computer system from recognizing that it belongs to a sign of different meaning.

When performing data cleansing, one or more of the above may be selected for execution as desired.

The form of the samples in the first training data is related to the target task, for example, if the target task is a decimated reading understanding task, each sample in the first training data may include an article, a question, and two labels (for indicating the start and stop positions of the answers in the article).

Step S140: the specialized model is trained using the first training data.

Before step S140 is performed, a dedicated model should be built based on the generic model obtained in step S110 and the adaptation structure related to the target task. The trained specialized model may be used to perform a target task, for example, if the target task is a decimated read understanding task, the model inputs may be articles and questions and the outputs may be the start and stop locations of the answers in the articles.

The updating of model parameters in relation to training data is particularly well known in the art and is not described in detail here. The following mainly describes the problem of setting the learning rate parameter in the training process of the special model:

in some implementations of the present application, in training the dedicated model using the first training data, the convergence degree of the dedicated model is periodically evaluated using the verification set, and the learning rate used in the training process is set according to the convergence degree.

The term "regular" may refer to a specified number of training steps (defined as one step using a batch of data to update parameters of a model) or a specified duration. The degree of convergence of a model may be defined in terms of the prediction error of the model: for example, it may be defined directly as a prediction error or a mapped value of the prediction error; for another example, a reference error may be preset as an error of the model in an ideal convergence state, a difference may be obtained between an actual error currently calculated by the model and the reference error, and the difference or a mapping value of the difference may be used as a convergence degree of the model.

The learning rate is set to be inversely related to the convergence degree, that is, when the special model is earlier away from convergence, the learning rate should be set to be larger to accelerate the convergence speed of the model, and when the special model is closer to convergence, the learning rate should be set to be smaller to avoid the problems of unstable model, overfitting and the like.

It should be noted that in the above implementation manner, the adjustment of the learning rate is a dynamic adjustment performed according to the convergence degree of the model, that is, the learning rate is adjusted according to the actual performance of the model on the verification set, instead of adjusting the fixed value each time, the adjustment manner is flexible, objective and effective, which is beneficial to improving the quality of the dedicated model obtained by training.

Furthermore, the learning rate can be set as a learning rate of the collapse of the norm-up, that is, a smaller learning rate is used in the initial stage of training, and a preset learning rate is selected for learning after the model is stable, that is, so-called learning rate preheating (norm-up), which is beneficial to accelerating the convergence rate of the model and improving the training effect of the model. After a period of learning using preset learning, the learning rate is started, so that the problems of instability, overfitting and the like caused by overlarge learning rate when the model tends to converge are avoided. In the attenuation process of the learning rate, the convergence degree of the special model can be estimated by periodically utilizing the verification set according to the mode, and the value of the learning rate is reduced according to the convergence degree; wherein the reduction of the learning rate value is inversely related to the convergence degree.

The above mentioned, the step S110 of obtaining the generic model includes at least two ways, and the second way is further illustrated below:

in some implementations, the generic model can be obtained as follows:

Step A: acquiring an original general model; the original general model is a pre-trained language model in the general field, for example, a BERT chinese model published by Google corporation, a ERNIE model published by hundred degrees corporation, etc., and the ERNIE model may be preferentially selected in the chinese environment.

And (B) step (B): acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, and the fourth corpus is the corpus in the target field. Regarding the obtaining of the third corpus and the fourth corpus, reference may be made to step S120, but it should be noted that, since the generic model is not related to a specific task, the fourth corpus may directly select the data in the target domain, and the problem of adapting to the target task is not involved.

Step C: and determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data.

This step can be implemented with reference to step S130, and the description will not be repeated. It should be noted, however, that the second training data and the first training data are still different in form, as the first training data is directed to the downstream target task and the second training data is used by the upstream pre-training task, e.g., each sample in the second training data may comprise two consecutive sentences.

Step D: and training the original universal model by using the second training data to obtain the universal model.

This step may be implemented with reference to step S140. Or its learning rate may be adjusted in a simple manner, such as by attenuating the learning rate by 70% of the previous after each training round (training data is all used once defined as one round), etc.

The method is characterized in that the general model is obtained by carrying out parameter fine adjustment on the original model, the second training data used by the general model is similar to the first training data, the second training data is also composed of data in the general field and data in the target field, and the proportion of the data is similar to that of the first training data, so that the general model can be compatible with the field and the general purpose, the quality of the general model is higher, and the training of the follow-up special model is facilitated.

In the above proposal, the obtained general model is utilized to construct a special model, and the special model is further trained, so as to finally obtain the model for executing the target task. But it is not excluded that in other embodiments, the dedicated model is directly put into use after the dedicated model is constructed by using the obtained general model (for example, through the above steps a to D), and the dedicated model is not subjected to parameter adjustment (of course, the dedicated model may be put into use after parameter adjustment). In the process of training and generating the universal model, the schemes reasonably mix the data in the universal field and the data in the target field, so that the obtained universal model has good balance between the field and the universality, and the performance of the special model constructed based on the universal model is improved. Thus, this is not excluded when the training time is limited.

Fig. 2 shows a functional block diagram of a model training apparatus 200 according to an embodiment of the present application. Referring to fig. 2, the model training apparatus 200 includes:

A first model obtaining module 210, configured to obtain a generic model, where the generic model is a pre-trained, task-independent language model;

a first data obtaining module 220, configured to obtain a first corpus and a second corpus; the first corpus is a corpus in a general field, the second corpus is a corpus related to a target task in a target field, the target task is a natural language processing task, and the target field is a field to which the target task belongs;

A first data mixing module 230, configured to determine a first data ratio based on a difference between the first corpus and the second corpus, and mix the first corpus and the second corpus according to the first data ratio to obtain first training data; wherein the difference between the first corpus and the second corpus is inversely related to the first data proportion;

a first training module 240 for training a dedicated model for performing the target task using the first training data; wherein the special model comprises the universal model and an adaptation structure related to the target task.

In one implementation of the model training apparatus 200, the first data mixing module 230 determines a data proportioning based on a difference between the first corpus and the second corpus, including: acquiring a first difference coefficient, wherein the first difference coefficient is positively correlated with the occurrence frequency of keywords in the target field in a test corpus in the target field; calculating a second difference coefficient according to the difference between the text length in the first corpus and the text length in the second corpus, wherein the second difference coefficient is in positive correlation with the difference between the text lengths; and determining the first data proportion according to the first difference coefficient and the second difference coefficient.

In one implementation of the model training apparatus 200, the target task is a decimated reading understanding task, and the first data mixing module 230 calculates a second difference coefficient according to a difference between the text length in the first corpus and the text length in the second corpus, including: calculating the average length L1 of the articles, the average length L2 of the questions and the average length L3 of the answers read and understood in the first corpus; calculating the average length P1 of the articles, the average length P2 of the questions and the average length P3 of the answers read and understood in the second corpus; calculating the second difference coefficient according to the difference between P1 and L1, the difference between P2 and L2 and the difference between P3 and L3; wherein the second difference coefficient is positively correlated with the difference between P1 and L1, the difference between P2 and L2, and the difference between P3 and L3, respectively.

In one implementation of model training apparatus 200, first training module 240 trains a dedicated model for performing the target task using the first training data, comprising: in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and setting the learning rate used in the training process according to the convergence degree; wherein the learning rate is set to be inversely related to the degree of convergence.

In one implementation of the model training apparatus 200, the learning rate is a decaying learning rate with a norm-up, the first training module 240, in training the dedicated model using the first training data, periodically evaluates a convergence degree of the dedicated model using a verification set, and sets the learning rate used in the training process according to the convergence degree, including: in a learning rate attenuation stage in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and reducing the value of the learning rate according to the convergence degree; wherein the amount of decrease in the learning rate value is inversely related to the degree of convergence.

In one implementation of the model training apparatus 200, the target task is a decimated reading understanding task, an answer of the target task meets a first statistical rule, and the first data obtaining module 220 obtains a second corpus, including: searching and/or constructing corpus for extraction reading understanding according to the target field, and screening corpus with answers meeting the first statistical rule from the corpus as the second corpus.

In one implementation of the model training apparatus 200, the first model acquisition module 210 acquires a generic model, including: acquiring an original universal model, wherein the original universal model is a pre-trained language model in the universal field; acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, and the fourth corpus is the corpus in the target field; determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio; and training the original universal model by using the second training data to obtain the universal model.

The model training apparatus 200 according to the embodiment of the present application has been described in the foregoing method embodiment, and for brevity, reference may be made to the corresponding contents of the method embodiment where the apparatus embodiment is not mentioned.

Fig. 3 shows a functional block diagram of a model training apparatus 300 according to an embodiment of the present application. Referring to fig. 3, the model training apparatus 300 includes:

A second model acquisition module 310 that acquires an original generic model, which is a pre-trained language model in a generic domain;

a second data obtaining module 320, for obtaining a third corpus and a fourth corpus; the third corpus is the corpus in the general field, the fourth corpus is the corpus in the target field to which the target task belongs, and the target task is a natural language processing task;

A second data mixing module 330, configured to determine a second data ratio based on a difference between the third corpus and the fourth corpus, and mix the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio;

A second training module 340, configured to train the original generic model using the second training data to obtain a generic model; the general model is used for forming a special model by the adaptation structure related to the target task, and the special model is used for executing the target task.

The model training apparatus 300 according to the embodiment of the present application has been described in the foregoing method embodiment, and for brevity, reference may be made to the corresponding contents of the method embodiment where the apparatus embodiment is not mentioned.

Fig. 4 shows a possible structure of an electronic device 400 according to an embodiment of the application. Referring to fig. 4, an electronic device 400 includes: processor 410, memory 420, and communication interface 430, which are interconnected and communicate with each other by a communication bus 440 and/or other forms of connection mechanisms (not shown).

The Memory 420 includes one or more (Only one is shown in the figure), which may be, but is not limited to, a random access Memory (Random Access Memory, abbreviated as RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, abbreviated as PROM), an erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, abbreviated as EPROM), an electrically erasable programmable Read Only Memory (Electric Erasable Programmable Read-Only Memory, abbreviated as EEPROM), and the like. The processor 410, as well as other possible components, may access, read, and/or write data from, the memory 420.

The processor 410 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The processor 410 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a micro control unit (Micro Controller Unit, MCU), a network processor (Network Processor, NP), or other conventional processor; but may also be a special purpose Processor including a graphics Processor (Graphics Processing Unit, GPU), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuits (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Also, when the processor 410 is plural, some of them may be general-purpose processors, and another may be special-purpose processors.

Communication interface 430 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for data interaction. Communication interface 430 may include an interface for wired and/or wireless communication.

One or more computer program instructions may be stored in memory 420 that may be read and executed by processor 410 to implement the model training methods provided by embodiments of the present application, as well as other desired functions.

It is to be understood that the configuration shown in fig. 4 is merely illustrative, and that electronic device 400 may also include more or fewer components than shown in fig. 4, or have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof. The electronic device 400 may be a physical device such as a PC, a notebook, a tablet, a cell phone, a server, an embedded device, etc., or may be a virtual device such as a virtual machine, a virtualized container, etc. The electronic device 400 is not limited to a single device, and may be a combination of a plurality of devices or a cluster of a large number of devices.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores computer program instructions which execute the model training method provided by the embodiment of the application when being read and run by a processor of a computer. For example, a computer-readable storage medium may be implemented as memory 420 in electronic device 400 in FIG. 4.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of model training, comprising:

Obtaining a universal model, wherein the universal model is a pre-trained language model irrelevant to tasks;

Acquiring a first corpus and a second corpus; the first corpus is a corpus in a general field, the second corpus is a corpus related to a target task in a target field, the target task is a natural language processing task, and the target field is a field to which the target task belongs;

determining a first data proportion based on the difference between the first corpus and the second corpus, and mixing the first corpus and the second corpus according to the first data proportion to obtain first training data; wherein the difference between the first corpus and the second corpus is inversely related to the first data proportion;

Training a dedicated model for performing the target task using the first training data; wherein the special model comprises the general model and an adaptation structure related to the target task;

the obtaining the universal model includes:

acquiring an original universal model, wherein the original universal model is a pre-trained language model in the universal field;

Acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, and the fourth corpus is the corpus in the target field;

Determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio;

And training the original universal model by using the second training data to obtain the universal model.

2. The training method of claim 1, wherein the determining the data proportioning based on the variability between the first corpus and the second corpus comprises:

acquiring a first difference coefficient, wherein the first difference coefficient is positively correlated with the occurrence frequency of keywords in the target field in a test corpus in the target field;

Calculating a second difference coefficient according to the difference between the text length in the first corpus and the text length in the second corpus, wherein the second difference coefficient is in positive correlation with the difference between the text lengths;

And determining the first data proportion according to the first difference coefficient and the second difference coefficient.

3. The training method of claim 2, wherein the target task is a decimated reading understanding task, and the calculating a second difference coefficient based on a difference between a text length in the first corpus and a text length in the second corpus comprises:

Calculating the average length L1 of the articles, the average length L2 of the questions and the average length L3 of the answers read and understood in the first corpus;

Calculating the average length P1 of the articles, the average length P2 of the questions and the average length P3 of the answers read and understood in the second corpus;

Calculating the second difference coefficient according to the difference between P1 and L1, the difference between P2 and L2 and the difference between P3 and L3; wherein the second difference coefficient is positively correlated with the difference between P1 and L1, the difference between P2 and L2, and the difference between P3 and L3, respectively.

4. The training method of claim 1, wherein training the dedicated model for performing the target task using the first training data comprises:

in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and setting the learning rate used in the training process according to the convergence degree; wherein the learning rate is set to be inversely related to the degree of convergence.

5. The training method according to claim 4, wherein the learning rate is a decaying learning rate with norm-up, wherein during training of the dedicated model with the first training data, a convergence degree of the dedicated model is periodically estimated with a verification set, and the learning rate used in training is set according to the convergence degree, comprising:

In a learning rate attenuation stage in the process of training the special model by using the first training data, periodically using a verification set to evaluate the convergence degree of the special model, and reducing the value of the learning rate according to the convergence degree; wherein the amount of decrease in the learning rate value is inversely related to the degree of convergence.

6. The training method of claim 1, wherein the target task is a decimated reading understanding task, an answer of the target task satisfies a first statistical rule, and obtaining a second corpus comprises:

Searching and/or constructing corpus for extraction reading understanding according to the target field, and screening corpus with answers meeting the first statistical rule from the corpus as the second corpus.

7. A model training device, comprising:

the first model acquisition module is used for acquiring a general model, wherein the general model is a pre-trained language model irrelevant to tasks;

The first data acquisition module is used for acquiring a first corpus and a second corpus; the first corpus is a corpus in a general field, the second corpus is a corpus related to a target task in a target field, the target task is a natural language processing task, and the target field is a field to which the target task belongs;

The first data mixing module is used for determining a first data proportion based on the difference between the first corpus and the second corpus, and mixing the first corpus and the second corpus according to the first data proportion to obtain first training data; wherein the difference between the first corpus and the second corpus is inversely related to the first data proportion;

The first training module is used for training a special model for executing the target task by using the first training data; wherein the special model comprises the general model and an adaptation structure related to the target task;

the first model obtaining module obtains a universal module, including: acquiring an original universal model, wherein the original universal model is a pre-trained language model in the universal field; acquiring a third corpus and a fourth corpus; the third corpus is the corpus in the general field, and the fourth corpus is the corpus in the target field; determining a second data ratio based on the difference between the third corpus and the fourth corpus, and mixing the third corpus and the fourth corpus according to the second data ratio to obtain second training data; wherein the difference between the third corpus and the fourth corpus is inversely related to the second data ratio; and training the original universal model by using the second training data to obtain the universal model.