CN111950302B

CN111950302B - Knowledge distillation-based machine translation model training method, device, equipment and medium

Info

Publication number: CN111950302B
Application number: CN202010843014.5A
Authority: CN
Inventors: 袁秋龙
Original assignee: Shanghai Zhilv Information Technology Co ltd
Current assignee: Shanghai Zhilv Information Technology Co ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-11-10
Anticipated expiration: 2040-08-20
Also published as: CN111950302A

Abstract

The invention provides a knowledge distillation-based machine translation model training method, a device, equipment and a medium, wherein the method comprises the following steps: obtaining a teacher model and a student model; acquiring a sample data set comprising a training corpus; inputting the training corpus into a teacher model to obtain intermediate content output by a simplified module in the teacher model and a final result output by the teacher model; inputting the training corpus into the student model to obtain intermediate content output by the simplified module in the student model and a final result output by the student model; determining a model loss function according to the labeling translation tag of the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model; and carrying out iterative training on the student model according to the model loss function. According to the invention, the teacher model is utilized to train the student model, and the performance effect of the model is ensured under the condition of simplified model structure.

Description

Knowledge distillation-based machine translation model training method, device, equipment and medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a knowledge distillation-based machine translation model training method, device, equipment and medium.

Background

Machine translation (machine translation), also known as automatic translation, is a process of converting one natural source language into another natural target language using a computer, and generally refers to translation of sentences and text between natural languages. Machine translation is a branch of natural language processing (Natural Language Processing) and has a dense and inseparable relationship with computational linguistics (Computational Linguistics) and natural language understanding (Natural Language Understanding). The idea of using a machine for translation was first proposed by Warren Weaver in 1949. For a long time (50 s to 80 s of the 20 th century), machine translation was accomplished by studying linguistic information of the source language and the target language, i.e., generating translations based on dictionary and grammar, which is called rule-based machine translation (RBMT). As statistics evolve, researchers began to apply statistical models to machine translation, which is based on analysis of bilingual text databases to generate translation results. This method is called Statistical Machine Translation (SMT), which performs better than RBMT, and dominates this field between the 1980 s and the 2000 s. In 1997 Ramon Neco and Mikel Forcada proposed the idea of using Encoder-Decoder (Encoder-Decoder) architecture for machine translation. In 2003, after a few years, a research team at the university Yoshua Bengio of montreal developed a neural network-based language model, improving the data sparsity problem of the conventional SMT model. Their research work lays the foundation for future neural network applications in machine translation.

Google (Google) in 2017 proposed a transducer model in paper Attention Is All You Need. The model based on the self-attention mechanism can well solve the problem of a sequence model, is applied to a machine translation task, and greatly improves the translation effect. However, on one hand, as the converter series goes from BERT to GPT2 to XLNet model, the capacity of the translation model increases, and although the translation effect can be improved to a certain extent, the reasoning performance (delay and throughput) on the translation model line is worse and worse, and how to improve the reasoning performance of the on-line translation model is a key factor for determining whether the translation model can be well deployed and providing user-friendly service; on the other hand, with the rapid increase of the number of the accessed foreign language, how to effectively compress the model on the premise of not losing the translation effect of the model, so that the model is convenient to store and release, and the method is an important problem facing engineering deployment of the algorithm model.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a machine translation model training method, device, equipment and medium based on knowledge distillation, so that a simplified student model is trained according to a teacher model on the premise of not affecting the model effect as much as possible, the throughput of the model when deployed on the line is improved, the delay of the model is reduced, and further the user experience is improved.

In order to achieve the above object, the present invention provides a knowledge distillation-based machine translation model training method, comprising:

obtaining a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;

acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;

inputting the training corpus into the teacher model for processing to obtain intermediate content output by a simplified module in the teacher model and a final result output by the teacher model;

inputting the training corpus into the student model for processing to obtain intermediate content output by a simplified module in the student model and a final result output by the student model;

determining a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;

and carrying out iterative training on the student model according to the model loss function.

In a preferred embodiment of the present invention, the determining a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model includes:

determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;

determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;

determining a third loss function according to the final result output by the teacher model and the final result output by the student model;

and determining the model loss function according to the first loss function, the second loss function and the third loss function.

In a preferred embodiment of the present invention, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module, and an output module.

In a preferred embodiment of the present invention, the structure of the student model is identical to that of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.

In a preferred embodiment of the invention, after acquiring the sample dataset, the method further comprises: and preprocessing the training corpus.

In a preferred embodiment of the present invention, the preprocessing the corpus includes:

converting the characters in the training corpus into corresponding numerical values;

dividing the training corpus into different batches, and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.

In order to achieve the above object, the present invention further provides a machine translation model training device based on knowledge distillation, comprising:

the model acquisition module is used for acquiring a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;

the sample acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;

the teacher model processing module is used for inputting the training corpus into the teacher model for processing to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model;

The student model processing module is used for inputting the training corpus into the student model for processing to obtain intermediate content output by the simplified module in the student model and a final result output by the student model;

the model loss function determining module is used for determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;

and the model training module is used for carrying out iterative training on the student model according to the model loss function.

In a preferred embodiment of the present invention, the model loss function determination module includes:

a first loss function determining unit, configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;

the second loss function determining unit is used for determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;

A third loss function determining unit, configured to determine a third loss function according to a final result output by the teacher model and a final result output by the student model;

and the model loss function determining unit is used for determining the model loss function according to the first loss function, the second loss function and the third loss function.

In a preferred embodiment of the invention, the device further comprises: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.

In a preferred embodiment of the present invention, the preprocessing module includes:

the numerical value conversion unit is used for converting the characters in the training corpus into corresponding numerical values;

The length adjusting unit is used for dividing the training corpus into different batches and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.

To achieve the above object, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the machine translation model training method described above when executing the computer program.

To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned machine translation model training step.

By adopting the technical scheme, the invention has the following beneficial effects:

according to the labeling translation labels corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model, a model loss function is determined, and iterative training is carried out on the student model according to the model loss function. Compared with a teacher model, the student model obtained through training simplifies the model structure, and the middle content and the final result output by the teacher model are utilized for supervision in the training process, so that the performance and the effect of the model are ensured as much as possible under the condition that parameters of the student model are reduced, and the throughput of the model when the model is deployed on line is improved due to the simplified model structure of the student model, the model delay is reduced, and further the user experience is improved.

Drawings

FIG. 1 is a flow chart of a knowledge distillation-based machine translation model training method in embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a knowledge distillation-based machine translation model training method in example 1 of the present invention;

FIG. 3 is a block diagram of a knowledge distillation-based machine translation model training apparatus according to embodiment 2 of the present invention;

fig. 4 is a hardware architecture diagram of an electronic device in embodiment 3 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

Example 1

The embodiment provides a machine translation model training method based on knowledge distillation, as shown in fig. 1, which specifically comprises the following steps:

s1, obtaining a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model.

It should be explained that knowledge distillation is a network model compression method, by constructing a teacher model-student model framework, the teacher model guides the training of the student model, distills out the "knowledge" about the characteristic representation learned by the teacher model with complex model structure and large parameter quantity, and transfers the "knowledge" to the student model with simple model structure, small parameter quantity and weak learning ability. The performance of the model can be improved by knowledge distillation without increasing the complexity of the student model.

In the present embodiment, a machine translation model that has been trained is prepared in advance as a teacher model, and a student model obtained by simplifying a part of modules in the teacher model. The teacher model is a prediction mode, and the prediction mode represents model parameters of the frozen teacher model, namely the model parameters of the teacher model cannot be modified in the subsequent training process; the student model is a training model, and model parameters in the student model can be modified during the training process.

For example, the teacher model and the student model in the present embodiment may be translation models based on a transducer. As shown in fig. 2, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module and an output module which are sequentially cascaded, wherein the embedding module may include a corpus embedding layer and a language type embedding layer. Because the embedding module, the coding module and the output module occupy less time in reasoning, the embedding module, the coding module and the output module of the student model are consistent with the embedding module, the coding module and the output module of the teacher model in structure, the reduction is not carried out, and the parameters can be shared. That is, the present embodiment performs simplified compression only for the decoding module of the teacher model by reducing the number of decoding layers in the decoding module) as the decoding module of the student model. In order to ensure the translation effect of the student model, the number of neurons of the embedded module and the output module of the student model is consistent with that of neurons of the embedded module and the output module of the teacher model.

In addition, in order to ensure that the dimension of the intermediate content output by the decoding module in the student model is consistent with the dimension of the intermediate content output by the decoding module in the teacher model so as to perform loss function calculation later, a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.

S2, acquiring a sample data set, wherein the sample data set comprises a plurality of training corpuses and labeling translation tags corresponding to the training corpuses, and the training corpuses can also carry corresponding language types.

S3, preprocessing the sample data set. The method specifically comprises the following steps: firstly, converting characters in the training corpus into corresponding numerical values, and dividing the training corpus into different batches. Since the lengths of the training corpuses are different, each batch of the training corpuses can be adjusted to be the same length through a zero-value filling mode. The zero-padding method is to use the longest sentence in the training corpus of the same batch as a reference, and fill the characters missing in other sentences with 0, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of batch_size is obtained, wherein batch_size refers to the number of the same Batch of training corpora, and sequence_length refers to the length of the longest corpus in the same Batch of training corpora.

S4, inputting the preprocessed training corpus into the teacher model for processing, and obtaining intermediate content output by the simplified module in the teacher model and a final result output by the teacher model.

For example, when the teacher model is in the structure shown in fig. 2, the training corpus is first input to the embedding module of the teacher model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then input to the encoding module for feature encoding, then the decoding module performs feature decoding, and intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that a final result output by the teacher model is obtained.

S5, inputting the training corpus into the student model for processing, and obtaining intermediate content output by the simplified module in the student model and a final result output by the student model.

For example, when the student model is in the structure shown in fig. 2, the training corpus is firstly input to the embedding module of the student model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then are input to the encoding module for feature encoding, then the feature decoding is performed through the decoding module, the intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that the final result output by the student model is obtained.

And S6, determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model. The specific implementation process of the step is as follows:

s61, determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.

For example, when the student model is of the structure shown in fig. 2, the first loss function L is calculated according to the following formula (1) _{AT_FMT} ：

Wherein C represents the decoding layer number of a decoding module in the student model, D _kl Representing a function that calculates the KL-divergence,representing the result of the c-layer decoding layer output from the teacher model after the full connection layer processing>And representing the processing result output by the c layer decoding layer in the student model.

S62, determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model.

For example, when the student model is of the structure shown in fig. 2, the second loss function L is calculated according to the following formula (2) _hard ：

L _hard ＝{-p′ _ij log(p _ij )-(1-p′ _ij )log(1-p _ij )} (2)

Wherein log (x) represents a logarithmic function, p _ij Probability value, p 'representing the j translation label corresponding to the i-th word output by the student model' _ij Representing the probability value (p 'of the labeled ith word corresponding to the jth translation tag' _ij Can be obtained according to the labeling translation labels corresponding to the training corpus).

And S63, determining a third loss function according to the final result output by the teacher model and the final result output by the student model.

For example, when the student model is of the structure shown in fig. 2, the third loss function L is calculated according to the following formula (3) _soft ：

Wherein log (x) represents a logarithmic function, p _ij A probability value representing that the ith word output by the student model corresponds to the jth translation tag,and representing the probability value of the ith word output by the teacher model corresponding to the jth translation tag.

S64, according to the first loss function L _{AT_FMT} Second loss function L _hard And a third loss function L _soft And determining the model loss function.

For example, the model Loss function Loss is calculated according to the following equation (4) _all ：

Loss _all ＝αL _hard +(1-α)L _soft +βL _AT-FMT (4)

Wherein, alpha and beta respectively represent corresponding loss value weight coefficients, alpha epsilon (0, 1), beta epsilon R, and specific values can be preset according to experience.

And S7, training the student model according to the model loss function, namely, updating parameters of the student model according to the loss function.

And (3) according to the fact that the process of training the model according to the loss function is an iterative process, and judging whether a preset training termination condition is met or not after training once. If the training termination condition is not satisfied, training is continued according to the steps S4 to S7 until the training termination condition is satisfied.

In one possible implementation, meeting the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative training reaches a number threshold. The frequency threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, the model loss function is less than the loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in the embodiment of the present application. Third, the model loss function converges. Model loss function convergence refers to the wave of model loss function in the training result of reference times with the increase of iterative training timesThe dynamic range is within the reference range. For example, assume that the reference range is-10 ^-3 ～10 ^-3 Assume that the reference number is 10. If the model loss function is in the fluctuation range of-10 in the 10-time iterative training results ^-3 ～10 ^-3 And (3) if the model loss function is considered to be converged. When any one of the above conditions is satisfied, the satisfaction of the training termination condition is described, and the training of the student model is completed.

In the process of updating model parameters by using the model loss function, an Adam (Adaptive Moment Estimation ) optimization algorithm can be adopted for optimization. During training, coding module lr of student model _eb Is less than or equal to the learning rate lr of the decoding module _db 。

In addition, the decoding module of the student model can be reduced step by step in the training process by using a level training mode. As shown in fig. 2, after training according to a teacher model (including K decoding layers) to obtain a student model (including L decoding layers), the trained student model is used as a new teacher model to train a student model with fewer decoding layers, and so on, until training is performed to obtain a student model only including a predetermined number of (N) decoding layers, where K > M > N. In this embodiment, the compression ratio of the student model is appropriately selected according to the improvement of the reasoning performance of the translation model and the compromise consideration of the effect of the translation model. And after the training of the student model is completed, removing the teacher model.

The student model obtained through training in the embodiment simplifies the model structure, and monitors the middle content and the final result output by the teacher model in the training process, so that the performance and the effect of the model are ensured as much as possible under the condition that the parameters of the student model are reduced.

It should be noted that, for the sake of simplicity of description, the foregoing embodiments are all described as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.

Example 2

The present embodiment provides a machine translation model training device based on knowledge distillation, as shown in fig. 3, the device 1 specifically includes: the model acquisition module 11, the sample acquisition module 12, the preprocessing module 13, the teacher model processing module 14, the student model processing module 15, the model loss function determination module 16 and the model training module 17.

Each module is described in detail below:

the model acquisition module 11 is configured to acquire a trained teacher model and an untrained student model, where the student model is obtained by simplifying some modules in the teacher model.

The sample acquiring module 12 is configured to acquire a sample data set, where the sample data set includes a plurality of training corpora, and labeling translation tags corresponding to the training corpora, and the training corpora may also carry a corresponding language type.

The preprocessing module 13 is used for preprocessing the sample data set. The method specifically comprises the following steps: a numerical value conversion unit 131, configured to convert words in the training corpus into corresponding numerical values; the length adjustment unit 132 is configured to divide the corpus into different batches, and adjust the corpus of each batch to have the same length by a zero-padding method because the lengths of the corpus are different. The zero-padding method is to use the longest sentence in the training corpus of the same batch as a reference, and fill the characters missing in other sentences with 0, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of batch_size is obtained, wherein batch_size refers to the number of the same Batch of training corpora, and sequence_length refers to the length of the longest corpus in the same Batch of training corpora.

The teacher model processing module 14 is configured to input the preprocessed training corpus into the teacher model for processing, so as to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model.

The student model processing module 15 is configured to input the training corpus into the student model for processing, so as to obtain intermediate content output by the simplifying module in the student model and a final result output by the student model.

The model loss function determining module 16 is configured to determine a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model. The specific implementation process of the step is as follows:

the first loss function determining unit 161 is configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.

The second loss function determining unit 162 is configured to determine a second loss function according to the labeling translation label corresponding to the training corpus and the final result output by the student model.

L _hard ＝{-p′ _ij log(p _ij )-(1-p′ _ij )log(1-p _ij )} (2)

The third loss function determining unit 163 is configured to determine a third loss function based on the final result output by the teacher model and the final result output by the student model.

The model loss function determining unit 164 is used for determining the first loss function L _{AT_FMT} Second loss function L _hard And a third loss function L _soft And determining the model loss function.

Loss _all ＝αL _hard +(1-α)L _soft +βL _AT-FMT (4)

The model training module 17 is configured to train the student model according to the model loss function, i.e. update parameters of the student model according to the loss function.

And (3) according to the fact that the process of training the model according to the loss function is an iterative process, and judging whether a preset training termination condition is met or not after training once. If the training termination condition is not met, continuing training until the training termination condition is met.

In one possible implementation, meeting the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative training reaches a number threshold. The frequency threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, the model loss function is less than the loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in the embodiment of the present application. Third, the model loss function converges. Model loss function convergence means that as the number of iterative training increases, the fluctuation range of the model loss function is within the reference range in the training results of the reference number. For example, assume that the reference range is-10 ^-3 ～10 ^-3 Assume that the reference number is 10. If the model loss function is wave in 10 iterative training resultsThe dynamic range is all-10 ^-3 ～10 ^-3 And (3) if the model loss function is considered to be converged. When any one of the above conditions is satisfied, the satisfaction of the training termination condition is described, and the training of the student model is completed.

Example 3

The present embodiment provides an electronic device, which may be expressed in the form of a computing device (for example, may be a server device), including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor may implement the knowledge distillation based machine translation model training method provided in embodiment 1 when executing the computer program.

Fig. 4 shows a schematic diagram of the hardware structure of the present embodiment, and as shown in fig. 4, the electronic device 9 specifically includes:

at least one processor 91, at least one memory 92, and a bus 93 for connecting the different system components (including the processor 91 and the memory 92), wherein:

the bus 93 includes a data bus, an address bus, and a control bus.

The memory 92 includes volatile memory such as Random Access Memory (RAM) 921 and/or cache memory 922, and may further include Read Only Memory (ROM) 923.

Memory 92 also includes a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 91 executes various functional applications and data processing, such as the knowledge distillation-based machine translation model training method provided in embodiment 1 of the present application, by running a computer program stored in the memory 92.

The electronic device 9 may further communicate with one or more external devices 94 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 95. Also, the electronic device 9 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 96. The network adapter 96 communicates with other modules of the electronic device 9 via the bus 93. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the electronic device 9, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 4

The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the knowledge distillation based machine translation model training method of embodiment 1.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the knowledge-based distillation machine translation model training method of example 1, when said program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A knowledge distillation-based machine translation model training method, comprising:

performing iterative training on the student model according to the model loss function;

the determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model includes:

2. The knowledge distillation based machine translation model training method according to claim 1, wherein the teacher model and the student model comprise an embedding module, an encoding module, a decoding module, and an output module, respectively.

3. The knowledge distillation based machine translation model training method according to claim 2, wherein the student model is consistent with the structure of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.

4. The knowledge-distillation based machine translation model training method according to claim 1, wherein after obtaining the sample dataset, the method further comprises: and preprocessing the training corpus.

5. The knowledge distillation based machine translation model training method according to claim 4, wherein said preprocessing said training corpus comprises:

6. A knowledge distillation based machine translation model training device, comprising:

The model training module is used for carrying out iterative training on the student model according to the model loss function;

wherein the model loss function determination module comprises:

7. The knowledge distillation based machine translation model training device according to claim 6, wherein the teacher model and the student model comprise an embedding module, an encoding module, a decoding module, and an output module, respectively.

8. The knowledge distillation based machine translation model training device according to claim 7, wherein the student model is consistent with the structure of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.

9. The knowledge distillation based machine translation model training device according to claim 6, wherein the device further comprises: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.

10. The knowledge distillation based machine translation model training device according to claim 9, wherein the preprocessing module comprises:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 5 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 5.