CN111950302B - Knowledge distillation-based machine translation model training method, device, equipment and medium - Google Patents
Knowledge distillation-based machine translation model training method, device, equipment and medium Download PDFInfo
- Publication number
- CN111950302B CN111950302B CN202010843014.5A CN202010843014A CN111950302B CN 111950302 B CN111950302 B CN 111950302B CN 202010843014 A CN202010843014 A CN 202010843014A CN 111950302 B CN111950302 B CN 111950302B
- Authority
- CN
- China
- Prior art keywords
- model
- module
- student
- loss function
- teacher
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 173
- 238000013519 translation Methods 0.000 title claims abstract description 87
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 29
- 238000002372 labelling Methods 0.000 claims abstract description 23
- 230000006870 function Effects 0.000 claims description 109
- 238000012545 processing Methods 0.000 claims description 25
- 238000007781 pre-processing Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 12
- 230000014616 translation Effects 0.000 description 67
- 230000008569 process Effects 0.000 description 16
- 230000006835 compression Effects 0.000 description 6
- 238000007906 compression Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 102100033814 Alanine aminotransferase 2 Human genes 0.000 description 1
- 101000779415 Homo sapiens Alanine aminotransferase 2 Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/47—Machine-assisted translation, e.g. using translation memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a knowledge distillation-based machine translation model training method, a device, equipment and a medium, wherein the method comprises the following steps: obtaining a teacher model and a student model; acquiring a sample data set comprising a training corpus; inputting the training corpus into a teacher model to obtain intermediate content output by a simplified module in the teacher model and a final result output by the teacher model; inputting the training corpus into the student model to obtain intermediate content output by the simplified module in the student model and a final result output by the student model; determining a model loss function according to the labeling translation tag of the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model; and carrying out iterative training on the student model according to the model loss function. According to the invention, the teacher model is utilized to train the student model, and the performance effect of the model is ensured under the condition of simplified model structure.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a knowledge distillation-based machine translation model training method, device, equipment and medium.
Background
Machine translation (machine translation), also known as automatic translation, is a process of converting one natural source language into another natural target language using a computer, and generally refers to translation of sentences and text between natural languages. Machine translation is a branch of natural language processing (Natural Language Processing) and has a dense and inseparable relationship with computational linguistics (Computational Linguistics) and natural language understanding (Natural Language Understanding). The idea of using a machine for translation was first proposed by Warren Weaver in 1949. For a long time (50 s to 80 s of the 20 th century), machine translation was accomplished by studying linguistic information of the source language and the target language, i.e., generating translations based on dictionary and grammar, which is called rule-based machine translation (RBMT). As statistics evolve, researchers began to apply statistical models to machine translation, which is based on analysis of bilingual text databases to generate translation results. This method is called Statistical Machine Translation (SMT), which performs better than RBMT, and dominates this field between the 1980 s and the 2000 s. In 1997 Ramon Neco and Mikel Forcada proposed the idea of using Encoder-Decoder (Encoder-Decoder) architecture for machine translation. In 2003, after a few years, a research team at the university Yoshua Bengio of montreal developed a neural network-based language model, improving the data sparsity problem of the conventional SMT model. Their research work lays the foundation for future neural network applications in machine translation.
Google (Google) in 2017 proposed a transducer model in paper Attention Is All You Need. The model based on the self-attention mechanism can well solve the problem of a sequence model, is applied to a machine translation task, and greatly improves the translation effect. However, on one hand, as the converter series goes from BERT to GPT2 to XLNet model, the capacity of the translation model increases, and although the translation effect can be improved to a certain extent, the reasoning performance (delay and throughput) on the translation model line is worse and worse, and how to improve the reasoning performance of the on-line translation model is a key factor for determining whether the translation model can be well deployed and providing user-friendly service; on the other hand, with the rapid increase of the number of the accessed foreign language, how to effectively compress the model on the premise of not losing the translation effect of the model, so that the model is convenient to store and release, and the method is an important problem facing engineering deployment of the algorithm model.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a machine translation model training method, device, equipment and medium based on knowledge distillation, so that a simplified student model is trained according to a teacher model on the premise of not affecting the model effect as much as possible, the throughput of the model when deployed on the line is improved, the delay of the model is reduced, and further the user experience is improved.
In order to achieve the above object, the present invention provides a knowledge distillation-based machine translation model training method, comprising:
obtaining a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;
acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;
inputting the training corpus into the teacher model for processing to obtain intermediate content output by a simplified module in the teacher model and a final result output by the teacher model;
inputting the training corpus into the student model for processing to obtain intermediate content output by a simplified module in the student model and a final result output by the student model;
determining a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;
and carrying out iterative training on the student model according to the model loss function.
In a preferred embodiment of the present invention, the determining a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model includes:
determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;
determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;
determining a third loss function according to the final result output by the teacher model and the final result output by the student model;
and determining the model loss function according to the first loss function, the second loss function and the third loss function.
In a preferred embodiment of the present invention, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module, and an output module.
In a preferred embodiment of the present invention, the structure of the student model is identical to that of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.
In a preferred embodiment of the invention, after acquiring the sample dataset, the method further comprises: and preprocessing the training corpus.
In a preferred embodiment of the present invention, the preprocessing the corpus includes:
converting the characters in the training corpus into corresponding numerical values;
dividing the training corpus into different batches, and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.
In order to achieve the above object, the present invention further provides a machine translation model training device based on knowledge distillation, comprising:
the model acquisition module is used for acquiring a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;
the sample acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;
the teacher model processing module is used for inputting the training corpus into the teacher model for processing to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model;
The student model processing module is used for inputting the training corpus into the student model for processing to obtain intermediate content output by the simplified module in the student model and a final result output by the student model;
the model loss function determining module is used for determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;
and the model training module is used for carrying out iterative training on the student model according to the model loss function.
In a preferred embodiment of the present invention, the model loss function determination module includes:
a first loss function determining unit, configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;
the second loss function determining unit is used for determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;
A third loss function determining unit, configured to determine a third loss function according to a final result output by the teacher model and a final result output by the student model;
and the model loss function determining unit is used for determining the model loss function according to the first loss function, the second loss function and the third loss function.
In a preferred embodiment of the present invention, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module, and an output module.
In a preferred embodiment of the present invention, the structure of the student model is identical to that of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.
In a preferred embodiment of the invention, the device further comprises: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.
In a preferred embodiment of the present invention, the preprocessing module includes:
the numerical value conversion unit is used for converting the characters in the training corpus into corresponding numerical values;
The length adjusting unit is used for dividing the training corpus into different batches and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.
To achieve the above object, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the machine translation model training method described above when executing the computer program.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned machine translation model training step.
By adopting the technical scheme, the invention has the following beneficial effects:
according to the labeling translation labels corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model, a model loss function is determined, and iterative training is carried out on the student model according to the model loss function. Compared with a teacher model, the student model obtained through training simplifies the model structure, and the middle content and the final result output by the teacher model are utilized for supervision in the training process, so that the performance and the effect of the model are ensured as much as possible under the condition that parameters of the student model are reduced, and the throughput of the model when the model is deployed on line is improved due to the simplified model structure of the student model, the model delay is reduced, and further the user experience is improved.
Drawings
FIG. 1 is a flow chart of a knowledge distillation-based machine translation model training method in embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a knowledge distillation-based machine translation model training method in example 1 of the present invention;
FIG. 3 is a block diagram of a knowledge distillation-based machine translation model training apparatus according to embodiment 2 of the present invention;
fig. 4 is a hardware architecture diagram of an electronic device in embodiment 3 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
Example 1
The embodiment provides a machine translation model training method based on knowledge distillation, as shown in fig. 1, which specifically comprises the following steps:
s1, obtaining a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model.
It should be explained that knowledge distillation is a network model compression method, by constructing a teacher model-student model framework, the teacher model guides the training of the student model, distills out the "knowledge" about the characteristic representation learned by the teacher model with complex model structure and large parameter quantity, and transfers the "knowledge" to the student model with simple model structure, small parameter quantity and weak learning ability. The performance of the model can be improved by knowledge distillation without increasing the complexity of the student model.
In the present embodiment, a machine translation model that has been trained is prepared in advance as a teacher model, and a student model obtained by simplifying a part of modules in the teacher model. The teacher model is a prediction mode, and the prediction mode represents model parameters of the frozen teacher model, namely the model parameters of the teacher model cannot be modified in the subsequent training process; the student model is a training model, and model parameters in the student model can be modified during the training process.
For example, the teacher model and the student model in the present embodiment may be translation models based on a transducer. As shown in fig. 2, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module and an output module which are sequentially cascaded, wherein the embedding module may include a corpus embedding layer and a language type embedding layer. Because the embedding module, the coding module and the output module occupy less time in reasoning, the embedding module, the coding module and the output module of the student model are consistent with the embedding module, the coding module and the output module of the teacher model in structure, the reduction is not carried out, and the parameters can be shared. That is, the present embodiment performs simplified compression only for the decoding module of the teacher model by reducing the number of decoding layers in the decoding module) as the decoding module of the student model. In order to ensure the translation effect of the student model, the number of neurons of the embedded module and the output module of the student model is consistent with that of neurons of the embedded module and the output module of the teacher model.
In addition, in order to ensure that the dimension of the intermediate content output by the decoding module in the student model is consistent with the dimension of the intermediate content output by the decoding module in the teacher model so as to perform loss function calculation later, a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.
S2, acquiring a sample data set, wherein the sample data set comprises a plurality of training corpuses and labeling translation tags corresponding to the training corpuses, and the training corpuses can also carry corresponding language types.
S3, preprocessing the sample data set. The method specifically comprises the following steps: firstly, converting characters in the training corpus into corresponding numerical values, and dividing the training corpus into different batches. Since the lengths of the training corpuses are different, each batch of the training corpuses can be adjusted to be the same length through a zero-value filling mode. The zero-padding method is to use the longest sentence in the training corpus of the same batch as a reference, and fill the characters missing in other sentences with 0, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of batch_size is obtained, wherein batch_size refers to the number of the same Batch of training corpora, and sequence_length refers to the length of the longest corpus in the same Batch of training corpora.
S4, inputting the preprocessed training corpus into the teacher model for processing, and obtaining intermediate content output by the simplified module in the teacher model and a final result output by the teacher model.
For example, when the teacher model is in the structure shown in fig. 2, the training corpus is first input to the embedding module of the teacher model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then input to the encoding module for feature encoding, then the decoding module performs feature decoding, and intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that a final result output by the teacher model is obtained.
S5, inputting the training corpus into the student model for processing, and obtaining intermediate content output by the simplified module in the student model and a final result output by the student model.
For example, when the student model is in the structure shown in fig. 2, the training corpus is firstly input to the embedding module of the student model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then are input to the encoding module for feature encoding, then the feature decoding is performed through the decoding module, the intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that the final result output by the student model is obtained.
And S6, determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model. The specific implementation process of the step is as follows:
s61, determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.
For example, when the student model is of the structure shown in fig. 2, the first loss function L is calculated according to the following formula (1) AT_FMT :
Wherein C represents the decoding layer number of a decoding module in the student model, D kl Representing a function that calculates the KL-divergence,representing the result of the c-layer decoding layer output from the teacher model after the full connection layer processing>And representing the processing result output by the c layer decoding layer in the student model.
S62, determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model.
For example, when the student model is of the structure shown in fig. 2, the second loss function L is calculated according to the following formula (2) hard :
L hard ={-p′ ij log(p ij )-(1-p′ ij )log(1-p ij )} (2)
Wherein log (x) represents a logarithmic function, p ij Probability value, p 'representing the j translation label corresponding to the i-th word output by the student model' ij Representing the probability value (p 'of the labeled ith word corresponding to the jth translation tag' ij Can be obtained according to the labeling translation labels corresponding to the training corpus).
And S63, determining a third loss function according to the final result output by the teacher model and the final result output by the student model.
For example, when the student model is of the structure shown in fig. 2, the third loss function L is calculated according to the following formula (3) soft :
Wherein log (x) represents a logarithmic function, p ij A probability value representing that the ith word output by the student model corresponds to the jth translation tag,and representing the probability value of the ith word output by the teacher model corresponding to the jth translation tag.
S64, according to the first loss function L AT_FMT Second loss function L hard And a third loss function L soft And determining the model loss function.
For example, the model Loss function Loss is calculated according to the following equation (4) all :
Loss all =αL hard +(1-α)L soft +βL AT-FMT (4)
Wherein, alpha and beta respectively represent corresponding loss value weight coefficients, alpha epsilon (0, 1), beta epsilon R, and specific values can be preset according to experience.
And S7, training the student model according to the model loss function, namely, updating parameters of the student model according to the loss function.
And (3) according to the fact that the process of training the model according to the loss function is an iterative process, and judging whether a preset training termination condition is met or not after training once. If the training termination condition is not satisfied, training is continued according to the steps S4 to S7 until the training termination condition is satisfied.
In one possible implementation, meeting the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative training reaches a number threshold. The frequency threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, the model loss function is less than the loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in the embodiment of the present application. Third, the model loss function converges. Model loss function convergence refers to the wave of model loss function in the training result of reference times with the increase of iterative training timesThe dynamic range is within the reference range. For example, assume that the reference range is-10 -3 ~10 -3 Assume that the reference number is 10. If the model loss function is in the fluctuation range of-10 in the 10-time iterative training results -3 ~10 -3 And (3) if the model loss function is considered to be converged. When any one of the above conditions is satisfied, the satisfaction of the training termination condition is described, and the training of the student model is completed.
In the process of updating model parameters by using the model loss function, an Adam (Adaptive Moment Estimation ) optimization algorithm can be adopted for optimization. During training, coding module lr of student model eb Is less than or equal to the learning rate lr of the decoding module db 。
In addition, the decoding module of the student model can be reduced step by step in the training process by using a level training mode. As shown in fig. 2, after training according to a teacher model (including K decoding layers) to obtain a student model (including L decoding layers), the trained student model is used as a new teacher model to train a student model with fewer decoding layers, and so on, until training is performed to obtain a student model only including a predetermined number of (N) decoding layers, where K > M > N. In this embodiment, the compression ratio of the student model is appropriately selected according to the improvement of the reasoning performance of the translation model and the compromise consideration of the effect of the translation model. And after the training of the student model is completed, removing the teacher model.
The student model obtained through training in the embodiment simplifies the model structure, and monitors the middle content and the final result output by the teacher model in the training process, so that the performance and the effect of the model are ensured as much as possible under the condition that the parameters of the student model are reduced.
It should be noted that, for the sake of simplicity of description, the foregoing embodiments are all described as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.
Example 2
The present embodiment provides a machine translation model training device based on knowledge distillation, as shown in fig. 3, the device 1 specifically includes: the model acquisition module 11, the sample acquisition module 12, the preprocessing module 13, the teacher model processing module 14, the student model processing module 15, the model loss function determination module 16 and the model training module 17.
Each module is described in detail below:
the model acquisition module 11 is configured to acquire a trained teacher model and an untrained student model, where the student model is obtained by simplifying some modules in the teacher model.
It should be explained that knowledge distillation is a network model compression method, by constructing a teacher model-student model framework, the teacher model guides the training of the student model, distills out the "knowledge" about the characteristic representation learned by the teacher model with complex model structure and large parameter quantity, and transfers the "knowledge" to the student model with simple model structure, small parameter quantity and weak learning ability. The performance of the model can be improved by knowledge distillation without increasing the complexity of the student model.
In the present embodiment, a machine translation model that has been trained is prepared in advance as a teacher model, and a student model obtained by simplifying a part of modules in the teacher model. The teacher model is a prediction mode, and the prediction mode represents model parameters of the frozen teacher model, namely the model parameters of the teacher model cannot be modified in the subsequent training process; the student model is a training model, and model parameters in the student model can be modified during the training process.
For example, the teacher model and the student model in the present embodiment may be translation models based on a transducer. As shown in fig. 2, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module and an output module which are sequentially cascaded, wherein the embedding module may include a corpus embedding layer and a language type embedding layer. Because the embedding module, the coding module and the output module occupy less time in reasoning, the embedding module, the coding module and the output module of the student model are consistent with the embedding module, the coding module and the output module of the teacher model in structure, the reduction is not carried out, and the parameters can be shared. That is, the present embodiment performs simplified compression only for the decoding module of the teacher model by reducing the number of decoding layers in the decoding module) as the decoding module of the student model. In order to ensure the translation effect of the student model, the number of neurons of the embedded module and the output module of the student model is consistent with that of neurons of the embedded module and the output module of the teacher model.
In addition, in order to ensure that the dimension of the intermediate content output by the decoding module in the student model is consistent with the dimension of the intermediate content output by the decoding module in the teacher model so as to perform loss function calculation later, a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.
The sample acquiring module 12 is configured to acquire a sample data set, where the sample data set includes a plurality of training corpora, and labeling translation tags corresponding to the training corpora, and the training corpora may also carry a corresponding language type.
The preprocessing module 13 is used for preprocessing the sample data set. The method specifically comprises the following steps: a numerical value conversion unit 131, configured to convert words in the training corpus into corresponding numerical values; the length adjustment unit 132 is configured to divide the corpus into different batches, and adjust the corpus of each batch to have the same length by a zero-padding method because the lengths of the corpus are different. The zero-padding method is to use the longest sentence in the training corpus of the same batch as a reference, and fill the characters missing in other sentences with 0, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of batch_size is obtained, wherein batch_size refers to the number of the same Batch of training corpora, and sequence_length refers to the length of the longest corpus in the same Batch of training corpora.
The teacher model processing module 14 is configured to input the preprocessed training corpus into the teacher model for processing, so as to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model.
For example, when the teacher model is in the structure shown in fig. 2, the training corpus is first input to the embedding module of the teacher model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then input to the encoding module for feature encoding, then the decoding module performs feature decoding, and intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that a final result output by the teacher model is obtained.
The student model processing module 15 is configured to input the training corpus into the student model for processing, so as to obtain intermediate content output by the simplifying module in the student model and a final result output by the student model.
For example, when the student model is in the structure shown in fig. 2, the training corpus is firstly input to the embedding module of the student model, so that the training corpus and the language type thereof are mapped respectively through the corpus layer and the language type layer of the embedding module, then the corpus embedding result and the language type embedding result are combined and then are input to the encoding module for feature encoding, then the feature decoding is performed through the decoding module, the intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module, so that the final result output by the student model is obtained.
The model loss function determining module 16 is configured to determine a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model. The specific implementation process of the step is as follows:
the first loss function determining unit 161 is configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.
For example, when the student model is of the structure shown in fig. 2, the first loss function L is calculated according to the following formula (1) AT_FMT :
Wherein C represents the decoding layer number of a decoding module in the student model, D kl Representing a function that calculates the KL-divergence,representing the result of the c-layer decoding layer output from the teacher model after the full connection layer processing>And representing the processing result output by the c layer decoding layer in the student model.
The second loss function determining unit 162 is configured to determine a second loss function according to the labeling translation label corresponding to the training corpus and the final result output by the student model.
For example, when the student model is of the structure shown in fig. 2, the second loss function L is calculated according to the following formula (2) hard :
L hard ={-p′ ij log(p ij )-(1-p′ ij )log(1-p ij )} (2)
The third loss function determining unit 163 is configured to determine a third loss function based on the final result output by the teacher model and the final result output by the student model.
For example, when the student model is of the structure shown in fig. 2, the third loss function L is calculated according to the following formula (3) soft :
The model loss function determining unit 164 is used for determining the first loss function L AT_FMT Second loss function L hard And a third loss function L soft And determining the model loss function.
For example, the model Loss function Loss is calculated according to the following equation (4) all :
Loss all =αL hard +(1-α)L soft +βL AT-FMT (4)
Wherein, alpha and beta respectively represent corresponding loss value weight coefficients, alpha epsilon (0, 1), beta epsilon R, and specific values can be preset according to experience.
The model training module 17 is configured to train the student model according to the model loss function, i.e. update parameters of the student model according to the loss function.
And (3) according to the fact that the process of training the model according to the loss function is an iterative process, and judging whether a preset training termination condition is met or not after training once. If the training termination condition is not met, continuing training until the training termination condition is met.
In one possible implementation, meeting the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative training reaches a number threshold. The frequency threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, the model loss function is less than the loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in the embodiment of the present application. Third, the model loss function converges. Model loss function convergence means that as the number of iterative training increases, the fluctuation range of the model loss function is within the reference range in the training results of the reference number. For example, assume that the reference range is-10 -3 ~10 -3 Assume that the reference number is 10. If the model loss function is wave in 10 iterative training resultsThe dynamic range is all-10 -3 ~10 -3 And (3) if the model loss function is considered to be converged. When any one of the above conditions is satisfied, the satisfaction of the training termination condition is described, and the training of the student model is completed.
In the process of updating model parameters by using the model loss function, an Adam (Adaptive Moment Estimation ) optimization algorithm can be adopted for optimization. During training, coding module lr of student model eb Is less than or equal to the learning rate lr of the decoding module db 。
In addition, the decoding module of the student model can be reduced step by step in the training process by using a level training mode. As shown in fig. 2, after training according to a teacher model (including K decoding layers) to obtain a student model (including L decoding layers), the trained student model is used as a new teacher model to train a student model with fewer decoding layers, and so on, until training is performed to obtain a student model only including a predetermined number of (N) decoding layers, where K > M > N. In this embodiment, the compression ratio of the student model is appropriately selected according to the improvement of the reasoning performance of the translation model and the compromise consideration of the effect of the translation model. And after the training of the student model is completed, removing the teacher model.
The student model obtained through training in the embodiment simplifies the model structure, and monitors the middle content and the final result output by the teacher model in the training process, so that the performance and the effect of the model are ensured as much as possible under the condition that the parameters of the student model are reduced.
Example 3
The present embodiment provides an electronic device, which may be expressed in the form of a computing device (for example, may be a server device), including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor may implement the knowledge distillation based machine translation model training method provided in embodiment 1 when executing the computer program.
Fig. 4 shows a schematic diagram of the hardware structure of the present embodiment, and as shown in fig. 4, the electronic device 9 specifically includes:
at least one processor 91, at least one memory 92, and a bus 93 for connecting the different system components (including the processor 91 and the memory 92), wherein:
the bus 93 includes a data bus, an address bus, and a control bus.
The memory 92 includes volatile memory such as Random Access Memory (RAM) 921 and/or cache memory 922, and may further include Read Only Memory (ROM) 923.
Memory 92 also includes a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The processor 91 executes various functional applications and data processing, such as the knowledge distillation-based machine translation model training method provided in embodiment 1 of the present application, by running a computer program stored in the memory 92.
The electronic device 9 may further communicate with one or more external devices 94 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 95. Also, the electronic device 9 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 96. The network adapter 96 communicates with other modules of the electronic device 9 via the bus 93. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the electronic device 9, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.
It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present application. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Example 4
The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the knowledge distillation based machine translation model training method of embodiment 1.
More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the knowledge-based distillation machine translation model training method of example 1, when said program product is run on the terminal device.
Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.
Claims (12)
1. A knowledge distillation-based machine translation model training method, comprising:
obtaining a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;
acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;
inputting the training corpus into the teacher model for processing to obtain intermediate content output by a simplified module in the teacher model and a final result output by the teacher model;
inputting the training corpus into the student model for processing to obtain intermediate content output by a simplified module in the student model and a final result output by the student model;
Determining a model loss function according to the labeling translation tag corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;
performing iterative training on the student model according to the model loss function;
the determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model includes:
determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;
determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;
determining a third loss function according to the final result output by the teacher model and the final result output by the student model;
And determining the model loss function according to the first loss function, the second loss function and the third loss function.
2. The knowledge distillation based machine translation model training method according to claim 1, wherein the teacher model and the student model comprise an embedding module, an encoding module, a decoding module, and an output module, respectively.
3. The knowledge distillation based machine translation model training method according to claim 2, wherein the student model is consistent with the structure of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.
4. The knowledge-distillation based machine translation model training method according to claim 1, wherein after obtaining the sample dataset, the method further comprises: and preprocessing the training corpus.
5. The knowledge distillation based machine translation model training method according to claim 4, wherein said preprocessing said training corpus comprises:
Converting the characters in the training corpus into corresponding numerical values;
dividing the training corpus into different batches, and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.
6. A knowledge distillation based machine translation model training device, comprising:
the model acquisition module is used for acquiring a trained teacher model and an untrained student model, wherein the student model is obtained by simplifying part of modules in the teacher model;
the sample acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a plurality of training corpus and labeling translation tags corresponding to the training corpus;
the teacher model processing module is used for inputting the training corpus into the teacher model for processing to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model;
the student model processing module is used for inputting the training corpus into the student model for processing to obtain intermediate content output by the simplified module in the student model and a final result output by the student model;
the model loss function determining module is used for determining a model loss function according to the labeling translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;
The model training module is used for carrying out iterative training on the student model according to the model loss function;
wherein the model loss function determination module comprises:
a first loss function determining unit, configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;
the second loss function determining unit is used for determining a second loss function according to the labeling translation labels corresponding to the training corpus and the final result output by the student model;
a third loss function determining unit, configured to determine a third loss function according to a final result output by the teacher model and a final result output by the student model;
and the model loss function determining unit is used for determining the model loss function according to the first loss function, the second loss function and the third loss function.
7. The knowledge distillation based machine translation model training device according to claim 6, wherein the teacher model and the student model comprise an embedding module, an encoding module, a decoding module, and an output module, respectively.
8. The knowledge distillation based machine translation model training device according to claim 7, wherein the student model is consistent with the structure of the embedded module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.
9. The knowledge distillation based machine translation model training device according to claim 6, wherein the device further comprises: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.
10. The knowledge distillation based machine translation model training device according to claim 9, wherein the preprocessing module comprises:
the numerical value conversion unit is used for converting the characters in the training corpus into corresponding numerical values;
the length adjusting unit is used for dividing the training corpus into different batches and adjusting the training corpus of each batch to be the same length in a zero-value filling mode.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 5 when the computer program is executed.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010843014.5A CN111950302B (en) | 2020-08-20 | 2020-08-20 | Knowledge distillation-based machine translation model training method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010843014.5A CN111950302B (en) | 2020-08-20 | 2020-08-20 | Knowledge distillation-based machine translation model training method, device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111950302A CN111950302A (en) | 2020-11-17 |
CN111950302B true CN111950302B (en) | 2023-11-10 |
Family
ID=73358463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010843014.5A Active CN111950302B (en) | 2020-08-20 | 2020-08-20 | Knowledge distillation-based machine translation model training method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111950302B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12321846B2 (en) * | 2020-12-09 | 2025-06-03 | International Business Machines Corporation | Knowledge distillation using deep clustering |
CN112597778B (en) * | 2020-12-14 | 2023-06-13 | 华为技术有限公司 | Translation model training method, translation method and translation equipment |
CN112541122A (en) * | 2020-12-23 | 2021-03-23 | 北京百度网讯科技有限公司 | Recommendation model training method and device, electronic equipment and storage medium |
CN112784999A (en) * | 2021-01-28 | 2021-05-11 | 开放智能机器(上海)有限公司 | Mobile-v 1 knowledge distillation method based on attention mechanism, memory and terminal equipment |
CN113011202B (en) * | 2021-03-23 | 2023-07-25 | 中国科学院自动化研究所 | End-to-end image text translation method, system and device based on multitasking training |
CN113160041B (en) * | 2021-05-07 | 2024-02-23 | 深圳追一科技有限公司 | Model training method and model training device |
CN113435208B (en) * | 2021-06-15 | 2023-08-25 | 北京百度网讯科技有限公司 | Training method and device for student model and electronic equipment |
CN113642605A (en) * | 2021-07-09 | 2021-11-12 | 北京百度网讯科技有限公司 | Model distillation method, device, electronic device and storage medium |
CN113505615B (en) * | 2021-07-29 | 2024-11-26 | 沈阳雅译网络技术有限公司 | Decoding acceleration method for neural machine translation system on small CPU devices |
CN113505614A (en) * | 2021-07-29 | 2021-10-15 | 沈阳雅译网络技术有限公司 | Small model training method for small CPU equipment |
CN113706347A (en) * | 2021-08-31 | 2021-11-26 | 深圳壹账通智能科技有限公司 | Multitask model distillation method, multitask model distillation system, multitask model distillation medium and electronic terminal |
CN114861671B (en) * | 2022-04-11 | 2024-11-05 | 深圳追一科技有限公司 | Model training method, device, computer equipment and storage medium |
CN114936605A (en) * | 2022-06-09 | 2022-08-23 | 五邑大学 | A neural network training method, equipment and storage medium based on knowledge distillation |
WO2023212997A1 (en) * | 2022-05-05 | 2023-11-09 | 五邑大学 | Knowledge distillation based neural network training method, device, and storage medium |
CN115438678B (en) * | 2022-11-08 | 2023-03-24 | 苏州浪潮智能科技有限公司 | Machine translation method, device, electronic device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090050A (en) * | 2017-11-08 | 2018-05-29 | 江苏名通信息科技有限公司 | Game translation system based on deep neural network |
WO2018126213A1 (en) * | 2016-12-30 | 2018-07-05 | Google Llc | Multi-task learning using knowledge distillation |
CN110059744A (en) * | 2019-04-16 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Method, the method for image procossing, equipment and the storage medium of training neural network |
CN110506279A (en) * | 2017-04-14 | 2019-11-26 | 易享信息技术有限公司 | Using the neural machine translation of hidden tree attention |
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN110765791A (en) * | 2019-11-01 | 2020-02-07 | 清华大学 | Method and device for automatic post-editing of machine translation |
CN111382582A (en) * | 2020-01-21 | 2020-07-07 | 沈阳雅译网络技术有限公司 | Neural machine translation decoding acceleration method based on non-autoregressive |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106383818A (en) * | 2015-07-30 | 2017-02-08 | 阿里巴巴集团控股有限公司 | Machine translation method and device |
-
2020
- 2020-08-20 CN CN202010843014.5A patent/CN111950302B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018126213A1 (en) * | 2016-12-30 | 2018-07-05 | Google Llc | Multi-task learning using knowledge distillation |
CN110506279A (en) * | 2017-04-14 | 2019-11-26 | 易享信息技术有限公司 | Using the neural machine translation of hidden tree attention |
CN108090050A (en) * | 2017-11-08 | 2018-05-29 | 江苏名通信息科技有限公司 | Game translation system based on deep neural network |
CN110059744A (en) * | 2019-04-16 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Method, the method for image procossing, equipment and the storage medium of training neural network |
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN110765791A (en) * | 2019-11-01 | 2020-02-07 | 清华大学 | Method and device for automatic post-editing of machine translation |
CN111382582A (en) * | 2020-01-21 | 2020-07-07 | 沈阳雅译网络技术有限公司 | Neural machine translation decoding acceleration method based on non-autoregressive |
Non-Patent Citations (4)
Title |
---|
Distilling the Knowledge in a Neural Network;Geoffrey Hinton 等;《网页在线公开:https://arxiv.org/abs/1503.02531》;1-9 * |
Mobilebert: Task-agnostic compression of bert by progressive knowledge transfer;Zhiqing Sun 等;《ICLR 2020 Conference》;1-26 * |
利用单语数据改进神经机器翻译压缩模型的翻译质量;李响 等;《中文信息学报》;第33卷(第7期);46-55 * |
基于BERT模型与知识蒸馏的意图分类方法;廖胜兰 等;《计算机工程》;第47卷(第5期);73-79 * |
Also Published As
Publication number | Publication date |
---|---|
CN111950302A (en) | 2020-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111950302B (en) | Knowledge distillation-based machine translation model training method, device, equipment and medium | |
CN112257858B (en) | Model compression method and device | |
CN109992773B (en) | Word vector training method, system, device and medium based on multi-task learning | |
WO2023160472A1 (en) | Model training method and related device | |
CN113987169A (en) | Method, device, device and storage medium for generating text summaries based on semantic blocks | |
CN116109978B (en) | Unsupervised video description method based on self-constrained dynamic text features | |
WO2023137911A1 (en) | Intention classification method and apparatus based on small-sample corpus, and computer device | |
CN115422369B (en) | Knowledge graph completion method and device based on improved TextRank | |
Zhang et al. | A generalized language model in tensor space | |
CN115730590A (en) | Intention recognition method and related equipment | |
CN114398899A (en) | Training method and device for pre-training language model, computer equipment and medium | |
CN116992940A (en) | SAR image multi-type target detection light-weight method and device combining channel pruning and knowledge distillation | |
CN117763084A (en) | Knowledge base retrieval method based on text compression and related equipment | |
US20250190795A1 (en) | Model optimization method and apparatus, computer device, and computer storage medium | |
CN116432637A (en) | A Multi-granularity Extraction-Generation Hybrid Abstract Method Based on Reinforcement Learning | |
CN115757694A (en) | Recruitment industry text recall method, system, device and medium | |
CN118520950A (en) | Ultra-long word element model reasoning method and device, electronic equipment and storage medium | |
CN117851595A (en) | A method for sentiment analysis of social network text | |
CN114239575B (en) | Statement analysis model construction method, statement analysis method, device, medium and computing equipment | |
Yang | Deep learning applications in natural language processing and optimization strategies | |
CN114090754A (en) | Question generation method, device, equipment and storage medium | |
CN112257463A (en) | Compression method of neural machine translation model for Chinese-English translation | |
CN114722845B (en) | A neural machine translation method based on source language reordering | |
Peng | Design and Construction of Machine Translation System Based on RNN Model | |
CN118279701B (en) | Continuous evolutionary learning method and system for joint optimization of model and sample storage resources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |