CN108363704A

CN108363704A - A kind of neural network machine translation corpus expansion method based on statistics phrase table

Info

Publication number: CN108363704A
Application number: CN201810175915.4A
Authority: CN
Inventors: 黄河燕; 史学文; 鉴萍; 唐翼琨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-02
Filing date: 2018-03-02
Publication date: 2018-08-03

Abstract

The invention discloses a neural network machine translation corpus extension method based on a statistical phrase table, which belongs to the technical field of machine translation. The present invention proposes a machine translation corpus expansion method based on a statistical phrase table for the neural network machine translation technology, which can effectively expand the corpus scale on the basis of the original training set of machine translation; the method mainly includes: training set expansion stage and model training Stage; Stage 1 uses statistical machine learning methods to learn phrase tables from the original training set and fuses it with the original training set according to certain filtering rules to form a new expanded training set. Stage 2 trains the neural machine translation model. The expanded training set is pre-trained, and then the original training set is used for training and tuning to obtain the final model; the experimental results show that compared with the machine translation model that does not use the corpus expansion method, the BLEU evaluation index of the present invention is significantly improved.

Description

A Neural Network Machine Translation Corpus Expansion Method Based on Statistical Phrase Table

技术领域technical field

本发明涉及一种基于统计短语表的神经网络机器翻译语料扩展方法，属于计算机应用以及机器翻译技术领域。The invention relates to a neural network machine translation corpus extension method based on a statistical phrase table, and belongs to the technical fields of computer application and machine translation.

背景技术Background technique

机器翻译是利用计算机自动地将一种语言(源语言)翻译成另外一种语言(目标语言)的技术。Machine translation is a technology that uses computers to automatically translate one language (source language) into another language (target language).

随着人工神经网络和深度学习技术的发展，基于深度学习技术的神经网络机器翻译技术(以下简称神经机器翻译)在近几年取得了重要的成就。神经机器翻译具有：需要的语言学知识和人工的干预少，模型存储所占空间小，翻译输出的译文流畅自然等优势。在面向双语资源丰富的的翻译任务上，神经机器翻译通常被认为是最好的选择。目前，神经机器翻译已经受到机器翻译领域的广泛关注和认可，并已投入商用。With the development of artificial neural network and deep learning technology, the neural network machine translation technology based on deep learning technology (hereinafter referred to as neural machine translation) has made important achievements in recent years. Neural machine translation has the advantages of less linguistic knowledge and manual intervention, less space for model storage, and smooth and natural translation output. Neural machine translation is generally considered the best choice for bilingual resource-rich translation tasks. At present, neural machine translation has received extensive attention and recognition in the field of machine translation, and has been put into commercial use.

训练神经网络的数据以双语平行句对为主。通常，神经机器翻译所用的神经网络模型具有大规模的自由参数，理论上，这类模型需要大规模的双语平行语料对其进行训练。经验表明，包含千万级别自由参数的神经机器翻译模型通常需要至少百万句对级别的数据进行训练方可获得理想效果。对于一些双语平行资源较为稀缺的语言，应用神经网络进行翻译难以获得满意效果。The data for training the neural network is mainly bilingual parallel sentence pairs. Usually, the neural network models used in neural machine translation have large-scale free parameters. In theory, such models require large-scale bilingual parallel corpora to train them. Experience has shown that a neural machine translation model containing tens of millions of free parameters usually requires at least one million sentence-pair level of data for training to achieve ideal results. For some languages where bilingual parallel resources are relatively scarce, it is difficult to obtain satisfactory results by applying neural networks for translation.

此外，神经机器翻译的训练通常以一个或一组(多个)完整的句对为单位进行，当语料资源稀缺时，对句对中包含的一些出现频率较低的短语学习的能力受限，尤其在单独翻译这些短语时。In addition, the training of neural machine translation is usually carried out in units of one or a group (multiple) complete sentence pairs. When the corpus resources are scarce, the ability to learn some phrases with low frequency included in the sentence pairs is limited. Especially when translating these phrases individually.

发明内容Contents of the invention

本发明针对资源稀缺语言的神经机器翻译的模型训练问题，提出了一种基于统计短语表的神经网络机器翻译语料扩展方法，能有效扩展神经机器翻译模型的训练数据，缓解语言资源稀缺对模型训练的不利影响。Aiming at the model training problem of neural machine translation for languages with scarce resources, the present invention proposes a corpus expansion method for neural network machine translation based on statistical phrase tables, which can effectively expand the training data of neural machine translation models and alleviate the impact of scarce language resources on model training. adverse effects.

本发明包含：训练集扩展阶段和模型训练阶段；The present invention includes: a training set expansion stage and a model training stage;

其中，A)训练集扩展阶段的操作如下：通过统计机器学习方法从原始训练集中学习得到带有概率得分的短语表，并根据规则对学习得到的短语表进行过滤，将过滤后的短语表抽取成新的双语平行短语对数据集，将新抽取出的数据集与原始训练集拼接得到新的双语平行伪数据，实现训练集的扩展；Among them, A) the operation of the training set expansion stage is as follows: the phrase table with probability score is learned from the original training set by statistical machine learning method, and the learned phrase table is filtered according to the rules, and the filtered phrase table is extracted Form a new bilingual parallel phrase pair data set, splice the newly extracted data set with the original training set to obtain new bilingual parallel pseudo-data, and realize the expansion of the training set;

B)模型训练阶段的操作分为两个步骤，步骤一是预训练，即将阶段A)得到的双语平行伪数据对模型进行预训练，训练后得到预训练好的模型b₁；步骤二利用原始训练集重新对模型b₂进行训练，目的为对模型进行调优，缓解伪数据中引入的噪声对模型的影响；B) The operation of the model training stage is divided into two steps. Step one is pre-training, that is, the bilingual parallel dummy data obtained in stage A) is used to pre-train the model, and the pre-trained model b ₁ is obtained after training; step two uses the original The training set retrains the model b ₂ , the purpose is to tune the model and alleviate the influence of the noise introduced in the fake data on the model;

为实现上述目的和技术，本发明采用的技术方案如下：For realizing above-mentioned purpose and technology, the technical scheme that the present invention adopts is as follows:

首先进行相关定义，具体如下：First, the relevant definitions are made, as follows:

定义1：源语言，即机器翻译中，进行翻译时将要被翻译的内容所属的语言，例如从中文翻译到英文的机器翻译中，中文为源语言；Definition 1: Source language, that is, in machine translation, the language of the content to be translated during translation, for example, in machine translation from Chinese to English, Chinese is the source language;

定义2：源语言数据，即属于源语言的数据，若源语言数据是一个自然语言句子，则该属于源语言的数据称为源语言句子，例如从中文翻译到英文的机器翻译中，输入的中文句子就是源语言数据，亦可称为源语言句子；Definition 2: Source language data, that is, data belonging to the source language. If the source language data is a natural language sentence, the data belonging to the source language is called a source language sentence. For example, in machine translation from Chinese to English, the input Chinese sentences are the source language data, which can also be called source language sentences;

由源语言数据组成的集合称为源语言数据集；A collection of source language data is called a source language dataset;

定义3：目标语言，即机器翻译中，进行翻译时被翻译成的内容所属的语言，例如从中文翻译到英文的机器翻译中，英文为目标语言；Definition 3: Target language, that is, in machine translation, the language to which the translated content belongs, for example, in machine translation from Chinese to English, English is the target language;

定义4：目标语言数据，即属于目标语言的数据，若目标语言数据是一个自然语言句子，则该属于目标语言的数据称为目标语言句子，例如从中文翻译到英文的机器翻译中，输出的英文句子就是目标语言数据，亦可称为目标语言句子；Definition 4: Target language data, that is, data belonging to the target language. If the target language data is a natural language sentence, the data belonging to the target language is called the target language sentence. For example, in machine translation from Chinese to English, the output English sentences are the target language data, and can also be called target language sentences;

由目标语言数据组成的集合称为目标语言数据集；A collection of target language data is called a target language dataset;

定义5：训练集，特指统计机器翻译模型的训练集，即用于训练统计机器翻译模型的数据集合，记为T；Definition 5: Training set, specifically refers to the training set of the statistical machine translation model, that is, the data set used to train the statistical machine translation model, denoted as T;

定义6：原始训练集，即经过扩展前的训练集；Definition 6: The original training set, that is, the training set before expansion;

定义7：词对齐信息，简称词对齐，即训练集T中，源语言单词和目标语言单词之间的对齐关系，记为α；Definition 7: Word alignment information, referred to as word alignment, is the alignment relationship between source language words and target language words in the training set T, denoted as α;

其中，若训练集T中，源语言第j个单词与目标语言第i个单词存在对齐关系记为(j,i)；Among them, if in the training set T, there is an alignment relationship between the jth word in the source language and the ith word in the target language, it is recorded as (j,i);

定义8，短语，一个或多个单词组成的语言单位；Definition 8, a phrase, a linguistic unit consisting of one or more words;

使用的语言为源语言的短语称为源语言短语，记为f，使用的语言为目标语言的短语称为目标语言短语，记为e；Phrases in the source language are called source language phrases, denoted as f, and phrases in the target language are called target language phrases, denoted as e;

定义9，翻译短语对，源语言短语和对齐的目标语言短语组成的短语对，例如“(‘长城’，‘The Great Wall’)”；Definition 9, translation phrase pair, a phrase pair consisting of a source language phrase and an aligned target language phrase, such as "('The Great Wall', 'The Great Wall')";

定义10，正向短语翻译概率，即给定源语言短语f时，翻译到目标语言短语e的条件概率，记为 Definition 10, the forward phrase translation probability, that is, given the source language phrase f, the conditional probability of translation to the target language phrase e, denoted as

定义11，反向短语翻译概率，即给定目标语言短语e时，翻译回源语言短语f的条件概率。记为 Definition 11, reverse phrase translation probability, which is the conditional probability of translating back to the source language phrase f given the target language phrase e. recorded as

定义12，双向短语翻译概率，正向短语翻译概率和反向短语翻译概率合称为双向短语翻译概率；Definition 12, two-way phrase translation probability, forward phrase translation probability and reverse phrase translation probability are collectively called two-way phrase translation probability;

定义13，正向词汇化短语翻译概率，给定源语言短语f时，翻译到目标语言短语e的词汇化翻译概率，记为lex(e|f)；Definition 13, the forward lexicalized phrase translation probability, given the source language phrase f, the lexicalized translation probability translated to the target language phrase e, denoted as lex(e|f);

定义14，反向词汇化短语翻译概率，给定目标语言短语e时，翻译回源语言短语f的词汇化翻译概率，记为lex(f|e)；Definition 14, the reverse lexicalized phrase translation probability, given the target language phrase e, the lexicalized translation probability of the translation back to the source language phrase f, denoted as lex(f|e);

定义15，双向词汇化短语翻译概率，正向词汇化翻译概率和反向词汇化翻译概率合称为双向词汇化翻译概率；Definition 15, two-way lexicalization phrase translation probability, forward lexicalization translation probability and reverse lexicalization translation probability are collectively called two-way lexicalization translation probability;

定义16，短语表，也称为短语翻译表，由多组翻译短语对构成的，并对每组翻译短语对附加上双向短语翻译概率和双向词汇化翻译概率；Definition 16. The phrase table, also called the phrase translation table, is composed of multiple sets of translated phrase pairs, and each set of translated phrase pairs is attached with a bidirectional phrase translation probability and a bidirectional lexicalized translation probability;

定义17，过滤规则，即过滤短语表的规则，根据短语表内所包含的源语言短语、目标语言短语、双向短语翻译概率、双向词汇化短语翻译概率信息对短语表进行过滤的人工制定的规则；Definition 17, filtering rules, that is, the rules for filtering phrase tables, artificially formulated rules for filtering phrase tables according to the source language phrases, target language phrases, bidirectional phrase translation probability, and bidirectional lexicalized phrase translation probability information contained in the phrase table ;

训练集扩展阶段，包括如下步骤：The training set expansion phase includes the following steps:

步骤A1，根据定义1、定义2、定义3、定义4和定义5，对原始训练集进行预处理，得到经过预处理后的原始训练集T_f；In step A1, according to definition 1, definition 2, definition 3, definition 4 and definition 5, the original training set is preprocessed to obtain the preprocessed original training set T _f ;

其中，对原始训练集进行预处理的具体过程因不同源语言和目标语言而异，目的为对训练集进行规范化处理，得到经过预处理后的原始训练集T_f；Among them, the specific process of preprocessing the original training set varies with different source languages and target languages. The purpose is to standardize the training set and obtain the preprocessed original training set T _f ;

步骤A2，基于步骤A1得到的经预处理后的原始训练集T_f，并根据定义7和定义8学习词对齐信息，该过程通常利用开源词对齐工具包实现，将步骤A1中得到的经过预处理后的原始训练集作为输入，经过训练词对齐工具的训练，得到训练集的词对齐信息α；Step A2, based on the preprocessed original training set T _f obtained in step A1, and learn word alignment information according to Definition 7 and Definition 8, this process is usually implemented by using an open source word alignment toolkit, and the preprocessed training set T f obtained in step A1 The processed original training set is used as input, and the word alignment information α of the training set is obtained through the training of the training word alignment tool;

步骤A3，根据定义6，定义7、定义8、定义9、定义10、定义11、定义12、定义13、定义14、定义15和定义16，结合步骤A1得到的经过预处理后的原始训练集T_f以及步骤A2得到的训练集的词对齐信息α，抽取翻译短语对，并对翻译短语对进行概率估计，得到每个翻译短语对的双向短语翻译概率和双向词汇化翻译概率，结合翻译短语对和翻译概率，得到短语表，短语表的每条记录由翻译短语对、词对齐信息、双向短语翻译概率和双向词汇化翻译概率组成；Step A3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step A1 T _f and the word alignment information α of the training set obtained in step A2, extract translation phrase pairs, and estimate the probability of translation phrase pairs, and obtain the bidirectional phrase translation probability and bidirectional lexicalization translation probability of each translation phrase pair, combined with translation phrases pair and translation probability to get a phrase table, each record of the phrase table is composed of translation phrase pair, word alignment information, two-way phrase translation probability and two-way lexicalized translation probability;

步骤A4，根据定义9、定义12、定义15、定义16和定义17，利用人工定义的过滤规则，对步骤A3得到的短语表进行过滤，过滤掉概率较低的翻译短语对，得到过滤后的短语表,记为P_new；Step A4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filter rules to filter the phrase table obtained in Step A3, filter out translation phrase pairs with low probability, and obtain the filtered Phrase table, denoted as P _new ;

步骤A5，根据定义5、定义16，将步骤A4得到的过滤后的短语表P_new中的翻译短语对部分与步骤A1得到的预处理后的原始训练集T_f拼接，得到新训练集T_new；Step A5, according to Definition 5 and Definition 16, splicing the translated phrase pairs in the filtered phrase table P _new obtained in Step A4 with the preprocessed original training set T _f obtained in Step A1 to obtain a new training set T _new ;

步骤A1至步骤A5，完成了本方法的训练集扩展阶段；From step A1 to step A5, the training set expansion stage of the method is completed;

模型训练阶段，包括如下步骤：The model training phase includes the following steps:

步骤B1，利用步骤A5得到的新训练集T_new对模型进行预训练，得到模型b₁；Step B1, using the new training set T _new obtained in step A5 to pre-train the model to obtain model _b1 ;

步骤B2，利用步骤A1得到的预处理后的原始训练集T_f，对步骤B1得到的模型b₁再次进行训练，得到新训练好的模型b₂；Step B2, using the preprocessed original training set T _f obtained in step A1, to train the model b 1 obtained in step _B1 again to obtain a newly trained model b ₂ ;

至此，从步骤B1到步骤B2，完成了本方法的模型训练阶段；So far, from step B1 to step B2, the model training phase of the method is completed;

至此，从步骤A1到步骤A5以及步骤B1到步骤B2，完成了一种基于统计短语表的神经网络机器翻译语料扩展方法。So far, from step A1 to step A5 and step B1 to step B2, a method for expanding the corpus of neural network machine translation based on the statistical phrase table has been completed.

有益效果Beneficial effect

本发明一种基于统计短语表的神经网络机器翻译语料扩展方法，与现有的机器翻译训练集使用方法相比，具有如下有益效果：A kind of neural network machine translation corpus expansion method based on statistical phrase table of the present invention, compared with existing machine translation training set using method, has following beneficial effect:

1.本发明设计了基于统计短语表的神经网络机器翻译语料扩展方法，该方法在不需要额外的双语或单语数据的情况下，可以对原始训练集进行有效的扩展，缓解资源稀缺语言训练集规模小对神经机器翻译模型的训练带来的不利影响。1. The present invention has designed a neural network machine translation corpus extension method based on a statistical phrase table. This method can effectively expand the original training set without requiring additional bilingual or monolingual data, and alleviate resource-scarce language training. The adverse effect of small set size on the training of neural machine translation models.

2.在训练集、开发集和测试集数据相同的情况下，本发明与不使用本发明的神经机器翻译模型训练方法相比，BLEU评测指标有明显提升。2. In the case of the same training set, development set and test set data, compared with the neural machine translation model training method not using the present invention, the BLEU evaluation index of the present invention is significantly improved.

附图说明Description of drawings

图1是本发明一种基于统计短语表的神经网络机器翻译语料扩展方法及实施例中的流程图。FIG. 1 is a flow chart of a method and embodiment of a neural network machine translation corpus expansion method based on a statistical phrase table in the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明所述方法进行详细叙述。说明时按照本发明包含的两个主要阶段：1)训练集扩展阶段以及2)模型训练阶段，分别进行说明。The method of the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments. During the description, the two main stages included in the present invention: 1) the training set expansion stage and 2) the model training stage will be described separately.

实施例1Example 1

本实施例叙述了本发明所述方法的流程及其具体实施例。This embodiment describes the flow of the method of the present invention and its specific examples.

图1为本发明一种基于统计短语表的神经网络机器翻译语料扩展方法及在本实施例中的流程图。FIG. 1 is a statistical phrase table-based neural network machine translation corpus expansion method according to the present invention and a flow chart in this embodiment.

从图1中可以看出本发明包含的两个阶段1)训练集扩展阶段以及2)模型训练阶段的操作流程。It can be seen from Fig. 1 that the present invention includes two stages 1) the training set expansion stage and 2) the operation flow of the model training stage.

以维吾尔语到汉语的翻译为例，其中维吾尔语为源语言，汉语为目标语言。Take Uyghur-to-Chinese translation as an example, where Uyghur is the source language and Chinese is the target language.

1)训练集扩展阶段：1) Training set expansion stage:

步骤一，根据定义1、定义2、定义3、定义4、定义5，对原始训练集进行预处理，预处理具体过程因不同源语言和目标语言而异，目的为对训练集进行规范化处理，其中，对源语言维吾尔语以及目标语言汉语的数据的预处理过程均为：先进行词片段(word-piece)切分，再进行词切分(tokenization)，得到经过预处理后的原始训练集T_f；Step 1. Preprocess the original training set according to Definition 1, Definition 2, Definition 3, Definition 4, and Definition 5. The specific process of preprocessing varies with different source languages and target languages. The purpose is to standardize the training set. Among them, the preprocessing process of the source language Uyghur and the target language Chinese data are: first perform word-piece segmentation, and then perform word segmentation (tokenization) to obtain the original training set after preprocessing T _f ;

步骤二，根据定义6和定义7，学习词对齐，在本实施例中，该过程利用开源词对齐工具包GIZA++实现，将步骤一中得到的经过预处理后的原始训练集作为输入，经过训练词对齐工具GIZA++的训练，得到训练集的词对齐信息α；Step 2, according to definition 6 and definition 7, learn word alignment, in the present embodiment, this process utilizes the open source word alignment toolkit GIZA++ to realize, the original training set after preprocessing that obtains in step 1 is used as input, after training The training of the word alignment tool GIZA++ obtains the word alignment information α of the training set;

步骤三，根据定义6，定义7，定义8、定义9、定义10、定义11、定义12、定义13、定义14、定义15和定义16，结合步骤一得到的经过预处理后的原始训练集T_f以及步骤二得到的训练集的词对齐信息α，抽取翻译短语对，并对翻译短语对进行概率估计，本实施例中，利用Moses开源工具中的train-model.perl脚本实现上述功能，得到短语表P，短语表的每条记录由翻译短语对、词对齐信息、双向短语翻译概率和双向词汇化翻译概率组成；Step 3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step 1 T _f and the word alignment information α of the training set obtained in step 2 extract translation phrases and estimate the probability of translation phrases. In this embodiment, the above functions are realized by using the train-model.perl script in the Moses open source tool. Get the phrase table P, each record of the phrase table consists of translation phrase pairs, word alignment information, bidirectional phrase translation probability and bidirectional lexicalization translation probability;

步骤四，根据定义9、定义12、定义15、定义16、定义17，利用人工定义的过滤规则，对步骤三得到的短语表进行过滤，人工定义的规则如下：Step 4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filtering rules to filter the phrase list obtained in Step 3. The manually defined rules are as follows:

保留该翻译短语对，当且仅当该翻译短语对的概率且且lex(e|f)≥0.025，且lex(f|e)≥0.025；Keep the translation phrase pair if and only if the probability of the translation phrase pair and And lex(e|f)≥0.025, and lex(f|e)≥0.025;

过滤掉概率较低的翻译短语对，得到过滤后的新短语表P_new；Filter out the translation phrase pairs with low probability to obtain the filtered new phrase table P _new ;

步骤五，根据定义5、定义16，将步骤四得到的过滤后的新短语表P_new的翻译短语对部分与步骤一得到的预处理后的原始训练集T_f拼接，得到新训练集T_new；Step 5, according to Definition 5 and Definition 16, splicing the translated phrase pair part of the filtered new phrase table P _new obtained in Step 4 with the preprocessed original training set T _f obtained in Step 1 to obtain a new training set T _new ;

2)模型训练阶段的步骤如下：2) The steps in the model training phase are as follows:

步骤六，进行模型预训练，本实施例中采用开源神经机器翻译模型tesnor2tensor，利用步骤五得到的新训练集T_new对模型进行预训练，得到模型b₁；Step 6, perform model pre-training. In this embodiment, the open source neural machine translation model tesnor2tensor is used, and the new training set T _new obtained in step 5 is used to pre-train the model to obtain model _b1 ;

步骤七，利用步骤一得到的预处理后的原始训练集T_f，对步骤六得到的模型b₁再次进行训练，得到新训练好的模型b₂；Step 7, use the preprocessed original training set T _f obtained in step 1 to train the model b ₁ obtained in step 6 again to obtain a newly trained model b ₂ ;

至此，从步骤一到步骤七，完成了一种基于统计短语表的神经网络机器翻译语料扩展方法。So far, from step 1 to step 7, a corpus extension method for neural network machine translation based on statistical phrase table has been completed.

实施例2Example 2

将CWMT2017提供的维吾尔语-汉语新闻翻译任务中的训练集随机地拆分为训练集、开发集以及测试集1，另外，将CWMT2017提供的维吾尔语-汉语新闻翻译评测任务的开发集数据作为测试集2，实验结果表明，在原始训练集、开发集、测试集数据和神经机器翻译模型相同的情况下，本发明与不使用本发明的神经机器翻译模型训练方法相比，采用基于汉字的BLEU作为评测指标，可以得到如下实验结果。The training set in the Uyghur-Chinese news translation task provided by CWMT2017 is randomly split into training set, development set and test set 1. In addition, the development set data of the Uyghur-Chinese news translation evaluation task provided by CWMT2017 is used as a test Set 2, the experimental results show that, in the case that the original training set, development set, test set data and neural machine translation model are the same, the present invention adopts BLEU based on Chinese characters compared with the neural machine translation model training method of the present invention. As the evaluation index, the following experimental results can be obtained.

表1使用本发明提出的训练集扩展方法前后BLEU值对比Table 1 uses the comparison of BLEU values before and after the training set expansion method proposed by the present invention

表1的实验结果表明：在训练集、开发集和测试集数据相同的情况下，采用本发明所述方法与不使用本发明的神经机器翻译模型训练方法相比，BLEU评测指标有明显提升。The experimental results in Table 1 show that: in the case of the same training set, development set and test set data, the BLEU evaluation index is significantly improved by using the method of the present invention compared with the neural machine translation model training method not using the present invention.

以上所述为本发明的较佳实施例而已，本发明不应该局限于该实施例和附图所公开的内容。凡是不脱离本发明所公开的精神下完成的等效或修改，都落入本发明保护的范围。The above description is only a preferred embodiment of the present invention, and the present invention should not be limited to the content disclosed in this embodiment and the accompanying drawings. All equivalents or modifications accomplished without departing from the disclosed spirit of the present invention fall within the protection scope of the present invention.

Claims

1. A neural network machine translation corpus expansion method based on a statistical phrase table, characterized in that: comprising: a training set expansion stage and a model training stage;

Among them, A) the operation of the training set expansion stage is as follows: the phrase table with probability score is learned from the original training set by statistical machine learning method, and the learned phrase table is filtered according to the rules, and the filtered phrase table is extracted Form a new bilingual parallel phrase pair data set, splice the newly extracted data set with the original training set to obtain new bilingual parallel pseudo-data, and realize the expansion of the training set;

B) The operation of the model training stage is divided into two steps. Step one is pre-training, that is, the bilingual parallel dummy data obtained in stage A) is used to pre-train the model, and the pre-trained model b ₁ is obtained after training; step two uses the original The training set retrains the model b ₂ for the purpose of tuning the model and alleviating the influence of the noise introduced in the fake data on the model.

2. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: for realizing above-mentioned purpose and technology, adopt following technical scheme:

First, the relevant definitions are made, as follows:

Definition 1: Source language, that is, in machine translation, the language of the content to be translated during translation, for example, in machine translation from Chinese to English, Chinese is the source language;

Definition 2: Source language data, that is, data belonging to the source language. If the source language data is a natural language sentence, the data belonging to the source language is called a source language sentence. For example, in machine translation from Chinese to English, the input Chinese sentences are the source language data, which can also be called source language sentences;

A collection of source language data is called a source language dataset;

Definition 3: Target language, that is, in machine translation, the language to which the translated content belongs, for example, in machine translation from Chinese to English, English is the target language;

Definition 4: Target language data, that is, data belonging to the target language. If the target language data is a natural language sentence, the data belonging to the target language is called the target language sentence. For example, in machine translation from Chinese to English, the output English sentences are the target language data, and can also be called target language sentences;

A collection of target language data is called a target language dataset;

Definition 5: Training set, specifically refers to the training set of the statistical machine translation model, that is, the data set used to train the statistical machine translation model, denoted as T;

Definition 6: The original training set, that is, the training set before expansion;

Definition 7: Word alignment information, referred to as word alignment, is the alignment relationship between source language words and target language words in the training set T, denoted as α;

Among them, if in the training set T, there is an alignment relationship between the jth word in the source language and the ith word in the target language, it is recorded as (j,i);

Definition 8, a phrase, a linguistic unit consisting of one or more words;

Phrases in the source language are called source language phrases, denoted as f, and phrases in the target language are called target language phrases, denoted as e;

Definition 9, translation phrase pair, a phrase pair consisting of a source language phrase and an aligned target language phrase, such as "('The Great Wall', 'The Great Wall')";

Definition 10, the forward phrase translation probability, that is, given the source language phrase f, the conditional probability of translation to the target language phrase e, denoted as

Definition 11, the reverse phrase translation probability, that is, given the target language phrase e, the conditional probability of translating back to the source language phrase f, denoted as

Definition 12, two-way phrase translation probability, forward phrase translation probability and reverse phrase translation probability are collectively called two-way phrase translation probability;

Definition 13, the forward lexicalized phrase translation probability, given the source language phrase f, the lexicalized translation probability translated to the target language phrase e, denoted as lex(e|f);

Definition 14, the reverse lexicalized phrase translation probability, given the target language phrase e, the lexicalized translation probability of the translation back to the source language phrase f, denoted as lex(f|e);

Definition 15, two-way lexicalization phrase translation probability, forward lexicalization translation probability and reverse lexicalization translation probability are collectively called two-way lexicalization translation probability;

Definition 16. The phrase table, also called the phrase translation table, is composed of multiple sets of translated phrase pairs, and each set of translated phrase pairs is attached with a bidirectional phrase translation probability and a bidirectional lexicalized translation probability;

Definition 17, filtering rules, that is, the rules for filtering phrase tables, artificially formulated rules for filtering phrase tables according to the source language phrases, target language phrases, bidirectional phrase translation probability, and bidirectional lexicalized phrase translation probability information contained in the phrase table ;

The training set expansion phase includes the following steps:

In step A1, according to definition 1, definition 2, definition 3, definition 4 and definition 5, the original training set is preprocessed to obtain the preprocessed original training set T _f ;

Step A2, based on the preprocessed original training set T _f obtained in step A1, and learn word alignment information according to Definition 7 and Definition 8, this process is usually implemented by using an open source word alignment toolkit, and the preprocessed training set T f obtained in step A1 The processed original training set is used as input, and the word alignment information α of the training set is obtained through the training of the training word alignment tool;

Step A3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step A1 T _f and the word alignment information α of the training set obtained in step A2, extract translation phrase pairs, and estimate the probability of translation phrase pairs, and obtain the bidirectional phrase translation probability and bidirectional lexicalization translation probability of each translation phrase pair, combined with translation phrases pair and translation probability to get a phrase table, each record of the phrase table is composed of translation phrase pair, word alignment information, two-way phrase translation probability and two-way lexicalized translation probability;

Step A4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filter rules to filter the phrase table obtained in Step A3, filter out translation phrase pairs with low probability, and obtain the filtered Phrase table, denoted as P _new ;

Step A5, according to Definition 5 and Definition 16, splicing the translated phrase pairs in the filtered phrase table P _new obtained in Step A4 with the preprocessed original training set T _f obtained in Step A1 to obtain a new training set T _new .

3. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: model training stage, comprises the steps:

Step B1, using the new training set T _new obtained in step A5 to pre-train the model to obtain model _b1 ;

In step B2, use the preprocessed original training set T _f obtained in step A1 to train the model b 1 obtained in step _B1 again to obtain a newly trained model b ₂ .

4. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: in step A1, wherein, the specific process that the original training set is carried out preprocessing varies from source language and target The purpose is to normalize the training set and obtain the preprocessed original training set T _f .