[go: up one dir, main page]

CN108363704A - A kind of neural network machine translation corpus expansion method based on statistics phrase table - Google Patents

A kind of neural network machine translation corpus expansion method based on statistics phrase table Download PDF

Info

Publication number
CN108363704A
CN108363704A CN201810175915.4A CN201810175915A CN108363704A CN 108363704 A CN108363704 A CN 108363704A CN 201810175915 A CN201810175915 A CN 201810175915A CN 108363704 A CN108363704 A CN 108363704A
Authority
CN
China
Prior art keywords
phrase
definition
translation
training set
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810175915.4A
Other languages
Chinese (zh)
Inventor
黄河燕
史学文
鉴萍
唐翼琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810175915.4A priority Critical patent/CN108363704A/en
Publication of CN108363704A publication Critical patent/CN108363704A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

一种基于统计短语表的神经网络机器翻译语料扩展方法,属于机器翻译技术领域。本发明针对神经网络机器翻译技术提出了一种基于统计短语表的机器翻译语料扩展方法,可以在机器翻译原始训练集的基础上有效扩展语料规模;本方法主要包含:训练集扩展阶段和模型训练阶段;阶段一通过统计机器学习方法从原始训练集中学习短语表并将其按照一定的过滤规则与原始训练集融合成新的扩展后的训练集,阶段二对神经机器翻译模型进行训练,先通过扩展后的训练集进行预训练,再由原始训练集进行训练以调优,得到最终模型;实验结果表明,本发明与不使用语料扩展方法的机器翻译模型相比,BLEU测评指标明显提升。

The invention discloses a neural network machine translation corpus extension method based on a statistical phrase table, which belongs to the technical field of machine translation. The present invention proposes a machine translation corpus expansion method based on a statistical phrase table for the neural network machine translation technology, which can effectively expand the corpus scale on the basis of the original training set of machine translation; the method mainly includes: training set expansion stage and model training Stage; Stage 1 uses statistical machine learning methods to learn phrase tables from the original training set and fuses it with the original training set according to certain filtering rules to form a new expanded training set. Stage 2 trains the neural machine translation model. The expanded training set is pre-trained, and then the original training set is used for training and tuning to obtain the final model; the experimental results show that compared with the machine translation model that does not use the corpus expansion method, the BLEU evaluation index of the present invention is significantly improved.

Description

一种基于统计短语表的神经网络机器翻译语料扩展方法A Neural Network Machine Translation Corpus Expansion Method Based on Statistical Phrase Table

技术领域technical field

本发明涉及一种基于统计短语表的神经网络机器翻译语料扩展方法,属于计算机应用以及机器翻译技术领域。The invention relates to a neural network machine translation corpus extension method based on a statistical phrase table, and belongs to the technical fields of computer application and machine translation.

背景技术Background technique

机器翻译是利用计算机自动地将一种语言(源语言)翻译成另外一种语言(目标语言)的技术。Machine translation is a technology that uses computers to automatically translate one language (source language) into another language (target language).

随着人工神经网络和深度学习技术的发展,基于深度学习技术的神经网络机器翻译技术(以下简称神经机器翻译)在近几年取得了重要的成就。神经机器翻译具有:需要的语言学知识和人工的干预少,模型存储所占空间小,翻译输出的译文流畅自然等优势。在面向双语资源丰富的的翻译任务上,神经机器翻译通常被认为是最好的选择。目前,神经机器翻译已经受到机器翻译领域的广泛关注和认可,并已投入商用。With the development of artificial neural network and deep learning technology, the neural network machine translation technology based on deep learning technology (hereinafter referred to as neural machine translation) has made important achievements in recent years. Neural machine translation has the advantages of less linguistic knowledge and manual intervention, less space for model storage, and smooth and natural translation output. Neural machine translation is generally considered the best choice for bilingual resource-rich translation tasks. At present, neural machine translation has received extensive attention and recognition in the field of machine translation, and has been put into commercial use.

训练神经网络的数据以双语平行句对为主。通常,神经机器翻译所用的神经网络模型具有大规模的自由参数,理论上,这类模型需要大规模的双语平行语料对其进行训练。经验表明,包含千万级别自由参数的神经机器翻译模型通常需要至少百万句对级别的数据进行训练方可获得理想效果。对于一些双语平行资源较为稀缺的语言,应用神经网络进行翻译难以获得满意效果。The data for training the neural network is mainly bilingual parallel sentence pairs. Usually, the neural network models used in neural machine translation have large-scale free parameters. In theory, such models require large-scale bilingual parallel corpora to train them. Experience has shown that a neural machine translation model containing tens of millions of free parameters usually requires at least one million sentence-pair level of data for training to achieve ideal results. For some languages where bilingual parallel resources are relatively scarce, it is difficult to obtain satisfactory results by applying neural networks for translation.

此外,神经机器翻译的训练通常以一个或一组(多个)完整的句对为单位进行,当语料资源稀缺时,对句对中包含的一些出现频率较低的短语学习的能力受限,尤其在单独翻译这些短语时。In addition, the training of neural machine translation is usually carried out in units of one or a group (multiple) complete sentence pairs. When the corpus resources are scarce, the ability to learn some phrases with low frequency included in the sentence pairs is limited. Especially when translating these phrases individually.

发明内容Contents of the invention

本发明针对资源稀缺语言的神经机器翻译的模型训练问题,提出了一种基于统计短语表的神经网络机器翻译语料扩展方法,能有效扩展神经机器翻译模型的训练数据,缓解语言资源稀缺对模型训练的不利影响。Aiming at the model training problem of neural machine translation for languages with scarce resources, the present invention proposes a corpus expansion method for neural network machine translation based on statistical phrase tables, which can effectively expand the training data of neural machine translation models and alleviate the impact of scarce language resources on model training. adverse effects.

本发明包含:训练集扩展阶段和模型训练阶段;The present invention includes: a training set expansion stage and a model training stage;

其中,A)训练集扩展阶段的操作如下:通过统计机器学习方法从原始训练集中学习得到带有概率得分的短语表,并根据规则对学习得到的短语表进行过滤,将过滤后的短语表抽取成新的双语平行短语对数据集,将新抽取出的数据集与原始训练集拼接得到新的双语平行伪数据,实现训练集的扩展;Among them, A) the operation of the training set expansion stage is as follows: the phrase table with probability score is learned from the original training set by statistical machine learning method, and the learned phrase table is filtered according to the rules, and the filtered phrase table is extracted Form a new bilingual parallel phrase pair data set, splice the newly extracted data set with the original training set to obtain new bilingual parallel pseudo-data, and realize the expansion of the training set;

B)模型训练阶段的操作分为两个步骤,步骤一是预训练,即将阶段A)得到的双语平行伪数据对模型进行预训练,训练后得到预训练好的模型b1;步骤二利用原始训练集重新对模型b2进行训练,目的为对模型进行调优,缓解伪数据中引入的噪声对模型的影响;B) The operation of the model training stage is divided into two steps. Step one is pre-training, that is, the bilingual parallel dummy data obtained in stage A) is used to pre-train the model, and the pre-trained model b 1 is obtained after training; step two uses the original The training set retrains the model b 2 , the purpose is to tune the model and alleviate the influence of the noise introduced in the fake data on the model;

为实现上述目的和技术,本发明采用的技术方案如下:For realizing above-mentioned purpose and technology, the technical scheme that the present invention adopts is as follows:

首先进行相关定义,具体如下:First, the relevant definitions are made, as follows:

定义1:源语言,即机器翻译中,进行翻译时将要被翻译的内容所属的语言,例如从中文翻译到英文的机器翻译中,中文为源语言;Definition 1: Source language, that is, in machine translation, the language of the content to be translated during translation, for example, in machine translation from Chinese to English, Chinese is the source language;

定义2:源语言数据,即属于源语言的数据,若源语言数据是一个自然语言句子,则该属于源语言的数据称为源语言句子,例如从中文翻译到英文的机器翻译中,输入的中文句子就是源语言数据,亦可称为源语言句子;Definition 2: Source language data, that is, data belonging to the source language. If the source language data is a natural language sentence, the data belonging to the source language is called a source language sentence. For example, in machine translation from Chinese to English, the input Chinese sentences are the source language data, which can also be called source language sentences;

由源语言数据组成的集合称为源语言数据集;A collection of source language data is called a source language dataset;

定义3:目标语言,即机器翻译中,进行翻译时被翻译成的内容所属的语言,例如从中文翻译到英文的机器翻译中,英文为目标语言;Definition 3: Target language, that is, in machine translation, the language to which the translated content belongs, for example, in machine translation from Chinese to English, English is the target language;

定义4:目标语言数据,即属于目标语言的数据,若目标语言数据是一个自然语言句子,则该属于目标语言的数据称为目标语言句子,例如从中文翻译到英文的机器翻译中,输出的英文句子就是目标语言数据,亦可称为目标语言句子;Definition 4: Target language data, that is, data belonging to the target language. If the target language data is a natural language sentence, the data belonging to the target language is called the target language sentence. For example, in machine translation from Chinese to English, the output English sentences are the target language data, and can also be called target language sentences;

由目标语言数据组成的集合称为目标语言数据集;A collection of target language data is called a target language dataset;

定义5:训练集,特指统计机器翻译模型的训练集,即用于训练统计机器翻译模型的数据集合,记为T;Definition 5: Training set, specifically refers to the training set of the statistical machine translation model, that is, the data set used to train the statistical machine translation model, denoted as T;

定义6:原始训练集,即经过扩展前的训练集;Definition 6: The original training set, that is, the training set before expansion;

定义7:词对齐信息,简称词对齐,即训练集T中,源语言单词和目标语言单词之间的对齐关系,记为α;Definition 7: Word alignment information, referred to as word alignment, is the alignment relationship between source language words and target language words in the training set T, denoted as α;

其中,若训练集T中,源语言第j个单词与目标语言第i个单词存在对齐关系记为(j,i);Among them, if in the training set T, there is an alignment relationship between the jth word in the source language and the ith word in the target language, it is recorded as (j,i);

定义8,短语,一个或多个单词组成的语言单位;Definition 8, a phrase, a linguistic unit consisting of one or more words;

使用的语言为源语言的短语称为源语言短语,记为f,使用的语言为目标语言的短语称为目标语言短语,记为e;Phrases in the source language are called source language phrases, denoted as f, and phrases in the target language are called target language phrases, denoted as e;

定义9,翻译短语对,源语言短语和对齐的目标语言短语组成的短语对,例如“(‘长城’,‘The Great Wall’)”;Definition 9, translation phrase pair, a phrase pair consisting of a source language phrase and an aligned target language phrase, such as "('The Great Wall', 'The Great Wall')";

定义10,正向短语翻译概率,即给定源语言短语f时,翻译到目标语言短语e的条件概率,记为 Definition 10, the forward phrase translation probability, that is, given the source language phrase f, the conditional probability of translation to the target language phrase e, denoted as

定义11,反向短语翻译概率,即给定目标语言短语e时,翻译回源语言短语f的条件概率。记为 Definition 11, reverse phrase translation probability, which is the conditional probability of translating back to the source language phrase f given the target language phrase e. recorded as

定义12,双向短语翻译概率,正向短语翻译概率和反向短语翻译概率合称为双向短语翻译概率;Definition 12, two-way phrase translation probability, forward phrase translation probability and reverse phrase translation probability are collectively called two-way phrase translation probability;

定义13,正向词汇化短语翻译概率,给定源语言短语f时,翻译到目标语言短语e的词汇化翻译概率,记为lex(e|f);Definition 13, the forward lexicalized phrase translation probability, given the source language phrase f, the lexicalized translation probability translated to the target language phrase e, denoted as lex(e|f);

定义14,反向词汇化短语翻译概率,给定目标语言短语e时,翻译回源语言短语f的词汇化翻译概率,记为lex(f|e);Definition 14, the reverse lexicalized phrase translation probability, given the target language phrase e, the lexicalized translation probability of the translation back to the source language phrase f, denoted as lex(f|e);

定义15,双向词汇化短语翻译概率,正向词汇化翻译概率和反向词汇化翻译概率合称为双向词汇化翻译概率;Definition 15, two-way lexicalization phrase translation probability, forward lexicalization translation probability and reverse lexicalization translation probability are collectively called two-way lexicalization translation probability;

定义16,短语表,也称为短语翻译表,由多组翻译短语对构成的,并对每组翻译短语对附加上双向短语翻译概率和双向词汇化翻译概率;Definition 16. The phrase table, also called the phrase translation table, is composed of multiple sets of translated phrase pairs, and each set of translated phrase pairs is attached with a bidirectional phrase translation probability and a bidirectional lexicalized translation probability;

定义17,过滤规则,即过滤短语表的规则,根据短语表内所包含的源语言短语、目标语言短语、双向短语翻译概率、双向词汇化短语翻译概率信息对短语表进行过滤的人工制定的规则;Definition 17, filtering rules, that is, the rules for filtering phrase tables, artificially formulated rules for filtering phrase tables according to the source language phrases, target language phrases, bidirectional phrase translation probability, and bidirectional lexicalized phrase translation probability information contained in the phrase table ;

训练集扩展阶段,包括如下步骤:The training set expansion phase includes the following steps:

步骤A1,根据定义1、定义2、定义3、定义4和定义5,对原始训练集进行预处理,得到经过预处理后的原始训练集TfIn step A1, according to definition 1, definition 2, definition 3, definition 4 and definition 5, the original training set is preprocessed to obtain the preprocessed original training set T f ;

其中,对原始训练集进行预处理的具体过程因不同源语言和目标语言而异,目的为对训练集进行规范化处理,得到经过预处理后的原始训练集TfAmong them, the specific process of preprocessing the original training set varies with different source languages and target languages. The purpose is to standardize the training set and obtain the preprocessed original training set T f ;

步骤A2,基于步骤A1得到的经预处理后的原始训练集Tf,并根据定义7和定义8学习词对齐信息,该过程通常利用开源词对齐工具包实现,将步骤A1中得到的经过预处理后的原始训练集作为输入,经过训练词对齐工具的训练,得到训练集的词对齐信息α;Step A2, based on the preprocessed original training set T f obtained in step A1, and learn word alignment information according to Definition 7 and Definition 8, this process is usually implemented by using an open source word alignment toolkit, and the preprocessed training set T f obtained in step A1 The processed original training set is used as input, and the word alignment information α of the training set is obtained through the training of the training word alignment tool;

步骤A3,根据定义6,定义7、定义8、定义9、定义10、定义11、定义12、定义13、定义14、定义15和定义16,结合步骤A1得到的经过预处理后的原始训练集Tf以及步骤A2得到的训练集的词对齐信息α,抽取翻译短语对,并对翻译短语对进行概率估计,得到每个翻译短语对的双向短语翻译概率和双向词汇化翻译概率,结合翻译短语对和翻译概率,得到短语表,短语表的每条记录由翻译短语对、词对齐信息、双向短语翻译概率和双向词汇化翻译概率组成;Step A3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step A1 T f and the word alignment information α of the training set obtained in step A2, extract translation phrase pairs, and estimate the probability of translation phrase pairs, and obtain the bidirectional phrase translation probability and bidirectional lexicalization translation probability of each translation phrase pair, combined with translation phrases pair and translation probability to get a phrase table, each record of the phrase table is composed of translation phrase pair, word alignment information, two-way phrase translation probability and two-way lexicalized translation probability;

步骤A4,根据定义9、定义12、定义15、定义16和定义17,利用人工定义的过滤规则,对步骤A3得到的短语表进行过滤,过滤掉概率较低的翻译短语对,得到过滤后的短语表,记为PnewStep A4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filter rules to filter the phrase table obtained in Step A3, filter out translation phrase pairs with low probability, and obtain the filtered Phrase table, denoted as P new ;

步骤A5,根据定义5、定义16,将步骤A4得到的过滤后的短语表Pnew中的翻译短语对部分与步骤A1得到的预处理后的原始训练集Tf拼接,得到新训练集TnewStep A5, according to Definition 5 and Definition 16, splicing the translated phrase pairs in the filtered phrase table P new obtained in Step A4 with the preprocessed original training set T f obtained in Step A1 to obtain a new training set T new ;

步骤A1至步骤A5,完成了本方法的训练集扩展阶段;From step A1 to step A5, the training set expansion stage of the method is completed;

模型训练阶段,包括如下步骤:The model training phase includes the following steps:

步骤B1,利用步骤A5得到的新训练集Tnew对模型进行预训练,得到模型b1Step B1, using the new training set T new obtained in step A5 to pre-train the model to obtain model b1 ;

步骤B2,利用步骤A1得到的预处理后的原始训练集Tf,对步骤B1得到的模型b1再次进行训练,得到新训练好的模型b2Step B2, using the preprocessed original training set T f obtained in step A1, to train the model b 1 obtained in step B1 again to obtain a newly trained model b 2 ;

至此,从步骤B1到步骤B2,完成了本方法的模型训练阶段;So far, from step B1 to step B2, the model training phase of the method is completed;

至此,从步骤A1到步骤A5以及步骤B1到步骤B2,完成了一种基于统计短语表的神经网络机器翻译语料扩展方法。So far, from step A1 to step A5 and step B1 to step B2, a method for expanding the corpus of neural network machine translation based on the statistical phrase table has been completed.

有益效果Beneficial effect

本发明一种基于统计短语表的神经网络机器翻译语料扩展方法,与现有的机器翻译训练集使用方法相比,具有如下有益效果:A kind of neural network machine translation corpus expansion method based on statistical phrase table of the present invention, compared with existing machine translation training set using method, has following beneficial effect:

1.本发明设计了基于统计短语表的神经网络机器翻译语料扩展方法,该方法在不需要额外的双语或单语数据的情况下,可以对原始训练集进行有效的扩展,缓解资源稀缺语言训练集规模小对神经机器翻译模型的训练带来的不利影响。1. The present invention has designed a neural network machine translation corpus extension method based on a statistical phrase table. This method can effectively expand the original training set without requiring additional bilingual or monolingual data, and alleviate resource-scarce language training. The adverse effect of small set size on the training of neural machine translation models.

2.在训练集、开发集和测试集数据相同的情况下,本发明与不使用本发明的神经机器翻译模型训练方法相比,BLEU评测指标有明显提升。2. In the case of the same training set, development set and test set data, compared with the neural machine translation model training method not using the present invention, the BLEU evaluation index of the present invention is significantly improved.

附图说明Description of drawings

图1是本发明一种基于统计短语表的神经网络机器翻译语料扩展方法及实施例中的流程图。FIG. 1 is a flow chart of a method and embodiment of a neural network machine translation corpus expansion method based on a statistical phrase table in the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明所述方法进行详细叙述。说明时按照本发明包含的两个主要阶段:1)训练集扩展阶段以及2)模型训练阶段,分别进行说明。The method of the present invention will be described in detail below in conjunction with the accompanying drawings and embodiments. During the description, the two main stages included in the present invention: 1) the training set expansion stage and 2) the model training stage will be described separately.

实施例1Example 1

本实施例叙述了本发明所述方法的流程及其具体实施例。This embodiment describes the flow of the method of the present invention and its specific examples.

图1为本发明一种基于统计短语表的神经网络机器翻译语料扩展方法及在本实施例中的流程图。FIG. 1 is a statistical phrase table-based neural network machine translation corpus expansion method according to the present invention and a flow chart in this embodiment.

从图1中可以看出本发明包含的两个阶段1)训练集扩展阶段以及2)模型训练阶段的操作流程。It can be seen from Fig. 1 that the present invention includes two stages 1) the training set expansion stage and 2) the operation flow of the model training stage.

以维吾尔语到汉语的翻译为例,其中维吾尔语为源语言,汉语为目标语言。Take Uyghur-to-Chinese translation as an example, where Uyghur is the source language and Chinese is the target language.

1)训练集扩展阶段:1) Training set expansion stage:

步骤一,根据定义1、定义2、定义3、定义4、定义5,对原始训练集进行预处理,预处理具体过程因不同源语言和目标语言而异,目的为对训练集进行规范化处理,其中,对源语言维吾尔语以及目标语言汉语的数据的预处理过程均为:先进行词片段(word-piece)切分,再进行词切分(tokenization),得到经过预处理后的原始训练集TfStep 1. Preprocess the original training set according to Definition 1, Definition 2, Definition 3, Definition 4, and Definition 5. The specific process of preprocessing varies with different source languages and target languages. The purpose is to standardize the training set. Among them, the preprocessing process of the source language Uyghur and the target language Chinese data are: first perform word-piece segmentation, and then perform word segmentation (tokenization) to obtain the original training set after preprocessing T f ;

步骤二,根据定义6和定义7,学习词对齐,在本实施例中,该过程利用开源词对齐工具包GIZA++实现,将步骤一中得到的经过预处理后的原始训练集作为输入,经过训练词对齐工具GIZA++的训练,得到训练集的词对齐信息α;Step 2, according to definition 6 and definition 7, learn word alignment, in the present embodiment, this process utilizes the open source word alignment toolkit GIZA++ to realize, the original training set after preprocessing that obtains in step 1 is used as input, after training The training of the word alignment tool GIZA++ obtains the word alignment information α of the training set;

步骤三,根据定义6,定义7,定义8、定义9、定义10、定义11、定义12、定义13、定义14、定义15和定义16,结合步骤一得到的经过预处理后的原始训练集Tf以及步骤二得到的训练集的词对齐信息α,抽取翻译短语对,并对翻译短语对进行概率估计,本实施例中,利用Moses开源工具中的train-model.perl脚本实现上述功能,得到短语表P,短语表的每条记录由翻译短语对、词对齐信息、双向短语翻译概率和双向词汇化翻译概率组成;Step 3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step 1 T f and the word alignment information α of the training set obtained in step 2 extract translation phrases and estimate the probability of translation phrases. In this embodiment, the above functions are realized by using the train-model.perl script in the Moses open source tool. Get the phrase table P, each record of the phrase table consists of translation phrase pairs, word alignment information, bidirectional phrase translation probability and bidirectional lexicalization translation probability;

步骤四,根据定义9、定义12、定义15、定义16、定义17,利用人工定义的过滤规则,对步骤三得到的短语表进行过滤,人工定义的规则如下:Step 4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filtering rules to filter the phrase list obtained in Step 3. The manually defined rules are as follows:

保留该翻译短语对,当且仅当该翻译短语对的概率且lex(e|f)≥0.025,且lex(f|e)≥0.025;Keep the translation phrase pair if and only if the probability of the translation phrase pair and And lex(e|f)≥0.025, and lex(f|e)≥0.025;

过滤掉概率较低的翻译短语对,得到过滤后的新短语表PnewFilter out the translation phrase pairs with low probability to obtain the filtered new phrase table P new ;

步骤五,根据定义5、定义16,将步骤四得到的过滤后的新短语表Pnew的翻译短语对部分与步骤一得到的预处理后的原始训练集Tf拼接,得到新训练集TnewStep 5, according to Definition 5 and Definition 16, splicing the translated phrase pair part of the filtered new phrase table P new obtained in Step 4 with the preprocessed original training set T f obtained in Step 1 to obtain a new training set T new ;

2)模型训练阶段的步骤如下:2) The steps in the model training phase are as follows:

步骤六,进行模型预训练,本实施例中采用开源神经机器翻译模型tesnor2tensor,利用步骤五得到的新训练集Tnew对模型进行预训练,得到模型b1Step 6, perform model pre-training. In this embodiment, the open source neural machine translation model tesnor2tensor is used, and the new training set T new obtained in step 5 is used to pre-train the model to obtain model b1 ;

步骤七,利用步骤一得到的预处理后的原始训练集Tf,对步骤六得到的模型b1再次进行训练,得到新训练好的模型b2Step 7, use the preprocessed original training set T f obtained in step 1 to train the model b 1 obtained in step 6 again to obtain a newly trained model b 2 ;

至此,从步骤一到步骤七,完成了一种基于统计短语表的神经网络机器翻译语料扩展方法。So far, from step 1 to step 7, a corpus extension method for neural network machine translation based on statistical phrase table has been completed.

实施例2Example 2

将CWMT2017提供的维吾尔语-汉语新闻翻译任务中的训练集随机地拆分为训练集、开发集以及测试集1,另外,将CWMT2017提供的维吾尔语-汉语新闻翻译评测任务的开发集数据作为测试集2,实验结果表明,在原始训练集、开发集、测试集数据和神经机器翻译模型相同的情况下,本发明与不使用本发明的神经机器翻译模型训练方法相比,采用基于汉字的BLEU作为评测指标,可以得到如下实验结果。The training set in the Uyghur-Chinese news translation task provided by CWMT2017 is randomly split into training set, development set and test set 1. In addition, the development set data of the Uyghur-Chinese news translation evaluation task provided by CWMT2017 is used as a test Set 2, the experimental results show that, in the case that the original training set, development set, test set data and neural machine translation model are the same, the present invention adopts BLEU based on Chinese characters compared with the neural machine translation model training method of the present invention. As the evaluation index, the following experimental results can be obtained.

表1使用本发明提出的训练集扩展方法前后BLEU值对比Table 1 uses the comparison of BLEU values before and after the training set expansion method proposed by the present invention

表1的实验结果表明:在训练集、开发集和测试集数据相同的情况下,采用本发明所述方法与不使用本发明的神经机器翻译模型训练方法相比,BLEU评测指标有明显提升。The experimental results in Table 1 show that: in the case of the same training set, development set and test set data, the BLEU evaluation index is significantly improved by using the method of the present invention compared with the neural machine translation model training method not using the present invention.

以上所述为本发明的较佳实施例而已,本发明不应该局限于该实施例和附图所公开的内容。凡是不脱离本发明所公开的精神下完成的等效或修改,都落入本发明保护的范围。The above description is only a preferred embodiment of the present invention, and the present invention should not be limited to the content disclosed in this embodiment and the accompanying drawings. All equivalents or modifications accomplished without departing from the disclosed spirit of the present invention fall within the protection scope of the present invention.

Claims (4)

1.一种基于统计短语表的神经网络机器翻译语料扩展方法,其特征在于:包含:训练集扩展阶段和模型训练阶段;1. A neural network machine translation corpus expansion method based on a statistical phrase table, characterized in that: comprising: a training set expansion stage and a model training stage; 其中,A)训练集扩展阶段的操作如下:通过统计机器学习方法从原始训练集中学习得到带有概率得分的短语表,并根据规则对学习得到的短语表进行过滤,将过滤后的短语表抽取成新的双语平行短语对数据集,将新抽取出的数据集与原始训练集拼接得到新的双语平行伪数据,实现训练集的扩展;Among them, A) the operation of the training set expansion stage is as follows: the phrase table with probability score is learned from the original training set by statistical machine learning method, and the learned phrase table is filtered according to the rules, and the filtered phrase table is extracted Form a new bilingual parallel phrase pair data set, splice the newly extracted data set with the original training set to obtain new bilingual parallel pseudo-data, and realize the expansion of the training set; B)模型训练阶段的操作分为两个步骤,步骤一是预训练,即将阶段A)得到的双语平行伪数据对模型进行预训练,训练后得到预训练好的模型b1;步骤二利用原始训练集重新对模型b2进行训练,目的为对模型进行调优,缓解伪数据中引入的噪声对模型的影响。B) The operation of the model training stage is divided into two steps. Step one is pre-training, that is, the bilingual parallel dummy data obtained in stage A) is used to pre-train the model, and the pre-trained model b 1 is obtained after training; step two uses the original The training set retrains the model b 2 for the purpose of tuning the model and alleviating the influence of the noise introduced in the fake data on the model. 2.根据权利要求1所述的一种基于统计短语表的神经网络机器翻译语料扩展方法,其特征在于:为实现上述目的和技术,采用如下技术方案:2. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: for realizing above-mentioned purpose and technology, adopt following technical scheme: 首先进行相关定义,具体如下:First, the relevant definitions are made, as follows: 定义1:源语言,即机器翻译中,进行翻译时将要被翻译的内容所属的语言,例如从中文翻译到英文的机器翻译中,中文为源语言;Definition 1: Source language, that is, in machine translation, the language of the content to be translated during translation, for example, in machine translation from Chinese to English, Chinese is the source language; 定义2:源语言数据,即属于源语言的数据,若源语言数据是一个自然语言句子,则该属于源语言的数据称为源语言句子,例如从中文翻译到英文的机器翻译中,输入的中文句子就是源语言数据,亦可称为源语言句子;Definition 2: Source language data, that is, data belonging to the source language. If the source language data is a natural language sentence, the data belonging to the source language is called a source language sentence. For example, in machine translation from Chinese to English, the input Chinese sentences are the source language data, which can also be called source language sentences; 由源语言数据组成的集合称为源语言数据集;A collection of source language data is called a source language dataset; 定义3:目标语言,即机器翻译中,进行翻译时被翻译成的内容所属的语言,例如从中文翻译到英文的机器翻译中,英文为目标语言;Definition 3: Target language, that is, in machine translation, the language to which the translated content belongs, for example, in machine translation from Chinese to English, English is the target language; 定义4:目标语言数据,即属于目标语言的数据,若目标语言数据是一个自然语言句子,则该属于目标语言的数据称为目标语言句子,例如从中文翻译到英文的机器翻译中,输出的英文句子就是目标语言数据,亦可称为目标语言句子;Definition 4: Target language data, that is, data belonging to the target language. If the target language data is a natural language sentence, the data belonging to the target language is called the target language sentence. For example, in machine translation from Chinese to English, the output English sentences are the target language data, and can also be called target language sentences; 由目标语言数据组成的集合称为目标语言数据集;A collection of target language data is called a target language dataset; 定义5:训练集,特指统计机器翻译模型的训练集,即用于训练统计机器翻译模型的数据集合,记为T;Definition 5: Training set, specifically refers to the training set of the statistical machine translation model, that is, the data set used to train the statistical machine translation model, denoted as T; 定义6:原始训练集,即经过扩展前的训练集;Definition 6: The original training set, that is, the training set before expansion; 定义7:词对齐信息,简称词对齐,即训练集T中,源语言单词和目标语言单词之间的对齐关系,记为α;Definition 7: Word alignment information, referred to as word alignment, is the alignment relationship between source language words and target language words in the training set T, denoted as α; 其中,若训练集T中,源语言第j个单词与目标语言第i个单词存在对齐关系记为(j,i);Among them, if in the training set T, there is an alignment relationship between the jth word in the source language and the ith word in the target language, it is recorded as (j,i); 定义8,短语,一个或多个单词组成的语言单位;Definition 8, a phrase, a linguistic unit consisting of one or more words; 使用的语言为源语言的短语称为源语言短语,记为f,使用的语言为目标语言的短语称为目标语言短语,记为e;Phrases in the source language are called source language phrases, denoted as f, and phrases in the target language are called target language phrases, denoted as e; 定义9,翻译短语对,源语言短语和对齐的目标语言短语组成的短语对,例如“(‘长城’,‘The Great Wall’)”;Definition 9, translation phrase pair, a phrase pair consisting of a source language phrase and an aligned target language phrase, such as "('The Great Wall', 'The Great Wall')"; 定义10,正向短语翻译概率,即给定源语言短语f时,翻译到目标语言短语e的条件概率,记为 Definition 10, the forward phrase translation probability, that is, given the source language phrase f, the conditional probability of translation to the target language phrase e, denoted as 定义11,反向短语翻译概率,即给定目标语言短语e时,翻译回源语言短语f的条件概率,记为 Definition 11, the reverse phrase translation probability, that is, given the target language phrase e, the conditional probability of translating back to the source language phrase f, denoted as 定义12,双向短语翻译概率,正向短语翻译概率和反向短语翻译概率合称为双向短语翻译概率;Definition 12, two-way phrase translation probability, forward phrase translation probability and reverse phrase translation probability are collectively called two-way phrase translation probability; 定义13,正向词汇化短语翻译概率,给定源语言短语f时,翻译到目标语言短语e的词汇化翻译概率,记为lex(e|f);Definition 13, the forward lexicalized phrase translation probability, given the source language phrase f, the lexicalized translation probability translated to the target language phrase e, denoted as lex(e|f); 定义14,反向词汇化短语翻译概率,给定目标语言短语e时,翻译回源语言短语f的词汇化翻译概率,记为lex(f|e);Definition 14, the reverse lexicalized phrase translation probability, given the target language phrase e, the lexicalized translation probability of the translation back to the source language phrase f, denoted as lex(f|e); 定义15,双向词汇化短语翻译概率,正向词汇化翻译概率和反向词汇化翻译概率合称为双向词汇化翻译概率;Definition 15, two-way lexicalization phrase translation probability, forward lexicalization translation probability and reverse lexicalization translation probability are collectively called two-way lexicalization translation probability; 定义16,短语表,也称为短语翻译表,由多组翻译短语对构成的,并对每组翻译短语对附加上双向短语翻译概率和双向词汇化翻译概率;Definition 16. The phrase table, also called the phrase translation table, is composed of multiple sets of translated phrase pairs, and each set of translated phrase pairs is attached with a bidirectional phrase translation probability and a bidirectional lexicalized translation probability; 定义17,过滤规则,即过滤短语表的规则,根据短语表内所包含的源语言短语、目标语言短语、双向短语翻译概率、双向词汇化短语翻译概率信息对短语表进行过滤的人工制定的规则;Definition 17, filtering rules, that is, the rules for filtering phrase tables, artificially formulated rules for filtering phrase tables according to the source language phrases, target language phrases, bidirectional phrase translation probability, and bidirectional lexicalized phrase translation probability information contained in the phrase table ; 训练集扩展阶段,包括如下步骤:The training set expansion phase includes the following steps: 步骤A1,根据定义1、定义2、定义3、定义4和定义5,对原始训练集进行预处理,得到经过预处理后的原始训练集TfIn step A1, according to definition 1, definition 2, definition 3, definition 4 and definition 5, the original training set is preprocessed to obtain the preprocessed original training set T f ; 步骤A2,基于步骤A1得到的经预处理后的原始训练集Tf,并根据定义7和定义8学习词对齐信息,该过程通常利用开源词对齐工具包实现,将步骤A1中得到的经过预处理后的原始训练集作为输入,经过训练词对齐工具的训练,得到训练集的词对齐信息α;Step A2, based on the preprocessed original training set T f obtained in step A1, and learn word alignment information according to Definition 7 and Definition 8, this process is usually implemented by using an open source word alignment toolkit, and the preprocessed training set T f obtained in step A1 The processed original training set is used as input, and the word alignment information α of the training set is obtained through the training of the training word alignment tool; 步骤A3,根据定义6,定义7、定义8、定义9、定义10、定义11、定义12、定义13、定义14、定义15和定义16,结合步骤A1得到的经过预处理后的原始训练集Tf以及步骤A2得到的训练集的词对齐信息α,抽取翻译短语对,并对翻译短语对进行概率估计,得到每个翻译短语对的双向短语翻译概率和双向词汇化翻译概率,结合翻译短语对和翻译概率,得到短语表,短语表的每条记录由翻译短语对、词对齐信息、双向短语翻译概率和双向词汇化翻译概率组成;Step A3, according to Definition 6, Definition 7, Definition 8, Definition 9, Definition 10, Definition 11, Definition 12, Definition 13, Definition 14, Definition 15 and Definition 16, combined with the preprocessed original training set obtained in Step A1 T f and the word alignment information α of the training set obtained in step A2, extract translation phrase pairs, and estimate the probability of translation phrase pairs, and obtain the bidirectional phrase translation probability and bidirectional lexicalization translation probability of each translation phrase pair, combined with translation phrases pair and translation probability to get a phrase table, each record of the phrase table is composed of translation phrase pair, word alignment information, two-way phrase translation probability and two-way lexicalized translation probability; 步骤A4,根据定义9、定义12、定义15、定义16和定义17,利用人工定义的过滤规则,对步骤A3得到的短语表进行过滤,过滤掉概率较低的翻译短语对,得到过滤后的短语表,记为PnewStep A4, according to Definition 9, Definition 12, Definition 15, Definition 16, and Definition 17, use manually defined filter rules to filter the phrase table obtained in Step A3, filter out translation phrase pairs with low probability, and obtain the filtered Phrase table, denoted as P new ; 步骤A5,根据定义5、定义16,将步骤A4得到的过滤后的短语表Pnew中的翻译短语对部分与步骤A1得到的预处理后的原始训练集Tf拼接,得到新训练集TnewStep A5, according to Definition 5 and Definition 16, splicing the translated phrase pairs in the filtered phrase table P new obtained in Step A4 with the preprocessed original training set T f obtained in Step A1 to obtain a new training set T new . 3.根据权利要求1所述的一种基于统计短语表的神经网络机器翻译语料扩展方法,其特征在于:模型训练阶段,包括如下步骤:3. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: model training stage, comprises the steps: 步骤B1,利用步骤A5得到的新训练集Tnew对模型进行预训练,得到模型b1Step B1, using the new training set T new obtained in step A5 to pre-train the model to obtain model b1 ; 步骤B2,利用步骤A1得到的预处理后的原始训练集Tf,对步骤B1得到的模型b1再次进行训练,得到新训练好的模型b2In step B2, use the preprocessed original training set T f obtained in step A1 to train the model b 1 obtained in step B1 again to obtain a newly trained model b 2 . 4.根据权利要求1所述的一种基于统计短语表的神经网络机器翻译语料扩展方法,其特征在于:步骤A1中,其中,对原始训练集进行预处理的具体过程因不同源语言和目标语言而异,目的为对训练集进行规范化处理,得到经过预处理后的原始训练集Tf4. a kind of neural network machine translation corpus expansion method based on statistical phrase table according to claim 1, is characterized in that: in step A1, wherein, the specific process that the original training set is carried out preprocessing varies from source language and target The purpose is to normalize the training set and obtain the preprocessed original training set T f .
CN201810175915.4A 2018-03-02 2018-03-02 A kind of neural network machine translation corpus expansion method based on statistics phrase table Pending CN108363704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810175915.4A CN108363704A (en) 2018-03-02 2018-03-02 A kind of neural network machine translation corpus expansion method based on statistics phrase table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810175915.4A CN108363704A (en) 2018-03-02 2018-03-02 A kind of neural network machine translation corpus expansion method based on statistics phrase table

Publications (1)

Publication Number Publication Date
CN108363704A true CN108363704A (en) 2018-08-03

Family

ID=63003675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810175915.4A Pending CN108363704A (en) 2018-03-02 2018-03-02 A kind of neural network machine translation corpus expansion method based on statistics phrase table

Country Status (1)

Country Link
CN (1) CN108363704A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190768A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 A kind of data enhancing corpus training method in neural network
CN110046332A (en) * 2019-04-04 2019-07-23 珠海远光移动互联科技有限公司 A kind of Similar Text data set generation method and device
CN110472252A (en) * 2019-08-15 2019-11-19 昆明理工大学 The method of the more neural machine translation of the Chinese based on transfer learning
CN110543645A (en) * 2019-09-04 2019-12-06 网易有道信息技术(北京)有限公司 Machine learning model training method, medium, device and computing equipment
CN110717341A (en) * 2019-09-11 2020-01-21 昆明理工大学 A method and device for constructing an old-Chinese bilingual corpus with Thai as the pivot
CN110852117A (en) * 2019-11-08 2020-02-28 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN111160046A (en) * 2018-11-07 2020-05-15 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111368035A (en) * 2020-03-03 2020-07-03 新疆大学 Neural network-based Chinese dimension-dimension Chinese organization name dictionary mining system
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system
US10963757B2 (en) 2018-12-14 2021-03-30 Industrial Technology Research Institute Neural network model fusion method and electronic device using the same
CN113111667A (en) * 2021-04-13 2021-07-13 沈阳雅译网络技术有限公司 Method for generating pseudo data by low-resource language based on multi-language model
CN117540755A (en) * 2023-11-13 2024-02-09 北京云上曲率科技有限公司 Method and system for enhancing data by neural machine translation model
CN118095302A (en) * 2024-04-26 2024-05-28 四川交通运输职业学校 A computer-assisted translation method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214166A (en) * 2010-04-06 2011-10-12 三星电子(中国)研发中心 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
US20130144593A1 (en) * 2007-03-26 2013-06-06 Franz Josef Och Minimum error rate training with a large number of features for machine learning
CN104391842A (en) * 2014-12-18 2015-03-04 苏州大学 Translation model establishing method and system
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN105190609A (en) * 2013-06-03 2015-12-23 国立研究开发法人情报通信研究机构 Translation device, learning device, translation method, and recording medium
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN107092594A (en) * 2017-04-19 2017-08-25 厦门大学 Bilingual recurrence self-encoding encoder based on figure
CN107329960A (en) * 2017-06-29 2017-11-07 哈尔滨工业大学 Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130144593A1 (en) * 2007-03-26 2013-06-06 Franz Josef Och Minimum error rate training with a large number of features for machine learning
CN102214166A (en) * 2010-04-06 2011-10-12 三星电子(中国)研发中心 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN105190609A (en) * 2013-06-03 2015-12-23 国立研究开发法人情报通信研究机构 Translation device, learning device, translation method, and recording medium
CN104391842A (en) * 2014-12-18 2015-03-04 苏州大学 Translation model establishing method and system
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN107092594A (en) * 2017-04-19 2017-08-25 厦门大学 Bilingual recurrence self-encoding encoder based on figure
CN107329960A (en) * 2017-06-29 2017-11-07 哈尔滨工业大学 Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张金鹏 等: "基于跨语言语料的汉泰词分布表示", 《计算机工程与科学》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190768A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 A kind of data enhancing corpus training method in neural network
CN111160046A (en) * 2018-11-07 2020-05-15 北京搜狗科技发展有限公司 Data processing method and device and data processing device
US10963757B2 (en) 2018-12-14 2021-03-30 Industrial Technology Research Institute Neural network model fusion method and electronic device using the same
CN110046332A (en) * 2019-04-04 2019-07-23 珠海远光移动互联科技有限公司 A kind of Similar Text data set generation method and device
CN110046332B (en) * 2019-04-04 2024-01-23 远光软件股份有限公司 Similar text data set generation method and device
CN110472252A (en) * 2019-08-15 2019-11-19 昆明理工大学 The method of the more neural machine translation of the Chinese based on transfer learning
CN110472252B (en) * 2019-08-15 2022-12-13 昆明理工大学 Method for translating Hanyue neural machine based on transfer learning
CN110543645A (en) * 2019-09-04 2019-12-06 网易有道信息技术(北京)有限公司 Machine learning model training method, medium, device and computing equipment
CN110543645B (en) * 2019-09-04 2023-04-07 网易有道信息技术(北京)有限公司 Machine learning model training method, medium, device and computing equipment
CN110717341A (en) * 2019-09-11 2020-01-21 昆明理工大学 A method and device for constructing an old-Chinese bilingual corpus with Thai as the pivot
CN110717341B (en) * 2019-09-11 2022-06-14 昆明理工大学 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN110852117B (en) * 2019-11-08 2023-02-24 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN110852117A (en) * 2019-11-08 2020-02-28 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN111368035A (en) * 2020-03-03 2020-07-03 新疆大学 Neural network-based Chinese dimension-dimension Chinese organization name dictionary mining system
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system
CN112507734B (en) * 2020-11-19 2024-03-19 南京大学 A neural machine translation system based on Romanized Uyghur
CN113111667A (en) * 2021-04-13 2021-07-13 沈阳雅译网络技术有限公司 Method for generating pseudo data by low-resource language based on multi-language model
CN113111667B (en) * 2021-04-13 2023-08-22 沈阳雅译网络技术有限公司 A Method for Generating Pseudo-Data in Low-Resource Languages Based on Multilingual Model
CN117540755A (en) * 2023-11-13 2024-02-09 北京云上曲率科技有限公司 Method and system for enhancing data by neural machine translation model
CN118095302A (en) * 2024-04-26 2024-05-28 四川交通运输职业学校 A computer-assisted translation method and system

Similar Documents

Publication Publication Date Title
CN108363704A (en) A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN111382580B (en) Encoder-decoder framework pre-training method for neural machine translation
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
CN111310470B (en) Chinese named entity recognition method fusing word and word features
CN104391842A (en) Translation model establishing method and system
CN104915337B (en) Translation chapter integrity assessment method based on bilingual structure of an article information
CN105320960A (en) Voting based classification method for cross-language subjective and objective sentiments
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN107463553A (en) For the text semantic extraction, expression and modeling method and system of elementary mathematics topic
CN104391885A (en) Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN117251524A (en) A short text classification method based on multi-strategy fusion
CN106598941A (en) Algorithm for globally optimizing quality of text keywords
CN106610949A (en) Text feature extraction method based on semantic analysis
CN104657351A (en) Method and device for processing bilingual alignment corpora
CN118898260A (en) Chinese-Lao-Thai multilingual neural machine translation method and device based on language feature representation learning
CN115757760A (en) Text summarization extraction method and system, computing device, storage medium
CN113901205A (en) Cross-language emotion classification method based on emotion semantic confrontation
Al-Mannai et al. Unsupervised word segmentation improves dialectal Arabic to English machine translation
CN103678270B (en) Semantic primitive abstracting method and semantic primitive extracting device
CN112949293A (en) Similar text generation method, similar text generation device and intelligent equipment
CN101989261A (en) Method for extracting phrases of statistical machine translation
Mrinalini et al. Pause-based phrase extraction and effective OOV handling for low-resource machine translation systems
CN103092830A (en) Reordering rule acquisition method and device
CN106126501B (en) A noun word sense disambiguation method and device based on dependency constraints and knowledge
CN117540755A (en) Method and system for enhancing data by neural machine translation model

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180803

WD01 Invention patent application deemed withdrawn after publication