[go: up one dir, main page]

CN103116578A - Translation method integrating syntactic tree and statistical machine translation technology and translation device - Google Patents

Translation method integrating syntactic tree and statistical machine translation technology and translation device Download PDF

Info

Publication number
CN103116578A
CN103116578A CN2013100497397A CN201310049739A CN103116578A CN 103116578 A CN103116578 A CN 103116578A CN 2013100497397 A CN2013100497397 A CN 2013100497397A CN 201310049739 A CN201310049739 A CN 201310049739A CN 103116578 A CN103116578 A CN 103116578A
Authority
CN
China
Prior art keywords
translation
phrase
target language
module
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100497397A
Other languages
Chinese (zh)
Inventor
罗文�
黄子河
刘法旺
胡小鹏
宋金平
袁琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd filed Critical BEIJING SAIDI TRANSLATION TECHNOLOGY Co Ltd
Priority to CN2013100497397A priority Critical patent/CN103116578A/en
Publication of CN103116578A publication Critical patent/CN103116578A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a translation method integrating a syntactic tree and statistical machine translation technology and a translation device. The method comprises the following steps. First, a dictionary base, a grammatical rule base, a phrase translation probability table and a target language linguistic model between different languages are established. Then, segmentation, word property removing and grammatical analysis are conducted to an original input sentence, and a syntactic tree is generated. Then by adopting a top-down strategy, the syntactic tree is gone through, by means of each individual node and part of continuous nodes which cross the syntax, the original texts of leaf nodes are taken to be matched with the phrase translation probability table trained by the statistical machine translation, By utilizing the translated texts of the phrase translation table and the linguistic model of the target language, the purpose of improving the fluency and the accuracy of the output translated texts is achieved. By means of the translation method integrating the syntactic tree and the statistical machine translation technology and the translation device, not only is fine grit knowledge provided by the phrase translation table utilized, but also the advantages of the syntactic tree when solving the relevant problems of depth and long distance of a sentence are utilized, and the quality of the texts translated by the machine can be improved remarkably.

Description

Translation method and device fusing syntax tree and statistical machine translation technology
Technical Field
The invention relates to the field of statistical and regular machine translation, in particular to a machine translation method and a device for fusing statistical machine translation technologies such as a syntax tree, a phrase translation probability table and a language model.
Background
With the spread of the internet, computer processing in natural language has become an important means for acquiring knowledge from the internet. For example, in the fields of international communication and scientific research and education, people need to translate foreign language characters, which is the stage of talent development of language masters in the past. With the rapid development of hardware technology, the continuous perfection of software technology and the continuous deepening of language research, machine translation is more and more widely applied. Machine translation has its own huge advantages, such as fast translation speed, strong memory, and reduced translation cost, but its disadvantage is that the translation quality is far from meeting the needs of people, and how to develop a high-quality machine translation method becomes an important subject we are facing.
International evaluation in 2011 shows that the quality of translated texts translated by a data-driven machine and a knowledge-driven machine is comparable, and the requirement of a user is difficult to meet only by adopting a single method. The analysis of the translation errors of the statistical and regular machine translation shows that the types of errors generated by different machine translation systems are complementary. The weakness of the rule system is that vocabulary selection is performed during the conversion process and the performance is poor when a diseased sentence is analyzed, but the rule system has the advantages that any small part cannot be omitted when an original text is analyzed, and accurate translation can be achieved. In contrast, statistical machine translation systems have strong adaptability, and the use of phrase collocation makes the translation more fluent, and better in the aspect of word selection. However, the biggest problem of the statistical machine translation system is that it is difficult to deal with the fact that the translation generation requires linguistic knowledge, for example, they lack lexical and syntactic functions and word order adjusting functions, and are more difficult to achieve word order adjustment at the whole sentence level. In addition, the proper translation of the statistical machine translation system cannot be achieved, and the phenomena of translation missing and false translation sometimes occur.
Disclosure of Invention
Because the machine translation of a single method can not obtain a good translation effect, and the machine translation based on data drive and knowledge drive basically has the characteristic of complementary advantages, different methods are combined to form a reasonable way for improving the machine translation quality. The machine translation method provided by the invention not only utilizes the fine-grained knowledge provided by the statistical translation engine, but also utilizes the advantages of the syntax tree in solving the deep and long-distance correlation problem of the sentence, so that the translation quality of machine translation can be obviously improved, and the development of the machine translation technology of the hybrid engine can be powerfully promoted.
The invention provides a machine translation method fusing a syntax tree and a statistical machine translation technology, which comprises the following steps:
1) establishing a dictionary library, a grammar rule library, a phrase translation probability table and a target language model among different languages; the system comprises a dictionary library, a grammar rule library, a phrase translation probability table, a target language model and a target language model, wherein the dictionary library stores words and phrases corresponding to different languages, the grammar rule library stores grammar rules corresponding to the different languages, the phrase translation probability table stores translation fragments of the different languages obtained by training of a statistical machine translation system, and the target language model stores a language model of a target language obtained by training of the statistical machine translation system;
2) reading dictionary base information, segmenting an input single sentence to be translated, and decomposing the single sentence into words or phrases in a source language;
3) reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntax tree;
4) reading phrase translation probability table information, traversing the syntax tree by adopting a top-down strategy, searching the phrase translation probability table for a single node and partial cross-syntax continuous nodes in the syntax tree by taking original texts of leaf nodes of the syntax tree, and selecting a translation in the phrase translation table as a translation of the node in the syntax tree; generating a translation for the nodes of the syntax tree which are not translated in the process according to a rule translation method;
5) and smoothing the generated translation by using the target language model to generate a target language.
Preferably, the translation segments of different languages stored in the phrase translation probability table are obtained by GIZA + + training.
Preferably, the language model of the target language is obtained by training according to the parallel corpus by adopting a language model training tool SRILM or N-gram.
The invention also provides a device adopting the machine translation method, which comprises the following steps:
the dictionary library module is used for storing words and phrases corresponding to different languages;
the grammar rule base module is used for storing grammar rules corresponding to different languages;
the phrase translation probability table module is used for storing translation fragments of different languages obtained by training of a statistical machine translation system;
the target language model module is used for storing a language model of a target language obtained by training of the statistical machine translation system;
the syntax analyzer is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and syntax analysis on the original text in sequence according to the dictionary base and the grammar rule base so as to generate a syntax tree;
and the decoder is connected with the phrase translation probability table module, the language model module and the syntactic analyzer and is used for traversing the syntactic tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language.
Further, the syntax analyzer includes:
the sentence dividing module is used for reading the original text and segmenting the original text;
the segmentation and pretreatment module is connected with the sentence division module and is used for segmenting and pretreating the divided single sentences;
the de-doubling module is connected with the segmentation and pretreatment module and is used for performing part-of-speech de-doubling on the segmented single sentences;
the grammar analysis module is connected with the eliminating and merging module and is used for carrying out grammar analysis on the single sentence after the elimination and merging;
and the master control module is respectively connected with the modules and controls the operation of the modules.
The invention provides a machine translation method and a device fusing a syntax tree, a phrase translation probability table and a language model, which adopt the strategy of scanning and searching the phrase translation probability table and the language model of statistical machine translation node by node and cross points of the syntax tree.
Drawings
FIG. 1 is a schematic diagram of the structure of an English-Chinese machine translation device according to an embodiment;
FIG. 2 is a flow chart of an English-Chinese machine translation method according to an embodiment;
FIG. 3 is a block diagram of a syntactic parser in FIG. 1;
FIG. 4 is a diagram illustrating the translation probability table and the training of a language model in an embodiment;
fig. 5 is a diagram of a syntax tree obtained in the example.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments and accompanying drawings.
Fig. 1 is a schematic diagram illustrating a structural configuration of a machine translation apparatus 100 according to the present embodiment, which incorporates a syntax tree and a statistical machine translation technique, and fig. 2 is a flowchart illustrating an implementation of machine translation using the apparatus.
Referring to fig. 1, the apparatus 100 includes: the dictionary library module 110 is used for storing words and phrases corresponding to different languages; a grammar rule base module 120 for storing grammar rules corresponding to different languages; a phrase translation probability table module 130, configured to store translation segments of different languages obtained by training of the statistical machine translation system; a target language model module 140 for storing a target language model trained by the statistical machine translation system; the syntax analyzer 150 is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and grammar analysis on the source text in sequence according to the dictionary base and the grammar rule base to generate a syntax tree; and the decoder 160 is connected with the phrase translation probability table module, the language model module and the syntax analyzer, and is used for traversing the syntax tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language. The phrase translation probability table and the target language model are obtained from the parallel corpus training, as shown in fig. 2.
Referring to fig. 1 and fig. 2, a specific translation process is described by taking an example in which the source language is english and the target language is chinese, and mainly includes the following steps:
1) performing morphological analysis on English in an English-Chinese bidirectional parallel corpus, and performing word segmentation on Chinese;
2) performing word alignment and phrase alignment on the parallel corpus by adopting a GIZA + + statistical tool, and extracting an English-Chinese phrase translation probability table;
3) filtering the extracted English-Chinese phrase translation probability table to filter inaccurate statistical items;
4) training a language model of a target language according to the parallel corpus by adopting a language model training tool SRILM;
5) reading dictionary base information, segmenting input single sentences to be translated, reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntactic tree; the step of disambiguation and parsing also identifies and records nouns or verb phrases in the dictionary repository that are not or cannot be collected in their entirety;
6) for the syntax tree, traversing the syntax tree by adopting a top-down strategy, searching the sub-tree taking the current node as a root node for entries in a phrase translation probability table, and generating a translation;
7) when traversing the syntax tree, besides searching the statistical phrase table for the root node, some cross-syntax conditions need to be added properly, so that the phrase translation probability table can be searched and used under the condition of not damaging the syntax tree, and the statistical phrase table is utilized to the maximum extent to improve the quality of the translated text;
when the continuous nodes across the syntax in the syntax tree must satisfy a certain structure, the original text search phrase translation probability table of the leaf node can be obtained, such as: in V N to V, V N to can go on search, but N to V can't go on search; regarding the implementation of the cross-syntax case, refer to steps 3), 4) of the following translation example;
8) and generating a target language for the nodes of the syntax tree which are not translated in the process by adopting a mode of combining a dictionary, a rule and a language model, namely generating a translation according to a rule translation method, and smoothing the generated translation by utilizing the target language model.
There are "untranslated syntax tree nodes" because there are segments that cannot be searched in the phrase translation probability table, so about 29% of the segments are untranslated and therefore translated by the regular translation method. It should be noted that the key point of the smoothing in step 8) is the regularly translated translation, but in other embodiments, all translations generated in the past (including translations obtained by using the phrase translation probability table) may be smoothed, and the present invention is not limited thereto.
As shown in fig. 3, one embodiment of a syntax analyzer comprises: a general control module 151 for managing and controlling the operations of the respective modules of the syntax analyzer; a sentence dividing module 152, configured to divide an english sentence to be translated into character strings; the segmentation and preprocessing module 153 is used for segmenting an english sentence into a string sequence taking phrases as units, and preprocessing comprises punctuation processing, format processing and the like, which are common technologies in a regular translation system; a parallel elimination module 154, configured to perform part-of-speech tagging on the segmented english sentence by eliminating the parallel category; and the syntax analysis module 155 is used for relatively simple syntax analysis to make the segmented English sentences form a syntax tree.
The entries stored in the dictionary base are labeled according to the requirements of the translation system, and the related semantic attributes are noted as follows:
afromosia \ N \ African red bean wood
CAT [ N ] M _ SEM [ B ] S _ SEM [ D ] CLAS [ plants ] $
afront \ F \ preceding
&CAT[F]M_SEM[J]$
……
mountain bike
&CAT[N]M_SEM[C]S_SEM[I]$
mountain coast \ N \ steep coast
&CAT[N]M_SEM[C]S_SEM[B]CLAS[d]$
mountain chrome/N/asbestos
&CAT[N]M_SEM[C]S_SEM[G]NUM[U]$
The requirements of the translation system refer to dictionary specifications, which are defined by a rule translation system developer, generally comprise part-of-speech, grammar and semantic information of a labeled entry, and are common technologies in the rule translation system.
The grammar rules stored in the grammar rule base prescribe translation rules of words or phrases according to the requirements of a translation system, and the following steps are shown:
with links to:
[24] (1) CAT [ N ] - - > is linked to% 1
reach:
[12] (1) CHI [ level | value ] - > MEANQ [0, reached ];
[13] (0) CAT [ V ] + (1) CHI [ confinement ] - > MEAN [0, obtained ];
[14] (0) CAT [ V ] + (1) CHI [ gold ] - > MEAN [0, up to ];
[15] (0) CAT [ V ] & & IS _ CENTER [1] + (1) CAT [ N ] & & L _ CHI [ acquisition ] - - > MEAN [0, achievement ].
As shown in fig. 4, the training process of a statistical machine to obtain a phrase translation probability table and a language model includes training a parallel corpus using a training tool GIZA + + of statistical machine translation to obtain a phrase translation probability table, and training the parallel corpus using a language model training tool SRILM of statistical machine translation to obtain a target language model. Besides SRILM, a training method of language models such as N-gram can be adopted.
In the above embodimentsThe step 2) of extracting the phrase translation probability table is the key point of the present invention, and will be further described. The phrase translation probability table herein includes four parts: source language phrase containing J words
Figure BDA00002831015500051
Target language phrase containing I words
Figure BDA00002831015500052
The word alignment relationship α and phrase translation score p within the source and target language phrases may be expressed as
Figure BDA00002831015500053
A phrase translation score is then calculated, comprising four components: phrase translation probability
Figure BDA00002831015500054
And P ( f 1 J | e 1 I ) , lexical translation probability p w ( e 1 I | f 1 J , α ) And p w ( f 1 J | e l I , α ) .
wherein, the phrase translation probability calculation formula is as follows:
p ( e 1 I | f 1 J ) = N ( f 1 J | e 1 I ) Σ ee 1 I N ( f 1 J | ee 1 I )
p ( f 1 J | e 1 I ) = N ( e 1 I | f 1 J ) Σ ff 1 J N ( e 1 I | ff 1 J )
in the above formula, the first and second carbon atoms are,
Figure BDA00002831015500063
representing phrase pairs
Figure BDA00002831015500064
The number of occurrences in the corpus is,
Figure BDA00002831015500065
to representAll of the possible target language phrases that correspond,representing phrase pairs
Figure BDA00002831015500068
The number of occurrences in the corpus is,
Figure BDA00002831015500069
to represent
Figure BDA000028310155000610
All of the possible source language phrases that correspond,representing phrase pairs
Figure BDA000028310155000612
The number of occurrences in the corpus is,
Figure BDA000028310155000613
representation phrase pair
Figure BDA000028310155000614
Number of occurrences in the corpus.
The lexical translation probability calculation formula is as follows:
Figure BDA000028310155000616
in the above formula, p (e)i,fi) Representing source language words fj(J =1.. J) translating to a target language eiI, p (f) and I, I =1j,ei) Representing target language words ei(I =1.. I) translation to source language fi(J =1.. J). Alpha represents the alignment of the source and target language word pairs.
In the above embodiments, regarding step 8), the target language is generated by combining the dictionary, the rule and the language model, that is, the target language model is used to smooth the translation generated by using the dictionary and the rule in the machine translation and/or smooth the translation obtained by using the phrase translation probability table, so as to improve the fluency of the translation. Disclosed herein is a computing method of computing a smoothness of a target language translation relative to a target language model:
1) a target language statistical model is expressed by the conditional probability of a subsequent word relative to a previous word:
Figure BDA000028310155000617
here, wtRepresents the t-th word in the translation,
Figure BDA000028310155000618
is w1,...,wTIs w1,...,wt-1
2) Due to the fact that
Figure BDA000028310155000620
An N-gram model can be used to calculate the conditional probability of a subsequent word relative to a previous word:
P ^ ( w t | w 1 t - 1 ) ≈ P ^ ( w t | w t - n + 1 t - 1 )
3) let w1…wTIs a training set of target languages, and wTE.v, V is a finite set, then our goal is to design a good model:
f ( w t . . . w t - n + 1 ) = P ^ ( w t | w 1 t - 1 )
the above equation gives the maximum sample likelihood, whose geometric mean is found:
Perplexity = 1 / P ^ ( w t | w 1 t - 1 )
4) in the above formula, for anyIs provided with
Figure BDA00002831015500074
Thus, the smoothness of the target language translation relative to the target language model can be calculated:
Score = 1 T Σ t log f ( w t , w t - 1 , . . . , w t - n + 1 ) Perplexity ,
wherein T is the number of words in the training set of the target language.
A specific example is provided below, where the sentences to be translated are:
Select this option to postpone deleting these records until pruning is performed.
firstly, segmenting the input sentence by reading dictionary base information; reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented sentence to form a syntax tree, wherein the syntax tree is shown in fig. 5:
then, decoding the syntax tree, the method comprises: traversing the fallen syntax tree by adopting a top-down strategy, namely traversing from a node [ V ] at the top-most left corner of the left side to a right leaf node direction, wherein the following detailed traversing steps are as follows:
1) reading the leaf node character string of [ V ]: select this choice from the poststroke deletion the phrases outstanding before the used string to search the phrase translation probability table, the result is not to search the matching translation segment.
2) Reading the structure attribute of [ V ] and finding that the structure attribute is a 'V Conj S V' structure, and dividing the structure into two parts for translation, namely, dividing the structure into 'V | | Conj S V'.
3) Reading a leaf node character string of a first [ V ] of 'V | | Conj S V': select this option to postpostpostlude the phrase records, and then use the string to search the phrase translation probability table, resulting in no search for a matching translation segment.
4) Reading the structure attribute of [ V ], finding that the structure is a 'V N to V' structure, and for the structure, there are two divisions, namely, division into 'V N to | | | V' or 'V N | | | to | | | V'.
5) According to the maximum matching principle, that is, if the number of the segmented blocks is the minimum, the translation result is more accurate, so that the first segmentation method "V N to V" should be tried first, and thus, the leaf node original text "Select this option to" of "V N to" is read to search the phrase translation probability table, and the result search is successful:
the option can be selected from the option of < 0-01-12-23-3 < 8 > | |10.0003327991397108 e-007;
at this time, the present example will use "Select this option can" as the translation of "Select this option to".
6) Reading a leaf node character string of the second [ V x ] of the 'V N to | V': posthole deletion the phrases are then used to search the phrase translation probability table, resulting in a successful search:
posthole deletion the records | l | delete the records 0-01-12-23-3| l | 10.00056812810.125;
this example will use "late delete these records" as a translation of "postbone deletion the records".
7) Thus, the first "V" of the entire sentence "V | | Conj S V" can be translated into:
selection of the option to post deletion of the records → deletion of the records can be postponed
8) Next, for "Conj S V", its leaf node string is read: the string is then used to search the phrase translation probability table, resulting in no matching translation segments being searched.
9) For the structure of "Conj S V", it can be cut into "Conj | | | S V".
10) "Conj" corresponds to the common word "unity" without searching the phrase translation probability table.
11) Reading the leaf node character string of the second part "S V" of "Conj | | S V": pruning is performed, and then the string is used to search the phrase translation probability table, resulting in a successful search:
pruning is completed by pruning is carried out in the pruning of 0-11-02-0 15.14058e-0063.37201e-006
This example will use "finish trim" as the translation of "pruning is carried".
12) Thus, the entire sentence "V | | Conj S V" can already be translated into:
selecting this option to post deletion the records without pruning the records will postpone deleting the records, with pruning done.
13) For the structure of "V | | Conj S V", there are two basic translation methods:
V||Conj S V→V,Conj S V
V||Conj S V→Conj S V,V
specifically, a "Conj" in this sentence is a "until" and is:
v | | | until S V → V, up to S V
V | | | until S V → before S V, V
Thus, there are two translation results:
selecting this option to post deletion the records pruning will be used to postpone deletion of the records until pruning is complete.
Selecting this option may postpone deleting the records until pruning is complete.
14) Specifically, which translation result is adopted needs to calculate the translation probabilities of the two sentences according to the target language model, and the word sequence of a certain translation result is set as w according to the N-gram language model1,w2,…,wmThen the probability of occurrence of the translation result is:
P ( w 1 , . . . , w m ) = &Pi; i = 1 m P ( w i | w 1 , . . . , w i - 1 ) &ap; &Pi; i = 1 m P ( w 1 | w i - ( n - 1 ) , . . . , w i - 1 )
according to the Markov assumption, the above conditional probability can be calculated from the frequency order in the N-gram:
P ( w i | w i - ( n - 1 ) , . . . , w i - 1 ) = count ( w i - ( n - 1 ) , . . . , w i - 1 , w i ) count ( w i - ( n - 1 ) , . . . , w i - 1 )
where n is an n-gram language model, which may be set to 3, the probability of occurrence of the first translation obtained by actual calculation is 5.38125e-005, and the probability of occurrence of the second translation is 4.20337e-006, and thus the probability of the first translation is higher, and therefore it is selected as the final translation:
selecting this option to post deletion the records pruning will be used to postpone deletion of the records until pruning is complete.
In the above process, "Select this option to" is not a complete node in the syntax tree, but a node across syntaxes is a part of the structure "V N to V", and for this continuous node across syntaxes, as long as it satisfies a certain pattern, it should be allowed to search the statistical phrase table, so that the statistical phrase table can be maximally utilized to obtain a better translation result, provided that the large structure of the sentence is not destroyed.
The applicant conducts experiments on the machine translation method and device fusing the syntax tree, the phrase translation probability table and the language model on a practical IT security field customized English-Chinese system, selects English-Chinese IT security field parallel linguistic data of 61 ten thousand sentence pairs as training linguistic data, trains a phrase translation probability table containing 349 ten thousand translation fragments by using a statistical alignment tool Giza + +, and trains a language model by using 61 ten thousand sentences of Chinese. 2468 test english sentences were used to test the translation method of the present invention, resulting in BLEU values, TER values and readability as shown in table 1.
As can be seen from Table 1, by using the method of the present invention, the snippets (e.g., the choices selected by the Selectitis option to #) appear 7684 times in the total of 2468 translations of the tested sentences, each sentence appears 3.11 times on average, and the frequency of appearance is very high, accounting for 71.7% of all the Chinese characters in the output translation.
That is to say, 71.7% of the output Chinese translation is the translation result of the adopted statistical phrase translation probability table, which shows that the method of the present invention fully utilizes the fine-grained knowledge of statistical machine translation, the improvement effect of translation quality is very obvious, and the BLEU value, TER value and readability of the translation result are all improved to different degrees.
TABLE 1 test data for the method of the invention
Figure BDA00002831015500101
The principles and features of the present invention have been described above with reference to specific embodiments thereof. It is to be understood that the invention is not limited to the particular embodiments described above, as variations and modifications may be made and equivalents may be substituted for elements thereof. The scope of the invention is only limited by the appended claims.

Claims (10)

1. A machine translation method fusing a syntax tree and a statistical machine translation technology comprises the following steps:
1) establishing a dictionary library, a grammar rule library, a phrase translation probability table and a target language model among different languages; the system comprises a dictionary library, a grammar rule library, a phrase translation probability table, a target language model and a target language model, wherein the dictionary library stores words and phrases corresponding to different languages, the grammar rule library stores grammar rules corresponding to the different languages, the phrase translation probability table stores translation fragments of the different languages obtained by training of a statistical machine translation system, and the target language model stores a language model of a target language obtained by training of the statistical machine translation system;
2) reading dictionary base information, segmenting an input single sentence to be translated, and decomposing the single sentence into words or phrases in a source language;
3) reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntax tree;
4) reading phrase translation probability table information, traversing the syntax tree by adopting a top-down strategy, searching the phrase translation probability table for a single node and partial cross-syntax continuous nodes in the syntax tree by taking original texts of leaf nodes of the syntax tree, and selecting a translation in the phrase translation table as a translation of the node in the syntax tree; generating a translation for the nodes of the syntax tree which are not translated in the process according to a rule translation method;
5) and smoothing the generated translation by using the target language model to generate a target language.
2. The method of claim 1, wherein: the translation segments of different languages stored in the phrase translation probability table are obtained by GIZA + + training.
3. The method of claim 1, wherein: and obtaining the target language model by adopting a language model training tool SRILM or N-gram.
4. The method of claim 1, wherein the phrase translation probability table comprises: source language phrase containing J words
Figure FDA00002831015400011
Target language phrase containing I words
Figure FDA00002831015400012
A word alignment relationship α inside the source and target language phrases, and a phrase translation score p.
5. The method of claim 4, wherein the phrase translation score p comprises a phrase translation probability and a lexical translation probability; the calculation formula of the phrase translation probability is as follows:
p ( e 1 I | f 1 J ) = N ( f 1 J | e 1 I ) &Sigma; ee 1 I N ( f 1 J | ee 1 I )
p ( f 1 J | e 1 I ) = N ( e 1 I | f 1 J ) &Sigma; ff 1 J N ( e 1 I | ff 1 J )
wherein,representing phrase pairs
Figure FDA00002831015400016
The number of occurrences in the corpus is,
Figure FDA00002831015400017
to represent
Figure FDA00002831015400018
All of the possible target language phrases that correspond,
Figure FDA00002831015400019
representing phrase pairs
Figure FDA000028310154000110
The number of occurrences in the corpus is,
Figure FDA000028310154000111
to represent
Figure FDA000028310154000112
All of the possible source language phrases that correspond,
Figure FDA000028310154000113
representing phrase pairs
Figure FDA000028310154000114
The number of occurrences in the corpus is,
Figure FDA00002831015400021
representation phrase pairNumber of occurrences in the corpus;
the calculation formula of the lexical translation probability is as follows:
Figure FDA00002831015400023
wherein, p (e)i,fj) Representing source language words fj(J =1.. J) translating to a target language eiI1.. I), p (f)j,ei) Representing target language words ei(I1.. I) into the source language fi(J =1.. J) probability; alpha represents the alignment of the source and target language word pairs.
6. The method of claim 1, wherein: the method for calculating the smoothness of the target language translation relative to the target language model comprises the following steps:
1) the target language statistical model is expressed by the conditional probability of the latter word relative to the former word:
Figure FDA00002831015400025
wherein, wtRepresents the t-th word in the translation,is w1,...,wT
Figure FDA00002831015400027
Is w1,...,wt-1
2) And calculating the conditional probability of the next word relative to the previous word by adopting an N-gram model:
P ^ ( w t | w 1 t - 1 ) &ap; P ^ ( w t | w t - n + 1 t - 1 )
3) let w1…wTIs oneA training set of target languages, and wTE.v, V is a finite set, the maximum sample likelihood is computed:
f ( w t . . . w t - n + 1 ) = P ^ ( w t | w 1 t - 1 ) ,
geometric mean thereof:
Perplexity = 1 / P ^ ( w t | w 1 t - 1 )
4) for arbitraryIs provided with
Figure FDA000028310154000212
Thereby obtaining a target language translation relative to the targetThe smoothness of the language model is:
Score = 1 T &Sigma; t log f ( w t , w t - 1 , . . . , w t - n + 1 ) Perplexity ,
wherein T is the number of words in the training set of the target language.
7. The method of claim 1, wherein: the entries stored in the dictionary base are labeled according to the requirement of a translation system, and related semantic attributes are noted; and the grammar rules stored in the grammar rule base prescribe translation rules of words or phrases according to the requirements of a translation system.
8. The method of claim 1, wherein: and calculating the translation probabilities of different translation results according to the target language model, and taking the translation with high probability as a final translation.
9. A machine translation apparatus that merges syntax trees and statistical machine translation techniques, comprising:
the dictionary library module is used for storing words and phrases corresponding to different languages;
the grammar rule base module is used for storing grammar rules corresponding to different languages;
the phrase translation probability table module is used for storing translation fragments of different languages obtained by training of a statistical machine translation system;
the target language model module is used for storing a language model of a target language obtained by training of the statistical machine translation system;
the syntax analyzer is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and syntax analysis on the original text in sequence according to the dictionary base and the grammar rule base so as to generate a syntax tree;
and the decoder is connected with the phrase translation probability table module, the language model module and the syntactic analyzer and is used for traversing the syntactic tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language.
10. The apparatus of claim 9, wherein the syntax analyzer comprises:
the sentence dividing module is used for reading the source text and segmenting the source text;
the segmentation and pretreatment module is connected with the sentence division module and is used for segmenting and pretreating the divided single sentences;
the de-doubling module is connected with the segmentation and pretreatment module and is used for performing part-of-speech de-doubling on the segmented single sentences;
the grammar analysis module is connected with the eliminating and merging module and is used for carrying out grammar analysis on the single sentence after the elimination and merging;
and the master control module is respectively connected with the modules and controls the operation of the modules.
CN2013100497397A 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device Pending CN103116578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100497397A CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100497397A CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Publications (1)

Publication Number Publication Date
CN103116578A true CN103116578A (en) 2013-05-22

Family

ID=48414955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100497397A Pending CN103116578A (en) 2013-02-07 2013-02-07 Translation method integrating syntactic tree and statistical machine translation technology and translation device

Country Status (1)

Country Link
CN (1) CN103116578A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731777A (en) * 2015-03-31 2015-06-24 网易有道信息技术(北京)有限公司 Translation evaluation method and device
WO2017012327A1 (en) * 2015-07-22 2017-01-26 华为技术有限公司 Syntax analysis method and device
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106844352A (en) * 2016-12-23 2017-06-13 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
TWI637278B (en) * 2017-07-03 2018-10-01 雲拓科技有限公司 Computer automatically claim-translating device
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage
CN108829657A (en) * 2018-04-17 2018-11-16 广州视源电子科技股份有限公司 Smoothing method and system
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model
CN109448458A (en) * 2018-11-29 2019-03-08 郑昕匀 English oral training device, data processing method and storage medium
CN109978829A (en) * 2019-02-26 2019-07-05 深圳市华汉伟业科技有限公司 A kind of detection method and its system of object to be detected
CN110413963A (en) * 2019-07-03 2019-11-05 东华大学 A structured method for breast ultrasound examination report based on domain ontology
CN110895660A (en) * 2018-08-23 2020-03-20 澳门大学 A sentence processing method and device based on dynamic coding of syntactic dependencies
CN111104796A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Method and device for translation
CN112766004A (en) * 2021-01-22 2021-05-07 西安文理学院 Artificial intelligence self-adaptive interactive foreign language teaching translation system
CN112800754A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Unsupervised grammar derivation method, unsupervised grammar derivation device and medium based on pre-training language model
CN113283250A (en) * 2021-05-26 2021-08-20 南京大学 Automatic machine translation test method based on syntactic component analysis
RU2766821C1 (en) * 2021-02-10 2022-03-16 Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof
CN114254630A (en) * 2021-11-29 2022-03-29 北京捷通华声科技股份有限公司 A translation method, apparatus, electronic device and readable storage medium
CN114330376A (en) * 2021-11-15 2022-04-12 甲骨易(北京)语言科技股份有限公司 A computer-aided translation system and method
CN119849478A (en) * 2025-01-08 2025-04-18 中国科学技术信息研究所 Theme determining method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
US20080162111A1 (en) * 2006-12-28 2008-07-03 Srinivas Bangalore Sequence classification for machine translation
CN101482861A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Chinese-English words automatic alignment method
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system
US20120316862A1 (en) * 2011-06-10 2012-12-13 Google Inc. Augmenting statistical machine translation with linguistic knowledge

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
US20080162111A1 (en) * 2006-12-28 2008-07-03 Srinivas Bangalore Sequence classification for machine translation
CN101482861A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Chinese-English words automatic alignment method
US20120316862A1 (en) * 2011-06-10 2012-12-13 Google Inc. Augmenting statistical machine translation with linguistic knowledge
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐志明,王晓龙等: "N-gram语言模型的数据平滑技术", 《计算机应用研究》 *
蒋宏飞,李生等: "一种基于同步树替换文法的统计机器翻译模型", 《软件学报》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731777A (en) * 2015-03-31 2015-06-24 网易有道信息技术(北京)有限公司 Translation evaluation method and device
US10909315B2 (en) 2015-07-22 2021-02-02 Huawei Technologies Co., Ltd. Syntax analysis method and apparatus
WO2017012327A1 (en) * 2015-07-22 2017-01-26 华为技术有限公司 Syntax analysis method and device
CN106407184A (en) * 2015-07-30 2017-02-15 阿里巴巴集团控股有限公司 Decoding method used for statistical machine translation, and statistical machine translation method and apparatus
CN106407184B (en) * 2015-07-30 2019-10-01 阿里巴巴集团控股有限公司 Coding/decoding method, statistical machine translation method and device for statistical machine translation
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106598937B (en) * 2015-10-16 2019-10-18 阿里巴巴集团控股有限公司 Language Identification, device and electronic equipment for text
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107436865B (en) * 2016-05-25 2020-10-16 阿里巴巴集团控股有限公司 Word alignment training method, machine translation method and system
CN106844352A (en) * 2016-12-23 2017-06-13 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN106844352B (en) * 2016-12-23 2019-11-08 中国科学院自动化研究所 Word prediction method and system based on neural machine translation system
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN107066455B (en) * 2017-03-30 2020-07-28 唐亮 Multi-language intelligent preprocessing real-time statistics machine translation system
TWI637278B (en) * 2017-07-03 2018-10-01 雲拓科技有限公司 Computer automatically claim-translating device
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN107729326B (en) * 2017-09-25 2020-12-25 沈阳航空航天大学 Multi-BiRNN coding-based neural machine translation method
CN107729326A (en) * 2017-09-25 2018-02-23 沈阳航空航天大学 Neural machine translation method based on Multi BiRNN codings
CN108829657B (en) * 2018-04-17 2022-05-03 广州视源电子科技股份有限公司 Smoothing method and system
CN108829657A (en) * 2018-04-17 2018-11-16 广州视源电子科技股份有限公司 Smoothing method and system
CN108763222B (en) * 2018-05-17 2020-08-04 腾讯科技(深圳)有限公司 Translation missing detection and translation method and device, server and storage medium
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model
CN110895660A (en) * 2018-08-23 2020-03-20 澳门大学 A sentence processing method and device based on dynamic coding of syntactic dependencies
CN110895660B (en) * 2018-08-23 2024-05-17 澳门大学 Sentence processing method and device based on syntactic dependency dynamic coding
CN109448458A (en) * 2018-11-29 2019-03-08 郑昕匀 English oral training device, data processing method and storage medium
CN109978829A (en) * 2019-02-26 2019-07-05 深圳市华汉伟业科技有限公司 A kind of detection method and its system of object to be detected
CN110413963A (en) * 2019-07-03 2019-11-05 东华大学 A structured method for breast ultrasound examination report based on domain ontology
CN110413963B (en) * 2019-07-03 2022-11-25 东华大学 Breast ultrasonic examination report structuring method based on domain ontology
CN111104796A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Method and device for translation
CN111104796B (en) * 2019-12-18 2023-05-05 北京百度网讯科技有限公司 Method and device for translation
CN112766004A (en) * 2021-01-22 2021-05-07 西安文理学院 Artificial intelligence self-adaptive interactive foreign language teaching translation system
CN112800754A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Unsupervised grammar derivation method, unsupervised grammar derivation device and medium based on pre-training language model
RU2766821C1 (en) * 2021-02-10 2022-03-16 Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof
CN113283250A (en) * 2021-05-26 2021-08-20 南京大学 Automatic machine translation test method based on syntactic component analysis
CN114330376A (en) * 2021-11-15 2022-04-12 甲骨易(北京)语言科技股份有限公司 A computer-aided translation system and method
CN114254630A (en) * 2021-11-29 2022-03-29 北京捷通华声科技股份有限公司 A translation method, apparatus, electronic device and readable storage medium
CN119849478A (en) * 2025-01-08 2025-04-18 中国科学技术信息研究所 Theme determining method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN103116578A (en) Translation method integrating syntactic tree and statistical machine translation technology and translation device
Cussens Part-of-speech tagging using Progol
CN109062892A (en) A kind of Chinese sentence similarity calculating method based on Word2Vec
CN108681529B (en) Multi-language text and voice generation method of flow model diagram
CN112183059A (en) Chinese structured event extraction method
CN110502744A (en) A Text Emotion Recognition Method and Device for Evaluation of Historical Parks
Anastasiou Idiom treatment experiments in machine translation
Dunđer Machine translation system for the industry domain and Croatian language
CN118246426A (en) Writing method, system, device and medium based on generative text big model
CN108519963B (en) Method for automatically converting process model into multi-language text
CN113343717A (en) Neural machine translation method based on translation memory library
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
CN112447172B (en) Quality improvement method and device for voice recognition text
Comas et al. Sibyl, a factoid question-answering system for spoken documents
Chennoufi et al. Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation
CN107862045A (en) A kind of across language plagiarism detection method based on multiple features
CN111259159B (en) Data mining method, device and computer readable storage medium
Romero et al. Category-based language models for handwriting recognition of marriage license books
CN118626065A (en) Web front-end style code generation method based on DOM
CN106776590A (en) A kind of method and system for obtaining entry translation
Sankaravelayuthan et al. A Comprehensive Study of Shallow Parsing and Machine Translation in Malaylam
Ducoffe et al. Machine Learning under the light of Phraseology expertise: use case of presidential speeches, De Gaulle-Hollande (1958-2016)
CN114021553A (en) A Chinese sentiment polarity detection method based on grammar dependency graph and dictionary expansion
KR100574887B1 (en) Vocabulary neutralization device and machine method in machine translation system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130522

RJ01 Rejection of invention patent application after publication