CN103116578A - Translation method integrating syntactic tree and statistical machine translation technology and translation device - Google Patents
Translation method integrating syntactic tree and statistical machine translation technology and translation device Download PDFInfo
- Publication number
- CN103116578A CN103116578A CN2013100497397A CN201310049739A CN103116578A CN 103116578 A CN103116578 A CN 103116578A CN 2013100497397 A CN2013100497397 A CN 2013100497397A CN 201310049739 A CN201310049739 A CN 201310049739A CN 103116578 A CN103116578 A CN 103116578A
- Authority
- CN
- China
- Prior art keywords
- translation
- phrase
- target language
- module
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a translation method integrating a syntactic tree and statistical machine translation technology and a translation device. The method comprises the following steps. First, a dictionary base, a grammatical rule base, a phrase translation probability table and a target language linguistic model between different languages are established. Then, segmentation, word property removing and grammatical analysis are conducted to an original input sentence, and a syntactic tree is generated. Then by adopting a top-down strategy, the syntactic tree is gone through, by means of each individual node and part of continuous nodes which cross the syntax, the original texts of leaf nodes are taken to be matched with the phrase translation probability table trained by the statistical machine translation, By utilizing the translated texts of the phrase translation table and the linguistic model of the target language, the purpose of improving the fluency and the accuracy of the output translated texts is achieved. By means of the translation method integrating the syntactic tree and the statistical machine translation technology and the translation device, not only is fine grit knowledge provided by the phrase translation table utilized, but also the advantages of the syntactic tree when solving the relevant problems of depth and long distance of a sentence are utilized, and the quality of the texts translated by the machine can be improved remarkably.
Description
Technical Field
The invention relates to the field of statistical and regular machine translation, in particular to a machine translation method and a device for fusing statistical machine translation technologies such as a syntax tree, a phrase translation probability table and a language model.
Background
With the spread of the internet, computer processing in natural language has become an important means for acquiring knowledge from the internet. For example, in the fields of international communication and scientific research and education, people need to translate foreign language characters, which is the stage of talent development of language masters in the past. With the rapid development of hardware technology, the continuous perfection of software technology and the continuous deepening of language research, machine translation is more and more widely applied. Machine translation has its own huge advantages, such as fast translation speed, strong memory, and reduced translation cost, but its disadvantage is that the translation quality is far from meeting the needs of people, and how to develop a high-quality machine translation method becomes an important subject we are facing.
International evaluation in 2011 shows that the quality of translated texts translated by a data-driven machine and a knowledge-driven machine is comparable, and the requirement of a user is difficult to meet only by adopting a single method. The analysis of the translation errors of the statistical and regular machine translation shows that the types of errors generated by different machine translation systems are complementary. The weakness of the rule system is that vocabulary selection is performed during the conversion process and the performance is poor when a diseased sentence is analyzed, but the rule system has the advantages that any small part cannot be omitted when an original text is analyzed, and accurate translation can be achieved. In contrast, statistical machine translation systems have strong adaptability, and the use of phrase collocation makes the translation more fluent, and better in the aspect of word selection. However, the biggest problem of the statistical machine translation system is that it is difficult to deal with the fact that the translation generation requires linguistic knowledge, for example, they lack lexical and syntactic functions and word order adjusting functions, and are more difficult to achieve word order adjustment at the whole sentence level. In addition, the proper translation of the statistical machine translation system cannot be achieved, and the phenomena of translation missing and false translation sometimes occur.
Disclosure of Invention
Because the machine translation of a single method can not obtain a good translation effect, and the machine translation based on data drive and knowledge drive basically has the characteristic of complementary advantages, different methods are combined to form a reasonable way for improving the machine translation quality. The machine translation method provided by the invention not only utilizes the fine-grained knowledge provided by the statistical translation engine, but also utilizes the advantages of the syntax tree in solving the deep and long-distance correlation problem of the sentence, so that the translation quality of machine translation can be obviously improved, and the development of the machine translation technology of the hybrid engine can be powerfully promoted.
The invention provides a machine translation method fusing a syntax tree and a statistical machine translation technology, which comprises the following steps:
1) establishing a dictionary library, a grammar rule library, a phrase translation probability table and a target language model among different languages; the system comprises a dictionary library, a grammar rule library, a phrase translation probability table, a target language model and a target language model, wherein the dictionary library stores words and phrases corresponding to different languages, the grammar rule library stores grammar rules corresponding to the different languages, the phrase translation probability table stores translation fragments of the different languages obtained by training of a statistical machine translation system, and the target language model stores a language model of a target language obtained by training of the statistical machine translation system;
2) reading dictionary base information, segmenting an input single sentence to be translated, and decomposing the single sentence into words or phrases in a source language;
3) reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntax tree;
4) reading phrase translation probability table information, traversing the syntax tree by adopting a top-down strategy, searching the phrase translation probability table for a single node and partial cross-syntax continuous nodes in the syntax tree by taking original texts of leaf nodes of the syntax tree, and selecting a translation in the phrase translation table as a translation of the node in the syntax tree; generating a translation for the nodes of the syntax tree which are not translated in the process according to a rule translation method;
5) and smoothing the generated translation by using the target language model to generate a target language.
Preferably, the translation segments of different languages stored in the phrase translation probability table are obtained by GIZA + + training.
Preferably, the language model of the target language is obtained by training according to the parallel corpus by adopting a language model training tool SRILM or N-gram.
The invention also provides a device adopting the machine translation method, which comprises the following steps:
the dictionary library module is used for storing words and phrases corresponding to different languages;
the grammar rule base module is used for storing grammar rules corresponding to different languages;
the phrase translation probability table module is used for storing translation fragments of different languages obtained by training of a statistical machine translation system;
the target language model module is used for storing a language model of a target language obtained by training of the statistical machine translation system;
the syntax analyzer is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and syntax analysis on the original text in sequence according to the dictionary base and the grammar rule base so as to generate a syntax tree;
and the decoder is connected with the phrase translation probability table module, the language model module and the syntactic analyzer and is used for traversing the syntactic tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language.
Further, the syntax analyzer includes:
the sentence dividing module is used for reading the original text and segmenting the original text;
the segmentation and pretreatment module is connected with the sentence division module and is used for segmenting and pretreating the divided single sentences;
the de-doubling module is connected with the segmentation and pretreatment module and is used for performing part-of-speech de-doubling on the segmented single sentences;
the grammar analysis module is connected with the eliminating and merging module and is used for carrying out grammar analysis on the single sentence after the elimination and merging;
and the master control module is respectively connected with the modules and controls the operation of the modules.
The invention provides a machine translation method and a device fusing a syntax tree, a phrase translation probability table and a language model, which adopt the strategy of scanning and searching the phrase translation probability table and the language model of statistical machine translation node by node and cross points of the syntax tree.
Drawings
FIG. 1 is a schematic diagram of the structure of an English-Chinese machine translation device according to an embodiment;
FIG. 2 is a flow chart of an English-Chinese machine translation method according to an embodiment;
FIG. 3 is a block diagram of a syntactic parser in FIG. 1;
FIG. 4 is a diagram illustrating the translation probability table and the training of a language model in an embodiment;
fig. 5 is a diagram of a syntax tree obtained in the example.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments and accompanying drawings.
Fig. 1 is a schematic diagram illustrating a structural configuration of a machine translation apparatus 100 according to the present embodiment, which incorporates a syntax tree and a statistical machine translation technique, and fig. 2 is a flowchart illustrating an implementation of machine translation using the apparatus.
Referring to fig. 1, the apparatus 100 includes: the dictionary library module 110 is used for storing words and phrases corresponding to different languages; a grammar rule base module 120 for storing grammar rules corresponding to different languages; a phrase translation probability table module 130, configured to store translation segments of different languages obtained by training of the statistical machine translation system; a target language model module 140 for storing a target language model trained by the statistical machine translation system; the syntax analyzer 150 is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and grammar analysis on the source text in sequence according to the dictionary base and the grammar rule base to generate a syntax tree; and the decoder 160 is connected with the phrase translation probability table module, the language model module and the syntax analyzer, and is used for traversing the syntax tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language. The phrase translation probability table and the target language model are obtained from the parallel corpus training, as shown in fig. 2.
Referring to fig. 1 and fig. 2, a specific translation process is described by taking an example in which the source language is english and the target language is chinese, and mainly includes the following steps:
1) performing morphological analysis on English in an English-Chinese bidirectional parallel corpus, and performing word segmentation on Chinese;
2) performing word alignment and phrase alignment on the parallel corpus by adopting a GIZA + + statistical tool, and extracting an English-Chinese phrase translation probability table;
3) filtering the extracted English-Chinese phrase translation probability table to filter inaccurate statistical items;
4) training a language model of a target language according to the parallel corpus by adopting a language model training tool SRILM;
5) reading dictionary base information, segmenting input single sentences to be translated, reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntactic tree; the step of disambiguation and parsing also identifies and records nouns or verb phrases in the dictionary repository that are not or cannot be collected in their entirety;
6) for the syntax tree, traversing the syntax tree by adopting a top-down strategy, searching the sub-tree taking the current node as a root node for entries in a phrase translation probability table, and generating a translation;
7) when traversing the syntax tree, besides searching the statistical phrase table for the root node, some cross-syntax conditions need to be added properly, so that the phrase translation probability table can be searched and used under the condition of not damaging the syntax tree, and the statistical phrase table is utilized to the maximum extent to improve the quality of the translated text;
when the continuous nodes across the syntax in the syntax tree must satisfy a certain structure, the original text search phrase translation probability table of the leaf node can be obtained, such as: in V N to V, V N to can go on search, but N to V can't go on search; regarding the implementation of the cross-syntax case, refer to steps 3), 4) of the following translation example;
8) and generating a target language for the nodes of the syntax tree which are not translated in the process by adopting a mode of combining a dictionary, a rule and a language model, namely generating a translation according to a rule translation method, and smoothing the generated translation by utilizing the target language model.
There are "untranslated syntax tree nodes" because there are segments that cannot be searched in the phrase translation probability table, so about 29% of the segments are untranslated and therefore translated by the regular translation method. It should be noted that the key point of the smoothing in step 8) is the regularly translated translation, but in other embodiments, all translations generated in the past (including translations obtained by using the phrase translation probability table) may be smoothed, and the present invention is not limited thereto.
As shown in fig. 3, one embodiment of a syntax analyzer comprises: a general control module 151 for managing and controlling the operations of the respective modules of the syntax analyzer; a sentence dividing module 152, configured to divide an english sentence to be translated into character strings; the segmentation and preprocessing module 153 is used for segmenting an english sentence into a string sequence taking phrases as units, and preprocessing comprises punctuation processing, format processing and the like, which are common technologies in a regular translation system; a parallel elimination module 154, configured to perform part-of-speech tagging on the segmented english sentence by eliminating the parallel category; and the syntax analysis module 155 is used for relatively simple syntax analysis to make the segmented English sentences form a syntax tree.
The entries stored in the dictionary base are labeled according to the requirements of the translation system, and the related semantic attributes are noted as follows:
afromosia \ N \ African red bean wood
CAT [ N ] M _ SEM [ B ] S _ SEM [ D ] CLAS [ plants ] $
afront \ F \ preceding
&CAT[F]M_SEM[J]$
……
mountain bike
&CAT[N]M_SEM[C]S_SEM[I]$
mountain coast \ N \ steep coast
&CAT[N]M_SEM[C]S_SEM[B]CLAS[d]$
mountain chrome/N/asbestos
&CAT[N]M_SEM[C]S_SEM[G]NUM[U]$
The requirements of the translation system refer to dictionary specifications, which are defined by a rule translation system developer, generally comprise part-of-speech, grammar and semantic information of a labeled entry, and are common technologies in the rule translation system.
The grammar rules stored in the grammar rule base prescribe translation rules of words or phrases according to the requirements of a translation system, and the following steps are shown:
with links to:
[24] (1) CAT [ N ] - - > is linked to% 1
reach:
[12] (1) CHI [ level | value ] - > MEANQ [0, reached ];
[13] (0) CAT [ V ] + (1) CHI [ confinement ] - > MEAN [0, obtained ];
[14] (0) CAT [ V ] + (1) CHI [ gold ] - > MEAN [0, up to ];
[15] (0) CAT [ V ] & & IS _ CENTER [1] + (1) CAT [ N ] & & L _ CHI [ acquisition ] - - > MEAN [0, achievement ].
As shown in fig. 4, the training process of a statistical machine to obtain a phrase translation probability table and a language model includes training a parallel corpus using a training tool GIZA + + of statistical machine translation to obtain a phrase translation probability table, and training the parallel corpus using a language model training tool SRILM of statistical machine translation to obtain a target language model. Besides SRILM, a training method of language models such as N-gram can be adopted.
In the above embodimentsThe step 2) of extracting the phrase translation probability table is the key point of the present invention, and will be further described. The phrase translation probability table herein includes four parts: source language phrase containing J wordsTarget language phrase containing I wordsThe word alignment relationship α and phrase translation score p within the source and target language phrases may be expressed asA phrase translation score is then calculated, comprising four components: phrase translation probabilityAnd lexical translation probability And
wherein, the phrase translation probability calculation formula is as follows:
in the above formula, the first and second carbon atoms are,representing phrase pairsThe number of occurrences in the corpus is,to representAll of the possible target language phrases that correspond,representing phrase pairsThe number of occurrences in the corpus is,to representAll of the possible source language phrases that correspond,representing phrase pairsThe number of occurrences in the corpus is,representation phrase pairNumber of occurrences in the corpus.
The lexical translation probability calculation formula is as follows:
in the above formula, p (e)i,fi) Representing source language words fj(J =1.. J) translating to a target language eiI, p (f) and I, I =1j,ei) Representing target language words ei(I =1.. I) translation to source language fi(J =1.. J). Alpha represents the alignment of the source and target language word pairs.
In the above embodiments, regarding step 8), the target language is generated by combining the dictionary, the rule and the language model, that is, the target language model is used to smooth the translation generated by using the dictionary and the rule in the machine translation and/or smooth the translation obtained by using the phrase translation probability table, so as to improve the fluency of the translation. Disclosed herein is a computing method of computing a smoothness of a target language translation relative to a target language model:
1) a target language statistical model is expressed by the conditional probability of a subsequent word relative to a previous word:
2) Due to the fact thatAn N-gram model can be used to calculate the conditional probability of a subsequent word relative to a previous word:
3) let w1…wTIs a training set of target languages, and wTE.v, V is a finite set, then our goal is to design a good model:
the above equation gives the maximum sample likelihood, whose geometric mean is found:
4) in the above formula, for anyIs provided withThus, the smoothness of the target language translation relative to the target language model can be calculated:
wherein T is the number of words in the training set of the target language.
A specific example is provided below, where the sentences to be translated are:
Select this option to postpone deleting these records until pruning is performed.
firstly, segmenting the input sentence by reading dictionary base information; reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented sentence to form a syntax tree, wherein the syntax tree is shown in fig. 5:
then, decoding the syntax tree, the method comprises: traversing the fallen syntax tree by adopting a top-down strategy, namely traversing from a node [ V ] at the top-most left corner of the left side to a right leaf node direction, wherein the following detailed traversing steps are as follows:
1) reading the leaf node character string of [ V ]: select this choice from the poststroke deletion the phrases outstanding before the used string to search the phrase translation probability table, the result is not to search the matching translation segment.
2) Reading the structure attribute of [ V ] and finding that the structure attribute is a 'V Conj S V' structure, and dividing the structure into two parts for translation, namely, dividing the structure into 'V | | Conj S V'.
3) Reading a leaf node character string of a first [ V ] of 'V | | Conj S V': select this option to postpostpostlude the phrase records, and then use the string to search the phrase translation probability table, resulting in no search for a matching translation segment.
4) Reading the structure attribute of [ V ], finding that the structure is a 'V N to V' structure, and for the structure, there are two divisions, namely, division into 'V N to | | | V' or 'V N | | | to | | | V'.
5) According to the maximum matching principle, that is, if the number of the segmented blocks is the minimum, the translation result is more accurate, so that the first segmentation method "V N to V" should be tried first, and thus, the leaf node original text "Select this option to" of "V N to" is read to search the phrase translation probability table, and the result search is successful:
the option can be selected from the option of < 0-01-12-23-3 < 8 > | |10.0003327991397108 e-007;
at this time, the present example will use "Select this option can" as the translation of "Select this option to".
6) Reading a leaf node character string of the second [ V x ] of the 'V N to | V': posthole deletion the phrases are then used to search the phrase translation probability table, resulting in a successful search:
posthole deletion the records | l | delete the records 0-01-12-23-3| l | 10.00056812810.125;
this example will use "late delete these records" as a translation of "postbone deletion the records".
7) Thus, the first "V" of the entire sentence "V | | Conj S V" can be translated into:
selection of the option to post deletion of the records → deletion of the records can be postponed
8) Next, for "Conj S V", its leaf node string is read: the string is then used to search the phrase translation probability table, resulting in no matching translation segments being searched.
9) For the structure of "Conj S V", it can be cut into "Conj | | | S V".
10) "Conj" corresponds to the common word "unity" without searching the phrase translation probability table.
11) Reading the leaf node character string of the second part "S V" of "Conj | | S V": pruning is performed, and then the string is used to search the phrase translation probability table, resulting in a successful search:
pruning is completed by pruning is carried out in the pruning of 0-11-02-0 15.14058e-0063.37201e-006
This example will use "finish trim" as the translation of "pruning is carried".
12) Thus, the entire sentence "V | | Conj S V" can already be translated into:
selecting this option to post deletion the records without pruning the records will postpone deleting the records, with pruning done.
13) For the structure of "V | | Conj S V", there are two basic translation methods:
V||Conj S V→V,Conj S V |
V||Conj S V→Conj S V,V |
specifically, a "Conj" in this sentence is a "until" and is:
v | | | until S V → V, up to S V |
V | | | until S V → before S V, V |
Thus, there are two translation results:
selecting this option to post deletion the records pruning will be used to postpone deletion of the records until pruning is complete.
Selecting this option may postpone deleting the records until pruning is complete.
14) Specifically, which translation result is adopted needs to calculate the translation probabilities of the two sentences according to the target language model, and the word sequence of a certain translation result is set as w according to the N-gram language model1,w2,…,wmThen the probability of occurrence of the translation result is:
according to the Markov assumption, the above conditional probability can be calculated from the frequency order in the N-gram:
where n is an n-gram language model, which may be set to 3, the probability of occurrence of the first translation obtained by actual calculation is 5.38125e-005, and the probability of occurrence of the second translation is 4.20337e-006, and thus the probability of the first translation is higher, and therefore it is selected as the final translation:
selecting this option to post deletion the records pruning will be used to postpone deletion of the records until pruning is complete.
In the above process, "Select this option to" is not a complete node in the syntax tree, but a node across syntaxes is a part of the structure "V N to V", and for this continuous node across syntaxes, as long as it satisfies a certain pattern, it should be allowed to search the statistical phrase table, so that the statistical phrase table can be maximally utilized to obtain a better translation result, provided that the large structure of the sentence is not destroyed.
The applicant conducts experiments on the machine translation method and device fusing the syntax tree, the phrase translation probability table and the language model on a practical IT security field customized English-Chinese system, selects English-Chinese IT security field parallel linguistic data of 61 ten thousand sentence pairs as training linguistic data, trains a phrase translation probability table containing 349 ten thousand translation fragments by using a statistical alignment tool Giza + +, and trains a language model by using 61 ten thousand sentences of Chinese. 2468 test english sentences were used to test the translation method of the present invention, resulting in BLEU values, TER values and readability as shown in table 1.
As can be seen from Table 1, by using the method of the present invention, the snippets (e.g., the choices selected by the Selectitis option to #) appear 7684 times in the total of 2468 translations of the tested sentences, each sentence appears 3.11 times on average, and the frequency of appearance is very high, accounting for 71.7% of all the Chinese characters in the output translation.
That is to say, 71.7% of the output Chinese translation is the translation result of the adopted statistical phrase translation probability table, which shows that the method of the present invention fully utilizes the fine-grained knowledge of statistical machine translation, the improvement effect of translation quality is very obvious, and the BLEU value, TER value and readability of the translation result are all improved to different degrees.
TABLE 1 test data for the method of the invention
The principles and features of the present invention have been described above with reference to specific embodiments thereof. It is to be understood that the invention is not limited to the particular embodiments described above, as variations and modifications may be made and equivalents may be substituted for elements thereof. The scope of the invention is only limited by the appended claims.
Claims (10)
1. A machine translation method fusing a syntax tree and a statistical machine translation technology comprises the following steps:
1) establishing a dictionary library, a grammar rule library, a phrase translation probability table and a target language model among different languages; the system comprises a dictionary library, a grammar rule library, a phrase translation probability table, a target language model and a target language model, wherein the dictionary library stores words and phrases corresponding to different languages, the grammar rule library stores grammar rules corresponding to the different languages, the phrase translation probability table stores translation fragments of the different languages obtained by training of a statistical machine translation system, and the target language model stores a language model of a target language obtained by training of the statistical machine translation system;
2) reading dictionary base information, segmenting an input single sentence to be translated, and decomposing the single sentence into words or phrases in a source language;
3) reading grammar rule base information, and performing part-of-speech elimination and grammar analysis on the segmented single sentences to form a syntax tree;
4) reading phrase translation probability table information, traversing the syntax tree by adopting a top-down strategy, searching the phrase translation probability table for a single node and partial cross-syntax continuous nodes in the syntax tree by taking original texts of leaf nodes of the syntax tree, and selecting a translation in the phrase translation table as a translation of the node in the syntax tree; generating a translation for the nodes of the syntax tree which are not translated in the process according to a rule translation method;
5) and smoothing the generated translation by using the target language model to generate a target language.
2. The method of claim 1, wherein: the translation segments of different languages stored in the phrase translation probability table are obtained by GIZA + + training.
3. The method of claim 1, wherein: and obtaining the target language model by adopting a language model training tool SRILM or N-gram.
5. The method of claim 4, wherein the phrase translation score p comprises a phrase translation probability and a lexical translation probability; the calculation formula of the phrase translation probability is as follows:
wherein,representing phrase pairsThe number of occurrences in the corpus is,to representAll of the possible target language phrases that correspond,representing phrase pairsThe number of occurrences in the corpus is,to representAll of the possible source language phrases that correspond,representing phrase pairsThe number of occurrences in the corpus is,representation phrase pairNumber of occurrences in the corpus;
the calculation formula of the lexical translation probability is as follows:
wherein, p (e)i,fj) Representing source language words fj(J =1.. J) translating to a target language eiI1.. I), p (f)j,ei) Representing target language words ei(I1.. I) into the source language fi(J =1.. J) probability; alpha represents the alignment of the source and target language word pairs.
6. The method of claim 1, wherein: the method for calculating the smoothness of the target language translation relative to the target language model comprises the following steps:
1) the target language statistical model is expressed by the conditional probability of the latter word relative to the former word:
2) And calculating the conditional probability of the next word relative to the previous word by adopting an N-gram model:
3) let w1…wTIs oneA training set of target languages, and wTE.v, V is a finite set, the maximum sample likelihood is computed:
geometric mean thereof:
4) for arbitraryIs provided withThereby obtaining a target language translation relative to the targetThe smoothness of the language model is:
wherein T is the number of words in the training set of the target language.
7. The method of claim 1, wherein: the entries stored in the dictionary base are labeled according to the requirement of a translation system, and related semantic attributes are noted; and the grammar rules stored in the grammar rule base prescribe translation rules of words or phrases according to the requirements of a translation system.
8. The method of claim 1, wherein: and calculating the translation probabilities of different translation results according to the target language model, and taking the translation with high probability as a final translation.
9. A machine translation apparatus that merges syntax trees and statistical machine translation techniques, comprising:
the dictionary library module is used for storing words and phrases corresponding to different languages;
the grammar rule base module is used for storing grammar rules corresponding to different languages;
the phrase translation probability table module is used for storing translation fragments of different languages obtained by training of a statistical machine translation system;
the target language model module is used for storing a language model of a target language obtained by training of the statistical machine translation system;
the syntax analyzer is connected with the dictionary base module and the grammar rule base module and is used for carrying out sentence division, segmentation, part of speech elimination and syntax analysis on the original text in sequence according to the dictionary base and the grammar rule base so as to generate a syntax tree;
and the decoder is connected with the phrase translation probability table module, the language model module and the syntactic analyzer and is used for traversing the syntactic tree according to the phrase translation probability table and the target language model, converting the original text into a translated text and generating the target language.
10. The apparatus of claim 9, wherein the syntax analyzer comprises:
the sentence dividing module is used for reading the source text and segmenting the source text;
the segmentation and pretreatment module is connected with the sentence division module and is used for segmenting and pretreating the divided single sentences;
the de-doubling module is connected with the segmentation and pretreatment module and is used for performing part-of-speech de-doubling on the segmented single sentences;
the grammar analysis module is connected with the eliminating and merging module and is used for carrying out grammar analysis on the single sentence after the elimination and merging;
and the master control module is respectively connected with the modules and controls the operation of the modules.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100497397A CN103116578A (en) | 2013-02-07 | 2013-02-07 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100497397A CN103116578A (en) | 2013-02-07 | 2013-02-07 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103116578A true CN103116578A (en) | 2013-05-22 |
Family
ID=48414955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013100497397A Pending CN103116578A (en) | 2013-02-07 | 2013-02-07 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103116578A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104731777A (en) * | 2015-03-31 | 2015-06-24 | 网易有道信息技术(北京)有限公司 | Translation evaluation method and device |
WO2017012327A1 (en) * | 2015-07-22 | 2017-01-26 | 华为技术有限公司 | Syntax analysis method and device |
CN106407184A (en) * | 2015-07-30 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Decoding method used for statistical machine translation, and statistical machine translation method and apparatus |
CN106598937A (en) * | 2015-10-16 | 2017-04-26 | 阿里巴巴集团控股有限公司 | Language recognition method and device for text and electronic equipment |
CN106844352A (en) * | 2016-12-23 | 2017-06-13 | 中国科学院自动化研究所 | Word prediction method and system based on neural machine translation system |
CN107066455A (en) * | 2017-03-30 | 2017-08-18 | 唐亮 | A kind of multilingual intelligence pretreatment real-time statistics machine translation system |
CN107436865A (en) * | 2016-05-25 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of word alignment training method, machine translation method and system |
CN107526726A (en) * | 2017-07-27 | 2017-12-29 | 山东科技大学 | A kind of method that Chinese procedural model is automatically converted to English natural language text |
CN107729326A (en) * | 2017-09-25 | 2018-02-23 | 沈阳航空航天大学 | Neural machine translation method based on Multi BiRNN codings |
TWI637278B (en) * | 2017-07-03 | 2018-10-01 | 雲拓科技有限公司 | Computer automatically claim-translating device |
CN108763222A (en) * | 2018-05-17 | 2018-11-06 | 腾讯科技(深圳)有限公司 | Detection, interpretation method and device, server and storage medium are translated in a kind of leakage |
CN108829657A (en) * | 2018-04-17 | 2018-11-16 | 广州视源电子科技股份有限公司 | Smoothing method and system |
CN108874790A (en) * | 2018-06-29 | 2018-11-23 | 中译语通科技股份有限公司 | A kind of cleaning parallel corpora method and system based on language model and translation model |
CN109448458A (en) * | 2018-11-29 | 2019-03-08 | 郑昕匀 | English oral training device, data processing method and storage medium |
CN109978829A (en) * | 2019-02-26 | 2019-07-05 | 深圳市华汉伟业科技有限公司 | A kind of detection method and its system of object to be detected |
CN110413963A (en) * | 2019-07-03 | 2019-11-05 | 东华大学 | A structured method for breast ultrasound examination report based on domain ontology |
CN110895660A (en) * | 2018-08-23 | 2020-03-20 | 澳门大学 | A sentence processing method and device based on dynamic coding of syntactic dependencies |
CN111104796A (en) * | 2019-12-18 | 2020-05-05 | 北京百度网讯科技有限公司 | Method and device for translation |
CN112766004A (en) * | 2021-01-22 | 2021-05-07 | 西安文理学院 | Artificial intelligence self-adaptive interactive foreign language teaching translation system |
CN112800754A (en) * | 2021-01-26 | 2021-05-14 | 浙江香侬慧语科技有限责任公司 | Unsupervised grammar derivation method, unsupervised grammar derivation device and medium based on pre-training language model |
CN113283250A (en) * | 2021-05-26 | 2021-08-20 | 南京大学 | Automatic machine translation test method based on syntactic component analysis |
RU2766821C1 (en) * | 2021-02-10 | 2022-03-16 | Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" | Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof |
CN114254630A (en) * | 2021-11-29 | 2022-03-29 | 北京捷通华声科技股份有限公司 | A translation method, apparatus, electronic device and readable storage medium |
CN114330376A (en) * | 2021-11-15 | 2022-04-12 | 甲骨易(北京)语言科技股份有限公司 | A computer-aided translation system and method |
CN119849478A (en) * | 2025-01-08 | 2025-04-18 | 中国科学技术信息研究所 | Theme determining method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652106A (en) * | 2004-02-04 | 2005-08-10 | 北京赛迪翻译技术有限公司 | Machine translation method and apparatus based on language knowledge base |
US20080162111A1 (en) * | 2006-12-28 | 2008-07-03 | Srinivas Bangalore | Sequence classification for machine translation |
CN101482861A (en) * | 2008-01-09 | 2009-07-15 | 中国科学院自动化研究所 | Chinese-English words automatic alignment method |
CN102662932A (en) * | 2012-03-15 | 2012-09-12 | 中国科学院自动化研究所 | Method for establishing tree structure and tree-structure-based machine translation system |
US20120316862A1 (en) * | 2011-06-10 | 2012-12-13 | Google Inc. | Augmenting statistical machine translation with linguistic knowledge |
-
2013
- 2013-02-07 CN CN2013100497397A patent/CN103116578A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652106A (en) * | 2004-02-04 | 2005-08-10 | 北京赛迪翻译技术有限公司 | Machine translation method and apparatus based on language knowledge base |
US20080162111A1 (en) * | 2006-12-28 | 2008-07-03 | Srinivas Bangalore | Sequence classification for machine translation |
CN101482861A (en) * | 2008-01-09 | 2009-07-15 | 中国科学院自动化研究所 | Chinese-English words automatic alignment method |
US20120316862A1 (en) * | 2011-06-10 | 2012-12-13 | Google Inc. | Augmenting statistical machine translation with linguistic knowledge |
CN102662932A (en) * | 2012-03-15 | 2012-09-12 | 中国科学院自动化研究所 | Method for establishing tree structure and tree-structure-based machine translation system |
Non-Patent Citations (2)
Title |
---|
徐志明,王晓龙等: "N-gram语言模型的数据平滑技术", 《计算机应用研究》 * |
蒋宏飞,李生等: "一种基于同步树替换文法的统计机器翻译模型", 《软件学报》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104731777A (en) * | 2015-03-31 | 2015-06-24 | 网易有道信息技术(北京)有限公司 | Translation evaluation method and device |
US10909315B2 (en) | 2015-07-22 | 2021-02-02 | Huawei Technologies Co., Ltd. | Syntax analysis method and apparatus |
WO2017012327A1 (en) * | 2015-07-22 | 2017-01-26 | 华为技术有限公司 | Syntax analysis method and device |
CN106407184A (en) * | 2015-07-30 | 2017-02-15 | 阿里巴巴集团控股有限公司 | Decoding method used for statistical machine translation, and statistical machine translation method and apparatus |
CN106407184B (en) * | 2015-07-30 | 2019-10-01 | 阿里巴巴集团控股有限公司 | Coding/decoding method, statistical machine translation method and device for statistical machine translation |
CN106598937A (en) * | 2015-10-16 | 2017-04-26 | 阿里巴巴集团控股有限公司 | Language recognition method and device for text and electronic equipment |
CN106598937B (en) * | 2015-10-16 | 2019-10-18 | 阿里巴巴集团控股有限公司 | Language Identification, device and electronic equipment for text |
CN107436865A (en) * | 2016-05-25 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of word alignment training method, machine translation method and system |
CN107436865B (en) * | 2016-05-25 | 2020-10-16 | 阿里巴巴集团控股有限公司 | Word alignment training method, machine translation method and system |
CN106844352A (en) * | 2016-12-23 | 2017-06-13 | 中国科学院自动化研究所 | Word prediction method and system based on neural machine translation system |
CN106844352B (en) * | 2016-12-23 | 2019-11-08 | 中国科学院自动化研究所 | Word prediction method and system based on neural machine translation system |
CN107066455A (en) * | 2017-03-30 | 2017-08-18 | 唐亮 | A kind of multilingual intelligence pretreatment real-time statistics machine translation system |
CN107066455B (en) * | 2017-03-30 | 2020-07-28 | 唐亮 | Multi-language intelligent preprocessing real-time statistics machine translation system |
TWI637278B (en) * | 2017-07-03 | 2018-10-01 | 雲拓科技有限公司 | Computer automatically claim-translating device |
CN107526726A (en) * | 2017-07-27 | 2017-12-29 | 山东科技大学 | A kind of method that Chinese procedural model is automatically converted to English natural language text |
CN107729326B (en) * | 2017-09-25 | 2020-12-25 | 沈阳航空航天大学 | Multi-BiRNN coding-based neural machine translation method |
CN107729326A (en) * | 2017-09-25 | 2018-02-23 | 沈阳航空航天大学 | Neural machine translation method based on Multi BiRNN codings |
CN108829657B (en) * | 2018-04-17 | 2022-05-03 | 广州视源电子科技股份有限公司 | Smoothing method and system |
CN108829657A (en) * | 2018-04-17 | 2018-11-16 | 广州视源电子科技股份有限公司 | Smoothing method and system |
CN108763222B (en) * | 2018-05-17 | 2020-08-04 | 腾讯科技(深圳)有限公司 | Translation missing detection and translation method and device, server and storage medium |
CN108763222A (en) * | 2018-05-17 | 2018-11-06 | 腾讯科技(深圳)有限公司 | Detection, interpretation method and device, server and storage medium are translated in a kind of leakage |
CN108874790A (en) * | 2018-06-29 | 2018-11-23 | 中译语通科技股份有限公司 | A kind of cleaning parallel corpora method and system based on language model and translation model |
CN110895660A (en) * | 2018-08-23 | 2020-03-20 | 澳门大学 | A sentence processing method and device based on dynamic coding of syntactic dependencies |
CN110895660B (en) * | 2018-08-23 | 2024-05-17 | 澳门大学 | Sentence processing method and device based on syntactic dependency dynamic coding |
CN109448458A (en) * | 2018-11-29 | 2019-03-08 | 郑昕匀 | English oral training device, data processing method and storage medium |
CN109978829A (en) * | 2019-02-26 | 2019-07-05 | 深圳市华汉伟业科技有限公司 | A kind of detection method and its system of object to be detected |
CN110413963A (en) * | 2019-07-03 | 2019-11-05 | 东华大学 | A structured method for breast ultrasound examination report based on domain ontology |
CN110413963B (en) * | 2019-07-03 | 2022-11-25 | 东华大学 | Breast ultrasonic examination report structuring method based on domain ontology |
CN111104796A (en) * | 2019-12-18 | 2020-05-05 | 北京百度网讯科技有限公司 | Method and device for translation |
CN111104796B (en) * | 2019-12-18 | 2023-05-05 | 北京百度网讯科技有限公司 | Method and device for translation |
CN112766004A (en) * | 2021-01-22 | 2021-05-07 | 西安文理学院 | Artificial intelligence self-adaptive interactive foreign language teaching translation system |
CN112800754A (en) * | 2021-01-26 | 2021-05-14 | 浙江香侬慧语科技有限责任公司 | Unsupervised grammar derivation method, unsupervised grammar derivation device and medium based on pre-training language model |
RU2766821C1 (en) * | 2021-02-10 | 2022-03-16 | Общество с ограниченной ответственностью " МЕНТАЛОГИЧЕСКИЕ ТЕХНОЛОГИИ" | Method for automated extraction of semantic components from compound sentences of natural language texts in machine translation systems and device for implementation thereof |
CN113283250A (en) * | 2021-05-26 | 2021-08-20 | 南京大学 | Automatic machine translation test method based on syntactic component analysis |
CN114330376A (en) * | 2021-11-15 | 2022-04-12 | 甲骨易(北京)语言科技股份有限公司 | A computer-aided translation system and method |
CN114254630A (en) * | 2021-11-29 | 2022-03-29 | 北京捷通华声科技股份有限公司 | A translation method, apparatus, electronic device and readable storage medium |
CN119849478A (en) * | 2025-01-08 | 2025-04-18 | 中国科学技术信息研究所 | Theme determining method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103116578A (en) | Translation method integrating syntactic tree and statistical machine translation technology and translation device | |
Cussens | Part-of-speech tagging using Progol | |
CN109062892A (en) | A kind of Chinese sentence similarity calculating method based on Word2Vec | |
CN108681529B (en) | Multi-language text and voice generation method of flow model diagram | |
CN112183059A (en) | Chinese structured event extraction method | |
CN110502744A (en) | A Text Emotion Recognition Method and Device for Evaluation of Historical Parks | |
Anastasiou | Idiom treatment experiments in machine translation | |
Dunđer | Machine translation system for the industry domain and Croatian language | |
CN118246426A (en) | Writing method, system, device and medium based on generative text big model | |
CN108519963B (en) | Method for automatically converting process model into multi-language text | |
CN113343717A (en) | Neural machine translation method based on translation memory library | |
Sen et al. | Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods | |
CN112447172B (en) | Quality improvement method and device for voice recognition text | |
Comas et al. | Sibyl, a factoid question-answering system for spoken documents | |
Chennoufi et al. | Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization | |
Seresangtakul et al. | Thai-Isarn dialect parallel corpus construction for machine translation | |
CN107862045A (en) | A kind of across language plagiarism detection method based on multiple features | |
CN111259159B (en) | Data mining method, device and computer readable storage medium | |
Romero et al. | Category-based language models for handwriting recognition of marriage license books | |
CN118626065A (en) | Web front-end style code generation method based on DOM | |
CN106776590A (en) | A kind of method and system for obtaining entry translation | |
Sankaravelayuthan et al. | A Comprehensive Study of Shallow Parsing and Machine Translation in Malaylam | |
Ducoffe et al. | Machine Learning under the light of Phraseology expertise: use case of presidential speeches, De Gaulle-Hollande (1958-2016) | |
CN114021553A (en) | A Chinese sentiment polarity detection method based on grammar dependency graph and dictionary expansion | |
KR100574887B1 (en) | Vocabulary neutralization device and machine method in machine translation system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130522 |
|
RJ01 | Rejection of invention patent application after publication |