KR101040438B1

KR101040438B1 - Method and apparatus for correcting word alignment links for improving accuracy of automatic word alignment

Info

Publication number: KR101040438B1
Application number: KR1020090051065A
Authority: KR
Inventors: 이종훈; 이근배
Original assignee: 포항공과대학교 산학협력단
Priority date: 2009-06-09
Filing date: 2009-06-09
Publication date: 2011-06-09
Anticipated expiration: 2029-06-09
Also published as: KR20100132311A

Abstract

본 발명은 통계적 자동 번역 기술에 관한 것으로, 단어 자동 정렬 링크의 정확도를 높이기 위한 단어 정렬 링크 수정 방법 및 장치에 관한 것이다. 종래의 기술은 병렬 코퍼스의 단어 사이의 통계적 특성만을 이용하여, 의미상 전혀 대응되지 않는 단어들이 서로 정렬되는 경우가 자주 발생한다. 이러한 오류들은 다대다 단어정렬을 생성해낼 때 옳은 정렬 링크를 배제시키는 작용을 하여, 통계적 기계 번역 시스템이 잘못된 번역을 생성하게 하는 원인이 되는데, 본 발명은 이 오류를 감소시켜 기계 번역의 성능을 향상시키는 것을 목적으로 한다. 본 발명에서는 기존의 단어 자동 정렬 방법들이 발생시키는 오류를 감소시키기 위해서, 일대다 정렬 방식과 다대일 단어 정렬 방식의 하나 또는 각각의 단어 정렬 링크(word alignment link)에 대해서 단어 정렬 링크들의 분포를 통해 오류가 있을 가능성을 시험하고, 오류가 있다고 판단할 수 있는 링크들을 널(NULL)로 수정, 즉, 삭제함으로써 다대다 정렬이 옳은 정렬을 포함하도록 한다. 이 방법을 통해 다대다 정렬이 올바른 정렬을 포함하게 됨으로써 최종적으로 생성되는 번역 모델은 좀더 강건해 지고, 이를 통한 기계 번역 결과가 좀더 정확해 지는 효과를 낸다.The present invention relates to a statistical automatic translation technique, and to a method and apparatus for correcting word alignment links to increase the accuracy of word automatic alignment links. The prior art often uses only statistical properties between words in parallel corpus, so that words that do not correspond at all in meaning are often aligned. These errors act to exclude the correct alignment link when generating many-to-many word alignment, causing the statistical machine translation system to generate an incorrect translation. The present invention reduces this error to improve the performance of machine translation. It is aimed at letting. In the present invention, in order to reduce the errors caused by the conventional word automatic alignment methods, the word alignment links are distributed to one or each word alignment link of one-to-many and many-to-one word alignment. Examine the likelihood of an error, and modify, or delete, links that might be considered to be error so that many-to-many alignments include the correct alignment. In this way, many-to-many sorting includes the correct sorting, resulting in a more robust translation model, which results in more accurate machine translation results.

기계 번역, 통계적 기계 번역, 단어 정렬, 자동 번역, 자동 통역 Machine translation, statistical machine translation, word sort, automatic translation, automatic translator

Description

Method of correcting word alignment links to enhance accuracy of aligning automatic words, and apparatus using the same}

본 발명은 통계적 자동 번역 기술에 관한 것으로, 보다 상세하게는 통계적 자동 번역 기술에 주로 이용되는 단어 자동 정렬 링크의 정확도를 높이기 위하여 통계적 방법론을 적용한 단어 정렬 링크 수정 방법 및 장치에 관한 것이다 (본 발명은 국가연구개발사업의 일환으로, 과제고유번호 : C1090-0902-0045, 연구사업명 : 대학 IT연구센터 육성 지원사업, 연구과제명 : "융합단말을 위한 내장형 소프트웨어 기술연구"에 관한 것이다).The present invention relates to a statistical automatic translation technique, and more particularly, to a method and apparatus for correcting a word alignment link using a statistical methodology in order to increase the accuracy of a word automatic alignment link mainly used in a statistical automatic translation technique. As part of the national R & D project, Project No .: C1090-0902-0045, Project name: University IT research center development support project, Project title: "Internal software technology research for fusion terminals").

통계적 기계 번역 방식은 1980년대 후반에 처음으로 연구되기 시작하여 현재는 전 세계적으로 새로운 기계 번역 방식으로서 활발히 연구되고 있다. Statistical machine translation was first studied in the late 1980s and is now being actively studied as a new machine translation technique worldwide.

일반적으로 통계적 기계 번역 방식은 문장단위로 정렬된 대량의 병렬 말뭉치에 대해 자동으로 단어 정렬을 수행하며 이를 바탕으로 번역 쌍을 학습한다. 이러한 자동 단어 정렬 방식으로써 IBM 모델과 이에 대한 한 가지 구현 방식인 GIZA++ 소프트웨어가 널리 이용되고 있다. In general, the statistical machine translation method performs word alignment automatically on a large number of parallel corpus sorted by sentence, and learns a translation pair based on this. As an automatic word alignment method, the IBM model and one implementation of the GIZA ++ software are widely used.

자동 단어 정렬 방식은 한 언어의 특정 단어에 대해서, 대상이 되는 언어의 문장을 구성하는 단어들 중 함께 나타나는 빈도를 기반으로, 대응되는 단어 하나를 짝짓는다. 이러한 방식의 단어 정렬 링크는 일반적으로 일대다(一對多)의 정렬 형태를 가지고 있으나, 최근에는 정방향의 일대다 정렬과 같은 방식을 역방향으로 적용한 다대일(多對一) 정렬 링크를 혼합하여 다대다(多對多) 단어 정렬 링크를 생성하는 방식이 널리 이용되고 있다. The automatic word sorting method matches a word corresponding to a specific word of a language based on the frequency of occurrence of words forming a sentence of the target language. Word-aligned links in this manner generally have a one-to-many alignment form, but recently they combine a many-to-one alignment link that applies the same method as the forward one-to-many alignment. The method of generating many word sort links is widely used.

전술한 종래의 기술은 소스 언어(source language)의 단어와 타겟 언어(target language)의 단어 사이의 통계적 특성만을 이용하기 때문에, 언어의 차이에 따라 의미상 전혀 대응되지 않는 단어들이 서로 정렬되는 경우가 자주 발생한다. 이러한 오류들은 최초의 일대다 단어정렬에서부터 발생하며, 이러한 단어 정렬의 오류는 다대다 단어정렬을 생성해낼 때 옳은 정렬 링크를 배제시키는 작용을 하여, 결과적으로는 통계적 기계 번역 시스템이 잘못된 번역을 생성하게 하는 주요한 원인이 된다. Since the above-described conventional technology uses only statistical characteristics between words of a source language and words of a target language, words that are not semantically corresponding at all according to language differences are sometimes aligned. Occurs frequently. These errors arise from the first one-to-many word alignment, and these word alignment errors work by excluding the correct alignment link when generating many-to-many word alignment, resulting in statistical machine translation systems generating incorrect translations. It is a major cause.

본 발명은 상술한 종래의 문제점들을 해결하기 위하여 단어 정렬 오류를 감소시킴으로써 궁극적으로 기계 번역의 성능을 향상시키는 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법 및 장치를 제공함을 목적으로 한다. It is an object of the present invention to provide a method and apparatus for correcting word alignment links for improving accuracy of automatic word alignment, which reduces word alignment errors and ultimately improves the performance of machine translation in order to solve the above-mentioned conventional problems.

본 발명에서는 기존의 단어 자동 정렬 방법론들이 발생시키는 오류를 감소시키기 위해서, 본 발명의 일 태양에서, 정방향의 일대다 정렬 링크와 이를 역방향으로 적용한 다대일(多對一) 정렬 링크를 혼합하여 다대다(多對多) 단어 정렬 링크를 생성하는 기계 번역의 성능을 향상시키는 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법에 있어서, 다대일 정렬 방식과 일대다 정렬 방식의 하나 또는 각각의 단어 정렬 링크(word alignment link)에 대해서 단어 정렬 링크들의 분포를 통해 오류가 있을 가능성을 시험하는 단계; 및 오류가 있다고 판단할 수 있는 링크들을 널(NULL)로 수정 또는 삭제함으로써 다대다 정렬이 올바른 정렬을 포함하도록 하는 단계를 포함하는 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법을 제공한다. In the present invention, in order to reduce the error caused by the conventional word auto-sorting methodologies, in one aspect of the present invention, the forward-to-many alignment link and the many-to-one alignment link applying the reverse direction are mixed many-to-many A word sorting link modification method for improving the accuracy of automatic word sorting that improves the performance of a machine translation generating a word sorting link, wherein one or each word sorting of many-to-one and one-to-many sorting methods is performed. Testing the likelihood of error through the distribution of word alignment links for a word alignment link; And correcting or deleting the links that can be determined to be errors by null so that the many-to-many sorting includes the correct sorting.

본 발명의 다른 태양에서, 정방향의 일대다 정렬 링크와 이를 역방향으로 적용한 다대일(多對一) 정렬 링크를 혼합하여 다대다(多對多) 단어 정렬 링크를 생성하는 기계 번역의 성능을 향상시키는 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 장치에 있어서, 다대일 단어 정렬 방식과 일대다 단어 정렬 방식의 하나 또는 각각을 채용하여 단어를 정렬하여 단어 정렬 링크(word alignment link)를 출력하는 단어 정렬부; 및 상기 각각의 단어 정렬부로부터의 단어 정렬 링크에 대해서 단어 정렬 링크들의 분포를 통해 오류가 있을 가능성을 시험하여, 오류가 있다고 판단할 수 있는 링크들을 널(NULL)로 수정 또는 삭제함으로써 다대다 정렬이 올바른 정렬을 포함하도록 하는 정제된 단어 정렬부를 포함하는 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 장치를 제공한다. In another aspect of the present invention, to improve the performance of machine translation to generate a many-to-many word-aligned link by mixing forward one-to-many alignment links and the many-to-one alignment links applied in the reverse direction. In the word alignment link correction device for improving the accuracy of automatic word alignment, a word that employs one or each of a many-to-one word alignment method and a one-to-many word alignment method to sort words to output a word alignment link. Alignment unit; And many-to-many sorting by testing the possibility of an error through the distribution of word alignment links for each word alignment link from each word alignment unit, and modifying or deleting links that can be determined to be null. Provided is a word alignment link correction device for improving the accuracy of automatic word alignment that includes a refined word alignment that includes this correct alignment.

바람직하기로는 상기 정제된 단어 정렬부는 정제된 다대일 단어 정렬부와 정제된 일대다 단어 정렬부를 하나 또는 모두를 포함하여, 상기 다대일 단어 정렬 방식과 일대다 단어 정렬 방식의 하나 또는 각각의 단어 정렬 링크(word alignment link)에 대해서 단어 정렬 링크들의 분포를 통해 오류가 있을 가능성을 시험하는 오류 판단부; 및 오류가 있다고 판단할 수 있는 링크들을 널(NULL)로 수정 또는 삭제함으로써 다대다 정렬이 올바른 정렬을 포함하도록 하는 스위칭부를 포함한다. Preferably, the refined word alignment unit includes one or both of a purified many-to-one word alignment unit and a purified one-to-many word alignment unit, so that one or each word alignment of the many-to-one word alignment and one-to-many word alignment may be performed. An error determination unit that tests a possibility of an error through distribution of word alignment links with respect to a word alignment link; And a switching portion that allows the many-to-many alignment to include the correct alignment by modifying or deleting the links that can be determined to be faulty.

바람직하기로는, 상기 정제된 다대일 단어 정렬부와 상기 정제된 일대다 단어 정렬부의 출력 링크에 의거 다대다 단어로 정렬시켜 번역 모델을 출력하는 다대다 단어 정렬부를 더 포함하며, 상기 정제된 다대일 단어 정렬부는 소스 언어 말뭉치와 타겟 언어 말뭉치의 출력인 소스 언어 및 타겟 언어를 받아서 다대일 단어로 정렬시키는 다대일 단어 정렬부로부터의 다대일 단어 정렬 링크를 받아서 정제된 다대일 단어로 정렬시켜 출력하고, 상기 정제된 일대다 단어 정렬부는 소스 언어 말뭉치와 타겟 언어 말뭉치의 출력인 소스 언어 및 타겟 언어를 받아서 일대다 단어로 정렬시키는 일대다 단어 정렬부로부터의 일대다 단어 정렬 링크를 받아서 정 제된 일대다 단어로 정렬시켜 출력한다. Preferably, the refined many-to-one word alignment unit and the refined one-to-many word alignment unit further include a many-to-many word alignment unit for outputting a translation model by aligning many-to-many words based on the output link. The word sorter receives the many-to-one word sort link from the many-to-one word sorter that receives the source language and the target language output from the source language corpus and the target language corpus, and sorts them into many-to-one words. The refined one-to-many word sorter receives the source language and the target language output from the source language corpus and the target language corpus, and receives the one-to-many word sorting link from the one-to-many word sorter. Print the words sorted.

바람직하기로는 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어별 정렬 애매성 점수 생성 방법을 적용함을 특징으로 한다. Preferably, the method of generating a word-aligned ambiguity score for each word alignment link is applied.

바람직하기로는, 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어 정렬 링크별 상대적 빈도수를 기준으로 사용하는 것을 특징으로 한다. Preferably, in order to determine whether each word alignment link is an error, it is characterized in that it is used based on the relative frequency for each word alignment link.

바람직하기로는, 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어별 정렬 애매성 점수와 단어 정렬 링크별 상대적 빈도수를 기준으로 사용하는 것을 특징으로 한다. Preferably, in order to determine whether each word alignment link is an error, it is characterized by using the word alignment ambiguity score and the relative frequency for each word alignment link.

바람직하기로는, 오류가 있는 단어 정렬 링크를 수정 또는 삭제하기 위해서 해당 단어 정렬 링크를 소스 언어의 단어와 널(NULL)로 정렬된 것을 특징으로 한다. Preferably, the word alignment link is null-aligned with the word of the source language in order to correct or delete the erroneous word alignment link.

바람직하기로는, 상기 정렬 애매성 점수는 다음 식Preferably, the alignment ambiguity score is

(여기서, y는 점수 계산의 대상이 되는 소스 언어의 단어를 나타내고, n은 타겟 언어의 어휘 수를 나타내고, x는 타겟 언어의 각 단어를 나타내고, P는 소스언어의 단어 y와 타겟 언어의 단어 x가 서로 정렬될 확률을 나타내며, 확률 p는 소스 언어의 특정 단어의 빈도 수와 그 단어와 관련된 단어 정렬 빈도 수의 상대적인 빈도수(relative frequency)로 추정)에 의해서 구해짐을 특징으로 한다. (Where y denotes a word of the source language to be scored, n denotes the vocabulary number of the target language, x denotes each word of the target language, and P denotes a word y of the source language and a word of the target language) x represents the probability of being aligned with each other, and the probability p is estimated by the relative frequency of the frequency of a word in a source language and the frequency of word alignment associated with the word.

본 발명의 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법 및 장치는 기존의 방법을 활용함에 있어서, 다대일 및/또는 일대다 단어 정렬 링크를 한번 정제하는 과정을 거침으로써 1차적으로 단어 정렬 링크의 정확도를 높일 수 있고, 2차적으로 이로부터 다대다 단어 정렬 링크를 추출하는 과정에서 올바른 정렬을 포함할 수 있게 되며, 최종적으로는 번역 성능을 향상시킬 수 있다. Word sorting link correction method and apparatus for improving the accuracy of automatic word sorting of the present invention is a word sorting process by first refining the many-to-one and / or one-to-many word sorting link in utilizing the existing method The accuracy of the link can be improved, and the second-to-many word-to-many link extraction can include the correct alignment, which in turn improves the translation performance.

이하, 본 발명에 의한 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법 및 장치의 실시예를 첨부된 도 1 및 도 2를 참조하여 설명하기로 한다. Hereinafter, an embodiment of a method and apparatus for correcting a word alignment link for improving accuracy of automatic word alignment according to the present invention will be described with reference to FIGS. 1 and 2.

본 발명의 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법은 통계적 자동 번역 시스템에 일반적으로 이용되는 훈련 과정에 있어서 사용되는 자동 단어 정렬 링크를 정제하기 위해 기 생성된 자동 단어 정렬 링크의 분포를 바탕으로 각 단어와 단어 정렬 링크의 오류 여부를 판단하고, 오류가 있는 단어 정렬 링크를 수정 또는 삭제하는 것을 특징으로 하는 단어 정렬 링크 후처리 기법에 관련된다. The word alignment link correction method for improving the accuracy of the automatic word alignment of the present invention is based on the distribution of the pre-generated automatic word alignment link for refining the automatic word alignment link used in the training process generally used in the statistical automatic translation system. Based on the word alignment link post-processing technique characterized in that it is determined whether each word and word alignment link error, and correcting or deleting the erroneous word alignment link.

본 발명에서, 각 단어의 오류 여부를 판단하기 위하여 단어별 정렬 애매성 점수 생성 방법을 적용한다. In the present invention, in order to determine whether each word is an error, a method of generating a word-level ambiguity ambiguity is applied.

또한, 본 발명에서, 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어 정렬 링크별 상대적 빈도수를 기준으로 사용하는 것을 특징으로 한다. In addition, in the present invention, in order to determine whether an error of each word alignment link, it is characterized in that it is used based on the relative frequency for each word alignment link.

또한, 본 발명에서, 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어별 정렬 애매성 점수와 단어 정렬 링크별 상대적 빈도수를 기준으로 사용하는 것을 특징으로 한다. In addition, in the present invention, in order to determine whether an error of each word alignment link, it is characterized by using the word alignment ambiguity score and the relative frequency for each word alignment link.

또한, 본 발명에서 오류가 있는 단어 정렬 링크를 수정 또는 삭제하기 위해서 해당 단어 정렬 링크를 소스 언어의 단어와 널(NULL)에 정렬되도록 한다. In addition, in the present invention, in order to correct or delete the erroneous word alignment link, the word alignment link is aligned with the word of the source language and null.

이를 첨부도면을 참조하여 보다 상세하게 설명한다. This will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 방법을 설명하기 위한 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 장치의 개략적인 블록도이다. 1 is a schematic block diagram of an apparatus for correcting word alignment links for improving accuracy of automatic word alignment for explaining a method for correcting word alignment links for improving the accuracy of automatic word alignment.

본 발명은 통계적 기계 번역 시스템의 성능을 향상시키기 위해서 일대다 단어 정렬 링크를 수정하여 그 오류를 수정하는 자동 단어 정렬의 정확도 향상을 위한 링크 수정 방법및 장치에 관한 것으로, 본 발명의 일실시예에 의하면, 도 1에 나타내는 바와 같이 다대일 단어 정렬부(11), 일대다 단어정렬부(12), 정제된 다대일 단어 정렬부(13)와 정제된 일대다 단어 정렬부(14)의 하나 또는 각각을 포함하여 단어 정렬 링크를 수정하여 정제된 단어 정렬 링크를 출력하는 정제된 단어 정렬부(100), 상기 정제된 다대일 단어 정렬부(13)와 상기 정제된 일대다 단어 정렬부(14)의 하나 또는 각각으로부터 정제된 단어 정렬 링크를 받아 통계적 번역기 훈련 모듈 방식을 이용하여 다대다 단어로 정렬시켜 출력하는 다대다 정렬부(15)를 포함한다. 도 1에서, 상기 정제된 단어 정렬부(100)는 정제된 다대일 단어 정렬부(13)와 정제된 일대다 단어 정렬부(14)를 모두 포함하고 있다. 그러나 다른 실시예로, 정제된 다대일 단어 정렬부(13)와 정제된 일대다 단어 정렬부(14) 중 어느 하나만을 포함하도록 설계할 수도 있다. 바람직하기로는 상기 정제된 단어 정렬부(100)는 적어도 정제된 일대다 단어 정렬부(14)를 포함하도록 설계하는 것이 좋다. The present invention relates to a method and apparatus for link correction for improving the accuracy of automatic word sorting that corrects errors by correcting one-to-many word-aligned links in order to improve the performance of a statistical machine translation system. 1, one of the many-to-one word alignment unit 11, the one-to-many word alignment unit 12, the purified many-to-one word alignment unit 13 and the purified one-to-many word alignment unit 14, or A refined word alignment unit 100 for modifying a word alignment link including each of them and outputting a refined word alignment link, the purified many-to-one word alignment unit 13 and the purified one-to-many word alignment unit 14 And a many-to-many sorting unit 15 which receives the refined word alignment link from one or each of and sorts and outputs many-to-many words using a statistical translator training module method. In FIG. 1, the refined word alignment unit 100 includes both a refined many-to-one word alignment unit 13 and a refined one-to-many word alignment unit 14. However, in another embodiment, it may be designed to include only one of the purified many-to-one word aligner 13 and the purified one-to-many word aligner 14. Preferably, the refined word alignment unit 100 is designed to include at least a refined one-to-many word alignment unit 14.

보다 구체적으로, 도 1에 도시된 본 발명의 자동 단어 정렬의 정확도 향상을 위한 단어 정렬 링크 수정 장치(10)에서, 다대일 단어 정렬부(11)는 소스 언어 말뭉치(S1)와 타겟 언어 말뭉치(S2)의 출력인 소스 언어 및 타겟 언어를 받아서 다대일 단어로 정렬시키고, 상기 일대다 단어 정렬부(12)는 소스 언어 말뭉치(S1)와 타겟 언어 말뭉치(S2)의 출력인 소스 언어 및 타겟 언어를 받아서 일대다 단어로 정렬시킨다. 또한, 다대일 단어 정렬부(13)는 상기 다대일 단어 정렬부(11)로부터의 다대일 단어 정렬 링크를 받아서 정제된 다대일 단어로 정렬시킨다. 또한, 정제된 단어 정렬부(100)는 상기 다대일 단어 정렬부(11)로부터의 다대일 단어 정렬 링크를 받아서 단어 정렬 링크를 수정 또는 삭제함으로써, 정제된 다대일 단어로 정렬시켜 정제된 다대일 단어 정렬 링크를 출력하는 정제된 다대일 단어 정렬부(13), 및 상기 일대다 단어 정렬부(12)로부터의 일대다 단어 정렬 링크를 받아서 단어 정렬 링크를 수정 또는 삭제함으로써, 정제된 일대다 단어로 정렬시켜 정제된 일대다 단어 정렬 링크를 출력하는 정제된 일대다 단어 정렬부(14)를 포함한다. 또한, 다대다 단어 정렬부(15)는 상기 정제된 다대일 단어 정렬부(13)와 상기 정제된 일대다 단어 정렬부(14)의 출력 링크에 의거 다대다 단어로 정렬시켜 번역 모델을 출력한다. More specifically, in the word alignment link correction device 10 for improving the accuracy of the automatic word alignment of the present invention shown in Figure 1, the many-to-one word alignment unit 11 is a source language corpus (S1) and the target language corpus ( A source language and a target language that are outputs of S2) are received, and the words are sorted into many words. The one-to-many word sorting unit 12 is a source language and a target language that are outputs of the source language corpus S1 and the target language corpus S2. Receive and sort by one-to-many words. In addition, the many-to-one word alignment unit 13 receives the many-to-one word alignment link from the many-to-one word alignment unit 11 and sorts them into refined many-to-one words. In addition, the refined word alignment unit 100 receives the many-to-one word alignment link from the many-to-one word alignment unit 11 and corrects or deletes the word alignment link, thereby sorting the purified many-to-one word to refine the many-to-one. Refined one-to-many word by receiving a refined many-to-one word alignment unit 13 that outputs a word alignment link, and a one-to-many word alignment link from the one-to-many word alignment unit 12, and modifying or deleting the word alignment link. And a refined one-to-many word alignment section 14 that outputs a refined one-to-many word alignment link. In addition, the many-to-many word aligner 15 sorts the word-to-many word based on the output link of the refined many-to-one word aligner 13 and the refined one-to-many word aligner 14 to output a translation model. .

도 2는 도 1의 정제된 단어 정렬부(100)의 상세하게 도시한 블록도로서, 도 1에서 상술한 바와 같이 정제된 단어 정렬부(100)는 정제된 다대일 단어 정렬부(13)와 정제된 일대다 단어 정렬부(14)를 포함한다. FIG. 2 is a detailed block diagram of the refined word alignment unit 100 of FIG. 1. As described above with reference to FIG. 1, the refined word alignment unit 100 may include a refined many-to-one word alignment unit 13. And a refined one-to-many word sorter 14.

도 2에 의하면, 상기 정제된 다대일 단어 정렬부(13)는 정제된 다대일 단어 정렬 링크를 받아서 오류를 판단하는 오류 판단부(152), 상기 오류 판단부(152)에서 오류가 없으면 스위칭부(156)를 턴온하여 정제된 다대일 단어 정렬 링크를 다대다 단어 정렬부(15)로 출력한다. 그러나, 오류 판단부(152)에서 오류가 있으면 스위칭부(156)를 턴오프하여 정제된 다대일 단어 정렬 링크를 다대다 단어 정렬부(15)로 출력하지 않는다. Referring to FIG. 2, the refined many-to-one word alignment unit 13 includes an error determination unit 152 that receives a refined many-to-one word alignment link and determines an error, and a switching unit if there is no error in the error determination unit 152. 156 is turned on to output the refined many-to-one word alignment link to the many-to-many word alignment unit 15. However, if there is an error in the error determining unit 152, the switching unit 156 is turned off and the refined many-to-one word alignment link is not output to the many-to-many word alignment unit 15.

또한, 도 2에 의하면, 상기 정제된 일대다 단어 정렬부(14)는 정제된 일대다 단어 정렬 링크를 받아서 오류를 판단하는 오류 판단부(150), 상기 오류 판단부(150)에서 오류가 없으면 스위칭부(152)를 턴온하여 정제된 일대다 단어 정렬 링크를 다대다 단어 정렬부(15)로 출력한다. 그러나, 오류 판단부(150)에서 오류가 있으면 스위칭부(152)를 턴오프하여 정제된 일대다 단어 정렬 링크를 다대다 단어 정렬부(15)로 출력하지 않는다. In addition, according to FIG. 2, the refined one-to-many word alignment unit 14 receives an refined one-to-many word alignment link and determines an error, if there is no error in the error determination unit 150. The switching unit 152 is turned on to output the refined one-to-many word alignment link to the many-to-many word alignment unit 15. However, if there is an error in the error determining unit 150, the switching unit 152 is turned off and the refined one-to-many word alignment link is not output to the many-to-many word alignment unit 15.

본 발명에서는 오류를 판단하기 위하여, 단어별 정렬 애매성 점수 생성 방법을 적용할 수도 있고, 또한 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어 정렬 링크별 상대적 빈도수를 기준으로 사용할 수도 있다. 더욱이, 본 발명에서는 각 단어 정렬 링크의 오류 여부를 판단하기 위해 단어별 정렬 애매성 점수와 단어 정렬 링크별 상대적 빈도수를 기준으로 사용할 수 있다. In the present invention, in order to determine an error, a method of generating word-aligned ambiguity scores may be applied, or may be used based on the relative frequency of each word-aligned link to determine whether an error occurs in each word-aligned link. Furthermore, in the present invention, in order to determine whether each word alignment link is an error, the alignment ambiguity score for each word and the relative frequency for each word alignment link may be used.

본 발명의 바람직한 실시예서는 오류를 판단하기 위하여, 기존의 단어 정렬 링크로부터 각 단어의 정렬 양상을 조사하여 정렬 애매성 정도를 나타내는 점수를 부과하고, 이 점수와 각 단어 정렬 링크의 빈도수를 이용하여 단어 정렬 링크의 오류 여부를 결정하며, 오류가 있는 것으로 판단된 단어 정렬 링크를 수정 또는 삭제하여 단어 정렬 링크를 자동 향상한다. In order to determine an error, a preferred embodiment of the present invention examines the alignment of each word from an existing word alignment link and assigns a score indicating the degree of ambiguity of the word, using the score and the frequency of each word alignment link. It determines whether a word alignment link is in error, and automatically improves the word alignment link by correcting or deleting the word alignment link determined to be in error.

본 발명에서는 도 1 및 도 2에서 도시된 바와 같이 다대일 단어 자동 정렬 및 일대다 단어 자동 정렬에서 각 단어 및 단어 정렬 링크의 분포를 수집할 때는 소스 언어의 각 단어에 대해서 그 단어와 정렬된 타겟 언어의 단어 목록 및 정렬된 빈도를 기록한다. 이로부터 각 단어에 대해서 정렬 무질서도(entropy)를 계산하고 이를 정규화 하여 단어 별로 정렬 애매성 점수를 계산한다. 좀더 명확하게는, 다음 수학식1 에 따라 정렬 애매성 점수를 계산한다.In the present invention, when collecting the distribution of each word and word alignment link in the many-to-one word auto-alignment and one-to-many word auto-alignment as shown in Figs. 1 and 2, the target aligned with the word for each word of the source language Record the list of words in the language and their sorted frequency. From this, the sorting entropy is calculated for each word and normalized to calculate the sorting ambiguity score for each word. More specifically, the alignment ambiguity score is calculated according to the following equation (1).

상기 정렬 애매성 정도를 나타내는 수학식 1 에서 y는 점수 계산의 대상이 되는 소스 언어의 단어를 나타내고, n은 타겟 언어의 어휘 수를 나타내고, x는 타겟 언어의 각 단어를 나타내고, p는 소스언어의 단어 y와 타겟 언어의 단어 x가 서로 정렬될 확률을 나타낸다. 확률 p는 소스 언어의 특정 단어의 빈도 수와 그 단어와 관련된 단어 정렬 빈도 수의 상대적인 빈도수(relative frequency)로 추정한다. In Equation 1 representing the degree of alignment ambiguity, y represents a word of a source language to be scored, n represents a vocabulary number of a target language, x represents each word of a target language, and p represents a source language. The probability of the word y of and the word x of the target language is aligned with each other. The probability p is estimated by the relative frequency of the frequency of a word in a source language and the frequency of word alignment associated with that word.

정렬애매성 점수는 각 단어별로 해당 단어의 정렬 링크가 얼마나 명확한지를 나타내는 척도이며 1에 가까울수록 정렬이 애매하고 0에 가까울수록 명확한 정렬 링크를 나타낸다. 그리고 각 단어 정렬 링크에 대해서는 특정 소스 언어의 단어에 대해서 정렬된 타겟 언어의 단어들의 종류에 따라 상대적인 빈도수(relative frequency)를 계산하며, 이는 수학식 1의 확률 p와 같다. The sort ambiguity score is a measure of how clear the sort link of the word is for each word, and the closer to 1, the more obscure the sort, and the closer to 0, the clearer sort link. For each word alignment link, a relative frequency is calculated according to the types of words of a target language aligned with respect to words of a specific source language, which is equal to the probability p of Equation 1.

전술한 방법에 의해 얻어진 각 단어에 대한 정렬 애매성 점수와, 각 단어 정렬 링크에 대한 상대적인 빈도 수 수치를 기준으로, 각 단어 정렬 링크의 오류 여부를 판단한다. 오류 여부 판단에 있어서 수치상의 정확한 기준은 상황에 따라 달라 질 수 있으나 크게 두 가지 원칙에 의해서 판정한다. Based on the alignment ambiguity score for each word obtained by the above-described method and the relative frequency number value for each word alignment link, it is determined whether each word alignment link is an error. The exact criteria in determining the error may vary depending on the situation.

첫 번째로 정렬 애매성 점수가 높고, 그 단어와 정렬된 단어의 종류가 많다면 그 단어와 관련된 단어 정렬 링크는 모두 오류가 있는 것으로 판단한다. First, if the sort ambiguity score is high and there are many kinds of words aligned with the word, all word sort links associated with the word are considered to be in error.

두 번째로 정렬 애매성 점수가 낮은 단어의 단어 정렬 링크 중에서, 단어 정렬의 상대적 빈도가 낮은 단어 정렬 링크는 오류가 있는 것으로 판단한다. Second, among the word alignment links of the words having low alignment ambiguity scores, word alignment links having a relatively low frequency of word alignment are determined to be in error.

즉, 본 발명에서 제시하는 방법을 적용하기 위해서 오류 판별의 기준으로 설정해야할 수치는 "높은 정렬 애매성 점수의 임계값", "낮은 정렬 애매성 점수의 임계값", "낮은 단어 정렬 빈도 수의 임계값", "정렬된 단어의 종류가 많다고 판단되는 어휘 수의 임계값"의 4가지 수치이다. 이 수치는 기계 번역 시스템이 채택한 소스 및 타겟 언어와 훈련용 말뭉치의 특성에 따라 실험적으로 결정되어야 하므로 본 명세서에서 정확한 수치를 제시하는 것은 의미가 없다. 이는 본 발명을 재현하려는 시행자의 의도에 따라 조정될 수 있으며, 많은 단어 정렬 링크를 삭제하면 번역 성능이 향상될 여지는 많아지지만, 번역 모델의 크기를 증가시키므로 실험을 통해 적 절한 수준에서 전술한 4가지의 임계값을 설정하여야 한다.In other words, in order to apply the method proposed in the present invention, the numerical values to be set as a criterion for error determination are "threshold of high sort ambiguity score", "threshold of low sort ambiguity score", and "low word sort frequency". Threshold "and" threshold of the number of vocabulary determined to be many sorted words ". Since this figure must be determined experimentally according to the source and target language adopted by the machine translation system and the characteristics of the training corpus, it is not meaningful to present the exact figure here. This can be adjusted according to the intention of the implementer to reproduce the present invention, and the deletion of many word alignment links increases the translation performance, but increases the size of the translation model. The threshold of should be set.

상기 방법을 통해 정제한 단어 정렬 링크로 기존의 단어 정렬 링크를 대체하여 기계 번역 훈련 과정을 계속하여 향상된 모델을 얻을 수 있다. Through the above method, the refined word alignment link can be replaced with the existing word alignment link to continue the machine translation training process to obtain an improved model.

본 실시예에서는 정제된 단어 정렬부(100)로서 정제된 다대일 단어 정렬부(13)와 정제된 일대다 단어 정렬부(14)를 모두 포함하는 경우에 대해서 설명하였으나, 정제된 다대일 단어 정렬부(13)만을 포함하여 다대일 단어 정렬부(11)의 정렬 링크를 받아서 오류를 판단하여 오류가 있는 경우에 그 오류를 수정 또는 삭제하는 것으로도 설계할 수도 있으며, 정제된 일대다 단어 정렬부(14)만을 포함하여 일대다 단어 정렬부(12)의 정렬 링크를 받아서 오류를 판단하여 오류가 있는 경우에 그 오류를 수정 또는 삭제하는 것으로도 설계할 수 있다. 그러나, 상술한 바와 같이, 단어 정렬 링크는 일반적으로 일대다(一對多)의 정렬 형태를 가지고 있으므로, 정제된 단어 정렬부(100)는 적어도 정제된 일대다 단어 정렬부(14)를 포함하는 것이 바람직하다. In the present exemplary embodiment, the case in which the refined word alignment unit 100 includes both the purified many-to-one word alignment unit 13 and the purified one-to-many word alignment unit 14 is described. It may be designed to receive an alignment link of the many-to-one word alignment unit 11 including only the unit 13 to determine an error and to correct or delete the error when there is an error, and to refine the one-to-many word alignment unit. It can also be designed to receive an alignment link of the one-to-many word alignment unit 12 including only (14) to determine an error and correct or delete the error when there is an error. However, as mentioned above, since the word alignment link generally has a one-to-many alignment form, the refined word alignment unit 100 includes at least a refined one-to-many word alignment unit 14. It is preferable.

상술한 바와 같이, 본 발명의 자동 단어 정렬의 정확도 향상을 위한 링크 수정 방법 및 장치는 기존의 방법을 활용함에 있어서, 다대일 및/또는 일대다 단어 자동 정렬 링크를 한번 정제하는 과정을 거침으로써 1차적으로 단어 정렬 링크의 정확도를 높일 수 있고, 2차적으로 이로부터 다대다 단어 정렬 링크를 추출하는 과정에서 올바른 정렬을 포함할 수 있게 되며, 최종적으로는 번역 성능을 향상할 수 있다. As described above, the link modification method and apparatus for improving the accuracy of the automatic word alignment of the present invention by using the existing method, by going through the process of refining the many-to-one and / or one-to-many word automatic alignment link once 1 Secondly, the accuracy of word alignment links can be increased, and secondly, the correct alignment can be included in the process of extracting many-to-many word alignment links, thereby improving translation performance.

도 2는 도 1의 정제된 단어 정렬부의 상세 블록도이다. FIG. 2 is a detailed block diagram of the refined word alignment unit of FIG. 1.

<도면의 주요 부분에 대한 부호의 설명> <Explanation of symbols for main parts of the drawings>

11...다대일 단어 정렬부 12...일대다 단어 정렬부11 ... many word alignment 12 ... many word alignment

13...정제된 다대일 단어 정렬부 14...정제된 일대다 단어 정렬부13 ... refined many-to-one word alignment 14 ... refined one-to-many word alignment

15...다대다 단어 정렬부 100...정제된 단어 정렬부15 ... many-to-many word sorter 100 ... refined word sorter

150, 152...오류 판단부 154, 156...스위칭부150, 152 ... error judgment unit 154, 156 ... switching unit

S1...소스 언어 말뭉치 S2...타겟 언어 말뭉치S1 ... source language corpus S2 ... target language corpus

S3...번역 모델 S3 ... translation model

Claims

Improves the accuracy of automatic word sorting, which improves the performance of machine translation by creating a many-to-many word-aligned link by mixing a forward-to-many link and a reverse-to-many link. In the word sort link correction method,

Testing for the likelihood of error through the distribution of word alignment links for one or each word alignment link in a many-to-one and one-to-many alignment scheme; And

Modifying or deleting links that may be deemed to be null so that the many-to-many sort includes the correct sort,

A method of correcting word alignment links for improving the accuracy of automatic word alignment, wherein the word alignment links are used based on the relative frequency of each word alignment link to determine an error.

The method of claim 1, wherein a method of generating word-aligned ambiguity scores for each word is applied to determine an error of each word.

delete

The method of claim 2, wherein the alignment ambiguity score is

(Where y denotes a word of the source language to be scored, n denotes the vocabulary number of the target language, x denotes each word of the target language, and P denotes a word y of the source language and a word of the target language) The probability of x being aligned with each other, and the probability p is calculated by the relative frequency of the frequency of a word in the source language and the frequency of word alignment associated with that word. How to fix word-sorted links for better accuracy.

Improves the accuracy of automatic word sorting, which improves the performance of machine translation by creating a many-to-many word-aligned link by mixing a forward-to-many link and a reverse-to-many link. In the word alignment link correction device,

A word alignment unit that employs one or each of a many-to-one word alignment method and a one-to-many word alignment method to sort words to output a word alignment link; And

The word-to-word alignment from each word alignment section is tested for the possibility of an error through the distribution of word alignment links, so that many-to-many alignments are corrected or deleted by nulling or deleting the links that may be considered to be in error. Includes refined word alignment to include correct alignment,

A device for correcting word alignment links for improving the accuracy of automatic word alignment, wherein the word alignment links are used based on the relative frequency of each word alignment link to determine an error.

The method of claim 7, wherein the refined word alignment unit comprises one or both of a purified many-to-one word alignment unit and a purified one-to-many word alignment unit. An error determination unit that tests a possibility of an error through distribution of word alignment links with respect to a word alignment link of a; And a switching unit for modifying or deleting links that can be determined to be error-free to null so that the many-to-many alignment includes a correct alignment.

The apparatus of claim 7, further comprising a many-to-many word aligner configured to output a translation model by aligning the word-to-many word based on the output link of the purified many-to-one word aligner and the purified one-to-many word aligner. The many-to-one word sorter receives the many-to-one word sort link from the many-to-one word sorter that receives the source language and target language output from the source language corpus and the target language corpus, and sorts them into many-to-one words. Outputs and refines the one-to-many word sorter link from the one-to-many word sorter from the one-to-many word sorter that receives the source language and the target language that are outputs of the source language corpus and the target language corpus and sorts them into one-to-many words. Word sorting link correction device, characterized in that the output sorted by one-to-many word.

The apparatus of claim 7, wherein the method of generating word-aligned ambiguity scores for each word is applied to determine whether an error occurs in each word.

delete

The method of claim 10, wherein the alignment ambiguity score is

(Where y denotes a word of the source language to be scored, n denotes the vocabulary number of the target language, x denotes each word of the target language, and P denotes a word y of the source language and a word of the target language) The probability of x being aligned with each other, and the probability p is calculated by the relative frequency of the frequency of a word in the source language and the frequency of word alignment associated with that word. Link correction device for improved accuracy.