KR100509917B1

KR100509917B1 - Apparatus and method for checking word by using word n-gram model

Info

Publication number: KR100509917B1
Application number: KR20030023563A
Authority: KR
Inventors: 김정세; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 2003-04-15
Filing date: 2003-04-15
Publication date: 2005-08-25
Anticipated expiration: 2023-04-15
Also published as: KR20040089774A

Abstract

본 발명은 어절 엔-그램(n-gram)을 이용한 띄어쓰기와 철자 교정장치 및 방법에 관한 것으로, 어절 n-gram을 저장하는 어절 n-gram DB; 정제되어 오류가 없는 언어 자료를 입력받아 어절 n-gram을 추출하고, 추출된 어절 n-gram을 어절 n-gram DB에 저장하는 어절 n-gram 구축부; 검증해야 할 대상 언어 자료를 제공받아 언어 자료에 포함된 각 어절을 하나의 심볼로 매핑 처리한 구조로 변경하며, 변경된 어절에 대하여 어절 n-gram DB를 검색하여 동일한 어절이 존재하는지를 검색하는 어절 n-gram 검색 및 검증부; 어절 n-gram 검색 및 검증부에 의해 검증이 되지 않은 어절에 대하여 띄어쓰기와 붙여쓰기 오류 및 오타를 수정하는 띄어쓰기 및 붙여쓰기 오류/오타 수정부; 띄어쓰기 및 붙여쓰기 오류/오타 수정부에 의해 수정된 어절에 대하여 형태소 태깅을 수행하여 확률값이 제일 높은 값을 선택하여 적용하는 통계 기반 품사 태깅 시스템; 대상 언어 자료에 대하여 띄어쓰기 및 붙여쓰기 오류/오타 수정부 및 통계 기반 품사 태깅 시스템에 의해 처리되는 과정을 거치면서 수정이 완료된 대상 언어 자료를 출력하는 수정 문장 출력부를 구비한다. 따라서, 입력된 대상 언어 자료에 대하여 띄어쓰기 오류와 철자 오류를 자동으로 정정할 수 있는 효과가 있다. The present invention relates to a spacing and spelling correction device and method using a word n-gram, word n-gram DB for storing the word n-gram; A word n-gram construction unit for extracting a word n-gram by receiving refined and error-free language data and storing the extracted word n-gram in a word n-gram DB; Receives the target language data to be verified and changes each word included in the language data into a structure that is mapped to one symbol, and the word n that searches for the same word by searching the word n-gram DB for the changed word. -gram search and verification unit; Spacing and Pasting Errors / Typographical Corrections for Correcting Spacing and Pasting Errors and Typos for Words Not Validated by the Word n-gram Search and Verification Unit; A statistic-based part-of-speech tagging system that selects and applies the highest probability value by performing morphological tagging on a word modified by a spacing and pasting error / typing correction unit; It is provided with a correction sentence output unit for outputting the corrected target language data while being processed by the spacing and pasting error / typo correction and statistics-based part-of-speech tagging system for the target language data. Therefore, the spacing and spelling errors with respect to the input target language data can be automatically corrected.

Description

Spacing and spelling correction apparatus and method using word engrams {APPARATUS AND METHOD FOR CHECKING WORD BY USING WORD N-GRAM MODEL}

본 발명은 어절 엔-그램(n-gram)을 이용한 띄어쓰기와 철자 교정장치 및 방법에 관한 것으로, 특히 대량의 정제된 코퍼스(corpus)를 이용하여 어절 n-gram 언어 모델을 구성하고, 검증해야 할 코퍼스에 대하여 어절 n-gram 언어모델을 적용시키면서 검증되지 않은 어절 리스트를 추출하고, 추출된 어절 리스트에 대해서 띄어쓰기의 오류 인식과 철자 오류를 인식하여 교정하도록 하는 장치 및 방법에 관한 것이다. The present invention relates to a spacing and spelling correction device and method using a word n-gram, in particular, to construct and verify a word n-gram language model using a large amount of refined corpus (corpus) The present invention relates to an apparatus and method for extracting an unverified word list by applying a word n-gram language model to a corpus, and recognizing and correcting a spacing error and a spelling error.

통상적으로, 언어 모델에 대하여 다수의 방법이 연구되고 있으나 가장 많이 연구되는 것은 통계적 언어 모델로서, 유니그램(Unigram), 바이그램(Bigram), 트라이그램(Trigram) 등 단어의 연쇄 확률을 이용한다. 그러나 이러한 통계적 언어 모델은 입력된 문장의 어절이나 문장 단위의 띄어쓰기 오류와 철자 오류를 자동으로 정정할 수 없으며, 또한, 오타에 대한 수정 및 음절 쌍의 띄어쓰기 오류를 교정할 수 없는 문제점이 있다. Typically, a number of methods have been studied for language models, but the most studied are statistical language models, which use chain probability of words such as Unigram, Bigram, Trigram, and the like. However, such a statistical language model does not automatically correct a word or sentence spacing error and a spelling error of an input sentence, and also cannot correct a typo and correct a spacing error of a syllable pair.

한편, 언어 모델과 관련하여 공개된 다른 종래 기술로는 2001년 1월 5일 2001-0000673으로 공개된 "음절 바이그램 특성을 이용한 한글 문서의 오류 인식 및 교정방법"과, 1998년 9월 15일 1998-047272로 공개된 "문서 편집기 상에서 상호정보를 이용한 자동 띄어쓰기 교정 방법" 등에 개시되어 있다. On the other hand, other conventional techniques disclosed in relation to the language model include "Method for Recognizing and Correcting Hangul Documents Using Syllable Viograms" published on January 5, 2001, 2001-0000673, and September 15, 1998. -047272, "Automatic Spacing Correction Method Using Mutual Information in a Text Editor", and the like.

이와 같이, 개시된 선행기술에 대하여 상세하게 설명하면, "음절 바이그램 특성을 이용한 한글 문서의 오류 인식 및 교정방법"은 종래의 통계적 언어 모델과 달리 말뭉치에서 바이그램 음절 쌍과 빈도 수를 추출하고, 바이그램 음절 특성을 이용하여 한글 문서의 띄어쓰기 오류를 인식하는 것으로, 이 선행 기술에 의하면 철자수정은 가능하지만 단지 형태소 태거를 이용하여 검증하는 점에서 본 발명에서의 실제 어절 n-gram을 이용한 검증 방법과는 차이점이 있다. As described above, the disclosed prior art will be described in detail, "a method for recognizing and correcting an error in a Hangul document using syllable viagram characteristics", unlike a conventional statistical language model, extracts a pair of bigram syllables and a frequency from a corpus, Recognition of spacing errors in Hangul documents by using a feature. According to the prior art, spelling correction is possible, but it differs from the verification method using the actual word n-gram in the present invention in that it is verified using morpheme tagger. There is this.

다음으로, "문서 편집기 상에서 상호정보를 이용한 자동 띄어쓰기 교정 방법"은 문서 편집기 상에서 상호 정보를 이용하여 자동 띄어쓰기 및 철자 검사의 기능을 수행함으로써 자동적으로 띄어쓰기 및 절차에 맞도록 입력 텍스트를 정정해 주는 자동 띄어쓰기 교정 방법에 관한 것으로, 이 선행 기술은 띄어쓰기만을 위한 것으로 철자 수정에는 적합하지 않아 본 발명에서 제안된 띄어쓰기 및 철자 교정 방법과는 차이점이 있다. Next, "Automatic spacing correction method using mutual information in the text editor" is an automatic correction of the input text to match the spacing and procedure automatically by performing the function of automatic spacing and spell check using mutual information in the text editor. It relates to a spacing correction method, the prior art is only for spacing and is not suitable for spell correction is different from the spacing and spelling correction method proposed in the present invention.

따라서, 앞서 기술한 선행 특허들은 입력된 문장을 어절이나 문장 단위로 띄어쓰기 오류와 철자 오류를 자동으로 정정할 수 없으며, 또한, 오타에 대한 수정 및 음절 쌍의 띄어쓰기 오류를 교정할 수 없게 되는 문제점이 여전히 남아 있다. Therefore, the above-described prior patents cannot automatically correct a spacing error and a misspelling of an input sentence by word or sentence unit, and also cannot correct a typo and correct a spacing error of a syllable pair. Still remains.

이에, 본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로서, 그 목적은 대량의 정제된 코퍼스를 이용하여 어절 n-gram 언어 모델을 구성하고, 검증해야 할 코퍼스에 대하여 어절 n-gram 언어모델을 적용시키면서 검증되지 않은 어절 리스트를 추출하고, 추출된 어절 리스트에 대해서 문법 규칙 및 형태소 해석, 고유명사 추정, 외래어 추정을 적용하여 오류를 교정하도록 하는 어절 엔-그램(n-gram)을 이용한 띄어쓰기와 철자 교정장치 및 방법을 제공함에 있다. Accordingly, the present invention has been made to solve the above-described problems, the object of the present invention is to construct a word n-gram language model using a large amount of refined corpus, the word n-gram language model for the corpus to be verified The word list is extracted from the unvalidated word list, and the word list is used to correct errors by applying grammar rules, morpheme analysis, proper noun estimation, and foreign language estimation. A spelling correction device and method are provided.

상술한 목적을 달성하기 위한 본 발명의 일실시 예에 따른 어절 엔-그램을 이용한 띄어쓰기와 철자 교정장치는 어절 n-gram을 저장하는 어절 n-gram DB; 정제되어 오류가 없는 언어 자료를 입력받아 어절 n-gram을 추출하고, 추출된 어절 n-gram을 어절 n-gram DB에 저장하는 어절 n-gram 구축부; 검증해야 할 대상 언어 자료를 제공받아 언어 자료에 포함된 각 어절을 하나의 심볼로 매핑 처리한 구조로 변경하며, 변경된 어절에 대하여 어절 n-gram DB를 검색하여 동일한 어절이 존재하는지를 검색하는 어절 n-gram 검색 및 검증부; 어절 n-gram 검색 및 검증부에 의해 검증이 되지 않은 어절에 대하여 띄어쓰기와 붙여쓰기 오류 및 오타를 수정하는 띄어쓰기 및 붙여쓰기 오류/오타 수정부; 띄어쓰기 및 붙여쓰기 오류/오타 수정부에 의해 수정된 어절에 대하여 형태소 태깅을 수행하여 확률값이 제일 높은 값을 선택하여 적용하는 통계 기반 품사 태깅 시스템; 대상 언어 자료에 대하여 띄어쓰기 및 붙여쓰기 오류/오타 수정부 및 통계 기반 품사 태깅 시스템에 의해 처리되는 과정을 거치면서 수정이 완료된 대상 언어 자료를 출력하는 수정 문장 출력부를 포함하는 것을 특징으로 한다.According to an embodiment of the present invention for achieving the above object, a spacing and spelling correction apparatus using a word en-gram includes a word n-gram DB for storing a word n-gram; A word n-gram construction unit for extracting a word n-gram by receiving refined and error-free language data and storing the extracted word n-gram in a word n-gram DB; Receives the target language data to be verified and changes each word included in the language data into a structure that is mapped to one symbol, and the word n that searches for the same word by searching the word n-gram DB for the changed word. -gram search and verification unit; Spacing and Pasting Errors / Typographical Corrections for Correcting Spacing and Pasting Errors and Typos for Words Not Validated by the Word n-gram Search and Verification Unit; A statistic-based part-of-speech tagging system that selects and applies the highest probability value by performing morphological tagging on a word modified by a spacing and pasting error / typing correction unit; It includes a correction sentence output unit for outputting the target language data has been corrected while the process is processed by the spacing and pasting errors / typo correction and statistics-based part-of-speech tagging system for the target language material.

그리고, 상술한 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 어절 엔-그램을 이용한 띄어쓰기와 철자 교정방법은 정제되어 오류가 없는 언어 자료를 입력받아 어절 n-gram을 추출하여 어절 n-gram DB에 저장하는 단계; 검증해야 할 대상 언어 자료를 제공받아 언어 자료에 포함된 각 어절을 하나의 심볼로 매핑 처리한 구조로 변경하며, 변경된 어절에 대하여 어절 n-gram DB를 검색하여 동일한 어절이 존재하는지를 검색하는 단계; 검색 단계에서 동일한 어절이 존재하지 않을 경우, 동일하지 않은 어절에 대하여 띄어쓰기와 붙여쓰기 오류 및 오타를 수정한 결과에 대하여 형태소 태깅을 수행하여 확률값이 제일 높은 값을 선택하여 적용하는 띄어쓰기 및 붙여쓰기 오류 수정 단계; 검증해야 할 대상 언어 자료에 대하여 수정이 완료된 대상 언어 자료를 출력하는 단계를 포함하는 것을 특징으로 한다.Then, the spacing and spelling correction method using the word en-gram according to another embodiment of the present invention for achieving the above-described object is refined to receive the language data without error error word n-gram to extract the word n-gram Storing in a DB; Receiving a target language material to be verified, and changing each word included in the language material into a structure in which a single symbol is mapped, and searching for a word word by searching a word n-gram DB for the changed word; If the same word does not exist in the search step, the spacing and pasting error is applied to the result of correcting the spacing and pasting error and the typographical error. Modification step; And outputting the corrected target language data for the target language data to be verified.

이하, 첨부된 도면을 참조하여 본 발명에 따른 실시 예를 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 어절 엔-그램(n-gram)을 이용한 띄어쓰기와 철자 교정장치에 대한 블록 구성도로서, 어절 n-gram 구축부(10)와, 어절 n-gram 검색 및 검증부(20)와, 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)와, 통계 기반 품사 태깅 시스템(40)과, 수정 문장 출력부(50)를 포함한다.1 is a block diagram of a word spacing and spelling correction apparatus using a word n-gram according to the present invention, a word n-gram construction unit 10 and a word n-gram search and verification unit ( 20), a spacing and pasting error / typing correction unit 30, a statistics-based part-of-speech tagging system 40, and a modified sentence output unit 50.

어절 n-gram 구축부(10)는 내부적으로, 어절 n-gram 추출기(11)를 구비하고 있으며, 구비된 어절 n-gram 추출기(11)는 정제된 텍스트 언어 자료(corpus) DB(S1)로부터 정제되어 오류가 없는 텍스트 언어 자료를 입력받아 어절 n-gram을 추출하고, 추출된 어절 n-gram을 어절 n-gram DB(S2)에 제공한다. The word n-gram constructing unit 10 has a word n-gram extractor 11 internally, and the provided word n-gram extractor 11 is provided from a purified text language material (corpus) DB (S1). The word-word n-gram is extracted by inputting refined and error-free text language data, and the extracted word-n-gram is provided to the word-n-gram DB (S2).

여기서, 어절 n-gram DB(S2)는 숫자, 영문자, 기호, 기타 외래어 등을 각각 대표하는 하나의 심볼로 매핑시켜 처리할 수 있도록 저장하고 있으며, 한 음절의 단어는 애매성이 커서 어절 1-gram에는 포함시키지 않는다.Here, the word n-gram DB (S2) is stored so that it can be processed by mapping a number, an English letter, a symbol, and other foreign words into a single symbol representing each word. Do not include in gram.

다시 말해서, 어절 n-gram DB(S2)에 저장되는 어절 n-gram에 대하여 보다 상세하게 설명하면, 어절 "A B C D E"가 있을 경우, 1-gram은 각각 A, B, C, D, E로 구성하고, 2-gram은 AB, BC, CD, DE로 구성하며, 3-gram은 ABC, BCD, CDE로 구성한다. 이와 같이, 어절 n-gram은 영문자, 숫자, 기호, 기타 외래어, 등을 각각 하나의 symbol로 대치해서 구성함으로써 어절 n-gram 검색기(23)에서의 히트 비율(hit ratio)을 높일 수 있도록 한다. In other words, when the word n-gram stored in the word n-gram DB (S2) is explained in more detail, when the word "ABCDE" exists, the 1-gram is composed of A, B, C, D, and E, respectively. 2-gram is composed of AB, BC, CD, DE, and 3-gram is composed of ABC, BCD, and CDE. In this way, the word n-gram is configured by replacing letters, numbers, symbols, other foreign words, and the like with one symbol so as to increase the hit ratio in the word n-gram searcher 23.

어절 n-gram 검색 및 검증부(20)는 토큰(Token) 분리기(21)와, 어절 n-gram 검색기(23)와, 형태소 태거(41)를 구비하고 있는 것으로, 이중, 토큰 분리기(21)는 대상 언어 자료(Text corpus) DB(S6)로부터 검증해야 할 대상 언어 자료를 제공받아 이를 영문자, 숫자, 기호, 한글, 한자 등으로 분리하는 것으로, 영문자, 숫자, 기호, 등에 대한 원문을 이들 각각을 대표하는 하나의 심볼로 매핑 처리한 구조로 변경하여 어절 n-gram 검색기(23)에 제공한다.The word n-gram retrieval and verification unit 20 includes a token separator 21, a word n-gram searcher 23, and a morpheme tagger 41, of which the token separator 21 is provided. Receives the target language data to be verified from the text corpus DB (S6) and separates them into English letters, numbers, symbols, Korean characters, and Chinese characters. The structure is changed to a structure in which a symbol is mapped to a symbol representing the symbol and provided to the word n-gram searcher 23.

어절 n-gram 검색기(23)는 토큰 분리기(21)로부터 출력된 어절에 대하여 어절 n-gram DB(S2)를 검색하여 동일한 어절이 존재하는지를 검색한다. The word n-gram searcher 23 searches the word n-gram DB S2 for the word output from the token separator 21 to find out whether the same word exists.

여기서, 어절 n-gram 검색기(23)에 의한 검색 순서는 3-gram → 2-gram → 1-gram 순으로 한다. 그러나, 음절수가 적은 어절, 즉 한 음절의 어절은 많은 애매성을 가지고 있기 때문에 1-gram 까지 검색하지 않는 것이 바람직하다. Here, the search order by the word n-gram searcher 23 is 3-gram → 2-gram → 1-gram. However, it is preferable not to search up to 1-gram because a word having a small number of syllables, that is, a word of one syllable has a lot of ambiguity.

띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)는 띄어쓰기 및 붙여쓰기를 위한 규칙에 의한 수정기(31)와, 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)와, 태거에 의한 수정기(35)와, 오타 수정을 위한 규칙에 의한 수정기(36)와, 알고리듬에 의한 수정기(37)를 구비한다. The spacing and pasting error / error correction unit 30 includes a modifier 31 based on a rule for spacing and pasting, a modifier 32 based on a word n-gram, a proper noun estimator 33, A foreign language estimator 34, a corrector 35 based on a tagger, a corrector 36 based on a rule for correcting a typo, and a corrector 37 based on an algorithm are provided.

규칙에 의한 수정기(31)는 문교부에서 제정한 맞춤법에 따라 정의되어 저장된 띄어쓰기 및 붙여쓰기 규칙을 반영하여 띄어쓰기 붙여쓰기 규칙 DB(S7)에 저장된 정의규칙에 따라 문장의 띄어쓰기와 붙여쓰기를 수정한다. 예를 들어, "쯤"은 접미사로 앞 어절에 무조건 붙여써야 한다 라든지, 아니면, "것" 앞은 무조건 뛴다 라는 기준이 구조화되어 띄어쓰기 붙여쓰기 규칙 DB(S7)에 저장된다. The modifier 31 according to the rule modifies the spacing and pasting of the sentence according to the definition rule stored in the spacing paste rule DB (S7) to reflect the spacing and paste rule defined and stored according to the spelling established by the Ministry of Education. . For example, "about" is a suffix that must be unconditionally pasted in the first word, or "unless" is unconditionally skipped and is stored in the spaced pasting rules DB (S7).

어절 n-gram에 의한 수정기(32)는 토큰 분리기(21)에서 어절 n-gram 검색기(23)로 제공된 검증해야 할 대상 언어 중에서 검증되지 않은 어절들이 있을 경우, 이 검증되지 않은 어절에 대하여 앞뒤 어절과 함께 붙이거나 또는 잘라서 다양한 리스트를 만들고, 만들어진 리스트가 어절 n-gram DB(S2)에 존재하는지는 확인하여 여러 후보 리스트가 나타날 수 있도록 한다. 여기서, 여러 후보 리스트중에 하나를 선택하는 기준은 어절 n-gram DB(S2)에 저장되어 있는 어절의 사용 빈도이며, 그 빈도가 동일할 경우 앞 뒤 몇 어절을 포함하는 문장으로 구성해서 형태소 태거(41)에 제공하여 형태소 태깅을 하여 확률값이 제일 높은 것을 선택한다.The modifier 32 by the word n-gram, if there are unverified words among the target languages to be verified provided in the token separator 21 to the word n-gram finder 23, then the back and forth for the unvalidated words. Various lists can be created by pasting or cutting words together, and checking whether the created list exists in the word n-gram DB (S2) so that multiple candidate lists can appear. Here, the criterion for selecting one of the candidate lists is the frequency of use of the words stored in the word n-gram DB (S2). If the frequency is the same, the sentence is composed of sentences including several words before and after the sentence. And morphological tagging to select the one with the highest probability.

고유명사 추정기(33)는 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)를 거치고도 아직 검증이 되지 않은 어절 리스트들에 대해서 사용자 입력 사전 및 인명/지명/상호 사전 DB(S8)을 이용해서 해당 어절이 여기에 포함되어 되는지를 판단하면서 수정하며, 또는 "씨", "군", "양", "박사" 등이 앞뒤 어절에 나타나면 고유명사임을 추정하여 수정한다. The proper noun estimator 33 is a user input dictionary and a name / name / mutual dictionary for word lists that have not yet been verified even after the corrector 31 according to the rule and the corrector 32 based on the word n-gram. Use DB (S8) to determine whether the word is included in the word, and correct it, or if "C", "kun", "sheep", or "doctor" appears in the word before or after, correct it to be a proper noun. .

외래어 추정기(34)는 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)를 거치고도 아직 검증이 되지 않은 어절 리스트들에 대해서 외래어 사전 DB(S9)을 이용하여 해당 외래어를 추출하는 기능을 수행하는데, 외래어는 HMM(Hidden Markov Model)을 이용하여 검출할 수도 있다. The foreign language estimator 34 is a foreign language dictionary DB for word lists that have not yet been validated even though the corrector 31 according to the rule and the corrector 32 based on the word n-gram and the proper noun estimator 33 are used. A function of extracting the foreign word is performed by using S9. The foreign word may be detected using a Hidden Markov Model (HMM).

태거에 의한 수정기(35)는 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)를 거치면서 아직 검증이 되지 않은 어절 리스트들에 대해서 음절별로 나눈 후, 이를 다시 여러 어절로 조합하여 형태소 태거(41)에 제공하여 형태소 태깅을 하여 확률값이 제일 높은 어절 조합을 선택한다.Tagger modifier 35 is still validated while passing through modifier 31 by rule and modifier 32 by word n-gram, proper noun estimator 33 and foreign language estimator 34. After dividing each word list by syllable, it is combined with several words and provided to the morph tagger 41 to perform morphological tagging to select a word combination having the highest probability.

규칙에 의한 수정기(36)는 오타 수정 규칙 DB(S10)에 저장되어 있는 오타 수정 규칙을 이용하여 수정한다. 또한, 오타 수정을 위한 알고리듬에 의한 수정기(37)를 이용하여 오타를 수정한다. The modifier 36 by the rule corrects using a typo correcting rule stored in the typo correcting rule DB S10. In addition, the typo is corrected using the corrector 37 based on an algorithm for correcting the typo.

즉, 알고리듬에 의한 수정기(37)는 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)와, 태거에 의한 수정기(35)를 거치면서 문제가 되는 어절이 발생될 경우, 발생된 어절에 대하여 수정하기 위한 것으로, 어절 n-gram DB(S2)에서 1-gram만 활용하여 어절 리스트를 만든다.That is, the modifier 37 based on the algorithm includes a modifier 31 based on a rule and a modifier 32 based on a word n-gram, a proper noun estimator 33, a foreign language estimator 34, and a tagger. If a problem word is generated while passing through the modifier 35, the word list is to be modified by using only 1-gram in the word n-gram DB (S2).

알고리듬에 의한 수정기(37)가 어절 리스트를 만드는 과정에 대하여 보다 상세하게 설명하면, 즉, 앞부분 몇 개 음소가 같거나, 또는 뒷부분 몇 개의 음소가 같은 어절 리스트를 작성한다. 이때, 음절의 수가 같은 어절을 우선으로 선택하며, 음절수가 동일하지 않으면, 그 음절수와 제일 가까운 어절을 선택하여 리스트를 만든다.When the modifier 37 by the algorithm describes the word list in more detail, that is, several phonemes in the first part of the algorithm or several phonemes in the latter part produce the same word list. In this case, a word having the same number of syllables is selected first. If the number of syllables is not the same, a word is selected by selecting the word closest to the number of syllables.

여기서, 음절을 이용하여 계산하는 이유는 오타의 주 오류가 음소의 첨삭으로 이루어지기 때문이며, 키보드 입력시의 주된 오류가 그 주변의 키를 잘못 입력해서 나타나기 때문이다. 이때, 음절의 개수가 제일 가까운 어절들이 다수 개 존재할 수 있는 것이다.Here, the reason for calculating using syllables is that the main error of a typo is made by adding a phoneme, and the main error at the time of keyboard input appears due to wrong input of a key nearby. In this case, a plurality of words with the closest number of syllables may exist.

다음으로, 앞서 기술한 절차에 의해 생성된 리스트를 축소하기 위한 과정으로, 음절수가 같거나 음절수가 가까운 어절에 대해서 음소의 수가 제일 가까운 것을 선택한다. 여기서, 같은 음소를 많이 가지고 있는 어절이 우선이며, 이러한 과정을 거치더라도 리스트가 다수 개가 존재할 수 있다.Next, as a process for narrowing the list generated by the above-described procedure, the one with the closest number of phonemes is selected for a word having the same syllable number or a near syllable number. Here, a word having a lot of the same phoneme comes first, and a plurality of lists may exist even through such a process.

리스트가 다수 개 존재할 경우, 앞뒤 한 어절씩을 연결하여 어절 n-gram 검색기(23)에서 어절 3-gram이나 2-gram에서 제일 좋은 값을 선택한다. When there are a plurality of lists, the word n-gram searcher 23 selects the best value in the word 3-gram or 2-gram by concatenating words before and after.

마지막 과정으로, 어절 3-gram이나 2-gram에 없거나, 빈도가 동일할 경우, 앞 뒤 2∼3 어절을 연결하여 형태소 태거(41)에 제공하여 형태소 태깅을 하여 확률값이 제일 높은 하나를 선택한다.Lastly, if the word is not in 3-gram or 2-gram, or if the frequency is the same, then the front and back 2 to 3 words are connected to the morpheme tagger 41 and the morpheme tagging is used to select the one with the highest probability. .

통계 기반 품사 태깅 시스템(40)은 내부적으로 형태소 태거(41)를 구비하고 있으며, 구비된 형태소 태거(41)는 품사 접속 가능 정보 DB(S3)와, 형태소 사전 DB(S4)와, 품사 3-gram DB(S5)를 이용하여 입력되는 문장에 대해서 형태소 태깅을 수행하는 블록으로서, 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)내 어절 n-gram에 의한 수정기(32)와, 태거에 의한 수정기(35)와, 규칙에 의한 수정기(36)와, 알고리듬에 의한 수정기(37)를 거친 결과에 대하여 형태소 태깅을 수행하여 확률값이 제일 높은 값을 선택하는 기능을 수행한다. 여기서, 통계 기반 품사 태깅 시스템(40)은 실세계의 자연어 용례들과 부속 정보를 포함하는 원시 또는 태깅된 코퍼스를 분석하고 자연어에 대한 통계 정보를 추출하여 얻는 확률을 이용하여 자연어 처리의 중요성 문제를 확률적으로 해결하도록 하는 시스템이다. The statistics-based part-of-speech tagging system 40 has an internal morpheme tagger 41, and the provided morpheme tagger 41 includes a part-of-speech accessible information DB (S3), a morpheme dictionary DB (S4), and a part-of-speech 3. A block that performs morphological tagging on a sentence input using a gram DB (S5), and includes a corrector 32 by a word n-gram in a spacing and pasting error / correction correction unit 30, and a tagger. Morphological tagging is performed on the result of the corrector 35, the corrector 36 based on the rule, and the corrector 37 based on the algorithm, and selects the value having the highest probability. Here, the statistics-based part-of-speech tagging system 40 analyzes the raw or tagged corpus including natural language usages and subsidiary information of the real world, and uses the probability obtained by extracting statistical information about natural language to probability the importance of natural language processing. It is a system to solve the problem.

수정 문장 출력부(50)는 대상 언어 자료 DB(S6)내 대상 언어 자료에 대하여 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)에 의해 수정이 완료된 문장을 출력한다. The correction sentence output unit 50 outputs a sentence that has been corrected by the spacing and pasting error / error correction unit 30 with respect to the target language data in the target language data DB S6.

도 2의 흐름도를 참조하면, 상술한 구성을 바탕으로, 본 발명에 따른 어절 n-gram을 이용한 띄어쓰기와 철자 교정방법에 대하여 보다 세부적으로 설명한다. Referring to the flowchart of FIG. 2, the spacing and spelling correction method using the word n-gram according to the present invention will be described in more detail based on the above-described configuration.

먼저, 어절 n-gram 구축부(10)내 어절 n-gram 추출기(11)는 정제된 텍스트 언어 자료(corpus) DB(S1)로부터 정제되어 오류가 없는 텍스트 언어 자료를 입력받아 어절 n-gram을 추출하고(단계 201), 추출된 어절 n-gram을 어절 n-gram DB(S2)에 저장한다(단계 202). First, the word n-gram extractor 11 in the word n-gram construction unit 10 receives the word language data that is refined from the purified text language data (corpus) DB (S1) without error and receives the word n-gram. The extracted word n-gram is stored in the word n-gram DB (S2) (step 201).

이러한 상태에서, 어절 n-gram 검색 및 검증부(20)내 토큰 분리기(21)는 대상 언어 자료(Text corpus) DB(S6)로부터 검증해야 할 대상 언어 자료를 제공받아 영문자, 숫자, 기호, 한글, 한자 등을 이들 각각을 대표하는 하나의 심볼로 매핑 처리한 구조로 변경하여 어절 n-gram 검색기(23)에 제공한다(단계 203).In this state, the token separator 21 in the word n-gram retrieval and verification unit 20 receives the target language data to be verified from the target language data (Text corpus) DB (S6). , Chinese characters, etc. are changed into a structure in which each of them is mapped to one symbol representing each of them, and provided to the word n-gram searcher 23 (step 203).

어절 n-gram 검색기(23)는 토큰 분리기(21)로부터 출력된 어절에 대하여 어절 n-gram DB(S2)를 검색하는데, 검색 순서는 3-gram → 2-gram → 1-gram 순으로 검색을 진행하여 동일한 어절이 존재하는지를 검색한다(단계 204).The word n-gram finder 23 searches the word n-gram DB (S2) for the word output from the token separator 21, and the search order is 3-gram → 2-gram → 1-gram. Proceed to search if the same word exists (step 204).

상기 검색 단계(204)에서 동일한 어절이 존재할 경우, 대상 언어 자료가 정제된 언어 자료이므로, 대상 언어 자료 DB(S6)에 입력된 대상 언어 자료를 수정 문장 출력부(50)에 제공하여 출력한다(단계 205).If the same word exists in the search step 204, since the target language data is a refined language data, the target language data input to the target language data DB S6 is provided to the corrected sentence output unit 50 and outputted. Step 205).

상기 검색 단계(204)에서 동일하지 않은 어절이 존재할 경우, 동일하지 않은 대상 언어 자료를 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)에 적용한다(단계 206). If there is an unequal word in the search step 204, the unequal target language material is applied to the spacing and pasting error / error corrector 30 (step 206).

띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)내 규칙에 의한 수정기(31)는 문교부에서 제정한 맞춤법에 따라 정의되어 저장된 띄어쓰기 붙여쓰기 규칙 DB(S7)에 저장된 정의규칙에 따라 동일하지 않은 대상 언어 자료를 수정한다. 예를 들어, "쯤"은 접미사로 앞 어절에 무조건 붙여써야 한다 라든지, 아니면, "것" 앞은 무조건 뛴다 라는 기준이 구조화되어 띄어쓰기 붙여쓰기 규칙 DB(S7)에 저장된다(단계 207). Spacing and pasting error / corrector 31 according to the rules in the typo corrector (30) is not the same according to the definition rules stored in the space spacing paste rule DB (S7) defined and stored according to the spelling established by the Ministry of Education. Modify your language resources. For example, the criteria "about" must be pasted unconditionally in the first word or "unconditionally skipped before" is structured and stored in the spacing paste rule DB (S7) (step 207).

이어서, 규칙에 의한 수정기(31)에 의해서도 수정이 완료되지 않을 경우, 어절 n-gram에 의한 수정기(32)에 의해 수정 작업이 진행된다. 즉, 어절 n-gram에 의한 수정기(32)는 검증되지 않은 어절에 대하여 그 앞뒤 어절과 함께 붙이거나 또는 잘라서 다양한 리스트를 만들고, 만들어진 리스트가 어절 n-gram DB(S2)에 존재하는지는 확인하여 여러 후보 리스트를 생성한다(단계 208). Subsequently, when the correction is not completed even by the corrector 31 according to the rule, the correction operation proceeds by the corrector 32 by the word n-gram. That is, the modifier 32 by the word n-gram makes various lists by pasting or cutting the unverified words together with the front and back words, and confirms whether the created list exists in the word n-gram DB (S2). To generate several candidate lists (step 208).

여기서, 여러 후보 리스트중에 하나를 선택하는 기준은 어절 n-gram DB(S2)에 저장되어 있는 빈도(예로, 가장 많이 나오는 어절)이며, 그 빈도가 동일할 경우 앞 뒤 몇 어절을 포함하는 문장으로 구성해서 형태소 태거(41)에 제공하여 형태소 태깅을 수행하여 확률값이 제일 높은 것을 선택한다(단계 209).Here, the criterion for selecting one among a plurality of candidate lists is a frequency (eg, the most frequently appeared word) stored in the word n-gram DB (S2). If the frequency is the same, a sentence including a few words before and after A configuration is provided to the morpheme tagger 41 to perform morpheme tagging to select the highest probability value (step 209).

이후, 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)를 거치고도 아직 검증이 되지 않은 리스트들에 대하여 고유명사 추정기(33)를 적용한다. 즉, 고유명사 추정기(33)는 사용자 입력 사전 및 인명/지명/상호 사전 DB(S8)을 이용해서 해당 어절이 여기에 포함되는지를 판단 및 수정한다. 그 예로, "씨", "군", "양", "박사" 등이 앞뒤 어절에 나타나면 고유명사임을 추정 판단하여 수정한다(단계 210).Subsequently, the proper noun estimator 33 is applied to the lists that have not yet been verified even through the modifier 31 based on the rule and the modifier 32 based on the word n-gram. That is, the proper noun estimator 33 determines and corrects whether the corresponding word is included here using the user input dictionary and the person / name / territory dictionary DB (S8). For example, if "C", "group", "sheep", "doctor", etc. appear in the front and back words, it is determined that the proper noun is corrected (step 210).

다음으로, 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)를 거치고도 아직 검증이 되지 않은 어절 리스트들에 대해서 외래어 추정기(34)에 의한 수정이 수행된다. 즉, 외래어 추정기(34)는 외래어 사전 DB(S9)을 이용하여 해당 외래어를 추출하는데, 외래어는 HMM 모델로 검출한다(단계 211). Next, the foreign language estimator 34 provides the corrector 31 according to the rule, the corrector 32 based on the word n-gram, and the word lists that have not yet been verified even through the proper noun estimator 33. Correction is performed. That is, the foreign language estimator 34 extracts the foreign word using the foreign language dictionary DB S9, and the foreign word is detected by the HMM model (step 211).

이후, 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)를 거치면서 아직 검증이 되지 않은 어절 리스트들에 대해서 태거에 의한 수정기(35)에 의해 수정이 수행된다(단계 212). 즉, 태거에 의한 수정기(35)는 어절을 음절별로 나눈 후, 이를 다시 여러 어절로 조합하여 형태소 태거(41)에 제공하여 형태소 태깅을 하여 확률값이 제일 높은 어절 조합을 선택한다(단계 213).Subsequently, the word list that has not been verified yet is passed through the corrector 31 based on the rule and the corrector 32 based on the word n-gram, the proper noun estimator 33, and the foreign word estimator 34. Modification is performed by the modifier 35 by tagger (step 212). That is, the modifier 35 by tagger divides the word by syllable, and then combines the word into syllables to provide the stemmer tagger 41 with morpheme tagging to select the word combination having the highest probability value (step 213). .

다음으로, 오타 수정에 관한 것으로, 규칙에 의한 수정기(36)는 띄어쓰기 및 붙여쓰기 오류 수정에 의해서도 검증이 되지 않은 어절 리스트들에 대해서 오타 수정 규칙 DB(S10)에 저장되어 있는 오타 수정 규칙을 이용하여 수정한다. 또한, 오타 수정을 위한 알고리듬에 의한 수정기(37)를 이용하여 오타를 수정한다(단계 214). Next, regarding the correction of a typo, the modifier 36 according to the rule corrects a typo correction rule stored in the typo correction rule DB (S10) for word lists that have not been verified by the spacing and pasting error correction. Correct it using In addition, the typo is corrected using the corrector 37 by the algorithm for correcting the typo (step 214).

즉, 알고리듬에 의한 수정기(37)는 규칙에 의한 수정기(31) 및 어절 n-gram에 의한 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)와, 태거에 의한 수정기(35)를 거치면서 검증이 되지 않은 어절 리스트들이 발생될 경우, 발생된 어절 리스트를 수정하기 위한 것으로, 어절 n-gram DB(S2)에서 1-gram만 활용하여 어절 리스트를 만든다(단계 215).That is, the modifier 37 based on the algorithm includes a modifier 31 based on a rule and a modifier 32 based on a word n-gram, a proper noun estimator 33, a foreign language estimator 34, and a tagger. When the unverified word lists are generated through the modifier 35, the word lists are made by using only 1-gram in the word n-gram DB (S2) ( Step 215).

알고리듬에 의한 수정기(37)가 어절 리스트를 만드는 과정에 대하여 설명하면, 즉, 앞부분 몇 개 음소가 같거나, 또는 뒷부분 몇 개의 음소가 동일한 어절 리스트를 작성한다. 이때, 음절의 수가 같은 어절을 우선으로 선택하며, 음절수가 동일하지 않으면, 그 음절수와 제일 가까운 어절을 선택하여 리스트를 만든다.When the modifier 37 according to the algorithm generates a word list, that is, a word list is created in the first part of the phoneme, or several phonemes in the latter part are created. In this case, a word having the same number of syllables is selected first. If the number of syllables is not the same, a word is selected by selecting the word closest to the number of syllables.

다음으로, 앞서 기술된 절차에 의해 생성된 리스트를 축소하기 위한 과정으로, 음절수가 같거나 음절수가 가까운 어절에 대해서 음소의 수가 제일 가까운 것을 선택한다. 여기서, 같은 음소를 많이 가지고 있는 어절이 우선이며, 이러한 과정을 거치더라도 리스트가 다수 개가 존재할 수 있다(단계 216).Next, as a process for narrowing the list generated by the above-described procedure, the closest number of phonemes is selected for a word having the same syllable number or a near syllable number. Here, a word having a lot of the same phoneme comes first, and a plurality of lists may exist even through this process (step 216).

리스트가 다수 개 존재할 경우, 앞뒤 한 어절씩을 연결하여 어절 n-gram 검색기(23)에서 어절 3-gram이나 2-gram에서 제일 좋은 값을 선택한다(단계 217). When there are a plurality of lists, the word n-gram searcher 23 selects the best value in the word 3-gram or 2-gram by concatenating words before and after (step 217).

마지막 과정으로, 어절 3-gram이나 2-gram에 없거나, 빈도가 동일할 경우, 앞 뒤 2∼3 어절을 연결하여 형태소 태거(41)에 제공하여 형태소 태깅을 수행하여 확률값이 제일 높은 하나를 선택한다(단계 218).Lastly, if the word is not in 3-gram or 2-gram, or if the frequency is the same, the front and rear 2 ~ 3 words are connected and provided to the stemmer tagger 41 to perform stemming tagging to select the one with the highest probability. (Step 218).

대상 언어 자료 DB(S6)내 대상 언어 자료에 대하여 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)에 의해 수정이 완료된 대상 언어 자료를 수정 문장 출력부(50)를 통해 출력한다(단계 219). The target language data that has been corrected by the spacing and pasting error / error correction unit 30 for the target language data in the target language data DB S6 is output through the modified sentence output unit 50 (step 219).

한편, 띄어쓰기 및 붙여쓰기 오류/오타 수정부(30)내에 구비된 띄어쓰기 붙여쓰기를 위한 규칙에 의한 수정기(31)와, 어절 n-gram 수정기(32)와, 고유명사 추정기(33)와, 외래어 추정기(34)와, 태거에 의한 수정기(35)와, 오타 수정을 위한 규칙에 의한 수정기(36)와, 알고리듬에 의한 수정기(37)에 대해서 번호 순서대로 작업을 해도 되나 순서에 크게 영향을 받지 않는다. On the other hand, the spacing and pasting error / corrector 31 according to the rules for spacing pasting provided in the correction unit 30, the word n-gram corrector 32, the proper noun estimator 33 and Alternatively, the foreign language estimator 34, the corrector 35 by tagger, the corrector 36 by rule for correcting a typo, and the corrector 37 by algorithm may be performed in numerical order. It is not greatly affected by.

상기와 같이 설명한 본 발명은 대량의 정제된 코퍼스를 이용하여 어절 n-gram 언어 모델을 구성하고, 검증해야 할 코퍼스에 대하여 어절 n-gram 언어모델을 적용시키면서 검증되지 않은 어절 리스트를 추출하고, 추출된 어절 리스트에 대해서 문법 규칙 및 형태소 해석, 고유명사 추정, 외래어 추정 등으로 오류를 교정함으로써, 입력된 대상 언어 자료에 대하여 띄어쓰기 오류와 철자 오류를 자동으로 정정할 수 있다. 또한, 문서 편집기에 적용하여 문장 단위의 자동 띄어쓰기 기능을 구현할 수 있으며 오타에 대한 수정도 가능하며, 문자 인식기나 음성 인식기를 통하여 인식된 음절 쌍의 띄어쓰기 오류 또는 철자 오류를 인식하고 이를 보다 정확하게 교정할 수 있는 효과가 있다. The present invention described above constructs a word n-gram language model using a large amount of refined corpus, extracts and extracts an unverified word list while applying a word n-gram language model to a corpus to be verified. By correcting the error by using grammar rules, morpheme analysis, proper noun estimation, and foreign language estimation, the corrected spacing and spelling errors can be corrected automatically. In addition, it can be applied to a text editor to implement the automatic spacing function of sentence units, and it is also possible to correct typos, and to recognize and correct the spacing or spelling errors of syllable pairs recognized through a text recognizer or a speech recognizer It can be effective.

도 1은 본 발명에 따른 어절 엔-그램을 이용한 띄어쓰기와 철자 교정장치에 대한 블록 구성도이며,1 is a block diagram of a spacing and spelling correction device using a word en-gram according to the present invention,

도 2는 본 발명에 따른 어절 엔-그램을 이용한 띄어쓰기와 철자 교정방법에 대한 상세 흐름도이다. Figure 2 is a detailed flowchart of the spacing and spelling correction method using a word en-gram according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

10 : 어절 n-gram 구축부 11 : 어절 n-gram 추출기10: word n-gram construction unit 11: word n-gram extractor

20 : 어절 n-gram 검색 및 검증부20: n-gram search and verification unit

21 : 토큰(Token) 분리기 23 : 어절 n-gram 검색기21: token separator 23: word n-gram finder

30 : 띄어쓰기 및 붙여쓰기 오류/오타 수정부30: Spacing and Paste Errors / Typographical Corrections

31,36 : 규칙에 의한 수정기 32 : 어절 n-gram 수정기31,36: Modifier by rule 32: Word n-gram corrector

33 : 고유명사 추정기 34 : 외래어 추정기33: proper noun estimator 34: foreign language estimator

35 : 태거에 의한 수정기 37 : 알고리듬에 의한 수정기35: Modifier by Tagger 37: Modifier by Algorithm

40 : 통계 기반 품사 태깅 시스템 41 : 형태소 태거40: Statistics based part-of-speech tagging system 41: Morphological tagger

50 : 수정 문장 출력부50: correction sentence output unit

Claims

In the spacing and spell correction device using the word n-gram,

A word n-gram DB for storing a word n-gram;

A word n-gram construction unit for extracting a word n-gram by receiving a refined and error-free language data and storing the extracted word n-gram in the word n-gram DB;

Receives the target language data to be verified and changes each word included in the language material into a structure that is mapped to one symbol, and searches for the same word by searching the word n-gram DB for the changed word. A word n-gram search and verification unit to perform;

A spacing and pasting error / typing correction unit for correcting spacing and pasting errors and typos for words that have not been verified by the word n-gram search and verification unit;

A statistic-based part-of-speech tagging system that selects and applies the highest probability value by performing morphological tagging on a word modified by the spacing and pasting error / typing correction unit;

A correction sentence output unit for outputting the corrected target language data while being processed by the spacing and pasting error / error correction and the statistics-based part-of-speech tagging system with respect to the target language data;

Spacing and spelling correction device using a word en-gram, characterized in that it comprises a.

According to claim 1, wherein the word n-gram search and verification unit using a word en-gram characterized in that it further comprises a corrector by a rule for correcting according to the word spacing paste definition rule not verified Spacing and spelling correction.

According to claim 1, For the word that is not verified by the word n-gram search and verification unit, the word is attached with the front and back words, or the word is cut to create a plurality of word list, the plurality of word list The word spacing and spelling correction device using a word en-gram, characterized in that it further comprises a modifier by word n-gram to determine whether the words included in the word n-gram DB to generate a candidate word list.

The method of claim 2 or 3, wherein each word is divided into syllables for words included in word lists that have not been verified yet through a modifier according to the rule and a modifier according to a word n-gram. In addition, the word-based part-of-speech tagging system combines multiple words to form a morphological tagging and selects a word combination having the highest probability by selecting a word combination. Calibration device.

According to claim 1, The first few phonemes are the same by using only 1-gram in the word n-gram DB for the words included in the word list that is not verified by the spacing and pasting error correction, Creates a list containing the same phrases with several phonemes in the latter part. If the number of syllables is the same in the list, and if the number of syllables is not the same, the algorithm that creates the word list by selecting the word closest to the number of syllables Spacing and spell correction device using a word en-gram, characterized in that it further comprises a corrector.

In the spacing and spelling correction method using the word engram,

Receiving language data without refined error, extracting a word n-gram, and storing the word n-gram in a word n-gram DB;

Receives the target language data to be verified and changes each word included in the language material into a structure that is mapped to one symbol, and searches for the same word by searching the word n-gram DB for the changed word. Making;

In the search step, if the same word does not exist, morphological tagging is performed on the result of correcting the spacing and pasting errors and typos with respect to the same word, and selecting and applying the highest value with the highest probability. Write error correction step;

Outputting the corrected target language material with respect to the target language material to be verified;

Spacing and spelling correction method using a word en-gram, characterized in that it comprises a.

7. The method of claim 6, further comprising outputting the input target language data when the same word is present in the search step.

[8] The word engram of claim 6, further comprising a typo correcting step of correcting a typo using a typo correcting rule for word lists that are not verified by the spacing and pasting error correcting step. Spacing and spelling correction method using.

The method of claim 8, wherein the step of correcting a typo when the unverified word lists are generated, by using only 1-gram in the word n-gram DB to modify the generated word list, Creates a list containing the same phrases after several phonemes in the latter part. If the number of syllables is the same in the created list, the phoneme is given priority. Spacing and spelling correction method using a word en-gram, characterized in that.

10. The word spacing and spelling correction method of claim 9, wherein a list of closest numbers of phonemes is selected for a word having the same syllable number or a near syllable number among the created list.

The word n-gram searcher selects the best value from the word 3-gram or 2-gram by concatenating words before and after the plurality of selected lists. If the word does not exist in the -gram or the frequency is the same, the spacing and spelling correction method using the word en-gram, characterized in that the morphological tagging by connecting the front and back two or three words to select the one with the highest probability value.