KR20240119459A

KR20240119459A - A voice data processing device and operation method for automatic speech recognition model

Info

Publication number: KR20240119459A
Application number: KR1020230011531A
Authority: KR
Inventors: 정한별; 김경훈
Original assignee: 주식회사 아이에스피디
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2024-08-06

Abstract

본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치는, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델을 사전 구축하는 인식 모델 구축부; 음성 인식 대상 데이터를 상기 다중 음성 인식 모델에 입력하여, 상기 음성 인식 모델들 각각의 처리 결과를 획득하는 다중 인식 처리부; 상기 다중 음성 인식 모델들의 인식 결과에 따라, 각 음성 인식 모델의 자모열 인식 결과를 산출하고, 상기 자모열 인식 결과를 비교하여, 음성 인식 모델 간 자모열 유사도를 산출하는 자모열 유사도 측정부; 및 상기 자모열 유사도에 기초하여, 상기 복수의 음성 인식 모델들 중 하나 이상의 음성 인식 모델의 인식 결과를 이용한 출력값의 데이터 자동 라벨링을 처리하는 데이터 라벨링부를 포함한다.A voice data processing apparatus according to an embodiment of the present invention includes a recognition model building unit that pre-builds a multiple voice recognition model including a plurality of voice recognition models for voice recognition; a multi-recognition processor that inputs voice recognition target data into the multi-voice recognition model and obtains processing results for each of the voice recognition models; A character string similarity measurement unit that calculates character string recognition results for each speech recognition model according to the recognition results of the multiple speech recognition models, compares the character sequence recognition results, and calculates character sequence similarity between speech recognition models; and a data labeling unit configured to automatically label data of an output value using a recognition result of one or more speech recognition models among the plurality of speech recognition models, based on the character string similarity.

Description

{A voice data processing device and operation method for automatic speech recognition model}

본 발명은 음성 데이터 가공 처리 장치 및 그 동작 방법에 관한 것이다. 보다 구체적으로, 본 발명은 자동 음성 인식 모델을 위한 음성 데이터 가공 처리 장치 및 그 동작 방법에 관한 것이다.The present invention relates to a voice data processing device and a method of operating the same. More specifically, the present invention relates to a voice data processing device for an automatic voice recognition model and a method of operating the same.

일반적으로, 자동 음성 인식(Automatic speech recognition, ASR)은 음성-텍스트 변환(speech-to-text, STT)으로도 알려진 음성 녹음을 자동으로 인식하여 텍스트로 변환하는 과정을 포함한다. In general, automatic speech recognition (ASR), also known as speech-to-text (STT), involves the process of automatically recognizing and converting speech recordings into text.

이러한 STT를 위한 데이터는 크게 음향학적 관점과 언어학점 관점으로 볼 수 있다. 음향학점 관점은 말하는 이, 공간, 노이즈 등의 환경적인 데이터가 주를 이루고 언어학적 관점에서는 어휘, 문맥, 문법 등을 모델링하기 위한 언어 데이터가 주를 이룬다.Data for this STT can be largely viewed from an acoustic perspective and a linguistic perspective. From an acoustic point of view, environmental data such as speakers, space, and noise are mainly used, and from a linguistic point of view, language data for modeling vocabulary, context, and grammar are mainly used.

또한, STT는 크게 음성/언어 데이터로부터 인식 네트워크 모델을 생성하는 오프라인 학습 단계와 사용자가 발성한 음성을 인식하는 온라인 탐색 단계로 나뉘며, STT 엔진은 음성과 언어 데이터의 사전 지식을 사용해서 음성 신호로부터 문자 정보를 출력하는데 이 때 해석이라는 차원에서 STT 알고리즘을 디코더(Decoder)라고도 부릅니다. 디코딩 단계에서는 학습 단계 결과인 음향 모델(Acoustic Model), 언어 모델(Language Model)과 발음 사전(Pronunciation Lexicon)을 이용하여 입력된 특징 벡터를 모델과 비교, 스코어링(Scoring)하여 단어 열을 최종 결정짓는다.In addition, STT is largely divided into an offline learning stage that creates a recognition network model from voice/language data and an online exploration stage that recognizes the voice uttered by the user. The STT engine uses prior knowledge of voice and language data to recognize voice signals from voice signals. When outputting text information, the STT algorithm is also called a decoder in terms of interpretation. In the decoding stage, the input feature vector is compared with the model using the Acoustic Model, Language Model, and Pronunciation Lexicon, which are the results of the learning stage, and the word string is finally determined by scoring. .

음향 모델링은 해당 언어의 음운 환경별 발음의 음향적 특성을 확률 모델로 대표 패턴을 생성하는 과정이고, 언어모델링은 어휘 선택, 문장 단위 구문 구조 등 해당 언어의 사용성 문제에 대해 문법 체계를 통계적으로 학습하는 과정으로서, 발음 사전 구축을 위해서는 텍스트를 소리 나는 대로 변환하는 음소 변환(Grapheme-to-Phoneme) 구현 과정이 필요하며, 표준 발음을 대상으로 하는 발음 변환 규칙만으로는 방언이나 사용자의 발화 습관과 어투에 따른 다양한 패턴을 반영하기 어려운 경우가 있어 별도의 사전 구축이 필요하다.Acoustic modeling is the process of generating representative patterns from the acoustic characteristics of pronunciation according to the phonological environment of the language using a probability model, and language modeling is the process of statistically learning the grammar system for usability issues of the language, such as vocabulary selection and sentence-level syntactic structure. As a process, in order to build a pronunciation dictionary, a phoneme conversion (Grapheme-to-Phoneme) implementation process that converts text phonetically is necessary, and pronunciation conversion rules targeting standard pronunciation are not enough to determine the dialect or user's speaking habits and tone. In some cases, it is difficult to reflect various patterns, so it is necessary to build a separate dictionary.

이에 따라 구축되는 음향 모델은 대부분 확률 통계 방식이나 신경망 기반의 딥러닝 학습 방식으로 이루어지고 있으며, 각각의 모델의 연구와 모델별 특성에 따라 개별 모델들의 STT의 성능 향상이 이루어지고 있다.Accordingly, most of the acoustic models built are made up of probabilistic statistical methods or neural network-based deep learning learning methods, and the STT performance of individual models is improved according to the research of each model and the characteristics of each model.

한편, 이러한 STT는 최근 다양한 API들로 구현되고 있는 챗봇 기능의 자연어 처리 기술 중 하나로서 이용되고 있는 바, 차량용 내비게이션, 키오스크, 대화형 IVR, 가상비서, 스마트 홈 등에 서 광범위하게 이용되고 있으며, 이러한 챗봇 기술은 TTS와 STT 그리고, AI 처리기술을 혼합하여 구현되고 있다.Meanwhile, STT is being used as one of the natural language processing technologies for chatbot functions that have recently been implemented with various APIs, and is widely used in car navigation, kiosks, interactive IVR, virtual assistants, smart homes, etc. Chatbot technology is implemented by mixing TTS, STT, and AI processing technology.

그러나, 아직까지도 자동음성인식을 위한 개별 모델의 인식 정확도는 90%정도가 한계이며, 여전히 인식 오류가 발생되고 있다. 특히, 생활 노이즈, 노인과 어린이의 부정확한 발음 특징 등으로 인해 인식 정확도의 편차가 큰 문제점이 있다.However, the recognition accuracy of individual models for automatic speech recognition is still limited to about 90%, and recognition errors still occur. In particular, there is a problem with large variations in recognition accuracy due to household noise, inaccurate pronunciation characteristics of the elderly and children, etc.

또한, 인식할 단어가 지정되어 있는 자동음성인식 모델의 경우 문법교정 절차에서 일반적으로 많이 쓰이는 단어로 변환이 되어, 특히 유사한 발음의 단어 인식에서 실제 화자의 의미가 왜곡될 수 있다. In addition, in the case of an automatic speech recognition model in which words to be recognized are designated, they are converted to commonly used words in the grammar correction process, so the meaning of the actual speaker may be distorted, especially when recognizing words with similar pronunciations.

특히, 인식 가능한 단어가 제한적인 자동음성인식 모델 이용시에는, 잘못 인식된 단어 또는 글자의 보정 자체가 불가하므로, 결과적으로 실제 생활에서의 인식 정확도 저하를 극복할 수가 없게 되는 바, 이러한 자동음성인식 방식에 대한 근본적인 변화를 통해 인식 정확도를 개선하는 것이 요구되고 있다.In particular, when using an automatic speech recognition model with a limited number of words that can be recognized, correction of incorrectly recognized words or letters is impossible, and as a result, it is impossible to overcome the decrease in recognition accuracy in real life. This automatic speech recognition method There is a need to improve recognition accuracy through fundamental changes to .

본 발명은 상기한 바와 같은 문제점들을 해결하고자 안출된 것으로, 다중 인공지능 음성인식 모델을 이용한 모델별 음성 인식 예측 결과를 자모분리 처리하고, 자모분리 처리된 자모열의 유사도에 따른 데이터 자동 라벨링을 수행함으로써, 인식 가능한 단어가 제한적인 경우라 하더라도 그 인식 정확도를 보정할 수 있으며, 특히 생활 노이즈, 노인과 어린이의 부정확한 발음 특징 등으로 인한 인식 정확도의 편차를 최소화할 수 있는 자동 음성 인식 모델을 위한 음성 데이터 가공 처리 장치 및 그 동작 방법을 제공하는 데 그 목적이 있다.The present invention was developed to solve the problems described above, by processing the speech recognition prediction results for each model using multiple artificial intelligence speech recognition models into alphabet separation, and performing automatic data labeling according to the similarity of the separated alphabet strings. , Even if the number of words that can be recognized is limited, the recognition accuracy can be corrected, and in particular, it is a voice for automatic speech recognition model that can minimize the deviation of recognition accuracy due to everyday noise, inaccurate pronunciation characteristics of the elderly and children, etc. The purpose is to provide a data processing device and its operation method.

상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 장치는, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델을 사전 구축하는 인식 모델 구축부; 음성 인식 대상 데이터를 상기 다중 음성 인식 모델에 입력하여, 상기 음성 인식 모델들 각각의 처리 결과를 획득하는 다중 인식 처리부; 상기 다중 음성 인식 모델들의 인식 결과에 따라, 각 음성 인식 모델의 자모열 인식 결과를 산출하고, 상기 자모열 인식 결과를 비교하여, 음성 인식 모델 간 자모열 유사도를 산출하는 자모열 유사도 측정부; 및 상기 자모열 유사도에 기초하여, 상기 복수의 음성 인식 모델들 중 하나 이상의 음성 인식 모델의 인식 결과를 이용한 출력값의 데이터 자동 라벨링을 처리하는 데이터 라벨링부를 포함한다.An apparatus according to an embodiment of the present invention for solving the problems described above includes a recognition model building unit that pre-constructs a multiple voice recognition model including a plurality of voice recognition models for voice recognition; a multi-recognition processor that inputs voice recognition target data into the multi-voice recognition model and obtains processing results for each of the voice recognition models; A character string similarity measurement unit that calculates character string recognition results for each speech recognition model according to the recognition results of the multiple speech recognition models, compares the character sequence recognition results, and calculates character sequence similarity between speech recognition models; and a data labeling unit configured to automatically label data of an output value using a recognition result of one or more speech recognition models among the plurality of speech recognition models, based on the character string similarity.

또한, 상기한 바와 같은 과제를 해결하기 위한 본 발명의 실시 예에 따른 방법은, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델을 사전 구축하는 단계; 음성 인식 대상 데이터를 상기 다중 음성 인식 모델에 입력하여, 상기 음성 인식 모델들 각각의 처리 결과를 획득하는 단계; 상기 다중 음성 인식 모델들의 인식 결과에 따라, 각 음성 인식 모델의 자모열 인식 결과를 산출하고, 상기 자모열 인식 결과를 비교하여, 음성 인식 모델 간 자모열 유사도를 산출하는 단계; 상기 자모열 유사도에 기초하여, 상기 복수의 음성 인식 모델들 중 하나 이상의 음성 인식 모델의 인식 결과를 이용한 출력값을 챗봇 질의 기반으로 보정하는 단계; 및 상기 보정 결과 및 상기 자모열 유사도를 이용하여, 상기 음성 인식 대상 데이터에 대응하는 인식 결과 정보를 출력하는 단계를 포함한다.In addition, a method according to an embodiment of the present invention for solving the problems described above includes the steps of pre-constructing a multiple voice recognition model including a plurality of voice recognition models for voice recognition; Inputting voice recognition target data into the multiple voice recognition models and obtaining processing results of each of the voice recognition models; According to the recognition results of the multiple speech recognition models, calculating a consonant recognition result of each speech recognition model, comparing the consonant recognition results, and calculating consonant similarity between speech recognition models; Based on the character string similarity, correcting an output value using a recognition result of one or more voice recognition models among the plurality of voice recognition models based on a chatbot query; and outputting recognition result information corresponding to the voice recognition target data using the correction result and the character string similarity.

본 발명의 실시 예에 따르면, 다중 인공지능 음성인식 모델을 이용한 모델별 음성 인식 예측 결과를 자모분리 처리하고, 자모분리 처리된 자모열의 유사도에 따른 음성 데이터의 데이터 라벨링을 수행함으로써, 음성 인식을 위한 데이터의 자동 라벨링을 처리할 수 있으며, 각 라벨링된 음성 데이터의 인식 정확도를 기존의 개별 모델보다 크게 향상시킬 수 있다.According to an embodiment of the present invention, the speech recognition prediction results for each model using a multi-artificial intelligence speech recognition model are separated into characters, and data labeling of the speech data is performed according to the similarity of the character strings that have been separated into characters for speech recognition. It can handle automatic labeling of data, and the recognition accuracy of each labeled voice data can be greatly improved compared to existing individual models.

이에 따라, 본 발명의 실시 예에 따르면, 인식 가능한 단어가 제한적인 경우라 하더라도 그 인식 정확도를 보정할 수 있으며, 특히 생활 노이즈, 노인과 어린이의 부정확한 발음 특징 등으로 인한 인식 정확도의 편차를 최소화할 수 있는 음성 데이터 가공 처리 장치 및 그 동작 방법을 제공할 수 있다.Accordingly, according to an embodiment of the present invention, even when the number of words that can be recognized is limited, the recognition accuracy can be corrected, and in particular, the deviation in recognition accuracy due to living noise, inaccurate pronunciation characteristics of the elderly and children, etc. is minimized. A voice data processing device and a method of operating the same can be provided.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 도면이다.
도 2는 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치를 보다 구체적으로 설명한 블록도이다.
도 3은 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치의 동작을 설명하기 위한 흐름도이다.
도 4는 본 발명의 실시 예에 따른 음성 인식 처리 과정을 설명하기 위한 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블을 나타낸다.
도 5는 본 발명의 실시 예에 따른 음성 자동 라벨링 과정을 설명하기 위한 예시도이다.1 is a diagram schematically showing the entire system according to an embodiment of the present invention.
Figure 2 is a block diagram illustrating in more detail a voice data processing device according to an embodiment of the present invention.
Figure 3 is a flowchart for explaining the operation of a voice data processing device according to an embodiment of the present invention.
Figure 4 shows a data automatic labeling and correction table based on character string similarity to explain the voice recognition processing process according to an embodiment of the present invention.
Figure 5 is an exemplary diagram to explain the automatic voice labeling process according to an embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치와 방법을 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시 예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시 예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following merely illustrates the principles of the invention. Therefore, those skilled in the art will be able to invent various devices and methods that embody the principles of the present invention and are included in the concept and scope of the present invention, although not explicitly described or shown herein. In addition, all conditional terms and examples listed herein are, in principle, expressly intended only for the purpose of enabling the concept of the invention to be understood, and should be understood not as limiting to the examples and states specifically listed as such. do.

또한, 본 발명의 원리, 관점 및 실시 예들 뿐만 아니라 특정 실시 예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.Additionally, it is to be understood that any detailed description reciting the principles, aspects, and embodiments of the invention, as well as specific embodiments, is intended to encompass structural and functional equivalents thereof. In addition, these equivalents should be understood to include not only currently known equivalents but also equivalents developed in the future, that is, all elements invented to perform the same function regardless of structure.

따라서, 예를 들어, 본 명세서의 블록도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Accordingly, for example, the block diagrams herein should be understood as representing a conceptual view of an example circuit embodying the principles of the invention. Similarly, all flow diagrams, state transition diagrams, pseudocode, etc. are understood to represent various processes that can be substantially represented on a computer-readable medium and are performed by a computer or processor, whether or not the computer or processor is explicitly shown. It has to be.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니 되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.Additionally, the clear use of terms such as processor, control, or similar concepts should not be construed as exclusively referring to hardware capable of executing software, and should not be construed as referring exclusively to hardware capable of executing software, including, without limitation, digital signal processor (DSP) hardware, and ROM for storing software. It should be understood as implicitly including ROM, RAM, and non-volatile memory. Other hardware for public use may also be included.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 실시함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다.The above-described purpose, features and advantages will become clearer through the following detailed description in conjunction with the accompanying drawings, and accordingly, those skilled in the art will be able to easily implement the technical idea of the present invention. There will be. Additionally, in carrying out the present invention, if it is determined that a detailed description of known techniques related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that it does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시 예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. In order to facilitate overall understanding when describing the present invention, the same reference numerals are used for the same components in the drawings, and duplicate descriptions of the same components are omitted.

도 1은 본 발명의 실시 예에 따른 전체 시스템을 개략적으로 도시한 도면이며, 도 2는 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치를 보다 구체적으로 설명한 블록도이다.FIG. 1 is a diagram schematically showing the entire system according to an embodiment of the present invention, and FIG. 2 is a block diagram illustrating in more detail a voice data processing device according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시 예에 따른 시스템은 음성 데이터 가공 처리 장치(100) 및 사용자 단말(200)을 포함할 수 있다.Referring to FIG. 2, a system according to an embodiment of the present invention may include a voice data processing device 100 and a user terminal 200.

보다 구체적으로, 음성 데이터 가공 처리 장치(100) 및 사용자 단말(200)은 공중망(Public network)과의 연결을 통해 유선 및 무선 중 하나 이상으로 연결되어 데이터를 송수신할 수 있다. 상기 공중망은 국가 혹은 통신 기간 사업자가 구축 및 관리하는 통신망으로, 일반적으로 전화망, 데이터망, CATV망 및 이동 통신망 등을 포함하여 불특정 다수의 일반인이 타 통신망이나 인터넷에 접속 가능하도록 연결 서비스를 제공한다.More specifically, the voice data processing device 100 and the user terminal 200 may be connected to one or more of wired and wireless channels through a connection to a public network to transmit and receive data. The above-mentioned public network is a communication network built and managed by the state or a telecommunications carrier, and generally includes a telephone network, data network, CATV network, and mobile communication network, and provides connection services so that an unspecified number of ordinary people can access other communication networks or the Internet. .

여기서 상기 공중망 네트워크는 근거리 통신망(Local Area Network; LAN), 광역 통신망(Wide Area Network; WAN), 부가가치 통신망(Value Added Network; VAN), 개인 근거리 무선통신(Personal Area Network; PAN), 이동 통신망(Mobile radio communication network) 또는 위성 통신망 등과 같은 모든 종류의 유/무선 네트워크로 구현될 수 있다.Here, the public network network includes a local area network (LAN), a wide area network (WAN), a value added network (VAN), a personal area network (PAN), and a mobile communication network ( It can be implemented as any type of wired/wireless network, such as a mobile radio communication network or satellite communication network.

그리고 본 명세서에서 설명되는 사용자 단말(200)은 PC(personal computer), 노트북 컴퓨터(laptop computer), 휴대폰(Mobile phone), 스마트폰(Smart Phone), 태블릿 PC(Tablet PC), PDA(Personal Digital Assistants), PMP(Portable Multimedia Player) 등이 포함될 수 있다.And the user terminal 200 described in this specification includes a personal computer (PC), a laptop computer, a mobile phone, a smart phone, a tablet PC, and a personal digital assistant (PDA). ), PMP (Portable Multimedia Player), etc. may be included.

또한 음성 데이터 가공 처리 장치(100) 및 사용자 단말(200)은 상기 장치 구분에 한정되지 않고 데이터 처리 및 저장, 관리 기능을 고도화하여 확장할 수 있는 서버 시스템 관련 장치를 포함할 수 있다. In addition, the voice data processing device 100 and the user terminal 200 are not limited to the above device categories and may include server system-related devices that can be expanded by upgrading data processing, storage, and management functions.

또한, 음성 데이터 가공 처리 장치(100) 및 사용자 단말(200)은 각 통신망에 상응하는 프로토콜로 통신하기 위한 각각의 통신 모듈을 포함할 수 있다.Additionally, the voice data processing device 100 and the user terminal 200 may include respective communication modules for communicating using a protocol corresponding to each communication network.

그리고, 음성 데이터 가공 처리 장치(100)는, 사용자 단말(200)의 요청에 따라, 음성 인식 서비스를 제공하는 서비스 제공 장치일 수 있다.And, the voice data processing device 100 may be a service providing device that provides a voice recognition service according to a request from the user terminal 200.

음성 데이터 가공 처리 장치(100)는, 사용자 단말(200)에서 수신되는 음성 인식 대상 데이터로부터, 본 발명의 실시 예에 따른 다중 인공지능 음성 인식 모델의 자모열 유사도를 이용한 음성 인식 및 자동 라벨링 처리를 수행하고, 자동 라벨링 처리된 음성 인식 데이터를 사용자 단말(200)로 제공하는 서비스 제공 장치일 수 있다.The voice data processing device 100 performs voice recognition and automatic labeling processing using the consonant string similarity of the multiple artificial intelligence voice recognition model according to an embodiment of the present invention from voice recognition target data received from the user terminal 200. It may be a service providing device that performs automatic labeling and provides voice recognition data to the user terminal 200.

또한, 음성 데이터 가공 처리 장치(100)는, 본 발명의 실시 예에 따른 다중 인공지능 음성 인식 모델의 자모열 유사도를 이용한 음성 인식 처리 및 음성 데이터 보정 및 자동 라벨링을 수행하는 음성 인식 처리 어플리케이션을 사용자 단말(200)로 제공하는 서비스 제공 장치일 수 있다.In addition, the voice data processing device 100 is a voice recognition processing application that performs voice recognition processing using character string similarity of a multi-artificial intelligence voice recognition model according to an embodiment of the present invention and performs voice data correction and automatic labeling to enable the user. It may be a service providing device provided to the terminal 200.

또한, 사용자 단말(200) 및 음성 데이터 가공 처리 장치(100)는 하나의 음성 인식 시스템으로 동작할 수 있으며, 본 발명의 실시 예에 따른 다중 인공지능 음성 인식 모델의 자모열 유사도를 이용한 음성 인식 처리 기능의 각 일부가 사용자 단말(200) 및 음성 데이터 가공 처리 장치(100)로 분산되어 처리될 수도 있다.In addition, the user terminal 200 and the voice data processing device 100 may operate as one voice recognition system, and voice recognition processing using the consonant string similarity of the multiple artificial intelligence voice recognition model according to an embodiment of the present invention Each part of the function may be distributed and processed between the user terminal 200 and the voice data processing device 100.

따라서, 본 명세서에서는 음성 데이터 가공 처리 장치(100)의 구성과 동작을 중심으로 설명하나, 사용자 단말(200) 또한 음성 데이터 가공 처리 장치(100)의 구성을 구비하거나, 동작을 수행할 수 있음은 자명할 것이다.Therefore, in this specification, the description is focused on the configuration and operation of the voice data processing device 100, but the user terminal 200 may also have the configuration or perform the operations of the voice data processing device 100. It will be self-explanatory.

보다 구체적으로, 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치(100)는, 인식 모델 구축부(110), 다중 인식 처리부(120), 자모열 유사도 측정부(130) 및 데이터 라벨링부(140)를 포함한다.More specifically, the voice data processing device 100 according to an embodiment of the present invention includes a recognition model building unit 110, a multi-recognition processing unit 120, a character string similarity measurement unit 130, and a data labeling unit 140. ) includes.

먼저, 인식 모델 구축부(110)는, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델을 사전 구축한다.First, the recognition model construction unit 110 pre-constructs a multiple voice recognition model including a plurality of voice recognition models for voice recognition.

여기서, 상기 다중 음성 인식 모델은 복수의 음성 인식 모델들을 포함할 수 있다. 복수의 음성 인식 모델들은, 현재 알려진 다양한 음성 인식 방식에 대응하여 각각 분류 및 구축될 수 있으며, 학습 대상이 어떠한 분류인지, 어떠한 카테고리의 음성을 학습하였는지, 어떠한 알고리즘을 사용하였는지 등에 따라서 서로 상이한 모델로 분류 및 구축될 수 있다.Here, the multiple voice recognition model may include a plurality of voice recognition models. A plurality of voice recognition models can be classified and built in response to various currently known voice recognition methods, and can be divided into different models depending on the classification of the learning target, the category of voice learned, and the algorithm used. Can be classified and constructed.

보다 구체적으로, 도 2에 도시된 바오 같이, 인식 모델 구축부(110)는, 각각의 음성 인식(ASR, Automatic Speech Recognition) 모델을 운용하는 업체별 서비스별로 제공되는 복수의 음성 인식 모델의 API 등을 이용하여, 다중으로 처리되는 음성 인식 서비스 프로세스를 구성함으로써, 다중 음성 인식 모델을 사전 구축할 수 있다.More specifically, as shown in FIG. 2, the recognition model building unit 110 uses APIs of a plurality of voice recognition models provided for each service by company operating each voice recognition (ASR, Automatic Speech Recognition) model. By configuring a multi-processed voice recognition service process, a multi-voice recognition model can be built in advance.

예를 들어, 도 2를 참조하면, 인식 모델 구축부(110)는, Google사의 Conformer 서비스, Google사의 음성 인식 API 서비스, Facebook사의 wav2vec 2.0서비스, Microsoft사의 Azure API 서비스, amazon사의 Transcribe API 서비스, IBM사의 Watson Services, NEVER사의 하이퍼클로바 서비스, SK텔레콤의 에이닷 서비스, 기타 KoSpeech Model을 이용한 다양한 오픈 소스 기반 한국어 음성 인식 모델 서비스 등을 다중으로 이용하도록 구성함으로써, 다중 음성 인식 모델을 사전 구축할 수 있다. 이러한 다중 음성 인식 모델의 출력은, 각 음성 인식 모델별 인식 결과일 수 있으며, 인식 결과는 텍스트 데이터로 구성되는 단어 인식 결과 또는 글자 인식 결과를 포함할 수 있다.For example, referring to Figure 2, the recognition model building unit 110 includes Google's Conformer service, Google's voice recognition API service, Facebook's wav2vec 2.0 service, Microsoft's Azure API service, Amazon's Transcribe API service, and IBM Multiple voice recognition models can be built in advance by configuring multiple uses of Watson Services, NEVER's Hyperclova service, SK Telecom's Adot service, and other open source-based Korean voice recognition model services using the KoSpeech Model. . The output of these multiple speech recognition models may be recognition results for each speech recognition model, and the recognition results may include word recognition results or letter recognition results composed of text data.

또한, 상기 각 음성 인식 모델은, 개별 학습 모델에 따라 개별적으로 구성될 수 있는 바, 개별 학습 모델에 따른 데이터 셋을 이용하여 딥러닝 학습을 수행하는 하나 이상의 뉴럴 네트워크 모델을 포함할 수 있다. 예를 들어, 뉴럴 네트워크 모델은, 나이, 성별, 직업, 건강 등과 같이 특정 카테고리별 음성 데이터와 텍스트 인식 결과를 학습한 딥러닝 뉴럴 네트워크 모델로써, 알려진 CNN, RNN 또는 CRNN 방식을 이용하여 각각의 카테고리별 음성과 그 인식 결과를 학습함에 따라, 각 음성 인식 결과를 출력하는 학습 모델이 구축될 수 있다.Additionally, each speech recognition model may be individually configured according to an individual learning model and may include one or more neural network models that perform deep learning using a data set according to the individual learning model. For example, the neural network model is a deep learning neural network model that learns voice data and text recognition results for specific categories such as age, gender, occupation, health, etc., and uses known CNN, RNN, or CRNN methods to classify each category. As star voices and their recognition results are learned, a learning model that outputs each voice recognition result can be built.

그리고, 다중 인식 처리부(120)는, 음성 인식 대상 데이터를 상기 다중 음성 인식 모델에 입력하여, 상기 복수의 음성 인식 모델들 각각의 처리 결과를 획득한다. 전술한 바와 같이, 인식 모델 구축부(110)에서 각각의 분류에 따라 구축된 모델들에서는 다양한 인식 결과가 출력될 수 있다. 이러한 인식 결과는 텍스트 데이터로 구성되는 단어 인식 결과 또는 글자 인식 결과를 포함할 수 있다.Then, the multi-recognition processing unit 120 inputs voice recognition target data into the multi-voice recognition model and obtains processing results for each of the plurality of voice recognition models. As described above, various recognition results may be output from models built according to each classification in the recognition model building unit 110. These recognition results may include word recognition results or character recognition results consisting of text data.

그리고, 자모열 유사도 측정부(130)는, 상기 다중 음성 인식 모델들의 인식 결과로부터, 각 음성 인식 모델의 자모열 인식 결과를 산출하고, 상기 자모열 인식 결과를 비교하여, 음성 인식 모델 간 자모열 유사도를 산출한다.And, the character string similarity measurement unit 130 calculates character string recognition results for each speech recognition model from the recognition results of the multiple speech recognition models, compares the character string recognition results, and compares the character sequence recognition results between speech recognition models. Calculate similarity.

여기서, 자모열 유사도 측정부(130)는, 상기 음성 인식 모델의 단어 인식 결과 또는 글자 인식 결과를, 자음 및 모음이 독립적인 문자 코드로 나열되도록 분리하여, 상기 자모열 인식 결과를 산출할 수 있다.Here, the consonant sequence similarity measurement unit 130 separates the word recognition result or letter recognition result of the speech recognition model so that consonants and vowels are arranged into independent character codes, and calculates the consonant recognition result. .

예를 들어, 자모열 유사도 측정부(130)는, 상기 글자 인식 결과의 문자 코드 또는 상기 단어 인식 결과를 구성하는 하나 이상의 문자 코드에 매핑된 제1 언어 인식 결과 데이터를, 초성, 중성 또는 종성에 대응하는 상대 거리 위치 값으로 각각 변환 및 결합하여, 상기 자모열 인식 결과를 산출할 수 있는 것이다.For example, the letter string similarity measuring unit 130 may measure first language recognition result data mapped to the character code of the character recognition result or one or more character codes constituting the word recognition result to the initial consonant, middle consonant, or final consonant. By converting and combining each value into the corresponding relative distance position value, the alphabetic sequence recognition result can be calculated.

이에 따라, 자모열 유사도 측정부(130)는, 상기 복수의 음성 인식 모델 각각의 자모열 인식 결과 상호간 유사도를 산출하고, 상기 산출된 자모열 인식 결과 상호간 유사도를 조합 연산하여, 상기 자모열 유사도를 산출할 수 있다.Accordingly, the character string similarity measurement unit 130 calculates the similarity between the character string recognition results of each of the plurality of voice recognition models, calculates the similarity between the calculated character string recognition results, and determines the similarity. It can be calculated.

보다 구체적으로, 자모열 유사도 측정부(130)는, 띄어쓰기를 기준으로 인식되는 단어별 인식 결과 또는, 각 문자별 인식 결과의 자음 문자와 모음 문자를 초성,중성 및 종성과 같은 발음 순서대로 분리하여, 문자 코드별로 변환함에 따라 분리 구성하고, 분리 구성된 문자 코드들을 연결하여 자모열 인식 결과를 획득할 수 있다. 예를 들어 '챗'이라는 한글 문자의 음성 인식 결과는. 자모열 유사도 측정부(130)에서 'ㅊㅐㅅ'과 같은 유니코드 기반으로 초성, 중성, 종성이 각각 분리 변환 처리됨에 따라, 자모열 인식 결과로서 획득될 수 있다.More specifically, the consonant string similarity measurement unit 130 separates the consonant characters and vowel characters of the recognition results for each word or the recognition result for each character in the order of pronunciation, such as initial consonant, middle consonant, and final consonant, based on spacing. , By converting by character code, the character codes can be separated and configured, and the character string recognition results can be obtained by connecting the separately composed character codes. For example, the voice recognition results for the Korean character 'Chat' are: As the initial consonant, middle consonant, and final consonant are each separately converted and processed based on Unicode, such as 'ㅊㅐㅅ', in the alphabet string similarity measurement unit 130, the alphabet string recognition result can be obtained.

그리고, 자모열 유사도 측정부(130)는, 분리 구성된 자모열 인식 결과를 비교함에 따라 각 음성 인식 모델 간 자모열 유사도를 측정할 수 있다. 이러한 자모열 분리 및 비교방식은, 기존의 기존의 단어유사도 또는 글자유사도 비교에서의 유사발음 단어들에 대한 부적절한 비교과정에서의 오류를 방지할 수 있게 한다.Additionally, the letter string similarity measuring unit 130 may measure the letter string similarity between each speech recognition model by comparing the separately constructed letter string recognition results. This method of separating and comparing letter sequences makes it possible to prevent errors in the inappropriate comparison process for words with similar pronunciations in existing word similarity or letter similarity comparisons.

예를 들어, 특정 음성을, 제1 인식 모델이 '신라면'으로 인식하고, 제2 인식 모델이 '진라면'으로 인식하는 경우, 단어 유사도 기반 다중 인식 방식에서는 '진라면' '신라면'은 완전히 유사하지 않은 단어이기 때문에, 제1 인식 모델, 제2 인식 모델 어디서 잘못된 인식이 이루어졌는지를 확인할 수 없고 인식 실패로 출력될 수밖에 없는 문제점이 있다.For example, if the first recognition model recognizes a specific voice as 'Shin Ramyun' and the second recognition model recognizes it as 'Jin Ramyun', in the word similarity-based multiple recognition method, 'Jin Ramyun' and 'Shin Ramyun' are completely Because the words are not similar, there is a problem in that it is impossible to determine where the incorrect recognition occurred in the first recognition model or the second recognition model, and it is inevitably output as a recognition failure.

나아가, 글자 유사도 기반 인식 프로세스에서도 '진'과 '신'은 완전히 다른 글자이므로, 인식 실패만이 도출될 뿐, 시스템 자체적으로 이를 보정하거나 오류를 수정할 수는 없는 문제점이 있다.Furthermore, even in the character similarity-based recognition process, 'Jin' and 'Shin' are completely different letters, so only recognition failure occurs, and there is a problem in that the system itself cannot correct this or correct the error.

이에 반하여, 본 발명의 실시 예에 따른 자모열 유사도 측정부(130)는, 자모열 유사도를 측정할 수 있는 바, 이에 따라, 제1 모델의 '신라면' 인식 결과는 'ㅅㅣㄴㄹㅏㅁㅕㄴ' 으로 자모열 분리되며, 제2 모델의 '진라면'은 인식 결과는 'ㅈㅣㄴㄹㅏㅁㅕㄴ' 으로 자모열 분리되고, 제1 모델과 제2 모델 간 인식 결과의 자모열 유사도는 70%이상 유사한 것으로 도출될 수 있으므로, 첫번째 자음만이 인식에 차이가 있는 것으로 확인 가능하며, 최종적으로 이를 보정하거나 삭제 반영할 수 있는 문자 코드의 위치 정보가 도출될 수 있는 것이다.On the other hand, the letter string similarity measurement unit 130 according to an embodiment of the present invention can measure the letter string similarity, and accordingly, the 'Shin Ramyun' recognition result of the first model is 'ㅅㅣㄴㄹㅏㅁㅕㄴ'. The consonants are separated into 'Jin Ramyun' in the second model, and the recognition result for 'Jin Ramyun' is separated into 'ㅈㅣㄴㄹㅏㅁㅕㄴ', and the consonant similarity of the recognition results between the first and second models is more than 70% similar. Therefore, it can be confirmed that only the first consonant has a difference in recognition, and ultimately, location information of the character code that can correct or delete this can be derived.

또한, 이러한 자모열 분리를 효율적으로 처리하기 위하여, 자모열 유사도 측정부(130)는, 상기 글자 인식 결과의 문자 코드 또는 상기 단어 인식 결과를 구성하는 하나 이상의 문자 코드에 매핑된 제1 언어 인식 결과 데이터를, 초성, 중성 또는 종성에 대응하는 상대 거리 위치 값으로 각각 변환 및 결합하여, 상기 자모열 인식 결과를 산출할 수 있는 것이다.In addition, in order to efficiently process such alphabet string separation, the alphabet string similarity measurement unit 130 provides a first language recognition result mapped to the character code of the letter recognition result or one or more character codes constituting the word recognition result. By converting and combining the data into relative distance position values corresponding to the initial consonant, middle consonant, or final consonant, the alphabet string recognition result can be calculated.

예를 들어, 한글 유니코드에서 사용하는 자모는 초성 19자, 중성 21자, 종성 27자로, 조합 가능한 글자 수는 19x21x28(종성 없음 포함)=11,172자이며. 한글 문자 코드로 이루어진 문자열 내에서 종성과 종성, 중성과 중성, 종성과 종성 사이의 거리는 항상 일정하게 구성될 수 있다. For example, the alphabet used in Korean Unicode consists of 19 initial consonants, 21 middle consonants, and 27 final consonants, and the number of characters that can be combined is 19x21x28 (including no final consonants) = 11,172 characters. In a string composed of Hangul character codes, the distance between the final consonant, the middle consonant, the middle consonant, and the final consonant can always be configured to be constant.

이에 따라, 자모열 유사도 측정부(130)는, 음성 인식 결과를 유니코드로 변환하고, 유니코드로 변환된 문자열의 글자별로, 초성, 중성, 종성 간 상대 거리를, 유니코드의 시작위치('가')로부터의 거리에 따라 각각 산출하여 연결하는 방식 등으로, 자모열 인식 결과를 구성할 수 있다.Accordingly, the alphabet string similarity measurement unit 130 converts the voice recognition result into Unicode, and calculates the relative distance between the initial consonant, middle consonant, and final consonant for each character of the string converted to Unicode to the starting position of Unicode (' The alphabet string recognition result can be constructed by calculating and connecting each according to the distance from 'a').

예를 들어, '?X'의 경우, 유니코드 테이블의 첫 글자인 '가' 를 기준으로 종성 'ㅅ' 까지 종성 간 거리가 19이며, 중성인 'ㅞ' 까지의 중성 간 거리가 15이며, 초성인 'ㅊ'까지의 초성 간 거리가 14로 산출되는 바, 각 상대 거리 위치 정보를 자모열 인식 결과로서 구성하면, 이로부터 매핑되는 문자 코드들을 변환 연결하여 구성된 'ㅊㅞㅅ' 이라는 분리된 자모음 인식 결과가 획득될 수 있다.For example, in the case of '? The distance between initial consonants up to the initial consonant 'ㅊ' is calculated to be 14, so when each relative distance location information is constructed as a result of consonant string recognition, a separate consonant called 'ㅊㅞㅅ' is formed by converting and connecting the character codes mapped therefrom. A recognition result may be obtained.

한글 문자 인식의 경우, 자모열 구성 코드로 구성하여 보면 아래와 같이 수행될 수 있다.In the case of Korean character recognition, it can be performed as follows by configuring the alphabet string composition code.

******************************************************************************************

//초성 매핑 문자 코드 구성//Configure initial consonant mapping character code

chut = 'ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ#'chut = '#'

//중성 매핑 문자 코드 구성//Configure neutral mapping character code

ga = 'ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ#'ga = 'ㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ#'

//종성 매핑 문자 코드 구성//Configure final consonant mapping character code

ggut = ' ㄱㄲㄳㄴㄵㄶㄷㄹㄺㄻㄼㄽㄾㄿㅀㅁㅂㅄㅅㅆㅇㅈㅊㅋㅌㅍㅎ#'ggut = '#'

//유니코드 시작 위치 설정//Set Unicode start position

BASE = 0xAC00BASE = 0xAC00

//음성 인식 결과 입력//Input voice recognition result

query = '?X'query = '?X'

//상대 위치 정보 기반 자모열 인식 결과 구성, *ord()함수는 대상 문자 값을 아스키 숫자로 변환해주는 함수임.//Composition of alphabetic string recognition results based on relative position information, *ord() function is a function that converts the target character value to ASCII number.

code = ord(query) - BASEcode = ord(query) - BASE

//종성 상대 위치 정보 추출//Extract final relative location information

jongsung = code % 28jongsung = code % 28

//중성 상대 위치 정보 추출//Extract neutral relative location information

jungsung = ((code-jongsung) // 28) % 21jungsung = ((code-jongsung) // 28) % 21

//초성 상대 위치 정보 추출//Extract relative position information of initial consonant

chosung = ((code - jongsung) // 28) // 21chosung = ((code - jongsung) // 28) // 21

//상대 위치 정보에 매핑된 문자 코드 출력//Output character code mapped to relative location information

print(chut[chosung], ga[jungsung], ggut[jongsung])print(chut[chosung], ga[jungsung], ggut[jongsung])

상기 코드의 수행 결과는 아래와 같이 이루어질 수 있다.The result of executing the above code may be as follows.

>>> ㅊ ㅞ ㅅ>>> ㅊ ㅞ ㅅ

여기서, 상기 chut 변수(=첫소리), ga 변수(=가운뎃소리), ggut 변수(=끝소리)는 자모조합에서 일반적으로 사용하는 변수이다.Here, the chut variable (=initial sound), ga variable (=middle sound), and ggut variable (=final sound) are variables commonly used in alphabet combinations.

한편, 자모열 유사도 측정부(130)는, 획득된 자모열 인식 결과를 각 다중 음성 인식 모델별로 비교함에 따라, 음성 인식 모델 간 자모열 유사도를 측정할 수 있다. 여기서, 자모열 유사도 측정은 알려진 SequenceMatcher 알고리즘의 ratio 메서드 등을 활용할 수 있으며, 이에 따라 각 유사도(Similarity)를 계산하는 함수를 이용하여 음성 인식 모델 간 자모열 유사도를 도출할 수 있다. 해당 함수는, 자모열을 비교하여 그 유사도를 출력하는 함수로서, 예를 들어 banana와 유사한 manana의 경우 83%가 유사하다는 결과 등을 출력할 수 있다.Meanwhile, the letter string similarity measurement unit 130 can measure the letter string similarity between voice recognition models by comparing the obtained letter string recognition results for each multiple speech recognition model. Here, the ratio method of the known SequenceMatcher algorithm can be used to measure string similarity, and accordingly, the string similarity between speech recognition models can be derived using a function that calculates each similarity. This function is a function that compares character strings and outputs the similarity. For example, in the case of manana, which is similar to banana, it can output a result such as 83% similarity.

여기서, 자모열 유사도는 각 음성 인식 모델과 음성 인식 모델 상호간 산출될 수 있으며, 상호간 산출된 자모열 유사도들을 종합하여, 최종적인 자모열 유사도가 결정될 수 있다. 이는 각 음성 인식 모델 상호간 산출된 자모열 유사도 값의 평균 값, 중간 값 또는 이들을 조합하여 산출되는 다양한 통계적 값으로 구성될 수 있다.Here, the alphabetic sequence similarity can be calculated between each speech recognition model and the speech recognition model, and the final alphabet string similarity can be determined by combining the alphabetical sequence similarities calculated between each speech recognition model. This may be composed of the average value, median value, or various statistical values calculated by combining the alphabet string similarity values calculated between each speech recognition model.

그리고, 데이터 라벨링부(140)는, 상기 음성 인식 대상 데이터의 단어별 자모열 유사도를 식별하고, 상기 단어별 자모열 유사도가 임계조건 이내인 단어가 하나 이상 식별되는 경우, 상기 음성 인식 대상 데이터 기반의 출력값을 보정하여, 데이터 자동 라벨링을 처리하고, 처리 결과를 저장하거나, 상기 사용자 단말(200)로 전송할 수 있다.Then, the data labeling unit 140 identifies the consonant string similarity for each word of the voice recognition target data, and when one or more words whose consonant string similarity for each word is within a threshold condition is identified, the voice recognition target data is based on the data. The output value of can be corrected, data automatic labeling can be processed, and the processing results can be stored or transmitted to the user terminal 200.

여기서, 상기 데이터 라벨링부(140)는, 상기 자모열 유사도가 임계조건에 해당하는 단어의 각 음성 인식 모델별 인식 결과를 이용하여, 상기 임계조건에 해당하는 단어의 데이터 자동 라벨링 정보를 보정할 수 있다. 예를 들어, 상기 데이터 라벨링부(140)는, 상기 음성 인식 대상 데이터를 증폭한 후, 상기 자모열 유사도가 임계조건에 해당하는 단어에 대응하는 음성 파형 부분을 삭제 처리할 수 있는 것이다.Here, the data labeling unit 140 can correct the data automatic labeling information of the word corresponding to the threshold condition by using the recognition results for each speech recognition model of the word whose letter string similarity corresponds to the threshold condition. there is. For example, the data labeling unit 140 may amplify the speech recognition target data and then delete the portion of the speech waveform corresponding to the word whose consonant string similarity meets the threshold condition.

이에 따라, 상기 데이터 라벨링부(140)의 처리 결과는, 음성 인식된 데이터가 자동 라벨링된 음성 데이터로서 별도의 데이터베이스(미도시)상에 저장 및 관리되거나, 사용자 단말(200)로 제공될 수 있다.Accordingly, the processing result of the data labeling unit 140 may be stored and managed in a separate database (not shown) as voice recognition data as automatically labeled voice data, or provided to the user terminal 200. .

여기서, 상기 데이터 라벨링 및 음성 데이터의 보정 처리는, 각 상기 단어별 자모열 유사도가 임계조건에 해당하는 단어가 식별된 단어의 각 음성 인식 모델별 인식 결과 일치여부를, 사전 구성된 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블에 적용함에 따라, 선택적으로 수행될 수 있다. 이러한 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블 및 그에 따른 데이터 라벨링부(140)의 동작에 대하여는 도 4 내지 도 5를0 참조하여 보다 구체적으로 후술하도록 한다.Here, the data labeling and correction processing of the speech data determines whether the recognition result for each speech recognition model of the identified word matches the threshold condition of the alphabet string similarity for each word, using pre-configured alphabet string similarity-based data. Depending on the application to the automatic labeling and correction table, it can be performed optionally. The automatic data labeling and correction table based on alphabetical string similarity and the operation of the data labeling unit 140 accordingly will be described in more detail later with reference to FIGS. 4 and 5.

또한, 데이터 라벨링부(140)에서 출력되는, 인식 데이터가 자동 라벨링된 음성 데이터는 하나 이상의 음성 인식 모델 각각에 대응하는 모델 식별 정보를 포함할 수 있는 바, 다중 인식 모델 중 어떠한 모델에서 획득된 인식 결과인지를 확인할 수 있도록 하며, 이는 다시 인식 모델 구축부(110)로 피드백되어, 학습 데이터 갱신 및 신뢰도 향상에 이용될 수 있도록 한다.In addition, the voice data with automatically labeled recognition data output from the data labeling unit 140 may include model identification information corresponding to each of one or more voice recognition models, and may include recognition obtained from any model among the multiple recognition models. It is possible to check whether it is a result, and this is fed back to the recognition model building unit 110 so that it can be used to update learning data and improve reliability.

도 3은 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치의 동작을 설명하기 위한 흐름도이다.Figure 3 is a flowchart for explaining the operation of a voice data processing device according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치(100)는, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델을 사전에 구축한다(S101).Referring to FIG. 3, the voice data processing apparatus 100 according to an embodiment of the present invention builds in advance a multiple voice recognition model including a plurality of voice recognition models for voice recognition (S101).

이후, 음성 데이터 가공 처리 장치(100)는, 음성 인식 대상 데이터를, 음성 인식을 위한 복수의 음성 인식 모델들을 포함하는 다중 음성 인식 모델에 입력하여, 상기 음성 인식 모델들 각각의 처리 결과를 획득한다(S103).Thereafter, the voice data processing apparatus 100 inputs the voice recognition target data into a multiple voice recognition model including a plurality of voice recognition models for voice recognition, and obtains processing results of each of the voice recognition models. (S103).

그리고, 음성 데이터 가공 처리 장치(100)는, 상기 음성 인식 모델들의 인식 결과를 단어별로 자모 분리 처리하고, 자모 분리 처리된 결과를 비교하여, 단어별 자모열 유사도를 측정한다(S105).Then, the speech data processing apparatus 100 processes the recognition results of the speech recognition models into character separation for each word, compares the resulting character separation processing, and measures the similarity of character strings for each word (S105).

이후, 음성 데이터 가공 처리 장치(100)는, 상기 단어별 자모열 유사도에 기초하여, 음성인식 대상 데이터의 데이터 자동 라벨링을 처리한다(S107).Thereafter, the voice data processing apparatus 100 processes automatic data labeling of the voice recognition target data based on the letter string similarity for each word (S107).

도 4는 본 발명의 실시 예에 따른 음성 인식 처리 과정을 설명하기 위한 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블을 나타내며, 도 5는 본 발명의 실시 예에 따른 음성 자동 라벨링 과정을 설명하기 위한 예시도이다.Figure 4 shows a data automatic labeling and correction table based on character string similarity for explaining the voice recognition processing process according to an embodiment of the present invention, and Figure 5 is an example for explaining the automatic voice labeling process according to an embodiment of the present invention. It's a degree.

도 4 및 도 5를 참조하면, 도 4를 참조하면, 본 발명의 실시 예에 따른 음성 데이터 가공 처리 장치(100)는, 각 상기 단어별 자모열 유사도가 임계조건에 해당하는 단어가 식별된 단어의 각 음성 인식 모델별 인식 결과 일치여부를, 사전 구성된 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블에 적용함에 따라, 인식 결과의 보정을 선택적으로 수행할 수 있다. 4 and 5, referring to FIG. 4, the voice data processing apparatus 100 according to an embodiment of the present invention identifies words whose consonant string similarity for each word corresponds to a threshold condition. Correction of recognition results can be selectively performed by applying the matching recognition results for each speech recognition model to a pre-configured alphabet string similarity-based data automatic labeling and correction table.

도 4는 이러한 자모열 유사도 기반 데이터 자동 라벨링 및 보정 테이블을 예시한 것으로, 데이터 라벨링부(140)는, 각 음성 인식 모델(ASR)별 인식 결과의 자모열 유사도 측정 결과에 따라, 자동 라벨링할 단어를 결정할 수 있다.Figure 4 is an example of such a data automatic labeling and correction table based on alphabet string similarity. The data labeling unit 140 selects words to be automatically labeled according to the alphabet string similarity measurement result of the recognition result for each speech recognition model (ASR). can be decided.

도 4를 참조하면, 먼저, 모든 모델의 인식 결과 상호간 자모열 유사도가 모두 일치하는 경우에는 별도의 보정 없이 인식 결과가 자동 라벨링 처리될 수 있다. 그러나, 자모열 유사도가 임계조건(예를 들어, 3개 중 2개가 일치하나 1개가 일치하지 않는 경우로서, 100% 미만인 경우) 이내로 측정된 하나 이상의 단어가 존재하는 경우, 해당 단어는 인식 대상에서 제외되는 등의 예외 처리가 이루어질 수 있다.Referring to FIG. 4, first, if the recognition results of all models match the consonant sequence similarities, the recognition results can be automatically labeled without additional correction. However, if there is one or more words whose alphabetic similarity is measured to be within a critical condition (for example, 2 out of 3 matches but 1 does not match, which is less than 100%), the word is excluded from the recognition target. Exception processing such as exclusion may occur.

이와 같이, 도 4를 참조하면, 사용자의 음성 인식에 있어서, 다중 음성 인식 모델을 활용하되, 다중 음성 인식 모델 상호간 인식 결과를 자모열 유사도에 따라 비교함에 따라, 그 인식 결과에 오류가 있는지를 확인할 수 있다. 또한, 음성 데이터 가공 처리 장치(100)는, 이러한 오류가 어디에서 얼마나 존재하는지를 미리 파악하고, 이에 대응하는 적절한 보정 또는 예외 처리를 수행할 수 있는 바, 정확한 인식 결과를 도출할 수 있게 된다. 또한, 인식 결과는 다시 각 모델별 학습 데이터로서 활용됨에 따라, 피드백에 따른 정확도 향상이 지속적으로 이루어질 수 있다.In this way, referring to FIG. 4, when recognizing a user's voice, multiple voice recognition models are utilized, and by comparing the recognition results between the multiple voice recognition models according to character string similarity, it is possible to check whether there are errors in the recognition results. You can. In addition, the voice data processing apparatus 100 can determine in advance where and how many such errors exist and perform appropriate correction or exception processing corresponding thereto, thereby producing accurate recognition results. In addition, as the recognition results are used as learning data for each model, accuracy can be continuously improved based on feedback.

그리고, 도 5를 참조하면, 유사도 100% 미만인 단어 등과 같이, 자모열 유사도가 임계조건에 해당하는 하나 이상의 단어가 존재하는 경우, 해당 단어는 인식 대상에서 제외되는 등의 예외 처리가 이루어질 수 있으며, 데이터 라벨링부(140)는, 상기 예외 처리가 이루어지는 음성 데이터의 경우, 음성 파형을 증폭한 후, 파형이 없는 부분(띄어쓰기 부분)을 기준으로, 해당 음성 부분에 대한 삭제 처리를 수행할 수 있다. 이 경우, 파형이 없는 띄어쓰기 부분에 대한 구분력은 증대되면서, 제대로 인식되지 않아 오류가 발생될 수 있는 파형 부분에 대하여는 삭제 처리가 이루어지는 바, 인식 데이터의 라벨링 정확도를 향상시킬 수 있게 된다.And, referring to Figure 5, if there is one or more words whose letter string similarity meets the threshold condition, such as a word with a similarity of less than 100%, exception processing such as excluding the word from recognition may be performed, In the case of voice data for which the exception processing is performed, the data labeling unit 140 may amplify the voice waveform and then perform deletion processing on the corresponding voice portion based on the portion without the waveform (spaced portion). In this case, the ability to distinguish between spaced parts without waveforms is increased, and the waveform parts that are not properly recognized and may cause errors are deleted, thereby improving the labeling accuracy of recognition data.

결과적으로, 도 4 및 도 5를 참조하면, '진라면에 참치를 넣으면 맛있다'라는 음성에 모든 다중 음성 인식 모델 간 자모열 유사도 기반 인식 결과가 100%인 경우에는 '진라면에 참치를 넣으면 맛있다'가 그대로 매핑 처리되나, '진라면'을 '신라면'으로 인식하여, 자모열 유사도가 100%미만인 단어가 발생된 경우에는, 해당 단어가 라벨링 대상에서 삭제되고, '참치를 넣으면 맛있다'가 자동 라벨링되며, 음성 데이터에서도 증폭 이후에, 해당 음성 구간에 대응하는 부분만 삭제 처리됨에 따라, 보정된 음성 데이터에 자동 라벨링된 데이터가 매핑되어 저장 및 관리될 수 있다.As a result, referring to Figures 4 and 5, if the recognition result based on the consonant similarity between all multi-speech recognition models in the voice 'It is delicious to put tuna in Jin Ramyun' is 100%, the result is 'If you put tuna in Jin Ramyun, it is delicious' ' is mapped as is, but 'Jin Ramyun' is recognized as 'Shin Ramyun', and if a word with a letter sequence similarity of less than 100% is generated, the word is deleted from the labeling target and 'It is delicious with tuna' is automatically changed. Labeled, and after amplification of the voice data, only the portion corresponding to the voice section is deleted, so the automatically labeled data can be mapped to the corrected voice data and stored and managed.

상술한 본 발명에 따른 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다.The method according to the present invention described above can be produced as a program to be executed on a computer and stored in a computer-readable recording medium. Examples of computer-readable recording media include ROM, RAM, CD-ROM, and magnetic tape. , floppy disks, optical data storage devices, etc., and also includes those implemented in the form of carrier waves (for example, transmission via the Internet).

컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The computer-readable recording medium is distributed in a computer system connected to a network, so that computer-readable code can be stored and executed in a distributed manner. And, functional programs, codes, and code segments for implementing the method can be easily deduced by programmers in the technical field to which the present invention pertains.

또한, 이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In addition, although preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the invention pertains without departing from the gist of the present invention as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical idea or perspective of the present invention.

Claims

a recognition model building unit that pre-builds a multiple voice recognition model including a plurality of voice recognition models for voice recognition;
a multi-recognition processor that inputs voice recognition target data into the multi-voice recognition model and obtains processing results for each of the voice recognition models;
A character string similarity measurement unit that calculates character string recognition results for each speech recognition model according to the recognition results of the multiple speech recognition models, compares the character sequence recognition results, and calculates character sequence similarity between speech recognition models; and
Based on the character string similarity, a data labeling unit that processes automatic data labeling of the output value using the recognition result of one or more speech recognition models among the plurality of speech recognition models.
Voice data processing device.

According to paragraph 1,
The alphabet string similarity measurement unit,
Separating the word recognition results or letter recognition results of the speech recognition model so that consonants and vowels are listed as independent character codes to calculate the consonant string recognition results.
Voice data processing device.

According to paragraph 2,
The alphabet string similarity measurement unit,
Converting and combining the first language recognition result data mapped to the character code of the character recognition result or one or more character codes constituting the word recognition result into relative distance position values corresponding to the initial consonant, middle consonant, or final consonant, respectively, Calculating the result of character string recognition
Voice data processing device.

According to paragraph 3,
The alphabet string similarity measurement unit,
Calculating the similarity between the character string recognition results of each of the plurality of voice recognition models, and calculating the similarity between the calculated character string recognition results to calculate the similarity.
Voice data processing device.

According to paragraph 1,
The data labeling unit,
Identifying the string similarity for each word of the voice recognition target data, and correcting the data automatic labeling information of the output value when one or more words whose string similarity for each word corresponds to a threshold condition are identified.
Voice data processing device.

According to clause 5,
The data labeling unit,
Using the recognition results for each speech recognition model for words whose letter string similarity corresponds to the threshold condition, the data automatic labeling information of the word corresponding to the threshold condition is corrected.
Voice data processing device.

According to clause 6,
The data labeling unit,
After amplifying the voice recognition target data, the portion of the voice waveform corresponding to the word whose alphabet string similarity meets a threshold condition is deleted and processed.
Voice data processing device.

Pre-building a multi-speech recognition model including a plurality of speech recognition models for speech recognition;
Inputting voice recognition target data into the multiple voice recognition models and obtaining processing results of each of the voice recognition models;
According to the recognition results of the multiple speech recognition models, calculating a consonant recognition result of each speech recognition model, comparing the consonant recognition results, and calculating consonant similarity between speech recognition models; and
Based on the character string similarity, a data labeling step of automatically labeling data of an output value using a recognition result of one or more speech recognition models among the plurality of speech recognition models.
Method of operating a voice data processing device.

According to clause 8,
The step of calculating the alphabetical sequence similarity is,
Separating the word recognition results or letter recognition results of the speech recognition model so that consonants and vowels are listed as independent character codes to calculate the consonant string recognition results.
Method of operating a voice data processing device.

According to clause 9,
The step of calculating the alphabetical sequence similarity is,
Converting and combining the first language recognition result data mapped to the character code of the character recognition result or one or more character codes constituting the word recognition result into relative distance position values corresponding to the initial consonant, middle consonant, or final consonant, respectively, Including the step of calculating a character string recognition result.
Method of operating a voice data processing device.

According to clause 10,
The step of calculating the alphabetical sequence similarity is,
Comprising the step of calculating the similarity between the character string recognition results of each of the plurality of voice recognition models, calculating the similarity between the calculated character string recognition results, and calculating the similarity of the character string.
Method of operating a voice data processing device.

According to clause 8,
The data labeling step is,
Identifying consonant string similarity for each word of the voice recognition target data, and when one or more words whose consonant string similarity for each word corresponds to a threshold condition are identified, correcting data automatic labeling information of the output value.
Method of operating a voice data processing device.

According to clause 12,
The data labeling step is,
Comprising the step of correcting the data automatic labeling information of the word corresponding to the threshold condition, using the recognition results for each speech recognition model of the word whose letter string similarity corresponds to the threshold condition.
Method of operating a voice data processing device.

According to clause 13,
The data labeling step is,
After amplifying the speech recognition target data, deleting a portion of the speech waveform corresponding to a word whose alphabet string similarity corresponds to a threshold condition.
Method of operating a voice data processing device.