KR20200141126A

KR20200141126A - Device and method for preventing misperception of wake word

Info

Publication number: KR20200141126A
Application number: KR1020190067704A
Authority: KR
Inventors: 정재훈; 김종엽; 류창선; 조경일
Original assignee: 주식회사 케이티
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2020-12-18
Anticipated expiration: 2039-06-10
Also published as: KR102402465B1

Abstract

A voice control device for preventing misrecognition for a wake-up word may comprise: a reception unit receiving a broadcast stream from a broadcast server; a reproducing unit reproducing sound data of the received broadcast stream inside the voice control device; an output unit outputting sound data of the broadcast stream through an external device connected to the voice control device; an input unit receiving the sound data outputted through the external device through a microphone of the voice control device; a wake-up word detection unit detecting a first wake-up word from the sound data inputted through the microphone, and detecting a second wake-up word from the sound data reproduced inside the voice control device; and a determination determining whether a wake-up word is misrecognized based on a difference between a first recognition time for the first wake-up word and a second recognition time for the second wake-up word.

Description

Device and method for preventing misrecognition of call words {DEVICE AND METHOD FOR PREVENTING MISPERCEPTION OF WAKE WORD}

본 발명은 호출어에 대한 오인식 방지를 위한 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for preventing misrecognition of a pager.

최근 들어, 인공지능 서비스에 대한 수요가 증가하면서 호출어 인식 기술에 대한 중요도가 높아지고 있다. 호출어는 인공지능 장치를 활성화시키는 명령어에 해당한다. In recent years, as the demand for artificial intelligence services increases, the importance of call word recognition technology is increasing. The caller corresponds to a command that activates an artificial intelligence device.

인공지능 장치에는 음성을 이용한 사용자 인터페이스 및 음향 정보에 대한 보안성을 확보하기 위해, 장치 내부에 호출어를 인식할 수 있는 호출어 인식기가 내장되어 있다. In an artificial intelligence device, a pager recognizer capable of recognizing a pager is built into the device in order to secure security for a user interface and sound information using voice.

한편, 호출어 인식기에 문제를 일으키는 요인에는 에코로 인한 오인식의 문제가 있다. 인공지능 장치의 자체 재생음이 인공지능 장치의 내장 스피커나 외부 연결 스피커를 통해 재생되고 인공지능 장치에 내장된 마이크로 재입력되는 문제를 에코라고 한다. 이러한 에코를 제거하기 위해서 AEC(Acoustic Echo Canceller)를 이용한다. 하지만, AEC을 이용하여 에코를 제거하더라도 일정 수준이상의 잔류 에코(residual echo)가 남게 된다. On the other hand, there is a problem of misrecognition due to echo as a factor causing a problem in the pager recognizer. The problem that the artificial intelligence device's self-reproducing sound is reproduced through the built-in speaker of the artificial intelligence device or externally connected speaker and re-input to the microphone built into the artificial intelligence device is called echo. Acoustic Echo Canceller (AEC) is used to cancel this echo. However, even if the echo is removed using AEC, a residual echo of a certain level or more remains.

인공지능 장치의 내부에서 재생되는 음원(예컨대, 방송 음원 등)에 호출어가 존재하는 경우, 호출어에 대한 잔류 에코에 의해 호출어 인식기가 명령어 인식상태로 진입하게 된다. 이 경우, 인공지능 장치는 음악청취나 영상을 시청 중이던 사용자에게 불필요한 음성인식 준비 알림 메시지를 출력하고, 호출어 오인식에 따라 컨텐츠의 재생을 중지하게 된다. When a pager exists in a sound source (eg, a broadcast sound source) played inside the artificial intelligence device, the pager recognizer enters a command recognition state by residual echo for the pager. In this case, the artificial intelligence device outputs an unnecessary voice recognition preparation notification message to a user who is listening to music or watching a video, and stops playing the content according to the misrecognition of the caller.

기존의 호출어 인식 기술로는 내부에서 재생되는 재생음에 호출어 및 유사 호출어가 포함되는 것을 미리 인지하는 것이 불가능 하여, 방송음원이나 임의 재생음원에 호출어가 포함될 경우, 인공지능 장치가 원치 않는 호출어 응답을 하는 문제가 발생할 수 있다. It is impossible to recognize in advance that a caller and a similar caller are included in the internally played sound with the existing caller recognition technology, so if a caller is included in a broadcast sound source or randomly played sound source, the artificial intelligence device does not want to There may be problems with responding.

한국공개특허공보 제2018-0107003호 (2018.10.01. 공개)Korean Patent Publication No. 2018-0107003 (published on October 1, 2018)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 음성 제어 장치의 내부에서 재생되는 방송 스트림의 음향 데이터로부터 검출된 제 1 호출어에 대한 인식 시간과, 음성 제어 장치와 연결된 외부 장치를 거쳐 음성 제어 장치의 마이크를 통해 입력된 방송 스트림의 음향 데이터로부터 검출된 제 2 호출어에 대한 인식 시간을 비교하여 호출어에 대한 오인식 여부를 판단하고자 한다. 다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. The present invention is to solve the problems of the prior art described above, through a recognition time for a first pager detected from sound data of a broadcast stream played inside a voice control device, and an external device connected to the voice control device. An attempt is made to determine whether or not the caller is misrecognized by comparing the recognition time for the second caller detected from sound data of the broadcast stream input through the microphone of the voice control device. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 호출어에 대한 오인식 방지를 위한 음성 제어 장치는 방송 서버로부터 방송 스트림을 수신하는 수신부; 상기 수신된 방송 스트림의 음향 데이터를 상기 음성 제어 장치의 내부에서 재생하는 재생부; 상기 음성 제어 장치와 연결된 외부 장치를 통해 상기 방송 스트림의 음향 데이터를 출력하는 출력부; 상기 외부 장치를 통해 출력된 음향 데이터를 상기 음성 제어 장치의 마이크를 통해 입력받는 입력부; 상기 마이크를 통해 입력받은 음향 데이터로부터 제 1 호출어를 검출하고, 상기 음성 제어 장치의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출하는 호출어 검출부; 상기 제 1 호출어에 대한 제 1 인식 시간 및 상기 제 2 호출어에 대한 제 2 인식 시간 간의 차에 기초하여 호출어 인식에 대한 오인식 여부를 판단하는 판단부를 포함할 수 있다. As a technical means for achieving the above technical problem, a voice control apparatus for preventing misrecognition of a pager according to a first aspect of the present invention includes: a receiver for receiving a broadcast stream from a broadcast server; A reproducing unit for reproducing sound data of the received broadcast stream in the voice control device; An output unit outputting sound data of the broadcast stream through an external device connected to the voice control device; An input unit receiving sound data output through the external device through a microphone of the voice control device; A pager detection unit detecting a first pager from the sound data received through the microphone, and detecting a second pager from the sound data reproduced inside the voice control device; And a determination unit that determines whether or not a caller is recognized incorrectly based on a difference between a first recognition time for the first pager and a second recognition time for the second pager.

본 발명의 제 2 측면에 따른 음성 제어 장치에서 호출어에 대한 오인식 방지를 위한 방법은 방송 서버로부터 방송 스트림을 수신하는 단계; 상기 수신된 방송 스트림의 음향 데이터를 상기 음성 제어 장치의 내부에서 재생하는 단계; 상기 음성 제어 장치와 연결된 외부 장치를 통해 상기 방송 스트림의 음향 데이터를 출력하는 단계; 상기 외부 장치를 통해 출력된 음향 데이터를 상기 음성 제어 장치의 마이크를 통해 입력받는 단계; 상기 마이크를 통해 입력받은 음향 데이터로부터 제 1 호출어를 검출하는 단계; 상기 음성 제어 장치의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출하는 단계; 및 상기 제 1 호출어에 대한 제 1 인식 시간 및 상기 제 2 호출어에 대한 제 2 인식 시간 간의 차에 기초하여 호출어 인식에 대한 오인식 여부를 판단하는 단계를 포함할 수 있다. A method for preventing misrecognition of a page word in a voice control device according to a second aspect of the present invention includes: receiving a broadcast stream from a broadcast server; Reproducing sound data of the received broadcast stream inside the voice control device; Outputting sound data of the broadcast stream through an external device connected to the voice control device; Receiving sound data output through the external device through a microphone of the voice control device; Detecting a first pager from sound data received through the microphone; Detecting a second call word from sound data reproduced inside the voice control device; And determining whether or not a pager is recognized incorrectly based on a difference between a first recognition time for the first pager and a second recognition time for the second pager.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the above-described exemplary embodiments, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 음성 제어 장치의 내부에서 재생되는 방송 스트림의 음향 데이터로부터 검출된 제 1 호출어에 대한 인식 시간과, 음성 제어 장치와 연결된 외부 장치를 거쳐 음성 제어 장치의 마이크를 통해 입력된 방송 스트림의 음향 데이터로부터 검출된 제 2 호출어에 대한 인식 시간을 비교하여 호출어에 대한 오인식 여부를 판단할 수 있다. According to any one of the above-described problem solving means of the present invention, the present invention provides a recognition time for the first call word detected from sound data of a broadcast stream played inside the voice control device, and an external device connected to the voice control device. By comparing the recognition time for the second call word detected from the sound data of the broadcast stream input through the microphone of the voice control device through the comparison, it is possible to determine whether the call word is misrecognized.

이를 통해, 본 발명은 잘못된 음성 인식률을 최소화하고, 자기재생 음향에 대한 호출어의 경우, 호출어에 대한 응답을 제거함으로써 보다 안정적인 음성 인식 환경을 구축할 수 있다. 예를 들면, 본 발명은 음성 제어 장치가 방송 스트림을 재생하는 도중에 재생음에 포함된 호출어 및 유사 호출어에 의하여 방송 스트림의 재생이 멈추는 등의 현상을 제거할 수 있다. Through this, the present invention minimizes an erroneous speech recognition rate, and in the case of a call word for a self-reproducing sound, a more stable speech recognition environment can be established by removing a response to the call word. For example, the present invention can eliminate a phenomenon such as stopping playback of a broadcast stream due to a pager and a similar pager included in the playback sound while the voice control device plays the broadcast stream.

도 1은 본 발명의 일 실시예에 따른, 음성 제어 장치의 블록도이다.
도 2는 호출어에 대한 오인식 문제점을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른, 음성 제어 장치에서 호출어에 대한 오인식을 방지하는 방법을 나타낸 흐름도이다. 1 is a block diagram of a voice control device according to an embodiment of the present invention.
2 is a diagram for explaining a problem of misrecognition of a call word.
3 is a flowchart illustrating a method of preventing misrecognition of a pager in a voice control device according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In the present specification, the term "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, or two or more units may be realized using one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. In the present specification, some of the operations or functions described as being performed by the terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal or device connected to the server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, with reference to the accompanying configuration diagram or processing flow chart, specific details for the implementation of the present invention will be described.

종래의 인공지능 장치의 내부에서 호출어를 내재한 방송 음원이 재생되는 경우, 도 2와 같은 현상이 발생하게 된다. 도 2를 참조하면, 인공지능 장치를 활성화시키는 호출어가 "지니야"라고 가정하면, 방송 음원(200)에 등장하는 인물의 이름이 "진희야" 또는, "진이야"(210)가 포함되거나, 인물의 대화 중에 "우리 가족 사-진이야" 등이 포함되는 경우, 방송 음원(200)의 재생음이 인공지능 장치의 호출어의 음향과 유사하게 재생될 수 있다. When a broadcast sound source containing a call word is reproduced inside a conventional artificial intelligence device, a phenomenon as shown in FIG. 2 occurs. Referring to FIG. 2, assuming that the call word for activating the artificial intelligence device is "Jiniya", the name of the person appearing in the broadcast sound source 200 is "Jin-hee" or "Jin-ya" 210 is included, or , When a person's conversation includes "it's my family," or the like, the sound reproduced by the broadcast sound source 200 may be reproduced similarly to the sound of the caller of the artificial intelligence device.

이러한 경우를 대비하여 종래의 인공지능 장치에서는 AEC(Acoustic Echo Cancellation)를 이용하여 방송 음원(200)의 재생음에 포함된 호출어에 대한 에코를 제거한다(AEC 이후 감쇄된 "진희야"(220)). 하지만, AEC를 이용하여 에코를 제거하더라도 일정 수준의 잔류 에코는 상존하게 되고, 인공지능 장치의 마이크를 입력으로 사용하는 호출어 인식기에 오인식을 발생시키게 된다. In preparation for such a case, in the conventional artificial intelligence device, the echo for the call word included in the reproduced sound of the broadcast sound source 200 is removed by using Acoustic Echo Cancellation (AEC) ("Jin Hee-ya" 220 attenuated after AEC). ). However, even if the echo is removed by using AEC, a certain level of residual echo is present, and a false recognition is generated in a pager recognizer that uses a microphone of an artificial intelligence device as an input.

이러한 에코에 의한 호출어 오검출 성능 향상을 위한 AEC와 호출어 인식기를 수정하는 기존의 다양한 시도는 모두 호출어 인식률 저하로 이어지며, 음성처리 수정을 이용한 오검출 감쇄로는 원하는 인식률과 오검출 감쇄를 달성 할 수 없다. 또한 외부 재생기(예컨대, TV나 AV 리시버 등)에 의하여 음향이 복호화(Decoding)되는 경우, 이러한 음성처리가 매우 어렵거나 불가능하다. The existing various attempts to modify the AEC and the pager recognizer to improve the performance of false caller detection by such echo lead to a decrease in the caller recognition rate, and false detection attenuation using voice processing correction reduces the desired recognition rate and false detection. Can't achieve In addition, when sound is decoded by an external player (eg, a TV or an AV receiver), such audio processing is very difficult or impossible.

도 1은 본 발명의 일 실시예에 따른, 음성 제어 장치(100)의 블록도이다. 1 is a block diagram of a voice control apparatus 100 according to an embodiment of the present invention.

도 1을 참조하면, 음성 제어 장치(100)는 수신부(110), 재생부(120), 출력부(130), 입력부(140), 음향 채널 변환부(150), 호출어 검출부(160), 도출부(170), 판단부(180) 및 조정부(190)를 포함할 수 있다. 다만, 도 1에 도시된 음성 제어 장치(100)는 본 발명의 하나의 구현 예에 불과하며, 도 1에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 1, the voice control device 100 includes a receiving unit 110, a playback unit 120, an output unit 130, an input unit 140, an audio channel conversion unit 150, a pager detection unit 160, A derivation unit 170, a determination unit 180, and an adjustment unit 190 may be included. However, the voice control apparatus 100 shown in FIG. 1 is only an example of implementation of the present invention, and various modifications are possible based on the components shown in FIG. 1.

수신부(110)는 방송 서버로부터 방송 스트림을 수신할 수 있다. The receiver 110 may receive a broadcast stream from a broadcast server.

재생부(120)는 수신된 방송 스트림의 음향 데이터를 음성 제어 장치(100)의 내부에서 재생할 수 있다. The playback unit 120 may reproduce sound data of the received broadcast stream inside the voice control device 100.

출력부(130)는 음성 제어 장치(100)와 연결된 외부 장치(예컨대, TV 또는 AV 리시버 등)를 통해 방송 스트림의 음향 데이터를 출력할 수 있다. The output unit 130 may output sound data of a broadcast stream through an external device (eg, a TV or an AV receiver) connected to the voice control device 100.

입력부(140)는 외부 장치를 통해 출력된 음향 데이터를 음성 제어 장치(100)의 마이크를 통해 입력받을 수 있다. The input unit 140 may receive sound data output through an external device through a microphone of the voice control device 100.

음향 채널 변환부(150)는 방송 스트림의 음향 데이터를 다중 채널로 분리하여 디코딩하고, 디코딩된 다중 채널을 하나의 음향 채널로 변환할 수 있다. The sound channel converter 150 may divide and decode sound data of a broadcast stream into multiple channels, and convert the decoded multiple channels into one sound channel.

예를 들면, 음향 채널 변환부(150)는 TV 또는 AV 리시버 등과 같은 외부 장치의 스피커로 출력되는 방송 스트림의 음향 데이터를 디코딩할 수 있다. For example, the sound channel conversion unit 150 may decode sound data of a broadcast stream output through a speaker of an external device such as a TV or an AV receiver.

예를 들면, 음향 채널 변환부(150)는 음성 제어 장치(100) 내부에서 출력되는 방송 스트림(audio/video 스트림)으로부터 음향 데이터를 추출하고, 추출된 음향 데이터를 오디오 형식에 맞춰 다중 채널로 분리한 후, 디코딩할 수 있다. For example, the sound channel conversion unit 150 extracts sound data from a broadcast stream (audio/video stream) output from the voice control device 100, and separates the extracted sound data into multiple channels according to an audio format. After that, it can be decoded.

디코딩된 음향 데이터의 채널 중 어느 채널에 음성 제어 장치(100)의 호출어와 유사한 재생음이 존재하는지를 예측하기 어려우므로, 음향 채널 변환부(150)는 디코딩된 다중 채널을 다운 믹싱(Down-mixxing)하여 하나의 음향 채널로 변환할 수 있다. 이 경우, 다중 채널에 존재하는 각 음소들을 하나의 채널로 조합함으로써 유사 호출어가 발생하는 경우를 막을 수 있다. Since it is difficult to predict which channel among the channels of the decoded sound data has a reproducible sound similar to the call word of the voice control device 100, the sound channel conversion unit 150 down-mixes the decoded multiple channels. It can be converted to one sound channel. In this case, it is possible to prevent the occurrence of similar call words by combining each phoneme existing in the multiple channels into one channel.

호출어 검출부(160)는 마이크를 통해 입력받은 음향 데이터로부터 제 1 호출어를 검출할 수 있다. 구체적으로, 방송 스트림의 음향 데이터가 외부 장치를 통해 출력되고, 출력된 음향 데이터를 음성 제어 장치(100)의 마이크로 입력받은 경우, 호출어 검출부(160)는 마이크를 통해 입력받은 음향 데이터로부터 제 1 호출어를 검출할 수 있다. The pager detection unit 160 may detect the first pager from sound data input through a microphone. Specifically, when sound data of a broadcast stream is output through an external device, and the output sound data is received by the microphone of the voice control device 100, the caller detection unit 160 first receives the sound data input through the microphone. The caller can be detected.

호출어 검출부(160)는 음성 제어 장치(100)의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출할 수 있다. 이 때, 음성 제어 장치(100)의 내부에서 음향 데이터가 재생되어 인식되는 시간은 음성 제어 장치(100)의 마이크를 통해 음향 데이터가 입력되어 인식되는 시간보다 선행한다. The pager detection unit 160 may detect the second pager from sound data reproduced in the voice control device 100. At this time, the time when sound data is reproduced and recognized inside the voice control device 100 precedes the time when sound data is input and recognized through the microphone of the voice control device 100.

이 경우, 음성 제어 장치(100)의 호출어 인식기에서의 요구 연산량과 음성 제어 장치(100) 자체의 여분 연산량에 따라 2가지 방법으로 제 2 호출어를 검출할 수 있다. In this case, the second call word may be detected in two ways according to the required calculation amount in the call word recognizer of the voice control device 100 and the extra calculation amount of the voice control device 100 itself.

음성 제어 장치(100) 자체의 여분 연산량이 호출어 인식기의 요구 연산량 대비 크지 않은 경우, 호출어 검출부(160)는 다운 믹싱된 하나의 음향 채널을 통해 음성 제어 장치(100)의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출할 수 있다. When the amount of extra computation of the voice control device 100 itself is not larger than the required computation amount of the call word recognizer, the call word detection unit 160 uses the sound reproduced inside the voice control device 100 through one down-mixed sound channel. The second pager can be detected from the data.

음성 제어 장치(100) 자체의 여분 연산량이 호출어 인식기의 요구 연산량을 크게 상회하여 여러 번의 연산이 가능한 경우, 호출어 검출부(160)는 디코딩된 다중 채널과 다운믹싱된 하나의 음향 채널 각각을 통해 음성 제어 장치(100)의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출할 수 있다. When the amount of extra computation of the voice control device 100 itself exceeds the required computational amount of the caller recognizer, and multiple operations are possible, the caller detection unit 160 uses decoded multiple channels and one downmixed sound channel. The second call word may be detected from the sound data reproduced inside the voice control device 100.

도출부(170)는 음향 데이터가 외부 장치를 통해 출력된 시간 및 외부 장치를 통해 출력된 음향 데이터가 마이크를 통해 입력 받은 시간에 기초하여 경로 지연 시간을 도출하고, 외부 장치를 통해 출력된 음향 데이터에 대한 처리 지연 시간을 도출할 수 있다. 여기서, 경로 지연 시간은 외부 장치를 통해 출력된 음향 데이터가 마이크를 통해 입력 받은 시간(r1time)과 음향 데이터가 외부 장치를 통해 출력된 시간(p1time) 간의 차(r1time - p1time)로 계산될 수 있다. 처리 지연 시간은 음성 제어 장치(100)의 내부에서 방송 스트림의 음향 데이터가 다중 채널로 분리되어 디코딩되는 시간(delay_d)과 음성 제어 장치(100)에서 디코딩된 다중 채널을 다운 믹싱하는 시간(delay_m) 간의 합(delay_d + delay_m)으로 계산될 수 있다. The derivation unit 170 derives a path delay time based on the time when the sound data is output through the external device and the time when the sound data output through the external device is input through the microphone, and the sound data output through the external device The processing delay time for can be derived. Here, the path delay time may be calculated as a difference (r1time-p1time) between a time (r1time) when sound data output through an external device is received through a microphone and a time (p1time) when sound data is output through an external device. . The processing delay time is a time at which sound data of a broadcast stream is divided into multiple channels and decoded inside the voice control device 100 (delay_d) and a time at which the decoded multiple channels are downmixed by the voice control device 100 (delay_m). It can be calculated as the sum (delay_d + delay_m).

이 때, 처리 지연 시간은 [수학식 1]과 같이 경로 지연 시간보다 작아야 한다. In this case, the processing delay time should be smaller than the path delay time as shown in [Equation 1].

[수학식 1][Equation 1]

r1time - p1time >= delay_d + delay_mr1time-p1time >= delay_d + delay_m

조정부(190)는 외부 장치를 통해 출력된 음향 데이터에 대한 출력 지연 시간을 통해 경로 지연 시간보다 처리 지연 시간이 작아지도록 조정할 수 있다. 여기서, 출력 지연 시간은 음향 데이터가 음성 제어 장치(100)에서 외부 장치로 출력되는 기본 출력 지연 시간(delay_o) 및 추가 출력 지연 시간(delay_o') 간의 합(delay_o + delay_o')으로 계산될 수 있다. The adjustment unit 190 may adjust the processing delay time to be smaller than the path delay time through an output delay time for sound data output through an external device. Here, the output delay time may be calculated as a sum (delay_o + delay_o') between a basic output delay time (delay_o) and an additional output delay time (delay_o') that sound data is output from the voice control device 100 to an external device. .

예를 들면, 처리 지연 시간이 경로 지연 시간을 초과하게 되면, 조정부(190)는 외부 장치를 통해 출력된 음향 데이터에 대한 출력 지연 시간을 증가시켜 외부 장치를 통해 출력된 음향 데이터에 대한 처리 지연 시간을 조정할 수 있다. For example, when the processing delay time exceeds the path delay time, the adjustment unit 190 increases the output delay time for the sound data output through the external device to increase the processing delay time for the sound data output through the external device. Can be adjusted.

음향 데이터가 음성 제어 장치(100)에서 외부 장치로 출력되는 추가 출력 지연 시간(delay_o')에 의한 새로운 경로 지연 시간(r1time - p1time + delay_o') 및 처리 지연 시간(delay_d + delay_m)에 의한 조건은 [수학식 2]와 같이 도출되고, 최소 추가 출력 지연 시간(delay_o')은 [수학식 3]과 같이 도출될 수 있다. Conditions for the new path delay time (r1time-p1time + delay_o') and the processing delay time (delay_d + delay_m) based on the additional output delay time (delay_o') that the sound data is output from the voice control device 100 to the external device are It is derived as in [Equation 2], and the minimum additional output delay time (delay_o') can be derived as in [Equation 3].

[수학식 2][Equation 2]

r1time - p1time + delay_o' >= delay_d + delay_mr1time-p1time + delay_o' >= delay_d + delay_m

[수학식 3][Equation 3]

delay_o' = delay_d + delay_m - (r1time - p1time)delay_o' = delay_d + delay_m-(r1time-p1time)

이 경우, 최소 추가 출력 지연 시간(delay_0')이 음수이면 추가 지연이 필요 없으나, 최소 추가 출력 지연 시간이 양수이면 최소 추가 출력 지연 시간만큼의 추가 지연이 필요하다. In this case, if the minimum additional output delay time (delay_0') is negative, no additional delay is required, but if the minimum additional output delay time is positive, an additional delay equal to the minimum additional output delay time is required.

판단부(180)는 제 1 호출어에 대한 제 1 인식 시간 및 제 2 호출어에 대한 제 2 인식 시간 간의 차에 기초하여 호출어 인식에 대한 오인식 여부를 판단할 수 있다. 예를 들면, 음성 제어 장치(100)의 내부에서 재생되는 다중 채널 각각에 대한 음향 데이터로부터 검출된 제 2 호출어에 대한 제 2 인식 시간을 각 채널 마다 Time2[채널 번호] 마다 기록하고, 마이크를 통해 입력받은 음향 데이터로부터 검출된 제 1 호출어에 대한 제 1 인식 시간을 Time2로 기록한다고 가정하면, 판단부(180)는 각 채널 당 제 2 호출어에 대한 제 2 인식 시간과 제 1 호출어에 대한 제 1 인식 시간을 비교하여 호출어 인식에 대한 오인식 여부를 판단할 수 있다. The determination unit 180 may determine whether or not the caller recognition is misrecognized based on a difference between the first recognition time for the first pager and the second recognition time for the second pager. For example, the second recognition time for the second call word detected from the sound data for each of the multiple channels played inside the voice control apparatus 100 is recorded for each channel by Time2 [channel number], and the microphone is Assuming that the first recognition time for the first pager detected from the sound data received through is recorded as Time2, the determination unit 180 determines the second recognition time and the first pager for each channel. It is possible to determine whether or not a page word recognition is misrecognized by comparing the first recognition time for.

예를 들면, 제 1 호출어에 대한 제 1 인식 시간 및 제 2 호출어에 대한 제 2 인식 시간 간의 비교를 통한 에코에 의한 오인식 판별 방법은 다음과 같다. For example, a method of discriminating misrecognition by echo through comparison between a first recognition time for a first pager and a second recognition time for a second pager is as follows.

Time_Def[채널 1] = Time2 - Time1[채널 1], Time_Def[channel 1] = Time2-Time1[channel 1],

Time_Def[채널 2] = Time2 - Time1[채널 2]Time_Def[channel 2] = Time2-Time1[channel 2]

......

Time_Def[채널 x] = Time2 - Time1[채널 x]Time_Def[channel x] = Time2-Time1[channel x]

Result = Delay_thresold > MAX( Time_Def[1 : x] )Result = Delay_thresold> MAX( Time_Def[1: x])

판단부(180)는 제 1 인식 시간 및 제 2 인식 시간 간의 차가 기정의된 임계값보다 작은 경우, 호출어 인식을 에코에 의한 오인식으로 판단할 수 있다. 예를 들면, 판단부(180)는 각 채널 당 제 1 인식 시간 및 제 2 인식 시간 간의 차에 대한 최대 값이 기정의된 임계값보다 작은 경우, 호출어 인식을 에코에 의한 오인식으로 판단할 수 있다. When the difference between the first recognition time and the second recognition time is less than a predefined threshold, the determination unit 180 may determine the recognition of the caller as a false recognition due to an echo. For example, when the maximum value for the difference between the first recognition time and the second recognition time per channel is less than a predefined threshold, the determination unit 180 may determine the recognition of the caller as a misrecognition by echo. have.

한편, 당업자라면, 수신부(110), 재생부(120), 출력부(130), 입력부(140), 음향 채널 변환부(150), 호출어 검출부(160), 도출부(170), 판단부(180) 및 조정부(190) 각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. Meanwhile, for those skilled in the art, the receiver 110, the reproduction unit 120, the output unit 130, the input unit 140, the sound channel conversion unit 150, the caller detection unit 160, the derivation unit 170, the determination unit It will be fully understood that each of the 180 and the adjustment unit 190 may be implemented separately, or one or more of them may be integrated and implemented.

도 3은 본 발명의 일 실시예에 따른, 음성 제어 장치(100)에서 호출어에 대한 오인식을 방지하는 방법을 나타낸 흐름도이다. 3 is a flowchart illustrating a method of preventing misrecognition of a pager in the voice control apparatus 100 according to an embodiment of the present invention.

도 3을 참조하면, 단계 S301에서 음성 제어 장치(100)는 방송 서버로부터 방송 스트림을 수신할 수 있다. Referring to FIG. 3, in step S301, the voice control device 100 may receive a broadcast stream from a broadcast server.

단계 S303에서 음성 제어 장치(100)는 수신된 방송 스트림의 음향 데이터를 음성 제어 장치(100)의 내부에서 재생할 수 있다. In step S303, the voice control device 100 may reproduce sound data of the received broadcast stream inside the voice control device 100.

단계 S305에서 음성 제어 장치(100)는 음성 제어 장치(100)와 연결된 외부 장치를 통해 방송 스트림의 음향 데이터를 출력할 수 있다. In step S305, the voice control device 100 may output sound data of a broadcast stream through an external device connected to the voice control device 100.

단계 S307에서 음성 제어 장치(100)는 외부 장치를 통해 출력된 음향 데이터를 음성 제어 장치(100)의 마이크를 통해 입력받을 수 있다. In step S307, the voice control device 100 may receive sound data output through an external device through a microphone of the voice control device 100.

단계 S309에서 음성 제어 장치(100)는 음성 제어 장치(100)의 마이크를 통해 입력받은 음향 데이터로부터 제 1 호출어를 검출할 수 있다. In step S309, the voice control device 100 may detect the first call word from sound data input through the microphone of the voice control device 100.

단계 S311에서 음성 제어 장치(100)는 음성 제어 장치(100)의 내부에서 재생된 음향 데이터로부터 제 2 호출어를 검출할 수 있다. In step S311, the voice control device 100 may detect a second call word from sound data reproduced in the voice control device 100.

단계 S315에서 음성 제어 장치(100)는 제 1 호출어에 대한 제 1 인식 시간 및 제 2 호출어에 대한 제 2 인식 시간 간의 차에 기초하여 호출어 인식에 대한 오인식 여부를 판단할 수 있다. In step S315, the voice control apparatus 100 may determine whether or not to recognize the page word incorrectly based on the difference between the first recognition time for the first page word and the second recognition time for the second page word.

상술한 설명에서, 단계 S301 내지 S315는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S301 to S315 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that other specific forms can be easily modified without changing the technical spirit or essential features of the present invention will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. .

100: 음성 제어 장치
110: 수신부
120: 재생부
130: 출력부
140: 입력부
150: 음향 채널 변환부
160: 호출어 검출부
170: 도출부
180: 판단부
190: 조정부100: voice control device
110: receiver
120: reproduction unit
130: output
140: input unit
150: sound channel conversion unit
160: pager detection unit
170: lead-out unit
180: judgment unit
190: adjustment unit

Claims

In the voice control device for preventing misrecognition of a pager,
A receiver configured to receive a broadcast stream from a broadcast server;
A reproducing unit for reproducing sound data of the received broadcast stream in the voice control device;
An output unit outputting sound data of the broadcast stream through an external device connected to the voice control device;
An input unit receiving sound data output through the external device through a microphone of the voice control device;
A pager detection unit detecting a first pager from the sound data received through the microphone, and detecting a second pager from the sound data reproduced inside the voice control device; And
A determination unit that determines whether or not a pager is misrecognized based on a difference between a first recognition time for the first pager and a second recognition time for the second pager
Including a voice control device.

The method of claim 1,
An acoustic channel conversion unit that divides and decodes the audio data of the broadcast stream into multiple channels, and converts the decoded multiple channels into one acoustic channel
The voice control device further comprising a.

The method of claim 2,
The sound channel converter converts the decoded multi-channels into one sound channel by down-mixing,
The call word detection unit detects the second call word from sound data reproduced inside the voice control device through the one sound channel.

The method of claim 1,
A path delay time is derived based on a time when the sound data is output through the external device and a time when the sound data output through the external device is input through the microphone, and based on the sound data output through the external device Derivation unit for deriving the processing delay time for
The voice control device further comprising a.

The method of claim 4,
The voice control device further comprising an adjustment unit for adjusting the processing delay time to be smaller than the path delay time through an output delay time for the sound data output through the external device.

The method of claim 1,
The determination unit, when the difference between the first recognition time and the second recognition time is less than a predefined threshold value, to determine the page word recognition as a false recognition due to an echo.

In a method for preventing misrecognition of a pager in a voice control device,
Receiving a broadcast stream from a broadcast server;
Reproducing sound data of the received broadcast stream inside the voice control device;
Outputting sound data of the broadcast stream through an external device connected to the voice control device;
Receiving sound data output through the external device through a microphone of the voice control device;
Detecting a first pager from sound data received through the microphone;
Detecting a second call word from sound data reproduced inside the voice control device; And
Determining whether or not a pager is recognized incorrectly based on a difference between a first recognition time for the first pager and a second recognition time for the second pager
That includes, misrecognition prevention method.