JP2022033258A

JP2022033258A - Speech control apparatus, operation method and computer program

Info

Publication number: JP2022033258A
Application number: JP2022000145A
Authority: JP
Inventors: 丙烈金; Byeong Yeol Kim; 益 ▲祥▼ 韓; Ick Sang Han; 五赫權; Oh Hyeok Kwon; 奉眞李; Bong Jin Lee; 明祐呉; Myung Woo Oh; ▲みん▼ 碩崔; Min Seok Choi; 燦奎李; Chan Kyu Lee; 貞姫任; Jung Hui Im; 智須崔; Ji Su Choi; 漢容姜; Han Yong Kang
Original assignee: Line Corp; Naver Corp
Current assignee: Z Intermediate Global Corp; Naver Corp
Priority date: 2017-05-19
Filing date: 2022-01-04
Publication date: 2022-02-28
Also published as: KR101986354B1; JP2018194844A; KR20180127065A; JP6510117B2; JP2019133182A

Abstract

PROBLEM TO BE SOLVED: To provide a speech control apparatus for preventing false detection of a keyword and a method of operating the same.

SOLUTION: The method according to the present disclosure includes the steps of: receiving an audio signal corresponding to a surrounding sound; generating audio stream data; determining a first section in which a candidate keyword corresponding to a predetermined keyword is detected, from the audio stream data; extracting a first speaker feature vector for identifying a speaker in the first section and a second speaker feature vector for identifying a speaker in a second section adjacent to the first section from the audio stream data; and determining whether or not the predetermined keyword is included in the first section based on similarity between the first speaker feature vector and the second speaker feature vector.

SELECTED DRAWING: Figure 2

Description

本発明は、音声制御装置に関し、さらに詳細には、キーワード誤認識防止が可能な音声制御装置、音声制御装置の動作方法、コンピュータプログラム及び記録媒体等に関する。 The present invention relates to a voice control device, and more particularly to a voice control device capable of preventing keyword misrecognition, an operation method of the voice control device, a computer program, a recording medium, and the like.

携帯用通信装置、デスクトップＰＣ（personal computer）、タブレットＰＣ、及びエンターテイメントシステムのようなコンピュータ装置の性能が高度化しつつ、操作性を向上させるために、音声認識機能が搭載され、音声によって制御される電子機器が市場に出回っている。該音声認識機能は、別途のボタン操作、またはタッチモジュールの接触によらず、ユーザの音声を認識することにより、装置を手軽に制御することができる長所を有する。 A voice recognition function is installed and controlled by voice in order to improve operability while improving the performance of computer devices such as portable communication devices, desktop PCs (personal computers), tablet PCs, and entertainment systems. Electronic devices are on the market. The voice recognition function has an advantage that the device can be easily controlled by recognizing a user's voice without using a separate button operation or touching a touch module.

かような音声認識機能によれば、例えば、スマートフォンのような携帯用通信装置においては、別途のボタンを押す操作なしに、通話機能を遂行したり、文字メッセージを作成したりすることができ、道案内、インターネット検索、アラーム設定等のような多様な機能を手軽に設定することができる。しかし、かような音声制御装置が、ユーザの音声を誤認識すると、不本意な動作を遂行してしまう問題が発生しうる。 According to such a voice recognition function, for example, in a portable communication device such as a smartphone, it is possible to perform a call function or compose a text message without pressing a separate button. You can easily set various functions such as directions, Internet search, and alarm settings. However, if such a voice control device erroneously recognizes the user's voice, a problem may occur in which an undesired operation is performed.

韓国特許公開第１０－２０１７－００２８６２８号公報Korean Patent Publication No. 10-2017-0028628

本発明が解決しようとする課題は、キーワード誤認識を防止することができる音声制御装置、音声制御装置の動作方法、コンピュータプログラム及び記録媒体等を提供することである。 An object to be solved by the present invention is to provide a voice control device, an operation method of the voice control device, a computer program, a recording medium, and the like, which can prevent erroneous recognition of keywords.

前述の技術的課題を達成するための技術的手段として、本開示の第１側面は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成するオーディオ処理部と、前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定するキーワード検出部と、前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出し、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する話者特徴ベクトル抽出部と、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断するウェークアップ判断部と、を含む音声制御装置を提供することができる。 As a technical means for achieving the above-mentioned technical problems, the first aspect of the present disclosure is from an audio processing unit that receives an audio signal corresponding to an ambient sound and generates audio stream data, and from the audio stream data. , A keyword detection unit that detects a candidate keyword corresponding to a predetermined keyword and determines a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected, and the first item. The first speaker feature vector related to one audio data is extracted, and in the audio stream data, the second speaker feature vector related to the second audio data corresponding to the second section whose end point is the start point of the first section is used. Based on the degree of similarity between the speaker feature vector extraction unit to be extracted and the first speaker feature vector and the second speaker feature vector, whether or not the keyword is included in the first audio data is determined. A voice control device including a wake-up determination unit for determination can be provided.

また、本開示の第２側面は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する段階と、前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定する段階と、前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出する段階と、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する段階と、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断し、ウェークアップさせるか否かを決定する段階と、を含む音声制御装置の動作方法を提供することができる。 Further, the second aspect of the present disclosure is a stage of receiving an audio signal corresponding to an ambient sound and generating audio stream data, and detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data, and the audio. In the stream data, a step of determining the start point and the end point of the first section corresponding to the first audio data in which the candidate keyword is detected, and a step of extracting the first speaker feature vector related to the first audio data. In the audio stream data, a step of extracting a second speaker feature vector related to the second audio data corresponding to the second section whose end point is the start point of the first section, the first speaker feature vector, and the first speaker. 2. A voice control device including a step of determining whether or not the keyword is included in the first audio data based on the similarity with the speaker feature vector, and determining whether or not to wake up. It is possible to provide a method of operation.

また、本開示の第３側面は、音声制御装置のプロセッサに、第２側面による動作方法を実行させる命令語を含むコンピュータプログラムを提供することができる。 Further, the third aspect of the present disclosure can provide a computer program including a command word for causing the processor of the voice control device to execute the operation method according to the second aspect.

また、本開示の第４側面は、第３側面によるコンピュータプログラムが記録されたコンピュータで読み取り可能な記録媒体を提供することができる。 Further, the fourth aspect of the present disclosure can provide a computer-readable recording medium in which the computer program according to the third aspect is recorded.

本発明の多様な実施形態によれば、キーワードを誤認識する可能性が低下するので、音声制御装置の誤動作が防止される。 According to various embodiments of the present invention, the possibility of erroneously recognizing a keyword is reduced, so that a malfunction of the voice control device is prevented.

一実施形態によるネットワーク環境の例を図示した図面である。It is a drawing which illustrated the example of the network environment by one Embodiment. 一実施形態によって、電子機器及びサーバの内部構成について説明するためのブロック図である。It is a block diagram for demonstrating the internal structure of an electronic device and a server by one Embodiment. 一実施形態による音声制御装置のプロセッサが含みうる機能ブロックの例を図示した図面である。It is a figure which illustrated the example of the functional block which can include the processor of the voice control apparatus by one Embodiment. 一実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。It is a flowchart which illustrated an example of the operation method which a voice control apparatus can perform by one Embodiment. 他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。It is a flowchart which illustrates the example of the operation method which a voice control apparatus can perform by another embodiment. 一実施形態による音声制御装置が、図５の動作方法を実行する場合、単独命令キーワードが発話される例を図示する図面である。It is a figure which illustrates the example in which a single instruction keyword is uttered when the voice control device by one Embodiment executes the operation method of FIG. 一実施形態による音声制御装置が、図６の動作方法を実行する場合、一般対話音声が発話される例を図示する図面である。It is a figure which illustrates the example in which a general dialogue voice is uttered when the voice control device by one Embodiment executes the operation method of FIG. さらに他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。It is a flowchart which illustrated an example of the operation method which a voice control apparatus can perform by still another Embodiment. 一実施形態による音声制御装置が、図７の動作方法を実行する場合、ウェークアップキーワード及び自然語音声命令が発話される例を図示する図面である。It is a figure which illustrates the example in which a wake-up keyword and a natural language voice command are uttered when the voice control device according to one embodiment executes the operation method of FIG. 7. 一実施形態による音声制御装置が、図７の動作方法を実行する場合、一般対話音声が発話される例を図示する図面である。It is a figure which illustrates the example in which a general dialogue voice is uttered when the voice control device by one Embodiment executes the operation method of FIG. 7.

以下、添付した図面を参照し、本発明が属する技術分野において当業者が容易に実施することができるように、本発明の実施形態について詳細に説明する。しかし、本発明は、さまざまに異なる形態に具現化され、ここで説明する実施形態に限定されるものではない。そして、図面において、本発明について明確に説明するために、説明と関係ない部分は省略し、明細書全体を通じて、類似した部分については、類似した図面符号を付した。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the invention in the technical field to which the present invention belongs. However, the present invention is embodied in various different forms and is not limited to the embodiments described herein. In the drawings, in order to clearly explain the present invention, parts unrelated to the description are omitted, and similar parts are designated by similar drawing reference numerals throughout the specification.

明細書全体において、ある部分が他の部分と「連結」されているとするとき、それは、「直接に連結」されている場合だけではなく、その中間に、他の素子を挟み、「電気的に連結」されている場合も含む。また、ある部分がある構成要素を「含む」とするとき、それは、特に別意の記載がない限り、他の構成要素を除くものではなく、他の構成要素をさらに含みうるということを意味する。 In the entire specification, when one part is "connected" to another part, it is not only when it is "directly connected", but another element is sandwiched between them and "electrically". Including the case of being "concatenated to". Also, when a part "contains" a component, it does not exclude other components, but may further include other components, unless otherwise stated. ..

本明細書において、様々な箇所に登場する「一部実施形態において」または「一実施形態において」というような語句は、必ずしもいずれも同一実施形態を示すものではない。 In the present specification, terms such as "in a partial embodiment" or "in one embodiment" appearing in various places do not necessarily indicate the same embodiment.

一部実施形態は、機能的なブロック構成、及び多様な処理段階で示される。かような機能ブロックの一部または全部は、特定機能を行う多様な個数のハードウェア構成及び／またはソフトウェア構成によっても具現化される。例えば、本開示の機能ブロックは、１以上のマイクロプロセッサによって具現化されるか、あるいは所定機能のための回路構成によっても具現化される。また、例えば、本開示の機能ブロックは、多様なプログラミング言語またはスクリプティング言語によっても具現化される。該機能ブロックは、１以上のプロセッサで実行されるアルゴリズムによっても具現化される。また、本開示は、電子的な環境設定、信号処理、及び／またはデータ処理などのために、従来技術を採用することができる。「モジュール」及び「構成」のような用語は、汎用され、機械的であって物理的な構成として限定されるものではない。 Some embodiments are shown in functional block configurations and various processing steps. Some or all of such functional blocks are also embodied by a diverse number of hardware and / or software configurations that perform a particular function. For example, the functional blocks of the present disclosure may be embodied by one or more microprocessors, or may also be embodied by a circuit configuration for a given function. Also, for example, the functional blocks of the present disclosure are embodied in various programming or scripting languages. The functional block is also embodied by an algorithm executed by one or more processors. Also, the present disclosure can employ prior art for electronic environment setting, signal processing, and / or data processing and the like. Terms such as "module" and "configuration" are generic, mechanical and not limited to physical configurations.

また、図面に図示された構成要素間の連結線または連結部材は、機能的な連結、及び／または物理的または回路的な連結を例示的に示しただけである。実際の装置においては、代替可能であったり、追加されたりする多様な機能的な連結、物理的な連結または回路連結により、構成要素間の連結が示される。 Also, the connecting lines or connecting members between the components illustrated in the drawings merely illustrate functional connection and / or physical or circuit connection. In a real device, various functional connections, physical connections or circuit connections that can be replaced or added indicate the connections between the components.

本開示においてキーワードは、音声制御装置の特定機能をウェークアップさせることができる音声情報をいう。該キーワードは、ユーザの音声信号に基づいて、単独命令キーワードでもあり、ウェークアップキーワードでもある。ウェークアップキーワードは、スリープモード状態の音声制御装置をウェークアップモードに転換することができる音声に基づくキーワードであり、例えば、「クローバ」、「ハイコンピュータ」のような音声キーワードでもある。ユーザは、ウェークアップキーワードを発話した後、音声制御装置が遂行することを願う機能や動作を指示するための命令を自然語形態で発話することができる。なお、以下の説明でウェークアップキーワードの単なる一例として登場する「クローバ」（Ｃｌｏｖａ）は登録商標であり、「四葉のクローバー」（ｆｏｕｒ－ｌｅａｆｃｌｏｖｅｒ）における「クローバー」とは異なる点に留意を要する。その場合、該音声制御装置は、自然語形態の音声命令を音声認識し、音声認識された結果に対応する機能または動作を遂行することができる。単独命令キーワードは、例えば、音楽が再生中である場合、「中止」のように、音声制御装置の動作を直接制御することができる音声キーワードでもある。本開示で言及されるウェークアップキーワードは、ウェークアップワード、ホットワード、トリガーワードのような用語で呼ばれる。 In the present disclosure, the keyword refers to voice information that can wake up a specific function of the voice control device. The keyword is both a single command keyword and a wake-up keyword based on the user's voice signal. The wake-up keyword is a voice-based keyword that can convert a voice control device in a sleep mode to a wake-up mode, and is also a voice keyword such as "clover" or "high computer". After uttering the wake-up keyword, the user can utter a command in a natural language form for instructing a function or operation desired to be performed by the voice control device. It should be noted that "Clover", which appears as a mere example of the wake-up keyword in the following explanation, is a registered trademark and is different from "Clover" in "four-leaf clover". In that case, the voice control device can voice-recognize the voice command in the natural language form and perform a function or an operation corresponding to the voice-recognized result. The single command keyword is also a voice keyword that can directly control the operation of the voice control device, for example, when music is being played, such as "stop". The wake-up keywords referred to in this disclosure are referred to by terms such as wake-up word, hot word, and trigger word.

本開示において候補キーワードは、キーワードと発音が類似したワードを含む。例えば、キーワードが「クローバ」である場合、該候補キーワードは、「クローバー」、「グローバル」、「クラブ」などでもある。該候補キーワードは、音声制御装置のキーワード検出部が、オーディオデータからキーワードとして検出したものと定義される。該候補キーワードは、キーワードと同一でもあるが、該キーワードと類似した発音を有する他のワードでもある。一般的には、該音声制御装置は、ユーザが候補キーワードに該当する用語が含まれている文章を発話する場合にも、当該キーワードと誤認識してウェークアップさせることがある。本開示による音声制御装置は、音声信号から、前述のような候補キーワードが検出される場合にも反応するが、候補キーワードによってウェークアップさせることを防止することができる。 In the present disclosure, the candidate keywords include words whose pronunciation is similar to that of the keywords. For example, when the keyword is "clover", the candidate keyword is also "clover", "global", "club" and the like. The candidate keyword is defined as one detected as a keyword from the audio data by the keyword detection unit of the voice control device. The candidate keyword is also the same as the keyword, but is also another word having a pronunciation similar to the keyword. In general, the voice control device may wake up by erroneously recognizing the keyword even when the user utters a sentence containing a term corresponding to the candidate keyword. The voice control device according to the present disclosure reacts even when the above-mentioned candidate keyword is detected from the voice signal, but it is possible to prevent wake-up by the candidate keyword.

本開示において音声認識機能は、ユーザの音声信号を、文字列（または、テキスト）に変換することをいう。ユーザの音声信号は、音声命令を含みうる。該音声命令は、音声制御装置の特定機能を行うことができる。 In the present disclosure, the voice recognition function means converting a user's voice signal into a character string (or text). The user's voice signal may include voice instructions. The voice command can perform a specific function of the voice control device.

本開示において音声制御装置は、音声制御機能が搭載された電子機器をいう。音声制御機能が搭載された電子機器は、スマートスピーカまたは人工知能スピーカのような独立した電子機器でもある。また、音声制御機能が搭載された電子機器は、音声制御機能が搭載されたコンピュータ装置、例えば、デスクトップＰＣ（personal computer）、ノート型パソコンなどであるだけでなく、携帯が可能なコンピュータ装置、例えば、スマートフォンなどでもある。その場合、該コンピュータ装置には、音声制御機能を行うためのプログラムまたはアプリケーションがインストールされる。また、該音声制御機能が搭載された電子機器は、特定機能を主に遂行する電子製品、例えば、スマートテレビ、スマート冷蔵庫、スマートエアコン、スマートナビゲーションなどでもあり、自動車のインフォテーンメントシステムでもある。それだけではなく、音声によって制御される事物インターネット装置も、それに該当する。 In the present disclosure, the voice control device refers to an electronic device equipped with a voice control function. Electronic devices equipped with voice control functions are also independent electronic devices such as smart speakers or artificial intelligence speakers. Further, the electronic device equipped with the voice control function is not only a computer device equipped with the voice control function, for example, a desktop PC (personal computer), a notebook personal computer, etc., but also a portable computer device, for example. , Smartphones, etc. In that case, a program or application for performing the voice control function is installed in the computer device. Further, the electronic device equipped with the voice control function is also an electronic product that mainly performs a specific function, for example, a smart TV, a smart refrigerator, a smart air conditioner, a smart navigation, etc., and is also an infotainment system for an automobile. Not only that, but things Internet devices that are controlled by voice also fall under this category.

本開示において、音声制御装置の特定機能は、例えば、該音声制御装置にインストールされたアプリケーションを実行することを含みうるが、それに制限されるものではない。例えば、該音声制御装置がスマートスピーカである場合、該音声制御装置の特定機能は、音楽再生、インターネットショッピング、音声情報提供、スマートスピーカに接続された電子装置または機械装置の制御などを含みうる。例えば、該音声制御装置がスマートフォンである場合、該アプリケーション実行は、電話かけること、道探し、インターネット検索またはアラーム設定などを含みうる。例えば、該音声制御装置がスマートテレビである場合、該アプリケーション実行は、プログラム検索またはチャネル検索などを含みうる。該音声制御装置がスマートオーブンである場合、該アプリケーション実行は、料理方法検索などを含みうる。該音声制御装置がスマート冷蔵庫である場合、該アプリケーション実行は、冷蔵状態及び冷凍状態の点検、または温度設定などを含みうる。該音声制御装置がスマート自動車である場合、該アプリケーション実行は、自動始動、自律走行、自動駐車などを含みうる。本開示でアプリケーション実行は、前述のところに制限されるものではない。 In the present disclosure, a particular function of a voice control device may include, but is not limited to, executing, for example, an application installed in the voice control device. For example, when the voice control device is a smart speaker, the specific function of the voice control device may include music playback, Internet shopping, voice information provision, control of an electronic device or a mechanical device connected to the smart speaker, and the like. For example, if the voice control device is a smartphone, the application execution may include making a call, finding a way, searching the Internet or setting an alarm. For example, if the voice control device is a smart television, the application execution may include program search, channel search, and the like. If the voice control device is a smart oven, the application execution may include cooking method search and the like. When the voice control device is a smart refrigerator, the application execution may include checking the refrigerated and frozen states, setting the temperature, and the like. When the voice control device is a smart vehicle, the application execution may include automatic start, autonomous driving, automatic parking, and the like. Application execution in the present disclosure is not limited to the above.

本開示においてキーワードは、ワード形態を有するか、あるいは球形態を有することができる。本開示において、ウェークアップキーワード後に発話される音声命令は、自然語形態の文章形態、ワード形態または球形態を有することができる。 In the present disclosure, the keyword may have a word form or a spherical form. In the present disclosure, the speech command uttered after the wake-up keyword can have a sentence form, a word form, or a spherical form in a natural language form.

以下、添付された図面を参照し、本開示について詳細に説明する。 Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

図１は、一実施形態によるネットワーク環境の例を図示した図面である。図１に図示されたネットワーク環境は、複数の電子機器１００ａないし１００ｆ、サーバ２００及びネットワーク３００を含むように例示的に図示される。 FIG. 1 is a drawing illustrating an example of a network environment according to an embodiment. The network environment illustrated in FIG. 1 is schematically illustrated to include a plurality of electronic devices 100a to 100f, a server 200, and a network 300.

電子機器１００ａないし１００ｆは、音声で制御される例示的な電子機器である。電子機器１００ａないし１００ｆそれぞれは、音声認識機能以外に、特定機能を行うことができる。電子機器１００ａないし１００ｆの例を挙げれば、スマートスピーカまたは人工知能スピーカ、スマートフォン、携帯電話、ナビゲーション、コンピュータ、ノート型パソコン、デジタル放送用端末、ＰＤＡ（personal digital assistants）、ＰＭＰ（portable multimedia player）、タブレットＰＣ、スマート電子製品などがある。電子機器１００ａないし１００ｆは、無線または有線の通信方式を利用し、ネットワーク３００を介して、サーバ２００、及び／または他の電子機器１００ａないし１００ｆと通信することができる。しかし、それに限定されるものではなく、電子機器１００ａないし１００ｆそれぞれは、ネットワーク３００に連結されず、独立して動作することもできる。電子機器１００ａないし１００ｆは、電子機器１００とも総称される。 The electronic devices 100a to 100f are exemplary electronic devices controlled by voice. Each of the electronic devices 100a to 100f can perform a specific function in addition to the voice recognition function. Examples of electronic devices 100a to 100f include smart speakers or artificial intelligence speakers, smartphones, mobile phones, navigation systems, computers, notebook computers, digital broadcasting terminals, PDAs (personal digital assistants), PMPs (portable multimedia players), etc. There are tablet PCs, smart electronic products, etc. The electronic devices 100a to 100f can communicate with the server 200 and / or other electronic devices 100a to 100f via the network 300 by using a wireless or wired communication method. However, the present invention is not limited to this, and each of the electronic devices 100a to 100f is not connected to the network 300 and can operate independently. The electronic devices 100a to 100f are also collectively referred to as the electronic device 100.

ネットワーク３００の通信方式は、制限されるものではなく、ネットワーク３００が含みうる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を活用する通信方式だけではなく、電子機器１００ａないし１００ｆ間の近距離無線通信が含まれてもよい。例えば、ネットワーク３００は、ＰＡＮ（personal area network）、ＬＡＮ（local area network）、ＣＡＮ（campus area network）、ＭＡＮ（metropolitan area network）、ＷＡＮ（wide area network）、ＢＢＮ（broadband network）、インターネットなどのネットワークのうち１以上の任意のネットワークを含みうる。また、ネットワーク３００は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター・バスネットワーク、ツリーネットワークまたは階層的（hierarchical）ネットワークなどを含むネットワークトポロジーのうち、任意の１以上を含みうるが、それらに制限されるものではない。 The communication method of the network 300 is not limited, and is not limited to a communication method utilizing a communication network (for example, a mobile communication network, a wired Internet, a wireless Internet, a broadcasting network) that the network 300 can include, as well as an electronic device 100a. It may also include short-range wireless communication between 100f and 100f. For example, the network 300 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network), the Internet, and the like. It may include any one or more of the networks. Also, the network 300 may include any one or more of network topologies including bus networks, star networks, ring networks, mesh networks, star bus networks, tree networks, hierarchical networks, and the like. It is not limited to.

サーバ２００は、ネットワーク３００を介し、て電子機器１００ａないし１００ｆと通信し、音声認識機能を遂行するコンピュータ装置、または複数のコンピュータ装置によっても具現化される。サーバ２００は、クラウド形態に分散され、命令、コード、ファイル、コンテンツなどを提供することができる。 The server 200 is also embodied by a computer device that communicates with electronic devices 100a to 100f via a network 300 and performs a voice recognition function, or a plurality of computer devices. The server 200 is distributed in the form of a cloud and can provide instructions, codes, files, contents, and the like.

例えば、サーバ２００は、電子機器１００ａないし１００ｆから提供されるオーディオファイルを受信し、オーディオファイル内の音声信号を文字列（または、テキスト）に変換し、変換された文字列（または、テキスト）を、電子機器１００ａないし１００ｆに提供することができる。また、サーバ２００は、ネットワーク３００を介して接続した電子機器１００ａないし１００ｆに、音声制御機能を遂行するためのアプリケーションインストールのためのファイルを提供することができる。例えば、第２電子機器１００ｂは、サーバ２００から提供されたファイルを利用し、アプリケーションをインストールすることができる。第２電子機器１００ｂは、インストールされた運用体制（ＯＳ）、及び／または少なくとも１つのプログラム（例えば、インストールされた音声制御アプリケーション）の制御によってサーバ２００に接続し、サーバ２００が提供する音声認識サービスを提供される。 For example, the server 200 receives the audio file provided from the electronic devices 100a to 100f, converts the audio signal in the audio file into a character string (or text), and converts the converted character string (or text) into a character string (or text). , Can be provided for electronic devices 100a to 100f. Further, the server 200 can provide the electronic devices 100a to 100f connected via the network 300 with a file for installing an application for performing the voice control function. For example, the second electronic device 100b can use the file provided by the server 200 to install an application. The second electronic device 100b connects to the server 200 under the control of an installed operating system (OS) and / or at least one program (for example, an installed voice control application), and is a voice recognition service provided by the server 200. Will be provided.

図２は、一実施形態によって、電子機器及びサーバの内部構成について説明するためのブロック図である。 FIG. 2 is a block diagram for explaining an internal configuration of an electronic device and a server according to an embodiment.

電子機器１００は、図１の電子機器１００ａないし１００ｆのうち一つであり、電子機器１００ａないし１００ｆは、少なくとも図２に図示された内部構成を有することができる。電子機器１００は、ネットワーク３００を介して音声認識機能を遂行するサーバ２００に接続されるように図示されているが、それは例示的なものであり、電子機器１００は、独立して音声認識機能を遂行することもできる。電子機器１００は、音声によって制御される電子機器であり、音声制御装置１００とも呼ばれる。音声制御装置１００は、スマートスピーカまたは人工知能スピーカ、コンピュータ装置、携帯用コンピュータ装置、スマート家電製品などに含まれたり、それらに、有線及び／または無線で連結されたりして具現化される。 The electronic device 100 is one of the electronic devices 100a to 100f of FIG. 1, and the electronic devices 100a to 100f can have at least the internal configuration shown in FIG. Although the electronic device 100 is shown to be connected to a server 200 that performs a voice recognition function via a network 300, it is an example, and the electronic device 100 independently performs a voice recognition function. It can also be accomplished. The electronic device 100 is an electronic device controlled by voice, and is also called a voice control device 100. The voice control device 100 is embodied in a smart speaker or an artificial intelligence speaker, a computer device, a portable computer device, a smart home appliance, or the like, or is connected to them by wire and / or wirelessly.

電子機器１００とサーバ２００は、メモリ１１０，２１０、プロセッサ１２０，２２０、通信モジュール１３０，２３０、及び入出力インターフェース１４０，２４０を含みうる。メモリ１１０，２１０は、コンピュータで読み取り可能な記録媒体であり、ＲＡＭ（random access memory）、ＲＯＭ（read-only memory）及びディスクドライブのような非消滅性大容量記録装置（permanent mass storage device）を含みうる。また、メモリ１１０，２１０には、運用体制と、少なくとも１つのプログラムコード（例えば、電子機器１００にインストールされて駆動される音声制御アプリケーション、音声認識アプリケーションなどのためのコード）とが保存される。かようなソフトウェア構成要素は、コンピュータで読み取り可能な記録媒体ではない通信モジュール１３０，２３０を介して、メモリ１１０，２１０にローディングされる。例えば、少なくとも１つのプログラムは、開発者、またはアプリケーションのインストールファイルを配布するファイル配布システムが、ネットワーク３００を介して提供するファイルによってインストールされるプログラムに基づいて、メモリ１１０，２１０にローディングされる。 The electronic device 100 and the server 200 may include memories 110, 210, processors 120, 220, communication modules 130, 230, and input / output interfaces 140, 240. The memories 110 and 210 are computer-readable recording media, and include a permanent mass storage device such as a RAM (random access memory), a ROM (read-only memory), and a disk drive. Can include. Further, the operation system and at least one program code (for example, a code for a voice control application, a voice recognition application, etc. installed and driven in the electronic device 100) are stored in the memories 110 and 210. Such software components are loaded into memory 110, 210 via communication modules 130, 230, which are not computer readable recording media. For example, at least one program is loaded into memory 110, 210 based on the program installed by the developer, or the file distribution system that distributes the application installation files, by the files provided over the network 300.

プロセッサ１２０，２２０は、基本的な算術、ロジック及び入出力演算を行うことにより、コンピュータプログラムの命令を処理するように構成される。該命令は、メモリ１１０，２１０または通信モジュール１３０，２３０によって、プロセッサ１２０，２２０にも提供される。例えば、プロセッサ１２０，２２０は、メモリ１１０，２１０のような記録装置に保存されたプログラムコードによって受信される命令を実行するようにも構成される。 Processors 120, 220 are configured to process instructions in a computer program by performing basic arithmetic, logic, and input / output operations. The instructions are also provided to processors 120, 220 by memory 110, 210 or communication modules 130, 230. For example, the processors 120, 220 are also configured to execute instructions received by a program code stored in a recording device such as memories 110, 210.

通信モジュール１３０，２３０は、ネットワーク３００を介して、電子機器１００とサーバ２００とが互いに通信するための機能を提供することができ、他の電子機器１００ｂないし１００ｆと通信するための機能を提供することができる。一例として、電子機器１００のプロセッサ１２０が、メモリ１１０のような記録装置に保存されたプログラムコードによって生成した要請（一例として、音声認識サービス要請）が、通信モジュール１３０の制御により、ネットワーク３００を介してサーバ２００に伝達される。反対に、サーバ２００のプロセッサ２２０の制御によって提供される音声認識結果である文字列（テキスト）などが、通信モジュール２３０及びネットワーク３００を経て、電子機器１００の通信モジュール１３０を介して、電子機器１００に受信される。例えば、通信モジュール１３０を介して受信されたサーバ２００の音声認識結果は、プロセッサ１２０やメモリ１１０に伝達される。サーバ２００は、制御信号や命令、コンテンツ、ファイルなどを電子機器１００に送信することができ、通信モジュール１３０を介して受信された制御信号や命令などは、プロセッサ１２０やメモリ１１０に伝達し、コンテンツやファイルなどは、電子機器１００がさらに含みうる別途の記録媒体にも保存される。 The communication modules 130 and 230 can provide a function for the electronic device 100 and the server 200 to communicate with each other via the network 300, and provide a function for communicating with other electronic devices 100b to 100f. be able to. As an example, a request (for example, a voice recognition service request) generated by a processor 120 of an electronic device 100 by a program code stored in a recording device such as a memory 110 is controlled by a communication module 130 via a network 300. Is transmitted to the server 200. On the contrary, the character string (text) which is the voice recognition result provided by the control of the processor 220 of the server 200 passes through the communication module 230 and the network 300, and via the communication module 130 of the electronic device 100, the electronic device 100. Is received by. For example, the voice recognition result of the server 200 received via the communication module 130 is transmitted to the processor 120 and the memory 110. The server 200 can transmit control signals, instructions, contents, files, etc. to the electronic device 100, and the control signals, instructions, etc. received via the communication module 130 are transmitted to the processor 120, the memory 110, and the contents. And files are also stored in a separate recording medium that may be further included in the electronic device 100.

入出力インターフェース１４０，２４０は、入出力装置１５０とのインターフェースのための手段でもある。例えば、入力装置はマイク１５１だけではなく、キーボードまたはマウスなどの装置を含み、出力装置は、スピーカ１５２だけではなく、状態を示す状態表示ＬＥＤ（light emitting diode）、アプリケーションの通信セッションを表示するためのディスプレイのような装置を含みうる。他の例として、入出力装置１５０は、タッチスクリーンのように、入力及び出力のための機能が一つに統合された装置を含みうる。 The input / output interfaces 140 and 240 are also means for interfacing with the input / output device 150. For example, the input device includes not only the microphone 151 but also a device such as a keyboard or a mouse, and the output device is not only the speaker 152 but also a state indicator LED (light emitting diode) indicating a state and a communication session of the application. Can include devices such as displays in. As another example, the input / output device 150 may include a device that integrates functions for input and output, such as a touch screen.

マイク１５１は、周辺音を電気的なオーディオ信号に変換することができる。マイク１５１は、電子機器１００内に直接装着されず、通信可能に連結される外部装置（例えば、スマート時計）に装着され、生成された外部信号は、通信によって電子機器１００に伝送される。図２には、マイク１５１が電子機器１００の内部に含まれるように図示されているが、他の一実施形態によれば、マイク１５１は、別途の装置内に含まれ、電子機器１００とは、有線通信または無線通信で連結される形態にも具現化される。 The microphone 151 can convert ambient sounds into electrical audio signals. The microphone 151 is not directly mounted in the electronic device 100, but is mounted in an external device (for example, a smart watch) that is communicably connected, and the generated external signal is transmitted to the electronic device 100 by communication. FIG. 2 is shown so that the microphone 151 is included inside the electronic device 100, but according to another embodiment, the microphone 151 is included in a separate device and is not the electronic device 100. , It is also embodied in the form of being connected by wired communication or wireless communication.

他の実施形態において、電子機器１００及びサーバ２００は、図２の構成要素よりさらに多くの構成要素を含んでもよい。例えば、電子機器１００は、前述の入出力装置１５０のうち少なくとも一部を含むように構成されるか、あるいはトランシーバ（transceiver）、ＧＰＳ（global position system）モジュール、カメラ、各種センサ、データベースのような他の構成要素をさらに含んでもよい。 In other embodiments, the electronic device 100 and the server 200 may include more components than the components of FIG. For example, the electronic device 100 may be configured to include at least a portion of the above-mentioned input / output devices 150, or may include a transceiver, a GPS (global position system) module, a camera, various sensors, a database, and the like. Other components may be further included.

図３は、一実施形態による音声制御装置のプロセッサが含みうる機能ブロックの例を図示した図面であり、図４は、一実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 3 is a drawing illustrating an example of a functional block that can be included in the processor of the voice control device according to the embodiment, and FIG. 4 is an example of an operation method that the voice control device can perform according to the embodiment. It is a flowchart illustrated.

図３に図示されているように、音声制御装置１００のプロセッサ１２０は、オーディオ処理部１２１、キーワード検出部１２２、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４、音声認識部１２５及び機能部１２６を含みうる。かようなプロセッサ１２０及び機能ブロック１２１ないし１２６のうち少なくとも一部は、図４に図示された動作方法が含む段階（Ｓ１１０ないしＳ１９０）を遂行するように、音声制御装置１００を制御することができる。例えば、プロセッサ１２０、及びプロセッサ１２０の機能ブロック１２１ないし１２６のうち少なくとも一部は、音声制御装置１００のメモリ１１０が含む運用体制のコードと、少なくとも１つのプログラムコードによる命令と、を実行するようにも具現化される。 As shown in FIG. 3, the processor 120 of the voice control device 100 includes an audio processing unit 121, a keyword detection unit 122, a speaker feature vector extraction unit 123, a wake-up determination unit 124, a voice recognition unit 125, and a functional unit 126. Can include. At least a portion of such processors 120 and functional blocks 121-126 can control the voice control device 100 to perform the steps (S110-S190) included in the operation method illustrated in FIG. .. For example, at least a part of the processor 120 and the functional blocks 121 to 126 of the processor 120 are to execute the operation system code included in the memory 110 of the voice control device 100 and the instruction by at least one program code. Is also embodied.

図３に図示された機能ブロック１２１ないし１２６の一部または全部は、特定機能を行うハードウェア構成及び／またはソフトウェア構成にも具現化される。図３に図示された機能ブロック１２１ないし１２６が遂行する機能は、１以上のマイクロプロセッサによって具現化されるか、あるいは当該機能のための回路構成によっても具現化される。図３に図示された機能ブロック１２１ないし１２６の一部または全部は、プロセッサ１２０で実行される多様なプログラミング言語またはスクリプト言語で構成されたソフトウェアモジュールでもある。例えば、オーディオ処理部１２１とキーワード検出部１２２は、デジタル信号処理器（ＤＳＰ）によって具現化され、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４及び音声認識部１２５は、ソフトウェアモジュールによっても具現化される。 A part or all of the functional blocks 121 to 126 illustrated in FIG. 3 is also embodied in a hardware configuration and / or a software configuration that performs a specific function. The function performed by the functional blocks 121 to 126 illustrated in FIG. 3 is embodied by one or more microprocessors, or is also embodied by a circuit configuration for that function. Some or all of the functional blocks 121 to 126 illustrated in FIG. 3 are also software modules composed of various programming or scripting languages executed by the processor 120. For example, the audio processing unit 121 and the keyword detection unit 122 are embodied by a digital signal processor (DSP), and the speaker feature vector extraction unit 123, the wake-up determination unit 124, and the voice recognition unit 125 are also embodied by a software module. Will be done.

オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する。オーディオ処理部１２１は、マイク１５１のような入力装置から、周辺音に対応するオーディオ信号を受信することができる。マイク１５１は、音声制御装置１００に通信で連結される周辺装置に含まれ、オーディオ処理部１２１は、マイク１５１で生成されたオーディオ信号を通信で受信することができる。該周辺音は、ユーザが発話した音声だけではなく、背景音を含む。従って、オーディオ信号には、音声信号だけではなく、背景音信号も含まれる。該背景音信号は、キーワード検出及び音声認識において、ノイズに該当する。 The audio processing unit 121 receives an audio signal corresponding to the ambient sound and generates audio stream data. The audio processing unit 121 can receive an audio signal corresponding to the ambient sound from an input device such as the microphone 151. The microphone 151 is included in a peripheral device connected to the voice control device 100 by communication, and the audio processing unit 121 can receive the audio signal generated by the microphone 151 by communication. The peripheral sound includes not only the voice spoken by the user but also the background sound. Therefore, the audio signal includes not only the audio signal but also the background sound signal. The background sound signal corresponds to noise in keyword detection and voice recognition.

オーディオ処理部１２１は、連続的に受信されるオーディオ信号に対応するオーディオストリームデータを生成することができる。オーディオ処理部１２１は、オーディオ信号をフィルタリングしてデジタル化し、オーディオストリームデータを生成することができる。オーディオ処理部１２１は、オーディオ信号をフィルタリングしてノイズ信号を除去し、背景音信号に比べ、音声信号を増幅することができる。また、オーディオ処理部１２１は、オーディオ信号から音声信号のエコーを除去することもできる。 The audio processing unit 121 can generate audio stream data corresponding to continuously received audio signals. The audio processing unit 121 can filter and digitize the audio signal to generate audio stream data. The audio processing unit 121 can filter the audio signal to remove the noise signal and amplify the audio signal as compared with the background sound signal. Further, the audio processing unit 121 can also remove the echo of the audio signal from the audio signal.

オーディオ処理部１２１は、音声制御装置１００がスリープモードで動作するときにも、オーディオ信号を受信するために、常時動作することができる。オーディオ処理部１２１は、音声制御装置１００がスリープモードで動作するとき、低い動作周波数で動作し、音声制御装置１００が正常モードで動作するときには、高い動作周波数で動作することができる。 The audio processing unit 121 can always operate in order to receive the audio signal even when the voice control device 100 operates in the sleep mode. The audio processing unit 121 can operate at a low operating frequency when the voice control device 100 operates in the sleep mode, and can operate at a high operating frequency when the voice control device 100 operates in the normal mode.

メモリ１１０は、オーディオ処理部１２１で生成されたオーディオストリームデータを一時的に保存することができる。オーディオ処理部１２１は、メモリ１１０を利用して、オーディオストリームデータをバッファリングすることができる。メモリ１１０には、キーワードを含むオーディオデータだけではなく、キーワードが検出される前のオーディオデータが共に保存される。最近のオーディオデータをメモリ１１０に保存するために、メモリ１１０に最も前に保存されたオーディオデータが削除される。メモリ１１０に割り当てられた大きさが同一であるならば、常時同一期間のオーディオデータが保存される。メモリ１１０に保存されたオーディオデータに該当する前記期間は、キーワードを発声する時間より長いことが望ましい。 The memory 110 can temporarily store the audio stream data generated by the audio processing unit 121. The audio processing unit 121 can use the memory 110 to buffer the audio stream data. In the memory 110, not only the audio data including the keyword but also the audio data before the keyword is detected is stored together. In order to save the latest audio data in the memory 110, the earliest audio data stored in the memory 110 is deleted. If the size allocated to the memory 110 is the same, the audio data for the same period is always stored. It is desirable that the period corresponding to the audio data stored in the memory 110 is longer than the time for uttering the keyword.

本発明の他の実施形態によれば、メモリ１１０は、オーディオ処理部１２１で生成されたオーディオストリームに係わる話者特徴ベクトルを抽出して保存することができる。そのとき、該話者特徴ベクトルは、特定長のオーディオストリームに対して抽出して保存される。前述のように、最近生成されたオーディオストリームに係わる話者特徴ベクトルを保存するために、最も前に保存された話者特徴ベクトルが削除される。 According to another embodiment of the present invention, the memory 110 can extract and store the speaker feature vector related to the audio stream generated by the audio processing unit 121. At that time, the speaker feature vector is extracted and stored for an audio stream of a specific length. As mentioned above, the earliest saved speaker feature vector is deleted in order to save the speaker feature vector associated with the recently generated audio stream.

キーワード検出部１２２は、オーディオ処理部１２１で生成されたオーディオストリームデータから、既定義の（即ち、所定の）キーワードに対応する候補キーワードを検出する。キーワード検出部１２２は、メモリ１１０に一時的に保存されたオーディオストリームデータから、既定義のキーワードに対応する候補キーワードを検出することができる。既定義のキーワードは、複数個存在することも可能であり、複数の既定義のキーワードは、キーワード保存所１１０ａに保存される。キーワード保存所１１０ａは、メモリ１１０に含まれてもよい。 The keyword detection unit 122 detects a candidate keyword corresponding to a defined (that is, a predetermined) keyword from the audio stream data generated by the audio processing unit 121. The keyword detection unit 122 can detect a candidate keyword corresponding to a defined keyword from the audio stream data temporarily stored in the memory 110. A plurality of predefined keywords may exist, and the plurality of defined keywords are stored in the keyword storage location 110a. The keyword storage 110a may be included in the memory 110.

候補キーワードは、キーワード検出部１２２から、オーディオストリームデータのうちキーワードとして検出したものを意味する。候補キーワードは、キーワードと同一であっても良いし、該キーワードと類似して発音される他の単語であっても良い。例えば、該キーワードが「クローバ」である場合、候補キーワードは、「グローバル」であっても良い。すなわち、ユーザが「グローバル」を含んだ文章を発声した場合、キーワード検出部１２２は、オーディオストリームデータから、「グローバル」を「クローバ」と誤認して検出するかもしれないからである。かように検出された「グローバル」は、候補キーワードに該当する。 The candidate keyword means the audio stream data detected as a keyword by the keyword detection unit 122. The candidate keyword may be the same as the keyword, or may be another word pronounced similar to the keyword. For example, when the keyword is "clover", the candidate keyword may be "global". That is, when the user utters a sentence including "global", the keyword detection unit 122 may misidentify "global" as "clover" and detect it from the audio stream data. The "global" detected in this way corresponds to the candidate keyword.

キーワード検出部１２２は、オーディオストリームデータを、既知のキーワードデータと比較し、オーディオストリームデータ内に、キーワードに対応する音声が含まれる可能性を計算することができる。キーワード検出部１２２は、オーディオストリームデータから、フィルタバンクエネルギー（filter bank energy）またはメル周波数ケプストラム係数（ＭＦＣＣ：Mel－frequency cepstram coefficients）のようなオーディオ特徴を抽出することができる。キーワード検出部１２２は、分類ウィンドウ（classifying window）を利用して、例えば、サポートベクトルマシン（support vector machine）または神経網（neural network）を利用して、かようなオーディオ特徴を処理することができる。該オーディオ特徴の処理に基づいて、キーワード検出部１２２は、オーディオストリームデータ内にキーワードが含まれる可能性を計算することができる。キーワード検出部１２２は、前記可能性が、既設定基準値（即ち、所定の基準値）より高い場合、オーディオストリームデータ内にキーワードが含まれていると判断することにより、候補キーワードを検出することができる。 The keyword detection unit 122 can compare the audio stream data with the known keyword data and calculate the possibility that the audio stream data includes the voice corresponding to the keyword. The keyword detector 122 can extract audio features such as filter bank energy or Mel-frequency cepstram coefficients (MFCC) from the audio stream data. The keyword detector 122 can use a classification window to process such audio features, for example, using a support vector machine or a neural network. .. Based on the processing of the audio feature, the keyword detection unit 122 can calculate the possibility that the keyword is included in the audio stream data. When the possibility is higher than the set reference value (that is, a predetermined reference value), the keyword detection unit 122 detects the candidate keyword by determining that the keyword is included in the audio stream data. Can be done.

キーワード検出部１２２は、キーワードデータに対応する音声サンプルを利用して人工神経網（例えば、人工知能のためのニューラルネットワーク）を生成し、生成された神経網を利用して、オーディオストリームデータからキーワードを検出するように、トレーニングされる。キーワード検出部１２２は、オーディオストリームデータ内のフレームごとに、それぞれキーワードを構成する音素の確率、またはキーワードの全体的な確率を計算することができる。キーワード検出部１２２は、オーディオストリームデータから、各音素に該当する確率シーケンス、またはキーワード自体の確率を出力することができる。そのシーケンスまたは確率を基に、キーワード検出部１２２は、オーディオストリームデータ内にキーワードが含まれる可能性を計算することができ、その可能性が既設定基準値以上である場合、候補キーワードが検出されたと判断することができる。前述の方式は、例示的なものであり、キーワード検出部１２２の動作は、多様な方式を介しても具現化される。 The keyword detection unit 122 generates an artificial neural network (for example, a neural network for artificial intelligence) using a voice sample corresponding to the keyword data, and uses the generated neural network to generate a keyword from the audio stream data. Is trained to detect. The keyword detection unit 122 can calculate the probability of phonemes constituting the keyword or the overall probability of the keyword for each frame in the audio stream data. The keyword detection unit 122 can output the probability sequence corresponding to each phoneme or the probability of the keyword itself from the audio stream data. Based on the sequence or probability, the keyword detection unit 122 can calculate the possibility that the keyword is included in the audio stream data, and if the possibility is equal to or higher than the set reference value, the candidate keyword is detected. It can be judged that it was. The above-mentioned method is exemplary, and the operation of the keyword detection unit 122 is embodied through various methods.

また、キーワード検出部１２２は、オーディオストリームデータ内のフレームごとに、オーディオ特徴を抽出することにより、当該フレームのオーディオデータが、人の音声に該当する可能性と、背景音に該当する可能性とを算出することができる。キーワード検出部１２２は、人の音声に該当する可能性と、背景音に該当する可能性とを比較し、当該フレームのオーディオデータが人の音声に該当すると判断することができる。例えば、キーワード検出部１２２は、当該フレームのオーディオデータが人の音声に該当する可能性が、背景音に該当する可能性より、既設定基準値を超えて高い場合、当該フレームのオーディオデータが人の音声に対応すると判断することができる。 Further, the keyword detection unit 122 extracts audio features for each frame in the audio stream data, so that the audio data of the frame may correspond to human voice and the background sound. Can be calculated. The keyword detection unit 122 can compare the possibility of corresponding to human voice and the possibility of corresponding to background sound, and can determine that the audio data of the frame corresponds to human voice. For example, in the keyword detection unit 122, when the possibility that the audio data of the frame corresponds to human voice is higher than the possibility that it corresponds to the background sound exceeds the set reference value, the audio data of the frame is human. It can be judged that it corresponds to the voice of.

キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出された区間を特定することができ、候補キーワードが検出された区間の始点及び終点を決定することができる。オーディオストリームデータから候補キーワードが検出された区間は、キーワード検出区間、現在区間または第１区間とされる。オーディオストリームデータにおいて第１区間に該当するオーディオデータは、第１オーディオデータとする。キーワード検出部１２２は、候補キーワードが検出された区間の終りを終点と決定することができる。他の例によれば、キーワード検出部１２２は、候補キーワードが検出された後、既設定時間（例えば、０．５秒）の黙音が発生するまで待った後、第１区間に黙音区間が含まれるように、第１区間の終点を決定するか、あるいは黙音期間が含まれないように、第１区間の終点を決定することができる。 The keyword detection unit 122 can specify the section in which the candidate keyword is detected from the audio stream data, and can determine the start point and the end point of the section in which the candidate keyword is detected. The section in which the candidate keyword is detected from the audio stream data is regarded as the keyword detection section, the current section, or the first section. The audio data corresponding to the first section in the audio stream data is referred to as the first audio data. The keyword detection unit 122 can determine the end of the section in which the candidate keyword is detected as the end point. According to another example, after the candidate keyword is detected, the keyword detection unit 122 waits until a silent sound for a set time (for example, 0.5 seconds) is generated, and then a silent sound section is set in the first section. The end point of the first section can be determined so that it is included, or the end point of the first section can be determined so that the silent period is not included.

話者特徴ベクトル抽出部１２３は、メモリ１１０に一時的に保存されたオーディオストリームデータにおいて、第２区間に該当する第２オーディオデータを、メモリ１１０から読み取る。第２区間は、第１区間の以前区間であり、第２区間の終点は、第１区間の始点と同一でもある。第２区間は、以前区間とされる。第２区間の長さは、検出された候補キーワードに対応するキーワードによって可変的にも設定される。他の例によれば、第２区間の長さは、固定的にも設定される。さらに他の例によれば、第２区間の長さは、キーワード検出性能が最適化されるように、適応的に可変される。例えば、マイク１５１が出力するオーディオ信号が、「四葉のクローバー」であり、候補キーワードが「クローバー」である場合、第２オーディオデータは、「四葉の」という音声に対応する。 The speaker feature vector extraction unit 123 reads the second audio data corresponding to the second section from the memory 110 in the audio stream data temporarily stored in the memory 110. The second section is the previous section of the first section, and the end point of the second section is also the same as the start point of the first section. The second section is the previous section. The length of the second section is also variably set by the keyword corresponding to the detected candidate keyword. According to another example, the length of the second section is also fixedly set. According to yet another example, the length of the second interval is adaptively variable so that the keyword detection performance is optimized. For example, when the audio signal output by the microphone 151 is "four-leaf clover" and the candidate keyword is "clover", the second audio data corresponds to the voice "four-leaf clover".

話者特徴ベクトル抽出部１２３は、第１区間に該当する第１オーディオデータの第１話者特徴ベクトルと、第２区間に該当する第２オーディオデータの第２話者特徴ベクトルと、を抽出する。話者特徴ベクトル抽出部１２３は、話者認識にロバストな話者特徴ベクトルをオーディオデータから抽出することができる。話者特徴ベクトル抽出部１２３は、時間ドメイン（time domain）の音声信号を、周波数ドメイン（frequency domain）の信号に変換し、変換された信号の周波数エネルギーを、互いに異なるように変形することにより、話者特徴ベクトルを抽出することができる。例えば、該話者特徴ベクトルは、メル周波数ケプストラム係数（ＭＦＣＣ）またはフィルタバンクエネルギーを基に抽出される、それらに限定されるものはではなく、多様な方式で、オーディオデータから話者特徴ベクトルを抽出することができる。 The speaker feature vector extraction unit 123 extracts the first speaker feature vector of the first audio data corresponding to the first section and the second speaker feature vector of the second audio data corresponding to the second section. .. The speaker feature vector extraction unit 123 can extract a speaker feature vector that is robust to speaker recognition from audio data. The speaker feature vector extraction unit 123 converts the voice signal in the time domain into the signal in the frequency domain, and transforms the frequency energy of the converted signal so as to be different from each other. The speaker feature vector can be extracted. For example, the speaker feature vector is extracted based on the Mel Frequency Cepstrum Coefficient (MFCC) or filter bank energy, but is not limited to them, and the speaker feature vector can be obtained from audio data in various ways. Can be extracted.

話者特徴ベクトル抽出部１２３は、一般的には、スリープモードで動作することができる。キーワード検出部１２２は、オーディオストリームデータから候補キーワードを検出すると、話者特徴ベクトル抽出部１２３をウェークアップさせることができる。キーワード検出部１２２は、オーディオストリームデータから候補キーワードを検出すると、話者特徴ベクトル抽出部１２３にウェークアップ信号を送信することができる。話者特徴ベクトル抽出部１２３は、キーワード検出部１２２において、候補キーワードが検出されたということを示すウェークアップ信号に応答してウェークアップされる。 The speaker feature vector extraction unit 123 can generally operate in the sleep mode. When the keyword detection unit 122 detects a candidate keyword from the audio stream data, the speaker feature vector extraction unit 123 can wake up. When the keyword detection unit 122 detects a candidate keyword from the audio stream data, the keyword detection unit 122 can transmit a wake-up signal to the speaker feature vector extraction unit 123. The speaker feature vector extraction unit 123 is waked up in response to the wakeup signal indicating that the candidate keyword has been detected in the keyword detection unit 122.

一実施形態によれば、話者特徴ベクトル抽出部１２３は、オーディオデータの各フレームごとに、フレーム特徴ベクトルを抽出し、抽出されたフレーム特徴ベクトルを正規化及び平均化し、オーディオデータを代表する話者特徴ベクトルを抽出することができる。抽出されたフレーム特徴ベクトルの正規化に、Ｌ２正規化が使用される。抽出されたフレーム特徴ベクトルの平均化は、オーディオデータ内の全フレームそれぞれに対して抽出されたフレーム特徴ベクトルを正規化して生成される正規化されたフレーム特徴ベクトルの平均を算出することによって達成される。 According to one embodiment, the speaker feature vector extraction unit 123 extracts a frame feature vector for each frame of the audio data, normalizes and averages the extracted frame feature vector, and represents the audio data. The speaker feature vector can be extracted. L2 normalization is used to normalize the extracted frame feature vectors. The averaging of the extracted frame feature vectors is achieved by calculating the average of the normalized frame feature vectors generated by normalizing the extracted frame feature vectors for each of the frames in the audio data. To.

例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータの各フレームごとに、第１フレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出することができる。また、話者特徴ベクトル抽出部１２３は、第２オーディオデータの各フレームごとに、第２フレーム特徴ベクトルを抽出し、抽出された第２フレーム特徴ベクトルを正規化及び平均化し、第２オーディオデータを代表する第２話者特徴ベクトルを抽出することができる。 For example, the speaker feature vector extraction unit 123 extracts the first frame feature vector for each frame of the first audio data, normalizes and averages the extracted first frame feature vector, and obtains the first audio data. The representative first speaker feature vector can be extracted. Further, the speaker feature vector extraction unit 123 extracts the second frame feature vector for each frame of the second audio data, normalizes and averages the extracted second frame feature vector, and obtains the second audio data. A representative second speaker feature vector can be extracted.

他の実施形態によれば、話者特徴ベクトル抽出部１２３は、オーディオデータ内の全フレームについて、フレーム特徴ベクトルをそれぞれ抽出するのではなく、オーディオデータ内の一部フレームについて、フレーム特徴ベクトルをそれぞれ抽出することができる。前記一部フレームは、当該フレームのオーディオデータが、ユーザの音声データである可能性が高いフレームにおいて、音声フレームとして選択される。かような音声フレームの選択は、キーワード検出部１２２によってなされる。キーワード検出部１２２は、オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算することができる。キーワード検出部１２２は、各フレームのオーディオデータが人音声である第１確率が、背景音である第２確率より、既設定基準値を超えて高いフレームを、音声フレームと決定することができる。キーワード検出部１２２は、当該フレームが、音声フレームであるか否かということを示すフラグまたはビットをオーディオストリームデータの各フレームに関連づけてメモリ１１０に保存することができる。 According to another embodiment, the speaker feature vector extraction unit 123 does not extract the frame feature vector for all the frames in the audio data, but extracts the frame feature vector for some frames in the audio data. Can be extracted. The partial frame is selected as a voice frame in a frame in which the audio data of the frame is likely to be the voice data of the user. The selection of such a voice frame is made by the keyword detection unit 122. The keyword detection unit 122 can calculate a first probability of human voice and a second probability of background sound for each frame of audio stream data. The keyword detection unit 122 can determine a frame in which the first probability that the audio data of each frame is human voice is higher than the second probability that the audio data is the background sound exceeds the set reference value as the voice frame. The keyword detection unit 122 can store a flag or a bit indicating whether or not the frame is an audio frame in the memory 110 in association with each frame of the audio stream data.

話者特徴ベクトル抽出部１２３は、第１オーディオデータ及び第２オーディオデータをメモリ１１０から読み取るとき、フラグまたはビットを共に読み取ることにより、当該フレームが音声フレームであるか否かということを知ることができる。 When the speaker feature vector extraction unit 123 reads the first audio data and the second audio data from the memory 110, it can know whether or not the frame is a voice frame by reading the flag or the bit together. can.

話者特徴ベクトル抽出部１２３は、オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについてフレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、オーディオデータを代表する話者特徴ベクトルを抽出することができる。例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出し、抽出された第１フレーム特徴ベクトルを正規化及び平均化し、第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出することができる。また、話者特徴ベクトル抽出部１２３は、第２オーディオデータ内のフレーム中、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出し、抽出された第２フレーム特徴ベクトルを正規化及び平均化し、第２オーディオデータを代表する第２話者特徴ベクトルを抽出することができる。 The speaker feature vector extraction unit 123 extracts the frame feature vector for each of the frames determined as the voice frame among the frames in the audio data, normalizes and averages the extracted first frame feature vector, and obtains the audio data. A representative speaker feature vector can be extracted. For example, the speaker feature vector extraction unit 123 extracts the first frame feature vector for each of the frames determined to be the audio frame in the frames in the first audio data, and normalizes the extracted first frame feature vector. And averaged, the first speaker feature vector representing the first audio data can be extracted. Further, the speaker feature vector extraction unit 123 extracts the second frame feature vector for each of the frames determined to be the audio frame in the frames in the second audio data, and normalizes the extracted second frame feature vector. And averaged, a second speaker feature vector representing the second audio data can be extracted.

ウェークアップ判断部１２４は、話者特徴ベクトル抽出部１２３で抽出された第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータに当該キーワードが含まれていたか否かということ、すなわち、第１区間のオーディオ信号に当該キーワードが含まれていたか否かということを判断する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を、既設定基準値と比較し、類似度が基準値以下である場合、第１区間の第１オーディオデータに当該キーワードが含まれていると判断することができる。 Whether the wake-up determination unit 124 includes the keyword in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector extracted by the speaker feature vector extraction unit 123. It is determined whether or not, that is, whether or not the keyword is included in the audio signal in the first section. The wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with the set reference value, and if the similarity is equal to or less than the reference value, the first audio in the first section. It can be determined that the keyword is included in the data.

音声制御装置１００がキーワードを誤認識する代表的な場合は、ユーザの音声中に、キーワードと類似した発音の単語が、音声中間に位置する場合である。例えば、キーワードが「クローバ」である場合、ユーザが他者に「四葉のクローバーをどうやって見つけられるの」という場合にも、音声制御装置１００は、「クローバー」に反応してウェークアップされ、ユーザが意図していない動作を遂行してしまうかもしれない。さらには、テレビニュースにおいてアナウンサーが、「ＪＮグローバルの時価総額は、…」という場合にも、音声制御装置１００は、「グローバル」に反応してウェークアップされてしまうかもしれない。そのようなキーワードの誤認識が発生してしまうことを防止するために、一実施形態によれば、キーワードと類似した発音の単語は、音声の最も先に位置する場合にのみ音声制御装置１００が反応する。また、周辺背景騒音が多い環境や、他の人々が話し合っている環境では、ユーザがキーワードに該当する音声を最も先に発声しても、周辺背景騒音や、他の人々の対話により、ユーザがキーワードに該当する音声を最も先に発声したということが感知されないこともある。一実施形態によれば、音声制御装置１００は、候補キーワードが検出された区間の第１話者特徴ベクトルと、以前区間の第２話者特徴ベクトルとを抽出し、第１話者特徴ベクトルと第２話者特徴ベクトルとが互いに異なる場合には、ユーザがキーワードに該当する音声を最も先に発声したと判断することができる。 A typical case where the voice control device 100 erroneously recognizes a keyword is a case where a word having a pronunciation similar to the keyword is located in the middle of the voice in the user's voice. For example, when the keyword is "clover", the voice control device 100 is waked up in response to the "clover" even when the user asks another person "how to find the four-leaf clover", and the user intends to do so. You may perform actions that you have not done. Furthermore, even when the announcer says "The market capitalization of JN Global is ..." in TV news, the voice control device 100 may be woken up in response to "global". In order to prevent such misrecognition of the keyword from occurring, according to one embodiment, the voice control device 100 uses the voice control device 100 only when the word having a pronunciation similar to the keyword is located at the earliest position of the voice. react. Also, in an environment with a lot of ambient background noise or in an environment where other people are discussing, even if the user utters the voice corresponding to the keyword first, the user will be affected by the ambient background noise and the dialogue of other people. It may not be perceived that the voice corresponding to the keyword was uttered first. According to one embodiment, the voice control device 100 extracts the first speaker feature vector in the section in which the candidate keyword is detected and the second speaker feature vector in the previous section, and uses the first speaker feature vector. When the second speaker feature vectors are different from each other, it can be determined that the user utters the voice corresponding to the keyword first.

かような判断のために、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下である場合には、ユーザがキーワードに該当する音声を最も先に発声したと判断することができる。すなわち、ウェークアップ判断部１２４は、第１区間の第１オーディオデータに当該キーワードが含まれていると判断することができ、音声制御装置１００の一部機能をウェークアップさせることができる。第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が高いということは、第１オーディオデータに対応する音声を放った者と、第２オーディオデータに対応する音声を放った者とが同一である可能性が高いというのである。 For such a determination, the wake-up determination unit 124 determines that the user corresponds to the keyword when the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value. Can be judged to have been uttered first. That is, the wake-up determination unit 124 can determine that the keyword is included in the first audio data in the first section, and can wake up some functions of the voice control device 100. The high degree of similarity between the first speaker feature vector and the second speaker feature vector means that the person who emitted the voice corresponding to the first audio data and the person who emitted the voice corresponding to the second audio data. Is likely to be the same.

第２オーディオデータが黙音に該当する場合、話者特徴ベクトル抽出部１２３は、第２オーディオデータから、黙音に該当する第２話者特徴ベクトルを抽出することができる。話者特徴ベクトル抽出部１２３は、第１オーディオデータから、ユーザの音声に該当する第１話者特徴ベクトルを抽出するので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 When the second audio data corresponds to the silent sound, the speaker feature vector extraction unit 123 can extract the second speaker feature vector corresponding to the silent sound from the second audio data. Since the speaker feature vector extraction unit 123 extracts the first speaker feature vector corresponding to the user's voice from the first audio data, the degree of similarity between the first speaker feature vector and the second speaker feature vector is high. ,Low.

音声認識部１２５は、オーディオ処理部１２１で生成されたオーディオストリームデータにおいて第３区間に該当する第３オーディオデータを受信し、第３オーディオデータを音声認識することができる。他の例によれば、音声認識部１２５は、第３オーディオデータが、外部（例えば、サーバ２００）で音声認識されるように、第３オーディオデータを外部に伝送し、音声認識結果を受信することができる。 The voice recognition unit 125 can receive the third audio data corresponding to the third section in the audio stream data generated by the audio processing unit 121, and can recognize the third audio data by voice. According to another example, the voice recognition unit 125 transmits the third audio data to the outside so that the third audio data is voice-recognized by an external device (for example, the server 200), and receives the voice recognition result. be able to.

機能部１２６は、キーワードに対応する機能を遂行することができる。例えば、音声制御装置１００がスマートスピーカである場合、機能部１２６は、音楽再生部、音声情報提供部、周辺機器制御部などを含み、検出されたキーワードに対応する機能を遂行することができる。音声制御装置１００がスマートフォンである場合、機能部１２６は、電話連結部、文字送受信部、インターネット検索部などを含み、検出されたキーワードに対応する機能を遂行することができる。機能部１２６は、音声制御装置１００の種類によって多様に構成される。機能部１２６は、音声制御装置１００が行うことができる多様な機能を遂行するための機能ブロックを包括的に示したものである。 The functional unit 126 can perform a function corresponding to the keyword. For example, when the voice control device 100 is a smart speaker, the function unit 126 includes a music playback unit, a voice information providing unit, a peripheral device control unit, and the like, and can perform a function corresponding to the detected keyword. When the voice control device 100 is a smartphone, the functional unit 126 can perform a function corresponding to the detected keyword, including a telephone connection unit, a character transmission / reception unit, an Internet search unit, and the like. The functional unit 126 is variously configured depending on the type of the voice control device 100. The functional unit 126 comprehensively shows a functional block for performing various functions that can be performed by the voice control device 100.

図３に図示された音声制御装置１００は、音声認識部１２５を含むように図示されているが、それは例示的なものであり、音声制御装置１００は、音声認識部１２５を含まず、図２に図示されたサーバ２００が、音声認識機能を代わりに遂行することができる。その場合、図１に図示されているように、音声制御装置１００は、ネットワーク３００を介して、音声認識機能を遂行するサーバ２００に接続される。音声制御装置１００は、音声認識が必要な音声信号を含む音声ファイルをサーバ２００に提供することができ、サーバ２００は、音声ファイル内の音声信号に対して音声認識を行い、音声信号に対応する文字列を生成することができる。サーバ２００は、生成された文字列を、ネットワーク３００を介して、音声制御装置１００に送信することができる。しかし、以下では、音声制御装置１００が音声認識機能を遂行する音声認識部１２５を含むと仮定して説明する。 The voice control device 100 illustrated in FIG. 3 is shown to include a voice recognition unit 125, which is exemplary, and the voice control device 100 does not include the voice recognition unit 125, FIG. The server 200 illustrated in 1 can perform the voice recognition function instead. In that case, as shown in FIG. 1, the voice control device 100 is connected to the server 200 that performs the voice recognition function via the network 300. The voice control device 100 can provide a voice file including a voice signal requiring voice recognition to the server 200, and the server 200 performs voice recognition for the voice signal in the voice file and corresponds to the voice signal. You can generate a string. The server 200 can transmit the generated character string to the voice control device 100 via the network 300. However, in the following, it is assumed that the voice control device 100 includes a voice recognition unit 125 that performs a voice recognition function.

プロセッサ１２０は、動作方法のためのプログラムファイルに保存されたプログラムコードをメモリ１１０にローディングすることができる。例えば、音声制御装置１００には、プログラムファイルによって、プログラムがインストール（install）される。そのとき、音声制御装置１００にインストールされたプログラムが実行される場合、プロセッサ１２０は、プログラムコードをメモリ１１０にローディングすることができる。そのとき、プロセッサ１２０が含むオーディオ処理部１２１、キーワード検出部１２２、話者特徴ベクトル抽出部１２３、ウェークアップ判断部１２４、音声認識部１２５及び機能部１２６のうち少なくとも一部のそれぞれは、メモリ１１０にローディングされたプログラムコードのうち対応するコードによる命令を実行し、図４の段階（Ｓ１１０ないしＳ１９０）を実行するようにも具現化される。 The processor 120 can load the program code stored in the program file for the operation method into the memory 110. For example, a program is installed in the voice control device 100 by a program file. At that time, when the program installed in the voice control device 100 is executed, the processor 120 can load the program code into the memory 110. At that time, at least a part of the audio processing unit 121, the keyword detection unit 122, the speaker feature vector extraction unit 123, the wake-up determination unit 124, the voice recognition unit 125, and the functional unit 126 included in the processor 120 are stored in the memory 110. It is also embodied to execute the instruction by the corresponding code in the loaded program code and execute the step (S110 to S190) of FIG.

その後、プロセッサ１２０の機能ブロック１２１ないし１２６が、音声制御装置１００を制御することは、プロセッサ１２０が音声制御装置１００の他の構成要素を制御することと理解される。例えば、プロセッサ１２０は、音声制御装置１００が含む通信モジュール１３０を制御し、音声制御装置１００が、例えば、サーバ２００と通信するように、音声制御装置１００を制御することができる。 After that, it is understood that the functional blocks 121 to 126 of the processor 120 controlling the voice control device 100 means that the processor 120 controls other components of the voice control device 100. For example, the processor 120 can control the communication module 130 included in the voice control device 100, and the voice control device 100 can control the voice control device 100 so that the voice control device 100 communicates with, for example, the server 200.

段階（Ｓ１１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。オーディオ処理部１２１は、持続的に周辺音に対応するオーディオ信号を受信することができる。オーディオ信号は、マイク１５１のような入力装置が周辺音に対応して生成した電気信号でもある。 In the stage (S110), the processor 120, for example, the audio processing unit 121 receives the audio signal corresponding to the ambient sound. The audio processing unit 121 can continuously receive an audio signal corresponding to the ambient sound. The audio signal is also an electrical signal generated by an input device such as the microphone 151 in response to ambient sound.

段階（Ｓ１２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。オーディオストリームデータは、持続的に受信されるオーディオ信号に対応したものである。該オーディオストリームデータは、オーディオ信号をフィルタリングしてデジタル化させることによって生成されるデータでもある。 In the step (S120), the processor 120, for example, the audio processing unit 121, generates audio stream data based on the audio signal from the microphone 151. The audio stream data corresponds to the continuously received audio signal. The audio stream data is also data generated by filtering and digitizing an audio signal.

段階（Ｓ１３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ１２０）で生成されるオーディオストリームデータをメモリ１１０に一時的に保存する。メモリ１１０は、限定された大きさを有し、現在から最近一定時間の間のオーディオ信号に対応するオーディオストリームデータの一部が、メモリ１１０に一時的に保存される。新たなオーディオストリームデータが生成されると、メモリ１１０に保存されたオーディオストリームデータのうち最も古いデータが削除され、メモリ１１０内の削除によって空くようになった空間に、新たなオーディオストリームデータが保存される。 In the stage (S130), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in the stage (S120) in the memory 110. The memory 110 has a limited size, and a part of the audio stream data corresponding to the audio signal for a certain period of time from the present is temporarily stored in the memory 110. When new audio stream data is generated, the oldest audio stream data saved in the memory 110 is deleted, and the new audio stream data is saved in the space freed by the deletion in the memory 110. Will be done.

段階（Ｓ１４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ１２０）で生成されるオーディオストリームデータから、既定義のキーワードに対応する候補キーワードを検出する。該候補キーワードは、既定義のキーワードと類似した発音を有する単語であり、段階（Ｓ１４０）において、キーワード検出部１２２でキーワードとして検出されたワードを指す。 In the stage (S140), the processor 120, for example, the keyword detection unit 122 detects a candidate keyword corresponding to the defined keyword from the audio stream data generated in the stage (S120). The candidate keyword is a word having a pronunciation similar to that of the defined keyword, and refers to a word detected as a keyword by the keyword detection unit 122 in the step (S140).

段階（Ｓ１５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータで現在区間に対応するデータは、第１オーディオデータとされる。 In the stage (S150), the processor 120, for example, the keyword detection unit 122 identifies the keyword detection section in which the candidate keyword is detected from the audio stream data, and determines the start point and the end point of the keyword detection section. The keyword detection section is the current section. The data corresponding to the current section in the audio stream data is regarded as the first audio data.

段階（Ｓ１６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In the stage (S160), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is the section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110.

段階（Ｓ１７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータから第１話者特徴ベクトルを抽出し、第２オーディオデータから第２話者特徴ベクトルを抽出する。第１話者特徴ベクトルは、第１オーディオデータに対応する音声の話者を識別するための指標であり、第２話者特徴ベクトルは、第２オーディオデータに対応する音声の話者を識別するための指標である。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータにキーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータにキーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In the stage (S170), the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first speaker feature vector from the first audio data and extracts the second speaker feature vector from the second audio data. The first speaker feature vector is an index for identifying the voice speaker corresponding to the first audio data, and the second speaker feature vector identifies the voice speaker corresponding to the second audio data. It is an index for. The processor 120, for example, the wake-up determination unit 124 determines whether or not the keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector. be able to. When the wake-up determination unit 124 determines that the first audio data contains a keyword, the wake-up determination unit 124 can wake up some components of the voice control device 100.

段階（Ｓ１８０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を既設定基準値と比較する。 In the step (S180), the processor 120, for example, the wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with the set reference value.

ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下である場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに異なるということであるので、第１オーディオデータにキーワードが含まれていると判断することができる。その場合、段階（Ｓ１９０）でのように、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、音声制御装置１００の一部構成要素をウェークアップさせることができる。 When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, the wake-up determination unit 124 has the speaker of the first audio data in the current section and the second speaker in the previous section. Since the speakers of the audio data are different from each other, it can be determined that the first audio data contains the keyword. In that case, as in step (S190), the processor 120, for example, the wakeup determination unit 124, can wake up some components of the voice control device 100.

しかし、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値より高い場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ１１０）に進み、周辺音に対応するオーディオ信号を受信する。段階（Ｓ１１０）において、オーディオ信号受信は、段階（Ｓ１２０－Ｓ１９０）を遂行するときにも続けられる。 However, when the similarity between the first speaker feature vector and the second speaker feature vector is higher than the set reference value, the wakeup determination unit 124 is the speaker of the first audio data in the current section and the first speaker in the previous section. 2 Since the speakers of the audio data are the same as each other, it is judged that the keyword is not included in the first audio data, and the wake-up cannot be proceeded. In that case, the process proceeds to the stage (S110), and the audio signal corresponding to the ambient sound is received. In the stage (S110), the audio signal reception is also continued when performing the stage (S120-S190).

図３のキーワード保存所１１０ａには、既定義の複数のキーワードが保存される。かようなキーワードは、ウェークアップキーワードでもあり、単独命令キーワードでもある。該ウェークアップキーワードは、音声制御装置１００の一部機能をウェークアップさせるためのものである。一般的には、ユーザは、ウェークアップキーワードを発話した後、所望の自然語音声命令を発話する。音声制御装置１００は、自然語音声命令を音声認識し、自然語音声命令に対応する動作及び機能を遂行することができる。 A plurality of defined keywords are stored in the keyword storage 110a of FIG. Such keywords are both wake-up keywords and single command keywords. The wake-up keyword is for wake-up a part of the functions of the voice control device 100. Generally, the user utters the desired natural speech command after uttering the wake-up keyword. The voice control device 100 can voice-recognize a natural language voice command and perform an operation and a function corresponding to the natural language voice command.

単独命令キーワードは、音声制御装置１００が、特定動作または機能を直接遂行するためのものであり、例えば、「再生」、「中止」のように、既定義の簡単な単語でもある。音声制御装置１００は、単独命令キーワードが受信されると、単独命令キーワードに該当する機能をウェークアップさせ、当該機能を遂行することができる。 The single command keyword is for the voice control device 100 to directly perform a specific operation or function, and is also a defined simple word such as "play" or "stop". When the voice control device 100 receives the single command keyword, the voice control device 100 can wake up the function corresponding to the single command keyword and perform the function.

以下では、オーディオストリームデータから単独命令キーワードに対応する候補キーワードを検出した場合、及びオーディオストリームデータからウェークアップキーワードに対応する候補キーワードを検出した場合のそれぞれについて説明する。 Hereinafter, the case where the candidate keyword corresponding to the single instruction keyword is detected from the audio stream data and the case where the candidate keyword corresponding to the wake-up keyword is detected from the audio stream data will be described.

図５は、他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 5 is a flowchart illustrating an example of an operation method that can be performed by the voice control device according to another embodiment.

図６Ａは、一実施形態による音声制御装置が、図５の動作方法を実行する場合、単独命令キーワードが発話される例を図示し、図６Ｂは、一実施形態による音声制御装置が、図５の動作方法を実行する場合、一般対話音声が発話される例を図示する。 FIG. 6A illustrates an example in which a single instruction keyword is spoken when the voice control device according to one embodiment executes the operation method of FIG. 5, and FIG. 6B shows an example in which the voice control device according to one embodiment is a voice control device according to FIG. An example in which a general dialogue voice is uttered when executing the operation method of is illustrated.

図５の動作方法は、図４の動作方法と実質的に同一である段階を含む。図５の段階のうち、図４の段階と実質的に同一である段階については、詳細に説明しない。図６Ａ及び図６Ｂには、オーディオストリームデータに対応するオーディオ信号と、オーディオ信号に対応するユーザの音声とが図示される。図６Ａには、音声「中止」に対応するオーディオ信号が図示され、図６Ｂには、音声「ここで停止して」に対応するオーディオ信号が図示される。 The operation method of FIG. 5 includes a step that is substantially the same as the operation method of FIG. Of the stages of FIG. 5, the stages that are substantially the same as the stages of FIG. 4 will not be described in detail. 6A and 6B show an audio signal corresponding to the audio stream data and a user's voice corresponding to the audio signal. FIG. 6A shows an audio signal corresponding to the voice "stop", and FIG. 6B shows an audio signal corresponding to the voice "stop here".

図６Ａ及び図６Ｂと共に図５を参照すれば、段階（Ｓ２１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。 Referring to FIG. 5 together with FIGS. 6A and 6B, in step (S210), the processor 120, for example, the audio processing unit 121, receives an audio signal corresponding to the ambient sound.

段階（Ｓ２２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。 In step (S220), the processor 120, for example, the audio processing unit 121, generates audio stream data based on the audio signal from the microphone 151.

段階（Ｓ２３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ２２０）で生成されるオーディオストリームデータをメモリ１１０に一時的に保存する。 In the stage (S230), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in the stage (S220) in the memory 110.

段階（Ｓ２４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ２２０）で生成されるオーディオストリームデータから、既定義の単独命令キーワードに対応する候補キーワードを検出する。単独命令キーワードは、音声制御装置１００の動作を直接制御することができる音声キーワードでもある。例えば、単独命令キーワードは、図６Ａに図示されているように、「中止」のような単語でもある。その場合、音声制御装置１００は、例えば、音楽や動画を再生している。 In the stage (S240), the processor 120, for example, the keyword detection unit 122 detects the candidate keyword corresponding to the defined single instruction keyword from the audio stream data generated in the stage (S220). The single command keyword is also a voice keyword that can directly control the operation of the voice control device 100. For example, the single command keyword is also a word such as "stop", as illustrated in FIG. 6A. In that case, the voice control device 100 is playing music or moving images, for example.

図６Ａの例において、キーワード検出部１２２は、オーディオ信号から「中止」という候補キーワードを検出することができる。図６Ｂの例において、キーワード検出部１２２は、オーディオ信号から、「中止」というキーワードと類似した発音を有する単語である「停止」という候補キーワードを検出することができる。 In the example of FIG. 6A, the keyword detection unit 122 can detect the candidate keyword "stop" from the audio signal. In the example of FIG. 6B, the keyword detection unit 122 can detect the candidate keyword "stop", which is a word having a pronunciation similar to the keyword "stop", from the audio signal.

段階（Ｓ２５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータにおいて、現在区間に対応するデータは、第１オーディオデータとされる。 In the stage (S250), the processor 120, for example, the keyword detection unit 122 identifies the keyword detection section in which the candidate keyword is detected from the audio stream data, and determines the start point and the end point of the keyword detection section. The keyword detection section is the current section. In the audio stream data, the data corresponding to the current section is regarded as the first audio data.

図６Ａの例において、キーワード検出部１２２は、「中止」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 6A, the keyword detection unit 122 can identify the section in which the candidate keyword "stop" is detected as the current section, and determine the start point and the end point of the current section. The audio data corresponding to the current section is referred to as the first audio data AD1.

図６Ｂの例において、キーワード検出部１２２は、「停止」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 6B, the keyword detection unit 122 can identify the section in which the candidate keyword "stop" is detected as the current section, and determine the start point and the end point of the current section. The audio data corresponding to the current section is referred to as the first audio data AD1.

また、段階（Ｓ２５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、検出された候補キーワードが、ウェークアップキーワード及び単独命令キーワードのうちいずれのキーワードに対応する候補キーワードであるかということを判断することができる。図６Ａ及び図６Ｂの例において、キーワード検出部１２２は、検出された候補キーワード、すなわち、「中止」及び「停止」が単独命令キーワードに対応する候補キーワードであるということを判断することができる。 Further, in the stage (S250), the processor 120, for example, the keyword detection unit 122 determines which of the wake-up keyword and the single instruction keyword the detected candidate keyword corresponds to. be able to. In the examples of FIGS. 6A and 6B, the keyword detection unit 122 can determine that the detected candidate keywords, that is, "stop" and "stop" are candidate keywords corresponding to the single command keywords.

段階（Ｓ２６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In the stage (S260), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is the section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110.

図６Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２をメモリ１１０から読み取ることができる。図６Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図６Ｂの例において、第２オーディオデータＡＤ２は「こで」という音声に対応する。以前区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 6A, the speaker feature vector extraction unit 123 can read the second audio data AD2 corresponding to the previous section, which is the section immediately immediately before the current section, from the memory 110. In the example of FIG. 6B, the speaker feature vector extraction unit 123 can read the second audio data AD2 corresponding to the previous section, which is the section immediately before the current section, from the memory 110. In the example of FIG. 6B, the second audio data AD2 corresponds to the voice "kode". The length of the previous interval is also variably set by the detected candidate keyword.

段階（Ｓ２７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、オーディオ処理部１２１から、現在区間後の次の区間に該当する第３オーディオデータを受信する。次の区間は、現在区間のすぐ次の区間であり、次の区間の始点は、現在区間の終点と同一でもある。 In the stage (S270), the processor 120, for example, the speaker feature vector extraction unit 123 receives the third audio data corresponding to the next section after the current section from the audio processing unit 121. The next section is the section immediately following the current section, and the start point of the next section is also the same as the end point of the current section.

図６Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間直後の次の区間に対応する第３オーディオデータＡＤ３を、オーディオ処理部１２１から受信することができる。図６Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間直後の次の区間に対応する第３オーディオデータＡＤ３を、オーディオ処理部１２１から受信することができる。図６Ｂの例において、第３オーディオデータＡＤ３は、「して」という音声に対応する。次の区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 6A, the speaker feature vector extraction unit 123 can receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processing unit 121. In the example of FIG. 6B, the speaker feature vector extraction unit 123 can receive the third audio data AD3 corresponding to the next section immediately after the current section from the audio processing unit 121. In the example of FIG. 6B, the third audio data AD3 corresponds to the voice "to". The length of the next interval is also variably set by the detected candidate keywords.

段階（Ｓ２８０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第オーディオデータ１ないし第３オーディオデータから、第１話者特徴ベクトルないし第３話者特徴ベクトルをそれぞれ抽出する。第１話者特徴ベクトルないし第３話者特徴ベクトルそれぞれは、第オーディオデータ１ないし第３オーディオデータに対応する音声の話者を識別するための指標である。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度、及び第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度を基に、第１オーディオデータに単独命令キーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータに、単独命令キーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In the stage (S280), the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first speaker feature vector or the third speaker feature vector from the first audio data 1 to the third audio data, respectively. Each of the first speaker feature vector and the third speaker feature vector is an index for identifying the speaker of the voice corresponding to the first audio data 1 to the third audio data. The processor 120, for example, the wake-up determination unit 124, is based on the similarity between the first speaker feature vector and the second speaker feature vector, and the similarity between the first speaker feature vector and the third speaker feature vector. , It can be determined whether or not the single command keyword is included in the first audio data. When the wake-up determination unit 124 determines that the first audio data includes a single command keyword, the wake-up determination unit 124 can wake up some components of the voice control device 100.

図６Ａの例において、第１オーディオデータＡＤ１に対応する第１話者特徴ベクトルは、「中止」という音声を発声した話者を識別するための指標である。第２オーディオデータＡＤ２と第３オーディオデータＡＤ３は、実質的に黙音であるので、第２話者特徴ベクトル及び第３話者特徴ベクトルは、黙音に対応するベクトルを有することができる。従って、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、低い。 In the example of FIG. 6A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying the speaker who utters the voice "stop". Since the second audio data AD2 and the third audio data AD3 are substantially silent, the second speaker feature vector and the third speaker feature vector can have a vector corresponding to the silent sound. Therefore, the degree of similarity between the first speaker feature vector and the second speaker feature vector and the third speaker feature vector is low.

他の例として、以前区間及び次の区間に、「中止」という音声を発声した話者ではない他者が音声を発声する場合、第２話者特徴ベクトル及び第３話者特徴ベクトルは、前記他者に対応したベクトルを有するので、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、低い。 As another example, when another person who is not the speaker who uttered the voice "stop" utters the voice in the previous section and the next section, the second speaker feature vector and the third speaker feature vector are described above. Since it has a vector corresponding to another person, the degree of similarity between the first speaker feature vector, the second speaker feature vector, and the third speaker feature vector is low.

図６Ｂの例では、一人が「ここで停止して」と発声した。従って、「停止」に対応する第１オーディオデータＡＤ１から抽出される第１話者特徴ベクトル、「こで」に対応する第２オーディオデータＡＤ２から抽出される第２話者特徴ベクトル、及び「して」に対応する第３オーディオデータＡＤ３から抽出される第３話者特徴ベクトルは、いずれも実質的に同一である話者を識別するためのベクトルであるので、第１話者特徴ベクトルないし第３話者特徴ベクトルとの類似度は、高い。 In the example of FIG. 6B, one person uttered "Stop here". Therefore, the first speaker feature vector extracted from the first audio data AD1 corresponding to "stop", the second speaker feature vector extracted from the second audio data AD2 corresponding to "kode", and the "speaker feature vector". Since the third speaker feature vector extracted from the third audio data AD3 corresponding to "te" is a vector for identifying speakers that are substantially the same, the first speaker feature vector or the first speaker feature vector. The degree of similarity with the three-speaker feature vector is high.

段階（Ｓ２９０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を、既設定基準値と比較し、第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度を既設定基準値と比較する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値以下であり、第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度が既設定基準値以下である場合、現在区間の第１オーディオデータの話者は、以前区間の第２オーディオデータの話者、及び次の区間の第３オーディオデータの話者とは異なるので、第１オーディオデータに、単独命令キーワードが含まれていると判断することができる。その場合、段階（Ｓ３００）でのように、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、単独命令キーワードを機能部１２６に提供し、機能部１２６は、ウェークアップ判断部１２４による、第１オーディオデータに単独命令キーワードが含まれているという判断に応答し、単独命令キーワードに対応する機能を遂行することができる。 In the stage (S290), the processor 120, for example, the wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with the set reference value, and compares the similarity between the first speaker feature vector and the first speaker feature vector. And the third speaker feature vector are compared with the set reference value. In the wake-up determination unit 124, the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, and the similarity between the first speaker feature vector and the third speaker feature vector is If it is less than or equal to the set reference value, the speaker of the first audio data in the current section is different from the speaker of the second audio data in the previous section and the speaker of the third audio data in the next section. 1 It can be determined that the audio data includes a single command keyword. In that case, as in the step (S300), the processor 120, for example, the wake-up determination unit 124, provides the single instruction keyword to the functional unit 126, and the functional unit 126 uses the wake-up determination unit 124 for the first audio data. In response to the determination that the single instruction keyword is included, the function corresponding to the single instruction keyword can be performed.

図６Ａの例において、第１話者特徴ベクトルは、「中止」と発声した話者に対応するベクトルであり、第２話者特徴ベクトル及び第３話者特徴ベクトルは、黙音に対応したベクトルであるので、第１話者特徴ベクトルと、第２話者特徴ベクトル及び第３話者特徴ベクトルとの類似度は、既設定基準値より低い。その場合、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に、「中止」という単独命令キーワードが含まれていると判断することができる。その場合、機能部１２６は、前記判断に応答してウェークアップされ、「中止」という単独命令キーワードに対応する動作または機能を遂行することができる。例えば、音声制御装置１００が音楽を再生しているのであれば、機能部１２６は、「中止」という単独命令キーワードに対応し、音楽再生を止めることができる。 In the example of FIG. 6A, the first speaker feature vector is a vector corresponding to the speaker who utters "stop", and the second speaker feature vector and the third speaker feature vector are vectors corresponding to silence. Therefore, the degree of similarity between the first speaker feature vector, the second speaker feature vector, and the third speaker feature vector is lower than the set reference value. In that case, the wake-up determination unit 124 can determine that the first audio data AD1 includes the single instruction keyword "stop". In that case, the functional unit 126 is waked up in response to the determination, and can perform an operation or a function corresponding to the single command keyword "stop". For example, if the voice control device 100 is playing music, the functional unit 126 can stop the music playback in response to the single command keyword "stop".

しかし、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、既設定基準値より高いか、あるいは第１話者特徴ベクトルと第３話者特徴ベクトルとの類似度が、既設定基準値より高い場合、現在区間の第１オーディオデータの話者が、以前区間の第２オーディオデータの話者、または次の区間の第３オーディオデータの話者と同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ２１０）に進み、周辺音に対応するオーディオ信号を受信する。 However, in the wake-up determination unit 124, the similarity between the first speaker feature vector and the second speaker feature vector is higher than the set reference value, or the first speaker feature vector and the third speaker feature vector If the similarity of is higher than the set reference value, the speaker of the first audio data in the current section is the same as the speaker of the second audio data in the previous section or the speaker of the third audio data in the next section. Therefore, it is judged that the keyword is not included in the first audio data, and the wake-up cannot be proceeded. In that case, the process proceeds to the stage (S210), and the audio signal corresponding to the ambient sound is received.

図６Ｂの例において、一人が「ここで停止して」と発声したので、第１話者特徴ベクトルないし第３話者特徴ベクトルの類似度は、高い。図６Ｂの例における発声である「ここに停止して」には、音声制御装置を制御するか、あるいはウェークアップさせるためのキーワードが含まれていないので、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に単独命令キーワードが含まれていないと判断し、機能部１２６が「停止」または「中止」に該当する機能や動作を遂行しないようにする。 In the example of FIG. 6B, since one person uttered "Stop here", the similarity between the first speaker feature vector and the third speaker feature vector is high. Since the utterance "stop here" in the example of FIG. 6B does not include a keyword for controlling or wake-up the voice control device, the wake-up determination unit 124 uses the first audio data AD1. It is determined that the single command keyword is not included in the function unit 126, and the function unit 126 is prevented from performing the function or operation corresponding to "stop" or "stop".

一般的な技術によれば、音声制御装置は、「ここで停止して」という発声のうち「停止」という音声を検出し、「停止」に該当する機能や動作を遂行することが技術的には可能である。かような機能や動作は、ユーザが意図していないものであり、ユーザは、音声制御装置を使用するときに不都合を感じる。しかし、一実施形態によれば、音声制御装置１００は、ユーザの音声から、単独命令キーワードを正確に認識することができるために、一般的な技術とは異なり、誤動作を遂行しない。 According to general technology, it is technically possible for a voice control device to detect the voice "stop" among the utterances "stop here" and perform a function or operation corresponding to "stop". Is possible. Such functions and operations are not intended by the user, and the user feels inconvenience when using the voice control device. However, according to one embodiment, the voice control device 100 can accurately recognize the single command keyword from the voice of the user, and therefore, unlike the general technique, does not perform a malfunction.

図７は、さらに他の実施形態によって、音声制御装置が遂行することができる動作方法の例を図示したフローチャートである。 FIG. 7 is a flowchart illustrating an example of an operation method that can be performed by the voice control device according to still another embodiment.

図８Ａは、一実施形態による音声制御装置が、図７の動作方法を実行する場合、ウェークアップキーワード及び自然語音声命令が発話される例を図示し、図８Ｂは、一実施形態による音声制御装置が、図７の動作方法を実行する場合、一般対話音声が発話される例を図示する。 FIG. 8A illustrates an example in which a wake-up keyword and a natural language voice command are uttered when the voice control device according to the embodiment executes the operation method of FIG. 7, and FIG. 8B shows a voice control device according to the embodiment. However, when the operation method of FIG. 7 is executed, an example in which a general dialogue voice is uttered is illustrated.

図７の動作方法は、図４の動作方法と実質的に同一である段階を含む。図７の段階のうち、図４の段階と実質的に同一である段階については、詳細に説明しない。図６Ａ及び図６Ｂには、オーディオストリームデータに対応するオーディオ信号と、オーディオ信号に対応するユーザの音声とが図示される。図８Ａには、ウェークアップキーワード「クローバ」と、自然語音声命令「明日の天気を教えて」とに対応するオーディオ信号が図示され、図８Ｂには「四葉のクローバーをどうやって見つけられるの」という対話音声に対応するオーディオ信号が図示される。 The operation method of FIG. 7 includes a step that is substantially the same as the operation method of FIG. Of the stages of FIG. 7, the stages that are substantially the same as the stages of FIG. 4 will not be described in detail. 6A and 6B show an audio signal corresponding to the audio stream data and a user's voice corresponding to the audio signal. FIG. 8A illustrates the audio signal corresponding to the wake-up keyword "clover" and the natural voice command "tell me the weather tomorrow", and FIG. 8B shows the dialogue "how to find the four-leaf clover". The audio signal corresponding to the voice is illustrated.

図８Ａ及び図８Ｂと共に、図７を参照すれば、段階（Ｓ４１０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。段階（Ｓ４２０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、マイク１５１からのオーディオ信号を基に、オーディオストリームデータを生成する。段階（Ｓ４３０）において、プロセッサ１２０、例えば、オーディオ処理部１２１は、段階（Ｓ１２０）で生成されるオーディオストリームデータを、メモリ１１０に一時的に保存する。 Referring to FIG. 7 together with FIGS. 8A and 8B, in step (S410), the processor 120, for example, the audio processing unit 121, receives an audio signal corresponding to the ambient sound. In step (S420), the processor 120, for example, the audio processing unit 121, generates audio stream data based on the audio signal from the microphone 151. In the stage (S430), the processor 120, for example, the audio processing unit 121 temporarily stores the audio stream data generated in the stage (S120) in the memory 110.

段階（Ｓ４４０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、段階（Ｓ４２０）で生成されるオーディオストリームデータから、既定義のウェークアップキーワードに対応する候補キーワードを検出する。該ウェークアップキーワードは、スリープモード状態の音声制御装置をウェークアップモードに転換することができる音声に基づくキーワードである。例えば、ウェークアップキーワードは、「クローバ」、「ハイコンピュータ」のような音声キーワードでもある。 In the stage (S440), the processor 120, for example, the keyword detection unit 122 detects a candidate keyword corresponding to the defined wake-up keyword from the audio stream data generated in the stage (S420). The wake-up keyword is a voice-based keyword capable of converting a voice control device in a sleep mode into a wake-up mode. For example, wake-up keywords are also voice keywords such as "clover" and "high computer".

図８Ａの例において、キーワード検出部１２２は、オーディオ信号から、「クローバ」という候補キーワードを検出することができる。図８Ｂの例において、キーワード検出部１２２は、オーディオ信号から、「クローバ」というキーワードと類似した発音を有する単語である「クローバー」という候補キーワードを検出することができる。 In the example of FIG. 8A, the keyword detection unit 122 can detect the candidate keyword "clover" from the audio signal. In the example of FIG. 8B, the keyword detection unit 122 can detect the candidate keyword "clover", which is a word having a pronunciation similar to the keyword "clover", from the audio signal.

段階（Ｓ４５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、オーディオストリームデータから候補キーワードが検出されたキーワード検出区間を識別し、キーワード検出区間の始点及び終点を決定する。キーワード検出区間は、現在区間とされる。オーディオストリームデータで現在区間に対応するデータは、第１オーディオデータとされる。 In the stage (S450), the processor 120, for example, the keyword detection unit 122 identifies the keyword detection section in which the candidate keyword is detected from the audio stream data, and determines the start point and the end point of the keyword detection section. The keyword detection section is the current section. The data corresponding to the current section in the audio stream data is regarded as the first audio data.

図８Ａの例において、キーワード検出部１２２は、「クローバ」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。図８Ｂの例において、キーワード検出部１２２は、「クローバー」という候補キーワードを検出した区間を現在区間と識別し、現在区間の始点及び終点を決定することができる。前記現在区間に対応するオーディオデータは、第１オーディオデータＡＤ１とされる。 In the example of FIG. 8A, the keyword detection unit 122 can identify the section in which the candidate keyword "clover" is detected as the current section, and determine the start point and the end point of the current section. The audio data corresponding to the current section is referred to as the first audio data AD1. In the example of FIG. 8B, the keyword detection unit 122 can identify the section in which the candidate keyword "clover" is detected as the current section, and determine the start point and the end point of the current section. The audio data corresponding to the current section is referred to as the first audio data AD1.

また、段階（Ｓ４５０）において、プロセッサ１２０、例えば、キーワード検出部１２２は、検出された候補キーワードがウェークアップキーワード及び単独命令キーワードのうちいずれのキーワードに対応する候補キーワードであるかということを判断することができる。図８Ａ及び図８Ｂの例において、キーワード検出部１２２は、検出された候補キーワード、すなわち、「クローバ」及び「クローバー」がウェークアップキーワードに対応する候補キーワードであるということを判断することができる。 Further, in the step (S450), the processor 120, for example, the keyword detection unit 122 determines whether the detected candidate keyword is a candidate keyword corresponding to which of the wake-up keyword and the single instruction keyword. Can be done. In the examples of FIGS. 8A and 8B, the keyword detection unit 122 can determine that the detected candidate keywords, that is, "clover" and "clover" are candidate keywords corresponding to the wake-up keywords.

段階（Ｓ４６０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、メモリ１１０から、以前区間に該当する第２オーディオデータを読み取る。以前区間は、現在区間のすぐ直前区間であり、以前区間の終点は、現在区間の始点と同一でもある。話者特徴ベクトル抽出部１２３は、メモリ１１０から、第１オーディオデータも共に読み取ることができる。 In the stage (S460), the processor 120, for example, the speaker feature vector extraction unit 123 reads the second audio data corresponding to the previous section from the memory 110. The previous section is the section immediately before the current section, and the end point of the previous section is also the same as the start point of the current section. The speaker feature vector extraction unit 123 can also read the first audio data from the memory 110.

図８Ａの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図８Ｂの例において、話者特徴ベクトル抽出部１２３は、現在区間のすぐ直前区間である以前区間に対応する第２オーディオデータＡＤ２を、メモリ１１０から読み取ることができる。図８Ｂの例において、第２オーディオデータＡＤ２は「、四葉の」という音声に対応する。以前区間の長さは、検出された候補キーワードによって可変的にも設定される。 In the example of FIG. 8A, the speaker feature vector extraction unit 123 can read the second audio data AD2 corresponding to the previous section, which is the section immediately immediately before the current section, from the memory 110. In the example of FIG. 8B, the speaker feature vector extraction unit 123 can read the second audio data AD2 corresponding to the previous section, which is the section immediately immediately before the current section, from the memory 110. In the example of FIG. 8B, the second audio data AD2 corresponds to the voice ", four leaves". The length of the previous interval is also variably set by the detected candidate keyword.

段階（Ｓ４７０）において、プロセッサ１２０、例えば、話者特徴ベクトル抽出部１２３は、第１オーディオデータ及び第２オーディオデータから、第１話者特徴ベクトル及び第２話者特徴ベクトルをそれぞれ抽出する。プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を基に、第１オーディオデータに、ウェークアップキーワードが含まれていたか否かということを判断することができる。ウェークアップ判断部１２４は、第１オーディオデータにウェークアップキーワードが含まれていると判断する場合、音声制御装置１００の一部構成要素をウェークアップさせることができる。 In the stage (S470), the processor 120, for example, the speaker feature vector extraction unit 123 extracts the first speaker feature vector and the second speaker feature vector from the first audio data and the second audio data, respectively. The processor 120, for example, the wake-up determination unit 124 determines whether or not the wake-up keyword is included in the first audio data based on the similarity between the first speaker feature vector and the second speaker feature vector. You can judge. When the wake-up determination unit 124 determines that the first audio data includes the wake-up keyword, the wake-up determination unit 124 can wake up some components of the voice control device 100.

図８Ａの例において、第１オーディオデータＡＤ１に対応する第１話者特徴ベクトルは、「クローバ」という音声を発声した話者を識別するための指標である。第２オーディオデータＡＤ２は、実質的に黙音であるので、第２話者特徴ベクトルは、黙音に対応するベクトルを有することができる。従って、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 In the example of FIG. 8A, the first speaker feature vector corresponding to the first audio data AD1 is an index for identifying the speaker who utters the voice "clover". Since the second audio data AD2 is substantially silent, the second speaker feature vector can have a vector corresponding to the silent sound. Therefore, the degree of similarity between the first speaker feature vector and the second speaker feature vector is low.

他の例として、以前区間に「クローバ」という音声を発声した話者ではない他者が音声を発声する場合、第２話者特徴ベクトルは、前記他者に対応したベクトルを有するので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、低い。 As another example, when another person who is not the speaker who uttered the voice "clover" in the previous section utters the voice, the second speaker feature vector has a vector corresponding to the other person, so that the first The similarity between the speaker feature vector and the second speaker feature vector is low.

図８Ｂの例では、一人が「四葉のクローバーをどうやって見つけられるの」と発声した。従って、「クローバー」に対応する第１オーディオデータＡＤ１から抽出される第１話者特徴ベクトルと、「四葉の」に対応する第２オーディオデータＡＤ２から抽出される第２話者特徴ベクトルは、いずれも実質的に同一である話者を識別するためのベクトルであるので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、高い。 In the example of FIG. 8B, one person uttered, "How can I find a four-leaf clover?" Therefore, the first speaker feature vector extracted from the first audio data AD1 corresponding to the "clover" and the second speaker feature vector extracted from the second audio data AD2 corresponding to the "four-leaf" will be either. Is a vector for identifying speakers who are substantially the same, so that the degree of similarity between the first speaker feature vector and the second speaker feature vector is high.

段階（Ｓ４８０）において、プロセッサ１２０、例えば、ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度を既設定基準値と比較する。ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が既設定基準値より高い場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに同一であるということであるので、第１オーディオデータにキーワードが含まれていないと判断し、ウェークアップを進めない。その場合、段階（Ｓ４１０）に進み、プロセッサ１２０、例えば、オーディオ処理部１２１は、周辺音に対応するオーディオ信号を受信する。 In the stage (S480), the processor 120, for example, the wake-up determination unit 124 compares the similarity between the first speaker feature vector and the second speaker feature vector with the set reference value. When the similarity between the first speaker feature vector and the second speaker feature vector is higher than the set reference value, the wake-up determination unit 124 determines that the speaker of the first audio data in the current section and the second audio in the previous section. Since the speakers of the data are the same as each other, it is judged that the keyword is not included in the first audio data, and the wake-up cannot be proceeded. In that case, the process proceeds to step (S410), and the processor 120, for example, the audio processing unit 121 receives the audio signal corresponding to the ambient sound.

図８Ｂの例において、一人が「四葉のクローバー…」と発声したので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、高い。図８Ｂの例において、「四葉のクローバー」と発声した者は、音声制御装置１００をウェークアップさせようという意図がないと判断し、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１にウェークアップキーワードが含まれていないと判断し、音声制御装置１００をウェークアップさせない。 In the example of FIG. 8B, since one person uttered "four-leaf clover ...", the degree of similarity between the first speaker feature vector and the second speaker feature vector is high. In the example of FIG. 8B, the person who utters "four-leaf clover" determines that there is no intention to wake up the voice control device 100, and the wake-up determination unit 124 includes the wake-up keyword in the first audio data AD1. It is determined that this is not the case, and the voice control device 100 is not waked up.

ウェークアップ判断部１２４は、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度が、既設定基準値以下である場合、現在区間の第１オーディオデータの話者と、以前区間の第２オーディオデータの話者とが互いに異なるということであるので、第１オーディオデータにキーワードが含まれていると判断することができる。その場合、ウェークアップ判断部１２４は、音声制御装置１００の一部構成要素をウェークアップさせることができる。例えば、ウェークアップ判断部１２４は、音声認識部１２５をウェークアップさせることができる。 When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, the wake-up determination unit 124 is the speaker of the first audio data in the current section and the first in the previous section. Since the speakers of the two audio data are different from each other, it can be determined that the first audio data contains the keyword. In that case, the wake-up determination unit 124 can wake up some components of the voice control device 100. For example, the wake-up determination unit 124 can wake up the voice recognition unit 125.

図８Ａの例において、第１話者特徴ベクトルは、「クローバ」と発声した話者に対応するベクトルであり、第２話者特徴ベクトルは、黙音に対応したベクトルであるので、第１話者特徴ベクトルと第２話者特徴ベクトルとの類似度は、既設定基準値より低い。その場合、ウェークアップ判断部１２４は、第１オーディオデータＡＤ１に「クローバ」というウェークアップキーワードが含まれていると判断することができる。その場合、音声認識部１２５は、自然語音声命令を認識するためにウェークアップされる。 In the example of FIG. 8A, the first speaker feature vector is a vector corresponding to the speaker who utters "clover", and the second speaker feature vector is a vector corresponding to the silent sound. The degree of similarity between the person feature vector and the second speaker feature vector is lower than the set reference value. In that case, the wake-up determination unit 124 can determine that the first audio data AD1 includes the wake-up keyword "clover". In that case, the voice recognition unit 125 is waked up to recognize the natural language voice command.

段階（Ｓ４９０）において、プロセッサ１２０、例えば、音声認識部１２５は、オーディオ処理部１２１から、現在区間後の次の区間に該当する第３オーディオデータを受信する。次の区間は、現在区間のすぐ次の区間であり、次の区間の始点は、現在区間の終点と同一でもある。 In the stage (S490), the processor 120, for example, the voice recognition unit 125 receives the third audio data corresponding to the next section after the current section from the audio processing unit 121. The next section is the section immediately following the current section, and the start point of the next section is also the same as the end point of the current section.

音声認識部１２５は、第３オーディオデータにおいて、既設定長の黙音が検出されるとき、次の区間の終点を決定することができる。音声認識部１２５は、第３オーディオデータを音声認識することができる。音声認識部１２５は、多様な方式で、第３オーディオデータを音声認識することができる。他の例によれば、音声認識部１２５は、第３オーディオデータの音声認識結果を得るために、外部装置、例えば、図２に図示される音声認識機能を有するサーバ２００に、第３オーディオデータを伝送することができる。サーバ２００は、第３オーディオデータを受信し、第３オーディオデータを音声認識することにより、第３オーディオデータに対応する文字列（テキスト）を生成し、生成された文字列（テキスト）を、音声認識結果として、音声認識部１２５に伝送することができる。 The voice recognition unit 125 can determine the end point of the next section when a silent sound having a preset length is detected in the third audio data. The voice recognition unit 125 can recognize the third audio data by voice. The voice recognition unit 125 can recognize the third audio data by voice by various methods. According to another example, in order to obtain the voice recognition result of the third audio data, the voice recognition unit 125 sends the third audio data to an external device, for example, the server 200 having the voice recognition function shown in FIG. Can be transmitted. The server 200 receives the third audio data and recognizes the third audio data by voice to generate a character string (text) corresponding to the third audio data, and the generated character string (text) is voiced. As a recognition result, it can be transmitted to the voice recognition unit 125.

図８Ａの例において、次の区間の第３オーディオデータは、「明日の天気を教えて」のような自然語音声命令である。音声認識部１２５は、第３オーディオデータを直接音声認識し、音声認識結果を生成するか、あるいは第３オーディオデータが音声認識されるように、外部（例えば、サーバ２００）に伝送することができる。 In the example of FIG. 8A, the third audio data in the next section is a natural language voice command such as "tell me the weather tomorrow". The voice recognition unit 125 can directly recognize the third audio data and generate a voice recognition result, or can transmit the third audio data to the outside (for example, the server 200) so that the third audio data is recognized by voice. ..

段階（Ｓ５００）において、プロセッサ１２０、例えば、機能部１２６は、第３オーディオデータの音声認識結果に対応する機能を遂行することができる。図８Ａの例において、機能部１２６は、明日の天気を検索して結果を提供する音声情報提供部でもあり、機能部１２６は、インターネットを利用して明日天気を検索し、その結果をユーザに提供することができる。機能部１２６は、明日の天気の検索結果を、スピーカ１５２を利用して音声として提供することもできる。機能部１２６は、第３オーディオデータの音声認識結果に応答し、ウェークアップされる。 In the stage (S500), the processor 120, for example, the functional unit 126 can perform a function corresponding to the voice recognition result of the third audio data. In the example of FIG. 8A, the functional unit 126 is also a voice information providing unit that searches for tomorrow's weather and provides a result, and the functional unit 126 searches for tomorrow's weather using the Internet and sends the result to the user. Can be provided. The functional unit 126 can also provide the search result of tomorrow's weather as voice using the speaker 152. The functional unit 126 responds to the voice recognition result of the third audio data and wakes up.

以上で説明した本発明による実施形態は、コンピュータ上で多様な構成要素を介して実行されるコンピュータプログラムの形態に具現化され、かようなコンピュータプログラムは、コンピュータで読み取り可能な媒体に記録される。そのとき、該媒体は、コンピュータで実行可能なプログラムを続けて保存するか、あるいは実行またはダウンロードのために、臨時保存するものでもある。また、該媒体は、単一、または数個のハードウェアが結合された形態の多様な記録手段または保存手段でもあるが、あるコンピュータシステムに直接接続される媒体に限定されるものではなく、ネットワーク上に分散存在するものでもある。該媒体の例示としては、ハードディスク、フロッピィーディスク及び磁気テープのような磁気媒体；ＣＤ－ＲＯＭ（compact disc read only memory）及びＤＶＤ（digital versatile disc）のような光記録媒体；フロプティカルディスク（floptical disk）のような磁気・光媒体（magneto-optical medium）；及びＲＯＭ（read-only memory）、ＲＡＭ（random access memory）、フラッシュメモリなどを含み、プログラム命令語が保存されるように構成されたものでもある。また、他の媒体の例示として、アプリケーションを流通するアプリストアや、その他多様なソフトウェアを供給したり流通させたりするサイト、サーバなどで管理する記録媒体ないし記録媒体も挙げることができる。 The embodiments according to the present invention described above are embodied in the form of a computer program executed on a computer via various components, and such a computer program is recorded on a computer-readable medium. .. At that time, the medium is also a continuous storage of a computer-executable program, or a temporary storage for execution or download. The medium is also a variety of recording or storage means in the form of a single piece or a combination of several pieces of hardware, but is not limited to a medium directly connected to a computer system, and is not limited to a network. It also exists in a distributed manner on top. Examples of such media are magnetic media such as hard disks, floppy discs and magnetic tapes; optical recording media such as CD-ROMs (compact disc read only memory) and DVDs (digital versatile discs); floptical discs. Includes magnetic and optical medium (disk); and ROM (read-only memory), RAM (random access memory), flash memory, etc., and is configured to store program command words. It is also a thing. Further, as an example of other media, a recording medium or a recording medium managed by an app store that distributes applications, a site that supplies or distributes various other software, a server, or the like can be mentioned.

本明細書において、「部」、「モジュール」などは、プロセッサまたは回路のようなハードウェア構成（hardware component）、及び／またはプロセッサのようなハードウェア構成によって実行されるソフトウェア構成（software component）でもある。例えば、「部」、「モジュール」などは、ソフトウェア構成要素、客体志向ソフトウェア構成要素、クラス構成要素及びタスク構成要素のような構成要素、プロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウェア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ及び変数によっても具現化される。 As used herein, a "part", "module", etc. may also be a hardware component such as a processor or circuit, and / or a software component executed by a hardware configuration such as a processor. be. For example, "parts", "modules", etc. are components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, program code segments, etc. It is also embodied by drivers, firmware, microcodes, circuits, data, databases, data structures, tables, arrays and variables.

前述の本発明の説明は、例示のためのものであり、本発明が属する技術分野の当業者であるならば、本発明の技術的思想や必須な特徴を変更せずにも、他の具体的な形態に容易に変形が可能であるということを理解することができるであろう。従って、以上で記述した実施形態は、全ての面において例示的なものであり、限定的ではないと理解しなければならない。例えば、単一型と説明されている各構成要素は、分散されて実施されもし、同様に、分散されていると説明されている構成要素も、結合された形態に実施されてもよい。 The above description of the present invention is for illustration purposes only, and if a person skilled in the art to which the present invention belongs, other specific examples without changing the technical idea or essential features of the present invention. It can be understood that it can be easily transformed into a typical form. Therefore, it should be understood that the embodiments described above are exemplary in all respects and are not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, the components described as distributed may also be implemented in a combined form.

本発明の範囲は、前記詳細な説明よりは、特許請求の範囲によって示され、特許請求の範囲の意味及び範囲、そしてその均等概念から導出される全ての変更、または変形された形態が、本発明の範囲に含まれるものであると解釈されなければならない。 The scope of the present invention is shown by the scope of claims, rather than the above-mentioned detailed description, and the meaning and scope of the claims, and all modifications or variants derived from the concept of equality thereof, are described in the present invention. It must be construed as being included in the scope of the invention.

本発明の、キーワード誤認識を防止する音声制御装置、及びその動作方法は、例えば、音声認識関連の技術分野に効果的に適用可能である。 The voice control device for preventing erroneous keyword recognition and the operation method thereof according to the present invention can be effectively applied to, for example, a technical field related to voice recognition.

（付記１）
周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成するオーディオ処理部と、
前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定するキーワード検出部と、
前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出し、前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する話者特徴ベクトル抽出部と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断するウェークアップ判断部と、を含む音声制御装置。
（付記２）
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、既設定基準値以下である場合、前記第１オーディオデータに、前記キーワードが含まれていると判断することを特徴とする付記１に記載の音声制御装置。
（付記３）
前記所定のキーワードを含む複数のキーワードを保存するキーワード保存所をさらに含み、
前記キーワードそれぞれは、ウェークアップキーワードまたは単独命令キーワードであることを特徴とする付記１に記載の音声制御装置。
（付記４）
前記キーワード検出部により、前記オーディオストリームデータから、前記単独命令キーワードに対応する前記候補キーワードが検出された場合、
前記話者特徴ベクトル抽出部は、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信し、前記第３オーディオデータの第３話者特徴ベクトルを抽出し、
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度、及び前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記単独命令キーワードが含まれていたか否かを判断することを特徴とする付記３に記載の音声制御装置。
（付記５）
前記ウェークアップ判断部は、前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、所定の基準値以下であり、前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度が、所定の基準値以下である場合、前記第１オーディオデータに、前記単独命令キーワードが含まれていると判断することを特徴とする付記４に記載の音声制御装置。
（付記６）
前記キーワード検出部により、前記オーディオストリームデータから、前記ウェークアップキーワードに対応する前記候補キーワードが検出された場合、
前記第１オーディオデータに前記ウェークアップキーワードが含まれている旨の前記ウェークアップ判断部による判断に応答して、ウェークアップされ、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信し、前記第３オーディオデータを音声認識するか、あるいは前記第３オーディオデータが音声認識されるように外部に伝送する音声認識部をさらに含むことを特徴とする付記３に記載の音声制御装置。
（付記７）
前記第２区間は、前記ウェークアップキーワードによって可変的に決定されることを特徴とする付記６に記載の音声制御装置。
（付記８）
前記話者特徴ベクトル抽出部は、
前記第１オーディオデータの各フレームごとに第１フレーム特徴ベクトルを抽出し、抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出し、
前記第２オーディオデータの各フレームごとに第２フレーム特徴ベクトルを抽出し、抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出することを特徴とする付記１に記載の音声制御装置。
（付記９）
前記キーワード検出部は、前記オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算し、前記第１確率が前記第２確率より、所定の基準値を超えて高いフレームを音声フレームと決定し、
前記話者特徴ベクトル抽出部は、
前記第１オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出し、抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出し、
前記第２オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出し、抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出することを特徴とする付記１に記載の音声制御装置。
（付記１０）
前記話者特徴ベクトル抽出部は、前記キーワード検出部による前記候補キーワードの検出に応答してウェークアップされることを特徴とする付記１に記載の音声制御装置。
（付記１１）
周辺音に対応するオーディオ信号を受信し、オーディオストリームデータを生成する段階と、
前記オーディオストリームデータから、所定のキーワードに対応する候補キーワードを検出し、前記オーディオストリームデータにおいて、前記候補キーワードが検出された第１オーディオデータに該当する第１区間の始点及び終点を決定する段階と、
前記第１オーディオデータに係わる第１話者特徴ベクトルを抽出する段階と、
前記オーディオストリームデータにおいて、前記第１区間の始点を終点にする第２区間に該当する第２オーディオデータに係わる第２話者特徴ベクトルを抽出する段階と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を基に、前記第１オーディオデータに、前記キーワードが含まれていたか否かを判断し、ウェークアップさせるか否かを決定する段階と、を含む音声制御装置の動作方法。
（付記１２）
前記ウェークアップさせるか否かを決定する段階は、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度を所定の基準値と比較する段階と、
前記類似度が、前記所定の基準値以下である場合、前記第１オーディオデータに、前記キーワードが含まれていると判断してウェークアップさせる段階と、
前記類似度が、前記所定の基準値を超える場合、前記第１オーディオデータに、前記キーワードが含まれていないと判断してウェークアップさせない段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１３）
前記検出された候補キーワードが、単独命令キーワードに対応する前記候補キーワードである場合、
前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信する段階と、
前記第３オーディオデータの第３話者特徴ベクトルを抽出する段階と、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとの類似度が、所定の基準値以下であり、前記第１話者特徴ベクトルと前記第３話者特徴ベクトルとの類似度が、所定の基準値以下である場合、前記第１オーディオデータに、前記単独命令キーワードが含まれていると判断する段階と、をさらに含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１４）
前記第１オーディオデータに、前記単独命令キーワードが含まれているという判断に応答し、前記単独命令キーワードに対応する機能を遂行する段階をさらに含むことを特徴とする付記１３に記載の音声制御装置の動作方法。
（付記１５）
前記検出されたキーワードがウェークアップキーワードに対応する前記候補キーワードである場合、
前記第１オーディオデータに、前記ウェークアップキーワードが含まれているという判断に応答して、前記オーディオストリームデータにおいて、前記第１区間の終点を始点にする第３区間に該当する第３オーディオデータを受信する段階と、
前記第３オーディオデータを音声認識するか、あるいは前記第３オーディオデータが音声認識されるように外部に伝送する段階と、をさらに含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１６）
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとを抽出する段階は、
前記第１オーディオデータの各フレームごとに第１フレーム特徴ベクトルを抽出する段階と、
抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出する段階と、
前記第２オーディオデータの各フレームごとに第２フレーム特徴ベクトルを抽出する段階と、
抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出する段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１７）
前記オーディオストリームデータの各フレームごとに、人音声である第１確率と、背景音である第２確率とを計算し、前記第１確率が前記第２確率より、所定の基準値を超えて高いフレームを音声フレームと決定する段階をさらに含み、
前記第１話者特徴ベクトルと前記第２話者特徴ベクトルとを抽出する段階は、
前記第１オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第１フレーム特徴ベクトルを抽出する段階と、
抽出された前記第１フレーム特徴ベクトルを正規化及び平均化し、前記第１オーディオデータを代表する前記第１話者特徴ベクトルを抽出する段階と、
前記第２オーディオデータ内のフレームにおいて、音声フレームと決定されたフレームそれぞれについて、第２フレーム特徴ベクトルを抽出する段階と、
抽出された前記第２フレーム特徴ベクトルを正規化及び平均化し、前記第２オーディオデータを代表する前記第２話者特徴ベクトルを抽出する段階と、を含むことを特徴とする付記１１に記載の音声制御装置の動作方法。
（付記１８）
音声制御装置のプロセッサに、付記１１ないし１７のうちいずれか１項に記載の動作方法を実行させる命令語を含むコンピュータプログラム。
（付記１９）
付記１８に記載のコンピュータプログラムを記録した記録媒体。 (Appendix 1)
An audio processing unit that receives audio signals corresponding to ambient sounds and generates audio stream data,
Keyword detection that detects a candidate keyword corresponding to a predetermined keyword from the audio stream data and determines the start point and the end point of the first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data. Department and
The first speaker feature vector related to the first audio data is extracted, and the second speaker feature related to the second audio data corresponding to the second section whose end point is the start point of the first section in the audio stream data. Speaker feature vector extractor that extracts the vector,
A voice including a wake-up determination unit for determining whether or not the keyword is included in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector. Control device.
(Appendix 2)
When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, the wake-up determination unit includes the keyword in the first audio data. The voice control device according to Appendix 1, wherein the voice control device is characterized in that.
(Appendix 3)
Further including a keyword storage for storing a plurality of keywords including the predetermined keyword,
The voice control device according to Appendix 1, wherein each of the keywords is a wake-up keyword or a single command keyword.
(Appendix 4)
When the candidate keyword corresponding to the single command keyword is detected from the audio stream data by the keyword detection unit.
The speaker feature vector extraction unit receives the third audio data corresponding to the third section starting from the end point of the first section in the audio stream data, and the third speaker feature of the third audio data. Extract the vector,
The wake-up determination unit is based on the degree of similarity between the first speaker feature vector and the second speaker feature vector, and the similarity between the first speaker feature vector and the third speaker feature vector. The voice control device according to Appendix 3, wherein it is determined whether or not the single command keyword is included in the first audio data.
(Appendix 5)
In the wake-up determination unit, the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a predetermined reference value, and the first speaker feature vector and the third speaker feature vector The voice control device according to Appendix 4, wherein it is determined that the first audio data includes the single command keyword when the degree of similarity with the above is equal to or less than a predetermined reference value.
(Appendix 6)
When the candidate keyword corresponding to the wake-up keyword is detected from the audio stream data by the keyword detection unit,
In response to the determination by the wakeup determination unit that the wakeup keyword is included in the first audio data, the wakeup is performed, and in the audio stream data, in the third section starting from the end point of the first section. Addendum, which further includes a voice recognition unit that receives the corresponding third audio data and recognizes the third audio data by voice, or transmits the third audio data to the outside so that the third audio data is recognized by voice. 3. The voice control device according to 3.
(Appendix 7)
The voice control device according to Appendix 6, wherein the second section is variably determined by the wake-up keyword.
(Appendix 8)
The speaker feature vector extraction unit
The first frame feature vector is extracted for each frame of the first audio data, the extracted first frame feature vector is normalized and averaged, and the first speaker feature vector representing the first audio data is expressed. Extract and
The second frame feature vector is extracted for each frame of the second audio data, the extracted second frame feature vector is normalized and averaged, and the second speaker feature vector representing the second audio data is expressed. The voice control device according to Appendix 1, wherein the voice control device is characterized in that.
(Appendix 9)
The keyword detection unit calculates a first probability of human voice and a second probability of background sound for each frame of the audio stream data, and the first probability is predetermined from the second probability. A frame higher than the standard value is determined as an audio frame, and
The speaker feature vector extraction unit
In the frame in the first audio data, the first frame feature vector is extracted for each of the frames determined to be the audio frame, the extracted first frame feature vector is normalized and averaged, and the first audio data is described. The first speaker feature vector representing the above is extracted,
In the frame in the second audio data, the second frame feature vector is extracted for each of the frames determined to be the audio frame, the extracted second frame feature vector is normalized and averaged, and the second audio data is described. The voice control device according to Appendix 1, wherein the second speaker feature vector representing the above is extracted.
(Appendix 10)
The voice control device according to Appendix 1, wherein the speaker feature vector extraction unit is waked up in response to detection of the candidate keyword by the keyword detection unit.
(Appendix 11)
At the stage of receiving the audio signal corresponding to the ambient sound and generating the audio stream data,
A step of detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data and determining a start point and an end point of a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data. ,
The stage of extracting the first speaker feature vector related to the first audio data, and
In the audio stream data, a step of extracting a second speaker feature vector related to the second audio data corresponding to the second section whose end point is the start point of the first section, and
Based on the degree of similarity between the first speaker feature vector and the second speaker feature vector, it is determined whether or not the keyword is included in the first audio data, and whether or not to wake up is determined. And how the voice control device operates, including.
(Appendix 12)
The stage of deciding whether or not to wake up is
A step of comparing the degree of similarity between the first speaker feature vector and the second speaker feature vector with a predetermined reference value, and
When the degree of similarity is equal to or less than the predetermined reference value, it is determined that the keyword is included in the first audio data and wakes up.
The voice according to Appendix 11, characterized in that, when the similarity exceeds the predetermined reference value, the first audio data includes a step of determining that the keyword is not included and not wake-up. How to operate the control device.
(Appendix 13)
When the detected candidate keyword is the candidate keyword corresponding to the single instruction keyword,
In the audio stream data, the stage of receiving the third audio data corresponding to the third section starting from the end point of the first section, and
At the stage of extracting the third speaker feature vector of the third audio data,
The degree of similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than a predetermined reference value, and the degree of similarity between the first speaker feature vector and the third speaker feature vector is The method of operating the voice control device according to Appendix 11, wherein when the value is equal to or less than a predetermined reference value, the first audio data further includes a step of determining that the single command keyword is included. ..
(Appendix 14)
The voice control device according to Appendix 13, further comprising a step of performing a function corresponding to the single command keyword in response to a determination that the first audio data includes the single command keyword. How it works.
(Appendix 15)
When the detected keyword is the candidate keyword corresponding to the wake-up keyword,
In response to the determination that the wake-up keyword is included in the first audio data, the audio stream data receives the third audio data corresponding to the third section starting from the end point of the first section. And the stage to do
The method of operating the voice control device according to Appendix 11, further comprising a step of recognizing the third audio data by voice or transmitting the third audio data to the outside so that the third audio data is recognized by voice. ..
(Appendix 16)
The stage of extracting the first speaker feature vector and the second speaker feature vector is
The stage of extracting the first frame feature vector for each frame of the first audio data, and
A step of normalizing and averaging the extracted first frame feature vector and extracting the first speaker feature vector representing the first audio data.
The stage of extracting the second frame feature vector for each frame of the second audio data, and
The voice according to Appendix 11, which comprises a step of normalizing and averaging the extracted second frame feature vector and extracting the second speaker feature vector representing the second audio data. How to operate the control device.
(Appendix 17)
For each frame of the audio stream data, a first probability of human voice and a second probability of background sound are calculated, and the first probability is higher than the second probability by exceeding a predetermined reference value. Including the step of deciding the frame as an audio frame,
The stage of extracting the first speaker feature vector and the second speaker feature vector is
In the frame in the first audio data, the step of extracting the first frame feature vector for each of the frames determined to be the audio frame, and
A step of normalizing and averaging the extracted first frame feature vector and extracting the first speaker feature vector representing the first audio data.
In the frame in the second audio data, the step of extracting the second frame feature vector for each of the frames determined to be the audio frame, and
The voice according to Appendix 11, which comprises a step of normalizing and averaging the extracted second frame feature vector and extracting the second speaker feature vector representing the second audio data. How to operate the control device.
(Appendix 18)
A computer program including a command word that causes a processor of a voice control device to execute the operation method according to any one of Supplementary note 11 to 17.
(Appendix 19)
A recording medium on which the computer program described in Appendix 18 is recorded.

１００音声制御装置（電子機器）
１１０メモリ
１２０プロセッサ
１２１オーディオ処理部
１２２キーワード検出部
１２３話者特徴ベクトル抽出部
１２４ウェークアップ判断部
１２５音声認識部
１２６機能部 100 Voice control device (electronic device)
110 Memory 120 Processor 121 Audio processing unit 122 Keyword detection unit 123 Speaker feature Vector extraction unit 124 Wake-up judgment unit 125 Speech recognition unit 126 Function unit

Claims

An audio processing unit that receives audio signals corresponding to ambient sounds and generates audio stream data,
A keyword detection unit that detects a candidate keyword corresponding to a predetermined keyword from the audio stream data and determines a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data.
In the audio stream data, the second audio data corresponding to the second section, which is the previous section of the first section, is determined, and the first speaker feature vector related to the first audio data and the second audio data are used. A speaker feature vector extraction unit that extracts the related second speaker feature vector,
A voice control device including a wake-up determination unit for determining whether or not the keyword is included in the first audio data based on the degree of similarity between the first speaker feature vector and the second speaker feature vector. ..

When the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, the wake-up determination unit includes the keyword in the first audio data. The voice control device according to claim 1, wherein the voice control device is characterized in that.

When the candidate keyword corresponding to the single command keyword is detected from the audio stream data by the keyword detection unit,
The speaker feature vector extraction unit extracts a third speaker feature vector related to the third audio data corresponding to the third section, which is the section next to the first section, from the audio stream data.
The wake-up determination unit is based on the degree of similarity between the first speaker feature vector and the second speaker feature vector, and the similarity between the first speaker feature vector and the third speaker feature vector. The voice control device according to claim 1, wherein it is determined whether or not the single command keyword is included in the first audio data.

In the wake-up determination unit, the similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, and the first speaker feature vector and the third speaker feature vector The voice control device according to claim 3, wherein it is determined that the first audio data includes the single command keyword when the degree of similarity with the first audio data is equal to or less than the set reference value.

When the candidate keyword corresponding to the wake-up keyword is detected from the audio stream data by the keyword detection unit,
In response to the determination by the wakeup determination unit that the wakeup keyword is included in the first audio data, the wakeup is performed, and in the audio stream data, the third section, which is the next section after the first section, is used. The voice control device according to claim 1, further comprising a voice recognition unit that recognizes the corresponding third audio data by voice or transmits the third audio data to the outside so that the third audio data is recognized by voice.

At the stage of receiving the audio signal corresponding to the ambient sound and generating the audio stream data,
A step of detecting a candidate keyword corresponding to a predetermined keyword from the audio stream data and determining a first section corresponding to the first audio data in which the candidate keyword is detected in the audio stream data.
The stage of extracting the first speaker feature vector related to the first audio data, and
In the audio stream data, a step of extracting a second speaker feature vector related to the second audio data corresponding to the second section which is the previous section of the first section, and
Based on the degree of similarity between the first speaker feature vector and the second speaker feature vector, it is determined whether or not the keyword is included in the first audio data, and whether or not to wake up is determined. And how the voice control device operates, including.

When the detected candidate keyword corresponds to a single instruction keyword,
In the audio stream data, a step of extracting a third speaker feature vector related to the third audio data corresponding to the third section, which is the next section of the first section, and
The degree of similarity between the first speaker feature vector and the second speaker feature vector is equal to or less than the set reference value, and the degree of similarity between the first speaker feature vector and the third speaker feature vector is The operation of the voice control device according to claim 6, wherein when the value is equal to or less than the set reference value, the first audio data further includes a step of determining that the single command keyword is included. Method.

When the detected keyword is the candidate keyword corresponding to the wake-up keyword,
In response to the determination that the wake-up keyword is included in the first audio data, the third audio data corresponding to the third section, which is the next section of the first section, is received in the audio stream data. And the stage to do
The operation of the voice control device according to claim 6, further comprising a step of recognizing the third audio data by voice or transmitting the third audio data to the outside so that the third audio data is recognized by voice. Method.

A computer program including a command word that causes a processor of a voice control device to execute the operation method according to any one of claims 6 to 8.