JP2000155600A

JP2000155600A - Speech recognition system and input voice level alarming method

Info

Publication number: JP2000155600A
Application number: JP10332115A
Authority: JP
Inventors: Toru Uchimura; 徹内村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-11-24
Filing date: 1998-11-24
Publication date: 2000-06-06

Abstract

PROBLEM TO BE SOLVED: To interactively given an alarm to a speaker so as to talke at a proper voice level when an input voice level is not within a proper range without necessitating additional circuits. SOLUTION: In a speech recognition system responding to a voice while recgnizing a speech, an input voice level judging part 10 judging levels of input voice data is provided so as to become the preceding stage of a speech recognizing part 20 in a CPU in which the part 20 is mounted and when the level of the input voice data is judged to be out of a reference range, this system is made to interactively perform the alarm of the result of the judgment and in the case the level is judged to be within the reference range and when the system is in the registration mode of the input voice data, the registring of the input voice data is performed by the recognizing part 20 and when the system is in the recognition mode of the input voice data, the recognizing of the speech is performed by the recognizing part 20.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は音声認識システムお
よび入力音声レベル警告方法に関し、特に音声をマイク
から入力して認識しスピーカから音声で応答する音声認
識システムおよびその音声認識システムにおける入力音
声レベル警告方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition system and an input voice level warning method, and more particularly to a voice recognition system for recognizing voice by inputting from a microphone and responding with voice from a speaker, and an input voice level warning in the voice recognition system. About the method.

【０００２】[0002]

【従来の技術】音声認識システムでは、入力音声レベル
（強弱のレベルをいう。以下同様）の大小が認識率に大
きな影響を及ぼすことがわかっている。すなわち、入力
音声レベルが大きすぎる場合、入力音声信号がクリッピ
ングされ、元の音声波形とは異なってしまう。反対に、
入力音声レベルが小さすぎる場合、周囲の雑音に入力音
声信号が埋もれてしまう。これにより、音声の認識率が
低下することになる。2. Description of the Related Art In a speech recognition system, it has been known that the magnitude of an input speech level (hereinafter, referred to as "high or low level") has a great effect on a recognition rate. That is, if the input audio level is too high, the input audio signal will be clipped and different from the original audio waveform. Conversely,
If the input audio level is too low, the input audio signal will be buried in ambient noise. As a result, the speech recognition rate decreases.

【０００３】このような不具合を解消するための対策と
して、従来の音声認識システムには、入力音声レベルを
検出して、これを適正な範囲に制御するＡＧＣ（Ａｕｔ
ｏｍａｔｉｃＧａｉｎＣｏｎｔｒｏｌ）回路を設け
ているものもあった。As a countermeasure for solving such a problem, a conventional voice recognition system includes an AGC (Automatic Control Unit) which detects an input voice level and controls the detected level to an appropriate range.
In some cases, an organic gain control circuit is provided.

【０００４】図８は、従来の音声認識システムの一例を
示すブロック図である。この音声認識システムは、音声
をアナログ音声信号に変換するマイク８１と、マイク８
１から出力されるアナログ音声信号を増幅するアンプ８
２と、アンプ８２から出力されたアナログ音声信号のレ
ベルを検出してアンプ８２の増幅率を制御するＡＧＣ回
路８３と、ＡＧＣ回路８３から出力されるアナログ音声
信号をデジタル音声データに変換するＡ／Ｄ変換器８４
と、Ａ／Ｄ変換器８４からのデジタル音声データを入力
して音声認識を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃ
ｅｓｓｉｎｇＵｎｉｔ）８５と、ＣＰＵ８５に接続さ
れたメモリ８６とから、その主要部が構成されていた。FIG. 8 is a block diagram showing an example of a conventional speech recognition system. The voice recognition system includes a microphone 81 for converting voice into an analog voice signal, a microphone 8
Amplifier 8 for amplifying analog audio signal output from 1
2, an AGC circuit 83 that detects the level of the analog audio signal output from the amplifier 82 and controls the amplification factor of the amplifier 82, and an A / A converter that converts the analog audio signal output from the AGC circuit 83 into digital audio data. D converter 84
And a CPU (Central Proc) for inputting digital voice data from the A / D converter 84 and performing voice recognition.
An essential unit is constituted by an essing unit 85 and a memory 86 connected to the CPU 85.

【０００５】また、図９は、従来の音声認識システムの
他の例を示すブロック図である。この音声認識システム
は、音声をアナログ音声信号に変換するマイク９１と、
マイク９１から出力されるアナログ音声信号を増幅する
アンプ９２と、アンプ９２から出力されるアナログ音声
信号をデジタル音声データに変換するＡ／Ｄ変換器９３
と、Ａ／Ｄ変換器９３から出力されたデジタル音声信号
のレベルを検出してアンプ９２の増幅率を制御するＡＧ
Ｃ回路９４と、ＡＧＣ回路９４からのデジタル音声デー
タを入力して音声認識を行うＣＰＵ９５と、ＣＰＵ９５
に接続されたメモリ９６とから、その主要部が構成され
ていた。FIG. 9 is a block diagram showing another example of a conventional speech recognition system. The voice recognition system includes a microphone 91 that converts voice into an analog voice signal,
An amplifier 92 for amplifying an analog audio signal output from the microphone 91, and an A / D converter 93 for converting the analog audio signal output from the amplifier 92 into digital audio data
And an AG for detecting the level of the digital audio signal output from the A / D converter 93 and controlling the amplification factor of the amplifier 92
A C circuit 94; a CPU 95 for inputting digital voice data from the AGC circuit 94 to perform voice recognition;
And a memory 96 connected to the main part.

【０００６】[0006]

【発明が解決しようとする課題】第１の問題点は、従来
の音声認識システムでは、認識率の低下を招くというこ
とである。その理由は、音声入力の途中でＡＧＣ回路で
アンプの増幅率が自動的に変更されると、入力音声信号
の波形が変形してしまい、認識率に悪影響を及ぼすから
である。The first problem is that the conventional speech recognition system causes a reduction in the recognition rate. The reason is that if the amplification factor of the amplifier is automatically changed by the AGC circuit during the voice input, the waveform of the input voice signal is deformed, which adversely affects the recognition rate.

【０００７】第２の問題点は、ＡＧＣ回路でアンプの増
幅率を制御しても、次の音声入力に対して適正な増幅率
でない場合、話者が関知しないところで音声認識が正し
く行われないおそれがあるということである。その理由
は、音声の登録時と認識時とでＡＧＣ回路によりアンプ
の増幅率が変更されていると、話者が同じように話した
と思っていても、アンプの増幅率が異なっているため
に、異なった言葉として認識されてしまうおそれがある
からである。The second problem is that even if the amplification factor of the amplifier is controlled by the AGC circuit, if the amplification factor is not appropriate for the next speech input, speech recognition is not performed correctly without the speaker's knowledge. There is a risk. The reason is that if the gain of the amplifier is changed by the AGC circuit at the time of voice registration and at the time of recognition, the gain of the amplifier is different even if the speaker thinks that he / she spoke the same way. The reason is that there is a possibility that it is recognized as a different word.

【０００８】第３の問題点は、コストがかかるというこ
とである。その理由は、ＡＧＣ回路を設けた場合、その
分、回路の追加になるからである。[0008] The third problem is that it is costly. The reason is that when an AGC circuit is provided, the circuit is added accordingly.

【０００９】本発明の目的は、追加回路を必要とせず、
入力音声レベルが適正な範囲にないときには話者に適正
な入力音声レベルで話すように音声で対話的に警告を与
えるようにした音声認識システムを提供することにあ
る。An object of the present invention is to eliminate the need for an additional circuit,
It is an object of the present invention to provide a speech recognition system which, when an input speech level is not within an appropriate range, interactively gives a warning by voice so that a speaker speaks at an appropriate input speech level.

【００１０】また、本発明の他の目的は、音声認識シス
テムにおいて、入力音声レベルが適正な範囲にないとき
には話者に適正な入力音声レベルで話すように音声で対
話的に警告を与えるようにした入力音声レベル警告方法
を提供することにある。It is another object of the present invention to provide a voice recognition system in which a speaker is interactively warned by voice so as to speak at a proper input voice level when the input voice level is not within an appropriate range. To provide a method for warning an input voice level.

【００１１】なお、先行技術文献として、特開平２−１
０５００号公報がある。この公報に開示された「音声認
識装置」は、音声入力があったときに、その入力レベル
をスレッショルドと比較し、入力レベルがスレッショル
ドよりも小さい場合には入力音声が小さいことを意味す
る短時間の警告音を発生し、入力レベルがスレッショル
ドよりも大きい場合には入力音声が大きいことを意味す
る短時間の警告音を発生するようにしたものであるの
で、音声そのもので対話的に警告を与える本発明とは異
なっている。As a prior art document, Japanese Patent Laid-Open No. 2-1
No. 0500. The "voice recognition device" disclosed in this publication compares the input level with a threshold when a voice input is made, and when the input level is lower than the threshold, a short time means that the input voice is low. Warning sound is generated, and when the input level is higher than the threshold, a short-time warning sound is generated, which means that the input voice is louder, so that the voice itself gives an interactive warning. This is different from the present invention.

【００１２】[0012]

【課題を解決するための手段】本発明の音声認識システ
ムは、音声を認識し音声で応答する音声認識システムに
おいて、音声認識部を搭載するデジタル演算処理装置に
前記音声認識部の前段となるように入力音声データのレ
ベルを判定する入力音声レベル判定部を設け、この入力
音声レベル判定部により入力音声データのレベルが基準
範囲外にあると判定された場合にその旨の警告を音声で
対話的に行うようにし、前記入力音声レベル判定部によ
り入力音声データのレベルが基準範囲内にあると判定さ
れた場合に前記音声認識部により音声認識を行わせるよ
うにしたことを特徴とする。A speech recognition system according to the present invention is a speech recognition system for recognizing speech and responding with speech so that a digital processing unit equipped with a speech recognition unit is located upstream of the speech recognition unit. An input audio level determining unit for determining the level of the input audio data, and when the input audio level determining unit determines that the level of the input audio data is out of the reference range, a warning to that effect is given by voice interactively. And when the input voice level determination unit determines that the level of the input voice data is within the reference range, the voice recognition unit performs voice recognition.

【００１３】また、本発明の音声認識システムは、音声
を認識し音声で応答する音声認識システムにおいて、音
声認識部を搭載するデジタル演算処理装置に前記音声認
識部の前段となるように入力音声データのレベルを判定
する入力音声レベル判定部を設け、この入力音声レベル
判定部により入力音声データのレベルが基準範囲外にあ
ると判定された場合にその旨の警告を音声で対話的に行
うようにし、前記入力音声レベル判定部により入力音声
データのレベルが基準範囲内にあると判定された場合
に、入力音声データの登録モードであれば前記音声認識
部により入力音声データの登録を行わせ、入力音声デー
タの認識モードであれば前記音声認識部により音声認識
を行わせるようにしたことを特徴とする。The speech recognition system of the present invention is a speech recognition system for recognizing speech and responding with speech. In a speech recognition system equipped with a speech recognition unit, the input speech data is provided so as to be in front of the speech recognition unit. An input audio level determining unit for determining the level of the input audio data is provided. When the input audio level determining unit determines that the level of the input audio data is out of the reference range, a warning to that effect is given interactively by voice. If the input voice level determination unit determines that the level of the input voice data is within the reference range, the input voice data is registered by the voice recognition unit if the input voice data is in the registration mode. In the voice data recognition mode, the voice recognition unit performs voice recognition.

【００１４】さらに、本発明の音声認識システムは、マ
イクから入力されたアナログ音声信号をＡ／Ｄ変換器で
Ａ／Ｄ変換した後に入力音声データとしてデジタル演算
処理装置に入力して音声認識を行い、スピーカから音声
で応答する音声認識システムにおいて、前記デジタル演
算処理装置に、音声認識部の前段となるように、入力音
声データと上限基準データおよび下限基準データとを比
較して入力音声データのレベルが基準範囲内にあるかど
うかを判定する基準データ比較手段と、この基準データ
比較手段により入力音声データのレベルが基準範囲内に
ないと判定された場合にその旨を警告する音声をスピー
カから発声させる警告発声手段とからなる入力音声レベ
ル判定部を設けたことを特徴とする。Further, in the voice recognition system of the present invention, an analog voice signal input from a microphone is A / D converted by an A / D converter, and then input as input voice data to a digital processing unit for voice recognition. In a voice recognition system that responds with voice from a speaker, the digital arithmetic processing unit compares input voice data with upper-limit reference data and lower-limit reference data so that the level of the input voice data is higher than that of a voice recognition unit. Reference data comparing means for determining whether or not the input voice data is within the reference range; and, when the reference data comparing means determines that the level of the input voice data is not within the reference range, a sound is issued from the speaker to warn of the fact. An input voice level determination unit comprising a warning utterance means for causing the input voice level is provided.

【００１５】さらにまた、本発明の音声認識システム
は、マイクから入力されたアナログ音声信号をＡ／Ｄ変
換器でＡ／Ｄ変換した後に入力音声データとしてデジタ
ル演算処理装置に入力して音声認識を行い、スピーカか
ら音声で応答する音声認識システムにおいて、入力音声
データと上限基準データおよび下限基準データとを比較
して入力音声データのレベルが基準範囲内にあるかどう
かを判定する基準データ比較手段と、この基準データ比
較手段により入力音声データのレベルが基準範囲内にな
いと判定された場合にその旨を警告する音声をスピーカ
から発声させる警告発声手段と、前記基準データ比較手
段により入力音声データのレベルが基準範囲内にあると
判定された場合に入力音声データをフレーム単位に高速
フーリエ変換して周波数成分に分解し周波数スペクトル
化する周波数スペクトル化手段と、この周波数スペクト
ル化手段により周波数スペクトル化された周波数成分に
ついてフレーム毎の変化量を計算する変化量計算手段
と、入力音声データの登録モードであるか認識モードで
あるかを判別する登録／認識モード判別手段と、この登
録／認識モード判別手段により登録モードであると判別
された場合に前記周波数スペクトル化手段により分解さ
れた周波数成分と前記変化量計算手段により計算された
フレーム毎の変化量とを時間方向に圧縮して登録音声デ
ータとして登録する時間方向圧縮手段と、前記登録／認
識モード判別手段により認識モードであると判別された
場合に入力音声データと登録音声データとをＤＰマッチ
ング法によりフレーム単位に照合して距離値を計算する
照合手段と、この照合手段により計算された距離値と予
め登録されたしきい値とを比較して距離値がしきい値以
下かどうかを判定する照合結果判定手段と、この照合結
果判定手段により距離値がしきい値以下であると判定さ
れた登録音声データを認識結果として該登録音声データ
に応答する応答音声データをスピーカから発声させる応
答発声手段とを有することを特徴とする。Still further, the speech recognition system of the present invention performs A / D conversion of an analog speech signal input from a microphone by an A / D converter, and then inputs the signal as input speech data to a digital arithmetic processing unit to perform speech recognition. And a reference data comparing means for comparing the input voice data with the upper reference data and the lower reference data to determine whether the level of the input voice data is within a reference range. When the reference data comparing means determines that the level of the input voice data is not within the reference range, a warning voice generating means for uttering a voice warning of the fact from a speaker, and the reference data comparing means If it is determined that the level is within the reference range, the input audio data is subjected to fast Fourier transform in frame units and Frequency spectrum converting means for decomposing into frequency components to obtain a frequency spectrum, change amount calculating means for calculating a change amount for each frame with respect to the frequency component converted into a frequency spectrum by the frequency spectrum converting means, and input voice data registration mode Registration / recognition mode discriminating means for discriminating whether the mode is a recognition mode or not, and a frequency component decomposed by the frequency spectrum converting means when the registration / recognition mode discriminating means discriminates the registration mode and the change. A time direction compressing means for compressing the change amount for each frame calculated by the amount calculating means in the time direction and registering it as registered voice data; and a case where the registration / recognition mode determining means determines that the recognition mode is set. The input voice data and the registered voice data are collated in frame units by the DP matching method, and the distance is determined. Matching means for calculating a value, a matching result determining means for comparing the distance value calculated by the matching means with a pre-registered threshold value to determine whether or not the distance value is equal to or smaller than the threshold value; A response uttering means for uttering response voice data in response to the registered voice data as a recognition result from the speaker as the recognition result of the registered voice data whose distance value is determined to be equal to or less than the threshold value by the result determining means. .

【００１６】一方、本発明の入力音声レベル警告方法
は、マイクから入力されたアナログ音声信号をＡ／Ｄ変
換器でＡ／Ｄ変換した後に入力音声データとしてデジタ
ル演算処理装置に入力して音声認識を行い、スピーカか
ら音声で応答する音声認識システムの入力音声レベル警
告方法において、入力音声データと上限基準データおよ
び下限基準データとを比較して入力音声データのレベル
が基準範囲内にあるかどうかを判定する基準データ比較
ステップと、この基準データ比較手段により入力音声デ
ータのレベルが基準範囲内にないと判定された場合にそ
の旨を警告する音声をスピーカから発声させる警告発声
ステップとを含むことを特徴とする。On the other hand, according to the input audio level warning method of the present invention, an analog audio signal input from a microphone is A / D converted by an A / D converter, and then input to a digital arithmetic processing device as input audio data to perform voice recognition. In the input voice level warning method of the voice recognition system that responds by voice from the speaker, the input voice data is compared with the upper reference data and the lower reference data to determine whether the level of the input voice data is within the reference range. Determining a reference data comparing step; and, when the reference data comparing means determines that the level of the input audio data is not within the reference range, issuing a warning sound from a speaker to warn the user of this. Features.

【００１７】また、本発明の入力音声レベル警告方法
は、入力音声データと上限基準データおよび下限基準デ
ータとを比較して入力音声データのレベルが基準範囲内
にあるかどうかを判定する基準データ比較ステップと、
この基準データ比較ステップにより入力音声データのレ
ベルが基準範囲内にないと判定された場合にその旨を警
告する音声をスピーカから発声させる警告発声ステップ
と、前記基準データ比較ステップにより入力音声データ
のレベルが基準範囲内にあると判定された場合に入力音
声データをフレーム単位に高速フーリエ変換して周波数
成分に分解し周波数スペクトル化する周波数スペクトル
化ステップと、この周波数スペクトル化ステップにより
周波数スペクトル化された周波数成分についてフレーム
毎の変化量を計算する変化量計算ステップと、入力音声
データの登録モードであるか認識モードであるかを判別
する登録／認識モード判別ステップと、この登録／認識
モード判別ステップにより登録モードであると判別され
た場合に前記周波数スペクトル化ステップにより分解さ
れた周波数成分と前記変化量計算ステップにより計算さ
れたフレーム毎の変化量とを時間方向に圧縮して登録音
声データとして登録する時間方向圧縮ステップと、前記
登録／認識モード判別ステップにより認識モードである
と判別された場合に入力音声データと登録音声データと
をＤＰマッチング法によりフレーム単位に照合して距離
値を計算する照合ステップと、この照合ステップにより
計算された距離値と予め登録されたしきい値とを比較し
て距離値がしきい値以下かどうかを判定する照合結果判
定ステップと、この照合結果判定ステップにより距離値
がしきい値以下であると判定された登録音声データを認
識結果として該登録音声データに応答する応答音声デー
タをスピーカから発声させる応答発声ステップとを含む
ことを特徴とする。Further, the input audio level warning method of the present invention compares the input audio data with the upper reference data and the lower reference data to determine whether the level of the input audio data is within a reference range. Steps and
A warning uttering step of, when it is determined by the reference data comparing step that the level of the input voice data is not within the reference range, uttering a voice to warn the user from the speaker, and a level of the input voice data by the reference data comparing step Is determined to be within the reference range, the input audio data is subjected to fast Fourier transform in frame units, decomposed into frequency components, and converted into a frequency spectrum. A change amount calculation step of calculating a change amount of each frequency component for each frame, a registration / recognition mode determination step of determining whether the input voice data is in a registration mode or a recognition mode, and a registration / recognition mode determination step. If it is determined that the mode is the registration mode, A time-direction compression step of compressing in the time direction the frequency component decomposed in the spectralization step and the change amount of each frame calculated in the change amount calculation step and registering the same as registered voice data; A matching step of calculating the distance value by comparing the input voice data and the registered voice data in a frame unit by the DP matching method when it is determined by the step that the recognition mode is set, and a distance value calculated by the matching step. A comparison result determination step of comparing the distance value with a pre-registered threshold value to determine whether the distance value is equal to or less than the threshold value; and a registration in which the distance value is determined to be less than or equal to the threshold value by the comparison result determination step. A response utterance that utters response voice data from a speaker in response to the registered voice data as voice data as a recognition result. Characterized in that it comprises a step.

【００１８】他方、本発明の記録媒体は、コンピュータ
を、入力音声データと上限基準データおよび下限基準デ
ータとを比較して入力音声データのレベルが基準範囲内
にあるかどうかを判定する基準データ比較手段，この基
準データ比較手段により入力音声データのレベルが基準
範囲内にないと判定された場合にその旨を警告する音声
をスピーカから発声させる警告発声手段，前記基準デー
タ比較手段により入力音声データのレベルが基準範囲内
にあると判定された場合に入力音声データをフレーム単
位に高速フーリエ変換して周波数成分に分解し周波数ス
ペクトル化する周波数スペクトル化手段，この周波数ス
ペクトル化手段により周波数スペクトル化された周波数
成分についてフレーム毎の変化量を計算する変化量計算
手段，入力音声データの登録モードであるか認識モード
であるかを判別する登録／認識モード判別手段，この登
録／認識モード判別手段により登録モードであると判別
された場合に前記周波数スペクトル化手段により分解さ
れた周波数成分と前記変化量計算手段により計算された
フレーム毎の変化量とを時間方向に圧縮して登録音声デ
ータとして登録する時間方向圧縮手段，前記登録／認識
モード判別手段により認識モードであると判別された場
合に入力音声データと登録音声データとをＤＰマッチン
グ法によりフレーム単位に照合して距離値を計算する照
合手段，この照合手段により計算された距離値と予め登
録されたしきい値とを比較して距離値がしきい値以下か
どうかを判定する照合結果判定手段，ならびにこの照合
結果判定手段により距離値がしきい値以下であると判定
された登録音声データを認識結果として該登録音声デー
タに応答する応答音声データをスピーカから発声させる
応答発声手段として機能させるためのプログラムを記録
する。[0018] On the other hand, the recording medium of the present invention allows the computer to compare the input audio data with the upper reference data and the lower reference data to determine whether or not the level of the input audio data is within a reference range. Means for outputting, when the reference data comparing means determines that the level of the input voice data is not within the reference range, a voice for issuing a warning to that effect from a speaker, and warning means for outputting the input voice data by the reference data comparing means. Frequency spectrum converting means for performing fast Fourier transform of the input voice data in frame units when it is determined that the level is within the reference range, decomposing the frequency components into frequency components, and converting the frequency spectrum into frequency spectrums; A change amount calculating means for calculating a change amount of each frequency component for each frame, and an input audio data. Registration / recognition mode discriminating means for discriminating whether the mode is a registration mode or a recognition mode of the data. If the registration / recognition mode discriminating means discriminates the registration mode, the frequency decomposed by the frequency spectrum converting means. The time direction compression means for compressing the component and the change amount for each frame calculated by the change amount calculation means in the time direction and registering it as registered voice data, and the registration / recognition mode discrimination means determines the recognition mode. Matching means for comparing the input voice data and the registered voice data on a frame basis by the DP matching method to calculate a distance value, and comparing the distance value calculated by the matching means with a pre-registered threshold value Means for judging whether or not the distance value is equal to or smaller than a threshold value, and the distance value is determined by the collation result judging means. The program for functioning as a response utterance means for uttering voice response data in response to the registered voice data registered voice data is determined to be less as a recognition result from the speaker to record.

【００１９】また、本発明の記録媒体は、コンピュータ
に、入力音声データと上限基準データおよび下限基準デ
ータとを比較して入力音声データのレベルが基準範囲内
にあるかどうかを判定する基準データ比較ステップ，こ
の基準データ比較ステップにより入力音声データのレベ
ルが基準範囲内にないと判定された場合にその旨の警告
音声をスピーカから発声させる警告発声ステップ，前記
基準データ比較ステップにより入力音声データのレベル
が基準範囲内にあると判定された場合に入力音声データ
をフレーム単位に高速フーリエ変換して周波数成分に分
解し周波数スペクトル化する周波数スペクトル化ステッ
プ，この周波数スペクトル化ステップにより周波数スペ
クトル化された周波数成分についてフレーム毎の変化量
を計算する変化量計算ステップ，入力音声データの登録
モードであるか認識モードであるかを判別する登録／認
識モード判別ステップ，この登録／認識モード判別ステ
ップにより登録モードであると判別された場合に前記周
波数スペクトル化ステップにより分解された周波数成分
と前記変化量計算ステップにより計算されたフレーム毎
の変化量とを時間方向に圧縮して登録音声データとして
登録する時間方向圧縮ステップ，前記登録／認識モード
判別ステップにより認識モードであると判別された場合
に入力音声データと登録音声データとをＤＰマッチング
法によりフレーム単位に照合して距離値を計算する照合
ステップ，この照合ステップにより計算された距離値と
予め登録されたしきい値とを比較して距離値がしきい値
以下かどうかを判定する照合結果判定ステップ，ならび
にこの照合結果判定ステップにより距離値がしきい値以
下であると判定された登録音声データを認識結果として
該登録音声データに応答する応答音声データをスピーカ
から発声させる応答発声ステップを実行させるためのプ
ログラムを記録する。Further, the recording medium of the present invention allows a computer to compare input audio data with upper-limit reference data and lower-limit reference data to determine whether or not the level of the input audio data is within a reference range. A step for issuing a warning sound from a speaker when the level of the input voice data is determined not to be within the reference range in the reference data comparing step; and a level for the input voice data in the reference data comparing step. If it is determined that is within the reference range, the input voice data is subjected to Fast Fourier Transform in frame units, decomposed into frequency components, and converted into a frequency spectrum. The amount of change to calculate the amount of change for each frame for the component Calculation step, a registration / recognition mode determination step for determining whether the input voice data is in a registration mode or a recognition mode, and a frequency spectrum conversion step when the registration / recognition mode determination step determines that the input voice data is in the registration mode. A time direction compression step of compressing in the time direction the frequency component decomposed by the above and the change amount of each frame calculated in the change amount calculation step and registering the same as registered voice data; If it is determined that the distance value is equal to the distance, the input voice data and the registered voice data are compared in a frame unit by the DP matching method to calculate a distance value, and the distance value calculated in the matching step is registered in advance. Compare the threshold value to determine whether the distance value is below the threshold. And executing a response utterance step of causing the speaker to utter response voice data responding to the registered voice data as a recognition result using the registered voice data whose distance value is determined to be equal to or smaller than the threshold value in the collation result determination step. Record a program for

【００２０】[0020]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２１】図１は、本発明の第１の実施の形態に係る
音声認識システムの構成を示すブロック図である。本実
施の形態に係る音声認識システムは、ＣＰＵ１と、メモ
リ３と、マイク４と、アンプ５と、Ａ／Ｄ変換器６と、
Ｄ／Ａ変換器７と、アンプ８と、スピーカ９とから、そ
の主要部が構成されている。FIG. 1 is a block diagram showing the configuration of the speech recognition system according to the first embodiment of the present invention. The speech recognition system according to the present embodiment includes a CPU 1, a memory 3, a microphone 4, an amplifier 5, an A / D converter 6,
The D / A converter 7, the amplifier 8, and the speaker 9 constitute the main part.

【００２２】ＣＰＵ１には、入力音声レベル判定部１０
と、音声認識部２０とが設けられている。ＣＰＵ１とし
ては、例えば、外部動作周波数が３３ＭＨｚ、内部動作
周波数が１００ＭＨｚの３２ビットＣＰＵであるＶ８３
１（μＰＤ７０５１０１）（日本電気株式会社製）が使
用可能である。The CPU 1 includes an input voice level determination unit 10
And a voice recognition unit 20 are provided. The CPU 1 is, for example, a V83 which is a 32-bit CPU having an external operating frequency of 33 MHz and an internal operating frequency of 100 MHz.
1 (μPD705101) (manufactured by NEC Corporation) can be used.

【００２３】入力音声レベル判定部１０は、基準データ
比較手段１１と、警告発声手段１２とから構成されてい
る。The input voice level judging section 10 comprises reference data comparing means 11 and warning utterance means 12.

【００２４】音声認識部２０は、周波数スペクトル化手
段２１と、変化量計算手段２２と、登録／認識モード判
別手段２３と、時間方向圧縮手段２４と、照合手段２５
と、照合結果判定手段２６と、応答発声手段２７とが含
まれている。The speech recognition unit 20 includes a frequency spectrum conversion unit 21, a change amount calculation unit 22, a registration / recognition mode determination unit 23, a time direction compression unit 24, and a verification unit 25.
And a collation result determination unit 26 and a response utterance unit 27.

【００２５】メモリ３には、基準データ領域３１と、警
告音声データ領域３２と、登録音声データ領域３３と、
しきい値領域３４と、応答音声データ領域３５とが設け
られている。The memory 3 has a reference data area 31, a warning voice data area 32, a registered voice data area 33,
A threshold area 34 and a response voice data area 35 are provided.

【００２６】基準データ比較手段１１は、入力音声デー
タと基準データ領域３１の基準データとを比較して、入
力音声データが基準範囲内にあるかどうかを判定する手
段である。基準データは、上限レベル検出用の基準デー
タ（以下、上限基準データという）が７ＦＦＦＨ（末尾
のＨは１６進数であることを表す。以下同様）または８
０００Ｈ（つまり、ピークツーピークで例えば２Ｖを越
えた時）、下限レベル検出用の基準データ（以下、下限
基準データという）が１６６６ＨまたはＥ９９９Ｈ（ピ
ークツーピークで例えば３５０ｍＶ以下の時）という１
６ビットデータとして、メモリ３の基準データ領域３１
にあらかじめ保存されている。入力音声データは、基準
データを生の形で比較する。つまり、１サンプリング中
に、一瞬でも２Ｖを越えた時（つまり、クリッピングし
た時）、および一度も３５０ｍＶを越えない時には、基
準データ比較手段１１は、警告発声手段１２に制御を渡
す。The reference data comparing means 11 is a means for comparing the input voice data with the reference data in the reference data area 31 to determine whether the input voice data is within the reference range. As the reference data, the reference data for detecting the upper limit level (hereinafter, referred to as upper limit reference data) is 7FFFFH (H at the end indicates that it is a hexadecimal number; the same applies hereinafter) or 8.
000H (that is, when the voltage exceeds 2 V peak-to-peak), and the reference data for detecting the lower limit level (hereinafter referred to as lower limit reference data) is 1666H or E999H (when the peak-to-peak is 350 mV or less, for example).
As 6-bit data, the reference data area 31 of the memory 3
Is stored in advance. The input audio data compares the reference data in a raw form. That is, during one sampling, when the voltage exceeds 2 V even for a moment (that is, when clipping is performed), and when the voltage does not exceed 350 mV, the reference data comparison unit 11 passes control to the warning utterance unit 12.

【００２７】警告発声手段１２は、基準データ比較手段
１１により入力音声データが上限基準データと下限基準
データとの間の基準範囲内にないと判定された場合に、
警告音声データ領域３２にあらかじめ格納されている該
当する警告音声データをＤ／Ａ変換器７に出力する手段
である。警告音声データは、上限基準データより大きい
場合のものとして「声が大きすぎます」が、下限基準デ
ータより小さい場合のものとして「声が小さすぎます」
が、メモリ３の警告音声データ領域３２にあらかじめ保
存されている。When the reference data comparing means 11 determines that the input voice data is not within the reference range between the upper limit reference data and the lower limit reference data, the warning utterance means 12 outputs
This is a means for outputting the corresponding warning voice data stored in advance in the warning voice data area 32 to the D / A converter 7. Warning voice data is "too loud" if it is larger than the upper reference data, but "too loud" if it is less than the lower reference data
Are stored in advance in the warning voice data area 32 of the memory 3.

【００２８】周波数スペクトル化手段２１は、基準デー
タ比較手段１１により入力音声データが上限基準データ
と下限基準データとの間の基準範囲内にあると判定され
た場合に、入力音声データをフレーム単位に高速フーリ
エ変換して周波数成分に分解し周波数スペクトル化する
手段である。入力音声データは、１１ｋＨｚで量子化さ
れており、例えば１６ｍｓのデータをフレーム単位とし
て高速フーリエ変換されて周波数スペクトル化される。When the reference data comparing means 11 determines that the input audio data is within the reference range between the upper reference data and the lower reference data, the frequency spectrum converting means 21 converts the input audio data into frames. This is a means for performing fast Fourier transform, decomposing the frequency components, and converting them into a frequency spectrum. The input audio data is quantized at 11 kHz, and is subjected to a fast Fourier transform using, for example, 16 ms data as a frame unit to be converted into a frequency spectrum.

【００２９】変化量計算手段２２は、周波数スペクトル
化手段２１により周波数スペクトル化された周波数成分
についてフレーム毎の変化量（ケプストラム）を計算す
る手段である。The change amount calculating means 22 is a means for calculating a change amount (cepstrum) for each frame with respect to the frequency components frequency-converted by the frequency spectrum converting means 21.

【００３０】登録／認識モード判別手段２３は、スイッ
チ（図示せず）等の操作に基づいて入力音声データの登
録モードであるか認識モードであるかを判別する手段で
ある。The registration / recognition mode determining means 23 determines whether the input voice data is in the registration mode or the recognition mode based on the operation of a switch (not shown) or the like.

【００３１】時間方向圧縮手段２４は、登録／認識モー
ド判別手段２３により入力音声データの登録モードであ
ると判別された場合に、周波数成分とフレーム毎の変化
量とを時間方向に１／２に圧縮して登録音声データ（話
者および単語等）としてメモリ３の登録音声データ領域
３３に登録する手段である。When the registration / recognition mode determining unit 23 determines that the input voice data is in the registration mode, the time direction compression unit 24 reduces the frequency component and the amount of change for each frame by half in the time direction. This is a means for compressing and registering in the registered voice data area 33 of the memory 3 as registered voice data (such as a speaker and a word).

【００３２】照合手段２５は、登録／認識モード判別手
段２３により入力音声データの認識モードであると判別
された場合に、入力音声データと登録音声データ領域３
３の登録音声データとをＤＰ（ＤｙｎａｍｉｃＰｒｏ
ｇｒａｍｍｉｎｇ）マッチング法によりフレーム単位に
照合する手段である。ＤＰマッチング法とは、入力音声
データを分析して得られる特徴量パターンと登録音声デ
ータの特徴量パターンとの時間軸の対応をとりながらマ
ッチングさせる（距離計算を行う) ことで、最も類似し
た登録音声データを選び出す方法である。When the registration / recognition mode determining means 23 determines that the input voice data is in the recognition mode, the collating means 25 compares the input voice data with the registered voice data area 3.
3 and DP (Dynamic Pro)
(Gramming) This is a means for matching on a frame basis by a matching method. In the DP matching method, matching (distance calculation) is performed by associating a feature amount pattern obtained by analyzing input voice data with a feature amount pattern of registered voice data while associating them with each other on a time axis. This is a method of selecting audio data.

【００３３】照合結果判定手段２６は、照合手段２５に
より得られた距離値とメモリ３のしきい値領域３４にあ
らかじめ格納されているしきい値とを比較し、距離値が
しきい値以下かどうかを判定する手段である。The collation result determination means 26 compares the distance value obtained by the collation means 25 with a threshold value stored in advance in the threshold area 34 of the memory 3 to determine whether the distance value is equal to or less than the threshold value. It is a means to determine whether or not.

【００３４】応答発声手段２７は、照合結果判定手段２
６により距離値がしきい値以下であると判定された登録
音声データがあれば認識結果とし、この登録音声データ
に応答する応答音声データをメモリ３の応答音声データ
領域３５から選択的に読み出してＤ／Ａ変換器７に出力
する手段である。なお、応答発声手段２７は、学習機能
（話しかければ話しかけるほどきちんとした応答をする
ようにする機能），時計機能（時間の問い掛けに対して
時間を応答する機能）等を備えている。The response utterance means 27 includes the collation result determination means 2
If there is any registered voice data whose distance value is determined to be equal to or smaller than the threshold value according to 6, the recognition result is obtained, and response voice data responding to the registered voice data is selectively read out from the response voice data area 35 of the memory 3. This is a means for outputting to the D / A converter 7. Note that the response utterance means 27 has a learning function (a function of giving a more responsive response as one speaks), a clock function (a function of responding to a time inquiry).

【００３５】図２を参照すると、第１の実施の形態に係
る音声認識システムの処理は、入力音声データ入力ステ
ップＳ１０１と、基準データ比較ステップＳ１０２と、
警告音声データ出力ステップＳ１０３と、入力音声デー
タ書き込みステップＳ１０４と、周波数スペクトル化ス
テップＳ１０５と、変化量計算ステップＳ１０６と、登
録モード判別ステップＳ１０７と、時間方向圧縮・登録
ステップＳ１０８と、データ照合ステップＳ１０９と、
照合結果判定ステップＳ１１０と、応答音声データ出力
ステップＳ１１１とからなる。Referring to FIG. 2, the processing of the voice recognition system according to the first embodiment includes an input voice data input step S101, a reference data comparison step S102,
Warning audio data output step S103, input audio data writing step S104, frequency spectrum conversion step S105, change amount calculation step S106, registration mode determination step S107, time direction compression / registration step S108, data collation step S109 When,
It comprises a collation result determination step S110 and a response voice data output step S111.

【００３６】図３を参照すると、基準データ比較ステッ
プＳ１０２および警告音声データ出力ステップＳ１０３
のより詳細な処理は、上限基準データより大きいか判定
ステップＳ２０１と、警告音声データ出力ステップＳ２
０２と、下限基準データより小さいか判定ステップＳ２
０３と、警告音声データ出力ステップＳ２０４とからな
る。Referring to FIG. 3, reference data comparison step S102 and warning voice data output step S103
The more detailed processing of step S201 determines whether the data is larger than the upper limit reference data, and the step of outputting warning voice data S2
02 and whether it is smaller than the lower limit reference data or not Step S2
03 and a warning voice data output step S204.

【００３７】図４（ａ）〜（ｄ）は、音声認識部２０に
よる入力音声データの登録時の順次の工程を説明する図
である。FIGS. 4A to 4D are diagrams for explaining the sequential steps at the time of registering the input voice data by the voice recognition unit 20. FIG.

【００３８】図５は、照合手段２５による入力音声デー
タと登録音声データとのＤＰマッチング法を説明する図
である。FIG. 5 is a diagram for explaining a DP matching method between the input voice data and the registered voice data by the matching means 25.

【００３９】次に、このように構成された第１の実施の
形態に係る音声認識システムの動作について、図１ない
し図５を参照して説明する。Next, the operation of the thus configured speech recognition system according to the first embodiment will be described with reference to FIGS.

【００４０】（１）入力音声データの登録時の動作(1) Operation when registering input voice data

【００４１】図示しないスイッチ等の操作により登録モ
ードが選択された状態で、話者が発声すると、マイク４
がこれを入力してアナログ音声信号に変換し、アンプ５
がアナログ音声信号を増幅する。When the speaker speaks while the registration mode is selected by operating a switch or the like (not shown), the microphone 4
Inputs this and converts it to an analog audio signal,
Amplifies the analog audio signal.

【００４２】Ａ／Ｄ変換器６は、アンプ５で増幅された
アナログ音声信号を、例えば、１１ｋＨｚのＰＣＭ（Ｐ
ｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）方式でサ
ンプリングしてデジタル化し、１６ビットの入力音声デ
ータに変換する。この入力音声データのフォーマット
は、例えば、２の補数で正のフルスケール以上の入力電
圧に対しては７ＦＦＦＨ、負のフルスケール以下の入力
電圧に対しては８０００Ｈ、無入力時は００００Ｈで表
される。入力電圧範囲は、例えばピークツーピークで２
Ｖである。The A / D converter 6 converts the analog audio signal amplified by the amplifier 5 into, for example, an 11 kHz PCM (P
The data is sampled and digitized by an ulse code modulation (ulse code modulation) method, and is converted into 16-bit input audio data. The format of the input audio data is expressed, for example, as 7FFFFH for an input voltage of 2's complement or more than positive full scale, 8000H for an input voltage of less than negative full scale, and 0000H for no input. You. The input voltage range is, for example, 2 peak-to-peak.
V.

【００４３】ＣＰＵ１は、入力音声データを読み込むと
（ステップＳ１０１）、基準データ比較手段１１によ
り、Ａ／Ｄ変換器６から入力されたデジタル音声データ
とメモリ３の基準データ領域３１に格納されている上限
基準データおよび下限基準データとを比較し（ステップ
Ｓ１０２）、基準範囲外の場合には、警告発声手段１２
により、メモリ３の警告音声データ領域３２の該当する
警告音声データをＤ／Ａ変換器７に出力する（ステップ
Ｓ１０３）。When the CPU 1 reads the input voice data (step S101), the reference data comparison means 11 stores the digital voice data input from the A / D converter 6 and the reference data area 31 of the memory 3. The upper limit reference data and the lower limit reference data are compared (step S102).
Thus, the corresponding warning voice data in the warning voice data area 32 of the memory 3 is output to the D / A converter 7 (step S103).

【００４４】詳しくは、基準データ比較手段１１は、入
力音声データが基準データ領域３１に格納されている上
限基準データより大きいかどうかを判定し（ステップＳ
２０１）、大きい場合には、警告発声手段１２が、メモ
リ３の警告音声データ領域３２から該当する「声が大き
すぎます」という警告音声データを読み出してＤ／Ａ変
換器７に出力する（ステップＳ２０２）。More specifically, the reference data comparing means 11 determines whether the input voice data is larger than the upper limit reference data stored in the reference data area 31 (step S).
201), if it is loud, the warning utterance means 12 reads out the corresponding warning voice data "the voice is too loud" from the warning voice data area 32 of the memory 3 and outputs it to the D / A converter 7 (step S201). S202).

【００４５】これにより、警告音声データがＤ／Ａ変換
器７でアナログの警告音声信号に変換され、アンプ８で
増幅され、スピーカ９から「声が大きすぎます」という
音声として発せられる。この警告音声を聞くことによ
り、話者は、自分の発声した音声が適正な入力音声レベ
ルを越えていることを知ることができるので、声を小さ
くすることにより、適正な入力音声レベルで音声を再入
力することができる。Thus, the warning voice data is converted into an analog warning voice signal by the D / A converter 7, amplified by the amplifier 8, and emitted from the speaker 9 as a voice saying “the voice is too loud”. By listening to this warning sound, the speaker can know that the sound uttered by the user exceeds the proper input sound level. You can re-enter.

【００４６】次に、基準データ比較手段１１は、入力音
声データが基準データ領域３１に格納されている下限基
準データより小さいかどうかを判定し（ステップＳ２０
３）、小さい場合には、警告発声手段１２が、メモリ３
の警告音声データ領域３２から該当する「声が小さすぎ
ます」という警告音声データを読み出してＤ／Ａ変換器
７に出力する（ステップＳ２０４）。Next, the reference data comparing means 11 determines whether the input voice data is smaller than the lower limit reference data stored in the reference data area 31 (step S20).
3) If small, the warning uttering means 12
The corresponding warning voice data indicating that "the voice is too low" is read out from the warning voice data area 32 and output to the D / A converter 7 (step S204).

【００４７】これにより、警告音声データがＤ／Ａ変換
器７でアナログの警告音声信号に変換され、アンプ８で
増幅され、スピーカ９から「声が小さすぎます」という
音声として発せられる。この警告音声を聞くことによ
り、話者は、自分の発声した音声が適正な入力音声レベ
ルより小さいことを知ることができるので、声を大きく
することにより、適正な入力音声レベルで音声を再入力
することができる。Thus, the warning voice data is converted into an analog warning voice signal by the D / A converter 7, amplified by the amplifier 8, and emitted from the speaker 9 as a voice saying “the voice is too low”. By listening to this warning voice, the speaker can know that the voice uttered by himself is lower than the proper input voice level, so re-inputting the voice at the proper input voice level by raising the voice can do.

【００４８】一方、基準データ比較手段１１により入力
音声データが上限基準データより大きくなく、かつ下限
基準データより小さくないと判定された場合、すなわち
入力音声データが上限基準データと下限基準データとの
間の基準範囲内あると判定された場合には、ＣＰＵ１
は、１サンプリング毎に１６ビットの入力音声データを
メモリ３のワークエリア（図示せず）に書き込む（ステ
ップＳ１０４）。On the other hand, when the reference data comparing means 11 determines that the input voice data is not larger than the upper limit reference data and not smaller than the lower limit reference data, that is, the input voice data is between the upper limit reference data and the lower limit reference data. Is determined to be within the reference range of
Writes 16-bit input audio data into a work area (not shown) of the memory 3 for each sampling (step S104).

【００４９】次に、ＣＰＵ１は、周波数スペクトル化手
段２１により、図４（ａ）および（ｂ）に示すように、
入力音声データを高速フーリエ変換して周波数成分に分
解し周波数スペクトル化する（ステップＳ１０５）。例
えば、１６ビットの入力音声データを１６ｍｓ分（フレ
ームという）まとめて、全ての入力音声データを周波数
成分に分解する。Next, the CPU 1 causes the frequency spectrum converting means 21 to perform the following operations as shown in FIGS. 4 (a) and 4 (b).
The input voice data is subjected to fast Fourier transform, decomposed into frequency components, and converted into a frequency spectrum (step S105). For example, 16-bit input audio data is collected for 16 ms (called a frame), and all the input audio data is decomposed into frequency components.

【００５０】続いて、ＣＰＵ１は、変化量計算手段２２
により、図４（ｂ）および（ｃ）に示すように、フレー
ム毎の変化量（ケプストラム）を計算する（ステップＳ
１０６）。Subsequently, the CPU 1 changes the change amount calculating means 22.
As shown in FIGS. 4B and 4C, the change amount (cepstrum) for each frame is calculated (step S).
106).

【００５１】次に、ＣＰＵ１は、登録／認識モード判別
手段２３により入力音声データの登録モードであるか認
識モードであるかを判別する（ステップＳ１０７）。Next, the CPU 1 uses the registration / recognition mode determining means 23 to determine whether the input voice data is in the registration mode or the recognition mode (step S107).

【００５２】いま、入力音声データの登録モードである
ので、ＣＰＵ１は、時間方向圧縮手段２４により、図４
（ｃ）および（ｄ）に示すように、周波数成分とフレー
ム毎の変化量とを時間方向に１／２に圧縮して登録音声
データとしてメモリ３の登録音声データ領域３３に登録
する（ステップＳ１０８）。これにより、入力音声デー
タの登録（話者および単語等の登録）が終了する。Since the input voice data is in the registration mode, the CPU 1 causes the time direction compression means 24 to
As shown in (c) and (d), the frequency component and the amount of change for each frame are compressed to half in the time direction and registered in the registered audio data area 33 of the memory 3 as registered audio data (step S108). ). Thus, the registration of the input voice data (the registration of the speaker, the word, and the like) ends.

【００５３】（２）入力音声データの照合時の動作(2) Operation when collating input voice data

【００５４】図示しないスイッチ等の操作により認識モ
ードが選択された状態で、話者が発声すると、登録モー
ドの場合と同様にして、マイク４がこれを入力してアナ
ログ音声信号に変換し、アンプ５がアナログ音声信号を
増幅し、Ａ／Ｄ変換器６が１６ビットの入力音声データ
に変換する。When the speaker speaks while the recognition mode is selected by operating a switch or the like (not shown), the microphone 4 inputs the same and converts it into an analog audio signal in the same manner as in the registration mode. 5 amplifies the analog audio signal, and the A / D converter 6 converts it into 16-bit input audio data.

【００５５】ＣＰＵ１は、入力音声データを入力すると
（ステップＳ１０１）、基準データ比較手段１１によ
り、入力音声データとメモリ３の基準データ領域３１に
格納されている上限基準データおよび下限基準データと
を比較する（ステップＳ１０２）。When the input voice data is input (step S101), the reference data comparing means 11 compares the input voice data with the upper limit reference data and the lower limit reference data stored in the reference data area 31 of the memory 3. (Step S102).

【００５６】なお、入力音声データと上限基準データお
よび下限基準データとの比較の処理は、登録モードの場
合と全く同様であり、入力音声データが上限基準データ
と下限基準データとの間の基準範囲内にないと判定され
た場合には、スピーカ９から「声が大きすぎます」また
は「声が小さすぎます」という音声が警告として発せら
れることになる。ここでは、重複を避けるために、その
詳しい動作の説明を省略する。The process of comparing the input voice data with the upper-limit reference data and the lower-limit reference data is exactly the same as in the case of the registration mode, and the input voice data is compared with the reference range between the upper-limit reference data and the lower-limit reference data. If it is determined that the voice is not within the range, the voice “loud voice” or “voice too low” is issued from the speaker 9 as a warning. Here, detailed description of the operation is omitted to avoid duplication.

【００５７】基準データ比較手段１１により入力音声デ
ータが上限基準データと下限基準データとの間の基準範
囲内あると判定された場合には、ＣＰＵ１は、登録モー
ドの場合と同様に、１サンプリング毎に１６ビットの入
力音声データをワークエリア（図示せず）に書き込み
（ステップＳ１０４）、周波数スペクトル化手段２１に
より入力音声データを高速フーリエ変換して周波数成分
に分解し周波数スペクトル化し（ステップＳ１０５）、
変化量計算手段２２によりフレーム毎の変化量（ケプス
トラム）を計算する（ステップＳ１０６）。When the reference data comparing means 11 determines that the input voice data is within the reference range between the upper limit reference data and the lower limit reference data, the CPU 1 sets the sampling rate for each sampling as in the case of the registration mode. Then, 16-bit input voice data is written in a work area (not shown) (step S104), and the input voice data is subjected to fast Fourier transform by the frequency spectrum converting means 21 to be decomposed into frequency components to generate a frequency spectrum (step S105).
The change amount calculating section 22 calculates the change amount (cepstrum) for each frame (step S106).

【００５８】次に、ＣＰＵ１は、登録／認識モード判別
手段２３により入力音声データの登録モードであるか認
識モードであるかを判別する（ステップＳ１０７）。Next, the CPU 1 determines whether the mode is the registration mode or the recognition mode of the input voice data by the registration / recognition mode determining means 23 (step S107).

【００５９】いま、入力音声データの認識モードである
ので、ＣＰＵ１は、照合手段２５により、図５に示すよ
うに、ＤＰマッチング法を用いて入力音声データをフレ
ーム毎に登録音声データ領域３３のどの登録音声データ
のフレームに近いかを照合し、一番近いものとの差分を
とり、入力音声データの全フレームでこの差分をとっ
て、全ての差分の合計を距離値とする（ステップＳ１０
９）。Since the input voice data is in the recognition mode, the CPU 1 uses the matching means 25 to input the input voice data in the registered voice data area 33 for each frame using the DP matching method as shown in FIG. It is checked whether the frame is close to the frame of the registered voice data, a difference from the closest frame is obtained, this difference is calculated for all frames of the input voice data, and the total of all the differences is set as a distance value (step S10).
9).

【００６０】次に、ＣＰＵ１は、照合結果判定手段２６
により、距離値がメモリ３のしきい値領域３４にあらか
じめ格納されているしきい値より大きい場合は不一致と
判定し、しきい値以下の場合は一致と判定する（ステッ
プＳ１１０）。Next, the CPU 1 checks the collation result determining means 26.
Thus, if the distance value is larger than the threshold value stored in the threshold area 34 of the memory 3 in advance, it is determined that they do not match, and if the distance value is equal to or less than the threshold value, it is determined that they match (step S110).

【００６１】照合結果判定手段２６により一致している
と判定された場合には、ＣＰＵ１は、応答発声手段２７
により、メモリ３の応答音声データ領域３５の適切な応
答音声データを読み出してＤ／Ａ変換器７に出力する
（ステップＳ１１１）。If it is determined by the collation result determination means 26 that they match, the CPU 1 sets the response utterance means 27
Thereby, appropriate response voice data in the response voice data area 35 of the memory 3 is read and output to the D / A converter 7 (step S111).

【００６２】これにより、応答音声データがＤ／Ａ変換
器７によりアナログ音声信号に変換され、アンプ７で増
幅されてスピーカ９から応答音声として発声される。例
えば、「こんにちわ」の入力音声に対して、「お元気で
すか」等の応答音声が発せられる。この応答音声を聞く
ことにより、話者は、自分の発した音声に対して音声認
識システムが対話的に応答していることを知ることがで
きる。As a result, the response voice data is converted into an analog voice signal by the D / A converter 7, amplified by the amplifier 7, and output as a response voice from the speaker 9. For example, in response to an input voice of "Hello", a response voice such as "How are you?" By listening to the response voice, the speaker can know that the voice recognition system is interactively responding to the voice uttered by the speaker.

【００６３】なお、照合結果判定手段２６により不一致
であると判定された場合には、ＣＰＵ１は、入力音声デ
ータに対応する登録音声データ（話者および単語等）が
未登録であることを意味するので、なにもせずに無視す
る。If the collation result judging means 26 judges that there is no match, the CPU 1 means that the registered voice data (speaker, word, etc.) corresponding to the input voice data has not been registered. Ignore it without doing anything.

【００６４】図６は、図３に示した基準データ比較ステ
ップＳ１０２および警告音声データ出力ステップＳ１０
３のより詳細な処理の変形例を示すフローチャートであ
る。図３に示した基準データ比較ステップＳ１０２およ
び警告音声データ出力ステップＳ１０３のより詳細な処
理では、入力音声データを上限基準データおよび下限基
準データと比較して基準範囲内になかった場合に警告音
声データを出力するようにしていたが、図６に示す基準
データ比較ステップＳ１０２および警告音声データ出力
ステップＳ１０３のより詳細な処理では、第１基準デー
タ，第２基準データおよび第３基準データの３つの基準
データを設けて、入力音声データのレベルを４つの範囲
に分けてそれぞれ異なる第１警告音声データ，第２警告
音声データ，第３警告音声データ，および第４警告音声
データを出力するようにしている。すなわち、図６に示
す基準データ比較ステップＳ１０２および警告音声デー
タ出力ステップＳ１０３のより詳細な処理は、第１基準
データ比較ステップＳ３０１と、第１警告音声データ出
力ステップＳ３０２と、第２基準データ比較ステップＳ
３０３と、第２警告音声データ出力ステップＳ３０４
と、第３基準データ比較ステップＳ３０５と、第３警告
音声データ出力ステップＳ３０６と、第４警告音声デー
タ出力ステップＳ３０７とからなる。FIG. 6 shows the reference data comparison step S102 and the warning voice data output step S10 shown in FIG.
13 is a flowchart illustrating a modification of the third more detailed process. In the more detailed processing of the reference data comparison step S102 and the warning voice data output step S103 shown in FIG. 3, the input voice data is compared with the upper reference data and the lower reference data, and if the input voice data is not within the reference range, the warning voice data is output. However, in the more detailed processing of the reference data comparison step S102 and the warning voice data output step S103 shown in FIG. 6, the three reference values of the first reference data, the second reference data, and the third reference data are used. Data is provided, and the level of the input voice data is divided into four ranges to output different first warning voice data, second warning voice data, third warning voice data, and fourth warning voice data. . That is, the more detailed processing of the reference data comparison step S102 and the warning voice data output step S103 shown in FIG. 6 includes a first reference data comparison step S301, a first warning voice data output step S302, and a second reference data comparison step. S
303 and a second warning voice data output step S304
, A third reference data comparison step S305, a third warning voice data output step S306, and a fourth warning voice data output step S307.

【００６５】図３に示した基準データ比較ステップＳ１
０２および警告音声データ出力ステップＳ１０３のより
詳細な処理を、図６に示した基準データ比較ステップＳ
１０２および警告音声データ出力ステップＳ１０３のよ
り詳細な処理のようにした場合、図６の処理終了後は図
２中のステップＳ１０４に制御が移行するようにする。
このようにした場合には、入力音声データのレベルが適
正な範囲外であっても音声認識部２０による音声認識が
行われることになる。Reference data comparison step S1 shown in FIG.
02 and the warning voice data output step S103 are described in detail in the reference data comparison step S103 shown in FIG.
In the case where the processing in step S103 and the warning voice data output step S103 are performed in more detail, the control is shifted to step S104 in FIG. 2 after the processing in FIG.
In such a case, the voice recognition by the voice recognition unit 20 is performed even if the level of the input voice data is out of the appropriate range.

【００６６】例えば、入力音声データが００００Ｈから
ＦＦＦＦＨまでとる場合、第１基準データを４０００
Ｈ、第２基準データを８０００Ｈ、第３基準データをＣ
０００Ｈとして、入力音声データのレベルを、００００
Ｈ〜３ＦＦＦＨ，４０００Ｈ〜７ＦＦＦＨ，８０００Ｈ
〜ＢＦＦＦＨ，およびＣ０００Ｈ〜ＦＦＦＦＨの４つの
範囲に分けて警告音声を異ならしめることができる。For example, when the input audio data ranges from 0000H to FFFFH, the first reference data is set to 4000
H, the second reference data is 8000H, and the third reference data is C
000H and the level of the input audio data is 0000
H-3FFFFH, 4000H-7FFFH, 8000H
~ BFFFH, and C000H ~ FFFFH, the warning sound can be divided into four ranges.

【００６７】次に、本発明の第２の実施の形態について
図面を参照して説明する。Next, a second embodiment of the present invention will be described with reference to the drawings.

【００６８】図７を参照すると、本発明の第２の実施の
形態に係る音声認識システムは、図１に示した第１の実
施の形態に係る音声認識システムに対して、入力音声レ
ベル判定プログラムおよび音声認識プログラムを記録し
た記録媒体４０を備える。この記録媒体４０は、磁気デ
ィスク，半導体メモリ，その他の記録媒体であってよ
い。Referring to FIG. 7, the speech recognition system according to the second embodiment of the present invention is different from the speech recognition system according to the first embodiment shown in FIG. And a recording medium 40 on which a voice recognition program is recorded. This recording medium 40 may be a magnetic disk, a semiconductor memory, or another recording medium.

【００６９】入力音声レベル判定プログラムおよび音声
認識プログラムは、記録媒体４０からＣＰＵ１に読み込
まれ、ＣＰＵ１の動作を、入力音声レベル判定部１０お
よび音声認識部２０として制御する。入力音声レベル判
定部１０および音声認識部２０の制御によるＣＰＵ１等
の動作は、第１の実施の形態に係る音声認識システムの
場合と全く同様になるので、その詳しい説明を割愛す
る。The input voice level determination program and the voice recognition program are read by the CPU 1 from the recording medium 40, and control the operation of the CPU 1 as the input voice level determination unit 10 and the voice recognition unit 20. The operations of the CPU 1 and the like under the control of the input voice level determination unit 10 and the voice recognition unit 20 are exactly the same as those in the case of the voice recognition system according to the first embodiment.

【００７０】ところで、上記各実施の形態では、マイク
４とアンプ５とを別のものとして説明したが、マイク４
とアンプ５とが一体であってもよいことはいうまでもな
い。In each of the above embodiments, the microphone 4 and the amplifier 5 have been described as being different from each other.
Needless to say, the amplifier 5 and the amplifier 5 may be integrated.

【００７１】また、アンプ８とスピーカ９とを別のもの
として説明したが、アンプ８とスピーカ９とが一体であ
ってもよいことはいうまでもない。Although the amplifier 8 and the speaker 9 have been described as being separate from each other, it goes without saying that the amplifier 8 and the speaker 9 may be integrated.

【００７２】さらに、アンプ５，Ａ／Ｄ変換器６，ＣＰ
Ｕ１，Ｄ／Ａ変換器７，およびアンプ８を別のものとし
て説明したが、これらが１つのＬＳＩ（ＬａｒｇｅＳ
ｃａｌｅＩｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）と
して集積されていてもよいことはいうまでもない。Further, amplifier 5, A / D converter 6, CP
Although the U1, D / A converter 7, and amplifier 8 have been described as separate units, these units are one LSI (Large S).
Needless to say, the information may be integrated as a call integrated circuit (call integrated circuit).

【００７３】さらにまた、アンプ５，Ａ／Ｄ変換器６，
ＣＰＵ１，メモリ３，Ｄ／Ａ変換器７，およびアンプ８
を別のものとして説明したが、これらが、いわゆるシス
テムＬＳＩとして集積されていてもよいことはいうまで
もない。Further, an amplifier 5, an A / D converter 6,
CPU 1, memory 3, D / A converter 7, and amplifier 8
However, it is needless to say that these may be integrated as a so-called system LSI.

【００７４】なお、上記各実施の形態では、音声認識シ
ステムが特定話者認識システムである場合を例にとって
説明したが、汎用話者認識システムであっても本発明を
同様に適用できることはいうまでもない。すなわち、本
発明の特徴は、音声認識部を搭載する演算処理装置に音
声認識部の前段となるように入力音声レベル判定部を設
けるようにした点にあり、音声認識部における音声認識
方法がどのようなものであるかを問わない。In the above embodiments, the case where the speech recognition system is a specific speaker recognition system has been described as an example. However, it goes without saying that the present invention can be similarly applied to a general-purpose speaker recognition system. Nor. That is, the feature of the present invention resides in that the input speech level determination unit is provided in the arithmetic processing device equipped with the speech recognition unit so as to be in front of the speech recognition unit. It does not matter whether it is something like that.

【００７５】[0075]

【発明の効果】第１の効果は、音声入力の途中で入力音
声信号の波形が変形しないので、認識率の低下を招かな
いことである。その理由は、ＡＧＣ回路を使用しておら
ず、音声入力の途中で入力音声信号のレベルが変化し
て、認識率に悪影響を及ぼすことがないからである。The first effect is that the waveform of the input voice signal is not deformed during the voice input, so that the recognition rate does not decrease. The reason is that the AGC circuit is not used, and the level of the input voice signal does not change during the voice input, so that the recognition rate is not adversely affected.

【００７６】第２の効果は、基準範囲内での音声入力が
容易に可能となることである。その理由は、話者に対し
て入力音声レベルが小さいまたは大きいことを音声によ
り対話的に警告することにより、話者に基準範囲内での
音声の再入力を促すことができるからである。The second effect is that voice input within the reference range can be easily performed. The reason is that by interactively warning the speaker that the input voice level is low or high by voice, it is possible to prompt the speaker to re-input voice within the reference range.

【００７７】第３の効果は、追加回路が不要であり、コ
ストがかからないことである。その理由は、一般的な音
声認識システムにおいて音声認識に使用している演算処
理装置で入力音声レベルの検出および警告をも行うよう
にしたからである。The third effect is that no additional circuit is required and the cost is not increased. The reason is that an arithmetic processing unit used for speech recognition in a general speech recognition system also detects an input speech level and performs a warning.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態に係る音声認識シス
テムの構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a speech recognition system according to a first embodiment of the present invention.

【図２】第１の実施の形態に係る音声認識システムの処
理を示すフローチャートである。FIG. 2 is a flowchart showing processing of the speech recognition system according to the first embodiment.

【図３】図２中の基準データ比較ステップおよび警告音
声データ出力ステップのより詳細な処理を示すフローチ
ャートである。FIG. 3 is a flowchart showing more detailed processing of a reference data comparison step and a warning voice data output step in FIG. 2;

【図４】（ａ）ないし（ｄ）は図１中の音声認識部によ
る入力音声データの登録時の順次の工程を説明するため
の図である。4 (a) to 4 (d) are diagrams for explaining sequential steps when registering input voice data by a voice recognition unit in FIG. 1;

【図５】図１中の照合手段により行われるＤＰマッチン
グ法による照合を説明するための図である。FIG. 5 is a diagram for explaining matching by the DP matching method performed by the matching means in FIG. 1;

【図６】図２に示した基準データ比較ステップおよび警
告音声データ出力ステップのより詳細な処理の変形例を
示すフローチャートである。FIG. 6 is a flowchart showing a modification of the more detailed processing of the reference data comparison step and the warning voice data output step shown in FIG. 2;

【図７】本発明の第２の実施の形態に係る音声認識シス
テムの構成を示すブロック図である。FIG. 7 is a block diagram illustrating a configuration of a speech recognition system according to a second embodiment of the present invention.

【図８】従来の音声認識システムの一例を示すブロック
図である。FIG. 8 is a block diagram showing an example of a conventional speech recognition system.

【図９】従来の音声認識システムの他の例を示すブロッ
ク図である。FIG. 9 is a block diagram showing another example of a conventional speech recognition system.

[Explanation of symbols]

１ＣＰＵ３メモリ４マイク５アンプ６Ａ／Ｄ変換器７Ｄ／Ａ変換器８アンプ９スピーカ１０入力音声レベル判定部１１基準データ比較手段１２警告発声手段２０音声認識部２１周波数スペクトル化手段２２変化量計算手段２３登録／認識モード判別手段２４時間方向圧縮手段２５照合手段２６照合結果判定手段２７応答発声手段３１基準データ領域３２警告音声データ領域３３登録音声データ領域３４しきい値領域３５応答音声データ領域４０記録媒体Ｓ１０１入力音声データ入力ステップＳ１０２基準データ比較ステップＳ１０３警告音声データ出力ステップＳ１０４入力音声データ書き込みステップＳ１０５周波数スペクトル化ステップＳ１０６変化量計算ステップＳ１０７登録モード判別ステップＳ１０８時間方向圧縮・登録ステップＳ１０９データ照合ステップＳ１１０照合結果判定ステップＳ１１１応答音声データ出力ステップＳ２０１上限基準データより大きいか判定ステップＳ２０２警告音声データ出力ステップＳ２０３下限基準データより小さいか判定ステップＳ２０４警告音声データ出力ステップＳ３０１第１基準データ比較ステップＳ３０２第１警告音声データ出力ステップＳ３０３第２基準データ比較ステップＳ３０４第２警告音声データ出力ステップＳ３０５第３基準データ比較ステップＳ３０６第３警告音声データ出力ステップＳ３０７第４警告音声データ出力ステップ Reference Signs List 1 CPU 3 memory 4 microphone 5 amplifier 6 A / D converter 7 D / A converter 8 amplifier 9 speaker 10 input audio level determination unit 11 reference data comparison unit 12 warning utterance unit 20 voice recognition unit 21 frequency spectrum conversion unit 22 change Quantity calculation means 23 Registration / recognition mode discrimination means 24 Time direction compression means 25 Verification means 26 Verification result determination means 27 Response utterance means 31 Reference data area 32 Warning voice data area 33 Registered voice data area 34 Threshold area 35 Response voice data Area 40 Recording medium S101 Input voice data input step S102 Reference data comparison step S103 Warning voice data output step S104 Input voice data writing step S105 Frequency spectrum conversion step S106 Change amount calculation step S107 Registration mode discrimination Step S108 Time direction compression / registration step S109 Data collation step S110 Collation result determination step S111 Response voice data output step S201 Determination step larger than upper limit reference data S202 Warning voice data output step S203 Determination step lower than lower limit reference data S204 Warning voice Data output step S301 First reference data comparison step S302 First warning voice data output step S303 Second reference data comparison step S304 Second warning voice data output step S305 Third reference data comparison step S306 Third warning voice data output step S307 4 Warning audio data output step

Claims

[Claims]

1. A speech recognition system for recognizing speech and responding with speech, wherein an input speech level for determining a level of input speech data in a digital processing unit equipped with a speech recognition unit so as to be a stage preceding the speech recognition unit. Provide a judgment unit,
When the level of the input voice data is determined to be out of the reference range by the input voice level determination unit, a warning to that effect is made interactively by voice, and the level of the input voice data is determined by the input voice level determination unit. A voice recognition system for performing voice recognition by the voice recognition unit when it is determined that is within a reference range.

2. A speech recognition system for recognizing speech and responding with speech, wherein an input speech level for determining a level of input speech data in a digital processing unit equipped with a speech recognition unit so as to be a stage preceding the speech recognition unit. Provide a judgment unit,
When the level of the input voice data is determined to be out of the reference range by the input voice level determination unit, a warning to that effect is made interactively by voice, and the level of the input voice data is determined by the input voice level determination unit. Is determined to be within the reference range, if the input voice data is in the registration mode, the input voice data is registered by the voice recognition unit.If the input voice data is in the recognition mode, the input voice data is registered by the voice recognition unit. A voice recognition system characterized by performing voice recognition.

3. An analog audio signal input from a microphone is A / D-converted by an A / D converter, and then input as input audio data to a digital processing unit to perform voice recognition.
In a voice recognition system that responds by voice from a speaker, the digital arithmetic processing device compares input voice data with upper-limit reference data and lower-limit reference data so that the level of the input voice data is higher than that of a voice recognition unit. A reference data comparing means for determining whether the input voice data is within the reference range; and, when the reference data comparing means determines that the level of the input voice data is not within the reference range, a sound is issued from the speaker to warn the user. A speech recognition system comprising an input speech level determination unit comprising warning utterance means.

4. An analog audio signal input from a microphone is A / D-converted by an A / D converter, and then input as input audio data to a digital processing unit to perform voice recognition.
In a voice recognition system that responds by voice from a speaker, reference data comparing means for comparing input voice data with upper limit reference data and lower limit reference data to determine whether or not the level of the input voice data is within a reference range. When the reference data comparing means determines that the level of the input voice data is not within the reference range, a warning voice generating means for uttering a voice to warn the user of the fact from the speaker, and the level of the input voice data is reduced by the reference data comparing means. Frequency spectrum converting means for performing fast Fourier transform on a frame basis to decompose the input voice data into frequency components when it is determined that the input voice data is within the reference range, and converting the frequency spectrum into a frequency spectrum by the frequency spectrum converting means; A change amount calculating means for calculating a change amount of each component for each frame, and an input Registration / recognition mode determining means for determining whether the mode is a voice data registration mode or a recognition mode, and when the registration / recognition mode determining means determines that the mode is the registration mode, the voice / speech data is decomposed by the frequency spectrum converting means. A time direction compression unit for compressing the frequency component and the change amount for each frame calculated by the change amount calculation unit in the time direction and registering them as registered voice data; and a recognition mode by the registration / recognition mode discrimination unit. When it is determined that the input voice data and the registered voice data are matched by a frame unit using the DP matching method to calculate a distance value, a threshold value calculated by the matching means and a threshold value registered in advance are used. Comparison result determining means for comparing the distance value with the value to determine whether the distance value is equal to or less than a threshold value;
Response speech means for uttering response speech data from the speaker as response to the registered speech data for which the distance value is determined to be less than or equal to the threshold value by the comparison result decision means as a recognition result. Speech recognition system.

5. An analog audio signal input from a microphone is A / D-converted by an A / D converter, and then input as input audio data to a digital processing unit to perform voice recognition.
In an input voice level warning method of a voice recognition system that responds by voice from a speaker, a criterion for comparing input voice data with upper reference data and lower reference data to determine whether the level of the input voice data is within a reference range. A data comparing step; and a warning uttering step of, when the reference data comparing means determines that the level of the input voice data is not within the reference range, uttering a voice to warn of the fact from a speaker. Input audio level warning method.

6. A reference data comparing step of comparing input voice data with upper limit reference data and lower limit reference data to determine whether or not the level of the input voice data is within a reference range. When it is determined that the level of the audio data is not within the reference range, a warning utterance step of uttering a voice to warn the user from the speaker, and when the level of the input audio data is within the reference range by the reference data comparison step. If determined, the input voice data is subjected to Fast Fourier Transform on a frame-by-frame basis, decomposed into frequency components and converted into a frequency spectrum, and the frequency component converted into a frequency spectrum by the frequency spectrum conversion step is changed for each frame. A change amount calculating step for calculating the amount, A registration / recognition mode determining step of determining whether the mode is the registration mode or the recognition mode, and a frequency component decomposed by the frequency spectrum conversion step when the registration / recognition mode determination step determines that the mode is the registration mode. And a change amount for each frame calculated in the change amount calculating step in the time direction, and registering as registered voice data in a time direction, and the registration / recognition mode determining step determines that the recognition mode is set. In this case, the input voice data and the registered voice data are compared in a frame unit by the DP matching method to calculate a distance value, and the distance value calculated in the matching step and a pre-registered threshold value are compared with each other. A comparison result determination step of comparing the distance value to determine whether the distance value is equal to or smaller than a threshold value; A response uttering step of uttering from a speaker response voice data responding to the registered voice data as a recognition result of the registered voice data whose distance value is determined to be equal to or smaller than the threshold value in the determining step. Audio level warning method.

7. A reference data comparing means for comparing the input audio data with the upper reference data and the lower reference data to determine whether or not the level of the input audio data is within a reference range. When the level of the input voice data is determined to be out of the reference range, a warning uttering means for uttering voice to warn the user from the speaker, and the level of the input voice data is within the reference range by the reference data comparing means. If it is determined that the input speech data is fast Fourier-transformed in frame units, decomposed into frequency components and converted into a frequency spectrum, frequency-spectrum-converted frequency components by this frequency-spectralizing means are changed for each frame Change amount calculation means for calculating the amount of input voice data / To determine if the mode is
Recognition mode discriminating means, when the registration / recognition mode discriminating means discriminates the registration mode, the frequency component decomposed by the frequency spectrum converting means and the change amount for each frame calculated by the change amount calculating means. Time-direction compression means for compressing the input speech data in the time direction and registering it as registered speech data; Matching means for calculating a distance value by matching the unit, comparing the distance value calculated by the matching means with a pre-registered threshold value to determine whether the distance value is equal to or less than the threshold value Means and the registered voice data whose distance value is determined to be equal to or less than the threshold value by the matching result determining means as the recognition result. Recording medium for recording a program for functioning as a response utterance means for utterance from the speaker response voice data responsive to audio data.

8. A reference data comparing step for comparing the input voice data with the upper reference data and the lower reference data to determine whether or not the level of the input voice data is within a reference range. When the level of the input voice data is determined not to be within the reference range, a warning utterance step of uttering voice to warn the user from the speaker, and the level of the input voice data is within the reference range by the reference data comparison step. If it is determined that the input voice data is fast Fourier-transformed in frame units and decomposed into frequency components to generate a frequency spectrum, the frequency component converted into a frequency spectrum by the frequency spectrum conversion step is changed for each frame. Change calculation step to calculate the amount, input voice A registration / recognition mode determination step for determining whether the mode is a data registration mode or a recognition mode, and a frequency decomposed by the frequency spectrum conversion step when the registration / recognition mode determination step determines that the mode is the registration mode. The recognition mode is determined by the registration / recognition mode determination step in the time direction compression step in which the component and the change amount for each frame calculated in the change amount calculation step are compressed in the time direction and registered as registered voice data. A matching step of calculating the distance value by comparing the input voice data with the registered voice data in a frame unit by the DP matching method, and comparing the distance value calculated in the matching step with a previously registered threshold value Matching result determination step for determining whether the distance value is equal to or smaller than a threshold value, and A program for executing a response utterance step of causing a speaker to utter response voice data responding to the registered voice data as a recognition result with respect to the registered voice data whose distance value is determined to be equal to or smaller than the threshold value in the comparison result determination step. The recording medium on which it was recorded.