JPS6342278B2

JPS6342278B2 -

Info

Publication number: JPS6342278B2
Application number: JP57078942A
Authority: JP
Inventors: Hiroshi Saito; Hideki Fuje
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-05-10
Filing date: 1982-05-10
Publication date: 1988-08-22
Also published as: JPS58194097A

Description

[Detailed description of the invention]

本発明は音声認識装置のトリガ方法に関するも
のであり、特に騒音等により発生する誤動作を低
減させることができる音声認識装置のトリガ方法
を提供することを目的とするものである。一般に、音声認識装置はマイクロホンで収音さ
れた音声信号をマイクアンプを通したのちに帯域
通過フイルタおよび全波整流回路を用いて構成し
た特徴抽出部を通し、この特徴抽出部を通した音
声信号をマルチプレクサにより順次切換えてAD
変換器に供給し、このAD変換器によりデイジタ
ル信号に変換し、音声データとして記憶部に記憶
し、再度、上述した回路を通して得られた音声デ
ータを上記記憶部に記憶された音声データと照合
する処理を行なうことにより音声を認識するよう
に構成していた。この場合、記憶モードでの帯域
通過フイルタからの音声信号のデイジタル変換は
高速に行ない、記憶モードでの再度の帯域通過フ
イルタからの音声信号のデイジタル変換までは時
間間隔（以下フレーム間隔と呼び、取り込まれた
データをフレームデータと呼ぶ）が数ｍ秒から数
十ｍ秒であり、このデータを処理することにより
音声を認識している。上述した音声認識装置における問題は種々ある
が、音声認識時の誤動作の要因として下記事項が
上げられる。 (1) 話者の音声発生の速度変化。 (2) 認識対象単語以外の言葉による作動。 (3) 環境騒音（特にパルス性騒音）による作動。上記項目中、(1)，(2)は収集したデータを高度な
分析処理をすることにより誤認識を減少させるこ
とが可能となるが、(3)は騒音による誤動作中に音
声が入力された時に誤認識又は拒絶が行なわれる
ため、誤認識を減少させることはむずかしいもの
であつた。そして、騒音による誤動作の原因とし
て、現在のトリガ方式が考えられる。現在のトリ
ガ方式として、一般に併いられている方式は、入
力信号のエネルギーが設定しているスレツシユレ
ベル以上になるとデータを取り込み、入力信号の
エネルギーがスレツシユレベル以下になり、一定
の時間を経過するとデータの取り込みを終了する
という方式であり、入力信号のエネルギーを全帯
域に渡り処理するか、特定の帯域に限定して処理
するかの差はあるが、基本的には入力信号のエネ
ルギーの大小を取り扱つている。そのため、上記
内容より、騒音等のノイズがある一定レベル以上
で入力されると、データの取り込みが開始され
る。このことから、ノイズが音声に重畳された
時、及び、衝撃的な騒音が多く発生している場所
での音声認識を行なう場合においては必然的に誤
動作が多くなるという欠点があつた。本発明はこのような従来の欠点を解消するもの
であり、単に入力信号のエネルギーの大小比較だ
けでなく、音声の特徴としての周波数特性差を判
定基準に導入したものである。一般に、騒音下で認識装置を作動させる時、初
期設定として暗騒音入力信号をトリガレベル以下
に設定して使用すると、定常ノイズによる誤動作
は起こりにくいが、衝撃音、周期的ノイズ等の非
定常ノイズは予測出来にくく、その音によりフレ
ームデータ取り込み動作が生じやすい。このフレ
ームデータ取り込み動作が行なわれても、その信
号の継続時間長を一つの判断として、音声か騒音
かの判定を行なうことによりフレームデータの取
り込みを行ないにくくすることができる。しか
し、音声発生の直前に騒音が発生した時は、まず
騒音によりトリガがかかり、騒音フレームデータ
を収集し、その後音声フレームデータを収集して
しまうために、処理による誤認識が起こりやす
い。そしてこの誤認識動作は処理が終了するまで
続くため、新規の音声信号の取り込みは拒絶され
る。騒音、特に衝撃音のスペクトルは、広周波帯
域の平坦な特性を有する。対する音声のスペクト
ルは、声道長が管で一定であるため発生する言葉
により異なるが、周波数軸上で複数のピーク（以
下ホルマントと記す）を生ずる。このホルマント
は、音声における母韻部で周波数変動が少なく、
時間的にも定常的な特性を有する。これに対し、
本発明では入力された信号の初期定常部フレーム
データを抽出し、この抽出したフレームデータと
の市街値距離を求めるようにしているため、入力
された信号が騒音か認識対象音声かを瞬時に判別
し、騒音時には再入力体制に入ることができる利
点を有している。また、入力された信号が音声で
も初期定常部の異なる言葉に対しては市街値距離
が大となり、前記の異なる音声が入力された時に
よる誤動作が低減出来る利点を有している。ここで、本発明が適用される音声認識には次の
２通りがある。その１つは登録式特定話者用限定
単語方式の認識の場合であり、他の１つは不特定
話者用限定単語方式の認識の場合である。前者の
場合には、音声登録時に、登録語の語頭部の定常
的なフレームを抽出したフレームデータ間との市
街値距離を求める。後者の場合には、使用する単
語における語頭部の音韻データとの市街値距離を
求める。以下、本発明について実施例の図面と共に説明
する。第１図および第２図は本発明の一実施例を
示しており、第１図において、マイクロホン１で
収音された音声信号はマイクアンプ２を通した後
に帯域フイルタ３−１，３−２……３−ｎを通
し、全波整流回路４−１，４−２……４−ｎによ
り音声信号を直流に変換し、マルチプレクサ５に
より各帯域フイルタ４−１，４−２……４−ｎの
出力を順次切り換えた後にＡ／Ｄコンバータ６に
よりデイジタル変換し、音声データとして図示し
ていない記憶部に記憶する。この種の装置におい
て、帯域フイルタ群のデイジタル変換は高速に行
なうが、再度の帯域フイルタ群のデイジタル変換
までは時間間隔（以下フレーム間隔とよび、取り
込まれたデータ群をフレームデータとよぶ）が数
ｍ秒から数十ｍ秒であり、このデータを処理する
ことにより音声を認識している。ここで、上記
Ａ／Ｄコンバータ６からのデータを記憶部に取り
込む登録モード、あるいは上記Ａ／Ｄコンバータ
６からのデータを記憶部に登録したデータと照合
処理する認識モードにおいては第２図に示すフロ
ーチヤートにもとずく音声トリガが実行される。
第２図において、１１は１フレームのデータを取
り込むステツプ、１２は入力信号のレベルとスレ
ツシユレベルの比較ステツプ、１３はそのレベル
トリガでの判定ステツプである。この判定ステツ
プ１３で判定を行なつた結果、スレツシユレベル
以上であれば次の周波数トリガステツプ１４へ、
以下のレベルであればステツプ１１へ帰る。上記
周波数トリガステツプ１４では、入力信号の定常
部を求め、それをトリガフレームとする。また、
別途分析して求めておいた限定単語のトリガフレ
ームとの市街値距離を求め、距離の大小で限定単
語の音韻かをテストし、次の判定ステツプ１５で
判定する。また、入力信号の定常部が数十msへ
ても抽出出来なかつた時はノイズとみなしステツ
プ１１へ帰る。判定ステツプ１５でのテスト判定
結果、市街値距離が小であれば次の公知のデータ
収集、認識処理のステツプ１６に行き、処理後結
果を出力する。次に周波数トリガステツプ１４における定常部
を検出する方法について示す。レベルトリガでの判定ステツプ１３をへて取り
込まれたデータフレームを、第５フレームX_Jと
表わす（X_Jは、周波数分析をＩチヤンネルで行
なつた時は、Ｉ次元のベクトルでチヤンネル番号
をｉとし、X_J中の最大値をL_Jとする）。さらに第
Ｊフレームと第Ｋフレームの間の距離を、D_JK＝
The present invention relates to a triggering method for a speech recognition device, and in particular, it is an object of the present invention to provide a triggering method for a speech recognition device that can reduce malfunctions caused by noise or the like. Generally, a speech recognition device passes an audio signal picked up by a microphone through a microphone amplifier, and then passes it through a feature extraction section configured using a bandpass filter and a full-wave rectification circuit. AD by switching sequentially using a multiplexer.
The audio data is supplied to a converter, converted into a digital signal by this AD converter, stored in the storage unit as audio data, and again, the audio data obtained through the above-mentioned circuit is compared with the audio data stored in the storage unit. It was configured to recognize speech through processing. In this case, the digital conversion of the audio signal from the bandpass filter in storage mode is performed at high speed, and the time interval (hereinafter referred to as frame interval) until the digital conversion of the audio signal from the bandpass filter in storage mode is performed at high speed. The captured data (called frame data) ranges from several milliseconds to several tens of milliseconds, and speech is recognized by processing this data. Although there are various problems with the above-mentioned speech recognition device, the following are cited as factors that cause malfunctions during speech recognition. (1) Changes in the speed of the speaker's speech production. (2) Activation by words other than the recognition target word. (3) Activation due to environmental noise (especially pulsed noise). Among the above items, (1) and (2) can reduce misrecognition by performing advanced analysis processing on the collected data, but (3) is when the voice is input during a malfunction due to noise. Because misrecognition or rejection sometimes occurs, it has been difficult to reduce misrecognition. The current trigger method is considered to be the cause of malfunctions due to noise. The current trigger method that is commonly used is to capture data when the energy of the input signal exceeds a set threshold level, and when the energy of the input signal becomes below the threshold level, the data is captured for a certain period of time. It is a method that stops data acquisition after a certain period of time has elapsed, and there are differences in whether the energy of the input signal is processed over the entire band or only in a specific band, but basically the energy of the input signal is processed. We deal with the size of. Therefore, from the above content, when noise such as noise is input at a certain level or higher, data capture is started. For this reason, when noise is superimposed on speech, or when speech recognition is performed in a place where a lot of shocking noise is generated, there is a drawback that malfunctions inevitably increase. The present invention solves these conventional drawbacks, and uses not only a comparison of the energy levels of input signals but also a difference in frequency characteristics as a feature of voice as a criterion for determination. Generally, when operating a recognition device in a noisy environment, if the background noise input signal is set to below the trigger level as an initial setting, malfunctions due to stationary noise are less likely to occur, but non-stationary noise such as impact sounds and periodic noises is difficult to predict, and the sound is likely to cause a frame data capture operation. Even if this frame data capture operation is performed, frame data capture can be made difficult by determining whether the signal is voice or noise based on the duration of the signal. However, when noise occurs immediately before voice generation, the noise first triggers the noise frame data, and then the voice frame data is collected, which tends to cause erroneous recognition due to processing. Since this erroneous recognition operation continues until the processing is completed, the acquisition of new audio signals is rejected. The spectrum of noise, especially impact sound, has flat characteristics over a wide frequency band. The spectrum of speech, on the other hand, differs depending on the words produced because the length of the vocal tract is constant in the tube, but it produces multiple peaks (hereinafter referred to as formants) on the frequency axis. This formant has little frequency variation in the vowel part of speech,
It also has characteristics that are constant over time. In contrast,
In the present invention, the initial stationary part frame data of the input signal is extracted and the city value distance from this extracted frame data is determined, so it is possible to instantly determine whether the input signal is noise or speech to be recognized. However, it has the advantage of being able to enter a re-input system when there is noise. Further, even if the input signal is a voice, the city value distance becomes large for words with different initial stationary parts, which has the advantage that malfunctions caused when different voices are input can be reduced. Here, there are the following two types of speech recognition to which the present invention is applied. One is the case of recognition using the registered limited word method for specific speakers, and the other is the case of recognition using the limited word method for non-specific speakers. In the former case, at the time of audio registration, the city value distance between the extracted frame data of the regular frames at the beginning of the registered words is determined. In the latter case, the city value distance from the phonological data at the beginning of the word to be used is determined. Hereinafter, the present invention will be explained with reference to drawings of embodiments. 1 and 2 show an embodiment of the present invention. In FIG. 1, an audio signal picked up by a microphone 1 passes through a microphone amplifier 2 and then passes through band filters 3-1 and 3-2. ......3-n, the audio signal is converted to DC by the full-wave rectifier circuit 4-1, 4-2...4-n, and the multiplexer 5 converts the audio signal into direct current through each band filter 4-1, 4-2...4- After the outputs of n are sequentially switched, they are digitally converted by the A/D converter 6 and stored in a storage section (not shown) as audio data. In this type of device, the band filter group is digitally converted at high speed, but the time interval (hereinafter referred to as frame interval, and the captured data group is called frame data) is several times until the band filter group is digitally converted again. It takes from m seconds to several tens of m seconds, and speech is recognized by processing this data. Here, in a registration mode in which data from the A/D converter 6 is taken into the storage section, or in a recognition mode in which data from the A/D converter 6 is collated with data registered in the storage section, as shown in FIG. An audio trigger based on the flowchart is executed.
In FIG. 2, numeral 11 is a step for taking in data of one frame, numeral 12 is a comparison step between the level of the input signal and the threshold level, and numeral 13 is a determination step using the level trigger. As a result of the judgment in this judgment step 13, if the threshold level is exceeded, the process proceeds to the next frequency trigger step 14.
If the level is below, return to step 11. In the frequency trigger step 14, the steady portion of the input signal is determined and used as a trigger frame. Also,
The street value distance between the limited word and the trigger frame, which has been separately analyzed and found, is determined, and the magnitude of the distance is used to test whether the limited word is phonological or not, and a determination is made in the next determination step 15. Furthermore, if the steady part of the input signal cannot be extracted even after several tens of milliseconds, it is regarded as noise and the process returns to step 11. If the test determination result in determination step 15 is that the city value distance is small, the process proceeds to step 16 of the next known data collection and recognition process, and the processed result is output. Next, a method for detecting a steady state in the frequency trigger step 14 will be described. The data frame captured through the determination step 13 using the level trigger is expressed as the fifth frame _XJ (when _frequency analysis is performed on the I channel, and the maximum value in X _J is L _J ). Furthermore, the distance between the J-th frame and the K-th frame is D _JK =

【式】あるいは、 D_JK＝〓ｉ｜X_J−X_K｜・１／L_Jと定義する。次に、実際のポテンシヤルを求める方法につい
て記す。例えば第Ｊフレームに着目し、その前後
Ｎフレームとの距離の平均値を第Ｊフレームのポ
テンシヤルをP_Jとすると、P_J＝１／Ｎ_J 〓^K=J-N D_JK・ｄ／ L_Jで表わす。但し、Ｎ＜Ｊとする。ｄはゲイン係数。上記式より求めたポテンシヤルP_Jが、市街値距
離設定スレツシユレベル以下になつた時にそのＪ
フレームをトリガフレームとする。もし50msま
で取り込んでも市街値距離設定スレツシユレベル
以下にならない時は、処理を中止してステツプ１
に帰る。次に、簡易化した方法について記す。この方式
は、複雑な演習を行なわないため、高速処理が可
能である。しかし、処理による定常部抽出の精度
が出ないため、市街値距離設定スレツシユレベル
を上げて使用することが望まれる。この簡易化方
式の内容は、レベルトリガの判定ステツプ１３で
OKとなつた後、数十ms経過した時に１フレーム
データを取り込み、これをトリガフレームとする
ことである。以上の説明より明らかなように、本発明の音声
認識装置のトリガ方法によれば、音声認識の前処
理としてデータ取り込みの有効性を判断するため
に、登録音声データと入力音声データのエネルギ
ー比較を行なうと共に、語頭部の定常部フレーム
データを抽出してその部分の周波数特性差を判断
基準に取り入れているため、定常ノイズだけでな
く、衝撃音等の非定常ノイズの下でも、入力音声
がノイズであるか認識対象音声であるかを正確に
判別することができます。[Formula] Alternatively, define D _JK =〓 i|X _J −X _K |・1/L _J. Next, we will describe how to find the actual potential. For example, focusing on the J-th frame, and assuming that the potential of the J-th frame is P _J , the average value of the distance between the N frames before and after it is expressed as P _J = 1/N _J 〓 ^K=JN D _JK・d/ L _J . However, N<J. d is the gain coefficient. When the potential P _J calculated from the above formula becomes less than the city value distance setting threshold level, that J
Set the frame as the trigger frame. If the distance does not fall below the city value distance setting threshold level even after capturing up to 50ms, stop the process and proceed to step 1.
Return to Next, a simplified method will be described. This method allows high-speed processing because it does not require complicated exercises. However, since the accuracy of stationary part extraction by processing is not achieved, it is desirable to use the city value distance setting threshold level raised. The contents of this simplification method are as follows in level trigger determination step 13.
After receiving OK, one frame of data is captured several tens of milliseconds later, and this is used as a trigger frame. As is clear from the above explanation, according to the trigger method of the speech recognition device of the present invention, in order to judge the effectiveness of data capture as pre-processing for speech recognition, energy comparison between registered speech data and input speech data is performed. At the same time, the frame data of the stationary part at the beginning of a word is extracted and the difference in frequency characteristics of that part is incorporated into the judgment criteria. It is possible to accurately determine whether it is noise or speech to be recognized.

[Brief explanation of the drawing]

第１図は本発明の音声認識装置の一実施例を示
すブロツク図、第２図は同装置のトリガ方法を説
明するたのフローチヤートである。１……マイクロホン、２……マイクアンプ、３
……帯域フイルタ、４……全波整流回路、５……
マルチプレクサ、６……Ａ／Ｄコンバータ。 FIG. 1 is a block diagram showing an embodiment of the speech recognition device of the present invention, and FIG. 2 is a flowchart for explaining a triggering method of the same device. 1...Microphone, 2...Mic amplifier, 3
...bandwidth filter, 4...full wave rectifier circuit, 5...
Multiplexer, 6...A/D converter.

Claims

[Scope of Claims] 1. When performing voice recognition by collating input voice data obtained from a microphone with registered voice data registered in advance in a storage unit, as pre-processing of the voice recognition, the input voice data is In addition to determining the energy of the input audio data and comparing it with a predetermined level, the constant part frame data of the beginning of the word is determined by determining the city value distance between the preceding and following frames of the input audio data, and the constant part frame data of the beginning of the word of the input audio data is determined. A triggering method for a speech recognition device that determines the effectiveness of data capture by determining the street value distance between the prefix frame data and the steady frame data at the beginning of a word of the registered speech data. 2. The method for triggering a speech recognition device according to claim 1, wherein the stationary part frame data is formant spectrum or phoneme data.