JPS63298298A

JPS63298298A - Voice section detecting system for voice recognition equipment

Info

Publication number: JPS63298298A
Application number: JP62131679A
Authority: JP
Inventors: 松下　満次; 勝美高橋
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1987-05-29
Filing date: 1987-05-29
Publication date: 1988-12-06

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、音声認識装置における音声区間検出方式に関
するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech interval detection method in a speech recognition device.

（従来の技術）近年、不特定話者を対象とした音声認識技術の発展に伴
ない、電話音声を認識することによりカタログ・ショッ
ピングや銀行残高照会等を行う音声認識装置が普及され
てきた。(Prior Art) In recent years, with the development of voice recognition technology targeting unspecified speakers, voice recognition devices that perform catalog shopping, bank balance inquiries, etc. by recognizing telephone voices have become popular.

電話音声入力における音声区間検出方式と、音声入力ワ
ードプロセッサ等のマイク入力における音声区間検出方
式とはその方式を異にしているが、これは発声音声以外
の雑音の形態に起因する。The method for detecting a voice section in telephone voice input is different from the method for detecting a voice section in microphone input of a voice input word processor, etc., and this is due to the form of noise other than voiced voice.

音声入力ワードプロセッサ等においては、音声入力はマ
イクで行なわれるが、通常これらには防風用のスポンジ
または網で包まれている。これは風切り音や口からの呼
気による音の雑音を防ぐ為である。しかしながら電話機
の場合は、この様な対策は施されず、また送話機を手で
直接持っている為に生ずる雑音等が含まれるので、音声
区間検出はマイク入力に比べてより高度の技術を要する
。Voice input In word processors and the like, voice input is performed using a microphone, which is usually wrapped in a windproof sponge or net. This is to prevent wind noise and noise caused by exhalation from the mouth. However, in the case of a telephone, such measures are not taken, and since noise is generated by holding the transmitter directly in the hand, voice section detection requires more advanced technology than microphone input. .

また、電話音声の認識においては、話者が認識結果を直
接口で見て確認することが出来ないので合成音等による
認識結果の確認が不可欠である。しかし、確認時の音声
入力に対して認識結果を確認することは不合理であり、
確認の為の音声入力に対しては高い信頼の認識が必要で
あることは言うまでも無い。Furthermore, in the recognition of telephone speech, since the speaker cannot directly check the recognition result by looking at it with his or her mouth, it is essential to check the recognition result using synthesized speech or the like. However, it is unreasonable to check the recognition results based on the voice input during confirmation.
It goes without saying that highly reliable recognition is required for voice input for confirmation.

したがって確認等に使用される制御語はなるべく誤認識
を生じにくい「はい」とか「いいえ」などの単語が選ば
れることが一般的である。Therefore, the control words used for confirmation etc. are generally selected from words such as "yes" and "no" that are less likely to cause misrecognition.

（発明が解決しようとする問題点）しかしながら、従来の電話音声認識装置における音声区
間検出方式においては、高い認識率が必要とされる認識
時例えば、認識結果の確認等においても、通常の認識と
同様な音声区間検出を行っているために、通常の認識対
象語の場合と誤検出の発生状況は変らず、信頼性の高い
認識を行うことができないという問題点があった。(Problems to be Solved by the Invention) However, in the speech section detection method in the conventional telephone speech recognition device, even in recognition that requires a high recognition rate, such as confirmation of recognition results, Since the same speech interval detection is performed, the occurrence of false detections is the same as in the case of normal recognition target words, and there is a problem that highly reliable recognition cannot be performed.

本発明は、以上述べた認識対象語に応じた音声区間検出
が出来ないという問題点を除去し、特に信頼性の高い認
識を必要とする認識対象語の認識率を向上させる音声区
間検出方式を提供することを目的とする。The present invention eliminates the above-mentioned problem of not being able to detect speech segments according to recognition target words, and provides a speech interval detection method that improves the recognition rate of recognition target words that particularly require highly reliable recognition. The purpose is to provide.

（問題点を解決するための手段）本発明は前記問題点を解決するため、予め受入れ用意さ
れている複数の認識対象語の音声信号を予定の順序指定
に従って受けて該音声信号の音声区間を検出し、該音声
区間の信号を標準パタン信号と照合して音声認識する音
声認識装置の音声区間検出方式において、各認識対象語
の音声信号について音声ブロック数の相違による特徴を
含み、各認識対象語相互に共通の特徴によって群分けさ
れた各認識対象語の音声信号を受入れて音声区間を検出
するための各群毎の音声区間検出部と、入力された音声
信号を前記各音声区間検出部に選択して振分ける音声区
間検出選択部とを備え、前記音声区間検出選択部は前記
予定の順序指定に基づいて選択指定される方式とした。(Means for Solving the Problems) In order to solve the above-mentioned problems, the present invention receives audio signals of a plurality of recognition target words prepared for acceptance in advance in accordance with a predetermined order designation, and calculates the audio sections of the audio signals. In the speech section detection method of a speech recognition device, which performs speech recognition by comparing the signal of the speech section with a standard pattern signal, the speech signal of each recognition target word includes the characteristics due to the difference in the number of speech blocks, and the speech signal of each recognition target word is detected. A speech section detection section for each group receives the speech signals of each recognition target word divided into groups based on common features and detects the speech section, and a speech section detection section for each group detects the speech section. and a speech section detection and selection section that selects and distributes the speech sections, and the speech section detection and selection section is selected and designated based on the scheduled order designation.

（作　用）本発明によれば、音声認識装置は予定の順序指定に従っ
た各音声信号を受け、音声区間検出選択部はその順序指
定に基づいた選択指定によって、各音声信号を受ける毎
に各音声区間検出部を選択し、当該各音声信号が、前記
選択された各音声区間検出部に送出されて、その音声区
間が検出される。(Function) According to the present invention, the speech recognition device receives each speech signal according to the scheduled order designation, and the speech section detection and selection section receives each speech signal according to the selection designation based on the order designation. Each voice section detecting section is selected, and each voice signal is sent to each selected voice section detecting section, and the corresponding voice section is detected.

（実施例）第１図は本発明の一実施例を示す電話音声認識装置の回
路のブロック図である。(Embodiment) FIG. 1 is a block diagram of a circuit of a telephone voice recognition device showing an embodiment of the present invention.

同図において、１は電話音声認識装置である。In the figure, 1 is a telephone voice recognition device.

２は音声入力部で、電話回線３から話者による音声信号
を受ける。Reference numeral 2 denotes an audio input unit which receives an audio signal from a speaker from the telephone line 3.

図示していないが、話者に対しては、音声入力するに当
ってこの電話音声認識装置１の上位装置から合成音によ
って順次ガイダンスが与えられ、話者はそのガイダンス
に従って暗唱番号の数字などの予め受入れ用意されてい
る認識対象語を音声によって入力し、これに対して確認
のために合成音によってその数字などが話者に伝えられ
て、話者は確認したことの「はい」または「いいえ」な
どの音声を入力するようにしている。Although not shown in the figure, when inputting voice, the speaker is sequentially given guidance using synthesized voices from the host device of this telephone voice recognition device 1, and the speaker follows the guidance to input numbers such as the numbers of the code number. A recognition target word that has been prepared in advance is input by voice, and the number etc. is conveyed to the speaker using a synthesized voice for confirmation, and the speaker confirms it by saying "yes" or "no." ” etc. is input.

４．５．６はそれぞれ第１と第２と第・３の各音声区間
検出部で、音声入力部２が受けた各音声信号毎に後記す
る選択によってそのうちの１つあるいは複数の音声区間
検出部に受けて、当該各音声信号の音声区間を検出する
。4.5.6 are first, second, and third voice section detection units, respectively, which detect one or more voice sections according to the selection described later for each voice signal received by the voice input unit 2. Then, the audio section of each audio signal is detected.

第２図は前記第１．第２．第３の各音声区間検出部４．
５．６の音声区間検出の説明図であり、音声信号Ａの音
声区間を検出するに当って、音声ブロック始端用及び音
声ブロック終端用の各閾値ＬＳ、ＬＥを予め各音声区間
検出部４．５．６毎に設定しておき閾値ＬＳを始端決定
時間７８以上継続して越えたならば、該閾値ＬＳを最初
に越えた時点を音声ブロック始端とし、閾値ＬＥを終端
決定時間ＴＥ以上継続して下廻ったならば、該閾値ＬＥ
を最初に下廻った時点を音声ブロック終端として、その
音声ブロックの長さ等から音声区間を決定する。FIG. 2 shows the above-mentioned section 1. Second. Third each voice section detection unit 4.
5.6 is an explanatory diagram of voice section detection in section 5.6. In detecting the voice section of the voice signal A, the threshold values LS and LE for the voice block start end and the voice block end are set in advance in each voice zone detecting section 4. 5. If the threshold LS, which is set every 6 minutes, is exceeded continuously for a start end determination time of 78 or more, the first time the threshold LS is exceeded is set as the start of the audio block, and the threshold LE is continued for an end end determination time of TE. If the threshold value LE
The voice block is determined to be the end of the voice block at the time when the voice block first passes, and the voice section is determined from the length of the voice block.

音声信号は１つの音声ブロックによるものと複数の音声
ブロックによるものとがあり、また一般にノイズを伴う
。The audio signal may consist of one audio block or multiple audio blocks, and generally includes noise.

第３図は各発声音などによる音声波形図で、その（１）
は発生音が「はい」の波形図、（２）は「いいえ」の波
形図、（３）は「いいえ」にノイズを伴っている場合の
波形図、（４）は「いち」の波形図である。音声認識す
るに当って、同図の（１）　、　（２）の波形について
はそれぞれ１つの音声ブロックによって音声区間が構成
されていることを検出し、そして（３）の波形について
はノイズの音声ブロックは除去された上で同様に１つの
音声ブロックによって音声区間が構成されていることを
検出し、（４）の波形については２つの音声ブロックに
よって音声区間が構成されていることを検出する必要が
ある。Figure 3 is an audio waveform diagram of each vocalization, etc. (1)
is a waveform diagram when the generated sound is "yes", (2) is a waveform diagram when the generated sound is "no", (3) is a waveform diagram when "no" is accompanied by noise, and (4) is a waveform diagram when the sound is "ichi". It is. In speech recognition, it is detected that the waveforms (1) and (2) in the same figure each consist of a speech section made up of one speech block, and the waveform (3) is detected as noise speech. After removing the block, it is necessary to similarly detect that a speech section is composed of one speech block, and for the waveform (4), it is necessary to detect that a speech section is composed of two speech blocks. There is.

第１の音声区間検出部４は、例えば各認識対象語のうち
音声ブロックが、ただ一つから構成される群の音声信号
の音声区間を検出し、第２の音声区間検出部６は音声ブ
ロックが、１つ又は２つの場合の群について、そして第
３の音声区間検出部７は音声ブロックの数に制限のない
群についてそれぞれ音声信号の音声区間を検出する。The first speech section detection section 4 detects the speech section of the speech signal of a group consisting of only one speech block among each recognition target word, for example, and the second speech section detection section 6 detects the speech section of the speech signal of a group consisting of only one speech block among each recognition target word. However, the third speech section detecting unit 7 detects the speech section of the speech signal for a group in which the number of speech blocks is one or two, and for a group in which there is no limit to the number of speech blocks.

７は分析部で、音声入力部２から音声信号を受け、該音
声信号を分析し、特徴をパラメータ化する。８は標準パ
タン部で、予め受入れ用意されている音声信号の各パタ
ンを記憶している。９は認識マツチング部で、各音声区
間検出部４，５．６の音声区間検出信号と、分析部７の
パラメータ化された信号とを受け、当該検出された音声
区間について前記パラメータ化された信号を標準パタン
部８のパタンと照合して一致したときはそのパタン信号
を送出する。１０は制御部で、認識マツチング部９のマ
ツチング結果を受けて音声認識の判定を行い、認識結果
を上位装置に送出する。Reference numeral 7 denotes an analysis section which receives the audio signal from the audio input section 2, analyzes the audio signal, and converts the features into parameters. Reference numeral 8 denotes a standard pattern section which stores each pattern of the audio signal prepared for acceptance in advance. Reference numeral 9 denotes a recognition matching section which receives the speech section detection signals from each speech section detection section 4, 5.6 and the parameterized signal from the analysis section 7, and calculates the parameterized signal for the detected speech section. is compared with the pattern in the standard pattern section 8, and if they match, the pattern signal is sent out. Reference numeral 10 denotes a control unit that receives the matching result of the recognition matching unit 9, makes a speech recognition determination, and sends the recognition result to the host device.

１１は選択指定部で、前述したように話者に対しては上
位装置から音声入力するためのガイダンスが順次与えら
れるが、該選択指定部１１は、順次人力される音声が、
各ガイダンス対応毎に特有な性質の各認識対象語群ある
いは各性質が混在する認識対象語群等に分類し得ること
から、音声入力部２に入力された音声信号を各ガイダン
スに対応させて適応の各音声区間検出部４，５．６に振
分けるべく選択指定信号を出力する。Reference numeral 11 denotes a selection designation unit, and as mentioned above, the speaker is sequentially provided with guidance for voice input from the host device.
Since each guidance response can be classified into recognition target word groups with unique properties or recognition target word groups with mixed characteristics, the audio signal input to the audio input unit 2 can be adapted to correspond to each guidance. A selection designation signal is output to be distributed to each voice section detection section 4, 5.6.

１２は音声区間検出選択部で、選択指定部１１の選択指
定信号に基づいて各音声区間検出部４゜５．６のいずれ
か１つあるいは複数を選択して当該音声区間検出部にイ
ネーブル信号を送出する。Reference numeral 12 denotes a voice section detection and selection section, which selects one or more of the voice section detection sections 4, 5, and 6 based on the selection designation signal of the selection specification section 11, and sends an enable signal to the voice section detection section. Send.

認識結果の確認の場合等における認識対象語は、通常、
「はい」または「いいえ」であり、第３図に示すように
、ノイズ等を除去すると音声ブロックは１つで構成され
るため、音声区間検出としては第１の音声区間検出部４
のみを選択すれば良い。In the case of checking recognition results, etc., the recognition target word is usually
"Yes" or "No", and as shown in FIG. 3, if noise etc. are removed, the voice block consists of one voice block, so the first voice zone detection unit 4
You only need to select.

そして第１音声区間検出部４では音声ブロックを１つの
みに限定しているので、周囲雑音、呼気等により音声ブ
ロックが２つ以上検出された場合、その音声信号を無効
にし、あるいは音声波形に最も近いものを１つ選択すれ
ば良い。Since the first voice section detection unit 4 limits the number of voice blocks to only one, if two or more voice blocks are detected due to ambient noise, exhalation, etc., the voice signal is invalidated or the voice waveform is Just select the one closest to you.

つぎに第１図の回路の動作を説明する。Next, the operation of the circuit shown in FIG. 1 will be explained.

選択指定部１１は、上位装置（図示せず）が話者に対し
て音声入力指示のガイダンスを行う毎に、その対応の音
声入力の認識動作に先立ち、音声区間検出選択部１２に
対して、各音声区間検出部４゜５．６を指定する選択指
定信号を送出する。音声区間検出選択部１２は、各音声
区間検出部４，５゜６の１つまたは複数に対して音声区
間検出イネーブル信号を送出する。電話回線３より入力
された音声信号は音声入力部２を通り各音声区間検出部
４．５．６に入る。第１音声区間検出部４が選択された
場合は、第１音声区間検出部４の音声区間検出結果のみ
が、認識マツチング部９に入力される。一方、音声入力
部２からの音声入力は分析部７によって分析され、特徴
がパラメータ化されて認識マツチング部９に入力される
。認識マツチング部９は、先に得られた音声区間検出結
果をもとに、標準パターン部８のパターンとのマツチン
グ処理を行い、制御部１０にマツチング結果を送出する
。制御部１０はそのマツチング結果より認識判定を行な
い、認識結果を上位装置に送出する。The selection designation unit 11 instructs the voice segment detection and selection unit 12, each time a host device (not shown) provides voice input instruction guidance to the speaker, prior to the recognition operation of the corresponding voice input. A selection designation signal is sent out to designate each voice section detection section 4.5.6. The voice section detection and selection section 12 sends out a voice section detection enable signal to one or more of the voice section detection sections 4, 5, and 6. A voice signal input from the telephone line 3 passes through the voice input section 2 and enters each voice section detection section 4.5.6. When the first speech section detection section 4 is selected, only the speech section detection result of the first speech section detection section 4 is input to the recognition matching section 9. On the other hand, the voice input from the voice input section 2 is analyzed by the analysis section 7, and the features are converted into parameters and input to the recognition matching section 9. The recognition matching section 9 performs matching processing with the pattern of the standard pattern section 8 based on the voice section detection result obtained previously, and sends the matching result to the control section 10. The control unit 10 performs recognition determination based on the matching results, and sends the recognition results to the host device.

（発明の効果）以上説明したように本発明によれば、認識対象語の特徴
を群分けして各群に対応させてそれぞれの音声区間検出
を選択するようにしたので、特に高い信頼性が要求され
る確認のための認識対象語などについて適正な音声区間
検出を選択することにより、その音声認識の精度の向上
が期待できる。(Effects of the Invention) As explained above, according to the present invention, the characteristics of the recognition target word are divided into groups and the speech section detection is selected for each group, so that particularly high reliability can be achieved. By selecting appropriate speech segment detection for recognition target words for required confirmation, it is expected that the accuracy of speech recognition will improve.

[Brief explanation of drawings]

第１図は本発明の実施例を示す電話音声認識装置の回路
のブロック図、第２図は音声区間検出の説明図、第３図
は各発生音の波形図である。１・・・電話音声認識装置４．５．６・・・音声区間検出部１１・・・選択指定部１２・・・音声区間検出選択部FIG. 1 is a block diagram of a circuit of a telephone speech recognition device showing an embodiment of the present invention, FIG. 2 is an explanatory diagram of voice section detection, and FIG. 3 is a waveform diagram of each generated sound. 1... Telephone speech recognition device 4.5.6... Voice section detection section 11... Selection specification section 12... Voice section detection selection section

Claims

[Claims] A speech signal of a plurality of recognition target words prepared for acceptance in advance is received in accordance with a predetermined order designation, a speech section of the speech signal is detected, and the signal of the speech section is compared with a standard pattern signal. In the speech segment detection method of a speech recognition device, each recognition target word is divided into groups based on features common to each recognition target word, including features due to differences in the number of audio blocks for the audio signal of each recognition target word. a voice section detection section for each group for receiving the voice signal and detecting the voice section; and a voice section detection and selection section for selecting and distributing the input voice signal to each of the voice section detection sections, A speech segment detection method for a speech recognition device, wherein the speech segment detection and selection unit is selectively designated based on the scheduled order designation.