JPH02289899A

JPH02289899A - Voice detection system

Info

Publication number: JPH02289899A
Application number: JP1250057A
Authority: JP
Inventors: Masayuki Unno; 海野　雅幸; Shingo Nishimura; 新吾西村; Kiyoyuki Iwai; 岩井　清行; Tomoko Ichimura; 市村　智子
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1989-01-24
Filing date: 1989-09-26
Publication date: 1990-11-29

Abstract

PURPOSE:To detect the presence of a voice at a high detection rate even when the amplitude of a noise is large by analyzing the frequency of an input signal by a preprocessing part, inputting the analysis result to a neural network, and deciding whether or not the input signal contains a voiced sound from the output of the neural network. CONSTITUTION:The preprocessing part 10 analyzes the frequency of the input signal which is inputted from a signal input part, the analysis result of the preprocessing part 10 is inputted to the nueral network 20, and a decision circuit 30 decides whether or not the input signal contains the voiced sound from the output of the neural network 20. Consequently, even when the amplitude of the noise is large and an influence upon the detection of a voice is large, the presence of the voice in noisy environment can be detected at a high detection rate and short-time processing is easily performed.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、音声検出方式に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a voice detection method.

［従来の技術］従来、雑音環境下で音声の存在を検出する方法は多数あ
り、特公昭５７−１２９９９号公報に記載されているよ
うな通信における音声区間の検出に用いたり、音声言語
内容の認識の前処理に用いたりされているが、高雑音下
での一般用途への展開は困難で、例えば、着信ベル音が
鳴っているような状態でのハンズフリー電話機の音声に
よる応答開始等ができなかった．尚、雑音環境下で簡易に音声の存在を検出する方法とし
ては、入力信号が一定時間間隔内に参照軸を横切る回数
を検出する方法があった。[Prior Art] Conventionally, there are many methods for detecting the presence of speech in a noisy environment. Although it is used for recognition preprocessing, it is difficult to deploy it for general purposes under high noise conditions, such as initiating a voice response from a hands-free telephone when the incoming bell is ringing. could not. Incidentally, as a method for simply detecting the presence of voice in a noisy environment, there is a method of detecting the number of times an input signal crosses a reference axis within a fixed time interval.

然しながら、上記従来の音声検出方式を用いる方法にあ
ワては、一Ｍに雑音の振幅は音声の振幅に比較して小さ
いという前提を用いており、雑音の振幅が音声の振幅と
同程度の場合、音声の存在を検出することができない．そこで、本出願人は、雑音環境下での音声の存在を簡易
に検出することができる音声検出方式として、入力信号
の参照軸交差数と波高値（波形の振幅レベルの無次元化
量）とを特徴量として算出し、この算出結果を、有声音
と特定雑音について予め定めた辞書データと比較し、入
力信号が有声音を含むかどうかをパターン認識により判
定する方法を提案している．［発明が解決しようとする課題］黙しながら、上記従来の音声検出方式では、以下の■、
■の問題点がある． ■辞書データの作成時に採用した特定雑音（例えばベル
音）以外の全ての雑音を想定してのパターン認識は不可
能であり、予め予想されなかった雑音環境下での検出率
が低い． ■従来の音声検出方式においては、一定以上の検出率確
保のために上述の如くの複雑な特徴量を用いる必要があ
るが、これは複雑な処理装置を必要とし、処理時間も比
較的長い．本発明は、雑音の振幅が大きく音声の検出に対する影響
が大きい場合にも、雑音環境下での音声の存在を、高い
検出率で検出でき、かつ容易に短時間処理できる音声検
出方式を得ることを目的とする．［課厘を解決するための手段］請求項１に記載の本発明は、信号入力部から入力した入
力信号を前処理部にて周波数分析し、前処理部による分
析結果をニューラルネットワークに入力し、ニューラル
ネットワークの出力により、入力信号が有声音を含むか
どうかを判定するようにしたものである．請求項２に記載の本発明は、前処理部による分析結果が
入力信号の周波数特性であるようにしたものである．請求項３に記載の本発明は、前処理部による分析結果が
入力信号の一定時間内における周波数特性の時間的変化
であるようにしたものである請求項４に記載の本発明は
、前処理部による分析結果が入力信号の平均的な周波数
特性であるようにしたものである．請求項５に記載の本発明は、前処理部による分析結果が
入力信号の一定時間内における平均的な周波数特性の時
間的変化であるようにしたものである．請求項６に記載の本発明は、前処理部による分析結果が
入力信号の平均的な周波数特性、及び平均的なピッチ周
波数であるようにしたものである．請求項７に記載の本発明は、前処理部による分析結果が
入力信号の一定時間内における平均的な周波数特性の時
間的変化、及び平均的なピッチ周波数の時間的変化であ
るようにしたものである。However, the method using the conventional voice detection method described above uses the assumption that the amplitude of the noise is small compared to the amplitude of the voice. In this case, the presence of voice cannot be detected. Therefore, the present applicant has developed a voice detection method that can easily detect the presence of voice in a noisy environment, based on the number of reference axis crossings of the input signal and the peak value (the amount of non-dimensionalization of the amplitude level of the waveform). We are proposing a method for determining whether an input signal contains voiced sounds by pattern recognition, by calculating them as feature quantities, and comparing the calculation results with predefined dictionary data for voiced sounds and specific noises. [Problems to be solved by the invention] Silently, the above conventional voice detection method has the following problems:
■There is a problem. ■It is impossible to recognize patterns assuming all noises other than the specific noise (for example, a bell sound) used when creating dictionary data, and the detection rate is low in environments with unexpected noise. ■In conventional voice detection methods, it is necessary to use complex features as described above in order to ensure a detection rate above a certain level, but this requires a complex processing device and requires a relatively long processing time. The present invention provides a voice detection method that can detect the presence of voice in a noisy environment with a high detection rate even when the amplitude of noise is large and has a large influence on voice detection, and that can be easily processed in a short time. With the goal. [Means for solving the problem] The present invention according to claim 1 analyzes the frequency of an input signal input from a signal input section in a preprocessing section, and inputs the analysis result by the preprocessing section to a neural network. , the output of the neural network is used to determine whether the input signal contains voiced sounds. According to the second aspect of the present invention, the analysis result by the preprocessing section is the frequency characteristic of the input signal. The present invention according to claim 3 is characterized in that the analysis result by the preprocessing section is a temporal change in the frequency characteristics of the input signal within a certain period of time. The analysis result by the section is the average frequency characteristic of the input signal. According to a fifth aspect of the present invention, the analysis result by the preprocessing section is a temporal change in the average frequency characteristic of the input signal within a certain period of time. According to a sixth aspect of the present invention, the analysis result by the preprocessing section is an average frequency characteristic of an input signal and an average pitch frequency. The present invention according to claim 7 is such that the analysis result by the preprocessing section is a temporal change in the average frequency characteristic of the input signal within a certain period of time, and a temporal change in the average pitch frequency. It is.

請求項８に記載の本発明は、前処理部による分析結果が
高域強調された入力信号の平均的な周波数特性であるよ
うにしたものである．請求項９に記載の本発明は、前処理部による分析結果が
高域強調された入力信号の一定時間内における平均的な
周波数特性の時間的変化であるようにしたものである．請求項１０に記載の本発明は、前記ニューラルネットワ
ークが階層的なニューラルネットワークであるようにし
たものである．［作用］請求項１〜９のそれぞれに記載の本発明によれば、以下
の■〜■の作用効果がある。尚、本発明にあっては、有
声音（母音、半母音、鼻音等の声帯の振動を伴う音であ
り、人間が発声する殆ど全ての音声には有声音が含まれ
ている）をもって音声とする． ■ニューラルネットワークは、後に述べる学習によりそ
のネットワークを構築した後のシステム稼動段階で、必
要に応じて追加学習できる．従って、ネットワーク構築
のための学習段階では予想されなかった雑音環境下での
音声検出についても、その稼動段階で随時これを追加学
習することにより、高い検出率を達成できる． ■ニューラルネットワークへの入力として、「入力信号
の周波数分析結果」を用いたから、入力を得るための前
処理が、従来の複雑な特徴量抽出に比して、単純となり
、この前処理に要する時間が短くて足りる． ■ニューラルネットワークは、原理的に、ネットワーク
全体の演算処理が単純かつ迅速である． ■ニューラルネットワークは、原理的に、それを構成し
ている各ユニットが独立に動作しており、並列的な演算
処理が可能である．従って、演算処理が迅速である． ■上記■〜■により、音声検出処理を複雑な処理装置に
よることなく容易に短時間処理できる．又、請求項１０に記載の本発明によれば上記■〜■の作
用効果に加えて、以下の■の作用効果がある． ■階層的なニューラルネットワークにあっては、現在、
後述する如くの簡単な学習アルゴリズム（パックブロバ
ゲーション）が確立されており、高い検出率を実現でき
るニューラルネットワークを容易に形成できる．［実施例］第１図は本発明が適用された音声検出システムの一例を
示す模式図、第２図はニューラルネットワークを示す模
式図、第３図は階層的なニューラルネットワークを示す
模式図、第４図はユニットの構造を示す模式図である．本発明の具体的実施例の説明に先立ち、ニューラルネッ
トワークの構成、学習アルゴリズムについて説明する．（１）ニューラルネットワークは、その構造から、第２
図（Ａ）に示す階層的ネットワークと第２図（Ｂ）に示
す相互結合ネットワークの２種に大別できる．本発明は
、両ネットワークのいずれを用いて構成するものであっ
ても良いが、階層的ネットワークは後述する如くの簡単
な学習アルゴリズムが確立されているためより有用であ
る．（２）ネットワークの構造階層的ネットワークは、第３図に示す如く、入力層、中
間層、出力層からなる階層構造をとる．各層は１以上の
ユニットから構成される．結合は、入力層→中間層→出
力層という前向きの結合だけで、各層内での結合はない
．（３）ユニットの構造ユニットは第４図に示す如く脳のニューロンのモデル化
であり構造は簡単である．他のユニットから入力を受け
、その総和をとり一定の規則（変換関数）で変換し、結
果を出力する．他のユニットとの結合には、それぞれ結
合の強さを表わす可変の重みを付ける．（４）学習（パックブロバゲーション）ネットワークの
学習とは、実際の出力を目標値（望ましい出力）に近づ
けることであり、一Ｍ的には第４図に示した各ユニット
の変換間数及び重みを変化させて学習を行なう．具体的
には目標値を、有声音について「１」、雑音については
「０」とし、下記■〜■による． ■有声音のみに本発明の前処理を施し、前処理の結果な
ニューラルネットワークに入力する．そして、ニューラ
ルネットワークの出力が目標値に近づくように各ユニッ
トの変換関数及び重みを修正する． ■雑音のみに本発明の前処理を施し、前処理の結果をニ
ューラルネットワークに入力する．そして、ニューラル
ネットワークの出力が目標値に近づくように各ユニット
の変換関数及び重みを修正する． ■有声音と雑音を含む入力信号で学習を行なワても良い
．この場合の目標値は、有声音の「１」である。According to an eighth aspect of the present invention, the analysis result by the preprocessing section is an average frequency characteristic of an input signal with high frequency emphasis. According to a ninth aspect of the present invention, the analysis result by the preprocessing section is a temporal change in the average frequency characteristic within a certain period of time of an input signal with high frequency emphasis. According to a tenth aspect of the present invention, the neural network is a hierarchical neural network. [Function] According to the present invention described in each of claims 1 to 9, there are the following effects (1) to (4). In the present invention, voiced sounds (sounds that involve vibration of the vocal cords, such as vowels, semi-vowels, and nasal sounds, and almost all sounds produced by humans include voiced sounds) are defined as speech. ．． ■Neural networks can be additionally trained as needed during the system operation stage after the network has been constructed through learning, which will be described later. Therefore, even for voice detection in noisy environments that was not anticipated during the learning stage for network construction, a high detection rate can be achieved by additionally learning this at any time during the operation stage. ■Since the "frequency analysis results of the input signal" are used as the input to the neural network, the preprocessing to obtain the input is simpler than the conventional complex feature extraction, and the time required for this preprocessing is is short enough. ■Neural networks are, in principle, capable of simple and quick arithmetic processing for the entire network. ■In principle, each unit that makes up a neural network operates independently, and parallel arithmetic processing is possible. Therefore, calculation processing is quick. ■With the above ■~■, voice detection processing can be easily performed in a short time without using a complicated processing device. Further, according to the present invention as set forth in claim 10, in addition to the effects (1) to (2) above, there is the following effect (2). ■For hierarchical neural networks, currently,
A simple learning algorithm (pack-broadcasting) has been established as described below, and it is easy to create a neural network that can achieve a high detection rate. [Example] FIG. 1 is a schematic diagram showing an example of a voice detection system to which the present invention is applied, FIG. 2 is a schematic diagram showing a neural network, FIG. 3 is a schematic diagram showing a hierarchical neural network, and FIG. Figure 4 is a schematic diagram showing the structure of the unit. Before explaining specific embodiments of the present invention, the configuration of the neural network and the learning algorithm will be explained. (1) Due to its structure, neural networks
It can be roughly divided into two types: the hierarchical network shown in Figure (A) and the interconnected network shown in Figure 2 (B). Although the present invention may be configured using either of these networks, the hierarchical network is more useful because a simple learning algorithm has been established as described below. (2) Network structure A hierarchical network has a hierarchical structure consisting of an input layer, a middle layer, and an output layer, as shown in Figure 3. Each layer consists of one or more units. The only connections are forward connections such as input layer → middle layer → output layer, and there are no connections within each layer. (3) Unit structure The unit is a model of a neuron in the brain and has a simple structure, as shown in Figure 4. It receives input from other units, sums it up, transforms it using a certain rule (conversion function), and outputs the result. Each connection with another unit is given a variable weight that represents the strength of the connection. (4) Learning (pack-broadcasting) Network learning is to bring the actual output closer to the target value (desired output), and in terms of the number of transformations and weights of each unit shown in Figure 4. Learning is performed by changing . Specifically, the target values are set to ``1'' for voiced sounds and ``0'' for noise, and according to the following ■~■. ■Apply the preprocessing of the present invention only to voiced sounds, and input the preprocessing results to the neural network. Then, modify the conversion function and weight of each unit so that the output of the neural network approaches the target value. ■Apply the preprocessing of the present invention only to the noise, and input the preprocessing results to the neural network. Then, modify the conversion function and weight of each unit so that the output of the neural network approaches the target value. ■It is also possible to perform learning with input signals that include voiced sounds and noise. The target value in this case is "1" for voiced sound.

又、学習のアルゴリズムとしては、例えば、Ｒｕｍｅｌ
ｈａｒｔ，　Ｄ．Ｅ．．ＭｃＣ１ｅｌｌａｎｄ，　Ｊ．
Ｌ．　ａｎｄ　ｔｈｅＰＤＰ　Ｒｅｓｅａｒｃｈ　Ｇｒ
ｏｕｐ，　ＰＡＲＡＬＬＥＬ　ＤＩＳＴＲＩＢＩＪＴＥ
ＤＰＲＯＣＥＳＳＩＮＧ，七ｈｅ　ＭＩＴ　Ｐｒｅｓｓ
，　１９８８．に記載されているパックプロバゲーショ
ンを用いることができる．（５）評価上記学習により一定検出率を確保し得るネットワークを
構築した後、前処理を施した未知の入力信号をニューラ
ルネットワークに入力する．そして、ニューラルネット
ワークの出力結果がｒｌＪに近い場合は有声音、「０」
に近い場合は雑音と判定する．以下、本発明の具体的な実施例について説明する．尚、
この実施例の検出システム１は、ｎチャンネルのバンド
バスフィルタ１０，ニューラルネットワーク２０、判定
回路３０の結合にて構成される（第１図参照）．ＴＡ）ネットワーク構築のための学習段階における入力
信号を、例えば、■有声音「ア」の定常的な部分（信号
の立上り部分や立下り部分を除いた部分）、及び■ベル
音（特定雑音）とする．尚、この学習段階で採用する特
定雑音は、ベル音に限らず、当該システムが使用される
であろう環境下で生ずることを予想される雑音であれば
何ても良い．（Ｂ）前処理入力信号波形を、第１図に示す如く、複数（ｎ個）チャ
ンネルのバンドバスフィルタ１０に通し、結果として入
力信号の周波数特性を得る。Further, as a learning algorithm, for example, Rumel
hart, D. E. ．． McCelland, J.
L. and thePDP Research Group
oup, PARALLEL DISTRIBIJTE
DPROCESSING, Seven he MIT Press
, 1988. You can use pack propagation as described in . (5) Evaluation After constructing a network that can ensure a constant detection rate through the above learning, input the preprocessed unknown input signal into the neural network. If the output result of the neural network is close to rlJ, it is a voiced sound, "0".
If it is close to , it is judged as noise. Hereinafter, specific examples of the present invention will be described. still,
The detection system 1 of this embodiment is composed of a combination of an n-channel bandpass filter 10, a neural network 20, and a determination circuit 30 (see FIG. 1). TA) The input signals in the learning stage for network construction are, for example, ■ the steady part of the voiced sound "a" (the part excluding the rising and falling parts of the signal), and ■ the bell sound (specific noise). Suppose that Note that the specific noise employed in this learning stage is not limited to the bell sound, but may be any noise that is expected to occur in the environment in which the system will be used. (B) Pre-processing As shown in FIG. 1, the input signal waveform is passed through a bandpass filter 10 with multiple (n) channels to obtain the frequency characteristics of the input signal.

（Ｃ）ニューラルネットワークによる処理及び判定 ■前処理の結果（バンドバスフィルタ１０の出力）を、
第１図に示す如く、３層の階層的なニューラルネットワ
ーク２０に入力する．入力層２１は、前処理のｎチャン
ネルに対応するｎユニットにて構成される．出力層２２
は、１ユニットにて構成され、目標値は有声音について
は「１」、雑音についてはｒＯＪとする．■ニューラル
ネットワーク２０の出力を判定回路３０に入力し、出力
層２２の出力値に応じて，入力信号が有声音を含むかど
うかを判定する．但し、本発明の実施において、ニュー
ラルネットワーク２０の出力は判定回路３０の如くにて
機械的に判定処理されず、ニューラルネットワーク２０
の出力を得た人間の知力にて判定処理されるものであっ
ても良い． ■前述した学習アルゴリズムのパックプロバゲーション
により、入力に対する出力のエラーが一定レベルに収束
するまで学習させ、一定検出率を保証し得るネットワー
クを構築する．■上記■にて構築されたニューラルネッ
トワーク２０を用いて、あらゆる雑音環境下での音声の
存在が検出される．この時、実際のシステム稼動現場に
おいて、ネットワーク構築のための学習段階で予想され
なかった背景雑音の影響が大きいと考えられる場合には
、現実の使用環境下でこれを追加学習し、結果としてニ
ューラルネットワーク２０を使用環境により適合するよ
うに改良できる．（ＤＪ実験上記検出システム１を用いて、音声検出を実験した．結果、検出率は９９％であることが認められた。(C) Processing and judgment by neural network ■Preprocessing results (output of bandpass filter 10)
As shown in FIG. 1, input is made to a three-layer hierarchical neural network 20. The input layer 21 is composed of n units corresponding to n channels of preprocessing. Output layer 22
is composed of one unit, and the target value is ``1'' for voiced sounds and rOJ for noise. (2) The output of the neural network 20 is input to the determination circuit 30, and it is determined according to the output value of the output layer 22 whether the input signal includes a voiced sound. However, in implementing the present invention, the output of the neural network 20 is not mechanically subjected to judgment processing such as in the judgment circuit 30;
The judgment process may be performed using human intellect after obtaining the output. ■Using the pack propagation of the learning algorithm described above, we construct a network that can guarantee a constant detection rate by learning until the error in the output relative to the input converges to a certain level. ■Using the neural network 20 constructed in (■) above, the presence of speech is detected in any noisy environment. At this time, in the actual system operation site, if it is thought that the influence of background noise that was not anticipated during the learning stage for network construction is large, additional learning is performed under the actual usage environment, and as a result, the The network 20 can be improved to better suit the usage environment. (DJ experiment: Using the detection system 1 described above, a voice detection experiment was conducted. As a result, the detection rate was found to be 99%.

次に、上記実施例の作用について説明する．上記検出シ
ステムｌによれば、以下の■〜■の作用効果がある． ■ニューラルネットワーク２０は、前述した如く、当初
の学習によりそのネットワークを構築した後のシステム
稼動段階で、必要に応じて追加学習できる．従って、ネ
ットワーク構築のための学習段階では予想されなかった
雑音環境下での音声検出についても、その稼動段階で随
時これを追加学習することにより、高い検出率を達成で
きる． ■ニューラルネットワーク２０への入力として、「入力
信号の周波数特性」を用いたから、入力を得るための前
処理が、従来の複雑な特徴量抽出に比して、単純となり
、この前処理に要する時間が短くて足りる． ■ニューラルネットワーク２０は、原理的に、ネットワ
ーク全体の演算処理が単純かつ迅速である． ■ニューラルネットワーク２０は、原理的に、それを構
成している各ユニットが独立に動作しており、並列的な
演算処理が可能である．従って、演算処理が迅速である
。Next, the operation of the above embodiment will be explained. According to the above detection system 1, there are the following effects. - As mentioned above, the neural network 20 can perform additional learning as needed during the system operation stage after the network has been constructed through initial learning. Therefore, even for voice detection in noisy environments that was not anticipated during the learning stage for network construction, a high detection rate can be achieved by additionally learning this at any time during the operation stage. ■Since the "frequency characteristics of the input signal" is used as the input to the neural network 20, the preprocessing to obtain the input is simpler than the conventional complex feature extraction, and the time required for this preprocessing is is short enough. - In principle, the neural network 20 has simple and quick calculation processing for the entire network. ■The neural network 20 has, in principle, each unit that constitutes it operating independently, and is capable of parallel arithmetic processing. Therefore, calculation processing is quick.

■土記■〜■により、音声検出処理を複雑な処理装置に
よることなく容易に短時間処理できる。■Doki ■ to ■ allow voice detection processing to be easily performed in a short time without using a complicated processing device.

■階層的なニューラルネットワーク２０を用いたから、
現在既に確立している簡単な学習アルゴリズム（パック
ブロバゲーション）を利用でき、高い検出率を実現でき
るニューラルネットワークを容易に形成できる．尚、本発明の実施においては、前処理部による分析結果
が入力信号の一定時間内における周波数特性の時間的変
化であるようにしても良い。■Because we used a hierarchical neural network 20,
By using a simple learning algorithm (pack-broadcasting) that has already been established, it is possible to easily create a neural network that can achieve a high detection rate. Note that in implementing the present invention, the analysis result by the preprocessing section may be a temporal change in the frequency characteristics of the input signal within a certain period of time.

又、本発明の実施においては、前処理部による分析結果
が入力信号の平均的な周波数特性であるようにしても良
い．更に、前処理部による分析結果が入力信号の一定時
間内における平均的な周波数特性の時間的変化であるよ
うにしても良い。この場合の前処理部を備えた検出シス
テムは、第５図の如くになる．即ち、検出システム１は
、ｎチャンネルのバンドバスフィルタ１０、平均化回路
１５、ニューラルネットワーク２０，判定回路３０の結
合にて構成される．そして、その前処理は下記■、■に
よる． ■入力信号波形を、第８図に示す如く、４つのブロック
に時間的に等分割する． ■各ブロックの入力信号波形を第５図に示す如く、複数
（ｎ個）（この実施例ではｎ＝８）チャンネルのバンド
バスフィルタ１ｏに通し、各ブロック即ち各一定時間毎
に第９図（Ａ）〜（Ｄ）のそれぞれに示す如くの周波数
特性を得る．この時、バンドバスフィルタ１ｏの出力は
各ブロック毎、即ち各一定時間毎に平均化回路１５で平
均化される．以上の前処理により、入力信号の一定時間内における平
均的な周波数特性の時間的変化が得られる．又、本発明の実施においては、前処理部による分析結果
が入力信号の平均的な周波数特性、及び平均的なピッチ
周波数であるようにしても良い．更に、前処理部による
分析結果が入力信号の一定時間内における平均的な周波
数特性の時間的変化、及び平均的なピッチ周波数の時間
的変化であるようにしても良い．尚、音声のピッチ周波
数とは、声帯波の繰返し周期（ピッチ周期）の逆数であ
る．そして、ニューラルネットワークへの入力として、
個人差がある声帯の基本的なパラメータであるピッチ周
波数を付加する場合には、不特定話者における有声音の
検出率を向上できる．この場合の前処理部を備えた検出
システムは、第６図の如くになる．即ち、検出システム
１は、ｎチャンネルのバンドバスフィルタ１０、ピッチ
抽出部１１、平均化回路１５、ニューラルネットワーク
２０、判定回路３０の結合にて構成される．そして、そ
の前処理は、下記■、■による．■入力信号を、第８図
に示す如く、４つのブロックに時間的に等分割する． ■入力信号波形を、第６図に示す如く、複数（ｎ個）（
この実施例ではｎ＝８）チャンネルのバンドバスフィル
タ１０に通し、各ブロック即ち各一定時間毎に第９図（
Ａ）〜（Ｄ）のそれぞれに示す如くの周波数特性を得る
．又、上記バンドバスフィルタ１０による処理と並列的に
、入力信号波形をピッチ抽出部１１に通し、各ブロック
即ち各一定時間毎にピッチ周波数を得る．この時、バンドバスフィルタ１０とピッチ抽出部１１の
各出力は各ブロック毎に平均化回路１５で平均化される
．以上の前処理により、入力信号の一定時間内における平
均的な周波数特性の時間的変化、及び平均的なピッチ周
波数の時間的変化が得られる。Further, in implementing the present invention, the analysis result by the preprocessing section may be the average frequency characteristic of the input signal. Furthermore, the analysis result by the preprocessing section may be a temporal change in the average frequency characteristic of the input signal within a certain period of time. The detection system equipped with the preprocessing section in this case is as shown in FIG. That is, the detection system 1 is configured by combining an n-channel bandpass filter 10, an averaging circuit 15, a neural network 20, and a determination circuit 30. The preprocessing is performed according to the following ■ and ■. ■The input signal waveform is temporally divided equally into four blocks as shown in Figure 8. ■As shown in FIG. 5, the input signal waveform of each block is passed through a bandpass filter 1o of multiple (n) (in this example, n=8) channels, and each block, that is, each fixed time period, is filtered as shown in FIG. Obtain frequency characteristics as shown in each of A) to (D). At this time, the output of the bandpass filter 1o is averaged by the averaging circuit 15 for each block, that is, for each fixed time period. Through the above preprocessing, the temporal change in the average frequency characteristics of the input signal within a certain period of time can be obtained. Further, in implementing the present invention, the analysis results by the preprocessing unit may be the average frequency characteristics and average pitch frequency of the input signal. Furthermore, the analysis result by the preprocessing unit may be a temporal change in the average frequency characteristic of the input signal within a certain period of time, and a temporal change in the average pitch frequency. Note that the pitch frequency of speech is the reciprocal of the repetition period (pitch period) of vocal cord waves. And as input to the neural network,
When pitch frequency, which is a basic parameter of the vocal cords that varies among individuals, is added, the detection rate of voiced sounds can be improved for unspecified speakers. The detection system equipped with the preprocessing section in this case is as shown in FIG. That is, the detection system 1 is configured by combining an n-channel bandpass filter 10, a pitch extractor 11, an averaging circuit 15, a neural network 20, and a determination circuit 30. The preprocessing is performed according to the following ■ and ■. ■The input signal is temporally equally divided into four blocks as shown in Figure 8. ■Input signal waveforms are input to multiple (n) (as shown in Figure 6).
In this embodiment, each block, that is, each fixed time period, is passed through a bandpass filter 10 of n=8 channels, as shown in FIG.
Obtain frequency characteristics as shown in each of A) to (D). Further, in parallel with the processing by the bandpass filter 10, the input signal waveform is passed through a pitch extraction section 11 to obtain a pitch frequency for each block, that is, for each fixed time period. At this time, the respective outputs of the bandpass filter 10 and the pitch extractor 11 are averaged by the averaging circuit 15 for each block. Through the above preprocessing, a temporal change in the average frequency characteristic and a temporal change in the average pitch frequency of the input signal within a certain period of time can be obtained.

又、本発明の実施においては、前処理部による分析結果
が高域強調された入力信号の平均的な周波数特性である
ようにしても良い．更に、前処理部による分析結果が高
域強調された入力信号の一定時間内における平均的な周
波数特性の時間的変化であるようにしても良い．尚、高
域強調とは、音声波形のスペクトルの平均的な傾きを補
償して、低域にエネルギが集中することを防止すること
である．この場合の前処理部を備えた検出システムは、
第７図の如くになる．即ち、検出システム１は、高域強
調部１０Ａ，ｎチャンネルのバンドバスフィルタ１０、
平均化回路１５、ニューラルネットワーク２０、判定回
路３０の結合にて構成される．そして、その前処理は、
下記■、■による． ■入力信号を、第８図に示す如く、４つのブロックに時
間的に等分割する， ■入力信号波形を、第７図に示す如く、高域強調フィル
タからなる高域強調部１０Ａに通して高域強調を施す．次に、上記高域強調後の音声波形を、複数（ｎ個）（こ
の実施例ではｎ＝８）チャンネルのバンドパスフィルタ
１０に通し、各ブロック即ち各一定時間毎に第９図（Ａ
）〜（Ｄ）のそれぞれに示す如くの周波数特性を得る．この時、バンドバスフィルタ１０の出力は各ブロック毎
に平均化回路１５で平均化される．以上の前処理により
、高域強調された入力信号の一定時間内における平均的
な周波数特性の時間的変化が得られる．尚、本発明の高域強調操作は、上述の如くバンドパスフ
ィルタ１０への入力前でなく、バンドバスフィルタ１０
からの出力後に施すものであっても良い．［発明の効果］以上のように本発明によれば、雑音の振幅が大きく音声
の検出に対する影響が大きい場合にも、雑音環境下での
音声の存在を、高い検出率で検出でき、かつ容易に短時
問処理できる音声検出方式を得ることができる．Further, in implementing the present invention, the analysis result by the preprocessing unit may be the average frequency characteristic of the input signal with high frequency emphasis. Furthermore, the analysis result by the preprocessing unit may be a temporal change in the average frequency characteristic within a certain period of time of the high-frequency emphasized input signal. Note that high frequency enhancement is to compensate for the average slope of the spectrum of the audio waveform and prevent energy from concentrating in the low frequency range. In this case, the detection system equipped with a pre-processing section is
It will look like Figure 7. That is, the detection system 1 includes a high frequency emphasizing section 10A, an n-channel bandpass filter 10,
It is composed of a combination of an averaging circuit 15, a neural network 20, and a determination circuit 30. And the pre-processing is
According to ■ and ■ below. ■The input signal is temporally equally divided into four blocks as shown in FIG. 8. ■The input signal waveform is passed through the high-frequency emphasizing section 10A consisting of a high-frequency emphasizing filter as shown in FIG. Emphasizes high frequencies. Next, the audio waveform after high-frequency emphasis is passed through a bandpass filter 10 with multiple (n) (n=8 in this example) channels, and each block, that is, each fixed time period, is filtered as shown in FIG.
) to (D) are obtained. At this time, the output of the bandpass filter 10 is averaged by an averaging circuit 15 for each block. Through the above preprocessing, it is possible to obtain the temporal changes in the average frequency characteristics of the high-frequency emphasized input signal within a certain period of time. Note that the high frequency enhancement operation of the present invention is performed not before the input to the band pass filter 10 as described above, but after the input to the band pass filter 10.
It may be applied after the output from . [Effects of the Invention] As described above, according to the present invention, even when the amplitude of noise is large and the influence on speech detection is large, the presence of speech in a noisy environment can be detected with a high detection rate and easily. It is possible to obtain a voice detection method that can process questions in a short time.

[Brief explanation of drawings]

第１図は本発明が適用された音声検出システムの一例を
示す模式図、第２図はニューラルネットワークを示す模
式図、第３図は階層的なニューラルネットワークを示す
模式図、第４図はユニットの構造を示す模式図、第５図
は音声検出システムの他の例を示す模式図、第６図は音
声検出システムの更に他の例を示す模式図、第７図は音
声検出システムの更に他の例を示す模式図、第８図は入
力信号を示す模式図、第９図はバンドバスフィルタの出
力を示す模式図である．１・・・検出システム，１０・・・バンドバスフィルタ、１０Ａ・・・高域強調部、１１・・・ピッチ抽出部、１５・・・平均化回路、２０・・・ニューラルネットワーク、２１・・・入力層、２２・・・出力層、３０・・・判定回路．特許出願人　積水化学工業株式会社代表者　廣　田　　馨（Ａ）第２図第３図人力パターン（Ｂ）第４図Fig. 1 is a schematic diagram showing an example of a voice detection system to which the present invention is applied, Fig. 2 is a schematic diagram showing a neural network, Fig. 3 is a schematic diagram showing a hierarchical neural network, and Fig. 4 is a schematic diagram showing a unit. FIG. 5 is a schematic diagram showing another example of the voice detection system. FIG. 6 is a schematic diagram showing another example of the voice detection system. FIG. 7 is a schematic diagram showing another example of the voice detection system. FIG. 8 is a schematic diagram showing an input signal, and FIG. 9 is a schematic diagram showing an output of a bandpass filter. DESCRIPTION OF SYMBOLS 1... Detection system, 10... Bandpass filter, 10A... High frequency emphasis section, 11... Pitch extraction section, 15... Averaging circuit, 20... Neural network, 21... - Input layer, 22... Output layer, 30... Judgment circuit. Patent applicant Sekisui Chemical Co., Ltd. Representative Kaoru Hirota (A) Figure 2 Figure 3 Human power pattern (B) Figure 4

Claims

[Claims]

(1) The input signal input from the signal input section is frequency-analyzed by the pre-processing section, the analysis result by the pre-processing section is input to the neural network, and the output of the neural network is used to determine whether the input signal contains voiced sounds. Voice detection method to judge.

(2) The audio detection method according to claim 1, wherein the analysis result by the preprocessing section is a frequency characteristic of the input signal.

(3) The audio detection method according to claim 2, wherein the analysis result by the preprocessing section is a temporal change in frequency characteristics of the input signal within a certain period of time.

(4) The voice detection method according to claim 1, wherein the analysis result by the preprocessing section is an average frequency characteristic of the input signal.

(5) The audio detection method according to claim 4, wherein the analysis result by the preprocessing section is a temporal change in an average frequency characteristic within a certain period of time of the input signal.

(6) The audio detection method according to claim 1, wherein the analysis result by the preprocessing section is an average frequency characteristic and an average pitch frequency of the input signal.

(7) The audio detection method according to claim 6, wherein the analysis result by the preprocessing section is a temporal change in an average frequency characteristic of the input signal within a certain period of time, and a temporal change in an average pitch frequency.

(8) The audio detection method according to claim 1, wherein the analysis result by the preprocessing section is an average frequency characteristic of an input signal with high frequency emphasis.

(9) The audio detection method according to claim 8, wherein the analysis result by the preprocessing section is a temporal change in an average frequency characteristic within a certain period of time of the high-frequency emphasized input signal.

(10) The voice detection method according to any one of claims 1 to 9, wherein the neural network is a hierarchical neural network.