JPS6147437B2

JPS6147437B2 -

Info

Publication number: JPS6147437B2
Application number: JP55174341A
Authority: JP
Inventors: Satoru Kabasawa; Hidekazu Tsuboka; Yoshiteru Mifune
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1980-12-10
Filing date: 1980-12-10
Publication date: 1986-10-18
Also published as: JPS5797599A

Description

【発明の詳細な説明】本発明において、音声は式(1)で示される様に、
入力をある時間毎にサンプリングして得られる特
徴ベクトルの系列として表わされるものとする。[Detailed Description of the Invention] In the present invention, the voice is expressed as shown in equation (1).
It is assumed that the input is expressed as a series of feature vectors obtained by sampling at certain time intervals.

X₁，X₂，…，Ｘ_N (1) 上式において、各々のXi（ｉ＝１，２，…，
Ｎ）はそれぞれｍ次元ベクトルであつて、 Xi＝（ｘ_i1，ｘ_i2，…，ｘ_in） (2) と表わされる。ここで、例えばｍチヤンネルのバ
ンドパスフイルタの出力x₁（ｔ），x₂（ｔ），…，
ｘ_n（ｔ）の時間標本化したものとして特徴ベク
トルをとらえることができる。 X ₁ , X ₂ ,...,X _N (1) In the above formula, each Xi (i=1, 2,...,
N) are each m-dimensional vectors, and are expressed as Xi=(x _i1 , x _i2 , . . . , x _in ) (2). Here, for example, the outputs of the m-channel bandpass filter x ₁ (t), x ₂ (t), ...,
The feature vector can be captured as a time sample of x _n (t).

また、特徴ベクトルで表わされる音声の区間を
フレームという。式(1)の添字１，２，…，ｎは時
間（即ち、フレーム）を表わすパラメータであ
る。 Furthermore, a section of audio represented by a feature vector is called a frame. The subscripts 1, 2, . . . , n in equation (1) are parameters representing time (ie, frame).

特徴パラメータの系列とは、例えばエネルギー
の系列であつて、式(1)で示される音声の特徴ベク
トルの系列がフイルタ・バンクの出力である場合
には、特徴ベクトルＸ_iのエネルギーは、 Pi＝Σ^ｍ _ｊ＝１ｘ^２ _ｉｊ (3) と定義される。式(1)の特徴ベクトルの系列に対応
するエネルギーの系列は、式(3)より、 P₁，P₂，…，Ｐ_N (4) と表わされる。 The series of feature parameters is, for example, a series of energies, and if the series of speech feature vectors shown in equation (1) is the output of a filter bank, the energy of the feature vector X _i is Pi= Σ ^m _j=1 x ² _ij (3) is defined. The energy sequence corresponding to the feature vector sequence in equation (1) is expressed as P ₁ , P ₂ , . . . , P _N (4) from equation (3).

式(4)で示されるエネルギーの系列の変動に対し
て、予め設定された閾値を越えた区間を音声区間
として検出するが、これは、音声の自動認識にお
ける重要な間題の一つである。音声区間検出の問
題は、始端の検出、は終端の検出の二つの部
分問題に分けることができる。 With respect to fluctuations in the energy sequence shown by equation (4), sections that exceed a preset threshold are detected as speech sections, which is one of the important issues in automatic speech recognition. . The problem of speech interval detection can be divided into two sub-problems: detection of the start end, and detection of the end.

本発明は、この二つの部分問題のうちでの問
題に関するもので、終端の検出を目的とするもの
である。 The present invention is concerned with one of these two sub-problems, and is aimed at detecting the termination.

ところで、音声区間の検出においては、異なつ
た話者ではもちろんのこと同一話者でも発話ごと
に入力レベルの平均値が変動し、また発話音声に
加わる雑音によつてエネルギーの系列が不規則な
変動を伴なう。これらの変動が原因で音声区間を
正しく検出できないために、発話音声を誤認識す
ることがしばしばある。 By the way, when detecting speech intervals, the average value of the input level fluctuates for each utterance, not only for different speakers but also for the same speaker, and the energy sequence may fluctuate irregularly due to noise added to the uttered voice. accompanied by. Due to these fluctuations, the speech interval cannot be detected correctly, which often results in erroneous recognition of speech.

本発明は、エネルギーの系列の変動に適応して
閾値を設定することによつて、入力レベルの平均
値の変動や雑音による不規則な変動を吸収し、予
め設定した時間観察して、より正確に音声区間を
検出することを目的とする。 The present invention absorbs fluctuations in the average value of the input level and irregular fluctuations due to noise by adapting to fluctuations in the energy series and setting thresholds, and observes for a preset period of time to improve accuracy. The purpose is to detect speech intervals.

ただし、本発明においては式(4)の系列を式(3)で
与えられる様なエネルギーのみならず。 However, in the present invention, the series of equation (4) is used not only for the energy as given by equation (3).

Pi＝（Σ^ｍ _ｊ＝１ｘ^２ _ｉｊ）^1/2 (5) Pi＝Σ^ｍ _ｊ＝１｜ｘ_ij｜ (6) などによつて定義しても有効となる。即ち、Ｐ_i
は式(3)のエネルギーと同等のもので、音声入力レ
ベルの特徴を表現できるものであればよいのであ
る。 Pi=(Σ ^m _j=1 x ² _ij ) ^1/2 (5) Pi=Σ ^m _j=1 |x _ij | (6) It is also valid to define. That is, P _i
is equivalent to the energy in Equation (3), and is sufficient as long as it can express the characteristics of the audio input level.

エネルギー系列から音声区間の終端を検出する
方法として従来提案されている方式を説明し、そ
の問題点を述べる。 This paper describes a method that has been proposed in the past for detecting the end of a speech interval from an energy sequence, and describes its problems.

従来提案されている音声区間の終端検出の方法
は、予め定めたエネルギー・レベルの閾値が高す
ぎれば、雑音による不規則な変動や発話時のレベ
ルの平均値の変動によつて音声区間の正しい終端
よりも早い時点で終端を誤検出しやすい。また、
予め定めたエネルギー・レベルの閾値が低すぎれ
ば、発話時のレベルの平均値の変動による誤検出
はかなり除かれるのであろうが、雑音による不規
則な変動によつて音声区間の正しい終端よりも遅
い時点で終端を誤検出しやすいという欠点があ
る。 Conventionally proposed methods for detecting the end of a speech section do not correct the speech section if the predetermined energy level threshold is too high due to irregular fluctuations due to noise or fluctuations in the average level during speech. It is easy to falsely detect the end at a point earlier than the end. Also,
If the predetermined energy level threshold is too low, false positives due to fluctuations in the average level during speech will be eliminated to a large extent, but irregular fluctuations due to noise may cause false positives to occur at the end of the speech interval. The disadvantage is that it is easy to falsely detect the end at a late point.

それ故、音声区間の終端を正しく検出するため
には、発話音声の入力レベルの変動を観察すると
共に、音声に加わつた雑音量を観祭して、個々の
発話音声に適応したエネルギー・レベルの閾値を
決定する必要がある。 Therefore, in order to correctly detect the end of a speech interval, it is necessary to observe the fluctuations in the input level of the speech sound, as well as the amount of noise added to the speech, and to determine the energy level that is appropriate for each speech sound. It is necessary to determine the threshold value.

以下で述べる終端検出方式は、このような背景
からなされたものである。この本発明を実現する
構成の一実施例を示して説明する。 The termination detection method described below was developed against this background. An example of a configuration for realizing the present invention will be shown and described.

第１図は、適応的に定めた閾値を用いて音声区
間の終端を検出するための一構成例である。 FIG. 1 shows an example of a configuration for detecting the end of a speech section using an adaptively determined threshold.

１で示される極大値決定部においては、入力レ
ベルの変動を観祭していて、エネルギー・レベル
の極大値を検出するたびに、２で示される閾値決
定部に極大値が与えられる。閾値決定部２におい
ては、極大値決定部１の出力、即ち極大値が、現
在の閾値を越えた場合に、その極大値に基づいて
新しい閾値を決定する。エネルギーの系列が現在
の閾値を越えない区間において極大値が存在して
も閾値はそのままである。いま、閾値をθ，極大
値をαとすると、θはある関数ｆに関して、 θ＝ｆ（α） (7) で与えられる。具体的には、例えば θ＝max（α／16，25） (8) で与えられる。ここで、max（・）は（）内の
最大値を与える関数である。最初の閾値は、最初
の極大値を用いて式(8)で与えられる。 The local maximum value determination unit indicated by 1 observes fluctuations in the input level, and each time a local maximum value of the energy level is detected, the maximum value is given to the threshold value determination unit indicated by 2. In the threshold value determination unit 2, when the output of the local maximum value determination unit 1, that is, the local maximum value exceeds the current threshold value, a new threshold value is determined based on the local maximum value. Even if a maximum value exists in an interval where the energy sequence does not exceed the current threshold, the threshold remains the same. Now, when the threshold value is θ and the maximum value is α, θ is given by θ=f(α) (7) for a certain function f. Specifically, it is given by, for example, θ=max(α/16,25) (8). Here, max(·) is a function that gives the maximum value in parentheses. The first threshold is given by equation (8) using the first local maximum value.

状態決定部３においては、閾値決定部２の出
力、即ち閾値に基づいて、エネルギーの系列が閾
値よりも大きいか、或いは小さいが決定される。
４で示される状態系列保持部においては、状態決
定部３の出力を保持し、予め設定された時間、即
ち予め設定したフレーム数以上連続してエネルギ
ー系列が閾値以下となつた時点で、はじめてエネ
ルギーの系列が閾値以下となつた時点が５で示さ
れる終端検出部に出力される。終端検出部５にお
いては、状態保持部４の出力をもとに、エネルギ
ーの系列から終端のフレームが検出され、出力さ
れるのである。 The state determining unit 3 determines whether the energy sequence is larger or smaller than the threshold based on the output of the threshold determining unit 2, that is, the threshold.
The state sequence holding unit indicated by 4 holds the output of the state determining unit 3, and stores the energy only after a preset time, that is, when the energy sequence becomes below the threshold value continuously for a preset number of frames or more. The time point at which the sequence becomes less than or equal to the threshold value is output to the end detection section indicated by 5. The end detection section 5 detects the end frame from the energy sequence based on the output of the state holding section 4 and outputs it.

第２図は、第１図に示した構成例の動作を具体
的に説明するためのエネルギーの系列の一例であ
る。同図において、横軸は時間、即ちフレームで
あり、縦軸はエネルギー・レベルである。エネル
ギーの系列は、同図に示された曲線上の離散的な
点に相当するが、表記の都合上、連続した曲線で
示してある。 FIG. 2 is an example of an energy series for specifically explaining the operation of the configuration example shown in FIG. 1. In the figure, the horizontal axis is time, ie, frames, and the vertical axis is energy level. The energy series corresponds to discrete points on the curve shown in the figure, but for convenience of notation, it is shown as a continuous curve.

極大値決定部１では、極大値θ_１を検出し閾値
決定部２にα_１を与える。閾値決定部２では、式
(8)に基づいて閾値θ_１が決定される。状態決定部
３では、閾値θ_１に基づいて、エネルギーの系列
がθ_１よりも大きいか小さいかが決定され出力さ
れる。状態系列保持部４では、状態決定部３から
の出力を保持し、予め設定された時間Ｔの間θ_１
以下のエネルギーが連続するかどうかを観祭して
いるが、第２図の例では、θ_１に関しては、Ｔ以
内でエネギーの系列がθ_１を越えるので、エネル
ギーの系列がはじめてθ_１より小さくなつた時点
t₁は終端検出部への出力とはならない。同様に、
極大値α_２をもとに閾値θ_２が決定されるが、終
端検出部５への出力はない。ただし、極大値α_４
は閾値θを越えないので、極大値α_３の時点ま
で、閾値θはそのままである。しかし、極大値α
_３をもとに決定された閾値θ_３に関しては、θ_３
以下となる時間がＴ以上連続するので、エネルギ
ーの系列がはじめてθ_３より小さくなつた時点t₃
が終端として検出され、終端検出部５から出力さ
れるのである。 The local maximum value determination unit 1 detects the local maximum value θ ₁ and provides α ₁ to the threshold value determination unit 2 . In the threshold determination unit 2, the formula
The threshold value θ ₁ is determined based on (8). The state determination unit 3 determines whether the energy sequence is larger or smaller than θ ₁ based on the threshold value θ ₁ and outputs it. The state series holding unit 4 holds the output from the state determining unit 3, and holds the output from the state determining unit 3 for a preset time _T.
We are observing whether the following energies are continuous, but in the example in Figure 2, with respect to θ ₁ , the energy series exceeds θ ₁ within T, so the energy series becomes smaller than θ ₁ for the first time. At the point of getting old
_t1 is not output to the termination detector. Similarly,
Although the threshold value θ ₂ is determined based on the maximum value α ₂ , there is no output to the end detection unit 5 . However, the maximum value α ₄
does not exceed the threshold θ, so the threshold θ remains unchanged until the local maximum value _α3 . However, the maximum value α
Regarding the threshold value θ ₃ determined based on _θ ₃
Since the time below continues for T or more, the time t ₃ is when the energy series becomes smaller than θ ₃ for the first time.
is detected as the termination, and is output from the termination detection section 5.

以上のように本発明は、音声入力レベルの平均
値の変動に伴ない、終端検出のための閾値を適応
的に決定して、より正確な終端検出を行なう終端
検出方式であつて、従来の方式に比較して音声入力レベルの平均値の変動に応じて、適応
的に終端検出のための閾値を決定しているので、
入力レベルの平均値の変動や雑音によるエネルギ
ーの系列の不規則な変動が原因となる終端の誤検
出の低減が可能となり、より正確な終端の検出が
行なえる。 As described above, the present invention is an end detection method that performs more accurate end detection by adaptively determining a threshold for end detection in accordance with fluctuations in the average value of the audio input level. Compared to the conventional method, the threshold for end detection is determined adaptively according to fluctuations in the average value of the audio input level.
It is possible to reduce false termination detections caused by fluctuations in the average value of the input level or irregular fluctuations in the energy sequence due to noise, and more accurate termination detection can be performed.

等の優れた特徴を有するものである。そして音
声区間を正確に検出することは、音声の自動認識
の正確さの向上につながり、したがつて入力レベ
ルの変動や雑音などによるエネルギーの系列の不
規則な変動を吸収して、より正確な終端の検出を
行なう本発明は、音声の自動認識においてきわめ
て有効である。 It has excellent characteristics such as. Accurately detecting speech intervals will lead to improved accuracy in automatic speech recognition, and will therefore absorb irregular fluctuations in the energy sequence due to input level fluctuations and noise, resulting in more accurate speech recognition. The present invention, which performs termination detection, is extremely effective in automatic speech recognition.

[Brief explanation of the drawing]

第１図は音声区間の終端検出方式の一実施例を
示すブロツク図、第２図は、第１図のブロツク図
の動作を説明するめのエネルギー系列の線図であ
る。１……極大値決定部、２……閾値決定部、３…
…状態決定部、４……状態系列保持部、５……終
端検出部。 FIG. 1 is a block diagram showing an embodiment of a method for detecting the end of a voice section, and FIG. 2 is an energy series diagram for explaining the operation of the block diagram of FIG. 1...Local maximum value determination unit, 2...Threshold value determination unit, 3...
...state determining section, 4... state series holding section, 5... termination detecting section.

Claims

[Claims]

1. Means for sampling audio input at certain time intervals to generate a sequence of audio feature vectors X ₁ , X ₂ , ..., X _N , and representing the audio input level using X ₁ , X ₂ , ..., _X Sequence of parameters P ₁ , P ₂ ,..., P
means for generating _N , means for setting thresholds for P ₁ , P ₂ ,...P _N , and detecting that P ₁ , P ₂ ,..., P _N do not exceed the thresholds. Means for determining the termination and the series of characteristic parameters P
_i (i=1, 2,...N), threshold determining means for comparing the maximum value and the threshold and setting the larger value as the threshold, and feature parameter P _i (i= 1, 2, ..., N) does not exceed the threshold set by the threshold determining means for a certain period of time, the end of the speech section is determined at the point in time when P _i becomes smaller than the threshold for the first time. A device for detecting the end of a voice section, comprising means for determining the end of a voice section.