JP3109978B2

JP3109978B2 - Voice section detection device

Info

Publication number: JP3109978B2
Application number: JP07106650A
Authority: JP
Inventors: 中直也田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1995-04-28
Filing date: 1995-04-28
Publication date: 2000-11-20
Anticipated expiration: 2015-11-20
Also published as: JPH08305388A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声区間のみを符号化
して伝送する音声符号化装置で使用される、音声区間検
出装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice section detecting apparatus used in a voice coding apparatus for coding and transmitting only a voice section.

【０００２】[0002]

【従来の技術】従来、入力音声を予め定められた長さの
フレームに分割し、そのフレームが音声区間であるか否
かを検出する音声区間検出装置が知られている。本明細
書において、入力音声はその性質により次のように分類
するものとする。音声区間とは、入力音声のうち音声と
して伝送すべき何らかの情報を有しており、音声符号化
装置により符号化する必要がある部分を示す。非音声区
間とは、入力音声から上記音声区間を除いた部分であ
り、符号化する必要が無い部分である。有音区間とは、
入力音声のうち音声信号が定められたしきい値以上に存
在する部分を示す。有音区間における音声信号には伝送
すべき情報が含まれているか否かは問わない。つまり、
単に雑音でもかまわない。無音区間とは入力音声のう
ち、上記有音区間を除いた部分を示す。2. Description of the Related Art Conventionally, there has been known a voice section detection device which divides an input voice into frames of a predetermined length and detects whether the frame is a voice section. In this specification, input speech is classified as follows according to its properties. The voice section has some information of the input voice to be transmitted as voice, and indicates a portion that needs to be coded by the voice coding device. The non-speech section is a part obtained by removing the speech section from the input speech, and is a part that does not need to be encoded. A sound segment is
It shows a portion of the input voice where the voice signal is above a predetermined threshold. It does not matter whether the audio signal in the sound section contains information to be transmitted. That is,
It may be just noise. The silent section indicates a portion of the input voice excluding the sound section.

【０００３】音声区間を検出する最も基本的な方法は、
フレーム毎の平均音声パワを予め定められたしきい値と
比較し、平均音声パワがしきい値よりも大きいフレーム
を有音区間と判定し、有音区間をそのまま音声区間とみ
なす方法である。背景雑音が無いか、あるいはレベルが
非常に低い条件では、音声区間と有音区間はほぼ一致す
るため、正確な音声区間の検出が可能である。一方、背
景雑音のレベルが高い条件では、有音区間と判定される
区間が多くなり、音声区間を正しく検出できなくなる。
あるいは有音区間を判定するしきい値を上げて有音区間
と判定される区間を減らすと、音声区間を無音区間と判
定することにより、音切れが発生するという問題があ
る。さらに、入力音声のスペクトルパラメータの安定性
や線形予測分析に基づく予測誤差、入力音声のゼロクロ
ス数等の入力音声が有する特徴量をしきい値として用い
る方法、または有音区間および無音区間の平均音声パワ
を基に有音区間を判定するしいき値を可変化する方法に
より、ホワイトノイズ等の定常的な背景雑音に対して
は、例えば背景雑音がＳＮ比で２０ｄＢ程度と高いレベ
ルで存在していても、音声区間を正しく検出することが
できる装置が開発されている。[0003] The most basic method for detecting a voice section is
In this method, the average voice power for each frame is compared with a predetermined threshold value, a frame whose average voice power is larger than the threshold value is determined as a voiced section, and the voiced section is regarded as a voice section as it is. Under the condition that there is no background noise or the level is very low, the voice section almost coincides with the sound section, so that accurate voice section detection is possible. On the other hand, under conditions where the background noise level is high, the number of sections determined to be sound sections increases, and voice sections cannot be detected correctly.
Alternatively, when the threshold for determining a sound section is increased and the number of sections determined to be a sound section is reduced, there is a problem that a sound section is determined to be a silent section, and a sound break occurs. Further, a method using a feature amount of the input speech such as a prediction error based on the stability of the spectral parameters of the input speech or a linear prediction analysis, the number of zero crossings of the input speech as a threshold, or an average speech in a sound section and a silent section By a method of determining a sound section based on power and varying a threshold value, for background noise such as white noise, for example, background noise exists at a high level of about 20 dB in SN ratio. However, a device that can correctly detect a voice section has been developed.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来の装置では、一般的な環境雑音、例えば工場内雑音や
街頭雑音等は、平均音声パワおよび音声特徴量が共に変
動が激しく、判定値がしきい値周辺で頻繁に上下するな
どの問題が発生するため、音声区間を正しく検出するこ
とが難しいという問題があった。However, in the above-mentioned conventional apparatus, general environmental noises, such as factory noises and street noises, both have a large variation in average voice power and voice feature amount, and the determination value is low. Since a problem such as frequent rise and fall around the threshold occurs, there is a problem that it is difficult to correctly detect a voice section.

【０００５】本発明は、このような従来の問題を解決す
るものであり、変動の激しい環境雑音が、例えばＳＮ比
で２０ｄＢ程度と高いレベルで存在する条件下でも、入
力音声中の音声区間を正しく検出することのできる音声
区間検出装置を提供することを目的とする。[0005] The present invention solves such a conventional problem. Even under the condition that a highly fluctuating environmental noise exists at a high SN ratio, for example, about 20 dB, the voice section in the input voice is recognized. It is an object of the present invention to provide a voice section detection device capable of correctly detecting.

【０００６】[0006]

【課題を解決するための手段】本発明は、上記目的を達
成するために、入力音声を分析して周期性を検出する周
期性検出手段と、入力音声のパワ情報に基づいて有音区
間を検出する有音区間検出手段と、これら２つの検出手
段の現在および過去の検出結果から、予め定めた音声区
間と非音声区間を判定する規則に従って音声区間を検出
する音声区間判定手段とを備え、前記有音区間検出手段
が、入力音声のフレーム毎の平均音声パワを算出する平
均音声パワ算出手段と、前フレームの平均音声パワと現
フレームの平均音声パワとの比を算出する短時間パワ比
算出手段と、フレーム毎の平均音声パワをさらにｍフレ
ームにわたって平均した長時間平均音声パワと現フレー
ムの平均音声パワとの比を算出する長時間パワ比算出手
段とを備え、前記音声区間判定手段が、入力音声の状態
を示す状態カウンタを有し、予め定めた規則に従って状
態カウンタの更新を行なう状態カウンタ更新手段と、前
記規則を納めた判定マップと、前記状態カウンタの値を
予め定めたしきい値とを比較して音声区間の判定を行な
う比較判定手段とを備えたものである。In order to achieve the above object, the present invention provides a periodicity detecting means for analyzing an input voice to detect a periodicity, and a sound section based on power information of the input voice. Voiced section detection means for detecting, and voice section determination means for detecting a voice section in accordance with rules for determining a predetermined voice section and a non-voice section from current and past detection results of these two detection means , The sound section detection means
Is the average for calculating the average audio power for each frame of the input audio.
Average voice power calculation means, and average voice power of the previous frame and current
Short-time power ratio to calculate the ratio with the average audio power of the frame
The calculation means and the average audio power for each frame
Average voice power and current frame averaged over time
A long-term power ratio calculator that calculates the ratio to the average audio power of the system
And a voice section determining means for determining a state of the input voice.
Has a status counter indicating the status according to a predetermined rule.
Status counter updating means for updating the status counter;
The judgment map containing the rules and the value of the status counter
The voice section is determined by comparing with a predetermined threshold.
And comparison determining means .

【０００７】本発明はまた、状態カウンタ更新手段が、
周期性判定手段からの周期性判定値と、短時間パワ比算
出手段からの短時間パワ比と、長時間パワ比算出手段か
らの長時間パワ比と、過去の判定結果に基づき現在の入
力音声の状態を推定する値を保持している状態カウンタ
の値をもとに、状態カウンタの増減値を決定する規則を
納めた判定マップを参照して状態カウンタの値を更新す
ることを特徴とするものである。In the present invention, the status counter updating means may include:
Periodicity judgment value from periodicity judgment means and short-time power ratio calculation
Short-term power ratio from the output means and long-term power ratio calculation means
Based on the long-term power ratio and the past judgment result.
A state counter that holds a value that estimates the state of the force voice
The rule to determine the increase / decrease value of the state counter based on the value of
Update the value of the status counter with reference to the judgment map
It is characterized by that.

【０００８】[0008]

【作用】本発明は、上記構成により、入力音声に変動の
激しい環境雑音が、例えばＳＮ比で２０ｄＢ程度と高い
レベルで存在する条件でも、音声区間を正しく検出する
ことができる。According to the present invention, a voice section can be correctly detected by the above configuration even under a condition in which the input noise has a highly fluctuating environmental noise, for example, at a high SN ratio of about 20 dB.

【０００９】[0009]

【実施例】以下、本発明の一実施例を図面を用いて説明
する。図１は本発明の一実施例における音声区間検出装
置の構成を示すブロック図である。図１において、１０
１は入力音声の周期性を検出する周期性検出手段、１０
２は入力音声のフレーム毎の平均音声パワを算出する平
均音声パワ算出手段、１０３は前フレームの平均音声パ
ワと現フレームの平均音声パワとの比を算出する短時間
パワ比算出手段、１０４はフレーム毎の平均音声パワを
さらにｍフレームにわたって平均した長時間平均音声パ
ワを算出し、長時間平均音声パワと現フレームの平均音
声パワとの比を算出する長時間パワ比算出手段、１０５
は入力音声の状態を示す状態カウンタを有して予め定め
た規則に従って状態カウンタの更新を行なう状態カウン
タ更新手段、１０６は予め定めた規則を納めた判定マッ
プ、１０７は状態カウンタの値と予め定めたしきい値と
を比較して音声区間の判定を行なう比較判定手段であ
る。また、１０８は入力音声、１０９は周期性判定値、
１１０は現フレームの平均音声パワ、１１１は短時間パ
ワ比、１１２は長時間パワ比、１１３は状態カウンタ
値、１１４は判定結果の音声区間判定である。そして、
平均音声パワ算出手段１０２と短時間パワ比算出手段１
０３と長時間パワ比算出手段１０４とで有音区間検出手
段１１５を構成し、状態カウンタ更新手段１０５と判定
マップ１０６と比較手段１０７とで音声区間判定手段１
１６を構成する。An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a voice section detection device according to one embodiment of the present invention. In FIG. 1, 10
1 is a periodicity detecting means for detecting the periodicity of the input voice, 10
2 is an average voice power calculating means for calculating an average voice power of each frame of the input voice, 103 is a short-time power ratio calculating means for calculating a ratio between the average voice power of the previous frame and the average voice power of the current frame, 104 is A long-time power ratio calculating means for calculating a long-time average voice power by averaging the average voice power for each frame over m frames and calculating a ratio between the long-time average voice power and the average voice power of the current frame; 105
Is a state counter updating means having a state counter indicating the state of the input voice and updating the state counter in accordance with a predetermined rule, 106 is a judgment map containing predetermined rules, 107 is a state counter value and a predetermined value This is a comparison determination unit that determines a voice section by comparing the threshold value with a threshold value. 108 is an input voice, 109 is a periodicity judgment value,
110 average speech power of the current frame, 111 short time power ratio, 112 long time power ratio, 113 state counter value 114 is a speech segment determination result of the determination. And
Average voice power calculating means 102 and short-time power ratio calculating means 1
03 and the long-time power ratio calculating means 104 constitute a sound section detecting means 115, and the state counter updating means 105, the determination map 106 and the comparing means 107 constitute the voice section determining means 1.
Constituting No. 16.

【００１０】次に、上記実施例の動作について説明す
る。図１において、入力音声１０８は、周期性検出手段
１０１と平均音声パワ算出手段１０２に入力される。周
期性検出手段１０１は、入力音声を分析して周期性判定
値１０９を出力する。周期性判定値１０９は、周期性の
有無を示す２値情報であっても、周期性の度合いを示す
連続値情報であってもよい。平均音声パワ算出手段１０
２は、現フレームの平均音声パワ１１０を算出し、出力
する。短時間パワ比算出手段１０３は、平均音声パワ１
１０と保持している前フレームの平均音声パワとの比を
算出し、短時間パワ比１１１として出力する。その後保
持している前フレームの平均音声パワを現フレームの平
均音声パワによって更新する。同様に、長時間パワ比算
出手段１０４は、平均音声パワ１１０と、保持している
過去ｍフレームの平均音声パワをさらに平均した長時間
平均パワとの比を算出し、長時間パワ比１１２として出
力する。その後、保持している過去ｍフレームの平均音
声パワを現フレームの平均音声パワ１１０によって更新
する。状態カウンタ更新手段１０５は、周期性判定値１
０９、短時間パワ比１１１、長時間パワ比１１２と、過
去の判定結果に基づき現在の入力音声の状態を推定する
値を保持している状態カウンタの値をもとに、状態カウ
ンタの増減値を決定する規則を納めた判定マップ１０６
を参照し、状態カウンタを更新する。比較判定手段１０
７は、更新された状態カウンタ値１１３と予め定められ
たしきい値を比較し、現フレームが音声区間であるか非
音声区間であるかを判定する。Next, the operation of the above embodiment will be described. In FIG. 1, an input voice 108 is input to a periodicity detecting unit 101 and an average voice power calculating unit 102. The periodicity detecting unit 101 analyzes the input voice and outputs a periodicity determination value 109. The periodicity determination value 109 may be binary information indicating the presence or absence of periodicity, or may be continuous value information indicating the degree of periodicity. Average voice power calculation means 10
2 calculates and outputs the average audio power 110 of the current frame. The short-time power ratio calculation means 103 calculates the average audio power 1
A ratio between 10 and the held average audio power of the previous frame is calculated and output as a short-time power ratio 111. Thereafter, the stored average audio power of the previous frame is updated by the average audio power of the current frame. Similarly, the long-time power ratio calculating means 104 calculates the ratio between the average voice power 110 and the long-term average power obtained by further averaging the held average voice power of the past m frames, and calculates the ratio as the long-time power ratio 112. Output. Thereafter, the stored average audio power of the past m frames is updated by the average audio power 110 of the current frame. The state counter updating means 105 calculates the periodicity judgment value 1
09, the short-time power ratio 111, the long-time power ratio 112, and the increase / decrease value of the state counter based on the value of the state counter holding the value for estimating the current state of the input voice based on the past determination result. Map 106 containing rules for determining
And update the status counter. Comparison judgment means 10
7 compares the updated state counter value 113 with a predetermined threshold value to determine whether the current frame is a voice section or a non-voice section.

【００１１】次に、上記実施例において使用する音声の
特徴量である周期性判定値１０９、平均音声パワ１１
０、短時間パワ比１１１、長時間パワ比１１２を用いた
音声区間検出の原理について以下に説明する。Next, the periodicity judgment value 109, which is the characteristic amount of the voice used in the above embodiment, and the average voice power 11
The principle of voice section detection using 0, short power ratio 111, and long power ratio 112 will be described below.

【００１２】図２は上記実施例におけるＳＮ比２０ｄＢ
の街頭雑音を付加した音声を入力したときの各音声特徴
量の変化を示し、２０１はフレーム毎の平均音声パワ１
１０、２０２は短時間パワ比１１１、２０３は長時間パ
ワ比１１２、２０４は周期性判定値１０９の変化をそれ
ぞれ示す。２０１、２０２、２０３は値をデシベル［ｄ
Ｂ］表示したものであり、２０４は周期性があると判定
した区間（定常区間）を山、周期性が無いと判定した区
間（非定常区間）を谷で表したものである。なお、入力
音声のフレーム長は２０ｍｓ、長時間平均音声パワを算
出するフレーム数ｍは５とした。短時間パワ比１１１と
長時間パワ比１１２は、ともに平均音声パワ１１０が大
きく変化する部分、すなわち音声の立上がりおよび立下
がり部分で大きく変化する。したがって、短時間パワ比
１１１と長時間パワ比１１２は、音声の立上がりおよび
立下がりを検出するのに適した特徴量であるといえる。
短時間パワ比１１１と長時間パワ比１１２の相違は、短
時間パワ比１１１が、平均音声パワが短時間に急激な変
化を繰りかえしても追従するかわりに、変化が激しすぎ
る傾向があるのに対して、長時間パワ比１１２は、変化
は安定しているが急激な変化の繰り返しには追従できな
い傾向を持つ点である。両者の特性を組み合わせて利用
することによって、背景雑音が無い条件だけではなく、
例えばＳＮ比が２０ｄＢ程度の変動の激しい雑音が付加
された条件においても、より正確な音声の立上がり立下
がり部分を検出することができる。また、周期性判定値
１０９は、音声中の定常な部分を検出するのに有効な特
徴量である。FIG. 2 shows an SN ratio of 20 dB in the above embodiment.
Shows the change of each voice feature amount when a voice to which street noise is added is input, and 201 denotes the average voice power 1 for each frame.
Reference numerals 10 and 202 denote a short-time power ratio 111, 203 denotes a long-time power ratio 112, and 204 denotes a change in the periodicity determination value 109, respectively. 201, 202, and 203 set the value in decibels [d
B], in which reference numeral 204 denotes a section determined to have periodicity (stationary section) by a peak, and a section determined to have no periodicity (unsteady section) by a valley. The frame length of the input voice was 20 ms, and the number m of frames for calculating the long-term average voice power was 5. Both the short-time power ratio 111 and the long-time power ratio 112 greatly change at a portion where the average audio power 110 changes significantly, that is, at the rising and falling portions of the audio. Therefore, it can be said that the short-time power ratio 111 and the long-time power ratio 112 are characteristic amounts suitable for detecting the rise and fall of the voice.
The difference between the short-time power ratio 111 and the long-time power ratio 112 is that the short-time power ratio 111 tends to change too rapidly instead of following the average voice power even if the average voice power repeatedly changes rapidly in a short time. On the other hand, the long-time power ratio 112 has a tendency that the change is stable but cannot follow a rapid change. By using both characteristics in combination, not only the condition without background noise,
For example, even under a condition in which a highly fluctuating noise having an SN ratio of about 20 dB is added, a more accurate rising and falling portion of a voice can be detected. Further, the periodicity judgment value 109 is a feature amount effective for detecting a stationary part in the voice.

【００１３】本実施例による音声区間検出装置は、短時
間パワ比と長時間パワ比の組み合わせによる音声の立上
がりおよび立下がり部分の検出と、周期性検出による音
声の定常区間の検出を行ない、両者の総合判定によって
音声区間を検出することを動作原理としており、この動
作原理を用いた音声区間検出の一例について以下に説明
する。ここでは、図１における状態カウンタ更新手段１
０５が保持する状態カウンタの取る値の範囲は０から１
８までとし、状態カウンタの値が０から５の範囲にある
とき音声区間であると判定することとする。The voice section detection apparatus according to the present embodiment detects the rising and falling portions of the voice based on a combination of the short-time power ratio and the long-time power ratio, and detects the stationary voice section by detecting the periodicity. The principle of operation is to detect a voice section by comprehensive judgment of the above, and an example of voice section detection using this principle of operation will be described below. Here, the state counter updating means 1 in FIG.
The range of values taken by the state counter held by 05 is 0 to 1.
When the value of the state counter is in the range of 0 to 5, it is determined to be a voice section.

【００１４】図３は判定マップ１０６の一例を示すもの
であり、短時間パワ比を縦軸、長時間パワ比を横軸に取
った平面を、領域１から領域９までの９つのマップ領域
３０１に分割したもので、各領域には状態カウンタの増
減値３０２が割り当てられている。状態カウンタ更新手
段１０５は、保持している更新前の状態カウンタ値が非
音声区間を示しているときには、受け取った短時間パワ
比１１１と長時間パワ比１１２が判定マップ１０６上の
どの領域に属するかを参照し、対応する状態カウンタの
増減値によって状態カウンタを更新する。更新後の状態
カウンタ値が０から５の範囲内にあれば現フレームは音
声区間と判定される。すなわち、音声区間の立上がりが
検出される。また、更新前の状態カウンタ値が音声区間
を示しているときには、周期性判定値１０９による定常
区間検出を行ない、定常区間と判定されれば状態カウン
タを０にクリアし、非定常区間と判定されれば、状態カ
ウンタ値が非音声区間を示しているときと同様に判定マ
ップ１０６を参照し、状態カウンタを更新する。更新後
の状態カウンタ値が６から１８の範囲内にあれば、現フ
レームは非音声区間と判定される。すなわち、音声区間
の立下がりが検出される。FIG. 3 shows an example of the judgment map 106. A plane having the short-time power ratio on the vertical axis and the long-time power ratio on the horizontal axis is represented by nine map areas 301 from area 1 to area 9. The increase / decrease value 302 of the state counter is assigned to each area. When the held state counter value before update indicates a non-voice section, the state counter updating unit 105 determines to which region on the determination map 106 the received short-time power ratio 111 and long-time power ratio 112 belong. , And updates the status counter with the increase / decrease value of the corresponding status counter. If the updated state counter value is in the range of 0 to 5, the current frame is determined to be a voice section. That is, the rising of the voice section is detected. Further, when the state counter value before the update indicates a voice section, a steady section is detected based on the periodicity determination value 109. If it is determined that the section is a steady section, the state counter is cleared to 0, and the state is determined to be an unsteady section. If it is, the state counter is updated and the state counter is updated in the same manner as when the state counter value indicates the non-voice section. If the updated state counter value is in the range of 6 to 18, the current frame is determined to be a non-voice section. That is, the falling of the voice section is detected.

【００１５】図４は本実施例の音声区間検出装置による
音声区間の検出結果を示す図であり、４０１はＳＮ比が
２０ｄＢの街頭雑音を付加した音声の平均音声パワの変
化を示し、４０２はＳＮ比が２０ｄＢの街頭雑音を付加
した音声から音声区間を検出した結果を示し、４０３は
背景雑音を付加しない音声の平均音声パワの変化を示
し、４０４は背景雑音を付加しない音声から音声区間を
検出した結果を示す。４０２と４０４においては、山の
部分が音声区間、谷の部分が非音声区間を示している。
図に示すとおり、検出区間にある程度の差異は認められ
るが、背景雑音の有無に関わらず安定して音声区間を検
出していることがわかる。FIG. 4 is a diagram showing a result of detection of a voice section by the voice section detection apparatus of the present embodiment, where 401 indicates a change in average voice power of voice to which street noise having an SN ratio of 20 dB is added, and 402 indicates A result of detecting a voice section from a voice to which street noise with an S / N ratio of 20 dB is added, 403 indicates a change in average voice power of voice without background noise, and 404 indicates a voice section from a voice without background noise. The detection result is shown. In 402 and 404, a peak indicates a voice section and a valley indicates a non-voice section.
As shown in the figure, although there are some differences in the detection intervals, it can be seen that the voice intervals are detected stably regardless of the presence or absence of background noise.

【００１６】なお、上記実施例における音声のフレーム
は、組み合わせる音声符号化装置の音声フレームと必ず
しも一致している必要性はなく、遅延が許される条件で
あれば、音声区間検出装置のフレームを音声符号化装置
のフレームに先行させてずらして配置し、入力音声を先
読みすることにより、入力音声のパワ変化をより早く検
出し、さらに正確な音声区間の検出が可能となる。Note that the speech frame in the above embodiment need not always match the speech frame of the speech encoding device to be combined, and if the delay is allowed, the speech frame of the speech segment detection device is replaced by the speech frame. By displacing the frame ahead of the frame of the encoding device and pre-reading the input voice, it is possible to detect the power change of the input voice more quickly and detect the voice section more accurately.

【００１７】また、上記実施例においては、音声区間の
検出に用いる特徴量として、平均音声パワを直接使用し
ていないため、入力音声の入力レベルの影響をほとんど
受けずに音声区間を検出できるいう特徴がある。したが
って、入力レベルが非常に低い条件でも音声区間の検出
が可能であるが、用途によっては平均音声パワが予め定
められたしきい値以下の区間を非音声区間と判定する必
要が生じる。このような用途に用いるためには、平均音
声パワに直接依存する判定規則を判定マップに追加し、
平均音声パワが予め定められたしきい値以下ならば非音
声区間と判定するようにすれば良い。Also, in the above embodiment, since the average voice power is not directly used as the feature value used for detecting the voice section, the voice section can be detected almost without being affected by the input level of the input voice. There are features. Therefore, the voice section can be detected even under the condition that the input level is very low. However, depending on the application, a section in which the average voice power is equal to or less than a predetermined threshold value needs to be determined as a non-voice section. In order to use such a purpose, a decision rule that directly depends on the average sound power is added to the decision map,
If the average voice power is equal to or less than a predetermined threshold, it may be determined to be a non-voice section.

【００１８】なお、本発明の音声区間検出装置をＣＥＬ
Ｐ(Code Excited Linear Prediction coding：符号励振
線形予測符号化) やＭＢＥ(Multi Band Excitation：マ
ルチバンド励振符号化) 等の、一般にピッチ抽出と呼ば
れる音声の周期性を検出する手段を有する音声符号化装
置と組み合わせて使用すれば、周期性判定値は音声符号
化の過程で得られるピッチ情報を用いれば良く、独立に
周期性検出手段を持つ必要がなくなり、音声区間検出に
要する演算量が大幅に減少する。したがって、本発明の
音声区間検出装置はピッチ抽出手段を有する音声符号化
装置と組み合わせるのに非常に適している。It should be noted that the voice section detection device of the present invention is
Speech coding apparatus having means for detecting speech periodicity generally called pitch extraction, such as P (Code Excited Linear Prediction coding) and MBE (Multi Band Excitation). When used in combination with the above, the periodicity determination value can use the pitch information obtained in the speech encoding process, eliminating the need for independent periodicity detection means, and significantly reducing the amount of computation required for speech section detection. I do. Therefore, the speech section detection device of the present invention is very suitable for being combined with a speech encoding device having pitch extraction means.

【００１９】[0019]

【発明の効果】以上のように、本発明によれば、入力音
声を予め定められた長さのフレームに分割し、そのフレ
ームが音声区間であるか否かを検出する音声区間検出装
置において、入力音声を分析して周期性を検出する周期
性検出手段と、入力音声のパワ情報に基づいて有音区間
を検出する有音区間検出手段と、これら２つの検出手段
の現在および過去の検出結果から、予め定めた音声区間
と非音声区間を判定する規則に従って音声区間を検出す
る音声区間判定手段とを備えているので、背景雑音のレ
ベルが高く、かつ変動が激しい条件においても、音声区
間を正確に検出することができる。As described above, according to the present invention, in a voice section detection apparatus for dividing an input voice into frames of a predetermined length and detecting whether or not the frame is a voice section, Periodicity detecting means for analyzing the input voice to detect periodicity; voiced section detecting means for detecting a voiced section based on the power information of the input voice; current and past detection results of these two detecting means Therefore, since the voice section is provided with voice section determination means for detecting a voice section in accordance with a rule for determining a predetermined voice section and a non-voice section, the voice section can be recognized even when the background noise level is high and the fluctuation is severe. It can be detected accurately.

[Brief description of the drawings]

【図１】本発明の一実施例における音声区間検出装置の
構成を示すブロック図FIG. 1 is a block diagram showing a configuration of a voice section detection device according to an embodiment of the present invention.

【図２】本発明の一実施例における音声の特徴量を示す
特性図FIG. 2 is a characteristic diagram showing a feature amount of a voice according to an embodiment of the present invention.

【図３】本発明の一実施例における判定マップを示す模
式図FIG. 3 is a schematic diagram showing a determination map according to one embodiment of the present invention.

【図４】本発明の一実施例における音声区間の検出結果
を示す特性図FIG. 4 is a characteristic diagram showing a detection result of a voice section in one embodiment of the present invention.

[Explanation of symbols]

１０１周期性検出手段１０２平均音声パワ算出手段１０３短時間パワ比算出手段１０４長時間パワ比算出手段１０５状態カウンタ更新手段１０６判定マップ１０７比較判定手段１０８入力音声１０９周期性判定値１１０現フレームの平均音声パワ１１１短時間パワ比１１２長時間パワ比１１３状態カウンタ値１１４判定結果１１５有音区間検出手段１１６音声区間判定手段２０１平均音声パワ２０２短時間パワ比２０３長時間パワ比２０４周期性判定値３０１判定マップ上のマップ領域３０２状態カウンタの増減値４０１背景雑音を付加した音声の平均音声パワ４０２背景雑音を付加した音声からの音声区間を検出
結果４０３背景雑音を付加しない音声の平均音声パワ４０４背景雑音を付加しない音声からの音声区間検出Reference Signs List 101 periodicity detecting means 102 average voice power calculating means 103 short-time power ratio calculating means 104 long-time power ratio calculating means 105 state counter updating means 106 determination map 107 comparison determining means 108 input voice 109 periodicity determining value 110 average of current frame Voice power 111 Short-time power ratio 112 Long-time power ratio 113 State counter value 114 Decision result 115 Voiced section detection means 116 Voice section determination means 201 Average voice power 202 Short-time power ratio 203 Long-time power ratio 204 Periodicity determination value 301 Map area on decision map 302 Increase / decrease value of state counter 401 Average voice power of voice with background noise added 402 Detection result of voice section from voice with background noise added 403 Average voice power of voice without background noise 404 Background Voice without noise The voice section detection

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平２−238493（ＪＰ，Ａ) 特開昭60−200300（ＪＰ，Ａ) 特開平１−159697（ＪＰ，Ａ) 特開昭60−57396（ＪＰ，Ａ) 特開昭60−499（ＪＰ，Ａ) 特開昭63−235999（ＪＰ，Ａ) 特開昭63−163495（ＪＰ，Ａ) 特開平１−255897（ＪＰ，Ａ) 特許2648779（ＪＰ，Ｂ２) 特公平１−21519（ＪＰ，Ｂ２) 特公平４−64074（ＪＰ，Ｂ２) 日本音響学会平成元年度春季研究発表会講演論文集▲Ｉ▼，３−７−15，滝沢由実外「耐雑音音声認識装置の開発（１）−区間検出方法について−」, ｐ．117−118（平成元年３月14日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/00 - 21/06 ────────────────────────────────────────────────── ─── Continuation of front page (56) References JP-A-2-238493 (JP, A) JP-A-60-200300 (JP, A) JP-A 1-159697 (JP, A) JP-A-60-200 57396 (JP, A) JP-A-60-499 (JP, A) JP-A-63-235999 (JP, A) JP-A-63-163495 (JP, A) JP-A-1-255897 (JP, A) Patent 2648779 (JP, B2) Japanese Patent Publication No. 1-2519 (JP, B2) Japanese Patent Publication No. 4-64074 (JP, B2) Proceedings of the Acoustical Society of Japan Spring Meeting, 1989, I-7, 3-7- 15, Yumi Takizawa, "Development of Noise-Resistant Speech Recognition System (1)-Section Detection Method-", p. 117-118 (Issued March 14, 1989) (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 11/00-21/06

Claims

(57) [Claims]

1. A speech section detection device for dividing an input speech into frames of a predetermined length and detecting whether the frame is a speech section, detects the periodicity by analyzing the input speech. Periodicity detecting means, and a sound section detecting means for detecting a sound section based on power information of the input voice,
A voice section determining means for detecting a voice section in accordance with a rule for determining a predetermined voice section and a non-voice section from current and past detection results of these two detecting means ;
The sound interval detecting means calculates an average of the input voice for each frame.
An average voice power calculating means for calculating voice power;
The ratio of the average audio power of the current frame to the average audio power of the current frame.
Short-time power ratio calculation means to calculate and average sound for each frame
Long-time flatness that averages voice power over m frames
Calculate the ratio between the average audio power of the current frame and the average audio power
Long time power ratio calculating means,
The stage has a state counter indicating the state of the input sound, and
A status counter that updates the status counter according to a set rule
Counter updating means, a judgment map storing the rules,
The value of the state counter is compared with a predetermined threshold value.
A comparison determining means for determining a voice section.
Characteristic voice section detection device.

2. A state counter updating means, comprising: a periodicity judgment value from a periodicity judging means, a short-time power ratio from a short-time power ratio calculating means, and a long-term power ratio from a long-time power ratio calculating means. Based on the value of the state counter that holds the value for estimating the current state of the input voice based on the past determination results, refer to the determination map containing rules for determining the increase or decrease of the state counter. updating the value of the counter speech segment detection device of claim 1, wherein.