JP6531449B2

JP6531449B2 - Voice processing apparatus, program and method, and exchange apparatus

Info

Publication number: JP6531449B2
Application number: JP2015058103A
Authority: JP
Inventors: 石田　斉; 斉石田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2015-03-20
Filing date: 2015-03-20
Publication date: 2019-06-19
Anticipated expiration: 2035-03-20
Also published as: JP2016177176A

Description

この発明は、音声処理装置、プログラム及び方法、並びに、交換装置に関し、例えば、音声信号の有音区間と無音区間を判定する有音検出装置に適用し得る。 The present invention relates to a voice processing device, a program and method, and an exchange device, and can be applied to, for example, a voice detection device that determines a voiced section and a silent section of a voice signal.

従来、電話機端末や交換機等の音声信号を処理する音声処理装置では、音声信号を処理する際、人（話者）が発話している有音区間と、発話していない無音区間を区別（識別）する有音検出の機能を用いた処理が行われる。従来、音声処理において有音検出の機能はＶＡＤ（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）とも呼ばれる。従来の音声処理装置において、有音検出の機能は、例えば、ＡＧＣ（ＡｕｔｏｍａｔｉｃＧａｉｎＣｏｎｔｒｏｌ）やノイズサプレッサ、音声認識など、様々な音声信号処理で必要とされる。 Conventionally, in a voice processing apparatus that processes voice signals such as a telephone terminal and a switchboard, when processing voice signals, the voiced section where a person (speaker) is speaking is distinguished from the silent section where it is not speaking ) Processing is performed using the function of the presence detection. Conventionally, the voice detection function in voice processing is also called VAD (Voice Activity Detection). In the conventional speech processing apparatus, the function of speech detection is required in various speech signal processing, such as AGC (Automatic Gain Control), noise suppressor, speech recognition, and the like.

従来の音声処理装置において、有音検出の機能を実現する最も簡単な方式は、単位時間あたりのパワーを調べる方法である。一般的に、有音区間ではパワーが強く、無音区間では弱い傾向にある。したがって、従来の音声処理装置では、単位時間あたりのパワーを求め、適当な閾値と比較することである程度の精度で有音／無音を判定することができる。ただし、無音区間の音声であっても、背景雑音が含まれるため、無音区間であるからといって、完全に無音な音声が捕捉されるわけではない。したがって、従来の音声処理装置において、単位時間あたりのパワーに基づいて有音／無音を判定する場合には、背景雑音のパワーよりも大きな値の閾値を設定する必要がある。しかしながら、音声を捕捉する環境によっては、背景雑音のパワーと信号（目的音の信号）のパワーは動的に変化する可能性がある。 In the conventional voice processing apparatus, the simplest method for realizing the function of the presence detection is a method of checking the power per unit time. In general, power is strong in the sound section and weak in the silent section. Therefore, in the conventional voice processing apparatus, the power per unit time can be determined, and the presence / absence of sound can be determined with a certain degree of accuracy that is to be compared with an appropriate threshold value. However, even a voice in a silent section includes background noise, and therefore, a silent section does not mean that a silent voice is completely captured. Therefore, in the conventional voice processing apparatus, when the presence / absence of sound is determined based on the power per unit time, it is necessary to set a threshold value larger than the power of the background noise. However, depending on the environment in which speech is captured, the power of background noise and the power of the signal (the target sound signal) may change dynamically.

上述のような問題に対応する従来技術としては、特許文献１、２の記載技術がある。特許文献１、２では、背景雑音の変化を推定し、その推定結果に基づいて有音／無音判定するための閾値を動的に変更している。また、特許文献１、２の記載技術では、単位時間あたりのパワーをある一定時間蓄積してヒストグラムで表し、統計的に背景雑音のパワーを推定している。 As a prior art which respond | corresponds to the above problems, there exists a description technique of patent document 1, 2. In Patent Documents 1 and 2, a change in background noise is estimated, and a threshold value for determining presence / absence of sound is dynamically changed based on the estimation result. Further, in the techniques described in Patent Documents 1 and 2, power per unit time is accumulated for a certain period of time, and is represented by a histogram to statistically estimate the power of background noise.

特公平１−１４５９９号公報Japanese Examined Patent Publication 1-14599 特許第３２５５５８４号公報Patent No. 3255584

しかしながら、特許文献１、２の記載技術では、Ｓ／Ｎの悪い環境では、有音／無音判定を行うための閾値が、信号のパワーより大きくなってしまい、正確な有音検出ができない場合がある。 However, in the techniques described in Patent Documents 1 and 2, in an environment with a low S / N, the threshold for performing the on / off determination may be larger than the power of the signal, and accurate voice detection can not be performed. is there.

以上のような問題に鑑みて、背景雑音のパワーが大きい環境でもより高い精度で有音検出を行うことができる音声処理装置、プログラム及び方法、並びに、交換装置が望まれている。 In view of the above problems, a voice processing device, a program and method, and an exchange device capable of performing voice detection with higher accuracy even in an environment with large background noise power are desired.

第１の本発明の音声処理装置は、（１）入力音声信号のレベル値を所定時間単位のフレーム毎に計算するレベル値計算手段と、（２）上記レベル値計算手段で計算されたレベル値について、レベル値毎の出現頻度を計数する頻度計数手段と、（３）レベル値毎の出現頻度から、背景雑音レベル値及び目的音信号レベル値を推定するレベル値推定手段と、（４）背景雑音レベル値の推定値及び目的音信号レベル値の推定値に基づいて、入力音声信号について有音声区間又は無音区間を判定する判定処理を行う判定手段とを有することを特徴とする。 A voice processing apparatus according to a first aspect of the present invention comprises (1) level value calculating means for calculating a level value of an input voice signal for each frame in a predetermined time unit, and (2) a level value calculated by the level value calculating means. , Frequency counting means for counting the appearance frequency for each level value, (3) level value estimation means for estimating the background noise level value and the target sound signal level value from the appearance frequency for each level value, and (4) background And determining means for performing a determination process of determining a voiced section or a silent section of the input audio signal based on the estimated value of the noise level value and the estimated value of the target sound signal level value.

第２の本発明の音声処理プログラムは、コンピュータを、（１）入力音声信号のレベル値を所定時間単位のフレーム毎に計算するレベル値計算手段と、（２）上記レベル値計算手段で計算されたレベル値について、レベル値毎の出現頻度を計数する頻度計数手段と、（３）レベル値毎の出現頻度から、背景雑音レベル値及び目的音信号レベル値を推定するレベル値推定手段と、（４）背景雑音レベル値の推定値及び目的音信号レベル値の推定値に基づいて、入力音声信号について有音声区間又は無音区間を判定する判定処理を行う判定手段として機能させることを特徴とする。 The speech processing program according to a second aspect of the present invention comprises a computer, (1) level value calculating means for calculating a level value of an input sound signal for each frame of a predetermined time unit, and (2) the level value calculating means And (3) a level value estimation means for estimating a background noise level value and a target sound signal level value from the appearance frequency for each level value. 4) It is characterized in that it functions as a determination unit that performs a determination process of determining a voiced section or a silent section for the input audio signal based on the estimated value of the background noise level value and the estimated value of the target sound signal level value.

第３の本発明は、音声処理装置が行う音声処理方法において、（１）レベル値計算手段、頻度計数手段、レベル値推定手段、判定手段を有し、（２）上記レベル値計算手段は、入力音声信号のレベル値を所定時間単位のフレーム毎に計算し、（３）上記頻度計数手段は、上記レベル値計算手段で計算されたレベル値について、レベル値毎の出現頻度を計数し、（４）上記レベル値推定手段は、レベル値毎の出現頻度から、背景雑音レベル値及び目的音信号レベル値を推定し、（５）上記判定手段は、背景雑音レベル値の推定値及び目的音信号レベル値の推定値に基づいて、入力音声信号について有音声区間又は無音区間を判定する判定処理を行うことを特徴とする。 According to a third aspect of the present invention, in the speech processing method performed by the speech processing apparatus, (1) level value calculation means, frequency counting means, level value estimation means, determination means, and (2) the level value calculation means The level value of the input audio signal is calculated for each frame in a predetermined time unit, and (3) the frequency counting means counts the appearance frequency for each level value for the level value calculated by the level value calculating means 4) The level value estimation means estimates the background noise level value and the target sound signal level value from the appearance frequency for each level value, and (5) the determination means estimates the background noise level value and the target sound signal A determination process is performed to determine a voiced section or a silent section of the input voice signal based on the estimated value of the level value.

第４の本発明の交換装置は、（１）複数の端末間の音声通信を交換処理するものであって、上記端末に送信する音声信号又は上記端末から受信した音声信号のレベルを、所望のレベルに調整する交換処理手段を有し、（２）上記交換処理手段は、第１の本発明の音声処理装置を用いて、上記端末に送信する音声信号又は上記端末から受信した音声信号のレベルを、所望のレベルに調整することを特徴とする。 The exchange apparatus according to the fourth aspect of the present invention is (1) an exchange process of voice communication between a plurality of terminals, wherein the level of the voice signal to be transmitted to the terminal or the voice signal received from the terminal is desired. (2) The exchange processing means uses the audio processing apparatus of the first invention to adjust the level of the audio signal transmitted to the terminal or the level of the audio signal received from the terminal Are adjusted to a desired level.

本発明によれば、背景雑音のパワーが大きい環境でもより高い精度で有音検出を行うことができる音声処理装置、プログラム及び方法、並びに、交換装置を実現できる。 According to the present invention, it is possible to realize a voice processing device, a program and method, and a switching device capable of performing sound presence detection with higher accuracy even in an environment where power of background noise is large.

第１の実施形態に係る音声処理装置（有音検出装置）の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the speech processing unit (voice presence detection device) concerning a 1st embodiment. 第１の実施形態に係る頻度計数部で保持されるヒストグラム（頻度分布）について示したグラフである。It is the graph shown about the histogram (frequency distribution) hold | maintained by the frequency counting part which concerns on 1st Embodiment. 第１の実施形態に係るレベル判定部で平滑化されたヒストグラムについて示したグラフである。It is the graph shown about the histogram smooth | blunted by the level determination part which concerns on 1st Embodiment. 第１の実施形態に係るレベル判定部で数値化されたヒストグラムの凸性を示したグラフである。It is the graph which showed the convexity of the histogram which was digitized by the level judgment part concerning a 1st embodiment. 第２の実施形態に係る音声処理装置（話頭検出装置）の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the speech processing unit (head detector) concerning a 2nd embodiment. 第３の実施形態に係る音声処理装置（背景雑音低減装置）の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the speech processing unit (background noise reduction device) concerning a 3rd embodiment. 第４の実施形態に係る音声処理装置（適応ゲイン制御装置）の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the speech processing unit (adaptive gain control device) concerning a 4th embodiment. 第５の実施形態に係る音声処理装置（ジッタバッファを備える音声処理装置）の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the speech processing unit (speech processing unit provided with a jitter buffer) concerning a 5th embodiment. 第６の実施形態に係る交換装置の機能的構成について示したブロック図である。It is the block diagram shown about the functional composition of the exchange concerning a 6th embodiment.

（Ａ）第１の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第１の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、有音検出装置に適用した例について説明する。 (A) First Embodiment Hereinafter, a first embodiment of a speech processing apparatus, program and method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example will be described in which the voice processing device, program and method of the present invention are applied to a sound presence detection device.

（Ａ−１）第１の実施形態の構成
図１は、この実施形態の有音検出装置１の全体構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing the entire configuration of a voice detection apparatus 1 of this embodiment.

有音検出装置１は、音声信号が入力されると、その音声信号について有音区間の検出を行い、その結果を出力する処理を行う。 When the voice signal is input, the voice detection apparatus 1 detects a voice section of the voice signal, and outputs the result.

有音検出装置１に入力される音声信号の形式（データ形式）については限定されないものであるが、例えば、ＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）形式等の種々のデータ形式を適用することができる。この実施形態では、有音検出装置１には、１０ｍｓｅｃ分のＰＣＭ形式の音声データが格納されたフレームが音声信号として入力されるものとする。すなわち、有音検出装置１には、入力音声信号としてフレーム単位の音声データが供給されるものとする。有音検出装置１に入力される音声信号（音声データ）のサンプリング周波数やビットレートは限定されないものである。この実施形態の例では、有音検出装置１には、サンプリング周波数８ｋＨｚ、１６ビットＰＣＭ、モノラルの音声データが入力されるものとして説明する。なお、有音検出装置１に、所定のコーデックで符号化された音声データ（例えば、ＩＴＵ−ＴＧ．７１１等の音声データ）が入力される場合には、復号処理を行う構成要素を追加するようにしてもよい。 Although there is no limitation on the format (data format) of the audio signal input to the presence detection device 1, for example, various data formats such as a PCM (Pulse Code Modulation) format can be applied. In this embodiment, it is assumed that a frame in which PCM format voice data of 10 msec worth is stored is input to the voice presence detection device 1 as a voice signal. That is, it is assumed that voice data in frame units is supplied to the voice presence detection device 1 as an input voice signal. The sampling frequency and the bit rate of the audio signal (audio data) input to the voice presence detection device 1 are not limited. In the example of this embodiment, it is assumed that audio data having a sampling frequency of 8 kHz, 16-bit PCM, and monaural is input to the voice presence detection device 1. When voice data (for example, voice data such as ITU-T G. 711) encoded by a predetermined codec is input to the voice presence detection device 1, a component for performing decoding processing is added. You may do so.

また、有音検出装置１が出力する信号形式（データ形式）については限定されないものである。例えば、有音検出装置１は、有音区間を示す信号（例えば、「１」や「Ｔｒｕｅ」）と無音区間を示す信号「例えば、「０」や「Ｆａｌｓｅ」のいずれかを出力するようにしてもよい。 Further, the signal format (data format) output by the presence detection device 1 is not limited. For example, the speech detection apparatus 1 outputs one of a signal (for example, “1” or “True”) indicating a speech zone and a signal “for example,“ 0 ”or“ False ”indicating a silence zone. May be

次に、有音検出装置１の内部構成について説明する。 Next, the internal configuration of the noise detection apparatus 1 will be described.

有音検出装置１は、高域透過フィルタ（以下、「ＨＰＦ」と呼ぶ）１０、レベル算出部１１、頻度計数部１２、レベル推定部１３、及び有音判定部１４を有している。 The voice presence detection device 1 includes a high frequency transmission filter (hereinafter referred to as “HPF”) 10, a level calculation unit 11, a frequency counting unit 12, a level estimation unit 13, and a voice presence determination unit 14.

ＨＰＳ１０は、入力された音声信号に含まれる低域（低周波数帯域）の成分（所定以下の周波数の成分）のパワーを減衰させるフィルタ処理を行うものである。背景雑音には、低域に比較的大きいパワーが含まれていることが多い。そのため、有音検出に先だって、ＨＰＳ１０で低域の成分を減衰させておくことで、有音検出に適用する音声信号のＳ／Ｎ比を改善できるという効果を奏する。ＨＰＳ１０で減衰させる周波数帯域については限定されないものである。ＨＰＳ１０では、例えば、３００Ｈｚ以下の成分を減衰させる処理を行うようにしてもよい。以下では、ＨＰＳ１０から出力される信号（低域成分が減衰された信号）を入力音声信号ｘとも呼ぶものとする。なお、有音検出装置１では、ＨＰＦ１０を搭載しない構成としてもよい。有音検出装置１にＨＰＦ１０が搭載されない場合、有音検出装置１に入力された音声信号（フレーム）自体が入力音声信号ｘとして処理されることになる。 The HPS 10 performs a filtering process to attenuate the power of a low frequency (low frequency band) component (a component of a frequency lower than a predetermined frequency) included in the input audio signal. Background noise often includes relatively large power in the low band. Therefore, it is possible to improve the S / N ratio of the audio signal applied to the sound detection by attenuating the low frequency component by the HPS 10 prior to the sound detection. The frequency band to be attenuated by the HPS 10 is not limited. In the HPS 10, for example, processing of attenuating components of 300 Hz or less may be performed. Hereinafter, the signal output from the HPS 10 (a signal whose low frequency component is attenuated) is also referred to as an input sound signal x. The presence detection apparatus 1 may not have the HPF 10 mounted thereon. When the HPF 10 is not mounted on the voice presence detection device 1, the voice signal (frame) itself input to the voice presence detection device 1 is processed as the input voice signal x.

レベル算出部１１は、入力音声信号ｘの音声レベル（パワーのレベル）を計算する。この実施形態のレベル算出部１１は、１フレームごとに当該フレームの音声レベルを算出する処理を行う。レベル算出部１１が行う音声レベル算出処理の具体例については後述する。 The level calculator 11 calculates an audio level (power level) of the input audio signal x. The level calculation unit 11 of this embodiment performs processing of calculating the audio level of the frame for each frame. A specific example of the sound level calculation process performed by the level calculation unit 11 will be described later.

頻度計数部１２は、入力音声信号ｘにおけるレベル（パワー）の分布（頻度分布）を計数するものである。具体的には、頻度計数部１２は、入力音声信号ｘにおける各レベル（レベル算出部１１で計算されたレベル）の出現回数を、ヒストグラム（頻度分布）として保持する。この実施形態では、頻度計数部１２は、入力音声信号ｘのそれぞれのレベルに対応するカウンタを含むカウンタ部１２１を備えているものとする。図１では、カウンタ部１２１は、Ｎ＋１個のカウンタＣＴ（ＣＴ＿０〜ＣＴ＿Ｎ）（Ｎは任意の整数）を有するものとして図示している。頻度計数部１２は、レベル算出部１１で１つのレベルが算出されるごとに、カウンタ部１２１のうち、当該レベルに対応するカウンタＣＴをインクリメント（１加算）する処理を行う。カウンタ部１２１に含まれるカウンタＣＴの数や、対応するレベルの間隔等は限定されないものである。 The frequency counting unit 12 counts the distribution (frequency distribution) of the level (power) in the input audio signal x. Specifically, the frequency counting unit 12 holds, as a histogram (frequency distribution), the number of occurrences of each level (the level calculated by the level calculation unit 11) in the input audio signal x. In this embodiment, the frequency counting unit 12 includes a counter unit 121 including counters corresponding to respective levels of the input audio signal x. In FIG. 1, the counter unit 121 is illustrated as having N + 1 counters CT (CT_0 to CT_N) (N is an arbitrary integer). The frequency counting unit 12 performs a process of incrementing (adding 1) the counter CT corresponding to the level in the counter unit 121 each time one level is calculated by the level calculation unit 11. The number of counters CT included in the counter unit 121, the interval of corresponding levels, and the like are not limited.

この実施形態のカウンタ部１２１では、１[ｄＢ]ごとにカウンタＣＴが設定されているものとして説明する。例えば、カウンタＣＴ＿０がＭ[ｄＢ]（Ｍは任意の整数）に対応する場合を想定する。具体的には、カウンタＣＴ＿０、ＣＴ＿１、ＣＴ＿２、…、ＣＴ＿Ｎは、それぞれ、Ｍ[ｄＢ]、Ｍ＋１[ｄＢ]、Ｍ＋２[ｄＢ]、…、Ｍ＋Ｎ[ｄＢ]に対応（１[ｄＢ]刻みで対応）しているものとして説明する。すなわち、カウンタ部１２１ではＭ[ｄＢ]〜Ｍ＋Ｎ[ｄＢ]の範囲内のレベルについてヒストグラム（頻度分布）を保持することができる。この実施形態では、カウンタ部１２１は、１０[ｄＢ]〜７０[ｄＢ]の範囲で１[ｄＢ]刻みのヒストグラム（頻度分布）が保持できるものとして説明する。 In the counter section 121 of this embodiment, it is assumed that the counter CT is set every 1 [dB]. For example, it is assumed that the counter CT_0 corresponds to M [dB] (M is an arbitrary integer). Specifically, counters CT_0, CT_1, CT_2,..., CT_N correspond to M [dB], M + 1 [dB], M + 2 [dB],..., M + N [dB] respectively (corresponding to 1 dB increments) ) Are described. That is, the counter unit 121 can hold a histogram (frequency distribution) for levels within the range of M [dB] to M + N [dB]. In this embodiment, the counter unit 121 is described as capable of holding a histogram (frequency distribution) in 1 dB increments in the range of 10 dB to 70 dB.

以上のように、頻度計数部１２ではカウンタ部１２１に、入力音声信号ｘにおける各レベルの出現回数を計数したヒストグラムが保持されることになる。なお、以下では、カウンタ部１２１で保持されるヒストグラム（頻度分布）をヒストグラムＨと呼ぶものとする。また、以下では、任意のレベルｖに対応するカウンタＣＴのカウンタ値をＨ（ｖ）と表すものとする。 As described above, in the frequency counting unit 12, the counter unit 121 holds a histogram obtained by counting the number of appearances of each level in the input audio signal x. Hereinafter, the histogram (frequency distribution) held by the counter unit 121 will be referred to as a histogram H. Furthermore, in the following, the counter value of the counter CT corresponding to an arbitrary level v is represented as H (v).

レベル推定部１３は、カウンタ部１２１に保持されるヒストグラムＨに基づいて、入力音声信号に含まれる背景雑音のレベル（以下、「背景雑音レベル」と呼ぶ）と、音声（すなわち目的音）のレベル（以下、「信号レベル」と呼ぶ）を推定する処理を行う。 Based on the histogram H held by the counter unit 121, the level estimation unit 13 determines the level of background noise (hereinafter referred to as "background noise level") included in the input speech signal and the level of speech (that is, target sound). A process of estimating (hereinafter referred to as "signal level") is performed.

有音判定部１４は、レベル推定部１３が推定した背景雑音レベルと信号レベルから、現在処理中のフレーム（最新に取得したフレーム）が有音区間か無音区間かを判定する処理を行う。そして、有音判定部１４は、その判定結果に応じた内容（有音区間を示す信号、又は無音区間を示す信号のいずれか）を出力する処理を行う。 The sound determination unit 14 performs processing to determine whether a frame currently being processed (a frame acquired most recently) is a sound section or a silent section from the background noise level and the signal level estimated by the level estimation unit 13. Then, the sound presence determining unit 14 performs a process of outputting the content (either a signal indicating a sound interval or a signal indicating a sound interval) according to the determination result.

（Ａ−２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態の有音検出装置１の具体的動作（実施形態に係る音声処理方法）の例を説明する。 (A-2) Operation of the First Embodiment Next, an example of the specific operation (voice processing method according to the embodiment) of the voiced detection device 1 of the first embodiment having the configuration as described above will be described. Do.

有音検出装置１では、１フレーム分の音声データが入力されると、まず、ＨＰＦ１０により、高域透過フィルタ処理（所定より低域の周波数成分のパワーを減衰させる処理）が行われる。ＨＰＦ１０は処理した音声信号（フレーム）を入力音声信号ｘとして出力する。 In the presence detection device 1, when voice data for one frame is input, the HPF 10 first performs high-pass filter processing (processing to attenuate the power of frequency components lower than a predetermined frequency). The HPF 10 outputs the processed audio signal (frame) as an input audio signal x.

レベル算出部１１は、入力音声信号ｘのフレーム毎にパワーを算出する。レベル算出部１１は、例えば、１フレーム分の入力音声信号ｘのパワーを、対数変換することによりレベル算出を行うようにしてもよい。レベル算出部１１において、レベル算出する際に、基準点（０ｄＢ）となるパワーについて適宜設定するようにしてもよい。また、レベル算出部１１は、過去フレームの音声レベルとの移動平均に基づいて、今回のフレームに係るレベルを算出するようにしてもよい。これにより、レベル算出部１１では、フレーム間のレベルの細かな変動を抑えることが可能となる。 The level calculator 11 calculates the power for each frame of the input audio signal x. For example, the level calculation unit 11 may perform level calculation by logarithmically converting the power of the input audio signal x for one frame. When the level calculation unit 11 calculates the level, the power to be the reference point (0 dB) may be appropriately set. Further, the level calculation unit 11 may calculate the level of the current frame based on the moving average of the sound levels of the past frames. As a result, the level calculation unit 11 can suppress fine fluctuations in level between frames.

頻度計数部１２は、レベル算出部１１により算出されたレベルに対応するカウンタＣＴをインクリメントする。これにより、頻度計数部１２では、カウンタ部１２１で保持されるヒストグラムＨが更新されることになる。 The frequency counting unit 12 increments a counter CT corresponding to the level calculated by the level calculation unit 11. As a result, the frequency counting unit 12 updates the histogram H held by the counter unit 121.

このとき、頻度計数部１２は、レベル算出部１１で算出されたレベルを所定の方式で丸める処理を行うものとする。そして、頻度計数部１２は、丸めた値（レベル）に対応するカウンタＣＴをインクリメントする。この実施形態のカウンタ部１２１では、上述の通り、１ｄＢ幅でカウンタＣＴが設定されている。そこで、頻度計数部１２は、例えば、９．５ｄＢ以上１０．５ｄＢ未満のレベルについては、１０ｄＢに丸め、１０ｄＢに対応するカウンタＣＴをインクリメントする処理を行うようにしてもよい。 At this time, the frequency counting unit 12 performs processing to round the level calculated by the level calculation unit 11 according to a predetermined method. Then, the frequency counting unit 12 increments the counter CT corresponding to the rounded value (level). As described above, in the counter section 121 of this embodiment, the counter CT is set to have a width of 1 dB. Therefore, the frequency counting unit 12 may, for example, round to 10 dB for a level of 9.5 dB or more and less than 10.5 dB, and increment the counter CT corresponding to 10 dB.

以上のように、頻度計数部１２のカウンタ部１２１では、１ｄＢ刻みで設定されたカウンタＣＴにより、ヒストグラムＨが保持される。 As described above, the counter unit 121 of the frequency counting unit 12 holds the histogram H by the counter CT set in 1 dB steps.

図２は、頻度計数部１２（カウンタ部１２１）で保持されるヒストグラムＨについて示したグラフである。図２に示すグラフは、実際に有音検出装置１に音声信号を入力した場合に、頻度計数部１２（カウンタ部１２１）で保持されたヒストグラムＨを示している。 FIG. 2 is a graph showing the histogram H held by the frequency counting unit 12 (counter unit 121). The graph shown in FIG. 2 shows the histogram H held by the frequency counting unit 12 (counter unit 121) when an audio signal is actually input to the voice presence detection device 1.

図２のグラフでは、横軸が入力音声信号ｘのレベルを示しており、縦軸が各レベルの出現数（各レベルのカウンタＣＴの値）を示している。 In the graph of FIG. 2, the horizontal axis indicates the level of the input speech signal x, and the vertical axis indicates the number of occurrences of each level (the value of the counter CT of each level).

レベル推定部１３は、一定時間ごとに、カウンタ部１２１で保持されているヒストグラムＨに基づいて、入力音声信号ｘに含まれる背景雑音レベルと信号レベルとを推定する。さらに、有音判定部１４は、推定された背景雑音レベルと信号レベルに基づいて、有音判定に用いる閾値を求める。 The level estimation unit 13 estimates the background noise level and the signal level included in the input audio signal x based on the histogram H held by the counter unit 121 at fixed time intervals. Furthermore, the presence determination unit 14 obtains a threshold used for the presence determination based on the estimated background noise level and the signal level.

有音判定部１４は、例えば、直近の所定時間分のフレーム（入力音声信号ｘのフレーム）に基づくヒストグラムＨを用いて閾値算出処理（閾値更新処理）を行う。有音判定部１４は、例えば、直近の１０ｓｅｃ分のフレーム（入力音声信号ｘのフレーム）に基づいて、閾値算出処理を行うようにしてもよい。有音判定部１４が閾値算出処理を行うタイミングについては限定されないものである。有音判定部１４は、例えば、所定の期間ごと（例えば、１０ｓｅｃの期間ごと）に有音判定を行うようにしてもよい。 The voice determination unit 14 performs threshold calculation processing (threshold update processing) using, for example, the histogram H based on a frame (frame of the input sound signal x) for the latest predetermined time. The presence determination unit 14 may perform threshold calculation processing based on, for example, a frame (a frame of the input sound signal x) for the last 10 seconds. The timing at which the presence determination unit 14 performs the threshold value calculation process is not limited. The presence determination unit 14 may perform the presence determination, for example, every predetermined period (for example, every 10 seconds).

なお、有音判定部１４が閾値算出処理を行うタイミングや、頻度計数部１２で保持するヒストグラムＨのサンプル数等については限定されないものである。例えば、有音判定部１４が、所定の期間ごとに閾値算出処理を行う際に、カウンタ部１２１の各カウンタＣＴを初期化（カウンタ値を０にリセット）する処理を行うようにしてもよい。 The timing at which the presence determination unit 14 performs threshold calculation processing, the number of samples of the histogram H held by the frequency counting unit 12, and the like are not limited. For example, sound determination unit 14, when performing the threshold value calculation processing for each predetermined period, initializing each counter CT of the counter unit 121 (the counter value reset to zero) may be performed a process of.

次に、レベル推定部１３が背景雑音レベルと信号レベルとを推定する処理の例について説明する。 Next, an example of processing in which the level estimation unit 13 estimates the background noise level and the signal level will be described.

上述の通り、図２に示すグラフは、実際に有音検出装置１に音声信号を入力した場合に、頻度計数部１２（カウンタ部１２１）で保持されたヒストグラムＨを示している。そして、図２に示すヒストグラムＨにおいて、実際の信号レベルの分布（有音区間のレベルの分布）と背景雑音レベルの分布（無音区間のレベルの分布）を確認した。そうすると、図２のヒストグラムＨでは、レベルＢ１〜Ｂ２の範囲に、主として背景雑音レベルの分布により形成される第１のピークが確認できた。また、図２のヒストグラムＨでは、レベルＢ１〜Ｂ２の範囲よりも高いレベルＢ３〜Ｂ４の範囲に、主として信号レベル（有音区間のレベル）の分布により形成される第２のピークが確認できた。 As described above, the graph shown in FIG. 2 shows the histogram H held by the frequency counting unit 12 (counter unit 121) when an audio signal is actually input to the voice presence detection device 1. Then, in the histogram H shown in FIG. 2, the distribution of the actual signal level (distribution of the level of the sound interval) and the distribution of the background noise level (distribution of the level of the silent interval) were confirmed. Then, in the histogram H of FIG. 2, the first peak formed mainly by the distribution of the background noise level could be confirmed in the range of the levels B1 and B2. Further, in the histogram H of FIG. 2, the second peak formed mainly by the distribution of the signal level (the level of the sound interval) was confirmed in the range of levels B3 to B4 higher than the range of levels B1 to B2. .

以上のように、図２に示すヒストグラムＨでは、主として背景雑音レベルの分布により形成される第１のピーク（レベルＢ１〜Ｂ２の範囲のピーク）と、主として信号レベルの分布により形成される第２のピーク（第１のピークよりも高いレベルＢ３〜Ｂ４の範囲のピーク）が発生する。すなわち、図２に示すヒストグラムＨは、２つのピーク（双峰性）を備えるヒストグラムとなっている。 As described above, in the histogram H shown in FIG. 2, the first peak (peak in the range of levels B1 to B2) mainly formed by the distribution of the background noise level and the second peak mainly formed by the distribution of the signal level Peak (a peak in the range of levels B3 to B4 higher than the first peak) occurs. That is, the histogram H shown in FIG. 2 is a histogram including two peaks (bimodal).

出願人による複数回の実験の結果、頻度計数部１２で保持されるヒストグラムＨにおいて、上述の２つのピークが発生することは、一般的に成り立つ（再現性がある）ことが明らかとなっている。 As a result of a plurality of experiments by the applicant, it has become clear that, in the histogram H held by the frequency counting unit 12, the occurrence of the above-mentioned two peaks is generally valid (reproducible) .

そこで、この実施形態のレベル推定部１３は、主として背景雑音レベルの分布により形成される第１のピークと、主として信号レベルの分布により形成される第２のピークとを検出し、有音判定部１４は検出された２つのピークに基づいて有音判定を行うものとする。 Therefore, the level estimation unit 13 of this embodiment detects the first peak mainly formed by the distribution of the background noise level and the second peak mainly formed by the distribution of the signal level, and the sound presence determination unit 14 will be made a sound judgment based on two peaks detected.

次に、レベル推定部１３および有音判定部１４による、有音判定の具体的手順の例（上述の２つのピークに基づく有音判定の具体例）について説明する。 Next, an example of a specific procedure of the presence determination by the level estimation unit 13 and the presence determination unit 14 (a specific example of the presence determination based on the above-described two peaks) will be described.

Ｈ（ｖ）により示される曲線には、細かな凹凸が含まれるので、レベル推定部１３はこの凹凸を除去する目的で平滑化する処理を行う。Ｈ（ｖ）を平滑化する手法については限定されないものであるが、例えば、重み付け平均等の手法を用いるようにしてもよい。 Since the curve indicated by H (v) includes fine asperities, the level estimation unit 13 performs a smoothing process for the purpose of removing the asperities. The method of smoothing H (v) is not limited, but, for example, a method such as weighted averaging may be used.

具体的には、レベル推定部１３は、以下の（１）式を用いてＨ（ｖ）の平滑化を行うようにしてもよい。以下の（１）式において、Ｈｓ（ｖ）は、Ｈ（ｖ）が平滑化された後の値を示している。レベル推定部１３は、ヒストグラムＨを構成する全てのレベルのそれぞれについてＨｓ（ｖ）を求めることにより平滑化処理を行う。
Ｈｓ（ｖ）＝｛Ｈ（ｖ−２）＋２Ｈ（ｖ−１）＋３Ｈ（ｖ）
＋２Ｈ（ｖ＋１）＋Ｈ（ｖ＋２）｝／９ …（１） Specifically, the level estimation unit 13 may perform smoothing of H (v) using the following equation (1). In the following equation (1), Hs (v) represents a value after H (v) is smoothed. The level estimation unit 13 performs smoothing processing by obtaining Hs (v) for each of all the levels constituting the histogram H.
Hs (v) = {H (v-2) + 2H (v-1) + 3H (v)
+ 2H (v + 1) + H (v + 2)} / 9 (1)

図３は、レベル推定部１３によるＨ（ｖ）の平滑化処理について示したグラフである。 FIG. 3 is a graph showing the smoothing process of H (v) by the level estimation unit 13 .

図３では、Ｈ（ｖ）が平滑化処理される前のグラフ（曲線）を点線で図示し、Ｈ（ｖ）が平滑化処理された後のグラフ（曲線）を実線で図示している。 In FIG. 3, a graph (curve) before H (v) is smoothed is shown by a dotted line, and a graph (curve) after H (v) is smoothed is shown by a solid line.

次に、レベル推定部１３は、平滑化したＨｓ（ｖ）から、上述の２つのピークを検出するために、Ｈｓ（ｖ）の凸性を数値化する処理を行う。レベル推定部１３において、Ｈｓ（ｖ）の凸性を数値化する具体的な手法は限定されないものである。この実施形態では、レベル推定部１３は、差分化した二階微分値を用いてＨｓ（ｖ）の凸性を数値化するものとして説明する。具体的には、レベル推定部１３は、以下の（２）式を用いて、Ｈｓ（ｖ）の凸性を数値化する処理を行う。以下の（２）式において、Ｃ（ｖ）はＨｓ（ｖ）の凸性を示す。レベル推定部１３は、ヒストグラムＨを構成するレベルのそれぞれについてＣ（ｖ）を求めることにより凸性を数値化する。Ｃ（ｖ）が正の値の区間は、上方向（正の方向）に凸の形であることを示すことになる。 Next, the level estimation unit 13 digitizes the convexity of Hs (v) in order to detect the above-mentioned two peaks from the smoothed Hs (v). The specific method of quantifying the convexity of Hs (v) in the level estimation unit 13 is not limited. In this embodiment, the level estimation unit 13 is described as quantifying the convexity of Hs (v) using the differentiated second-order differential value. Specifically, the level estimation unit 13 performs a process of digitizing the convexity of Hs (v) using the following equation (2). In the following equation (2), C (v) represents the convexity of Hs (v). The level estimation unit 13 digitizes the convexity by obtaining C (v) for each of the levels constituting the histogram H. A section in which C (v) is a positive value indicates that the shape is convex in the upward direction (positive direction).

そして、レベル推定部１３は、Ｃ（ｖ）が正である区間を１つのピークとみなし、全区間に渡ってピークの探索を行う。 Then, the level estimation unit 13 regards a section in which C (v) is positive as one peak, and searches for the peak over the entire section.

上述の通り、ヒストグラムＨは、通常、主として背景雑音レベルの分布により形成される第１のピークと、主として信号レベルの分布により形成される第２のピーク（第１のピークよりも高いレベルのピーク）が発生する双峰性の分布となる。したがって、レベル推定部１３は、通常、ヒストグラムＨ（Ｈｓ（ｖ））の凸性に基づいて２つのピークを検出することができる。そして、レベル推定部１３は、検出した２つのピークのうち、レベルの低い方のピークを背景雑音レベルに係る第１のピークと見なし、レベルの高い方のピークを信号レベルに係る第２のピークと見なすものとする。なお、レベル推定部１３は、１つのピークしか見つけられない場合は、当該ピークを背景雑音レベルに係るピークと見なすようにしてもよい。また、レベル推定部１３は、３つ以上のピークを検出した場合、区間の広いもの（Ｃ（ｖ）が正である区間が広いもの）から順に２つを選択して、レベルの低い方のピークを背景雑音レベルに係る第１のピークと見なし、レベルの高い方のピークを信号レベルに係る第２のピークとみなすようにしてもよい。
Ｃ（ｖ）＝Ｈｓ（ｖ）−
｛Ｈｓ（ｖ−１０）＋Ｈｓ（ｖ＋１０）｝／２…（２） As described above, the histogram H usually has a first peak mainly formed by the background noise level distribution and a second peak mainly formed by the signal level distribution (a peak higher than the first peak). Distribution is bimodal). Therefore, the level estimation unit 13 can usually detect two peaks based on the convexity of the histogram H (Hs (v)). Then, the level estimation unit 13 regards the lower one of the two detected peaks as the first peak associated with the background noise level, and the second peak associated with the signal level with the higher peak. It shall be regarded as When only one peak can be found, the level estimation unit 13 may regard the peak as a peak related to the background noise level. In addition, when three or more peaks are detected, the level estimation unit 13 selects two in order from the one with the widest section (the one with the positive C (v) section), and the level is lower. The peak may be regarded as the first peak related to the background noise level, and the higher peak may be regarded as the second peak related to the signal level.
C (v) = Hs (v)-
{Hs (v-10) + Hs (v + 10)} / 2 (2)

図４は、図３に示すＨｓ（ｖ）の各レベルにおける凸性を数値化（上記の（２）式に基づいて数値化）した場合のグラフである。図４では、Ｈｓ（ｖ）の凸性を数値化したグラフ（曲線）を実線で図示し、Ｈｓ（ｖ）を示すグラフを点線で図示している。 FIG. 4 is a graph in the case where the convexity at each level of Hs (v) shown in FIG. 3 is digitized (digitized based on the above equation (2)). In FIG. 4, a graph (curve) obtained by digitizing the convexity of Hs (v) is illustrated by a solid line, and a graph indicating Hs (v) is illustrated by a dotted line.

図４に示すグラフでは、Ｃ（ｖ）が正の値となる区間（ピーク）が２つ形成されている。したがって、レベル推定部１３は、この２つのピークのうち、レベルの低い方のピークを背景雑音レベルに係る第１のピークと見なし、レベルの高い方のピークを信号レベルに係る第２のピークと見なすことになる。以下では、背景雑音レベルに係る第１のピークの区間（第１のピークを含む区間）をピーク区間ＰＮと呼ぶものとする。また、以下では、信号レベルに係る第２のピークの区間（第２のピークを含む区間）をピーク区間ＰＳと呼ぶものとする。 In the graph shown in FIG. 4, two sections (peaks) in which C (v) is a positive value are formed. Therefore, the level estimation unit 13 regards the lower one of the two peaks as the first peak related to the background noise level, and selects the higher peak as the second peak related to the signal level. It will be considered. Hereinafter, the section of the first peak (section including the first peak) related to the background noise level is referred to as a peak section PN. Moreover, below, the area (area containing a 2nd peak) of the 2nd peak which concerns on a signal level shall be called peak area PS.

なお、図４に示すように、背景雑音レベルに係るピーク区間ＰＮは、信号レベルに係るピーク区間ＰＳよりも狭くなる傾向にある。したがって、背景雑音レベルに係る第１のピークに係る分布より、信号レベルに係る第２のピークに係る分布の方が分散が大きくなる傾向にあると言える。 As shown in FIG. 4, the peak section PN related to the background noise level tends to be narrower than the peak section PS related to the signal level. Therefore, it can be said that the distribution of the second peak of the signal level tends to be larger than the distribution of the first peak of the background noise level.

次に、レベル推定部１３は、ピーク区間ＰＮ、ＰＳのそれぞれについて、区間内の代表値（有音判定に適用する値）を決定する。レベル推定部１３が各ピーク区間の代表値を決定する手法については限定されないものである。この実施形態では、レベル推定部１３は、重心法を用いて各ピーク区間の代表値を決定するものとする。例えば、レベル推定部１３は、以下の（３）式に基づいてピーク区間ＰＮの代表値を決定するようにしてもよい。以下の（３）式において、ＬｖＮは、ピーク区間ＰＮの代表値（背景雑音レベルの推定値）である。また、例えば、レベル推定部１３は、以下の（４）式に基づいてピーク区間ＰＳの代表値を決定するようにしてもよい。以下の（４）式において、ＬｖＳは、ピーク区間ＰＳの代表値（信号レベルの推定値）である。
ＬｖＮ＝ΣｖＨ（ｖ）／ΣＨ（ｖ）（ｖ∈ＰＮ） …（３）
ＬｖＳ＝ΣｖＨ（ｖ）／ΣＨ（ｖ）（ｖ∈ＰＳ） …（４） Next, the level estimation unit 13 determines, for each of the peak sections PN and PS, a representative value in the section (a value to be applied to the presence determination). The method of determining the representative value of each peak section by the level estimation unit 13 is not limited. In this embodiment, the level estimation unit 13 determines the representative value of each peak section using the centroid method. For example, the level estimation unit 13 may determine the representative value of the peak interval PN based on the following equation (3). In the following equation (3), LvN is a representative value of the peak interval PN (estimated value of background noise level). Further, for example, the level estimation unit 13 may determine the representative value of the peak section PS based on the following equation (4). In the following equation (4), LvS is a representative value (estimated value of signal level) of the peak interval PS.
LvN = v v H (v) / H H (v) (v ∈ PN) (3)
LvS = v v H (v) / H H (v) (v ∈ PS) (4)

次に、有音判定部１４が推定された背景雑音レベルＬｖＮと信号レベルＬｖＳとに基づいて閾値を決定する処理について説明する。 Next, a process of determining a threshold based on the background noise level LvN and the signal level LvS estimated by the presence determination unit 14 will be described.

有音判定部１４は、背景雑音レベルの推定値ＬｖＮ、及び信号レベルの推定値ＬｖＳを用いて、現在の処理フレームが有音区間か無音区間かを判定する。ここでは、有音判定部１４は、背景雑音レベルの推定値ＬｖＮ、及び信号レベルの推定値ＬｖＳを用いて、現在の処理フレームのフレームレベルＬｖと比較するための閾値ＬｖＴｈを求める。ここでは、有音判定部１４は、以下の（５）式を用いて、閾値ＬｖＴｈを求めるものとする。以下の（５）式においてαは０から１の間（０≦α≦１）で任意の値に設定される係数である。αは、例えば、固定的（静的）な値（例えば、０．５程度の値）としてもよいが、動的に変動させるようにしてもよい。 The sound determination unit 14 determines whether the current processing frame is a sound section or a silent section using the estimated value LvN of the background noise level and the estimated value LvS of the signal level. Here, the noise determination unit 14 uses the estimated value LvN of the background noise level and the estimated value LvS of the signal level to determine a threshold LvTh to be compared with the frame level Lv of the current processing frame. Here, it is assumed that the noise determination unit 14 obtains the threshold LvTh using the following equation (5). In the following equation (5), α is a coefficient set to any value between 0 and 1 (0 ≦ α ≦ 1). For example, α may be a fixed (static) value (for example, a value on the order of 0.5), but may be varied dynamically.

なお、有音判定部１４は、ヒストグラムＨから１つのピークしか見つけられなかった場合は、背景雑音レベルの推定値ＬｖＮのみ最新のヒストグラムＨに基づく値に更新し、信号レベルの推定値ＬｖＳについては前回算出したものを継続して用い、閾値ＬｖＴｈを求めるようにしてもよい。
ＬｖＴｈ＝αＬｖＮ＋（１−α）ＬｖＳ …（５） If the noise determination unit 14 finds only one peak from the histogram H, it updates only the estimated value LvN of the background noise level to a value based on the latest histogram H, and the estimated value LvS of the signal level. The threshold LvTh may be obtained by continuously using the previously calculated one.
LvTh = αLvN + (1-α) LvS (5)

この実施形態の例では、有音判定部１４は、現在処理している音声フレーム（例えば、最新に入力された音声フレーム）のフレームレベルＬｖと、閾値ＬｖＴｈを比較し、当該音声フレームの有音判定（有音区間に属する音声フレームか、無音区間に属する音声フレームかを判定）する処理を行う。具体的には、有音判定部１４は、Ｌｖ＞＝ＬｖＴｈであれば当該音声フレームは有音区間に属し、Ｌｖ＜ＬｖＴｈであれば当該音声フレームは無音区間に属すると判定するものとする。 In the example of this embodiment, the voice determination unit 14 compares the frame level Lv of the currently processed voice frame (for example, the voice frame input most recently) with the threshold LvTh, and generates the voice of the voice frame. A process of determining (determining whether it is an audio frame belonging to a sound interval or an audio frame belonging to a silent interval) is performed. Specifically, the voice determination unit 14 determines that the voice frame belongs to a voiced section if Lv> = LvTh, and determines that the voice frame belongs to a silent segment if Lv <LvTh.

（Ａ−３）第１の実施形態の効果
第１の実施形態実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of the First Embodiment According to the first embodiment, the following effects can be achieved.

第１の実施形態の有音検出装置１では、ヒストグラムＨから、主として背景雑音レベルの分布により形成される第１のピークの区間と、主として信号レベルの分布により形成される第２のピークの区間を検出し、背景雑音レベルと信号レベルとを推定する処理を行っている。そして、第１の実施形態の有音検出装置１では、背景雑音レベルと信号レベルの両方を用いて、有音判定に用いる閾値を推定している。従来技術では、背景雑音レベルのみを推定するため、Ｓ／Ｎ比が悪い状態では、適切な閾値を設定することができない場合があった。しかしながら、第１の実施形態の有音検出装置１では、ヒストグラムＨから、背景雑音レベルと信号レベルの両方を推定して、適切な閾値を設定するため、Ｓ／Ｎ比が悪い状態であっても、従来より適切な閾値設定を行うことが可能となる。すなわち、第１の実施形態の有音検出装置１では、従来よりも安定的に有音検出を行うことができる。 In the speech detection apparatus 1 according to the first embodiment, from the histogram H, a section of a first peak mainly formed by the distribution of background noise level and a section of a second peak mainly formed by the distribution of signal level Is performed to estimate the background noise level and the signal level. Then, in the voice presence detection device 1 of the first embodiment, the threshold used for the voice presence determination is estimated using both the background noise level and the signal level. In the prior art, since only the background noise level is estimated, there is a case where an appropriate threshold can not be set in a state where the S / N ratio is bad. However, in the speech detection apparatus 1 according to the first embodiment, since both the background noise level and the signal level are estimated from the histogram H to set an appropriate threshold, the S / N ratio is bad. Also, it becomes possible to perform threshold setting more appropriate than before. That is, in the speech detection apparatus 1 according to the first embodiment, the speech detection can be performed more stably than in the related art.

（Ｂ）第２の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第２の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、話頭検出装置に適用した例について説明する。 (B) Second Embodiment Hereinafter, a second embodiment of the speech processing device, program and method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the speech processing device, program and method of the present invention are applied to a speech detection device will be described.

図５は、本発明の話頭検出装置１００の機能的構成について示した説明図である。 FIG. 5 is an explanatory view showing a functional configuration of the head detection apparatus 100 of the present invention.

話頭検出装置１００は、入力された音声信号から話頭（音声の開始時点）を検出する装置（話頭検出手段の機能を担う装置）である。話頭検出装置１００は、例えば、プロセッサ及びメモリを備えるコンピュータ上にプログラム（実施形態に係る音声処理プログラムを含む）をインストールしてソフトウェアとして実現するようにしてもよい。話頭検出装置１００は、例えば、電話端末等の音声処理を行う装置に組み込むようにしてもよい。 The head detection apparatus 100 is an apparatus (an apparatus having a function of a talk head detection unit) that detects a talk head (a start time of sound) from an input voice signal. For example, the talkhead detection apparatus 100 may be implemented as software by installing a program (including the speech processing program according to the embodiment) on a computer including a processor and a memory. The head detection apparatus 100 may be incorporated, for example, in an apparatus that performs voice processing, such as a telephone terminal.

話頭検出装置１００は、有音検出部１０１及び話頭検出部１０２を有している。 The speech head detection apparatus 100 includes a speech detection unit 101 and a speech head detection unit 102.

有音検出部１０１は、入力された音声信号について有音検出するものである。この実施形態では、有音検出部１０１として、第１の実施形態の有音検出装置１を適用しているものとする。有音検出部１０１は、入力された音声信号に基づいて、所定期間ごとに有音判定又は無音判定を出力する。 The voice presence detection unit 101 performs voice presence detection on the input audio signal. In this embodiment, it is assumed that the voice detection apparatus 1 of the first embodiment is applied as the voice detection unit 101. The presence detection unit 101 outputs the presence determination or the silence determination at predetermined intervals based on the input audio signal.

話頭検出部１０２は、有音検出部１０１の検出結果に基づいて、入力された音声信号の話頭を検出するものである。話頭検出部１０２は、有音検出部１０１の判定結果が無音判定から有音判定に遷移したタイミングにも話頭を検出したことを示す検出信号を出力し、それ以外のタイミングでは話頭を検出していないことを示す非検出信号を出力する。 The talker detection unit 102 detects a talker of the input audio signal based on the detection result of the voice detection unit 101. The speech head detection unit 102 also outputs a detection signal indicating that the speech head is detected at the timing when the judgment result of the speech detection unit 101 transits from silence judgment to speech judgment, and detects the speech head at other timings. Output a non-detection signal indicating that there is no.

（Ｃ）第３の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第３の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、背景雑音低減装置（ノイズサプレッサ）に適用した例について説明する。 (C) Third Embodiment Hereinafter, a third embodiment of the speech processing device, program and method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the speech processing apparatus, program and method of the present invention are applied to a background noise reduction apparatus (noise suppressor) will be described.

図６は、本発明の背景雑音低減装置２００の機能的構成について示した説明図である。 FIG. 6 is an explanatory view showing a functional configuration of the background noise reduction device 200 of the present invention.

背景雑音低減装置２００は、入力された音声信号について、無音区間で音声レベルを下げて出力することで背景雑音を低減する装置（背景雑音低減手段の機能を担う装置）である。背景雑音低減装置２００は、例えば、プロセッサ及びメモリを備えるコンピュータ上にプログラム（実施形態に係る音声処理プログラムを含む）をインストールしてソフトウェアとして実現するようにしてもよい。背景雑音低減装置２００は、例えば、電話端末等の音声処理を行う装置に組み込むようにしてもよい。 The background noise reduction device 200 is a device that reduces background noise by lowering the voice level in a silent section and outputting the input voice signal (a device that takes on the function of background noise reduction means). For example, the background noise reduction apparatus 200 may be implemented as software by installing a program (including the voice processing program according to the embodiment) on a computer including a processor and a memory. The background noise reduction device 200 may be incorporated into a device that performs voice processing, such as a telephone terminal, for example.

背景雑音低減装置２００は、有音検出部２０１、音声フレームバッファ２０２、有音／無音判定バッファ２０３、判定書換部２０４、及びゲイン重畳部２０５を有している。 The background noise reduction device 200 includes a voice presence detection unit 201, an audio frame buffer 202, a voice presence / non-speech determination buffer 203, a determination rewriting unit 204, and a gain superposition unit 205.

有音検出部２０１は、入力された音声信号について有音検出するものである。この実施形態では、有音検出部２０１として、第１の実施形態の有音検出装置１を適用しているものとする。有音検出部２０１は、入力された音声信号に基づいて、所定期間ごとに有音判定又は無音判定を出力する。 The sound detection unit 201 detects sound of the input audio signal. In this embodiment, it is assumed that the voice detection apparatus 1 of the first embodiment is applied as the voice detection unit 201. The presence detection unit 201 outputs the presence determination or the silence determination at predetermined intervals based on the input audio signal.

音声フレームバッファ２０２は、入力音声信号のフレームを一定時間分バッファリングする。 The voice frame buffer 202 buffers frames of the input voice signal for a fixed time.

有音／無音判定バッファ２０３は、有音検出部２０１の判定結果を一定期間分バッファリングする。 The speech / non-speech determination buffer 203 buffers the determination result of the speech presence detector 201 for a fixed period.

判定書換部２０４は、有音／無音判定バッファ２０３にバッファリングされている有音／無音判定の判定結果を参照し、無音判定から有音判定への変化を検知した場合に、有音／無音判定バッファ２０３に蓄積された有音／無音判定の判定結果について、過去の一定時間を遡り、無音判定を有音判定に書き換える処理を行う。判定書換部２０４は、背景雑音低減装置２００が出力する音声信号での話頭切れを防止する目的で配置されている。 The judgment rewriting unit 204 refers to the judgment result of the speech / non-speech judgment buffered in the speech / non-speech judgment buffer 203 and detects the speech / silence when a change from the speech judgment to the speech judgment is detected. With regard to the determination result of the sound / non-speech determination accumulated in the determination buffer 203, a process is performed in which the fixed time in the past is traced back and the non-speech determination is rewritten as the presence determination. The determination rewrite unit 204 is disposed for the purpose of preventing a dead end in the audio signal output from the background noise reduction device 200.

ゲイン重畳部２０５は、有音／無音判定バッファ２０３から任意の音声フレーム（例えば、最も古い音声フレーム）を取得して出力する。ゲイン重畳部２０５は、フレームを出力する際に、当該フレームに対応する有音／無音判定の判定結果を参照（有音／無音判定バッファ２０３に蓄積された判定結果を参照）し、当該フレームが無音区間のフレームである場合に、当該フレームの音声レベルを下げる処理（ゲインを調整する処理）を行ってから出力する。ゲイン重畳部２０５は、有音区間のフレームについてはそのまま出力する処理を行う。 The gain superimposing unit 205 acquires and outputs an arbitrary voice frame (for example, the oldest voice frame) from the speech / non-speech determination buffer 203. When outputting a frame, the gain superimposing unit 205 refers to the determination result of the sound / non-voice determination corresponding to the frame (refer to the determination result stored in the voice / non-voice determination buffer 203), and When the frame is a silent section, processing is performed to lower the sound level of the frame (processing to adjust the gain) and then output. The gain superimposing unit 205 performs processing for outputting a frame in a sound interval as it is.

（Ｄ）第４の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第４の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、適応ゲイン制御装置（ＡＧＣ）に適用した例について説明する。 (D) Fourth Embodiment Hereinafter, a fourth embodiment of the speech processing device, program and method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the voice processing device, program and method of the present invention are applied to an adaptive gain controller (AGC) will be described.

図７は、第４の実施形態の適応ゲイン制御装置３００の機能的構成について示した説明図である。 FIG. 7 is an explanatory view showing a functional configuration of the adaptive gain control apparatus 300 of the fourth embodiment.

適応ゲイン制御装置３００は、入力された音声信号について、所望の音声レベル（予め設定された一定のレベル）に調整して出力する装置（ゲイン制御手段の機能を担う装置）である。適応ゲイン制御装置３００は、例えば、プロセッサ及びメモリを備えるコンピュータ上にプログラム（実施形態に係る音声処理プログラムを含む）をインストールしてソフトウェアとして実現するようにしてもよい。適応ゲイン制御装置３００は、例えば、電話端末等の音声処理を行う装置に組み込むようにしてもよい。 The adaptive gain control device 300 is a device (a device that takes on the function of a gain control unit) that adjusts and outputs the input voice signal to a desired voice level (a predetermined constant level). For example, the adaptive gain control apparatus 300 may be implemented as software by installing a program (including the voice processing program according to the embodiment) on a computer including a processor and a memory. The adaptive gain control device 300 may be incorporated, for example, in a device that performs voice processing, such as a telephone terminal.

適応ゲイン制御装置３００は、有音検出部３０１、レベル算出部３０２、ゲイン決定部３０３、及びゲイン重畳部３０４を有している。 The adaptive gain control apparatus 300 includes a sound presence detection unit 301, a level calculation unit 302, a gain determination unit 303, and a gain superposition unit 304.

有音検出部３０１は、入力された音声信号について有音検出するものである。この実施形態では、有音検出部３０１として、第１の実施形態の有音検出装置１を適用しているものとする。有音検出部３０１は、入力された音声信号に基づいて、所定期間ごとに有音判定又は無音判定を出力する。 The voice presence detection unit 301 performs voice presence detection on the input audio signal. In this embodiment, it is assumed that the voice detection apparatus 1 of the first embodiment is applied as the voice detection unit 301. The presence detection unit 301 outputs the presence determination or the silence determination at predetermined intervals based on the input audio signal.

レベル算出部３０２は、入力信号のレベルを算出する。 The level calculator 302 calculates the level of the input signal.

ゲイン決定部３０３は、レベル算出部３０２で算出された入力信号のレベルに基づき重畳すべきゲインを決定するも。また、ゲイン決定部３０３は、有音検出部３０１の検出結果（有音／無音判定の結果）を考慮して、入力信号に重畳すべきゲイン（入力信号を所望のレベルとするためのゲイン）を決定する。例えば、ゲイン決定部３０３は、無音区間（有音検出部３０１で無音判定が検出されている区間）で、背景雑音を増幅しないゲインを決定する処理を行う。 The gain determination unit 303 also determines the gain to be superimposed based on the level of the input signal calculated by the level calculation unit 302. Further, the gain determination unit 303 takes into consideration the detection result (the result of the presence / absence determination) of the presence detection unit 301, and the gain to be superimposed on the input signal (gain for setting the input signal to a desired level) Decide. For example, the gain determination unit 303 performs a process of determining a gain that does not amplify background noise in a silent section (a section in which a silence determination is detected by the noise detection unit 301).

ゲイン重畳部３０４は、入力信号にゲイン決定部３０３で決定されたゲインを重畳して出力する。ゲイン重畳部３０４が出力する音声信号のレベルは、予め設定された所望のレベルとなる。 The gain superposition unit 304 superimposes the gain determined by the gain determination unit 303 on the input signal and outputs the resultant signal. The level of the audio signal output by the gain superimposing unit 304 is a desired level set in advance.

（Ｅ）第５の実施形態
以下、本発明による音声処理装置、プログラム及び方法の第５の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、ジッタバッファを備える音声処理装置に適用した例について説明する。 (E) Fifth Embodiment Hereinafter, a fifth embodiment of the speech processing device, program and method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the speech processing apparatus, program and method of the present invention are applied to a speech processing apparatus provided with a jitter buffer will be described.

音声処理装置４００は、例えば、プロセッサ及びメモリを備えるコンピュータ上にプログラム（実施形態に係る音声処理プログラムを含む）をインストールしてソフトウェアとして実現するようにしてもよい。音声処理装置４００は、例えば、電話端末等の音声処理を行う装置に組み込むようにしてもよい。 For example, the voice processing apparatus 400 may be implemented as software by installing a program (including the voice processing program according to the embodiment) on a computer including a processor and a memory. The voice processing device 400 may be incorporated into, for example, a device that performs voice processing, such as a telephone terminal.

図８は、第５の実施形態の音声処理装置４００の機能的構成について示した説明図である。 FIG. 8 is an explanatory view showing a functional configuration of the speech processing device 400 according to the fifth embodiment.

音声処理装置４００は、有音検出部４０１、ジッタバッファ４０２及びＰＣＭ復号手段４０３を有している。 The voice processing device 400 includes a voice detection unit 401, a jitter buffer 402, and a PCM decoding unit 403.

ジッタバッファ４０２は、ＩＰネットワークＮを介して到達する音声パケット（音声フレームが挿入されたパケット）をジッタバッファ４０２により保留（バッファリング）してジッタ（ゆらぎ）を吸収し、一定間隔で保留した音声パケットを出力するものである。ジッタバッファ４０２は、音声パケットを格納（保留）する格納バッファ４０２ａ、及び格納バッファ４０２ａの音声パケットの処理（例えば、廃棄等）を制御するジッタバッファ制御手段４０２ｂを有している。 The jitter buffer 402 suspends (buffers) a voice packet (packet in which a voice frame is inserted) arriving via the IP network N by the jitter buffer 402 and absorbs jitter (fluctuation) to suspend voice at a predetermined interval. A packet is output. The jitter buffer 402 includes a storage buffer 402 a that stores (holds) voice packets, and a jitter buffer control unit 402 b that controls processing (for example, discarding) of the voice packets in the storage buffer 402 a.

なお、この実施形態では、ジッタバッファ４０２には、ＲＴＰ（Ｒｅａｌ−ＴｉｍｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ）の形式の音声パケットが入力されるものとして説明する。 In this embodiment, it is assumed that voice packets in the form of RTP (Real-Time Transport Protocol) are input to the jitter buffer 402.

ジッタバッファ４０２（格納バッファ４０２ａ）は、ＩＰネットワークＮから到達する音声パケット（ＲＴＰ形式のパケット）中のシーケンス番号を読み取り、音声パケットをシーケンス番号の小さい順に蓄積する。また、ジッタバッファ４０２（格納バッファ４０２ａ）は、シーケンス番号の小さい順に蓄積した音声パケットを出力する。 The jitter buffer 402 (storage buffer 402 a) reads the sequence number in the voice packet (packet in RTP format) arriving from the IP network N, and stores the voice packet in ascending order of the sequence number. Also, the jitter buffer 402 (storage buffer 402 a) outputs voice packets stored in ascending order of sequence numbers.

ジッタバッファ制御手段４０２ｂは格納バッファ４０２ａ内に蓄積された音声パケットの量（数）がある閾値以上の場合に、一部の音声パケットを破棄させて蓄積量を削減する処理を行う。 When the amount (number) of voice packets stored in the storage buffer 402a is equal to or more than a threshold, the jitter buffer control unit 402b performs processing to discard some voice packets and reduce the storage amount.

ＰＣＭ復号手段４０３は格納バッファ４０２ａから供給された音声パケットのペイロード（符号化された音声データ）を復号する処理を行う。ＰＣＭ復号手段４０３は、例えば、ＩＴＵ−ＴＧ．７１１等の所定のコーデックに従って、音声パケットのペイロードを復号し、復号した音声データ（例えば、ＰＣＭ形式の音声データのフレーム）を取得する。 The PCM decoding unit 403 decodes the payload (encoded audio data) of the audio packet supplied from the storage buffer 402 a. The PCM decoding means 403 is, for example, an ITU-T G.I. The payload of the audio packet is decoded according to a predetermined codec such as 711 to obtain decoded audio data (for example, a frame of audio data in PCM format).

有音検出部４０１は、入力された音声信号（ＰＣＭ復号手段４０３から出力される音声データのフレーム）について有音検出するものである。この実施形態では、有音検出部４０１として、第１の実施形態の有音検出装置１を適用しているものとする。有音検出部４０１は、入力された音声信号に基づいて、所定期間ごとに有音判定又は無音判定を出力する。 The voice presence detection unit 401 performs voice presence detection on the input audio signal (frame of voice data output from the PCM decoding unit 403). In this embodiment, it is assumed that the voice detection device 1 of the first embodiment is applied as the voice detection unit 401. The sound presence detection unit 401 outputs sound presence determination or silence determination at predetermined intervals based on the input audio signal.

ジッタバッファ制御手段４０２ｂは、有音検出部４０１の判定結果を考慮して、格納バッファ４０２ａの音声パケットを破棄するか否かの判断（音声パケットを破棄する処理タイミングの判断）を行う。例えば、ジッタバッファ制御手段４０２ｂは、有音検出部４０１が無音判定を出力している間のみ、格納バッファ４０２ａの音声パケットを廃棄すると決定可能とするようにしてもよい。これにより、ジッタバッファ４０２では、音声パケット破棄に伴う音声への影響（例えば、復号側での復号音声の劣化等）を抑制することができる。 The jitter buffer control unit 402b determines whether to discard the voice packet of the storage buffer 402a (determination of processing timing for discarding the voice packet) in consideration of the determination result of the voice presence detection unit 401. For example, the jitter buffer control unit 402b may determine that the voice packet in the storage buffer 402a is to be discarded only while the noise detection unit 401 outputs the silence determination. As a result, in the jitter buffer 402, it is possible to suppress the influence on voice due to voice packet discarding (for example, deterioration of decoded voice on the decoding side).

（Ｆ）第６の実施形態
以下、本発明による音声処理装置、プログラム及び方法、並びに交換装置の第６の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音声処理装置、プログラム及び方法を、交換装置に適用した例について説明する。 (F) Sixth Embodiment The sixth embodiment of the speech processing device, program and method according to the present invention, and the exchange device will be described in detail with reference to the drawings. In this embodiment, an example in which the voice processing device, program and method of the present invention are applied to a switching device will be described.

図９は、第６の実施形態の交換装置５００の機能的構成について示した説明図である。 FIG. 9 is an explanatory view showing a functional configuration of the exchange device 500 of the sixth embodiment.

交換装置５００は、ＩＰネットワークＮを介して複数のＩＰ電話端末６００と接続し、ＩＰ電話端末６００間の呼制御処理やメディア通信処理（音声データ等のメディアデータの処理）等を行う装置（いわゆるＩＰ−ＰＢＸの機能を担う装置）である。 The exchange apparatus 500 is connected to a plurality of IP telephone terminals 600 via an IP network N, and performs call control processing and media communication processing (processing of media data such as voice data) between the IP telephone terminals 600 (so-called so-called It is a device that takes on the function of IP-PBX.

交換装置５００は、例えば、プロセッサ及びメモリを備えるコンピュータ上にプログラム（実施形態に係る音声処理プログラムを含む）をインストールしてソフトウェアとして実現するようにしてもよい。 For example, the switching device 500 may be implemented as software by installing a program (including the voice processing program according to the embodiment) on a computer including a processor and a memory.

この実施形態の例では、交換装置５００は、呼制御部５０１及びメディア処理部５０２を有している。言い換えると、交換装置５００は、交換処理手段を実現する構成要素として呼制御部５０１及びメディア処理部５０２を有している。なお、交換装置５００における交換方式（呼制御処理方式やメディアデータの処理方式等）については限定されないものである。 In the example of this embodiment, the switching device 500 includes a call control unit 501 and a media processing unit 502. In other words, the switching device 500 includes the call control unit 501 and the media processing unit 502 as components for realizing the switching processing means. The switching method (call control processing method, media data processing method, etc.) in the switching device 500 is not limited.

呼制御部５０１は、ＩＰ電話端末６００間の呼制御処理を行う。呼制御部５０１は、例えばＳＩＰ（ＳｅｓｓｉｏｎＩｎｉｔｉａｔｉｏｎＰｒｏｔｏｃｏｌ）等の呼制御プロトコルに従って呼制御処理を行う。 The call control unit 501 performs call control processing between the IP telephone terminals 600. The call control unit 501 performs call control processing in accordance with a call control protocol such as, for example, SIP (Session Initiation Protocol).

メディア処理部５０２は、ＩＰ電話端末６００との間でメディア通信処理（音声データ等のメディアデータの処理）を行うものである。メディア処理部５０２は、ＩＰ電話端末６００から音声データ（音声パケット）を受信して処理し、他のＩＰ電話端末６００へ送信する処理等を行う。メディア処理部５０２は、適応ゲイン制御部５０３を有している。適応ゲイン制御部５０３は、ＩＰ電話端末６００から受信した音声パケットに基づく音声信号（音声データ）又は、ＩＰ電話端末６００へ送信する音声パケットに挿入する音声信号（音声データ）について、所望の音声レベル（予め設定された一定のレベル）に調整する処理を行う。この実施形態では、適応ゲイン制御部５０３として、第３の実施形態の適応ゲイン制御装置３００を適用しているものとする。すなわち、交換装置５００は、第３の実施形態の適応ゲイン制御装置３００を用いて、送信又は受信する音声信号の処理を行っている。 The media processing unit 502 performs media communication processing (processing of media data such as voice data) with the IP telephone terminal 600. Media processing unit 502 receives and processes voice data (voice packet) from IP telephone terminal 600, and performs processing such as transmitting to another IP telephone terminal 600. The media processing unit 502 includes an adaptive gain control unit 503. The adaptive gain control unit 503 sets a desired voice level for a voice signal (voice data) based on a voice packet received from the IP telephone terminal 600 or a voice signal (voice data) to be inserted into a voice packet to be transmitted to the IP telephone terminal 600. A process of adjusting to (a predetermined constant level) is performed. In this embodiment, it is assumed that the adaptive gain control device 300 of the third embodiment is applied as the adaptive gain control unit 503. That is, the switching device 500 processes an audio signal to be transmitted or received using the adaptive gain control device 300 of the third embodiment.

（Ｇ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (G) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｇ−１）本発明の音声処理装置（第１の実施形態の有音検出装置）は、上記の各実施形態で例示した各装置以外のその他の音声処理装置（例えば、電話端末、会議端末、音声録音装置等）に対しても適用可能である。 (G-1) The voice processing device of the present invention (the voiced detection device of the first embodiment) is a voice processing device other than the devices exemplified in the above embodiments (for example, a telephone terminal, a conference terminal , Voice recording apparatus etc.).

（Ｇ−２）第１の実施形態の有音判定部１４において、フレームレベルＬｖが閾値ＬｖＴｈに近い場合、有音判定と無音判定が短時間に交互に入れ替わる状態（いわゆるフラッピング状態）となる可能性がある。そこで、有音判定部１４は、上述のような状態を防ぐために、あるフレームが有音区間と判定された場合、それ以降の一定時間のフレームは必ず有音区間と判定する処理（いわゆる「ハングオーバ機能」の処理）を行うようにしてもよい。上述の一定時間としては、例えば５００ｍｓｅｃ程度の時間を適用するようにしてもよい。 (G-2) In the voice determination unit 14 of the first embodiment, when the frame level Lv is close to the threshold LvTh, the voice determination and the silence determination alternate in a short time (a so-called flapping state). there is a possibility. Therefore, in order to prevent the above-mentioned state, the sound determination unit 14 always determines that a frame for a predetermined time period after that is a sound interval (a so-called "hangover") when a frame is determined to be a sound interval. "Function" may be performed. For example, a time of about 500 msec may be applied as the above-mentioned predetermined time.

１…有音検出装置（音声処理装置）、１０…ＨＰＦ、１１…レベル算出部、１２…頻度計数部、１２…１カウンタ部、１３…レベル推定部、１４…有音判定部。 DESCRIPTION OF SYMBOLS 1 ... existence detection apparatus (voice processing apparatus), 10 ... HPF, 11 ... level calculation part, 12 ... frequency counting part, 12 ... 1 counter part, 13 ... level estimation part, 14 ... existence determination part.

Claims

Level value calculation means for calculating the level value of the input speech signal for each frame of a predetermined time unit;
Frequency counting means for counting the frequency of appearance of each level value for the level value calculated by the above-mentioned level value calculating means;
Level value estimation means for estimating a background noise level value and a target sound signal level value from an appearance frequency for each level value;
Speech processing apparatus characterized by further comprising: determination means for performing determination processing for determining a voiced section or a silent section of an input speech signal based on an estimated value of a background noise level value and an estimated value of a target sound signal level value. .

The determination method according to claim 1, wherein the determination means calculates a threshold based on the estimated value of the background noise level value and the estimated value of the target sound signal level value, and performs the determination process using the calculated threshold. Voice processing device.

The above-mentioned level value estimating means detects the section of the peak of two appearance frequency in the appearance frequency for each level value, and estimates the background noise level value based on the appearance frequency of the low level first section of the two sections. 3. The speech processing apparatus according to claim 1, wherein the target sound signal level value is estimated based on the appearance frequency of the high level second section of the two sections.

The speech processing apparatus according to any one of claims 1 to 3, further comprising a speech head detection means for detecting a speech head in the input speech signal by using the judgment result of the judgment means.

The speech processing apparatus according to any one of claims 1 to 3, further comprising background noise reduction means for reducing background noise from the input speech signal using the determination result of the determination means.

The voice according to any one of claims 1 to 3, further comprising gain control means for adjusting the level of the input voice signal to a desired level in consideration of the determination result of the determination means. Processing unit.

The input speech signal comes from the network in frame units,
A jitter buffer for holding frames coming from the network and outputting the frames at predetermined intervals;
The jitter buffer control means is for controlling the jitter buffer, and further includes a jitter buffer control means for discarding a frame reserved in the jitter buffer at a timing taking into consideration the determination result of the determination means. The voice processing apparatus according to any one of claims 1 to 3.

Exchange processing of voice communication between a plurality of terminals, and having exchange processing means for adjusting a level of a voice signal to be transmitted to the terminal or a voice signal received from the terminal to a desired level;
The exchange processing means adjusts the level of an audio signal to be transmitted to the terminal or an audio signal received from the terminal to a desired level using the speech processing apparatus according to claim 6. apparatus.

Computer,
Level value calculation means for calculating the level value of the input speech signal for each frame of a predetermined time unit;
Frequency counting means for counting the frequency of appearance of each level value for the level value calculated by the above-mentioned level value calculating means;
Level value estimation means for estimating a background noise level value and a target sound signal level value from an appearance frequency for each level value;
A voice characterized in that it functions as determination means for performing a determination process of determining a voiced section or a silent section for an input voice signal based on an estimated value of a background noise level value and an estimated value of a target sound signal level value. Processing program.

In the speech processing method performed by the speech processing device,
Level value calculation means, frequency counting means, level value estimation means, determination means
The level value calculating means calculates the level value of the input speech signal for each frame of a predetermined time unit,
The frequency counting means counts the appearance frequency for each level value for the level value calculated by the level value calculating means,
The level value estimation means estimates a background noise level value and a target sound signal level value from the appearance frequency for each level value,
The voice processing method characterized in that the determination means performs a determination process of determining a voiced section or a silent section for the input voice signal based on the estimated value of the background noise level value and the estimated value of the target sound signal level value. .