JP5457293B2

JP5457293B2 - Voice recognition device

Info

Publication number: JP5457293B2
Application number: JP2010159600A
Authority: JP
Inventors: 大和鈴木; 望齊藤; 徹丸本
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2010-07-14
Filing date: 2010-07-14
Publication date: 2014-04-02
Anticipated expiration: 2030-07-14
Also published as: JP2012022127A

Description

本発明は、音声認識装置において音声信号の入力ゲインを制御する技術に関するものである。 The present invention relates to a technique for controlling an input gain of a voice signal in a voice recognition device.

音声認識装置において音声信号の入力ゲインを制御する技術としては、過去の音声認識の成功率に応じて入力ゲインを調整する技術や（特許文献１）、過去の音声認識において認識対象とした時間区間における音声信号レベルに応じて入力ゲインを設定する技術（特許文献２）が知られている。 As a technique for controlling the input gain of a speech signal in the speech recognition apparatus, a technique for adjusting the input gain according to a past speech recognition success rate (Patent Document 1), or a time interval targeted for recognition in past speech recognition There is known a technique (Patent Document 2) for setting an input gain in accordance with an audio signal level.

特許第２９７５８０８号公報Japanese Patent No. 2975808 特許第３５９４３５６号公報Japanese Patent No. 3594356

たとえば、自動車内などの周辺の音響環境が刻々と変化する状況で音声認識装置を使用する場合、音声信号に含まれる騒音レベルも刻々と変化する。
したがって、上述のように、過去の音声認識の成功率や過去の音声区間における音声信号レベルなどの、過去の音声認識実行時の状況にのみ基づいて入力ゲインを調整したのでは、必ずしも、現在の騒音状況に適した入力ゲインを設定することはできない。
そこで、本発明は、音声信号の入力ゲインとして、より現在の騒音状況に適した入力ゲインを設定することができる音声認識装置を提供することを課題とする。 For example, when the voice recognition device is used in a situation where the surrounding acoustic environment such as in an automobile changes every moment, the noise level included in the voice signal also changes every moment.
Therefore, as described above, if the input gain is adjusted based only on the situation at the time of past speech recognition, such as the success rate of past speech recognition and the speech signal level in the past speech section, An input gain suitable for the noise situation cannot be set.
Therefore, an object of the present invention is to provide a speech recognition apparatus that can set an input gain more suitable for the current noise situation as an input gain of a speech signal.

前記課題達成のために、本発明は、音声認識を行う音声認識装置を、マイクと、マイクから出力される入力音声信号を増幅する入力アンプと、前記入力アンプで増幅された信号を、入力音声データに変換するＡＤ変換器と、音声認識実行指示に応答して、前記ＡＤ変換器の出力する入力音声データを対象とする音声認識処理を行う音声認識エンジンと、騒音レベル検出部と、発話音声レベル検出部と、前記入力アンプのゲインを制御する入力ゲイン制御部とを含めて構成したものである。ここで、前記音声認識エンジンにおいて、前記音声認識処理において、前記入力音声データにユーザの発話音声が含まれる時間区間を発話音声区間として検出すると共に、検出した発話音声区間の入力音声データに含まれる発話音声の内容を識別し、前記騒音レベル検出部は、前記発話音声区間以外の時間区間、または、前記音声認識処理を行っていない時間区間において、前記入力音声信号に含まれる騒音のレベルを、前記入力音声データに基づいて繰り返し算出し、前記発話音声レベル検出部は、前記音声認識処理の各回において検出された各発話音声区間の前記入力音声信号に含まれる発話音声の平均的なレベルを、前記入力音声データに基づいて算出し、前記入力ゲイン制御部は、前記音声認識処理の各回の開始時に、前記騒音レベル検出部によって最後に算出された騒音のレベルと、発話音声レベル検出部によって検出されている発話音声の平均的なレベルとより、当該回の前記音声認識処理で検出される発話音声区間の前記入力音声信号のレベルを推定し、推定した前記入力音声信号のレベルを前記入力アンプで増幅したレベルが、前記音声認識エンジンに適合したレベルとなるように、前記入力アンプのゲインを設定するものである。 In order to achieve the above object, the present invention provides a voice recognition device that performs voice recognition, a microphone, an input amplifier that amplifies an input voice signal output from the microphone, and a signal amplified by the input amplifier. An AD converter for converting data, a speech recognition engine for performing speech recognition processing on input speech data output from the AD converter in response to a speech recognition execution instruction, a noise level detection unit, and speech speech A level detection unit and an input gain control unit for controlling the gain of the input amplifier are included. Here, in the voice recognition engine, in the voice recognition process, a time interval in which the input voice data includes the user's uttered voice is detected as the uttered voice period, and is included in the input voice data of the detected uttered voice period. Identifying the content of the uttered voice, the noise level detection unit, in a time section other than the utterance voice section, or a time section in which the voice recognition processing is not performed, the noise level included in the input voice signal, Repetitively calculating based on the input voice data, the utterance voice level detection unit, the average level of the utterance voice included in the input voice signal of each utterance voice section detected in each time of the voice recognition processing, The input gain control unit calculates the noise level at the start of each speech recognition process. The input of the speech speech section detected by the speech recognition process of the current time based on the level of noise finally calculated by the detection unit and the average level of speech speech detected by the speech speech level detection unit Estimating the level of the voice signal, and setting the gain of the input amplifier so that the level obtained by amplifying the estimated level of the input voice signal by the input amplifier is a level suitable for the voice recognition engine. .

このような音声認識装置によれば、騒音レベル検出部において騒音のレベルを繰り返し実行し、音声認識処理の開始時に、最後に検出された騒音のレベル、従って、直近の時点における騒音のレベルと、前回以前の音声認識処理実行時の発話音声の平均的なレベルとに基づいて入力音声信号のレベルを推定し、推定した前記入力音声信号のレベルを前記入力アンプで増幅したレベルが、前記音声認識エンジンに適合したレベルとなるように入力アンプのゲインを設定する。 According to such a voice recognition device, the noise level detection unit repeatedly executes the noise level, and at the start of the voice recognition process, the noise level detected last, and therefore the noise level at the most recent time point, The level of the input voice signal is estimated based on the average level of the uttered voice at the time of the previous voice recognition processing execution, and the level obtained by amplifying the estimated level of the input voice signal by the input amplifier is the voice recognition. Set the gain of the input amplifier to a level suitable for the engine.

そして、直近の時点における騒音のレベルは、現在の騒音状況における騒音のレベルと近似していることが期待できる。よって、このような音声認識装置によれば、音声認識処理の開始時に、より現在の騒音状況に適したゲインを入力アンプに設定することができるようになる。 The noise level at the most recent time can be expected to approximate the noise level in the current noise situation. Therefore, according to such a speech recognition apparatus, a gain more suitable for the current noise situation can be set in the input amplifier at the start of speech recognition processing.

ここで、以上のような音声認識装置は、前記入力ゲイン制御部において、前記音声認識処理が行われていない時間区間中、前記入力アンプのゲインを、前記入力音声信号のレベルが取り得る最大レベルを前記入力アンプで増幅したレベルが、前記ＡＤ変換器の入力レンジを越えないように予め定めた所定の値に設定するように構成することも、前記音声認識処理が行われていない時間区間における騒音のレベルの適正な算出を担保する上で好ましい。 Here, in the speech recognition apparatus as described above, in the input gain control unit, during the time interval in which the speech recognition processing is not performed, the gain of the input amplifier is set to the maximum level that the level of the input speech signal can take. Is set to a predetermined value so that the level amplified by the input amplifier does not exceed the input range of the AD converter, or in a time interval in which the speech recognition processing is not performed This is preferable in ensuring proper calculation of the noise level.

また、以上の音声認識装置が、オーディオデータが表すオーディオ音を出力するオーディオ機器と共に用いられる場合には、音声認識装置に、前記音声認識処理が行われている期間中、前記オーディオ機器のオーディオ音の出力を抑止する出力抑止部を設け、前記騒音レベル検出部において、前記音声認識処理を行っていない時間区間において、前記入力音声信号に含まれる騒音のレベルを、前記入力音声データと前記オーディオデータとに基づいて算出するようにしてもよい。 Further, when the above voice recognition device is used together with an audio device that outputs an audio sound represented by audio data, the audio recognition device of the audio device is in the period during which the voice recognition processing is performed. An output suppression unit that suppresses the output of the input audio signal and the audio data in the noise level detection unit in a time interval in which the speech recognition process is not performed. You may make it calculate based on these.

ここで、以上の音声認識装置は、より具体的には、前記騒音レベル検出部において、前記騒音のレベルとして騒音の振幅分布を算出し、前記発話音声レベル検出部において、前記発話音声の平均的なレベルとして、前記発話音声の平均的な振幅分布を算出し、前記入力ゲイン制御部において、前記入力音声信号のレベルとして、前記入力音声信号の振幅分布を推定するものとしてもよい。 More specifically, in the above speech recognition apparatus, the noise level detection unit calculates a noise amplitude distribution as the noise level, and the utterance speech level detection unit calculates the average of the utterance speech. Alternatively, an average amplitude distribution of the uttered voice may be calculated as the level, and the input gain control unit may estimate the amplitude distribution of the input voice signal as the level of the input voice signal.

また、この場合には、前記入力ゲイン制御部において、前記推定した入力音声信号の振幅分布が示す振幅の分布範囲のダイナミックレンジが、前記音声認識エンジンの入力レンジのダイナミックレンジ以下である場合には、前記推定した入力音声信号の振幅分布が示す振幅の分布範囲の中心の振幅値を前記入力アンプで増幅した後の振幅値が、前記音声認識エンジンの入力レンジの中心の振幅値となるように、前記入力アンプのゲインを設定することが好ましい。 In this case, when the dynamic range of the amplitude distribution range indicated by the estimated amplitude distribution of the input speech signal is less than or equal to the dynamic range of the input range of the speech recognition engine in the input gain control unit. The amplitude value after the amplitude value at the center of the amplitude distribution range indicated by the estimated amplitude distribution of the input speech signal is amplified by the input amplifier becomes the amplitude value at the center of the input range of the speech recognition engine. It is preferable to set the gain of the input amplifier.

また、この場合には、前記入力ゲイン制御部において、前記推定した入力音声信号の振幅分布における振幅の分布範囲のダイナミックレンジが、前記音声認識エンジンの入力レンジのダイナミックレンジを越える場合には、前記推定した入力音声信号の振幅分布における振幅の分布範囲のうちの、前記音声認識エンジンの入力レンジのダイナミックレンジと同じダイナミックレンジとなる範囲部分であって、当該範囲分布内の度数の合計が最大となる部分範囲を選定し、選定した部分範囲を前記入力アンプで増幅した後の範囲が、前記音声認識エンジンの入力レンジと一致するように、前記入力アンプのゲインを設定することが好ましい。 In this case, in the input gain control unit, if the dynamic range of the amplitude distribution range in the estimated amplitude distribution of the input speech signal exceeds the dynamic range of the input range of the speech recognition engine, Of the amplitude distribution range in the estimated amplitude distribution of the input speech signal, the range portion having the same dynamic range as the dynamic range of the input range of the speech recognition engine, and the sum of the frequencies in the range distribution is the maximum Preferably, the gain of the input amplifier is set so that the range after the selected partial range is amplified by the input amplifier matches the input range of the speech recognition engine.

以上のように、本発明によれば、音声信号の入力ゲインとして、より現在の騒音状況に適した入力ゲインを設定することができる音声認識装置を提供することができる。 As described above, according to the present invention, it is possible to provide a speech recognition apparatus that can set an input gain more suitable for the current noise situation as an input gain of a speech signal.

本発明の実施形態に係る音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識システムの動作を示すタイミングチャートである。It is a timing chart which shows operation | movement of the speech recognition system which concerns on embodiment of this invention. 本発明の実施形態に係る入力ゲイン制御処理を示すフローチャートである。It is a flowchart which shows the input gain control process which concerns on embodiment of this invention. 本発明の実施形態に係る入力ゲイン制御処理の処理例を示す図である。It is a figure which shows the process example of the input gain control process which concerns on embodiment of this invention.

以下、本発明の実施形態について説明する。
図１に、本実施形態に係る音声認識システムの構成を示す。
図示するように音声認識システムは、図示を省略したオーディオ機器から出力されるオーディオデータをアナログオーディオ信号にＤＡ変換するＤＡ変換器１、オーディオ信号を出力ゲインＳｐＧで増幅する出力アンプ２、オーディオアンプの出力するオーディオ信号が表す音をスピーカ出力音として出力するスピーカ３、マイク４、マイク４でピックアップした音声を表す入力音声信号を入力ゲインＧで増幅する入力アンプ５、入力アンプ５で増幅された入力音声信号を入力音声データにデジタル変換するＡＤ変換器６、ＡＤ変換器６で変換された入力音声データに対して音声認識処理を実行する音声認識エンジン７、トークスイッチ８、出力アンプ２の出力ゲインＳｐＧを制御する出力ゲイン制御部９、入力アンプ５のゲインＧを制御する入力ゲイン制御部１０とを備えている。 Hereinafter, embodiments of the present invention will be described.
FIG. 1 shows the configuration of a speech recognition system according to this embodiment.
As shown in the figure, the speech recognition system includes a DA converter 1 that DA converts audio data output from an audio device (not shown) into an analog audio signal, an output amplifier 2 that amplifies the audio signal with an output gain SpG, and an audio amplifier. The sound represented by the output audio signal is output as the speaker output sound. The speaker 3, the microphone 4, the input sound signal representing the sound picked up by the microphone 4, the input amplifier 5 that amplifies the sound by the input gain G, and the input that is amplified by the input amplifier 5 An AD converter 6 for digitally converting a voice signal into input voice data, a voice recognition engine 7 for executing voice recognition processing on the input voice data converted by the AD converter 6, a talk switch 8, and an output gain of the output amplifier 2 The output gain control unit 9 for controlling SpG and the gain G of the input amplifier 5 are controlled. And an input gain control unit 10.

このような構成において、音声認識エンジン７は、ユーザのトークスイッチ８の押し下げが発生すると、音声認識処理を開始する。音声認識処理では、入力音声データに、ユーザの発話音声が含まれる区間である発話音声区間の検出と、発話音声区間中の入力音声データに対する音声認識（ユーザの発話内容の識別）を行う。また、音声認識エンジン７は、ユーザのトークスイッチ８の押し下げ時点から発話音声区間の終了時点までオンとなる音声認識中信号Ｒｏｎを出力すると共に、音声認識処理の終了後に、音声認識処理中で検出した発話音声区間の時間位置を表す発話音声区間データＳｏｎを出力する。 In such a configuration, the voice recognition engine 7 starts the voice recognition process when the user depresses the talk switch 8. In the voice recognition processing, detection of an utterance voice section, which is a section in which the user's utterance voice is included in the input voice data, and voice recognition (identification of user utterance contents) for the input voice data in the utterance voice section are performed. The speech recognition engine 7 outputs a speech recognition in-progress signal Ron that is turned on from the time the user depresses the talk switch 8 to the end of the speech speech section, and also detects during the speech recognition processing after the speech recognition processing ends. The utterance voice section data Son representing the time position of the uttered voice section is output.

そして、出力ゲイン制御部９は、音声認識中信号Ｒｏｎがオフである期間中は、出力アンプ２の出力ゲインＳｐＧをオーディオ機器から出力されるボリューム信号Ｖｏｌに従って制御し、音声認識中信号Ｒｏｎがオンである期間中は、出力アンプ２の出力ゲインＳｐＧを０として、スピーカ出力音の発生を抑止する。 The output gain control unit 9 controls the output gain SpG of the output amplifier 2 according to the volume signal Vol output from the audio device during the period in which the voice recognition in-progress signal Ron is off, and the voice recognition in-progress signal Ron is on. During this period, the output gain SpG of the output amplifier 2 is set to 0 to suppress the generation of speaker output sound.

さて、ここで、マイク４が出力する入力音声信号には、その成分として、スピーカ３から出力されるスピーカ出力音ａと、騒音ｂと、ユーザの発話音声ｓが含まれる。
そして、入力ゲイン制御部１０は、騒音ｂの振幅分布を算出する第１騒音振幅分布検出部１１と、騒音ｂの振幅分布ｇｂ（ｎ）を算出する第２騒音振幅分布検出部１２と、騒音ｂの最新の振幅分布ｇｂ（ｎ）を格納する騒音振幅分布レジスタ１３と、発話音声ｓの平均の振幅分布ｆ（ｎ）を検出する音声振幅分布検出部１４と、発話音声ｓの平均の音声振幅分布ｆ（ｎ）を格納する音声振幅分布レジスタ１５と、畳込演算器１６と、ゲイン制御部１７とを備えている。なお、振幅分布Ｚ（ｎ）におけるｎは、振幅分布Ｚ（ｎ）が、振幅値（ｄＢ）をｎ個の振幅値の階級に離散化して振幅分布を表したものであることを表している。 Here, the input audio signal output from the microphone 4 includes, as its components, the speaker output sound a, noise b, and the user's uttered speech s output from the speaker 3.
The input gain control unit 10 includes a first noise amplitude distribution detection unit 11 that calculates the amplitude distribution of the noise b, a second noise amplitude distribution detection unit 12 that calculates the amplitude distribution gb (n) of the noise b, and noise. b, the noise amplitude distribution register 13 for storing the latest amplitude distribution gb (n), the voice amplitude distribution detecting unit 14 for detecting the average amplitude distribution f (n) of the uttered voice s, and the average voice of the uttered voice s. A voice amplitude distribution register 15 that stores the amplitude distribution f (n), a convolution calculator 16, and a gain control unit 17 are provided. Note that n in the amplitude distribution Z (n) represents that the amplitude distribution Z (n) represents the amplitude distribution by discretizing the amplitude value (dB) into a class of n amplitude values. .

ここで、第１騒音振幅分布検出部１１、第２騒音振幅分布検出部１２、音声振幅分布検出部１４の振幅分布の算出のタイミングについて図２を用いて説明する。
ここで、図２では、マイク４が出力する入力音声信号をｘとして、音声認識エンジン７が出力する発話音声区間データＳｏｎで表される発話音声区間をＳｏｎＤとして示す。
図示するように、トークスイッチ８の押し下げが発生するまでの時間区間中、マイク４が出力する入力音声信号ｘには、その成分として、スピーカ出力音ａと騒音ｂとが含まれる。
第１騒音振幅分布検出部１１は、この入力音声信号ｘに成分としてスピーカ出力音ａと騒音ｂとが含まれる時間区間である、音声認識中信号Ｒｏｎがオフである期間を算出実行期間として、算出実行期間中、騒音ｂの振幅分布ｇｂ（ｎ）の算出を行う。ここで、この第１騒音振幅分布検出部１１における騒音ｂの振幅分布ｇｂ（ｎ）の算出法の詳細については後述する。 Here, the timing of calculation of the amplitude distribution of the first noise amplitude distribution detection unit 11, the second noise amplitude distribution detection unit 12, and the audio amplitude distribution detection unit 14 will be described with reference to FIG.
Here, in FIG. 2, the input voice signal output from the microphone 4 is represented as x, and the speech voice section represented by the speech voice section data Son output from the voice recognition engine 7 is represented as SonD.
As shown in the figure, during the time interval until the talk switch 8 is depressed, the input audio signal x output from the microphone 4 includes a speaker output sound a and noise b as its components.
The first noise amplitude distribution detecting unit 11 is a time interval in which the speaker output sound a and the noise b are included as components in the input sound signal x, and a period during which the sound recognition signal Ron is off is set as a calculation execution period. During the calculation execution period, the amplitude distribution gb (n) of the noise b is calculated. Here, the details of the calculation method of the amplitude distribution gb (n) of the noise b in the first noise amplitude distribution detector 11 will be described later.

次に、トークスイッチ８の押し下げが発生し、音声認識中信号Ｒｏｎがオンとなると、スピーカ出力音が抑止されるので、音声認識中信号Ｒｏｎがオンとなってから、音声認識中信号Ｒｏｎがオフとなるまでの間の、発話音声区間Ｓｏｎで表される発話音声区間ＳｏｎＤ以外の期間中は、マイク４が出力する入力音声信号ｘには、その成分として、騒音ｂのみが含まれることとなる。 Next, when the talk switch 8 is pushed down and the voice recognition signal Ron is turned on, the speaker output sound is suppressed. Therefore, after the voice recognition signal Ron is turned on, the voice recognition signal Ron is turned off. During the period other than the utterance voice section SonD represented by the utterance voice section Son, the input voice signal x output from the microphone 4 includes only noise b as its component. .

第２騒音振幅分布検出部１２は、この入力音声信号ｘに成分として騒音ｂのみが含まれる時間区間である、音声認識中信号Ｒｏｎがオンである期間中の、発話音声区間データＳｏｎで表される発話音声区間ＳｏｎＤ以外の期間を算出実行期間として、算出実行期間中、騒音ｂの振幅分布ｇｂ（ｎ）の算出を行う。ここで、第２騒音振幅分布検出部１２の騒音ｂの振幅分布ｇｂ（ｎ）の算出法の詳細については後述する。 The second noise amplitude distribution detection unit 12 is represented by the speech voice section data Son during the period in which the voice recognition signal Ron is on, which is a time section in which only the noise b is included as a component in the input voice signal x. The amplitude distribution gb (n) of the noise b is calculated during the calculation execution period, with the period other than the speech voice section SonD as the calculation execution period. Here, details of a method of calculating the amplitude distribution gb (n) of the noise b of the second noise amplitude distribution detection unit 12 will be described later.

次に、発話音声区間ＳｏｎＤは、ユーザが発話を行っている時間区間であるので、マイク４が出力する入力音声信号ｘには、その成分として、騒音ｂと発話音声ｓとが含まれることになる。
音声振幅分布検出部１４は、音声認識中信号Ｒｏｎがオンである期間を算出実行期間とする。そして、入力音声信号ｘに成分として騒音ｂのみが含まれる時間区間である、発話音声区間データＳｏｎで表される発話音声区間ＳｏｎＤ以外の算出実行期間中の時間区間の入力音声信号ｘと、入力音声信号ｘに成分として騒音ｂと発話音声ｓとが含まれる時間区間である、算出実行期間中の発話音声区間データＳｏｎで表される発話音声区間ＳｏｎＤ中の入力音声信号ｘとを用いて発話音声ｓの平均の振幅分布ｆ（ｎ）の算出を実行する。ここで、この音声振幅分布検出部１４における発話音声ｓの平均の振幅分布ｆ（ｎ）の算出法の詳細については後述する。 Next, since the speech voice section SonD is a time section in which the user is speaking, the input voice signal x output from the microphone 4 includes noise b and speech voice s as its components. Become.
The voice amplitude distribution detection unit 14 sets a period during which the voice recognition in-progress signal Ron is on as a calculation execution period. Then, the input speech signal x in the time interval during the calculation execution period other than the speech speech segment SonD represented by the speech speech segment data Son, which is a time segment in which only the noise b is included as a component in the input speech signal x, Speaking using the input speech signal x in the speech speech section SonD represented by the speech speech section data Son during the calculation execution period, which is a time section in which the speech signal x includes the noise b and the speech speech s as components. Calculation of the average amplitude distribution f (n) of the voice s is executed. Here, the details of the method of calculating the average amplitude distribution f (n) of the uttered speech s in the speech amplitude distribution detection unit 14 will be described later.

さて、第１騒音振幅分布検出部１１と、第２騒音分布検出部は、前述した算出実行期間中、一定の単位時間区間の入力音声信号ｘを用いた振幅分布ｇｂ（ｎ）の算出を繰り返し行い、騒音ｂの振幅分布ｇｂ（ｎ）を算出する度に、算出した振幅分布ｇｂ（ｎ）によって、騒音振幅分布レジスタ１３の内容を更新する。なお、算出実行期間が、前述した単位時間区間に満たない場合、当該算出実行期間中には、振幅分布ｇｂ（ｎ）の算出は行われず、騒音振幅分布レジスタ１３の内容の更新も行われないこととなる。 Now, the first noise amplitude distribution detection unit 11 and the second noise distribution detection unit repeatedly calculate the amplitude distribution gb (n) using the input speech signal x in a certain unit time interval during the above-described calculation execution period. Whenever the amplitude distribution gb (n) of the noise b is calculated, the content of the noise amplitude distribution register 13 is updated with the calculated amplitude distribution gb (n). When the calculation execution period is less than the above-described unit time interval, the amplitude distribution gb (n) is not calculated during the calculation execution period, and the content of the noise amplitude distribution register 13 is not updated. It will be.

したがって、騒音振幅分布レジスタ１３に格納される騒音ｂの振幅分布ｇｂ（ｎ）は、常に、第１騒音振幅分布検出部１１と第２騒音分布検出部で算出された騒音ｂの振幅分布ｇｂ（ｎ）のうちの、最後に算出された最新の騒音ｂの振幅分布となる。
次に、音声振幅分布検出部１４における発話音声ｓの平均の振幅分布ｆ（ｎ）の算出は、音声認識処理の実行の度に行われ、音声振幅分布検出部１４は、発話音声ｓの平均の振幅分布ｆ（ｎ）の算出の度に、算出した振幅分布ｆ（ｎ）で音声振幅分布レジスタ１５の内容を更新する。したがって、音声認識処理の実行開始時に、音声振幅レジスタには、前回の音声認識処理の実行時に算出された発話音声ｓの平均の振幅分布ｆ（ｎ）が格納されていることになる。 Therefore, the amplitude distribution gb (n) of the noise b stored in the noise amplitude distribution register 13 is always the amplitude distribution gb (bb) of the noise b calculated by the first noise amplitude distribution detector 11 and the second noise distribution detector. n) of the latest amplitude b of the latest noise b calculated.
Next, the average amplitude distribution f (n) of the utterance voice s in the voice amplitude distribution detection unit 14 is calculated every time the voice recognition process is executed, and the voice amplitude distribution detection unit 14 calculates the average of the utterance voice s. Each time the amplitude distribution f (n) is calculated, the contents of the audio amplitude distribution register 15 are updated with the calculated amplitude distribution f (n). Therefore, at the start of execution of the speech recognition process, the average amplitude distribution f (n) of the uttered speech s calculated at the previous execution of the speech recognition process is stored in the speech amplitude register.

次に、畳込演算器１６は、騒音振幅分布レジスタ１３に格納されている騒音ｂの振幅分布ｇｂ（ｎ）と、音声振幅レジスタに格納されている発話音声ｓの平均の振幅分布ｆ（ｎ）との畳み込み演算を式１に従って行い、マイク４からの入力音声信号の振幅分布ｈ（ｎ）を算出する。なお、式１中において、Ｓｍａｘは発話音声ｓの最大値の階級の番号、Ｂｍａｘは騒音ｂの最大値の階級の番号である。 Next, the convolution calculator 16 has an amplitude distribution gb (n) of the noise b stored in the noise amplitude distribution register 13 and an average amplitude distribution f (n) of the uttered speech s stored in the voice amplitude register. ) And the amplitude distribution h (n) of the input audio signal from the microphone 4 is calculated. In Equation 1, Smax is the maximum class number of the speech voice s, and Bmax is the maximum class number of the noise b.

次に、入力ゲイン制御部１０は、図３に示す入力ゲイン制御処理によって、入力アンプ５の入力ゲインＧ（ｄＢ）を制御する。
いま、音声認識エンジン７で適正に処理可能な入力音声データのレンジを規格レンジＲとして、規格レンジがＲｍｉｎからＲｍａｘまでの範囲であるものとする。また、規格レンジのダイナミックレンジＲｍａｘ/Ｒｍｉｎを、音声認識エンジン７のダイナミックレンジの規格値Ｄと呼ぶこととする。 Next, the input gain control unit 10 controls the input gain G (dB) of the input amplifier 5 by the input gain control process shown in FIG.
Assume that the range of input speech data that can be properly processed by the speech recognition engine 7 is a standard range R, and the standard range is a range from Rmin to Rmax. The dynamic range Rmax / Rmin of the standard range is referred to as the standard value D of the dynamic range of the speech recognition engine 7.

さて、図３に示すように、入力ゲイン制御処理では、まず、入力アンプ５の入力ゲインＧを予め定めた最小ゲインＧｍｉｎに設定し（ステップ３０２）、音声認識エンジン７から出力される音声認識中信号Ｒｏｎが１となって音声認識処理が開始されるのを待つ（ステップ３０４）。
次に、音声認識中信号Ｒｏｎがオンとなって音声認識処理が開始されたならば（ステップ３０４）、畳込演算器１６から出力されているマイク４からの入力音声信号の振幅分布ｈ（ｎ）が示す振幅分布の最大値（度数が存在する振幅の最大値）をＨｍａｘ、振幅分布ｈ（ｎ）が示す振幅分布の最小値（度数が存在する振幅の最小値）をＨｍｉｎとして（ステップ３０６）、振幅分布ｈ（ｎ）が表すマイク４からの入力音声信号のダイナミックレンジＨｍａｘ/Ｈｍｉｎが、音声認識エンジン７のダイナミックレンジの規格値Ｄ以下であるかどうかを調べる（ステップ３０８）。 As shown in FIG. 3, in the input gain control process, first, the input gain G of the input amplifier 5 is set to a predetermined minimum gain Gmin (step 302), and during speech recognition output from the speech recognition engine 7. Wait until the signal Ron becomes 1 and the speech recognition process is started (step 304).
Next, when the speech recognition in-process signal Ron is turned on and the speech recognition processing is started (step 304), the amplitude distribution h (n) of the input speech signal from the microphone 4 output from the convolution calculator 16 is reached. Hmax is the maximum value of the amplitude distribution indicated by () (the maximum value of the amplitude where the frequency is present), and Hmin is the minimum value (minimum value of the amplitude where the frequency is present) indicated by the amplitude distribution h (n) (step 306). ), Whether the dynamic range Hmax / Hmin of the input voice signal from the microphone 4 represented by the amplitude distribution h (n) is equal to or less than the standard value D of the dynamic range of the voice recognition engine 7 is checked (step 308).

そして、入力音声信号のダイナミックレンジＨｍａｘ/Ｈｍｉｎが、音声認識エンジン７のダイナミックレンジの規格値Ｄ以下であれば（ステップ３０８）、入力音声信号のレンジの中心Ｈｍｉｄ=（Ｈｍａｘ+Ｈｍｉｎ）/２と、音声認識エンジン７の入力音声データの規格レンジの中心Ｒｍｉｄ=（Ｒｍａｘ+Ｒｍｉｎ）/２とを求める（ステップ３１０）。 If the dynamic range Hmax / Hmin of the input speech signal is equal to or less than the standard value D of the dynamic range of the speech recognition engine 7 (step 308), the center of the range of the input speech signal Hmid = (Hmax + Hmin) / 2 Then, the center Rmid = (Rmax + Rmin) / 2 of the standard range of the input speech data of the speech recognition engine 7 is obtained (step 310).

また、次に、入力アンプ５の入力ゲインＧを、Ｒｍｉｄ/Ｈｍｉｄに設定する（ステップ３１２）。
この結果、音声認識エンジン７の規格レンジのダイナミックレンジが、入力音声データのダイナミックレンジ以上である場合には、次のように入力アンプ５の入力ゲインＧが設定されることになる。
すなわち、いま、図４ａ１に示すように、振幅分布ｈ（ｎ）の中心Ｈｍｉｄが、音声認識エンジン７の規格レンジＲの中心Ｒｍｉｄからずれた位置にあるものとする。ここで、振幅分布ｈ（ｎ）は、振幅分布ｈ（ｎ）と等しい振幅分布を持つ入力音声信号を、入力アンプ５で入力音声信号を増幅せずにＡＤ変換した場合に音声認識エンジン７に入力する入力音声データの振幅分布に一致する。 Next, the input gain G of the input amplifier 5 is set to Rmid / Hmid (step 312).
As a result, when the dynamic range of the standard range of the speech recognition engine 7 is equal to or greater than the dynamic range of the input speech data, the input gain G of the input amplifier 5 is set as follows.
That is, it is assumed that the center Hmid of the amplitude distribution h (n) is at a position shifted from the center Rmid of the standard range R of the speech recognition engine 7, as shown in FIG. Here, the amplitude distribution h (n) is input to the speech recognition engine 7 when an input speech signal having an amplitude distribution equal to the amplitude distribution h (n) is AD converted without amplifying the input speech signal by the input amplifier 5. It matches the amplitude distribution of the input voice data to be input.

そして、このような場合に、ステップ３１２の入力ゲインＧの設定によれば、図４ａ２に示すように、振幅分布ｈ（ｎ）と等しい振幅分布を持つ入力音声信号を、ステップ３１２で設定した入力ゲインＧで増幅してＡＤ変換した入力音声データの振幅分布ｈｉｎ（ｎ）、すなわち、ステップ３１２で入力ゲインＧを上述のように設定した場合の音声認識エンジン７に入力する入力音声データの振幅分布ｈｉｎ（ｎ）は、その中心が、音声認識エンジン７の規格レンジＲの中心Ｒｍｉｄに一致したものとなる。また、音声認識処理時に実際にマイク４がピックアップした入力音声信号は、振幅分布ｈ（ｎ）と近似した振幅分布を持つことが期待できる。 In such a case, according to the setting of the input gain G in step 312, as shown in FIG. 4a2, an input audio signal having an amplitude distribution equal to the amplitude distribution h (n) is input in step 312. Amplitude distribution h in (n) of input voice data amplified by gain G and AD converted, that is, amplitude distribution of input voice data input to voice recognition engine 7 when input gain G is set in step 312 as described above. The center of “hin (n)” coincides with the center Rmid of the standard range R of the speech recognition engine 7. In addition, the input voice signal actually picked up by the microphone 4 during the voice recognition process can be expected to have an amplitude distribution approximate to the amplitude distribution h (n).

よって、このような入力ゲインＧの設定によれば、音声認識エンジン７に入力する入力音声データの振幅分布ｈｉｎ（ｎ）の全体が、音声認識エンジン７の規格レンジＲ内の、規格レンジＲの中央部分に収まるようになる。ここで、一般的に、音声認識エンジン７は、音声認識エンジン７の規格レンジＲ内の中央部分に振幅分布を持つ入力音声データに対して精度良く音声認識を行うことができる。 Therefore, according to such setting of the input gain G, the entire amplitude distribution h in (n) of the input speech data input to the speech recognition engine 7 is within the standard range R of the speech recognition engine 7. Fits in the center part. Here, in general, the speech recognition engine 7 can perform speech recognition with high accuracy on input speech data having an amplitude distribution in the central portion within the standard range R of the speech recognition engine 7.

さて、図３に戻り、ステップ３１２で入力ゲインＧを設定したならば音声認識エンジン７から出力される音声認識中信号Ｒｏｎがオフとなって音声認識処理が終了するのを待って（ステップ３１４）、ステップ３０２からの処理に戻る。
一方、力音声信号のダイナミックレンジＨｍａｘ/Ｈｍｉｎが、音声認識エンジン７のダイナミックレンジの規格値Ｄを越えていれば（ステップ３０８）、ダイナミックレンジＭＤ=Ｍｍａｘ/Ｍｍｉｎが音声認識エンジン７の規格レンジＤと等しくなるレンジであって、かつ、入力音声信号の振幅分布ｈ（ｎ）上で当該レンジ中に含まれる度数（当該レンジ中に含まれる振幅値の出現頻度の総数）が最大となるレンジＭを算出する（ステップ３１６）。但し、ＭｍｉｎはレンジＭの最小値、ＭｍａｘはレンジＭの最大値を表す。 Returning to FIG. 3, if the input gain G is set in step 312, the voice recognition in-progress signal Ron output from the voice recognition engine 7 is turned off and the voice recognition process ends (step 314). Return to the processing from step 302.
On the other hand, if the dynamic range Hmax / Hmin of the force speech signal exceeds the standard value D of the dynamic range of the speech recognition engine 7 (step 308), the dynamic range MD = Mmax / Mmin is the standard range D of the speech recognition engine 7. And the frequency M included in the range on the amplitude distribution h (n) of the input audio signal (the total number of occurrences of the amplitude value included in the range) is maximized. Is calculated (step 316). However, Mmin represents the minimum value of the range M, and Mmax represents the maximum value of the range M.

また、次に、入力アンプ５の入力ゲインＧを、Ｒｍｉｎ/Ｍｍｉｎに設定する（ステップ３１８）。
この結果、音声認識エンジン７の規格レンジのダイナミックレンジが、入力音声データのダイナミックレンジ未満である場合には、次のように入力アンプ５の入力ゲインＧが設定されることになる。
すなわち、いま、図４ｂ１に示すように、振幅分布ｈ（ｎ）が、音声認識エンジン７の規格レンジＲ内にその端部分が含まれるように存在しているものとする。ここで、振幅分布ｈ（ｎ）は、振幅分布ｈ（ｎ）と等しい振幅分布を持つ入力音声信号を、入力アンプ５で入力音声信号を増幅せずにＡＤ変換した場合に音声認識エンジン７に入力する入力音声データの振幅分布に一致する。 Next, the input gain G of the input amplifier 5 is set to Rmin / Mmin (step 318).
As a result, when the dynamic range of the standard range of the speech recognition engine 7 is less than the dynamic range of the input speech data, the input gain G of the input amplifier 5 is set as follows.
That is, it is assumed that the amplitude distribution h (n) exists so that its end portion is included in the standard range R of the speech recognition engine 7, as shown in FIG. 4b1. Here, the amplitude distribution h (n) is input to the speech recognition engine 7 when an input speech signal having an amplitude distribution equal to the amplitude distribution h (n) is AD converted without amplifying the input speech signal by the input amplifier 5. It matches the amplitude distribution of the input voice data to be input.

そして、このような場合に、ステップ３１８の入力ゲインＧの設定によれば、図４ｂ２に示すように、振幅分布ｈ（ｎ）と等しい振幅分布を持つ入力音声信号を、ステップ３１８で設定した入力ゲインＧで増幅してＡＤ変換した入力音声データの振幅分布ｈｉｎ（ｎ）、すなわち、ステップ３１８で上述のように入力ゲインＧを設定した場合の音声認識エンジン７に入力する入力音声データの振幅分布ｈｉｎ（ｎ）は、音声認識エンジン７の規格レンジＲ内の振幅値の度数（出現確率）が最大になるものとなる。また、音声認識処理時に実際にマイク４がピックアップした入力音声信号は、振幅分布ｈ（ｎ）と近似した振幅分布を持つことが期待できる。 In such a case, according to the setting of the input gain G in step 318, an input audio signal having an amplitude distribution equal to the amplitude distribution h (n) is input as set in step 318 as shown in FIG. 4b2. Amplitude distribution h in (n) of input voice data amplified by gain G and AD-converted, that is, amplitude distribution of input voice data input to voice recognition engine 7 when input gain G is set in step 318 as described above. Hin (n) has the maximum frequency (appearance probability) of the amplitude value within the standard range R of the speech recognition engine 7. In addition, the input voice signal actually picked up by the microphone 4 during the voice recognition process can be expected to have an amplitude distribution approximate to the amplitude distribution h (n).

よって、このような入力ゲインＧの設定によれば、音声認識エンジン７に入力する入力音声データは、度数（出現確率）の大きい振幅値の範囲、すなわち、主要と思われる振幅値の範囲が、音声認識エンジン７の規格レンジＲ内に収まるようになり、これにより、音声認識エンジン７で良好に音声認識を行えるようになる。 Therefore, according to such setting of the input gain G, the input speech data input to the speech recognition engine 7 has a range of amplitude values with a high frequency (appearance probability), that is, a range of amplitude values considered to be main. The voice recognition engine 7 is within the standard range R, so that the voice recognition engine 7 can perform voice recognition satisfactorily.

さて、図３に戻り、ステップ３１８で入力ゲインＧを設定したならば、音声認識エンジン７から出力される音声認識中信号Ｒｏｎがオフとなって音声認識処理が終了するのを待って（ステップ３１４）、ステップ３０２からの処理に戻る。
以上、入力ゲイン制御処理について説明した。
なお、以上の、ステップ３０２で入力ゲインＧを最小ゲインＧｍｉｎに設定するのは、音声認識処理が行われていない期間中に、入力アンプ５の増幅によって入力音声信号が飽和してしまって、当該期間中に騒音振幅分布ｇｂ（ｎ）を算出する第１騒音振幅分布検出部１１において適正に騒音振幅分布ｇｂ（ｎ）を算出できなくなってしまうことを抑制するためである。また、最小ゲインＧｍｉｎは、たとえば、マイク４で歪まずにピックアップ可能な最大の大きさの音声が、入力音声データとして表現可能な最大値に、ＡＤ変換器６で変換されることとなる値とする。 Returning to FIG. 3, if the input gain G is set in step 318, the process waits until the speech recognition processing signal Ron output from the speech recognition engine 7 is turned off and the speech recognition process ends (step 314). ), The process returns to step 302.
The input gain control process has been described above.
Note that the reason why the input gain G is set to the minimum gain Gmin in the above step 302 is that the input voice signal is saturated by the amplification of the input amplifier 5 during the period when the voice recognition process is not performed. This is to prevent the noise amplitude distribution gb (n) from being appropriately calculated in the first noise amplitude distribution detecting unit 11 that calculates the noise amplitude distribution gb (n) during the period. The minimum gain Gmin is, for example, a value that allows the AD converter 6 to convert a maximum volume of sound that can be picked up without distortion by the microphone 4 to a maximum value that can be expressed as input sound data. To do.

次に、上述した音声振幅分布検出部１４の発話音声ｓの平均の振幅分布ｆ（ｎ）の算出法、第１騒音振幅分布検出部１１の騒音ｂの振幅分布ｇｂ（ｎ）の算出法、第２騒音振幅分布検出部１２の騒音ｂの振幅分布ｇｂ（ｎ）の算出法について説明する。
まず、音声振幅分布検出部１４の発話音声ｓの平均の振幅分布ｆ（ｎ）の算出法について説明する。
発声された発話音声ｓの振幅分布ｆ（ｓ）は、スーパーガウス分布となることが知られており、発話音声ｓの振幅分布がスーパーガウス分布となると仮定すると、式２によって、発話音声ｓの振幅分布は表すことができる。 Next, a calculation method of the average amplitude distribution f (n) of the speech s of the speech amplitude distribution detection unit 14 described above, a calculation method of the amplitude distribution gb (n) of the noise b of the first noise amplitude distribution detection unit 11, A method for calculating the amplitude distribution gb (n) of the noise b of the second noise amplitude distribution detector 12 will be described.
First, a method for calculating the average amplitude distribution f (n) of the uttered voice s by the voice amplitude distribution detector 14 will be described.
It is known that the amplitude distribution f (s) of the uttered speech s is a super Gaussian distribution, and assuming that the amplitude distribution of the uttered speech s is a super Gaussian distribution, The amplitude distribution can be represented.

ここで、式２中のα、βは、発話音声ｓの平均μｓと分散σｓと、式３の関係を持つ。 Here, α and β in Equation 2 have the relationship of Equation 3 with the average μs and variance σs of the speech s.

また、平均μｓと分散σｓは、発話音声ｓのパワー（二乗平均）Ｐｓと式４の関係を持つ。 Further, the average μs and the variance σs have the relationship of the power (root mean square) Ps of the uttered voice s and Equation 4.

よって、式２中のα、βと発話音声ｓのパワーＰｓとの関係は式５で表すことができる。 Therefore, the relationship between α and β in Expression 2 and the power Ps of the speech voice s can be expressed by Expression 5.

ここで、発声された発話音声ｓの振幅分布のピークは０付近に現れ、この場合、下記参考文献などに示されているようにαはほぼ１とすることができる。
参考文献：T.Lotter and P.Vary, “Noise reduction by joint maximum a posteriori spectral amplitude and phase estimation with super-gaussian speech modeling",Proc. EUSIPCO-04(Vienna,Austria), pp.1447-60,Sep.2004.
そして、α=１とすると、βとパワーＰｓとの関係は式６で表すことができ、βが求まれば式１の発話音声ｓの振幅分布ｆ（ｓ）を算出することができる。 Here, the peak of the amplitude distribution of the uttered voice s appears near 0, and in this case, α can be set to approximately 1 as shown in the following references.
References: T. Lotter and P. Vary, “Noise reduction by joint maximum a posteriori spectral amplitude and phase estimation with super-gaussian speech modeling”, Proc. EUSIPCO-04 (Vienna, Austria), pp. 1447-60, Sep. .2004.
If α = 1, the relationship between β and power Ps can be expressed by Equation 6, and if β is obtained, the amplitude distribution f (s) of the uttered speech s in Equation 1 can be calculated.

そこで、音声振幅分布検出部１４は、算出実行期間毎に発話音声ｓのパワーＰｓｅを求め、保存する。ここで、算出実行期間の発話音声ｓのパワーＰｓｅの算出は次のように行う。
すなわち、ＡＤ変換器６がＡＤ変換した入力音声データのゲインを、当該入力音声データの生成時に用いた入力アンプ５の入力ゲインＧの逆数で表せるゲイン／Ｇでゲイン調整した入力音声データを対象入力音声データとする。ここで、対象入力音声データは、入力アンプ５で増幅される前の入力音声信号ｘの値を表すものとなる。 Therefore, the voice amplitude distribution detection unit 14 obtains and stores the power Pse of the uttered voice s for each calculation execution period. Here, the calculation of the power Pse of the speech s during the calculation execution period is performed as follows.
That is, the input audio data gain-adjusted by a gain / G that can be expressed by the reciprocal of the input gain G of the input amplifier 5 used when generating the input audio data is used as the target input. Let it be audio data. Here, the target input audio data represents the value of the input audio signal x before being amplified by the input amplifier 5.

そして、音声振幅分布検出部１４は、算出実行期間中、対象入力音声データを求めて保存すると共に、保存しておいた対象入力音声データを用いて、以下のように発話音声ｓのパワーＰｓｅを算出し、保存する。
すなわち、算出実行期間中の、発話音声区間データＳｏｎで表される発話音声区間以外の期間には、入力音声信号ｘに成分として騒音ｂのみが含まれる時間区間となるので、この時間区間の対象入力音声データのパワーをパワーＰｂとして算出する。また、入力音声信号ｘに成分として騒音ｂと発話音声ｓとが含まれる発話音声区間の対象入力音声データのパワーをパワーＰｂ+ｓとして算出する。そして、パワーＰｂ+ｓからパワーＰｂを減算することにより発話音声ｓのパワーＰｓｅを算出し、保存する。 Then, the voice amplitude distribution detection unit 14 obtains and stores the target input voice data during the calculation execution period, and uses the saved target input voice data to calculate the power Pse of the uttered voice s as follows. Calculate and save.
That is, during the calculation execution period, the period other than the speech voice section represented by the speech voice section data Son is a time section in which only the noise b is included as a component in the input voice signal x. The power of the input audio data is calculated as power Pb. Further, the power of the target input voice data in the utterance voice section in which noise b and utterance voice s are included as components in the input voice signal x is calculated as power Pb + s. Then, the power Pb of the uttered voice s is calculated and stored by subtracting the power Pb from the power Pb + s.

そして、各回の発話音声ｓのパワーＰｓｅの算出と保存が完了時に、それまでに保存された発話音声ｓのパワーＰｓｅの平均を、式６のパワーＰｓとしてβを求め、求めたβから、発話音声ｓの振幅分布ｆ（ｓ）を算出する。そして、振幅分布ｆ（ｓ）を離散化し、発話音声ｓの平均の振幅分布ｆ（ｎ）とする。 Then, when calculation and storage of the power Pse of each speech s is completed, β is obtained as the power Ps of Equation 6 by using the average power Pse of the speech s stored so far, and the speech is obtained from the obtained β. The amplitude distribution f (s) of the voice s is calculated. Then, the amplitude distribution f (s) is discretized to obtain an average amplitude distribution f (n) of the speech voice s.

次に、第１騒音振幅分布検出部１１の騒音ｂの振幅分布ｇｂ（ｎ）の算出法について説明する。
まず、第１騒音振幅分布検出部１１の算出実行期間中は、入力音声信号ｘに成分としてスピーカ出力音ａと騒音ｂとが含まれる。
したがって、入力音声信号ｘの振幅分布ｇｃ（ｎ）は、スピーカ出力音ａの振幅分布ｇａ（ｎ）と騒音ｂの振幅分布ｇｂ（ｎ）との、式７、８に示す畳み込み演算によって表すことができる。 Next, a method for calculating the amplitude distribution gb (n) of the noise b of the first noise amplitude distribution detector 11 will be described.
First, during the calculation execution period of the first noise amplitude distribution detection unit 11, the speaker output sound a and the noise b are included as components in the input sound signal x.
Therefore, the amplitude distribution gc (n) of the input audio signal x is expressed by a convolution operation represented by Equations 7 and 8 between the amplitude distribution ga (n) of the speaker output sound a and the amplitude distribution gb (n) of the noise b. Can do.

なお、式８中において、Ａｍａｘはスピーカ出力音ａの最大値の階級の番号、Ｂｍａｘは騒音ｂの最大値の階級の番号である。
そして、式９のように、騒音ｂの振幅分布ｇｂ（ｎ）を行列表記したＷと、行列表記したスピーカ出力音ａの振幅分布ｇａ（ｎ）を定める。 In Expression 8, Amax is the number of the maximum value of the speaker output sound a, and Bmax is the number of the maximum value of the noise b.
Then, as shown in Equation 9, W representing the amplitude distribution gb (n) of the noise b and the amplitude distribution ga (n) of the speaker output sound a representing the matrix are determined.

そして、この場合には、式７、８より、式１０で示す誤差ｅの単位時間区間の平均自乗誤差Ｊを最小とするＷが、Ｗの真値であることが分かる。なお、Ｅ[Ｘ]は、Ｘの単位時間区間の平均値を表すものとする。 In this case, it can be seen from Equations 7 and 8 that W that minimizes the mean square error J in the unit time interval of error e shown in Equation 10 is the true value of W. Note that E [X] represents an average value of the unit time interval of X.

そして、平均自乗誤差Ｊを最小とするＷは、平均自乗誤差ＪをＷで偏微分した値が０となるＷとして、式１１のように求まる。 Then, W that minimizes the mean square error J is obtained as shown in Equation 11, where W is a value obtained by partial differentiation of the mean square error J with W.

そして、平均自乗誤差Ｊを最小とするＷより騒音ｂの振幅分布ｇｂ（ｎ）が定まることとなる。
そこで、第１騒音振幅分布検出部１１は、ＡＤ変換器６がＡＤ変換した入力音声データのゲインを、当該入力音声データの生成時に用いた入力アンプ５の入力ゲインＧの逆数で表せるゲイン／Ｇでゲイン調整した入力音声データを対象入力音声データとして、第１騒音振幅分布検出部１１は、算出実行期間中、対象入力音声データを求めて保存すると共に、ＤＡ変換器１に入力するオーディオデータを保存する。
そして、算出実行期間中、単位時間区間毎に、以下のようにして騒音ｂの振幅分布ｇｂ（ｎ）を、保存した対象入力音声データとオーディオデータを用いて算出する。
すなわち、ＤＡ変換器１の入力からマイク４の出力までの伝達関数Ｈを、出力アンプ２の出力ゲインＳｐＧを参照して算出し、ＤＡ変換器１に入力するオーディオデータに算出した伝達関数Ｈを施したオーディオデータの単位時間区間の振幅分布関数をスピーカ出力音ａの振幅分布ｇａ（ｎ）として算出する。伝達関数Ｈは、たとえば、予め求めておいた、出力アンプ２で増幅を行わなかった場合の、ＤＡ変換器１の入力からマイク４の出力までの伝達関数に、出力アンプ２の出力ゲインＳｐＧを乗じることにより求める。または、伝達関数Ｈは、適応フィルタなどを用いて対象入力音声データとオーディオデータからリアルタイムに求めるようにすることもできる。 The amplitude distribution gb (n) of the noise b is determined from W that minimizes the mean square error J.
Accordingly, the first noise amplitude distribution detection unit 11 can represent the gain of the input voice data AD-converted by the AD converter 6 by a gain / G that can be expressed by the reciprocal of the input gain G of the input amplifier 5 used when the input voice data is generated. The first noise amplitude distribution detecting unit 11 obtains and stores the target input voice data during the calculation execution period, and the audio data to be input to the DA converter 1 as the target input voice data. save.
Then, during the calculation execution period, the amplitude distribution gb (n) of the noise b is calculated for each unit time section as follows using the stored target input voice data and audio data.
That is, the transfer function H from the input of the DA converter 1 to the output of the microphone 4 is calculated with reference to the output gain SpG of the output amplifier 2, and the transfer function H calculated for the audio data input to the DA converter 1 is calculated. The amplitude distribution function of the unit time interval of the applied audio data is calculated as the amplitude distribution ga (n) of the speaker output sound a. For example, the transfer function H is obtained by adding the output gain SpG of the output amplifier 2 to the transfer function obtained in advance from the input of the DA converter 1 to the output of the microphone 4 when amplification is not performed by the output amplifier 2. Find by multiplying. Alternatively, the transfer function H can be obtained in real time from target input voice data and audio data using an adaptive filter or the like.

また、単位時間区間の対象入力音声データの振幅分布を入力音声信号ｘの振幅分布ｇｃ（ｎ）として算出する。 Further, the amplitude distribution of the target input voice data in the unit time interval is calculated as the amplitude distribution gc (n) of the input voice signal x.

そして、単位時間中に以上のように算出した振幅分布ｇａ（ｎ）と振幅分布ｇｃ（ｎ）より、式１１に従って、騒音ｂの振幅分布ｇｂ（ｎ）を算出する。
次に、第２騒音振幅分布検出部１２の騒音ｂの振幅分布ｇｂ（ｎ）の算出法について説明する。
ＡＤ変換器６がＡＤ変換した入力音声データのゲインを、当該入力音声データの生成時に用いた入力アンプ５の入力ゲインＧの逆数で表せるゲイン／Ｇでゲイン調整した入力音声データを対象入力音声データとして、第２騒音振幅分布検出部１２は、算出実行期間中、対象入力音声データを求めて保存すると共に、保存しておいた対象入力音声データを用いて、単位時間区間毎に、以下のように騒音ｂの振幅分布ｇｂ（ｎ）を算出する。
すなわち、第２騒音振幅分布検出部１２の算出実行期間中、入力音声信号ｘには、成分として騒音ｂのみが含まれる。そこで、第２騒音振幅分布検出部１２は、単位時間区間の対象入力音声データの振幅分布をそのまま騒音ｂの振幅分布ｇｂ（ｎ）として算出する。 Then, the amplitude distribution gb (n) of the noise b is calculated from the amplitude distribution ga (n) and the amplitude distribution gc (n) calculated as described above during the unit time according to Equation 11.
Next, a method for calculating the amplitude distribution gb (n) of the noise b of the second noise amplitude distribution detector 12 will be described.
Input audio data whose gain is adjusted by a gain / G that represents the gain of the input audio data AD-converted by the AD converter 6 by the reciprocal of the input gain G of the input amplifier 5 used when generating the input audio data. As described below, the second noise amplitude distribution detector 12 obtains and stores the target input voice data during the calculation execution period, and uses the saved target input voice data for each unit time interval as follows. Then, the amplitude distribution gb (n) of the noise b is calculated.
That is, during the calculation execution period of the second noise amplitude distribution detector 12, the input audio signal x includes only the noise b as a component. Therefore, the second noise amplitude distribution detection unit 12 calculates the amplitude distribution of the target input speech data in the unit time interval as the amplitude distribution gb (n) of the noise b as it is.

以上、本発明の実施形態について説明した。
以上のように本実施形態によれば、第１騒音振幅分布検出部１１と第２振幅分布検出部によって、発話音声区間以外の時間区間において、騒音の振幅分布ｇｂ（ｎ）の算出を繰り返し実行し、音声認識処理の開始時に、最後に検出された騒音の振幅分布ｇｂ（ｎ）、すなわち、直近の時点における騒音の振幅分布ｇｂ（ｎ）と、前回以前の音声認識処理実行時の発話音声の平均的な振幅分布ｆ（ｎ）とに基づいて入力音声信号の振幅分布ｈ（ｎ）を推定し、推定した前記入力音声信号の振幅分布ｈ（ｎ）を前記入力アンプ５で増幅した振幅分布ｈｉｎ（ｎ）が、前記音声認識エンジン７に適合したレベルとなるように入力アンプ５の入力ゲインＧを設定する。 The embodiment of the present invention has been described above.
As described above, according to the present embodiment, the first noise amplitude distribution detection unit 11 and the second amplitude distribution detection unit repeatedly execute the calculation of the noise amplitude distribution gb (n) in a time section other than the speech voice section. At the start of the speech recognition process, the noise amplitude distribution gb (n) detected last, that is, the noise amplitude distribution gb (n) at the most recent time point, and the utterance speech at the time of the previous speech recognition process execution The amplitude distribution h (n) of the input speech signal is estimated based on the average amplitude distribution f (n) of the input amplitude, and the amplitude obtained by amplifying the estimated amplitude distribution h (n) of the input speech signal by the input amplifier 5 The input gain G of the input amplifier 5 is set so that the distribution h in (n) becomes a level suitable for the voice recognition engine 7.

そして、直近の時点における騒音の振幅分布ｇｂ（ｎ）は、現在の騒音状況における騒音の振幅分布ｇｂ（ｎ）と近似していることが期待できる。よって、このような音声認識装置によれば、音声認識処理の開始時に、より現在の騒音状況に適した入力ゲインＧを入力アンプ５に設定することができるようになる。 The noise amplitude distribution gb (n) at the most recent time can be expected to approximate the noise amplitude distribution gb (n) in the current noise situation. Therefore, according to such a speech recognition apparatus, an input gain G more suitable for the current noise situation can be set in the input amplifier 5 at the start of speech recognition processing.

ところで、以上の実施形態では、振幅分布に基づいて入力アンプ５の入力ゲインＧを設定するようにしたが、これは振幅分布に代えて音声のレベルを表す他の特性値Ｚ（Ｚは、たとえば、振幅のピーク値の分布や最大振幅値や平均振幅等）に基づいて入力アンプ５の入力ゲインＧを設定するようにすることもできる。 By the way, in the above embodiment, the input gain G of the input amplifier 5 is set based on the amplitude distribution, but this is another characteristic value Z (Z is an example of the voice level instead of the amplitude distribution). The input gain G of the input amplifier 5 can be set based on the distribution of the peak value of amplitude, the maximum amplitude value, the average amplitude, or the like.

すなわち、この場合には、第１騒音振幅分布検出部１１と第２振幅分布検出部とにおいて、騒音の振幅分布ｇｂ（ｎ）に代えて騒音の特性値Ｚを算出し、音声振幅分布検出部１４において発話音声の平均的な特性値Ｚを算出し、入力ゲイン制御部１０において最後に算出された騒音の特性値Ｚと、発話音声の平均的な特性値Ｚとに基づいて、入力音声信号の特性値Ｚを推定し、推定した入力音声信号の特性Ｚに基づいて、特性値Ｚとする特性値の種類に応じて予め定めた規則に従って、入力アンプ５の入力ゲインＧを、入力音声信号を入力ゲインＧで増幅してＡＤ変換した入力音声データの振幅レンジが、音声認識エンジン７の規格レンジＲに適合するように設定する。 That is, in this case, the first noise amplitude distribution detection unit 11 and the second amplitude distribution detection unit calculate the noise characteristic value Z instead of the noise amplitude distribution gb (n), and the voice amplitude distribution detection unit. 14, the average characteristic value Z of the uttered voice is calculated, and the input voice signal is calculated based on the noise characteristic value Z finally calculated by the input gain control unit 10 and the average characteristic value Z of the uttered voice. The characteristic value Z of the input amplifier 5 is estimated based on the estimated characteristic Z of the input voice signal, and the input gain G of the input amplifier 5 is determined according to a rule predetermined according to the type of characteristic value to be the characteristic value Z. Is set so that the amplitude range of the input voice data obtained by amplifying the signal by the input gain G and AD-converting it matches the standard range R of the voice recognition engine 7.

１…ＤＡ変換器、２…出力アンプ、３…スピーカ、４…マイク、５…入力アンプ、６…ＡＤ変換器、７…音声認識エンジン、８…トークスイッチ、９…出力ゲイン制御部、１０…入力ゲイン制御部、１１…第１騒音振幅分布検出部、１２…第２騒音振幅分布検出部、１３…騒音振幅分布レジスタ、１４…音声振幅分布検出部、１５…音声振幅分布レジスタ、１６…畳込演算器、１７…ゲイン制御部。 DESCRIPTION OF SYMBOLS 1 ... DA converter, 2 ... Output amplifier, 3 ... Speaker, 4 ... Microphone, 5 ... Input amplifier, 6 ... AD converter, 7 ... Speech recognition engine, 8 ... Talk switch, 9 ... Output gain control part, 10 ... Input gain control unit, 11 ... first noise amplitude distribution detection unit, 12 ... second noise amplitude distribution detection unit, 13 ... noise amplitude distribution register, 14 ... audio amplitude distribution detection unit, 15 ... audio amplitude distribution register, 16 ... tatami Calculation unit, 17... Gain control unit.

Claims

A speech recognition device that performs speech recognition,
With a microphone,
An input amplifier that amplifies the input audio signal output from the microphone;
An AD converter that converts the signal amplified by the input amplifier into input audio data;
A speech recognition engine that performs speech recognition processing on input speech data output from the AD converter in response to a speech recognition execution instruction;
A noise level detector;
An utterance voice level detector;
An input gain control unit for controlling the gain of the input amplifier;
In the voice recognition process, the voice recognition engine detects a time interval in which the input voice data includes a user's utterance voice as a utterance voice section, and the utterance voice included in the input voice data in the detected utterance voice section. Identify the content,
The noise level detector repeats a noise level included in the input voice signal based on the input voice data in a time section other than the speech voice section or a time section in which the voice recognition process is not performed. Calculate
The utterance voice level detection unit calculates an average level of the utterance voice included in the input voice signal of each utterance voice section detected in each time of the voice recognition processing based on the input voice data,
The input gain control unit, at the start of each time of the speech recognition processing, the noise level calculated last by the noise level detection unit, and the average level of the utterance speech detected by the utterance speech level detection unit Thus, the level of the input voice signal in the utterance voice section detected in the voice recognition process of the time is estimated, and the level obtained by amplifying the estimated level of the input voice signal with the input amplifier is the voice recognition engine. A speech recognition apparatus, wherein a gain of the input amplifier is set so as to be a level suitable for.

The speech recognition apparatus according to claim 1,
The input gain control unit is configured such that, during a time period when the speech recognition process is not performed, the gain of the input amplifier is amplified by the input amplifier so that the maximum level that can be taken by the level of the input speech signal is the AD amplifier. A speech recognition apparatus, wherein a predetermined value is set so as not to exceed an input range of the converter.

The speech recognition apparatus according to claim 1 or 2,
An audio device that outputs the audio sound represented by the audio data;
An output suppression unit that suppresses output of audio sound of the audio device during a period in which the speech recognition process is performed;
The noise level detection unit calculates a noise level included in the input voice signal based on the input voice data and the audio data in a time interval in which the voice recognition process is not performed. Voice recognition device.

The speech recognition device according to claim 1, 2, or 3,
The noise level detector calculates a noise amplitude distribution as the noise level,
The utterance voice level detection unit calculates an average amplitude distribution of the utterance voice as an average level of the utterance voice,
The speech recognition apparatus, wherein the input gain control unit estimates an amplitude distribution of the input speech signal as a level of the input speech signal.

The speech recognition device according to claim 4,
When the dynamic range of the amplitude distribution range indicated by the amplitude distribution of the estimated input speech signal is equal to or less than the dynamic range of the input range of the speech recognition engine, the input gain control unit may determine the estimated input speech signal. The gain of the input amplifier is adjusted so that the amplitude value after the amplitude value at the center of the amplitude distribution range indicated by the amplitude distribution is amplified by the input amplifier becomes the amplitude value at the center of the input range of the speech recognition engine. A speech recognition apparatus characterized by setting.

The speech recognition device according to claim 4 or 5,
When the dynamic range of the amplitude distribution range in the estimated amplitude distribution of the input speech signal exceeds the dynamic range of the input range of the speech recognition engine, the input gain control unit determines the amplitude of the estimated input speech signal. of distribution range of the amplitude in the distribution, in the range moiety having the same dynamic range as the dynamic range of the input range of the voice recognition engine, it selects a range that the total amount of power within that range portion becomes maximum, A speech recognition apparatus, wherein a gain of the input amplifier is set so that a range after the selected range portion is amplified by the input amplifier matches an input range of the speech recognition engine.