JP4932530B2

JP4932530B2 - Acoustic processing device, acoustic processing method, acoustic processing program, verification processing device, verification processing method, and verification processing program

Info

Publication number: JP4932530B2
Application number: JP2007044081A
Authority: JP
Inventors: 洋平岡登; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-02-23
Filing date: 2007-02-23
Publication date: 2012-05-16
Anticipated expiration: 2021-09-26
Also published as: JP2007179072A

Description

この発明は、入力音声を音響処理して伝送する音響処理装置、音響処理方法及び音響処理プログラムと、符号化信号を照合処理して音声認識結果を出力する照合処理装置、照合処理方法及び照合処理プログラムに関するものである。 The present invention relates to an acoustic processing device, an acoustic processing method, and an acoustic processing program for acoustically processing and transmitting input speech, and a verification processing device, a verification processing method, and a verification processing for verifying an encoded signal and outputting a speech recognition result It is about the program.

図１５は例えば“ＣｏｍｐｒｅｓｓｉｏｎｏｆＡｃｏｕｓｔｉｃＦｅａｔｕｒｅｓｆｏｒＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎｉｎＮｅｔｗｏｒｋＥｎｖｉｒｏｎｍｅｎｔｓ”，Ｇ．Ｎ．Ｒａｍａｓｗａｍｙ，Ｐ．Ｓ．Ｇｏｐａｌａｋｒｉｓｈｎａｎ（ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｆＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ＩＣＡＳＳＰ−９８），ＰＰ．９７７−９８０，１９９８に示された従来の音響処理装置及び照合処理装置を示す構成図であり、図において、１は認識対象の音声信号を入力し、その音声信号をＡ／Ｄ変換する音声入力部、２は音声入力部１から出力されたディジタルの音声信号を一定時間周期毎にフレームに区切って分析し、その音声信号の音声的な特徴を表す音響特徴量を算出する音響特徴量算出部である。 FIG. 15 shows, for example, “Compression of Acoustic Features for Speech Recognition in Network Environments”, G.A. N. Ramawamy, P.M. S. Gopalakrishnan (International Conference of Acoustics, Speech and Signal Processing, ICASSP-98), PP. 977-980, 1998 is a block diagram showing a conventional acoustic processing apparatus and collation processing apparatus, in which 1 is a speech input for inputting a speech signal to be recognized and A / D converting the speech signal. And 2, an acoustic feature quantity calculation unit that analyzes the digital voice signal output from the voice input section 1 by dividing the digital voice signal into frames at regular time intervals, and calculates an acoustic feature quantity that represents the voice characteristics of the voice signal. It is.

３は所定の情報圧縮方式にしたがって当該音響特徴量を信号圧縮する音響特徴量圧縮部、４は量子化テーブル、５は量子化テーブル４を参照しながら、所定の量子化・符号化方式にしたがって当該音響特徴量の量子化及び符号化を行う量子化・符号化部、６は量子化・符号化部５により符号化された音響特徴量を照合処理装置に送信する符号出力部である。 3 is an acoustic feature amount compression unit that compresses the acoustic feature amount in accordance with a predetermined information compression method, 4 is a quantization table, 5 is referring to the quantization table 4, and is according to a predetermined quantization / encoding method. A quantization / encoding unit 6 that quantizes and encodes the acoustic feature amount, and a code output unit 6 that transmits the acoustic feature amount encoded by the quantization / encoding unit 5 to the verification processing device.

１１は音響処理装置から送信された符号化信号である音響特徴量を入力する符号入力部、１２は量子化テーブル、１３は量子化テーブル１２を参照しながら、所定の復号化・逆量子化方式にしたがって当該音響特徴量の復号化及び逆量子化を行う復号化・逆量子化部、１４は所定の情報復元方式にしたがって当該音響特徴量の信号圧縮を解除し、元の音響特徴量を復元する音響特徴量復元部、１５は認識対象を構成する単位の音響特徴量の性質を示す標準パタン、１６は言語辞書、１７は音響特徴量復元部１４により復元された音響特徴量を標準パタン１５及び言語辞書１６と照合する照合部、１８は照合部１７による音声の認識結果を出力する認識結果出力部である。
なお、図１６は音響処理装置の処理内容を示すフローチャート、図１７は照合処理装置の処理内容を示すフローチャートである。 Reference numeral 11 denotes a code input unit that inputs an acoustic feature amount that is an encoded signal transmitted from the acoustic processing device, 12 denotes a quantization table, 13 denotes a predetermined decoding / inverse quantization method with reference to the quantization table 12 , A decoding / inverse quantization unit that performs decoding and inverse quantization of the acoustic feature amount according to, 14 releases the signal compression of the acoustic feature amount according to a predetermined information restoration method, and restores the original acoustic feature amount An acoustic feature quantity restoring unit 15, 15 is a standard pattern indicating properties of acoustic feature quantities of units constituting a recognition target, 16 is a language dictionary, and 17 is an acoustic feature quantity restored by the acoustic feature quantity restoring unit 14. A collation unit 18 for collating with the language dictionary 16 and a recognition result output unit 18 for outputting a speech recognition result by the collation unit 17.
16 is a flowchart showing the processing contents of the acoustic processing apparatus, and FIG. 17 is a flowchart showing the processing contents of the collation processing apparatus.

次に動作について説明する。
まず、音響処理装置の音声入力部１は、認識対象の音声信号Ｓを入力すると、その音声信号ＳをＡ／Ｄ変換する（ステップＳＴ１）。
Ｓ＝［ｓ（１），・・・，ｓ（Ｎ）］（１） Next, the operation will be described.
First, when a speech signal S to be recognized is input, the speech input unit 1 of the acoustic processing device performs A / D conversion on the speech signal S (step ST1).
S = [s (1), ..., s (N)] (1)

音響特徴量算出部２は、音声入力部１がディジタルの音声信号Ｓを出力すると、その音声信号Ｓを一定時間周期毎にフレームに区切って分析し、その音声信号Ｓの音声的な特徴を表す音響特徴量を算出する（ステップＳＴ２）。
即ち、分析周期をＴサンプルとする場合、標本数Ｎの音声信号Ｓを、ｎ＝Ｎ＝Ｔであるｎフレームの音響特徴量ベクトルの時系列Ｃに変換する。
Ｃ＝［ｃ（１），・・・，ｃ（ｎ）］（２） When the voice input unit 1 outputs the digital voice signal S, the acoustic feature quantity calculation unit 2 analyzes the voice signal S by dividing it into frames at regular time intervals, and represents the voice characteristics of the voice signal S. An acoustic feature amount is calculated (step ST2).
That is, when the analysis cycle is T samples, the N number of audio signals S are converted into a time series C of acoustic feature vector vectors of n frames where n = N = T.
C = [c (1),..., C (n)] (2)

ここで、音声認識に用いる音響特徴量の詳細は、“「音声認識の基礎（上，下）」Ｌ．Ｒ．Ｒａｂｉｎｅｒ，Ｂ．Ｈ．Ｊｕａｎｇ（古井監訳），１９９５年１１月，ＮＴＴアドバンステクノロジ（文献１）”の上巻で詳細に説明されている。
例えば、音響特徴量として、メルＦＦＴケプストラムを用いることができる。メルＦＦＴケプストラムとは、音声の短時間対数スペクトルをメル尺度と呼ばれる人間の聴覚特性に合わせた周波数スケールに置き換えて、逆フーリエ変換したものである。 Here, the details of the acoustic feature used for speech recognition are described in ““ Basics of Speech Recognition (Upper and Lower) ” R. Rabiner, B.M. H. Jung (translated by Furui), November 1995, described in detail in the first volume of “NTT Advanced Technology (Reference 1)”.
For example, a mel FFT cepstrum can be used as the acoustic feature quantity. The mel FFT cepstrum is obtained by performing an inverse Fourier transform by replacing the short-time logarithmic spectrum of a voice with a frequency scale that matches a human auditory characteristic called a mel scale.

音響特徴量圧縮部３は、音響特徴量算出部２が音響特徴量を算出すると、所定の情報圧縮方式にしたがって音響特徴量を信号圧縮する（ステップＳＴ３）。
即ち、予め設定したＫ次の線形予測係数Ａ＝［ａ（１），・・・，ａ（ｋ）］を用いて、音響特徴量ベクトル時系列Ｃから予測残差ベクトル時系列Ｖを求める。ここで、予測残差ベクトル時系列Ｖは式（３）のように表すことができる。
Ｖ＝［ｖ（１），・・・，ｖ（ｎ）］（３） When the acoustic feature value calculation unit 2 calculates the acoustic feature value, the acoustic feature value compression unit 3 performs signal compression on the acoustic feature value according to a predetermined information compression method (step ST3).
That is, the prediction residual vector time series V is obtained from the acoustic feature vector time series C using a preset K-th order linear prediction coefficient A = [a (1),..., A (k)]. Here, the prediction residual vector time series V can be expressed as shown in Equation (3).
V = [v (1),..., V (n)] (3)

また、時刻ｔにおける予測残差の算出は、式（４）のように表すことができる（ｔが１未満のとき、ｃ（ｔ）＝０と仮定する）。

例えば、単純な隣接フレーム間の差分をとる場合、Ｋ＝１であり、Ａ＝［−１］と表せる。この処理により得られる予測残差ベクトル時系列Ｖ（ｔ）は、音声特徴量の変化が連続的であることから元のｃ（ｔ）よりも分散を小さくすることができる。その結果、後段のベクトル量子化・符号化処理において符号長を短くすることが可能となる。 Also, the calculation of the prediction residual at time t can be expressed as in Equation (4) (assuming that c (t) = 0 when t is less than 1).

For example, when a simple difference between adjacent frames is taken, K = 1 and A = [− 1]. The prediction residual vector time series V (t) obtained by this processing can be made smaller in variance than the original c (t) because the change of the speech feature amount is continuous. As a result, the code length can be shortened in the subsequent vector quantization / encoding process.

量子化・符号化部５は、音響特徴量圧縮部３が音響特徴量を信号圧縮すると、量子化テーブル４を参照しながら、所定の量子化・符号化方式にしたがって当該音響特徴量の量子化及び符号化を行う（ステップＳＴ４）。
なお、量子化テーブル４は、ベクトル量子化あるいはスカラ量子化のために量子化・符号化部５が参照するテーブルであり、以降の説明では、ベクトル量子化はスカラ量子化を含むものとする。 When the acoustic feature value compression unit 3 compresses the acoustic feature value, the quantization / coding unit 5 quantizes the acoustic feature value according to a predetermined quantization / coding method with reference to the quantization table 4. Then, encoding is performed (step ST4).
The quantization table 4 is a table that is referred to by the quantization / encoding unit 5 for vector quantization or scalar quantization. In the following description, it is assumed that vector quantization includes scalar quantization.

具体的には、量子化・符号化部５は、量子化テーブル４を参照して、予測残差ベクトル時系列Ｖをベクトル量子化および符号化した符号Ｑ［ｑ（１），・・・，ｑ（ｎ）］に変換する。ベクトル量子化およびスカラ量子化の方法は、例えば、上記文献１の上巻に詳述されている。
符号出力部６は、量子化・符号化部５により符号化された音響特徴量を照合処理装置に送信する（ステップＳＴ５）。 Specifically, the quantization / encoding unit 5 refers to the quantization table 4 and codes Q [q (1),..., Vector quantization and encoding of the prediction residual vector time series V. q (n)]. Vector quantization and scalar quantization methods are described in detail, for example, in the first volume of Document 1 above.
The code output unit 6 transmits the acoustic feature amount encoded by the quantization / encoding unit 5 to the collation processing device (step ST5).

次に、照合処理装置の符号入力部１１は、音響処理装置から送信された符号化信号である音響特徴量を入力する（ステップＳＴ１１）。即ち、音響処理装置から符号Ｑ［ｑ（１），・・・，ｑ（ｎ）］を受信する。
復号化・逆量子化部１３は、符号入力部１１が音響特徴量を入力すると、量子化テーブル１２を参照しながら、所定の復号化・逆量子化方式にしたがって音響特徴量の復号化及び逆量子化を行う（ステップＳＴ１２）。
即ち、量子化テーブル１２を参照して、符号Ｑを復号化及び逆量子化して、信号圧縮された音響特徴量Ｖ’を求める。音響特徴量Ｖ’は一般に元の信号Ｖに対して量子化による量子化誤差を含む。
Ｖ’＝［ｖ’（１），・・・，ｖ’（ｎ）］（５） Next, the code input unit 11 of the verification processing device inputs an acoustic feature quantity that is an encoded signal transmitted from the acoustic processing device (step ST11). That is, the code Q [q (1),..., Q (n)] is received from the sound processing device.
When the code input unit 11 inputs the acoustic feature amount, the decoding / inverse quantization unit 13 decodes and reverses the acoustic feature amount according to a predetermined decoding / inverse quantization method with reference to the quantization table 12. Quantization is performed (step ST12).
That is, with reference to the quantization table 12, the code Q is decoded and inversely quantized to obtain a signal-compressed acoustic feature value V ′. The acoustic feature amount V ′ generally includes a quantization error due to quantization with respect to the original signal V.
V ′ = [v ′ (1),..., V ′ (n)] (5)

音響特徴量復元部１４は、復号化・逆量子化部１３が信号圧縮された音響特徴量Ｖ’を求めると、所定の情報復元方式にしたがって音響特徴量Ｖ’の信号圧縮を解除し、元の音響特徴量を復元する（ステップＳＴ１３）。
即ち、信号圧縮された音響特徴量Ｖ’に対して、音響特徴量圧縮部３による変換と逆の変換を施して、音響特徴量Ｃ’を復元する。
Ｃ’＝［ｃ’（１），・・・，ｃ’（ｎ）］（６）
なお、式（６）に対応する逆変換は式（７）となる。

When the decoding / inverse quantization unit 13 obtains the acoustic feature value V ′ obtained by the signal compression, the acoustic feature value restoration unit 14 releases the signal compression of the acoustic feature value V ′ according to a predetermined information restoration method. Are restored (step ST13).
That is, the acoustic feature quantity C ′ is restored by performing a conversion opposite to the conversion by the acoustic feature quantity compression unit 3 on the signal-compressed acoustic feature quantity V ′.
C ′ = [c ′ (1),..., C ′ (n)] (6)
Note that the inverse transformation corresponding to Equation (6) is Equation (7).

照合部１７は、音響特徴量復元部１４が音響特徴量Ｃ’を復元すると、その音響特徴量Ｃ’を標準パタン１５及び言語辞書１６と照合して、入力音声に対する認識結果を取得する（ステップＳＴ１４，ＳＴ１５）。
照合手順は次の通りである。ただし、標準パタン１５は、認識対象を構成する単位の音響特徴量の性質を示し、標準パタンの単位として、例えば、言語的な単位である音素を用いる。また、標準パタンの認識単位と音響特徴量の対応付けは、例えばＨＭＭ（隠れマルコフモデル）を用いて表現する。言語辞書１６は、標準パタンが示す認識単位と認識対象全体の言語表現の対応を示すものである。 When the acoustic feature value restoration unit 14 restores the acoustic feature value C ′, the matching unit 17 matches the acoustic feature value C ′ with the standard pattern 15 and the language dictionary 16 and acquires a recognition result for the input speech (step). ST14, ST15).
The verification procedure is as follows. However, the standard pattern 15 indicates the property of the acoustic feature quantity of the unit constituting the recognition target, and for example, a phoneme which is a linguistic unit is used as the standard pattern unit. The association between the standard pattern recognition unit and the acoustic feature amount is expressed using, for example, an HMM (Hidden Markov Model). The language dictionary 16 indicates the correspondence between the recognition unit indicated by the standard pattern and the language expression of the entire recognition target.

図１８は単語音声認識における言語辞書の記述例である。この例では、認識対象は「赤」「青」「黄色」の３単語であり、それぞれについて標準パタンとの対応及び単語の出現確率を記している。ここでは、標準パタンに示された認識単位を音素（日本語では概ねローマ字書きした場合の一文字に対応）としている。出現確率は、事前に分かっている認識対象単語の出現確率である。 FIG. 18 shows a description example of a language dictionary in word speech recognition. In this example, the recognition target is three words of “red”, “blue”, and “yellow”, and the correspondence with the standard pattern and the appearance probability of the word are described for each. Here, the recognition unit shown in the standard pattern is a phoneme (corresponding to one character when written in Roman characters in Japanese). The appearance probability is an appearance probability of a recognition target word that is known in advance.

照合部１７の照合手順
（１）音響特徴量Ｃ’と認識候補を構成する標準パタン１５のエントリを照合して照合スコアを求める。
（２）それぞれの認識候補について、部分あるいは終端に到達するまでの累積スコアを求める。
（３）音響特徴量Ｃ’の終端フレームに到達したら、最終的に最も高い累積スコアを持つ単語を音声認識結果とする。 Collation Procedure of Collation Unit 17 (1) The collation score is obtained by collating the acoustic feature quantity C ′ with the entry of the standard pattern 15 constituting the recognition candidate.
(2) For each recognition candidate, a cumulative score until reaching the part or the end is obtained.
(3) When the terminal frame of the acoustic feature quantity C ′ is reached, the word having the highest cumulative score is determined as the speech recognition result.

認識結果出力部１８は、照合部１７が上記のようにして音声認識結果を得ると、その音声認識結果を出力する（ステップＳＴ１６）。 When the collation unit 17 obtains the speech recognition result as described above, the recognition result output unit 18 outputs the speech recognition result (step ST16).

従来の音響処理装置は以上のように構成されているので、音声認識精度の劣化を招くことなく、音響特徴量を圧縮することができる。しかし、音声の局所的な性質，音声の種類，符号伝達の際の伝送状況や認識タスクの困難さを考慮することなく、常に同一の方式にしたがって音響特徴量の信号圧縮を行っているので、音響特徴量を必ずしも十分に圧縮することができないなどの課題があった。 Since the conventional acoustic processing apparatus is configured as described above, it is possible to compress the acoustic feature amount without causing deterioration of the speech recognition accuracy. However, the signal compression of acoustic features is always performed according to the same method without considering the local nature of the speech, the type of speech, the transmission situation at the time of code transmission and the difficulty of the recognition task, There has been a problem that the acoustic feature amount cannot always be sufficiently compressed.

この発明は上記のような課題を解決するためになされたもので、音声認識精度の劣化を招くことなく、音響特徴量の圧縮度を高めることができる音響処理装置、音響処理方法及び音響処理プログラムを得ることを目的とする。
また、この発明は、圧縮度の高い音響特徴量から音声認識結果を得ることができる照合処理装置、照合処理方法及び照合処理プログラムを得ることを目的とする。 The present invention has been made to solve the above-described problems, and an acoustic processing device, an acoustic processing method, and an acoustic processing program capable of increasing the compression degree of an acoustic feature amount without causing deterioration of speech recognition accuracy. The purpose is to obtain.
Another object of the present invention is to obtain a collation processing device, a collation processing method, and a collation processing program that can obtain a speech recognition result from an acoustic feature quantity having a high degree of compression.

この発明に係る音響処理装置は、出力対象判定手段が、特徴量抽出手段により抽出された音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行うようにしたものである。 In the acoustic processing device according to the present invention, when the output target determination unit has a smaller variation amount of the acoustic feature amount extracted by the feature amount extraction unit than the reference variation amount, the acoustic feature amount is not included in the output target, When the variation amount of the feature amount is larger than the reference variation amount, it is determined that the acoustic feature amount is included in the output target.

この発明に係る音響処理方法は、音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行うようにしたものである。 In the acoustic processing method according to the present invention, when the variation amount of the acoustic feature amount is smaller than the reference variation amount, the acoustic feature amount is not included in the output target, and when the variation amount of the acoustic feature amount is larger than the reference variation amount, It is determined that the acoustic feature amount is included in the output target .

この発明に係る照合処理方法は、符号化信号に音響特徴量が含まれているか否かを判定するようにしたものである。 In the verification processing method according to the present invention, it is determined whether or not an acoustic feature is included in the encoded signal.

この発明に係る音響処理プログラムは、特徴量抽出処理手順により抽出された音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行う出力対象判定処理手順を設けたものである。 The acoustic processing program according to the present invention does not include the acoustic feature amount as an output target when the variation amount of the acoustic feature amount extracted by the feature amount extraction processing procedure is smaller than the reference variation amount, and the variation amount of the acoustic feature amount Is greater than the reference variation amount, an output target determination processing procedure for determining that the acoustic feature value is included in the output target is provided.

この発明に係る照合処理プログラムは、符号化信号に音響特徴量が含まれているか否かを判定する包含判定処理手順を設けたものである。 The collation processing program according to the present invention is provided with an inclusion determination processing procedure for determining whether or not an acoustic feature is included in an encoded signal.

この発明によれば、出力対象判定手段が、特徴量抽出手段により抽出された音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行うように構成したので、変化が小さい部分の音響特徴量を出力対象から除外できるようになり、その結果、効率よく伝送情報を削減することができる効果がある。また、構成の複雑化を招くことなく、音響特徴量を出力対象に含める旨の判定を行うことができる効果がある。 According to this invention, when the variation amount of the acoustic feature amount extracted by the feature amount extraction unit is smaller than the reference variation amount, the output target determination unit does not include the acoustic feature amount in the output target, and the acoustic feature amount When the fluctuation amount is larger than the reference fluctuation amount, since it is configured to determine that the acoustic feature amount is included in the output target, it is possible to exclude the acoustic feature amount of the portion where the change is small from the output target, and as a result The transmission information can be efficiently reduced. In addition, there is an effect that it is possible to determine that the acoustic feature amount is included in the output target without complicating the configuration.

この発明によれば、符号化信号に音響特徴量が含まれているか否かを判定する包含判定手段を設けるように構成したので、圧縮度の高い音響特徴量から音声認識結果を得ることができる効果がある。 According to the present invention, since the inclusion determining means for determining whether or not an acoustic feature is included in the encoded signal is provided, a speech recognition result can be obtained from the acoustic feature having a high degree of compression. effective.

この発明によれば、音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行うように構成したので、変化が小さい部分の音響特徴量を出力対象から除外できるようになり、その結果、効率よく伝送情報を削減することができる効果がある。また、構成の複雑化を招くことなく、音響特徴量を出力対象に含める旨の判定を行うことができる効果がある。 According to this invention, when the variation amount of the acoustic feature amount is smaller than the reference variation amount, the acoustic feature amount is not included in the output target, and when the variation amount of the acoustic feature amount is larger than the reference variation amount, the acoustic feature amount Is determined to be included in the output target, so that it is possible to exclude the acoustic feature amount of the portion with a small change from the output target, and as a result, it is possible to efficiently reduce transmission information. . In addition, there is an effect that it is possible to determine that the acoustic feature amount is included in the output target without complicating the configuration.

この発明によれば、符号化信号に音響特徴量が含まれているか否かを判定するように構成したので、圧縮度の高い音響特徴量から音声認識結果を得ることができる効果がある。 According to the present invention, since it is configured to determine whether or not an acoustic feature is included in the encoded signal, there is an effect that a speech recognition result can be obtained from an acoustic feature having a high degree of compression.

この発明によれば、特徴量抽出処理手順により抽出された音響特徴量の変動量が基準変動量より小さい場合、その音響特徴量を出力対象に含めず、その音響特徴量の変動量が基準変動量より大きい場合、その音響特徴量を出力対象に含める旨の判定を行う出力対象判定処理手順を設けるように構成したので、変化が小さい部分の音響特徴量を出力対象から除外できるようになり、その結果、効率よく伝送情報を削減することができる効果がある。また、構成の複雑化を招くことなく、音響特徴量を出力対象に含める旨の判定を行うことができる効果がある。
According to the present invention, when the variation amount of the acoustic feature amount extracted by the feature amount extraction processing procedure is smaller than the reference variation amount, the acoustic feature amount is not included in the output target, and the variation amount of the acoustic feature amount is the reference variation amount. If it is larger than the amount, since it is configured to provide an output target determination processing procedure for determining that the acoustic feature amount is included in the output target, it becomes possible to exclude the acoustic feature amount of the portion with a small change from the output target, As a result, there is an effect that transmission information can be efficiently reduced. In addition, there is an effect that it is possible to determine that the acoustic feature amount is included in the output target without complicating the configuration.

この発明によれば、符号化信号に音響特徴量が含まれているか否かを判定する包含判定処理手順を設けるように構成したので、圧縮度の高い音響特徴量から音声認識結果を得ることができる効果がある。 According to the present invention, since the inclusion determination processing procedure for determining whether or not an acoustic feature is included in the encoded signal is provided, a speech recognition result can be obtained from an acoustic feature having a high degree of compression. There is an effect that can be done.

以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による音響処理装置及び照合処理装置を示す構成図であり、図において、２１は認識対象の音声信号を入力し、その音声信号をＡ／Ｄ変換する音声入力部、２２は音声入力部２１から出力されたディジタルの音声信号を一定時間周期毎にフレームに区切って分析し、その音声信号の音声的な特徴を表す音響特徴量を算出する音響特徴量算出部（特徴量抽出手段）、２３は音響特徴量算出部２２により算出された音響特徴量の情報圧縮方式と量子化・符号化方式を決定する方式決定部（方式決定手段）、２４は方式決定部２３により決定された情報圧縮方式にしたがって音響特徴量を信号圧縮する音響特徴量圧縮部（信号圧縮手段）、２５は量子化テーブル、２６は量子化テーブル２５を参照しながら、方式決定部２３により決定された量子化・符号化方式にしたがって音響特徴量の量子化及び符号化を行う量子化・符号化部（量子化・符号化手段）、２７は量子化・符号化部２６により符号化された音響特徴量を照合処理装置に送信する符号出力部である。 An embodiment of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing an acoustic processing device and a collation processing device according to Embodiment 1 of the present invention. In FIG. 1, reference numeral 21 denotes a speech input for inputting a recognition target speech signal and A / D converting the speech signal. And 22, an acoustic feature quantity calculation section that analyzes the digital voice signal output from the voice input section 21 by dividing the digital voice signal into frames at regular time intervals and calculates an acoustic feature quantity that represents the voice characteristics of the voice signal. (Feature amount extraction means), 23 is a method determination section (method determination means) for determining the information compression method and quantization / encoding method of the acoustic feature amount calculated by the acoustic feature amount calculation section 22, and 24 is a method determination section. An acoustic feature amount compression unit (signal compression means) that compresses an acoustic feature amount in accordance with the information compression method determined by 23, 25 is a quantization table, and 26 is a quantization table 25 with reference to the quantization table 25. Quantization / encoding unit (quantization / encoding means) that quantizes and encodes acoustic features according to the quantization / encoding method determined by 23, and 27 is encoded by the quantization / encoding unit 26. It is the code | symbol output part which transmits the converted acoustic feature-value to the collation processing apparatus.

３１は音響処理装置から送信された符号化信号である音響特徴量を入力する符号入力部、３２は符号入力部３１により入力された音響特徴量の情報圧縮方式と量子化・符号化方式を判別する方式判定部（方式判別手段）、３３は量子化テーブル、３４は量子化テーブル３３を参照しながら、方式判定部３２により判別された量子化・符号化方式に対応する復号化・逆量子化方式にしたがって音響特徴量の復号化及び逆量子化を行う復号化・逆量子化部（復号化・逆量子化手段）、３５は方式判定部３２により判別された情報圧縮方式に対応する情報復元方式にしたがって音響特徴量の信号圧縮を解除して、元の音響特徴量を復元する音響特徴量復元部（圧縮解除手段）、３６は認識対象を構成する単位の音響特徴量の性質を示す標準パタン、３７は言語辞書、３８は音響特徴量復元部３５により復元された音響特徴量を標準パタン３６及び言語辞書３７と照合する照合部（照合手段）、３９は照合部３８による音声の認識結果を出力する認識結果出力部である。 Reference numeral 31 denotes a code input unit that inputs an acoustic feature amount that is an encoded signal transmitted from the acoustic processing device. Reference numeral 32 denotes an information compression method and a quantization / encoding method of the acoustic feature amount input by the code input unit 31. Decoding / inverse quantization corresponding to the quantization / encoding method determined by the method determination unit 32 while referring to the quantization table 33, and the quantization table 33, while referring to the quantization table 33. Decoding / inverse quantization unit (decoding / inverse quantization means) that decodes and dequantizes the acoustic features according to the method, 35 is an information restoration corresponding to the information compression method determined by the method determination unit 32 An acoustic feature value restoration unit (compression decompression unit) that restores the original acoustic feature value by releasing signal compression of the acoustic feature value according to a method, and 36 is a standard indicating the characteristics of the acoustic feature value of a unit constituting the recognition target pattern, 7 is a language dictionary, 38 is a collation unit (collation unit) that collates the acoustic feature quantity restored by the acoustic feature quantity restoration unit 35 with the standard pattern 36 and the language dictionary 37, and 39 outputs the speech recognition result by the collation unit 38. A recognition result output unit.

図２はこの発明の実施の形態１による音響処理方法を示すフローチャート、図３はこの発明の実施の形態１による照合処理方法を示すフローチャートである。
因みに、図１における音響処理装置及び照合処理装置の各構成要素をハードウエアを用いて構成してもよいが、各構成要素の処理内容が記述されたプログラムを用意して、図示せぬコンピュータが当該プログラムを実行するようにしてもよい。なお、以下に示す他の実施の形態においても同様である。 FIG. 2 is a flowchart showing an acoustic processing method according to Embodiment 1 of the present invention, and FIG. 3 is a flowchart showing a collation processing method according to Embodiment 1 of the present invention.
Incidentally, each component of the sound processing device and the matching processing device in FIG. 1 may be configured using hardware, but a computer (not shown) is prepared by preparing a program describing the processing contents of each component. The program may be executed. The same applies to other embodiments described below.

次に動作について説明する。
まず、音響処理装置の音声入力部２１は、従来の音声入力部１と同様に、認識対象の音声信号Ｓを入力すると、その音声信号ＳをＡ／Ｄ変換する（ステップＳＴ２１）。
音響特徴量算出部２２は、音声入力部２１がディジタルの音声信号Ｓを出力すると、従来の音響特徴量算出部２と同様に、その音声信号Ｓを一定時間周期毎にフレームに区切って分析し、その音声信号Ｓの音声的な特徴を表す音響特徴量を算出する（ステップＳＴ２２）。 Next, the operation will be described.
First, when the speech signal S to be recognized is input, the speech input unit 21 of the acoustic processing device performs A / D conversion on the speech signal S (step ST21).
When the voice input unit 21 outputs a digital voice signal S, the acoustic feature quantity calculation unit 22 analyzes the voice signal S by dividing it into frames at regular time intervals as in the conventional acoustic feature quantity calculation unit 2. Then, an acoustic feature amount representing the voice feature of the voice signal S is calculated (step ST22).

方式決定部２３は、音響特徴量算出部２２が音響特徴量を算出すると、現在及び過去の音響特徴量を参照して、現在の音響特徴量の情報圧縮方式と量子化・符号化方式を決定する（ステップＳＴ２３）。
例えば、時刻ｔのフレームにおける音響特徴量ｃ（ｔ）の情報圧縮方式を決定する場合（ただし、この例では、音響特徴量ｃ（ｔ）の情報圧縮方式が決定されれば、一義的に量子化・符号化方式が決定されるものとする）、Ｋ時刻前のフレームまでの音響特徴量による線形予測残差ｖ（ｔ）の２乗値｜ｖ（ｔ）｜² を計算し（Ｋについては、式（４）を参照）、その２乗値｜ｖ（ｔ）｜² と適当に設定された閾値ｔｈとを比較する。 When the acoustic feature value calculation unit 22 calculates the acoustic feature value, the method determining unit 23 refers to the current and past acoustic feature values to determine the information compression method and quantization / encoding method of the current acoustic feature value. (Step ST23).
For example, when the information compression method of the acoustic feature value c (t) in the frame at time t is determined (however, in this example, if the information compression method of the acoustic feature value c (t) is determined, it is uniquely quantum. And the square value | v (t) | ² of the linear prediction residual v (t) based on the acoustic feature quantity up to the frame before K time is calculated (about K). (See equation (4)), the square value | v (t) | ² is compared with a suitably set threshold th.

そして、｜ｖ（ｔ）｜² ≧ｔｈである場合、線形予測により音響特徴量ｃ（ｔ）を効率的に信号圧縮できないと判定し、音響特徴量ｃ（ｔ）を信号圧縮しないで、量子化及び符号化を行う方式を採用する。
一方、｜ｖ（ｔ）｜² ＜ｔｈである場合、線形予測により音響特徴量ｃ（ｔ）を効率的に信号圧縮することが可能であると判定し、線形予測残差ｖ（ｔ）のみを量子化及び符号化を行う方式を採用する。 If | v (t) | ² ≧ th, it is determined by linear prediction that the acoustic feature quantity c (t) cannot be efficiently signal-compressed, and the acoustic feature quantity c (t) is quantum-quantized without signal compression. A method for performing encoding and encoding is adopted.
On the other hand, if | v (t) | ² <th, it is determined that the acoustic feature quantity c (t) can be efficiently signal-compressed by linear prediction, and only the linear prediction residual v (t) is obtained. A method of performing quantization and encoding is adopted.

音響特徴量圧縮部２４は、方式決定部２３が情報圧縮方式を決定すると、その情報圧縮方式にしたがって音響特徴量算出部２２により算出された音響特徴量を信号圧縮する（ステップＳＴ２４）。方式決定部２３により決定された情報圧縮方式を用いること以外は、従来の音響特徴量圧縮部３と同様である。
量子化・符号化部２６は、音響特徴量圧縮部２４が音響特徴量を信号圧縮すると、量子化テーブル２５を参照しながら、方式決定部２３により決定された量子化・符号化方式にしたがって音響特徴量の量子化及び符号化を行う（ステップＳＴ２５）。方式決定部２３により決定された量子化・符号化方式を用いること以外は、従来の量子化・符号化部５と同様である。 When the method determination unit 23 determines the information compression method, the acoustic feature amount compression unit 24 performs signal compression on the acoustic feature amount calculated by the acoustic feature amount calculation unit 22 in accordance with the information compression method (step ST24). Except for using the information compression method determined by the method determination unit 23, it is the same as the conventional acoustic feature amount compression unit 3.
When the acoustic feature value compression unit 24 compresses the acoustic feature value, the quantization / coding unit 26 refers to the quantization table 25 and refers to the sound according to the quantization / coding method determined by the method determination unit 23. The feature quantity is quantized and encoded (step ST25). It is the same as the conventional quantization / encoding unit 5 except that the quantization / encoding method determined by the method determining unit 23 is used.

符号出力部２７は、量子化・符号化部２６により符号化された音響特徴量を照合処理装置に送信する（ステップＳＴ２６）。
ただし、符号化された音響特徴量を送信する際、図４に示すように、方式決定部２３により決定された情報圧縮方式を示すヘッダ情報を付加して送信する。上述したように、音響特徴量ｃ（ｔ）の情報圧縮方式が決定されれば、一義的に量子化・符号化方式が決定される場合には、情報圧縮方式を示すヘッダ情報を送信すれば、照合側では音響特徴量ｃ（ｔ）の量子化・符号化方式を特定することができる。
なお、音響特徴量の変動が線形予測される値から小さな変動範囲であれば、線形予測残差は音響特徴量の符号より短い符号長に変換することができる。 The code output unit 27 transmits the acoustic feature amount encoded by the quantization / encoding unit 26 to the collation processing device (step ST26).
However, when transmitting the encoded acoustic feature amount, as shown in FIG. 4, the header information indicating the information compression method determined by the method determination unit 23 is added and transmitted. As described above, if the information compression method of the acoustic feature value c (t) is determined, if the quantization / coding method is uniquely determined, the header information indicating the information compression method is transmitted. On the collation side, the quantization / encoding method of the acoustic feature quantity c (t) can be specified.
If the fluctuation of the acoustic feature value is within a small fluctuation range from the linearly predicted value, the linear prediction residual can be converted to a code length shorter than the code of the acoustic feature quantity.

次に、照合処理装置の符号入力部３１は、従来の符号入力部１１と同様に、音響処理装置から送信された符号化信号である音響特徴量を入力する（ステップＳＴ３１）。
方式判定部３２は、符号入力部３１が音響特徴量を入力すると、その音響特徴量に付加されたヘッダ情報を参照して、その音響特徴量の情報圧縮方式と量子化・符号化方式を判別する（ステップＳＴ３２）。 Next, similarly to the conventional code input unit 11, the code input unit 31 of the verification processing device inputs an acoustic feature quantity that is an encoded signal transmitted from the acoustic processing device (step ST31).
When the code input unit 31 inputs an acoustic feature value, the method determination unit 32 refers to the header information added to the acoustic feature value, and determines the information compression method and quantization / encoding method of the acoustic feature value. (Step ST32).

復号化・逆量子化部３４は、方式判定部３２が量子化・符号化方式を判別すると、量子化テーブル３３を参照しながら、その量子化・符号化方式に対応する復号化・逆量子化方式にしたがって符号入力部３１により入力された音響特徴量の復号化及び逆量子化を行う（ステップＳＴ３３）。方式判定部３２により判別された量子化・符号化方式に対応する復号化・逆量子化方式を用いること以外は、従来の復号化・逆量子化部１３と同様である。 When the method determination unit 32 determines the quantization / encoding method, the decoding / inverse quantization unit 34 refers to the quantization table 33 and performs decoding / inverse quantization corresponding to the quantization / encoding method. The acoustic feature quantity input by the code input unit 31 is decoded and dequantized according to the method (step ST33). This is the same as the conventional decoding / inverse quantization unit 13 except that the decoding / inverse quantization method corresponding to the quantization / encoding method determined by the method determination unit 32 is used.

音響特徴量復元部３５は、復号化・逆量子化部３４が音響特徴量の復号化及び逆量子化を行うと、方式判定部３２により判別された情報圧縮方式に対応する情報復元方式にしたがって音響特徴量の信号圧縮を解除し、元の音響特徴量を復元する（ステップＳＴ３４）。方式判定部３２により判別された情報圧縮方式に対応する情報復元方式を用いること以外は、従来の音響特徴量復元部１４と同様である。 When the decoding / inverse quantization unit 34 decodes and dequantizes the acoustic feature amount, the acoustic feature amount restoration unit 35 follows the information restoration method corresponding to the information compression method determined by the method determination unit 32. The signal compression of the acoustic feature quantity is canceled, and the original acoustic feature quantity is restored (step ST34). Except for using an information restoration method corresponding to the information compression method discriminated by the method judgment unit 32, it is the same as the conventional acoustic feature quantity restoration unit 14.

照合部３８は、音響特徴量復元部３５が元の音響特徴量を復元すると、従来の照合部１７と同様に、その音響特徴量を標準パタン３６及び言語辞書３７と照合して、入力音声に対する認識結果を取得する（ステップＳＴ３５，ＳＴ３６）。
認識結果出力部３９は、照合部３８が音声認識結果を得ると、その音声認識結果を出力する（ステップＳＴ３７）。 When the acoustic feature quantity restoration unit 35 restores the original acoustic feature quantity, the collation unit 38 collates the acoustic feature quantity with the standard pattern 36 and the language dictionary 37 in the same manner as the conventional collation part 17, A recognition result is acquired (steps ST35 and ST36).
When the collation unit 38 obtains the voice recognition result, the recognition result output unit 39 outputs the voice recognition result (step ST37).

以上で明らかなように、この実施の形態１によれば、音響特徴量算出部２２により抽出された音響特徴量の情報圧縮方式と量子化・符号化方式を決定するように構成したので、音声認識精度の劣化を招くことなく、音響特徴量の圧縮度を高めることができる効果を奏する。 As is apparent from the above, according to the first embodiment, the information compression method and the quantization / coding method of the acoustic feature amount extracted by the acoustic feature amount calculation unit 22 are determined. There is an effect that the compression degree of the acoustic feature amount can be increased without deteriorating the recognition accuracy.

実施の形態２．
図５はこの発明の実施の形態２による音響処理装置及び照合処理装置を示す構成図であり、図において、図１と同一符号は同一または相当部分を示すので説明を省略する。
４１は出力伝送路の回線状況を考慮して、音響特徴量の情報圧縮方式と量子化・符号化方式を決定する伝送回線状況判定部（方式決定手段）である。
図６はこの発明の実施の形態２による音響処理方法を示すフローチャートである。 Embodiment 2. FIG.
FIG. 5 is a block diagram showing an acoustic processing apparatus and a collation processing apparatus according to Embodiment 2 of the present invention. In the figure, the same reference numerals as those in FIG.
Reference numeral 41 denotes a transmission line status determination unit (method determination means) that determines the information compression method and quantization / encoding method of the acoustic feature amount in consideration of the line status of the output transmission path.
FIG. 6 is a flowchart showing an acoustic processing method according to Embodiment 2 of the present invention.

次に動作について説明する。
上記実施の形態１では、方式決定部２３が現在及び過去の音響特徴量を参照して、現在の音響特徴量の情報圧縮方式と量子化・符号化方式を決定するものについて示したが、伝送回線状況判定部４１が出力伝送路の回線状況や音響特徴量の変動を考慮して、音響特徴量の情報圧縮方式と量子化・符号化方式を決定するようにしてもよい（ステップＳＴ４１，ＳＴ４２）。 Next, the operation will be described.
In the first embodiment, the method determining unit 23 refers to the current and past acoustic feature amounts to determine the information compression method and quantization / encoding method of the current acoustic feature amount. The line status determination unit 41 may determine the information compression method and the quantization / coding method of the acoustic feature amount in consideration of the line status of the output transmission path and the variation of the acoustic feature amount (steps ST41 and ST42). ).

具体的には、伝送回線状況判定部４１が出力伝送回線の符号誤り率と実質的な伝送容量を計測する。
そして、伝送回線状況判定部４１が当該計測結果に応じて音響特徴量の情報圧縮方式と量子化・符号化方式を決定する。 Specifically, the transmission line status determination unit 41 measures the code error rate and the substantial transmission capacity of the output transmission line.
Then, the transmission line status determination unit 41 determines the information compression method and quantization / encoding method of the acoustic feature amount according to the measurement result.

例えば、符号誤り率が基準値より大きい場合、単一の時間フレームの内容のみから音響特徴量フレームを復元可能な情報圧縮方式と量子化・符号化方式を採用する。
また、伝送容量が基準容量より小さい場合、隣接フレーム間の音響特徴量から線形予測残差を計算し、圧縮効率の高い情報圧縮方式と量子化・符号化方式を採用する。
なお、伝送回線状況判定部４１は、伝送回線の誤り率や伝達容量の判定を時刻フレーム毎に判定する必要はない。例えば、一回の音声に対して誤り率と伝達容量を１度判定し、後は同一の状態であるとする。 For example, when the code error rate is larger than the reference value, an information compression method and a quantization / coding method capable of restoring an acoustic feature amount frame only from the contents of a single time frame are adopted.
If the transmission capacity is smaller than the reference capacity, a linear prediction residual is calculated from the acoustic feature quantity between adjacent frames, and an information compression method and a quantization / coding method with high compression efficiency are adopted.
The transmission line status determination unit 41 does not have to determine the transmission line error rate or transmission capacity for each time frame. For example, it is assumed that the error rate and the transmission capacity are determined once for a single voice, and the same state thereafter.

実施の形態３．
図７はこの発明の実施の形態３による音響処理装置及び照合処理装置を示す構成図であり、図において、図１と同一符号は同一または相当部分を示すので説明を省略する。
４２は入力音声の認識困難度を考慮して、音響特徴量の情報圧縮方式と量子化・符号化方式を決定するタスク困難度判定部（方式決定手段）である。
図８はこの発明の実施の形態３による音響処理方法を示すフローチャートである。 Embodiment 3 FIG.
FIG. 7 is a block diagram showing an acoustic processing device and a matching processing device according to Embodiment 3 of the present invention. In the figure, the same reference numerals as those in FIG.
Reference numeral 42 denotes a task difficulty level determination unit (method determination unit) that determines the information compression method and quantization / encoding method of the acoustic feature amount in consideration of the recognition difficulty level of the input speech.
FIG. 8 is a flowchart showing an acoustic processing method according to Embodiment 3 of the present invention.

次に動作について説明する。
上記実施の形態１では、方式決定部２３が現在及び過去の音響特徴量を参照して、現在の音響特徴量の情報圧縮方式と量子化・符号化方式を決定するものについて示したが、タスク困難度判定部４２が入力音声の認識困難度を考慮して、音響特徴量の情報圧縮方式と量子化・符号化方式を決定するようにしてもよい（ステップＳＴ４３，ＳＴ４４）。 Next, the operation will be described.
In the first embodiment, the method determining unit 23 refers to the current and past acoustic feature amounts to determine the information compression method and quantization / encoding method of the current acoustic feature amount. The difficulty level determination unit 42 may determine the information compression method and the quantization / coding method of the acoustic feature amount in consideration of the recognition difficulty level of the input speech (steps ST43 and ST44).

具体的には、タスク困難度判定部４２が入力音声の認識困難度を示す指標を求め、その指標に応じて音響特徴量の情報圧縮方式と量子化・符号化方式を決定する。
音声認識対象の難しさを測る代表的な指標としては、音声入力時の背景騒音レベルや同時認識単語数に相当する単語パープレキシティの大きさを用いることができる。 Specifically, the task difficulty level determination unit 42 obtains an index indicating the recognition difficulty level of the input speech, and determines the information compression method and the quantization / coding method of the acoustic feature amount according to the index.
As a representative index for measuring the difficulty of a speech recognition target, the background noise level at the time of speech input and the size of word perplexity corresponding to the number of simultaneously recognized words can be used.

その指標に対応する情報圧縮方式と量子化・符号化方式の決定は、認識対象が容易であれば、大きな情報圧縮が可能な情報圧縮方式と量子化・符号化方式を採用する。一方、認識対象が困難な場合は、情報圧縮による歪みが小さい情報圧縮方式と量子化・符号化方式を採用する。
次に、これらを考慮して、適当な信号圧縮方式と量子化ビット数を確定する。情報圧縮方式と量子化・符号化方式の決定は、時刻フレーム毎に行う必要はなく、一回の発話あるいは認識タスクを通して決めておいてもよい。適当な信号圧縮方式と量子化ビット数の決定は、認識条件毎に、あらかじめ調査しておくようにする。 For the determination of the information compression method and the quantization / encoding method corresponding to the index, an information compression method and a quantization / encoding method capable of large information compression are adopted if the recognition target is easy. On the other hand, when the recognition target is difficult, an information compression method and a quantization / encoding method with small distortion due to information compression are adopted.
Next, considering these, an appropriate signal compression method and the number of quantization bits are determined. The information compression method and the quantization / coding method need not be determined for each time frame, but may be determined through a single utterance or recognition task. The appropriate signal compression method and the number of quantization bits are determined in advance for each recognition condition.

実施の形態４．
図９はこの発明の実施の形態４による音響処理装置及び照合処理装置を示す構成図であり、図において、図１と同一符号は同一または相当部分を示すので説明を省略する。
４３は音響特徴量算出部２２により算出された音響特徴量を出力対象に含めるか否かを判定する時間フレーム間引き部（出力対象判定手段）、４４は対象フレームの音響特徴量が省略されているか否かを判定するフレーム周期判定部（包含判定手段）、４５はフレーム周期判定部４４の判定結果が省略されていない旨を示す場合、音響特徴量復元部３５から出力された音響特徴量を標準パタン３６及び言語辞書３７と照合し、その判定結果が省略されている旨を示す場合、他の音響特徴量に係る照合結果を流用する照合部（照合手段）である。
図１０はこの発明の実施の形態４による音響処理方法を示すフローチャート、図１１はこの発明の実施の形態４による照合処理方法を示すフローチャートである。 Embodiment 4 FIG.
FIG. 9 is a block diagram showing an acoustic processing apparatus and a collation processing apparatus according to Embodiment 4 of the present invention. In the figure, the same reference numerals as those in FIG.
43 is a time frame decimation unit (output target determination unit) that determines whether or not the acoustic feature value calculated by the acoustic feature value calculation unit 22 is included in the output target, and 44 is the acoustic feature value of the target frame omitted. A frame period determination unit (inclusion determination unit) 45 that determines whether or not the determination indicates that the determination result of the frame period determination unit 44 is not omitted, and the acoustic feature amount output from the acoustic feature amount restoration unit 35 is standardized. When collating with the pattern 36 and the language dictionary 37 and indicating that the determination result is omitted, a collation unit (collation unit) that uses the collation result related to other acoustic feature quantities.
FIG. 10 is a flowchart showing an acoustic processing method according to Embodiment 4 of the present invention, and FIG. 11 is a flowchart showing a collation processing method according to Embodiment 4 of the present invention.

次に動作について説明する。
時間フレーム間引き部４３は、上記実施の形態１と同様にして、音響特徴量算出部２２が音響特徴量を算出すると、その音響特徴量を出力対象に含めるか否かを判定する。
即ち、時間フレーム間引き部４３は、前後の時刻と比較した対象時刻の音響特徴量の性質を調べて、対象時刻のフレームの音響特徴量を省略した場合に、照合側での音声認識精度へ与える影響が小さいか否かを判定するため、まず、省略せず出力した最後の音響特徴量ｃ（τ）と現時刻の音響特徴量ｃ（ｔ）の変動量ｄを求める（ステップＳＴ５１）。 Next, the operation will be described.
Similar to the first embodiment, when the acoustic feature quantity calculation unit 22 calculates the acoustic feature quantity, the time frame thinning section 43 determines whether or not to include the acoustic feature quantity as an output target.
That is, the time frame thinning unit 43 examines the characteristics of the acoustic feature quantity at the target time compared with the previous and subsequent times, and gives the speech recognition accuracy on the collation side when the acoustic feature quantity of the frame at the target time is omitted. In order to determine whether or not the influence is small, first, the variation d of the last acoustic feature value c (τ) and the acoustic feature value c (t) output without being omitted is obtained (step ST51).

時間フレーム間引き部４３は、音響特徴量の変動量ｄ（ｔ）が閾値ｔｈｈを上回る場合、音声認識精度へ与える影響が大きいため間引きフラグを“０”にして、その音響特徴量を省略しない旨を明示する（ステップＳＴ５２，ＳＴ５３）。
この場合、音響特徴量圧縮部２４は、所定の情報圧縮方式にしたがって音響特徴量算出部２２により算出された音響特徴量を信号圧縮し（ステップＳＴ５４）、量子化・符号化部２６は、量子化テーブル２５を参照しながら、所定の量子化・符号化方式にしたがって音響特徴量の量子化及び符号化を行う（ステップＳＴ５５）。
そして、符号出力部２７は、量子化・符号化部２６により符号化された音響特徴量と上記間引きフラグを照合処理装置に送信する（ステップＳＴ５６）。 The time frame decimation unit 43 sets the decimation flag to “0” and does not omit the acoustic feature amount when the variation amount d (t) of the acoustic feature amount exceeds the threshold thh because the influence on the speech recognition accuracy is large. Is clearly indicated (steps ST52 and ST53).
In this case, the acoustic feature quantity compression unit 24 performs signal compression on the acoustic feature quantity calculated by the acoustic feature quantity calculation unit 22 in accordance with a predetermined information compression method (step ST54), and the quantization / encoding unit 26 performs quantum quantization. Referring to the conversion table 25, the acoustic feature quantity is quantized and encoded according to a predetermined quantization / encoding method (step ST55).
Then, the code output unit 27 transmits the acoustic feature amount encoded by the quantization / encoding unit 26 and the thinning flag to the verification processing device (step ST56).

時間フレーム間引き部４３は、音響特徴量の変動量ｄ（ｔ）が閾値ｔｈｈを下回る場合、音声認識精度へ与える影響が小さいため間引きフラグを“１”にして、その音響特徴量を省略する旨を明示する（ステップＳＴ５２，ＳＴ５７）。
この場合、音響特徴量圧縮部２４は、当該音響特徴量に対する信号圧縮処理を実施せず、量子化・符号化部２６は、当該音響特徴量に対する量子化及び符号化処理を実施しない。
そして、符号出力部２７は、上記間引きフラグを照合処理装置に送信する。 The time frame decimation unit 43 sets the decimation flag to “1” and omits the acoustic feature amount when the variation d (t) of the acoustic feature amount is less than the threshold thh because the influence on the speech recognition accuracy is small. Is clearly indicated (steps ST52 and ST57).
In this case, the acoustic feature amount compression unit 24 does not perform signal compression processing on the acoustic feature amount, and the quantization / encoding unit 26 does not perform quantization and encoding processing on the acoustic feature amount.
Then, the code output unit 27 transmits the thinning flag to the verification processing device.

ここで、音響特徴量の変動量ｄ（ｔ）としては、例えば、式（８）で定義した音響パラメータベクトル間の分散重み付きユークリッド距離を用いる。

ただし、式（８）において、ｃ_I （ｔ）は時刻ｔのｉ次元目の音響特徴量，Ｍは音響特徴量ベクトルの次元数，ｖａｒ［ｉ］はｉ次元目の音響特徴量の分散を表す。この結果、音響特徴量の変化が小さい部分では音響特徴量の出力がなくなるため、音響特徴量を表すための情報量を削減することができる。 Here, as the variation d (t) of the acoustic feature amount, for example, the Euclidean distance with dispersion weight between the acoustic parameter vectors defined by Expression (8) is used.

In Equation (8), c _I (t) is the i-th acoustic feature quantity at time t, M is the number of dimensions of the acoustic feature vector, and var [i] is the variance of the i-dimensional acoustic feature quantity. To express. As a result, since there is no output of the acoustic feature amount in the portion where the change in the acoustic feature amount is small, the amount of information for representing the acoustic feature amount can be reduced.

照合処理装置のフレーム周期判定部４４は、符号入力部３１が間引きフラグを入力すると、その間引きフラグを参照して、対象フレームの音響特徴量が省略されているか否かを判定する（ステップＳＴ６１，ＳＴ６２）。 When the code input unit 31 inputs a decimation flag, the frame period determination unit 44 of the verification processing apparatus refers to the decimation flag and determines whether or not the acoustic feature amount of the target frame is omitted (step ST61, ST62).

対象フレームの音響特徴量が省略されていない場合、即ち、間引きフラグが“０”の場合、復号化・逆量子化部３４は、量子化テーブル３３を参照しながら、所定の復号化・逆量子化方式にしたがって符号入力部３１により入力された音響特徴量の復号化及び逆量子化を実行し（ステップＳＴ６３）、音響特徴量復元部３５は、所定の情報復元方式にしたがって音響特徴量の信号圧縮を解除し、元の音響特徴量を復元する（ステップＳＴ６４）。 When the acoustic feature value of the target frame is not omitted, that is, when the decimation flag is “0”, the decoding / inverse quantization unit 34 refers to the quantization table 33 and performs predetermined decoding / inverse quantization. The acoustic feature value input by the code input unit 31 is decoded and inverse-quantized according to the encoding method (step ST63), and the acoustic feature value restoring unit 35 outputs the acoustic feature value signal according to a predetermined information restoring method. The compression is released and the original acoustic feature value is restored (step ST64).

そして、照合部４５は、上記実施の形態１における照合部１７と同様の手順により照合処理を実施する（ステップＳＴ６５，ＳＴ６６）。
照合部４５の照合手順
（１）音響特徴量Ｃ’と認識候補を構成する標準パタン３６のエントリを照合して照合スコアを求める。
（２）それぞれの認識候補について、部分あるいは終端に到達するまでの累積スコアを求める。
（３）音響特徴量Ｃ’の終端フレームに到達したら、最終的に最も高い累積スコアを持つ単語を音声認識結果とする。 And the collation part 45 implements collation process by the procedure similar to the collation part 17 in the said Embodiment 1 (step ST65, ST66).
Collation procedure of collation unit 45 (1) Collation score is obtained by collating the acoustic feature quantity C ′ with the entry of the standard pattern 36 constituting the recognition candidate.
(2) For each recognition candidate, a cumulative score until reaching the part or the end is obtained.
(3) When the terminal frame of the acoustic feature quantity C ′ is reached, the word having the highest cumulative score is determined as the speech recognition result.

一方、対象フレームの音響特徴量が省略されている場合、即ち、間引きフラグが“１”の場合、復号化・逆量子化部３４は、音響特徴量の復号化及び逆量子化処理を実施せず、音響特徴量復元部３５は、音響特徴量の信号圧縮の解除処理を実施しない。 On the other hand, when the acoustic feature value of the target frame is omitted, that is, when the thinning flag is “1”, the decoding / inverse quantization unit 34 performs the acoustic feature amount decoding and inverse quantization processing. First, the acoustic feature quantity restoration unit 35 does not perform the process of releasing the signal compression of the acoustic feature quantity.

そして、照合部４５は、他の音響特徴量に係る照合結果を流用する。
即ち、省略されたフレームの照合スコアを補間して、累積スコアを更新する。補間された照合のスコアは、例えば、最後に入力された音響特徴量に対する照合スコアとする。この処理を音響特徴量の終端フレームまで続け、最終的に最も高い累積照合スコアを持つ単語を音声認識結果とする。 And the collation part 45 diverts the collation result which concerns on another acoustic feature-value.
That is, the accumulated score is updated by interpolating the matching score of the omitted frame. The interpolated matching score is, for example, a matching score for the acoustic feature amount input last. This process is continued until the end frame of the acoustic feature value, and the word having the highest cumulative collation score is finally used as the speech recognition result.

以上で明らかなように、この実施の形態４によれば、音響特徴量算出部２２により算出された音響特徴量を出力対象に含めるか否かを判定するように構成したので、変化が小さい部分の音響特徴量を出力対象から除外できるようになり、その結果、効率よく伝送情報を削減することができる効果を奏する。 As apparent from the above, according to the fourth embodiment, since it is configured to determine whether or not the acoustic feature amount calculated by the acoustic feature amount calculation unit 22 is included in the output target, a portion with a small change Can be excluded from the output target, and as a result, the transmission information can be efficiently reduced.

実施の形態５．
図１２はこの発明の実施の形態５による音響処理装置及び照合処理装置を示す構成図であり、図において、図９と同一符号は同一または相当部分を示すので説明を省略する。
４６は音響特徴量算出部２２により算出された音響特徴量が無音状態であるか否かを判別して、その音響特徴量を出力対象に含めるか否かを判定する無音判定部（出力対象判定手段）、４７は対象フレームの音響特徴量が省略されているか否かを判定する無音フレーム判定部（包含判定手段）である。
図１３はこの発明の実施の形態５による音響処理方法を示すフローチャート、図１４はこの発明の実施の形態５による照合処理方法を示すフローチャートである。 Embodiment 5 FIG.
FIG. 12 is a block diagram showing an acoustic processing apparatus and a collation processing apparatus according to Embodiment 5 of the present invention. In the figure, the same reference numerals as those in FIG.
46 is a silence determination unit (output target determination) that determines whether or not the acoustic feature amount calculated by the acoustic feature amount calculation unit 22 is in a silent state and determines whether or not the acoustic feature amount is included in the output target. Means) 47 is a silent frame determination unit (inclusion determination means) for determining whether or not the acoustic feature quantity of the target frame is omitted.
FIG. 13 is a flowchart showing an acoustic processing method according to Embodiment 5 of the present invention, and FIG. 14 is a flowchart showing a collation processing method according to Embodiment 5 of the present invention.

次に動作について説明する。
上記実施の形態４では、時間フレーム間引き部４３が音響特徴量の変動量ｄ（ｔ）と閾値ｔｈｈを比較して、その音響特徴量を出力対象に含めるか否かを判定するものについて示したが、無音判定部４６が、音響特徴量が無音状態であるか否かを判別して、その音響特徴量を出力対象に含めるか否かを判定するようにしてもよい。 Next, the operation will be described.
In the fourth embodiment, the time frame thinning unit 43 compares the acoustic feature amount variation d (t) with the threshold thh to determine whether or not to include the acoustic feature amount in the output target. However, the silence determination unit 46 may determine whether or not the acoustic feature amount is in a silence state, and determine whether or not to include the acoustic feature amount as an output target.

即ち、無音判定部４６は、例えば、対象音声フレームの音声の短時間パワーやゼロ交差回数を計測することにより、対象音声フレームが有音区間であるか、無音区間であるかを判定する（ステップＳＴ７１）。
そして、対象音声フレームが有音区間である場合は、無音フラグを“０”にして、その音響特徴量を省略しない旨を明示する（ステップＳＴ７２）。
以下、上記実施の形態４と同様に、ステップＳＴ５４〜ＳＴ５６の処理を実行する。 That is, the silence determination unit 46 determines, for example, whether the target voice frame is a voiced section or a silent section by measuring the short-time power of the voice of the target voice frame and the number of zero crossings. ST71).
If the target speech frame is a sound section, the silence flag is set to “0” to clearly indicate that the acoustic feature amount is not omitted (step ST72).
Thereafter, similarly to the fourth embodiment, the processes of steps ST54 to ST56 are executed.

一方、対象音声フレームが無音区間である場合は、無音フラグを“１”にして、その音響特徴量を省略する旨を明示する（ステップＳＴ７３）。
これにより、符号出力部２７から無音フラグが照合処理装置に送信される。 On the other hand, if the target speech frame is a silent section, the silence flag is set to “1” to clearly indicate that the acoustic feature amount is omitted (step ST73).
Thereby, the silence flag is transmitted from the code output unit 27 to the verification processing device.

照合処理装置の無音フレーム判定部４７は、符号入力部３１が無音フラグを入力すると、その無音フラグを参照して、対象フレームの音響特徴量が省略されているか否かを判定する（ステップＳＴ８１，ＳＴ８２）。 When the code input unit 31 inputs a silence flag, the silence frame determination unit 47 of the verification processing device refers to the silence flag and determines whether or not the acoustic feature amount of the target frame is omitted (step ST81, ST82).

対象フレームの音響特徴量が省略されていない場合、即ち、無音フラグが“０”の場合、上記実施の形態４と同様に、ステップＳＴ６３，ＳＴ６４の処理を実行し、対象フレームの音響特徴量が省略されている場合、即ち、無音フラグが“１”の場合、上記実施の形態４と同様に、ステップＳＴ６３，ＳＴ６４の処理を実行しない。
以下、上記実施の形態４と同様のため説明を省略する。 When the acoustic feature quantity of the target frame is not omitted, that is, when the silence flag is “0”, the processing of steps ST63 and ST64 is executed as in the fourth embodiment, and the acoustic feature quantity of the target frame is When omitted, that is, when the silence flag is “1”, the processes of steps ST63 and ST64 are not executed as in the fourth embodiment.
Hereinafter, since it is the same as that of the said Embodiment 4, description is abbreviate | omitted.

この発明の実施の形態１による音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the acoustic processing apparatus and collation processing apparatus by Embodiment 1 of this invention. この発明の実施の形態１による音響処理方法を示すフローチャートである。It is a flowchart which shows the sound processing method by Embodiment 1 of this invention. この発明の実施の形態１による照合処理方法を示すフローチャートである。It is a flowchart which shows the collation processing method by Embodiment 1 of this invention. 音響特徴量の符号化例を示す説明図である。It is explanatory drawing which shows the encoding example of an acoustic feature-value. この発明の実施の形態２による音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the acoustic processing apparatus and collation processing apparatus by Embodiment 2 of this invention. この発明の実施の形態２による音響処理方法を示すフローチャートである。It is a flowchart which shows the sound processing method by Embodiment 2 of this invention. この発明の実施の形態３による音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the audio processing apparatus and collation processing apparatus by Embodiment 3 of this invention. この発明の実施の形態３による音響処理方法を示すフローチャートである。It is a flowchart which shows the sound processing method by Embodiment 3 of this invention. この発明の実施の形態４による音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the audio processing apparatus and collation processing apparatus by Embodiment 4 of this invention. この発明の実施の形態４による音響処理方法を示すフローチャートである。It is a flowchart which shows the sound processing method by Embodiment 4 of this invention. この発明の実施の形態４による照合処理方法を示すフローチャートである。It is a flowchart which shows the collation processing method by Embodiment 4 of this invention. この発明の実施の形態５による音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the sound processing apparatus and collation processing apparatus by Embodiment 5 of this invention. この発明の実施の形態５による音響処理方法を示すフローチャートである。It is a flowchart which shows the sound processing method by Embodiment 5 of this invention. この発明の実施の形態５による照合処理方法を示すフローチャートである。It is a flowchart which shows the collation processing method by Embodiment 5 of this invention. 従来の音響処理装置及び照合処理装置を示す構成図である。It is a block diagram which shows the conventional acoustic processing apparatus and collation processing apparatus. 音響処理装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of an acoustic processing apparatus. 照合処理装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of a collation processing apparatus. 言語辞書の記述例を示す説明図である。It is explanatory drawing which shows the example of a description of a language dictionary.

Explanation of symbols

１音声入力部、２音響特徴量算出部、３音響特徴量圧縮部、４量子化テーブル、５量子化・符号化部、６符号出力部、１１符号入力部、１２量子化テーブル、１３復号化・逆量子化部、１４音響特徴量復元部、１５標準パタン、１６言語辞書、１７照合部、１８認識結果出力部、２１音声入力部、２２音響特徴量算出部（特徴量抽出手段）、２３方式決定部（方式決定手段）、２４音響特徴量圧縮部（信号圧縮手段）、２５量子化テーブル、２６量子化・符号化部（量子化・符号化手段）、２７符号出力部、３１符号入力部、３２方式判定部（方式判別手段）、３３量子化テーブル、３４復号化・逆量子化部（復号化・逆量子化手段）、３５音響特徴量復元部（圧縮解除手段）、３６標準パタン、３７言語辞書、３８照合部（照合手段）、３９認識結果出力部、４１伝送回線状況判定部（方式決定手段）、４２タスク困難度判定部（方式決定手段）、４３時間フレーム間引き部（出力対象判定手段）、４４フレーム周期判定部（包含判定手段）、４５照合部（照合手段）、４６無音判定部（出力対象判定手段）、４７無音フレーム判定部（包含判定手段）。 DESCRIPTION OF SYMBOLS 1 Audio | voice input part, 2 Acoustic feature-value calculation part, 3 Acoustic feature-value compression part, 4 Quantization table, 5 Quantization and encoding part, 6 Code output part, 11 Code input part, 12 Quantization table, 13 Decoding Inverse quantization unit, 14 acoustic feature amount restoration unit, 15 standard pattern, 16 language dictionary, 17 collation unit, 18 recognition result output unit, 21 speech input unit, 22 acoustic feature amount calculation unit (feature amount extraction unit), 23 Method determination unit (method determination unit), 24 Acoustic feature value compression unit (signal compression unit), 25 Quantization table, 26 Quantization / encoding unit (quantization / encoding unit), 27 Code output unit, 31 Code input 32, method determination unit (method determination unit), 33 quantization table, 34 decoding / dequantization unit (decoding / dequantization unit), 35 acoustic feature amount restoration unit (decompression unit), 36 standard pattern , 37 Language dictionary, 38 verification unit (verification unit), 39 recognition result output unit, 41 transmission line status determination unit (method determination unit), 42 task difficulty level determination unit (method determination unit), 43 time frame thinning unit (output target determination) Means), 44 frame period determination section (inclusion determination means), 45 collation section (collation means), 46 silence determination section (output target determination means), 47 silence frame determination section (inclusion determination means).

Claims

The feature quantity extracting means for analyzing the input speech and extracting the acoustic feature quantity , and if the variation quantity of the acoustic feature quantity extracted by the feature quantity extraction means is smaller than the reference fluctuation quantity, the acoustic feature quantity is included in the output target First, when the variation amount of the acoustic feature amount is larger than the reference variation amount, the output target determination unit that determines that the acoustic feature amount is included in the output target, and the determination result of the output target determination unit is the acoustic feature amount. Is included in the output target, the signal compression means for compressing the acoustic feature amount according to a predetermined information compression method, and the determination result of the output target determination means indicates that the acoustic feature amount is included in the output target. In the case of showing, an acoustic processing apparatus including quantization / encoding means for performing quantization and encoding of the acoustic feature amount according to a predetermined quantization / encoding method.

Including inclusion determining means for determining whether or not an acoustic feature is included in the encoded signal, and indicating that the determination result of the inclusion determining means is included, according to a predetermined decoding / inverse quantization scheme In the case of indicating that the determination result of the inclusion determination unit and the decoding / inverse quantization unit for decoding and dequantizing the acoustic feature amount are included, the acoustic feature amount is determined according to a predetermined information restoration method. When the compression release means for releasing the signal compression and the inclusion determination means indicate that the determination result is included, the acoustic feature amount output from the compression release means is compared with a standard pattern, and the inclusion determination means The verification processing apparatus provided with the collation means which diverts the collation result which concerns on another acoustic feature-value when it shows that the determination result of this is not contained.

When an acoustic feature is extracted by analyzing the input speech, if the variation in the acoustic feature is smaller than the reference variation, the acoustic feature is not included in the output target and the variation in the acoustic feature is the reference variation. If it is larger, a determination is made that the acoustic feature is included in the output target, and if the determination result indicates that the acoustic feature is included in the output target, the acoustic feature is determined according to a predetermined information compression method. An acoustic processing method that compresses a signal and performs quantization and encoding of the acoustic feature amount according to a predetermined quantization / encoding method.

When it is determined whether or not an acoustic feature is included in the encoded signal and indicates that the determination result is included, decoding of the acoustic feature according to a predetermined decoding / inverse quantization scheme and When performing inverse quantization, releasing signal compression of the acoustic feature amount according to a predetermined information restoration method, checking the acoustic feature amount with a standard pattern, and indicating that the determination result is not included, A matching processing method that uses matching results relating to other acoustic feature quantities.

The feature extraction process procedure that extracts the acoustic feature value by analyzing the input speech, and if the acoustic feature value variation extracted by the feature extraction process procedure is smaller than the reference variation value, the acoustic feature value is output If the variation amount of the acoustic feature amount is larger than the reference variation amount, the output target determination processing procedure for determining that the acoustic feature amount is included in the output target and the determination result of the output target determination processing procedure are When indicating that the acoustic feature value is to be included in the output target, a signal compression processing procedure for signal compressing the acoustic feature value in accordance with a predetermined information compression method, and a determination result of the output target determination processing procedure indicate that the acoustic feature value is In order to indicate that it is included in the output target, the computer is caused to execute a quantization / encoding process procedure for performing quantization and encoding of the acoustic feature amount according to a predetermined quantization / encoding method. Sound processing program.

An inclusion determination processing procedure for determining whether or not an acoustic feature is included in the encoded signal, and a predetermined decoding / inverse quantization method when indicating that the determination result of the inclusion determination processing procedure is included In accordance with a predetermined information restoration method, if the decoding / dequantization processing procedure for decoding and dequantizing the acoustic feature amount and the determination result of the inclusion determination processing procedure are included. When it is shown that the decompression processing procedure for canceling signal compression of the acoustic feature amount and the determination result of the inclusion determination processing procedure are included, the acoustic feature amount output from the decompression processing procedure is defined as a standard pattern. A collation processing program for causing a computer to execute a collation processing procedure that uses a collation result related to another acoustic feature when collating and indicating that the determination result of the inclusion determination processing procedure is not included.