JPH08328583A

JPH08328583A - Speach recognition device

Info

Publication number: JPH08328583A
Application number: JP7136725A
Authority: JP
Inventors: Shinsuke Sakai; 信輔坂井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1995-06-02
Filing date: 1995-06-02
Publication date: 1996-12-13
Anticipated expiration: 2014-02-03
Also published as: JP2853731B2

Abstract

PURPOSE: To provide a high speed speach recognition device small in processing amount for deciding a mowing threshold value while having stable retrieval efficiency to fluctuation in ambient noise environment and change in a setting value of a beam width. CONSTITUTION: A characteristic extraction part 101 converts a voice input into time sequence of a characteristic vector to output to a gradual calculation part 104. A standard pattern storage part 102 stores a standard pattern. A cumulative likelihood storage part 103 stores cumulative likelihood outputted from a cumulative likelihood output part 105. The gradual calculation part 104 obtains the characteristic vector, the standard pattern of an i-th frame and the cumulative likelihood from the cumulative likelihood until an i-1 frame to the cumulative likelihood until the i-th frame. A cumulative likelihood output part 105 selects candidates until an M-th in a partial section of a certain K frame from the set of the inputted cumulative likelihood, and thereafter, selects the candidates by using a mean value of a cumulative likelihood difference between a most likelihood candidate and the M-th candidate obtained in the partial section of the K frame to output to a cumulative likelihood storage part 103. A result output part 106 outputs the recognition result based on the cumulative likelihood until a final frame.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device.

【０００２】[0002]

【従来の技術】音声認識装置は非常に大きい演算量を必
要とするため、従来よりビームサーチによる演算量の削
減が試みられている。ビームサーチによる候補刈り取り
のためのビームの幅の設定法としては、各候補刈り取り
時に、尤度の高いものから一定数の候補を残す方法と、
最大尤度から一定幅の範囲の尤度をもつ候補を残す方法
が良く知られている。2. Description of the Related Art Since a voice recognition device requires a very large amount of calculation, it has been attempted to reduce the amount of calculation by beam search. As a method of setting the beam width for candidate pruning by beam search, a method of leaving a certain number of candidates from the one with high likelihood at the time of pruning each candidate,
A well-known method is to leave candidates having a likelihood within a certain range from the maximum likelihood.

【０００３】伊藤らによる、音響学会研究発表会講演論
文集１９９３年１０月７３〜７４ページに掲載の論文
「連続音声認識におけるビームサーチ」おいては、一定
数の候補を残す方法のほうが、ビーム幅の設定値の変化
に対して探索効率が安定していると報告されている。ま
た、最大尤度から一定幅の範囲の尤度をもつ候補の数
は、発声時の周囲雑音環境の影響を受けて変動すると考
えられるが、一定数の候補を残す方法においては、その
ような候補数の変動はない。In the paper "Beam search in continuous speech recognition" published by Ito et al. On the Acoustical Society Research Presentation, October 73, pp. 73-74, the method of leaving a fixed number of candidates is better than the method of leaving a beam. It has been reported that the search efficiency is stable with respect to changes in the set value of the width. Moreover, the number of candidates having a likelihood within a certain range from the maximum likelihood is considered to fluctuate under the influence of the ambient noise environment at the time of utterance, but in the method of leaving a certain number of candidates, There is no change in the number of candidates.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述の
一定数（仮にＭ個とする）の候補を残す方法では、候補
刈り取り時に第Ｍ位の候補を求めるための並べ替え処理
が必要となるために、処理量が多いという欠点があっ
た。However, the above method of leaving a fixed number (probably M) of candidates requires rearrangement processing for obtaining the Mth candidate when pruning candidates. However, there was a drawback that the processing amount was large.

【０００５】[0005]

【課題を解決するための手段】請求項１記載の発明によ
れば、入力音声の最初の数フレームで、最大尤度と一定
個数番目の候補の尤度との差をもとめておき、それ以後
は、前記差を用いて、ビームサーチにおける候補刈り取
りのための閾値を設定することを特徴とする音声認識装
置が得られる。According to the first aspect of the invention, the difference between the maximum likelihood and the likelihood of a fixed number of candidates is found in the first several frames of the input speech, and thereafter, the difference is found. Is used to set a threshold value for candidate pruning in beam search, and a speech recognition apparatus is obtained.

【０００６】請求項２記載の発明によれば、音声信号を
分析して特徴ベクトル時系列を出力する特徴抽出部と、
あらかじめ作成された標準パタンを蓄えておく標準パタ
ン記憶部と、累積尤度を保持する累積尤度記憶部と、前
記累積尤度記憶部に蓄えられた累積尤度と前記特徴ベク
トルの時系列と前記標準パタンとから新しい累積尤度を
求める漸化式計算部と、前記特徴ベクトル時系列のう
ち、ある部分系列に対しては、漸化式計算部でもとめら
れた累積尤度のうち一定個数を出力するとともに、最大
の累積尤度と前記一定個数番目の尤度との差を蓄積して
おき、それ以降の部分系列に対しては、前記蓄積された
尤度の差を用いて求められた閾値により、出力する累積
尤度を決定する累積尤度出力部と、前記累積尤度出力部
から出力される累積尤度より前記音声信号に対する認識
結果を求める結果出力部とを有することを特徴とする音
声認識装置が得られる。According to the second aspect of the present invention, a feature extraction section for analyzing a voice signal and outputting a feature vector time series,
A standard pattern storage unit that stores a standard pattern created in advance, a cumulative likelihood storage unit that holds a cumulative likelihood, a cumulative likelihood stored in the cumulative likelihood storage unit, and a time series of the feature vector, A recurrence formula calculation unit that obtains a new cumulative likelihood from the standard pattern, and a certain number of cumulative likelihoods determined by the recurrence formula calculation unit for a certain partial sequence of the feature vector time series. Is output, and the difference between the maximum cumulative likelihood and the certain number of likelihoods is accumulated, and for subsequent subsequences, the difference is calculated using the accumulated likelihood difference. A cumulative likelihood output unit that determines a cumulative likelihood to be output according to the threshold, and a result output unit that obtains a recognition result for the voice signal from the cumulative likelihood output from the cumulative likelihood output unit. A voice recognition device .

【０００７】請求項３記載の発明によれば、入力音声の
任意の個数の部分系列のおのおのに対して、第Ｍ位の候
補の累積尤度の最大累積尤度との差の平均値を求め、次
の部分系列の間では、前の部分系列で求めた前記差の平
均値を用いて候補刈り取りの閾値を設定することを特徴
とする音声認識装置における閾値設定方法が得られる。According to the third aspect of the present invention, the average value of the difference between the cumulative likelihood of the M-th candidate and the maximum cumulative likelihood is calculated for each of the arbitrary number of subsequences of the input speech. A threshold setting method for a voice recognition device is obtained, in which the threshold value for candidate pruning is set using the average value of the differences obtained in the previous partial series between the next partial series.

【０００８】[0008]

【実施例】次に、本発明について図面を参照して説明す
る。Next, the present invention will be described with reference to the drawings.

【０００９】図１は、本発明の一実施例を示すブロック
図である。図１を参照すると本発明の実施例は、特徴抽
出部１０１と、標準パタン記憶部１０２と、累積尤度記
憶部１０３と、漸化式計算部１０４と、累積尤度出力部
１０５と、結果出力部１０６とから構成される。FIG. 1 is a block diagram showing one embodiment of the present invention. Referring to FIG. 1, the embodiment of the present invention includes a feature extraction unit 101, a standard pattern storage unit 102, a cumulative likelihood storage unit 103, a recurrence formula calculation unit 104, a cumulative likelihood output unit 105, and a result. And an output unit 106.

【００１０】特徴抽出部１０１は、音声入力を特徴ベク
トルの時系列に変換し、漸化式計算部１０４に出力す
る。標準パタン記憶部１０２は、標準パタンを記憶す
る。累積尤度記憶部１０３は、累積尤度出力部１０５か
ら出力される累積尤度を記憶する。処理が開始される以
前には、全認識パス候補に対して累積尤度の初期値１．
０を保持する。漸化式計算部１０４は、第ｉフレームの
特徴ベクトル、標準パタン、および第ｉ−１フレームま
での累積尤度から、第ｉフレームまでの累積尤度を求め
る。累積尤度出力部１０５は、入力された累積尤度の集
合から、次フレームの累積尤度計算に用いられるものを
選択し、累積尤度記憶部１０３に出力する。結果出力部
１０６は、最終フレームまでの累積尤度に基づいて認識
結果を出力する。The feature extraction unit 101 converts a voice input into a time series of feature vectors and outputs it to the recurrence formula calculation unit 104. The standard pattern storage unit 102 stores standard patterns. The cumulative likelihood storage unit 103 stores the cumulative likelihood output from the cumulative likelihood output unit 105. Before the processing is started, the initial value of the cumulative likelihood is 1.
Holds 0. The recurrence formula calculation unit 104 obtains the cumulative likelihood up to the i-th frame from the feature vector of the i-th frame, the standard pattern, and the cumulative likelihood up to the (i-1) th frame. The cumulative likelihood output unit 105 selects, from the set of input cumulative likelihoods, the one used for the cumulative likelihood calculation of the next frame, and outputs it to the cumulative likelihood storage unit 103. The result output unit 106 outputs the recognition result based on the cumulative likelihood up to the final frame.

【００１１】次に、図１及び図２を参照して、本実施例
の動作について説明する。Next, the operation of this embodiment will be described with reference to FIGS.

【００１２】入力された音声は、特徴抽出部１０１にお
いて、一定の時間間隔ごとに、音声の周波数をスペクト
ルをあらわす特徴ベクトルに変換され、漸化式計算部１
０４に出力される。この一定の時間間隔を以下ではフレ
ームと呼ぶ。第ｉフレームにおいて、漸化式計算部１０
４では、標準パタン記憶部１０２に保持されている標準
パタンＲＥＦ＝｛Ｒ₁，…，Ｒ_N｝、ここでＲ_w＝｛ｒ
_w（１），…，ｒ_w（Ｊ_w)｝を用いて、現在のフレームの特徴ベクトルの各標準パタ
ンに対する局所的尤度ｌ_w（ｉ，ｊ）（ｗ＝１，…，Ｎ、ｊ＝１，…，Ｊ_w）を求める。ここで、Ｎは標準パタン数、Ｊ_wはｗ番目の
標準パタンのフレーム長である。次に、この局所的尤
度、及び累積尤度記憶部１０３に保持されている第ｉ−
１フレームの累積尤度集合Ｇ＝｛ｇ₁（ｉ−１，１），…，ｇ₁（ｉ−１，
Ｊ₁），…，ｇ_N（ｉ−１，１），…，ｇ_N（ｉ−１，
Ｊ_N）｝から、動的計画法に基づいた最大化処理により、下記数
１として現在のフレームの認識パス候補およびその累積
尤度を求める（図２のステップ１）。In the feature extraction unit 101, the input voice is converted into a feature vector representing a spectrum of the frequency of the voice at regular time intervals, and the recurrence formula calculation unit 1
It is output to 04. Hereinafter, this fixed time interval is called a frame. In the i-th frame, the recurrence formula calculation unit 10
4, the standard pattern REF = {R ₁ , ..., _RN } held in the standard pattern storage unit 102, where R _w = {r
_{Using w} (1), ..., R _w (J _w )}, the local likelihood l _w (i, j) (w = 1, ..., N, j) for each standard pattern of the feature vector of the current frame. = 1, ..., J _w ). Here, N is the number of standard patterns, and J _w is the frame length of the wth standard pattern. Next, the local likelihood and the i-th stored in the cumulative likelihood storage unit 103.
Cumulative likelihood set of 1 frame G = {g ₁ (i-1, 1), ..., G ₁ (i-1,
_{_{J 1), ..., g N}} (i-1,1), ..., g N (i-1,
J _N )}, the recognition path candidate of the current frame and its cumulative likelihood are obtained by the maximization process based on the dynamic programming as the following expression 1 (step 1 in FIG. 2).

【００１３】[0013]

【数１】累積尤度出力部１０５は、あらかじめ決められたＫと比
較して、ｉ≦Ｋであるならば、最大値から第Ｍ番目の累
積尤度を求め、これを候補刈り取りのための閾値θと
し、これと最大尤度との差ｄを求める。後で平均を求め
るために、ｄの累積値Ｓ_dを、Ｓ_d＝Ｓ_d＋ｄと更新す
る（ステップ２，６、及び７）。[Equation 1] The cumulative likelihood output unit 105 compares with a predetermined K, and if i ≦ K, calculates the Mth cumulative likelihood from the maximum value, and sets this as the threshold θ for candidate pruning, The difference d between this and the maximum likelihood is calculated. The cumulative value S _d of d is updated to S _d = S _d + d for later averaging (steps 2, 6, and 7).

【００１４】なお、Ｓ_dは、第１フレーム以前には０に
初期化しておく。Note that S _d is initialized to 0 before the first frame.

【００１５】ｉ＝Ｋの場合は、Ｋフレーム間の最大尤度
と候補刈り取り閾値との差の平均Ｄ＝Ｓ_d／Ｋを求める
（ステップ３）。When i = K, the average D = S _d / K of the differences between the maximum likelihood between K frames and the candidate cutting threshold is calculated (step 3).

【００１６】また、ｉ＞Ｋの場合は、候補刈り取りのた
めの閾値θは、θ＝ｇ_max−Ｄとする。ｇ_maxは、第ｉ
フレームにおける累積尤度の最大値である（ステップ
４）。When i> K, the threshold value θ for cutting the candidate is θ = g _max -D. g _max is the i-th
It is the maximum value of the cumulative likelihood in the frame (step 4).

【００１７】各フレームにおいて、累積尤度出力部１０
５は、累積尤度の閾値θよりも大きい尤度をもつ認識パ
ス候補のみを累積尤度記憶部に出力する（ステップ
５）。In each frame, the cumulative likelihood output unit 10
5 outputs only the recognition path candidates having a likelihood larger than the cumulative likelihood threshold θ to the cumulative likelihood storage unit (step 5).

【００１８】現フレームが最終フレームである場合は、
累積尤度出力部１０５は、標準パタンの終端点に達した
すべての認識パス候補を結果出力部１０６に出力する。
結果出力部１０６は、累積候補が最大の認識パス候補を
もとめ、認識結果を出力する（ステップ８，９）。If the current frame is the last frame,
The cumulative likelihood output unit 105 outputs to the result output unit 106 all recognition path candidates that have reached the end point of the standard pattern.
The result output unit 106 finds a recognition path candidate having the largest cumulative candidate and outputs the recognition result (steps 8 and 9).

【００１９】以上、本実施例では、入力音声の最初のＫ
フレームで、第Ｍ位の候補の累積尤度の最大累積尤度と
の差の平均値を求めるという例によって説明したが、さ
らに一般には、入力音声の任意のＬ_max個の部分系列ｌ
ｉ₁，ｌｉ₂，…，ｌｉ_Lmax（Ｌ_max≧１）（これらの
部分系列を仮に学習区間と呼ぶ）のおのおのに対して、
上記の平均値を求め、学習区間ｌｉ_kと次の学習区間ｌ
ｉ_k+1の間では、ｌｉ_kで求めた差の平均値を用いて候
補刈り取りの閾値を設定するという方法をとることがで
きる。As described above, in this embodiment, the first K of the input voice is used.
The example has been described in which the average value of the difference between the cumulative likelihood of the M-th candidate and the maximum cumulative likelihood is calculated in a frame, but more generally, any L _max subsequence l of the input speech is more generally described.
For each of i ₁ , li ₂ , ..., Li _Lmax (L _max ≧ 1) (these subsequences are tentatively referred to as learning intervals),
The above average value is calculated, and the learning section li _k and the next learning section l
Between i _{k + 1} , a method of setting a threshold value for candidate pruning using the average value of the differences obtained by li _k can be used.

【００２０】[0020]

【発明の効果】以上説明したように、本発明による音声
認識装置は、周囲の雑音環境の変動やビーム幅Ｍの設定
値の変化に対応して第Ｍ位の候補の累積尤度と最大の累
積尤度の差が大きく変動するような場合でも、入力の一
部を用いてこの差の平均値を求めておき、これを用いて
刈り取り閾値の決定を行なうので、入力の全ての区間に
対して累積尤度第Ｍ位までの候補を残す方法に準ずる候
補の刈り取りが行なわれ、安定した探索効率を有しなが
らも、刈り取り閾値決定のための処理量が多くならない
という効果を有する。As described above, the speech recognition apparatus according to the present invention corresponds to the cumulative likelihood and maximum of the M-th candidate in response to the fluctuation of the surrounding noise environment and the change of the set value of the beam width M. Even if the difference in cumulative likelihood fluctuates greatly, the average value of this difference is obtained using a part of the input, and the cutting threshold is determined using this, so that for all sections of the input. The candidates are pruned according to the method of leaving the candidates up to the M-th cumulative likelihood, and there is an effect that the amount of processing for determining the pruning threshold does not increase while having stable search efficiency.

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例の構成を示し
たブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of a voice recognition device of the present invention.

【図２】図１に示す音声認識装置の一実施例の処理の流
れを示したフローチャートである。FIG. 2 is a flowchart showing a processing flow of an embodiment of the voice recognition device shown in FIG.

[Explanation of symbols]

１０１特徴抽出部１０２標準パタン記憶部１０３累積尤度記憶部１０４漸化式計算部１０５累積尤度出力部１０６結果出力部 101 Feature Extraction Unit 102 Standard Pattern Storage Unit 103 Cumulative Likelihood Storage Unit 104 Recurrence Formula Calculation Unit 105 Cumulative Likelihood Output Unit 106 Result Output Unit

Claims

[Claims]

1. The difference between the maximum likelihood and the likelihood of a fixed number of candidates is found in the first several frames of the input speech,
After that, the difference is used to set a threshold value for candidate pruning in beam search.

2. A feature extraction unit that analyzes a voice signal and outputs a feature vector time series, a standard pattern storage unit that stores a standard pattern created in advance, and a cumulative likelihood storage unit that holds a cumulative likelihood. A recurrence formula calculation unit that obtains a new cumulative likelihood from the cumulative likelihood stored in the cumulative likelihood storage unit, the time series of the feature vector, and the standard pattern; and, among the feature vector time series, For a partial sequence, while outputting a fixed number of cumulative likelihoods determined by the recurrence formula calculation unit, the difference between the maximum cumulative likelihood and the fixed number th likelihood is accumulated, For subsequences thereafter, a cumulative likelihood output unit that determines a cumulative likelihood to be output by a threshold value obtained by using the accumulated difference in likelihood, and an output from the cumulative likelihood output unit. From the accumulated likelihood That the recognition result speech recognition apparatus characterized by having a a determined result output unit.

3. An average value of the difference between the cumulative likelihood of the M-th candidate and the maximum cumulative likelihood is calculated for each of the arbitrary number of partial sequences of the input speech, and between the next partial sequences, A threshold setting method in a voice recognition device, characterized in that a threshold for candidate cutting is set using an average value of the differences obtained in the previous partial series.