JPH05188991A

JPH05188991A - Speech recognition device

Info

Publication number: JPH05188991A
Application number: JP4005435A
Authority: JP
Inventors: Yukio Tabei; 幸雄田部井
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1992-01-16
Filing date: 1992-01-16
Publication date: 1993-07-30

Abstract

PURPOSE:To provide the speech recognition device which can be improved in the ability to discriminate between similar word speeches. CONSTITUTION:A likelihood calculation part 6 calculates a series of >=1 candidates (i) from the connection model of phoneme HMMs as to an input speech and when the candidate series is entered into a previously stored similar word table 8, a back tracing part 9 finds the optimum state transition series of the candidates by the connection model of phoneme HMMs; and a feature time series is fractionized, a discrimination effective section corresponding to a feature difference part is automatically extracted, and auxiliary features in the discrimination effective section of the candidate are used by a partial similarity calculation part 10. Then a total decision part 12 outputs optimum word recognition result information I based upon the likelihood of the phoneme HMM and the partial similarity of the auxiliary features.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、特に確率モデルを用
いて、類似した単語の識別能力を向上させ得る音声認識
装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition apparatus which can improve the ability to identify similar words by using a probabilistic model.

【０００２】[0002]

【従来の技術】近年、音声認識装置の研究開発が行われ
ている。この音声認識装置においては、一般にパターン
マッチングによる方法が用いられる。例えば、文献１：
沖電気研究開発、Ｖｏｌ．５３、Ｎｏ．２、ｐｐ６１〜
６６、昭和６１年４月にこの種の例が示されている。2. Description of the Related Art In recent years, research and development of voice recognition devices have been conducted. In this voice recognition device, a method based on pattern matching is generally used. For example, Document 1:
Oki Electric Research and Development, Vol. 53, No. 2, pp61
66, April 1987, an example of this type is given.

【０００３】このパターンマッチング方法は、予め辞書
に登録された標準パターンと入力パターンとの類似度
（又は距離）を計算し、最大類似度（又は最小距離）を
与え得る標準パターンのカテゴリを前記入力パターンの
認識結果として同定するものである。ここで、上記文献
１に記載があるように、標準パターンと入力パターン間
のような、不等長なパターン間の類似度（又は距離）の
算出にはＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎ
ｇ）マッチング法が一般に用いられている。ただし、Ｄ
Ｐマッチングを用いた方法は、時系列パターンの時間構
造の変動、変形には極めて強力であるという性質がある
が、スペクトルの変動に対しては弱いという性質があ
る。This pattern matching method calculates the degree of similarity (or distance) between a standard pattern registered in a dictionary in advance and an input pattern, and inputs the category of the standard pattern that can give the maximum degree of similarity (or minimum distance). It is identified as the recognition result of the pattern. Here, as described in Document 1 above, DP (Dynamic Programming) is used to calculate the similarity (or distance) between patterns of unequal length, such as between a standard pattern and an input pattern.
g) The matching method is generally used. However, D
The method using P matching has the property that it is extremely strong in the fluctuation and deformation of the time structure of the time series pattern, but it is weak in the fluctuation of the spectrum.

【０００４】この様な問題に対して、単語を状態の遷移
で表し、各状態での生起確率と、状態間の遷移確率を与
えて、入力音声をこのモデルに当てはめて認識する、い
わゆるＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（隠れ
マルコフモデル）（以下、略してＨＭＭと記す。）に基
づく方法がある。例えば、文献２の中川聖一、鹿野清
宏、東倉洋一共著「音声・聴覚と神経回路網モデル」１
９９０年８月２５日、オーム社、ｐｐ．４５〜ｐｐ．５
０にこのモデルの説明が記載されている。To solve such a problem, so-called Hidden Markov, in which a word is represented by a state transition, an occurrence probability in each state and a transition probability between states are given, and an input speech is applied to this model to be recognized. There is a method based on Model (Hidden Markov Model) (hereinafter, abbreviated as HMM). For example, “Voice / Hearing and Neural Network Model” by Seiichi Nakagawa, Kiyohiro Kano, and Yoichi Higashikura in Reference 2 1
August 25, 990, Ohmsha, pp. 45-pp. 5
0 describes the model.

【０００５】このＨＭＭによる方法は、発声のスペクト
ル的な揺らぎと時間的な揺らぎを、多数の学習サンプル
によって統計的にモデル化するもののであり、音声認識
の強力な一つの方法とされている。The HMM method statistically models spectral fluctuations and temporal fluctuations of utterance by a large number of learning samples, and is considered as a powerful method of speech recognition.

【０００６】そして、このＨＭＭを用いた音声認識方法
は、入力パターンの同定に、最大尤度を与えるＨＭＭモ
デルのカテゴリを認識結果として得る方法を採ってい
る。The speech recognition method using the HMM employs a method of obtaining the category of the HMM model which gives the maximum likelihood as a recognition result for the identification of the input pattern.

【０００７】[0007]

【発明が解決しようとする課題】これら従来の最大類似
度（又は最小距離）、又はＨＭＭの最大尤度を用いた音
声認識方法は、入力パターンの局所的な雑音に余り影響
されず、比較的安定して音声認識を行えるという利点を
有している。しかしながら、入力音声パターンの全体的
な類似性が主に評価されるため、部分的な類似部分を含
む単語においては誤認識しやすいという問題点があっ
た。The conventional speech recognition method using the maximum similarity (or the minimum distance) or the maximum likelihood of the HMM is relatively unaffected by the local noise of the input pattern, and is relatively small. It has an advantage that stable voice recognition can be performed. However, since the overall similarity of the input voice pattern is mainly evaluated, there is a problem that a word including a partially similar portion is likely to be erroneously recognized.

【０００８】例えば、入力音声が地名の『びさい（尾
西）』であるにもかかわらず、音声認識では、『ひさい
（久居）』と認識したり、数詞の『いち（１）』を、
『ひち（７）』と誤認識するとい問題があり、この様な
誤認識を改善する音声認識装置の実現が望まれていた。For example, even though the input voice is the place name "Bisai (Onishi)", the voice recognition recognizes it as "Hisai (Hisui)" or the number "ichi (1)".
There is a problem of erroneously recognizing "Hichi (7)", and it has been desired to realize a voice recognition device that improves such erroneous recognition.

【０００９】この発明は、以上の課題に鑑み為されたも
のであり、その目的とするところは、類似単語音声間の
識別能力を向上させ得る音声認識装置を提供することで
ある。The present invention has been made in view of the above problems, and an object thereof is to provide a voice recognition device capable of improving the discrimination ability between similar word voices.

【００１０】[0010]

【課題を解決するための手段】この発明は、以上の目的
を達成するために、入力音声信号に対応して隠れマルコ
フモデル法を用いて、入力音声信号に対応する複数の候
補単語と、その尤度を求めて音声認識を行う音声認識装
置において、以下の特徴的な各手段を備えて改良した。In order to achieve the above object, the present invention uses a hidden Markov model method corresponding to an input speech signal and uses a plurality of candidate words corresponding to the input speech signal, and A speech recognition apparatus for performing speech recognition by obtaining likelihood is improved by including the following characteristic means.

【００１１】つまり、予め類似単語をグループ化して格
納している類似単語格納手段と、予めスペクトル包絡の
標準特徴情報を認識可能単語に対応して少なくとも１種
類以上格納している標準特徴情報格納手段と、入力音声
信号のスペクトル包絡の特徴情報を上記標準特徴情報と
同じ種類抽出して、上記特徴情報を系列化して出力する
特徴情報系列抽出手段と、上記候補単語とその尤度とか
ら上記類似単語格納手段を検索し、この検索結果に応じ
てバックトレース処理して最適識別有効区間を抽出する
抽出手段とを備えて、上記最適識別有効区間の上記特徴
情報系列と、上記標準特徴情報とを用いて、上記尤度を
最適に変更して最適単語識別結果を生成出力することを
特徴とする。That is, similar word storage means for preliminarily grouping and storing similar words, and standard feature information storage means for preliminarily storing at least one kind of standard feature information of spectrum envelope corresponding to recognizable words. And a feature information series extraction unit that extracts the same type of feature information of the spectrum envelope of the input speech signal as the standard feature information and outputs the feature information in a series, and the similarity from the candidate word and its likelihood. The word storage means is searched for, and a back trace process is performed according to the search result to extract an optimum discrimination effective section, and the characteristic information series of the optimum discrimination effective section and the standard characteristic information are provided. It is characterized in that the above likelihood is optimally changed to generate and output the optimal word identification result.

【００１２】[0012]

【作用】この発明によれば、上記特徴情報系列抽出手段
によって、入力音声信号の特徴情報、例えば、音声スペ
クトルの低域成分と高域成分のエネルギーの差や、スペ
クトルの傾きや、音声信号のゼロクロス（零交差）回数
などの情報などを抽出したスペクトル包絡の特徴情報
と、予め認識可能単語に対応して格納している標準特徴
情報とを用いて、上記候補単語の尤度を最適に変更して
類似度の高さなどの差に基づき、上記最適識別有効区間
の系列中の上記最適単語識別結果を出力できる様にした
ので、類似単語間の音韻特徴の相違を正確に識別するこ
とが容易になる。According to the present invention, the characteristic information series extraction means allows the characteristic information of the input speech signal, for example, the energy difference between the low-frequency component and the high-frequency component of the speech spectrum, the inclination of the spectrum, and the speech signal. The likelihood of the candidate word is optimally changed by using the feature information of the spectrum envelope, which is obtained by extracting information such as the number of zero crosses (zero crossings), and the standard feature information stored in advance corresponding to the recognizable word. Since the optimum word identification result in the series of the optimum identification effective section can be output based on the difference in the degree of similarity, it is possible to accurately identify the difference in phonological features between similar words. It will be easier.

【００１３】従って、認識対象入力音声中に類似した音
韻系列の単語が存在する場合においても正確に識別する
ことができる。Therefore, even if a word of a similar phoneme sequence exists in the input speech to be recognized, it can be accurately identified.

【００１４】[0014]

【実施例】次にこの発明に係る音声認識装置の好適な一
実施例を図面を用いて説明する。この一実施例の目的
は、認識対象音声中に類似した音韻系列の単語が存在す
る場合においても正確に認識判定結果を出力でき、類似
単語音声の識別能力をより向上させた音声認識装置を実
現することである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT A preferred embodiment of a voice recognition apparatus according to the present invention will be described with reference to the drawings. The purpose of this embodiment is to realize a speech recognition apparatus capable of accurately outputting a recognition determination result even when a word of a similar phoneme sequence exists in a speech to be recognized, and further improving the discrimination ability of similar word speech. It is to be.

【００１５】この目的を実現するために、この一実施例
においては、音韻ＨＭＭの連結モデルによって少なくと
も一つ以上の単語候補（ｉ）の列を算出し、当該候補列
が予め格納されている類似単語テーブル内に記載されて
いるものであれば、当該候補の、音韻ＨＭＭの連結モデ
ルによる最適状態遷移系列を求め、特徴時系列を区分化
し、特徴の相違部分に相当する識別有効区間を自動的に
抽出し、当該候補の識別有効区間の補助特徴（音声スペ
クトルの低域成分と高域成分のエネルギーの差や、スペ
クトルの傾きや、音声信号のゼロクロス（零交差）回数
などの情報）を認識に用いる様にして、音韻ＨＭＭの尤
度とパターンマッチング法を相補的に用いて、上記尤度
を最適に変更して最適単語認識（識別）結果情報Ｉを出
力できる様にした。In order to achieve this purpose, this embodiment is
In, at least,
Also calculates a column of one or more word candidates (i) and
Is written in the similar word table that is stored in advance.
If it is, the phoneme HMM connection mode of the candidate
The optimal time transition sequence by
Then, the effective area for identification corresponding to the difference in features is automatically
Auxiliary features (speech space
The energy difference between the low and high frequencies of the cutout,
The tilt of the cuttle and the number of zero crossings of the audio signal
(Information such as) is used for recognition, and the likelihood of phonological HMM
And the pattern matching method are used complementarily,
To the optimum word recognition (identification) result information I
I was able to do my best.

【００１６】図１は、この一実施例に係る音声認識装置
の機能ブロック図である。FIG. 1 is a functional block diagram of a voice recognition device according to this embodiment.

【００１７】この図１において、この一実施例の音声認
識装置は、フィルタ１と、Ａ／Ｄ変換器２と、音響特徴
抽出部３と、補助特徴抽出部４と、最適コードベクトル
選択部５と、尤度算出部６と、判定部７と、近類似度テ
ーブル８と、バックトレース部９と、部分類似度算出部
１０と、補助特徴標準パターン部１１と、総合判定部１
２とで構成されている。In FIG. 1, the speech recognition apparatus according to this embodiment has a filter 1, an A / D converter 2, an acoustic feature extraction unit 3, an auxiliary feature extraction unit 4, and an optimum code vector selection unit 5. , Likelihood calculation unit 6, determination unit 7, near similarity table 8, back trace unit 9, partial similarity calculation unit 10, auxiliary feature standard pattern unit 11, comprehensive determination unit 1
2 and.

【００１８】そして、図２は上記最適コードベクトル選
択部５の機能ブロックを示している。FIG. 2 shows functional blocks of the optimum code vector selection unit 5.

【００１９】この図２において、上記最適コードベクト
ル選択部５は、コードベクトル部分５２と、コードベク
トル選択部分５１とで構成されている。In FIG. 2, the optimum code vector selecting section 5 is composed of a code vector section 52 and a code vector selecting section 51.

【００２０】そして、図３は上記尤度算出部６の機能ブ
ロックを示している。FIG. 3 shows functional blocks of the likelihood calculating section 6.

【００２１】この図３において、上記尤度算出部６は、
音韻ＨＭＭ部分６１と、単語辞書部分６２と、単語連結
モデル部分６３と、ビタビ比較照合部分６４とで構成さ
れている。In FIG. 3, the likelihood calculating section 6 is
It is composed of a phonological HMM part 61, a word dictionary part 62, a word concatenation model part 63, and a Viterbi comparison and verification part 64.

【００２２】図１において、フィルタ１は、入力音声信
号Ｓの高域成分を制限して、得られた信号をＡ／Ｄ変換
器２に供給する。Ａ／Ｄ変換器２は、高域制限された音
声信号を所望のビット数に量子化して、この量子化信号
を音響特徴抽出部３と、補助特徴抽出部４に供給する。
音響特徴抽出部３は、必要な音声区間検出を行って、音
響特徴として、例えば低次のＬＰＣケプストラムを求め
る。次数としては例えば１５次程度を想定する。このＬ
ＰＣケプストラムの算出方法は、例えば文献３：『デジ
タル音声処理』、東海大学出版会、１９８５年９月２５
日、著者：古井貞煕、ｐｐ．４７〜ｐｐ．４８などに示
されている。この算出によって得られるＬＰＣケプスト
ラムの時系列を音響特徴系列Ｖと呼ぶ。この音響特徴系
列Ｖは、最適コードベクトル選択部５に供給される。In FIG. 1, the filter 1 limits the high frequency component of the input audio signal S and supplies the obtained signal to the A / D converter 2. The A / D converter 2 quantizes the high-frequency-limited audio signal into a desired number of bits, and supplies the quantized signal to the acoustic feature extraction unit 3 and the auxiliary feature extraction unit 4.
The acoustic feature extraction unit 3 performs necessary speech section detection to obtain, for example, a low-order LPC cepstrum as an acoustic feature. As the order, for example, the order of 15 is assumed. This L
The calculation method of the PC cepstrum is described in, for example, Reference 3: “Digital Audio Processing”, Tokai University Press, September 25, 1985.
Sun, author: Sadahiro Furui, pp. 47-pp. 48 and the like. The time series of the LPC cepstrum obtained by this calculation is called an acoustic feature series V. The acoustic feature series V is supplied to the optimum code vector selection unit 5.

【００２３】補助特徴抽出部４は、量子化音声信号から
補助特徴を抽出して補助特徴系列Ｂを出力する。この補
助特徴の抽出種類は、後述の補助特徴標準パターン部１
１に格納している種類と同じであることが望ましい。こ
の補助特徴系列Ｂは、例えば、ある時間間隔（例えば、
１０ｍｓｅｃなど）毎に、ＢＰＦ（ＢａｎｄＰａｓｓ
Ｆｉｌｔｅｒ）バンク分析やＦＦＴ（ＦａｓｔＦｏ
ｕｒｉｅＴｒａｎｓｆｏｒｍ）分析などで得られるス
ペクトルの高域成分（例えば、３ｋＨｚ〜５ｋＨｚな
ど）と、低域成分（例えば、１００Ｈｚ〜９００Ｈｚ）
との間のエネルギー差（エネルギー偏差なども含む。）
や、スペクトルを直線近似した場合のスペクトルの傾き
や、音声信号のゼロクロス回数を計数し、そのゼロクロ
ス（零交差）回数などの情報を系列化したものである。
例えば、一般にゼロクロス回数は、信号の周波数成分が
高くなるほど多くなり、また、信号の周波数成分が低く
なるほど少なくなるので、このゼロクロス回数を抽出す
ることによってスペクトルの分布を知ることができる。The auxiliary feature extraction unit 4 extracts an auxiliary feature from the quantized speech signal and outputs an auxiliary feature sequence B. The extraction type of this auxiliary feature is the auxiliary feature standard pattern section 1 described later.
It is desirable that the type is the same as the type stored in 1. This auxiliary feature series B is, for example, at a certain time interval (for example,
BPF (Band Pass) every 10 msec.
Filter) Bank analysis and FFT (Fast Fo)
high frequency components (for example, 3 kHz to 5 kHz) and low frequency components (for example, 100 Hz to 900 Hz) of the spectrum obtained by urie Transform analysis or the like.
Energy difference between and (including energy deviation)
Alternatively, the information is a series of information such as the slope of the spectrum when the spectrum is linearly approximated, the number of zero crossings of the audio signal, and the number of zero crosses (zero crossings).
For example, in general, the number of zero-crosses increases as the frequency component of the signal increases, and decreases as the frequency component of the signal decreases. Therefore, the spectrum distribution can be known by extracting the number of zero-crossings.

【００２４】この補助特徴系列Ｂは、後述の部分類似度
算出部１０における、近類似度テーブル内の単語表記グ
ループ間の識別に有効な特徴を抽出するものとして利用
される。しかもこの補助特徴系列Ｂは、パターンマッチ
ングに用いるために、識別に有効な特徴であることが望
ましい。しかしながら、処理量の削減やハードウエア量
の削減を行いたい場合などにおいては、音響特徴系列Ｖ
と同一の系列を用いても良い。そして、この補助特徴系
列Ｂは部分類似度算出部１０に供給する。This auxiliary feature sequence B is used as a feature which is effective in identifying between word notation groups in the near similarity table in the partial similarity calculation unit 10 which will be described later. Moreover, since the auxiliary feature series B is used for pattern matching, it is desirable that it is a feature effective for identification. However, when it is desired to reduce the processing amount or the hardware amount, the acoustic feature sequence V
The same sequence may be used. Then, the auxiliary feature sequence B is supplied to the partial similarity calculation unit 10.

【００２５】最適コードベクトル選択部５は、音響特徴
系列Ｖを供給されると処理して最適コードベクトルを出
力して、これを尤度算出部６に供給する。そして、内部
のコードベクトル部分５２は、予め多数の訓練サンプル
の音響系列から、最も良くこれらのサンプルを代表する
ベクトルを選ぶことによって求める。このコードベクト
ルの選び方は、クラスタリングの手法などで行うことが
できる。このクラスタリングの手法については、例えば
文献４：ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｏｎｃｏ
ｍｍｕｎｉｃａｔｉｏｎ、Ｖｏｌ．ｃｏｍ−２８、Ｎ
ｏ．１、Ｊａｎｕａｒｙ、１９８０、ｐｐ．８４〜ｐ
ｐ．９５、Ｌｉｎｄｅ、Ｂｕｚｏ、ａｎｄＧｒａｙ、”
ＡｎＡｌｇｏｒｉｔｈｍｆｏｒＶｅｃｔｏｒＱ
ｕａｎｔｉｚｅｒＤｅｓｉｇｎ”、などに示されてい
る、このＬｉｎｄｅらのアルゴリズム（これを、ＬＢＧ
アリゴリスムと呼ぶ。）を使用して求めることができ
る。コードベクトル選択部分５１は、音響特徴系列Ｖか
ら上記コードベクトル部分５２の検索を上記クラスタリ
ング手法などを使用して、最も近いコードベクトルを選
択して最適コードベクトル系列Ａとして尤度算出部６に
供給する。尤度算出部６は、供給された最適コードベク
トル系列Ａから候補列（ｉ）と、これに対応する尤度を
算出して判定部７に供給する。つまり、尤度算出部６に
おいては、音韻ＨＭＭ部分６１と、辞書部分６２とを用
いて単語連結モデル部分６３は少なくとも１以上の候補
列（ｉ）を算出する。尚この単語モデルの連結方法につ
いては、後述する図５及び図６の説明で明らかにする。The optimal code vector selection unit 5 processes the acoustic feature sequence V, outputs an optimal code vector, and supplies it to the likelihood calculation unit 6. Then, the internal code vector portion 52 is obtained by selecting a vector that best represents these samples from the acoustic sequence of a large number of training samples in advance. The code vector can be selected by a clustering method or the like. This clustering method is described in, for example, Document 4: IEEE Transactionon co.
mmunication, Vol. com-28, N
o. 1, January, 1980, pp. 84-p
p. 95, Linde, Buzo, andGray, "
An Algorithm for Vector Q
The algorithm of Linde et al. (referred to as LBG
It is called an algorithm. ) Can be used to find out. The code vector selection portion 51 searches the acoustic feature sequence V for the code vector portion 52 by using the clustering method or the like, selects the closest code vector, and supplies it to the likelihood calculation unit 6 as the optimum code vector sequence A. To do. The likelihood calculating unit 6 calculates a candidate sequence (i) from the supplied optimal code vector sequence A and the likelihood corresponding to the candidate sequence (i), and supplies the likelihood to the determining unit 7. That is, in the likelihood calculation unit 6, the word concatenation model portion 63 calculates at least one or more candidate strings (i) using the phoneme HMM portion 61 and the dictionary portion 62. The method of connecting the word models will be clarified in the description of FIGS. 5 and 6 described later.

【００２６】ビタビ比較照合部分６４は得られた単語モ
デルに対して、例えば上述の文献２に示されている様な
Ｖｉｔｅｒｂｉアルゴリズムによって、上記単語モデル
の尤度を求める。そして、最上位からの尤度ＬＡ（ｉ）
と、その尤度を与える単語番号（ｉ）を候補として出力
する。この候補列をｉ_１、ｉ_２、・・・と表し、この候
補列に対応する尤度をＬＡ（ｉ_１）、ＬＡ（ｉ_２）、・
・・と表わすものとする。以上の候補列ｉ_１、ｉ_２、・
・・と、この候補列に対応する尤度をＬＡ（ｉ_１）、Ｌ
Ａ（ｉ_２）、・・・とを判定部７に供給する。The Viterbi comparing and collating unit 64 obtains the likelihood of the obtained word model by the Viterbi algorithm as shown in the above-mentioned reference 2, for example. Then, the likelihood LA (i) from the highest rank
And the word number (i) that gives the likelihood is output as a candidate. This candidate sequence is represented as i ₁ , i ₂ , ... And the likelihoods corresponding to this candidate sequence are LA (i ₁ ), LA (i ₂ ) ,.
・・・The above candidate sequences i ₁ , i ₂ , ...
And ..., the likelihood corresponding to the candidate column LA _(i 1), L
And A (i ₂ ), ... Are supplied to the determination unit 7.

【００２７】次に判定部７は、入力されるリジェクトし
きい値Ｒ０を参照して、供給される尤度と比較してリジ
ェクト処理を行う。即ち上記リジェクトしきい値Ｒ０未
満の尤度である場合は、いずれの単語にも類似していな
い（該当しない）と判定してリジェクトする。しかしな
がら、しきい値Ｒ０以上の場合は、当該候補が近類似度
テーブル８に記載があるか否かを判定する。尚上記尤度
は、計算上のアンダーフローがなければ必ず正の値を取
るので、リジェクトしきい値Ｒ０＝０とすると、上記リ
ジェクト処理を行わないことと同じことに相当する。Next, the decision section 7 refers to the reject threshold value R0 that is input and compares it with the supplied likelihood to perform reject processing. That is, when the likelihood is less than the reject threshold R0, it is determined that the word is not similar to any word (does not correspond) and rejected. However, if it is equal to or greater than the threshold value R0, it is determined whether or not the candidate is described in the near similarity table 8. Note that the above-mentioned likelihood always takes a positive value unless there is an underflow in the calculation, so setting a reject threshold R0 = 0 is equivalent to not performing the reject process.

【００２８】そして、上記近類似度テーブル８は、類似
度の近い、即ち類似している単語名がグループ化されて
記載されている。例えば、地名であるならば、『びさい
（尾西）』と『ひさい（久居）』のグループとか、『そ
うか（草加）』と『もおか（真岡）』のグループとか、
数詞であるならば、『いち（１）』と『しち（７）』の
グループなどが、単語表記で予め登録されている。In the near-similarity table 8, word names having a similar similarity, that is, similar word names are grouped and described. For example, if it is a place name, the group of "Bisai (Onishi)" and "Hisai (Hisui)" or the group of "Soka (Soka)" and "Moka (Moka)",
If it is a numeral, a group such as "ichi (1)" and "shichi (7)" is registered in advance in word notation.

【００２９】図４は上記近類似度テーブル８の一例の具
体的なテーブル構成を示している。この図４において、
この近類似度テーブル８は例えば、メモリなどに格納さ
れ、その構成は１番目のグループからＧＮ０番目のグル
ープとで構成され、テーブルの先頭にはグループの総数
ＧＮ０が設定されている。そして、各グループは、例え
ばｉ番目のグループ内の総数ＧＭ（ｉ）と、１番目の単
語表記ＨＹＯ（ｉ，１）と、２番目の単語表記ＨＹＯ
（ｉ，２）と、・・・と、ＧＭ（ｉ）番目の単語表記Ｈ
ＹＯ（ｉ，ＧＭ（ｉ））と、識別開始割合ＷＳＴＡＲＴ
（ｉ）、識別終了割合ＷＥＮＤ（ｉ）とで構成されてい
る。FIG. 4 shows a specific table configuration of an example of the near similarity table 8. In this FIG.
The near-similarity table 8 is stored in, for example, a memory or the like, and its configuration is composed of a first group to a GN0th group, and the total number GN0 of groups is set at the head of the table. Then, each group has, for example, the total number GM (i) in the i-th group, the first word notation HYO (i, 1), and the second word notation HYO.
(I, 2), ..., GM (i) th word notation H
YO (i, GM (i)) and identification start ratio WSTART
(I) and the identification end ratio WEND (i).

【００３０】そして、上記判定部７において、ｍ位の候
補までが上記近類似度テーブル８内に記載されているか
否かを判定する。ここで、近類似度テーブル８に、例え
ばｍ＝２、ＧＭ（ｉ）＝２、ＨＹＯ（ｉ，１）＝ヒサ
イ、ＨＹＯ（ｉ，２）＝ビサイと記載されている場合
に、尤度算出部６の算出結果の１位と２位として、上記
『ヒサイ』と『ビサイ』の対が供給されたならば上記テ
ーブル内に記載有りと判定して出力する。Then, the judging section 7 judges whether or not the candidates up to the m-th place are listed in the near similarity table 8. Here, if the near similarity table 8 describes, for example, m = 2, GM (i) = 2, HYO (i, 1) = hisai, HYO (i, 2) = Visai, the likelihood calculation If the pair of "Hisai" and "Visai" is supplied as the 1st and 2nd place of the calculation result of the unit 6, it is determined that there is a description in the table and is output.

【００３１】しかも、上記判定部７での処理において、
上記近類似度テーブル８に記載が無ければ、最大の尤度
ＬＡ（ｉ_１）を与える第１位の候補ｉ_１を単語認識判定
結果情報Ｉとして出力する。即ちＩ＝ｉ_１として出力す
る。一方、近類似度テーブル８に記載があれば、これを
バックトレース部９に供給する。Moreover, in the processing in the judging section 7,
If there is no description in the near similarity table 8, the first candidate i ₁ that gives the maximum likelihood LA (i ₁ ) is output as the word recognition determination result information I. That is, I = i ₁ is output. On the other hand, if there is a description in the near similarity table 8, this is supplied to the back trace unit 9.

【００３２】バックトレース部９は、尤度算出部６のビ
タビ比較照合部分６４から供給される候補列ｉ_１、
ｉ_２、・・・と仮定した場合の、音韻ＨＭＭ部分６１の
モデルで連結した単語連結モデル部分６３の最適状態遷
移系列をバックトレース処理で求め、この特徴時系列を
ＨＭＭ状態系列に対応して区分化を行う。このバックト
レース処理の具体的な例は後述の図７及び図８を用いて
説明する。The back trace unit 9 supplies the candidate sequence i ₁ supplied from the Viterbi comparing and collating unit 64 of the likelihood calculating unit 6,
The optimal state transition sequence of the word concatenation model part 63 concatenated with the model of the phoneme HMM part 61 is obtained by backtrace processing, assuming that i ₂ , ... Perform segmentation. A specific example of this backtrace processing will be described with reference to FIGS. 7 and 8 described later.

【００３３】具体的には、各音韻ＨＭＭは同じ状態数の
モデルを用いているので、バックトレース処理によって
最適状態系列が求まれば、音韻ＨＭＭの接続部分に相当
する『時間変化点』（対応するフレーム番号）を求める
ことができる。この時間変化点は、通常複数ＬＬ個ある
と考えられ、それぞれｔｉｍｅ（１）、ｔｉｍｅ
（２）、・・・と表す。Specifically, since each phoneme HMM uses a model with the same number of states, if the optimum state sequence is obtained by backtrace processing, a "time change point" (corresponding to the connected portion of the phoneme HMM) (corresponding to Frame number) to be used. It is considered that there are usually a plurality of LL time change points, and time (1) and time (1) respectively.
(2), ...

【００３４】近類似度テーブル８中の、識別開始割合Ｗ
ＳＴＡＲＴ（ｉ）と識別終了割合ＷＥＮＤ（ｉ）と、入
力フレーム数Ｔから、類似単語の識別に有効な区間［Ｓ
Ｔ，ＥＤ］を以下の様にして求める。Identification start ratio W in the near similarity table 8
From START (i), the identification end ratio WEND (i), and the number of input frames T, an effective section [S
T, ED] is obtained as follows.

【００３５】区間［ＷＳＴＡＲＴ（ｉ）＊Ｔ（識別開始
フレーム）、ＷＥＮＤ（ｉ）＊Ｔ（識別終了フレー
ム）］において、ＳＴ＝ｍｉｎｉｍｕｍｔｉｍｅ（Ｌ）ｆｏｒＬ＝１
〜ＬＬＥＤ＝ｍａｘｉｍｕｍｔｉｍｅ（Ｌ）ｆｏｒ１＝１
〜ＬＬこの［ＳＴ，ＥＤ］区間を求めることによって、候補列
ｉ_１、ｉ_２、・・・を仮定したときの特徴の相違部分を
自動的に抽出することができる。従って、上記ＳＴとＥ
Ｄは［ＷＳＴＡＲＴ（ｉ）＊Ｔ（識別開始フレーム）、
ＷＥＮＤ（ｉ）＊Ｔ（識別終了フレーム）］以内のｔｉ
ｍｅ（Ｌ）の最小値と最大値とを表している。上記Ｓ
Ｔ、ＥＤの具体的な算出処理方法例は後述の図９に示
す。In the section [WSTART (i) * T (identification start frame), WEND (i) * T (identification end frame)], ST = minimum time (L) forL = 1
~ LL ED = maximum time (L) for1 = 1
~ LL By obtaining this [ST, ED] section, it is possible to automatically extract the different portion of the features when the candidate sequences i ₁ , i ₂ , ... Are assumed. Therefore, ST and E above
D is [WSTART (i) * T (identification start frame),
WEND (i) * T (identification end frame)]
It represents the minimum value and the maximum value of me (L). Above S
A specific example of the calculation processing method of T and ED is shown in FIG. 9 described later.

【００３６】このバックトレース部９のバックトレース
結果と上記補助特徴系列Ｂとが、部分類似度算出部１０
に供給される。The back trace result of the back trace unit 9 and the auxiliary feature series B are obtained by the partial similarity calculating unit 10.
Is supplied to.

【００３７】部分類似度算出部１０は、このバックトレ
ース部９のバックトレース結果と補助特徴系列Ｂとを用
いて、当該候補の識別に有効な前記区間［ＳＴ，ＥＤ］
の、補助特徴系列Ｂとの間の類似度（以下、部分類似度
と呼ぶ。）ＬＢ（ｉ_１）、ＬＢ（ｉ_２）…を算出する。
この算出には補助特徴標準パターン部１１に予め格納さ
れている補助特徴パターンの標準パターンＲＥＦ（ｉ、
ｊ、ｋ）を使用する。そして、この格納標準パターンの
数ＲＮＯは、認識単語の数に対応する。The partial similarity calculation unit 10 uses the backtrace result of the backtrace unit 9 and the auxiliary feature sequence B to obtain the section [ST, ED] effective for identifying the candidate.
Of the auxiliary feature series B (hereinafter, referred to as partial similarity) LB (i ₁ ), LB (i ₂ ), ...
For this calculation, the standard pattern REF (i, of the auxiliary feature pattern previously stored in the auxiliary feature standard pattern section 11 is used.
j, k) are used. The number RNO of stored standard patterns corresponds to the number of recognized words.

【００３８】そして、上記部分類似度の定義には、種々
の方法が考えられるが、ここでは、互いのベクトル間の
角度を部分類似度の一例として用いる。Various methods can be considered for the definition of the partial similarity, but here, the angle between the vectors is used as an example of the partial similarity.

【００３９】即ち、入力フレーム長Ｔを時間長ＶＬＥＮ
に時間正規化した補助特徴系列ＢをＳＰＥＣ（ｉ，
ｊ）、（ｉ＝１〜ＤＮＯ、ｊ＝１〜ＶＬＥＮ、ＤＮＯ：
補助特徴次元数（補助特徴種類数）、ＶＬＥＮ：固定の
フレーム数）とし、補助特徴標準パターン部１１の標準
パターンをＲＥＦ（ｉ，ｊ，ｋ），（ｉ＝１〜ＤＮＯ，
ｊ＝１〜ＶＬＥＮ，ｋ＝１〜ＲＮＯ）とすると、第ｊフ
レームの補助特徴系列と、第ｊフレーム、第ｋカテゴリ
ーの補助特徴標準パターンとのベクトルの角度ｒ（ｊ、
ｋ）は、次の数１式で表す。That is, the input frame length T is set to the time length VLEN.
The auxiliary feature sequence B time-normalized to SPEC (i,
j), (i = 1 to DNO, j = 1 to VLEN, DNO:
With the number of auxiliary feature dimensions (the number of auxiliary feature types), VLEN: fixed number of frames, the standard pattern of the auxiliary feature standard pattern unit 11 is REF (i, j, k), (i = 1 to DNO,
j = 1 to VLEN, k = 1 to RNO), the angle r (j, j of the vector of the auxiliary feature series of the jth frame and the auxiliary feature standard pattern of the jth frame, kth category
k) is expressed by the following equation 1.

【００４０】[0040]

【数１】そして更に上記数１式によって、上記部分類似度は次の
数２式で表すことができる。[Equation 1] Further, the above-mentioned partial similarity can be expressed by the following mathematical formula 2 by the mathematical formula 1.

【００４１】[0041]

【数２】この様にして得られた部分類似度は、総合判定部１２に
供給される。[Equation 2] The partial similarity thus obtained is supplied to the comprehensive determination unit 12.

【００４２】総合判定部１２は、尤度算出部６における
算出結果（ビタビ演算での尤度）と部分類似度算出部１
０における算出結果（補助特徴の部分類似度）とを用い
て、総合的に判定され単語認識判定結果Ｉを出力する。The comprehensive judgment unit 12 calculates the calculation result in the likelihood calculation unit 6 (likelihood in Viterbi calculation) and the partial similarity calculation unit 1.
A word recognition determination result I that is comprehensively determined is output using the calculation result of 0 (partial similarity of auxiliary features).

【００４３】この単語認識判定結果Ｉを出力するため
に、まず、算出された候補列ｉ_１、ｉ_２、…に対する、
ビタビ演算での尤度ＬＡ（ｉ_１）、ＬＡ（ｉ_２）…と、
補助特徴の部分類似度ＬＢ（ｉ_１）、ＬＢ（ｉ_２）、…
とを用いて、総合判定用類似度Ｌ（ｉ）を求める。この
総合判定用類似度Ｌ（ｉ）は次の数３式で定義する。In order to output this word recognition determination result I, first, for the calculated candidate strings i ₁ , i ₂ , ...
Likelihood LA (i ₁ ), LA (i ₂ ) ... in the Viterbi operation,
Auxiliary feature partial similarity LB (i ₁ ), LB (i ₂ ), ...
And are used to obtain the overall determination similarity L (i). This comprehensive determination similarity L (i) is defined by the following equation (3).

【数３】尚、上記ｗ１、ｗ２は正定数の重み付け係数である。[Equation 3] The w1 and w2 are positive constant weighting coefficients.

【００４４】そして、最終的に単語認識（識別）判定結
果Ｉは、次の数４式で表す。Finally, the word recognition (identification) determination result I is expressed by the following equation (4).

【００４５】[0045]

【数４】この数４式によって得られる、上記Ｌ（ｉ_ｋ）の内の最
大値を与えるｉを単語認識判定結果Ｉとして出力する。
そして、この単語認識判定結果Ｉを、例えばテーブルな
どに供給することによって所望の単語名を出力すること
ができる。例えば、このテーブルには、Ｉの値に対応し
て予め複数の単語名（あるいはカテゴリー名など）を格
納しておき、任意の上記Ｉが供給されることによって、
このＩの値に応じて任意の単語名を出力させることがで
きる。[Equation 4] The word recognition determination result I is output as i which gives the maximum value of the above L (i _k ) obtained by the equation (4).
Then, a desired word name can be output by supplying the word recognition determination result I to, for example, a table. For example, in this table, a plurality of word names (or category names, etc.) are stored in advance corresponding to the value of I, and by supplying any of the above I,
An arbitrary word name can be output according to the value of I.

【００４６】ここで、上記数４式のＫＫは近類似度テー
ブル中のグループ内の単語表記候補の数であり、ＫＫ＝
２又は３の値に設定することが実用的である。また、上
記ｗ１、ｗ２の重み付け係数は、部分類似度をより重視
しる観点から、ｗ２＞ｗ１の関係に設定することが望ま
しい。Here, KK in the above equation 4 is the number of word notation candidates in the group in the near similarity table, and KK =
It is practical to set the value to 2 or 3. In addition, it is desirable that the weighting coefficients of w1 and w2 be set to the relationship of w2> w1 from the viewpoint of placing more importance on the partial similarity.

【００４７】上記総合判定部１２における総合判定処理
により、音韻ＨＭＭの尤度と、パターンマッチングとの
間の類似度という異なった手法を、相補的に用いている
ことによって類似単語認識に対する識別改善作用をもた
らしている。By the comprehensive judgment processing in the comprehensive judgment unit 12, different methods of the similarity between the phoneme HMM and the pattern matching are used complementarily, so that the discrimination improving effect for the similar word recognition is obtained. Is brought.

【００４８】次に上述の図３における音韻ＨＭＭ部分６
１と単語辞書部分６２と単語連結モデル部分６３とによ
る単語連結方法の一例について、図５及び図６を用いて
説明する。Next, the phonological HMM part 6 in FIG. 3 described above.
An example of a word connection method using 1 and the word dictionary portion 62 and the word connection model portion 63 will be described with reference to FIGS. 5 and 6.

【００４９】図５は、音韻ＨＭＭ部分６１の音韻ＨＭＭ
の構成の一例である。図５（ａ）は、４状態３ループの
音韻ＨＭＭモデルを表している。図５（ｂ）は、２状態
１ループの音韻ＨＭＭモデルを表している。これらは、
予め学習サンプルから、例えば上記文献（２）に示され
ている『Ｂａｕｍ−Ｗｅｌｃｈのパラメータ推定法』に
よって、状態遷移確率確及び出力確率を求めてモデルを
推定して格納しておく。FIG. 5 shows the phoneme HMM of the phoneme HMM part 61.
It is an example of a configuration of. FIG. 5A shows a phoneme HMM model with four states and three loops. FIG. 5B shows a phonological HMM model with two states and one loop. They are,
The state transition probability and the output probability are obtained from the learning sample in advance by, for example, the “Baum-Welch parameter estimation method” shown in the above-mentioned document (2), and the model is estimated and stored.

【００５０】図６は、認識単語のＨＭＭを構成する方法
を示したものであり、上記音韻ＨＭＭ部分６１と、単語
辞書部分６２とを用いて音韻ＨＭＭを連結する。ここで
は、音韻ＨＭＭの例として、上記図５（ａ）の４状態３
ループのモデルを例にしている。単語辞書部分６２に
は、正書法（音韻記号表記）にて、認識対象単語名が記
述されている。単語辞書部分中の音韻記号に対応した音
韻ＨＭＭを選び、直前の音韻ＨＭＭの最終状態（図６
（ｂ））と、直後の音韻ＨＭＭの初期状態（図６
（ａ））とを単に接続して、単語の連結モデルを得るこ
とができる。尚、上記図５の２状態１ループのモデルを
用いた場合の連結方法も同様な方法で実現することがで
きる。FIG. 6 shows a method of constructing an HMM of a recognized word. The phoneme HMM is connected using the phoneme HMM part 61 and the word dictionary part 62. Here, as an example of the phoneme HMM, the four states 3 in FIG.
The loop model is used as an example. In the word dictionary portion 62, the recognition target word name is described in the orthography (phonetic symbol notation). The phoneme HMM corresponding to the phoneme symbol in the word dictionary part is selected, and the final state of the immediately preceding phoneme HMM (see FIG. 6).
(B)) and the initial state of the phonological HMM immediately after (FIG. 6).
(A)) can be simply connected to obtain a concatenation model of words. The connection method using the two-state one-loop model in FIG. 5 can be realized by the same method.

【００５１】図７は各音韻が１状態で表せる場合の左か
ら右への形の（Ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ）ＨＭＭを
用いた単語モデルの説明図である。FIG. 7 is an explanatory diagram of a word model using a left-to-right (Left-to-right) HMM when each phoneme can be represented by one state.

【００５２】この図７において、図７（ａ）は『ｂｉｓ
ａｉ』（尾西：びさい）に対する状態Ｓ１〜Ｓ５間の状
態遷移確率（ａ_１１、ａ_１２、・・・・）と、コード出
力確率（ｂ_１１ｋ、ｂ_１２ｋ、・・・）とを表してい
る。また、図７（ｂ）は『ｈｉｓａｉ』（久居：ひさ
い）に対する状態Ｓ１^＊〜Ｓ５^＊間の状態遷移確率（ａ
_１ _１ ^＊、ａ_１２ ^＊、・・・）と、コード出力確率（ｂ
_１１ｋ ^＊、ｂ_１２ｋ ^＊、・・・）とを表している。尚上
記（ａ）、（ｂ）図は、（ａ_ｉ、ｉ）＋
（ａ_ｉ、ｉ＋ _１）＝１の場合の例を示している。In FIG. 7, FIG. 7A shows "bis
ai ”(Onishi: Bisai), state transition probabilities (a ₁₁ , a ₁₂ , ...) Between states S1 to S5 and code output probabilities (b _11k , b _12k , ...). .. Further, FIG. 7B shows a state transition probability (a) between states S1 ^{* to} S5 ^* for “hisai” (Hirai: Hisai).
₁ ₁ ^* , a ₁₂ ^* , ...) And the code output probability (b
_11k ^* , b _12k ^* , ...). The above figures (a) and (b) are (a _{i, i} ) +
An example of the case of (a _{i, i +} ₁ ) = 1 is shown.

【００５３】図８は上記図７の例の音韻における各構成
部の信号の具体的な状態例を表している。FIG. 8 shows an example of a concrete state of the signal of each component in the phoneme of the example of FIG.

【００５４】この図８において、各音韻が１状態で表
せ、上記図７（ａ）（ｂ）の様に『ｂｉｓａｉ』と『ｈ
ｉｓａｉ』の単語モデルが単語連結モデル部分６３（図
３に図示した。）の要素として予め与えられているもの
とする。そして、このときに入力音声が図８の（ａ）に
示す様に『Ｘｉｓａｉ』であったとする。この『Ｘ』は
不確定を意味するものとする。このときの入力音声エネ
ルギーを図８（ｂ）に示し、この音声区間を図８（ｅ）
に示す。そして尤度算出部６の算出結果の、１位と２位
の候補単語が『久居』、『尾西』と出力され、しかもこ
れらの候補単語が上記近類似度テーブル８（図１に図示
した。）に記載されているものとする。このときに『尾
西』の単語モデル（図７（ａ）に図示した。）に対する
上記ビタビ比較照合部分６４（図３に図示した。）によ
って図８（ｆ）に示す様な状態Ｓと入力フレームＴとの
対応付けを行うことができる。この状態をバックトレー
ス部９（図１に図示した。）で処理することによって、
図８（ｉ）に示す様な最適状態遷移系列（但しフレーム
方向は間引いている。）を得ることができる。この図８
（ｉ）の最適状態系列から図８（ｇ）の『尾西』に対す
る時間変化点ｔｉｍｅ（１）〜ｔｉｍｅ（５）を求める
ことができる。In FIG. 8, each phoneme can be represented by one state, and as shown in FIGS. 7 (a) and 7 (b), "bisai" and "h" are displayed.
It is assumed that the word model “isai” is given in advance as an element of the word concatenation model part 63 (illustrated in FIG. 3). Then, at this time, it is assumed that the input voice is "Xisai" as shown in FIG. This "X" means indeterminacy. The input voice energy at this time is shown in FIG. 8B, and this voice section is shown in FIG.
Shown in. Then, the first and second candidate words of the calculation result of the likelihood calculating unit 6 are output as "Hirai" and "Onishi", and these candidate words are shown in the near similarity table 8 (illustrated in FIG. 1). ). At this time, the Viterbi comparison and collation part 64 (shown in FIG. 3) with respect to the word model of “Onishi” (shown in FIG. 7A) is processed by the state S and the input frame as shown in FIG. 8F. Correlation with T can be performed. By processing this state by the back trace unit 9 (shown in FIG. 1),
It is possible to obtain an optimum state transition sequence (however, the frame direction is thinned out) as shown in FIG. 8 (i). This Figure 8
The time change points time (1) to time (5) for "Onishi" in FIG. 8 (g) can be obtained from the optimum state series of (i).

【００５５】同様にして上記『久居』の単語モデルに対
するバックトレース処理を行い、その最適状態遷移系列
を図８（ｊ）に示す様に得ることができる。この図８
（ｊ）の最適状態遷移系列から図８（ｈ）の『久居』に
対する時間変化点ｔｉｍｅ（６）〜ｔｉｍｅ（１０）を
求めることができる。In the same manner, backtrace processing is performed on the word model of "Hisui", and the optimum state transition sequence can be obtained as shown in FIG. 8 (j). This Figure 8
The time change points time (6) to time (10) for "Hisui" in FIG. 8 (h) can be obtained from the optimum state transition sequence in (j).

【００５６】以上の時間変化点ｔｉｍｅ（Ｌ）、Ｌ＝１
〜１０と、上記近類似度テーブル８（図１に図示し
た。）の値とを用いて、上述の様にしてＳＴ（識別開始
フレーム）と、ＥＤ（識別終了フレーム）を決定するこ
とができる。The above time change points time (L), L = 1
-10 and the value of the near similarity table 8 (shown in FIG. 1), ST (identification start frame) and ED (identification end frame) can be determined as described above. ..

【００５７】図９は上記ＳＴ（有効な識別開始フレー
ム）、ＥＤ（有効な識別終了フレーム）の一例の具体的
な算出処理フローチャートを表している。FIG. 9 shows a specific calculation process flowchart of an example of ST (valid identification start frame) and ED (valid identification end frame).

【００５８】この図９において、まずＷＳＴＡＲＴ
（ｉ）＊Ｔ＝ｓ（識別開始フレーム）を求め、ＷＥＮＤ
（ｉ）＊Ｔ＝ｅ（識別終了フレーム）を算出する（Ｓ９
１）。ＷＳＴＡＲＴ（ｉ）は識別開始割合で、Ｔは入力
フレーム数又は入力フレーム長を表す。次に上記ＴをＳ
^＊とし、ｅ^＊をｅ^＊＝１と設定する（Ｓ９２）。次にＬ
＝１に設定する（Ｓ９３）。次に上記ｓ≦ｔｉｍｅ
（Ｌ）を満たすか否かを判定する（Ｓ９４）。この条件
を満たすならば次に上記ｅがｅ≧ｔｉｍｅ（Ｌ）を満た
すか否かを判定する（Ｓ９５）。この条件を満たすなら
ば次に上記ｓ^＊がｓ^＊≧ｔｉｍｅ（Ｌ）を満たすか否か
を判定する（Ｓ９６）。この条件を満たすならば次に上
記ｓ^＊をｓ^＊＝ｔｉｍｅ（Ｌ）にする（Ｓ９７）。次に
上記ｅ^＊がｅ^＊≦ｔｉｍｅ（Ｌ）を満たすか否かを判定
する（Ｓ９８）。この条件を満たすならば次にｅ^＊をｅ
^＊＝ｔｉｍｅ（Ｌ）とする（Ｓ９９）。次にＬをＬ＋１
に更新する（Ｓ１００）。次にこの更新されたＬがＬ＞
ＬＬを満たすか否かを判定する（Ｓ１０１）。この条件
を満たすならば上記ｓ^＊の値を上記ＳＴ（有効な識別開
始フレーム）とし、そして上記ｅ^＊の値を上記ＥＤ（有
効な識別終了フレーム）として出力する（Ｓ１０２）。In FIG. 9, first, WSTART
(I) * T = s (identification start frame) is calculated, and WEND
(I) * T = e (identification end frame) is calculated (S9)
1). WSTART (i) is the identification start rate, and T is the number of input frames or the input frame length. Next, let T be S
^*, And e ^* is set to e ^* = 1 (S92). Then L
= 1 is set (S93). Next, the above s ≦ time
It is determined whether (L) is satisfied (S94). If this condition is satisfied, then it is determined whether or not the above e satisfies e ≧ time (L) (S95). If this condition is satisfied then determines whether the ^{s *} satisfies ^{s * ≧ time (L) (} S96). If this condition is satisfied, then s ^{* is set} to s ^* = time (L) (S97). Next, it is determined whether or not the above e ^* satisfies e ^* ≤time (L) (S98). If this condition is satisfied, then e ^{* is changed} to e
^* = Time (L) is set (S99). Then L to L + 1
(S100). Next, this updated L is L>
It is determined whether LL is satisfied (S101). If this condition is satisfied, the value of s ^* is output as ST (valid identification start frame), and the value of e ^* is output as ED (valid identification end frame) (S102).

【００５９】そして、上記Ｓ９４、Ｓ９５、Ｓ９８で条
件を満たさないと判断されると上記Ｓ１００の処理を行
う。また、更に上記Ｓ９６で条件を満たさないと判断さ
れると、次に上記Ｓ９８の処理を行う。そして、上記Ｓ
９４〜Ｓ１００の処理を上記Ｓ１０１の条件を満たさな
い間、継続して処理する。If it is determined in S94, S95, and S98 that the conditions are not satisfied, the process of S100 is performed. If it is determined that the condition is not satisfied in S96, the process of S98 is performed next. And the above S
The processes of 94 to S100 are continuously processed while the condition of S101 is not satisfied.

【００６０】以上の一実施例によれば、音韻ＨＭＭの連
結モデルによって少なくとも一つ以上の単語候補（ｉ）
の列を算出し、当該候補列が近類似度テーブル８内に記
載されているものであれば、当該候補の、音韻ＨＭＭの
連結モデルによる最適状態遷移系列を求め、特徴時系列
を区分化し、特徴の相違部分に相当する上記識別有効区
間［ＳＴ、ＥＤ］を自動的に抽出し、当該候補の前記識
別有効区間の補助特徴を認識に用いる様にしたので、音
韻ＨＭＭとパターンマッチング法を相補的に用いて、最
適尤度に変更して認識対象音声中に類似した音韻系列の
単語が存在する場合においても正確な認識判定結果を出
力でき、類似単語音声の識別能力をより向上させた音声
認識装置を実現することができる。According to the above embodiment, at least one or more word candidates (i) are selected by the phoneme HMM concatenation model.
Is calculated, and if the candidate column is described in the close similarity table 8, an optimal state transition sequence of the candidate by the concatenation model of phonological HMM is obtained, and the feature time series is segmented, Since the identification effective section [ST, ED] corresponding to the different part of the feature is automatically extracted and the auxiliary feature of the identification effective section of the candidate is used for the recognition, the phoneme HMM and the pattern matching method are complemented. It is possible to output the accurate recognition judgment result even when there is a similar phoneme sequence word in the speech to be recognized by changing to the optimal likelihood, and to improve the recognition ability of the similar word speech. A recognition device can be realized.

【００６１】以上の実施例の図１の音響特徴抽出部３に
おいては、例えば上記文献３に示したＬＰＣケプストラ
ムを例に説明しが、これに限るものではない。In the acoustic feature extraction unit 3 of FIG. 1 of the above embodiment, the LPC cepstrum shown in the above-mentioned reference 3 is taken as an example, but the present invention is not limited to this.

【００６２】また、以上の実施例において、図２の最適
コードベクトル選択部５においては、上記文献４に示さ
れているクラスタリング手法を例に説明したが、この手
法に限るものではない。Further, in the above embodiment, the optimal code vector selecting section 5 in FIG. 2 has been described by taking the clustering method shown in the above-mentioned document 4 as an example, but the present invention is not limited to this method.

【００６３】また、以上の実施例においては、近類似度
テーブル８の構成を図４に示す様に構成したが、この構
成に限るものではない。Further, in the above embodiment, the configuration of the near similarity table 8 is configured as shown in FIG. 4, but it is not limited to this configuration.

【００６４】また、以上の実施例の図６において、音韻
ＨＭＭの連結モデルの例を説明したが、連結方法は以上
の例に限るものではない。Although the example of the phoneme HMM connection model has been described with reference to FIG. 6 of the above embodiment, the connection method is not limited to the above example.

【００６５】また、以上の実施例の図１において、総合
判定部１２は判定結果として、単語認識判定結果の番号
Ｉを出力する様にしているが、代わりにこの第Ｉ番目の
カテゴリー名を最終認識判定結果として出力してもよ
い。Further, in FIG. 1 of the above-mentioned embodiment, the comprehensive judgment unit 12 outputs the word recognition judgment result number I as the judgment result, but instead of this, the I-th category name is the final one. You may output as a recognition determination result.

【００６６】また、以上の実施例の図９において、上記
ＳＴ（有効な識別開始フレーム）、ＥＤ（有効な識別終
了フレーム）の算出処理方法を説明したが、この処理手
順に限るものではない。Further, although the calculation processing method of ST (effective identification start frame) and ED (effective identification end frame) has been described with reference to FIG. 9 of the above embodiment, the present invention is not limited to this processing procedure.

【００６７】[0067]

【発明の効果】以上詳細に述べたようにこの発明によれ
ば、上記特徴情報系列抽出手段と、上記類似単語格納手
段と、上記標準特徴情報格納手段と、上記抽出手段とを
備えて、上記最適識別有効区間の上記特徴情報系列と、
上記標準特徴情報とを用いて、上記尤度を最適に変更し
て最適単語識別結果を生成出力しているので、認識対象
音声信号中に類似した音韻系列の単語が存在する場合に
おいても正確に認識でき、類似単語音声の識別能力をよ
り向上させた音声認識装置を実現することができる。As described in detail above, according to the present invention, the feature information sequence extracting means, the similar word storing means, the standard feature information storing means, and the extracting means are provided, and The characteristic information series of the optimal identification effective section,
Using the standard feature information, the likelihood is optimally changed and the optimum word identification result is generated and output. Therefore, even if a word of a similar phonological sequence exists in the recognition target speech signal, the word is accurately output. It is possible to realize a voice recognition device which can be recognized and whose recognition ability of similar word voice is further improved.

[Brief description of drawings]

【図１】この一実施例に係る音声認識装置の機能ブロッ
ク図である。FIG. 1 is a functional block diagram of a voice recognition device according to an embodiment.

【図２】この一実施例に係る音声認識装置の最適コード
ベクトル選択部の機能ブロック図である。FIG. 2 is a functional block diagram of an optimum code vector selection unit of the voice recognition device according to the embodiment.

【図３】この一実施例に係る音声認識装置の尤度算出部
の機能ブロック図である。FIG. 3 is a functional block diagram of a likelihood calculation unit of the voice recognition device according to the embodiment.

【図４】この一実施例に係る音声認識装置の近類似度テ
ーブルの構成図である。FIG. 4 is a configuration diagram of a near similarity table of the voice recognition device according to the embodiment.

【図５】この一実施例に係る音声認識装置の音韻ＨＭＭ
の説明図である。FIG. 5 is a phoneme HMM of the voice recognition device according to the embodiment.
FIG.

【図６】この一実施例に係る音声認識装置の音韻ＨＭＭ
の連結モデル図である。FIG. 6 is a phoneme HMM of the voice recognition device according to the embodiment.
It is a connection model figure of.

【図７】この一実施例に係る音声認識装置の左から右へ
の形のＨＭＭを用いた単語モデルの説明図である。FIG. 7 is an explanatory diagram of a word model using a left-to-right HMM of the voice recognition device according to the embodiment.

【図８】この一実施例に係る音声認識装置の各部の信号
の状態例図である。FIG. 8 is an example of a signal state of each unit of the voice recognition device according to the embodiment.

【図９】この一実施例に係る音声認識装置のＳＴ、ＥＤ
算出処理フローチャートである。FIG. 9: ST and ED of the voice recognition device according to this embodiment
It is a calculation processing flowchart.

[Explanation of symbols]

３…音響特徴抽出部、４…補助特徴抽出部、５…最適コ
ードベクトル選択部、６…尤度算出部、７…判定部、８
…近類似度テーブル、９…バックトレース部、１０…部
分類似度算出部、１１…補助特徴標準パターン部、１２
…総合判定部、５１…コードベクトル選択部分５１、５
２…コードベクトル部分、６１…音韻ＨＭＭ部分、６２
…単語辞書部分、６３…単語連結モデル部分、６４…ビ
タビ比較照合部分。3 ... Acoustic feature extraction unit, 4 ... Auxiliary feature extraction unit, 5 ... Optimal code vector selection unit, 6 ... Likelihood calculation unit, 7 ... Judgment unit, 8
... near similarity table, 9 ... back trace part, 10 ... partial similarity calculation part, 11 ... auxiliary feature standard pattern part, 12
... Comprehensive judgment part, 51 ... Code vector selection parts 51,
2 ... Code vector part, 61 ... Phonological HMM part, 62
... word dictionary part, 63 ... word connection model part, 64 ... Viterbi comparison and collation part.

Claims

[Claims]

1. A speech recognition apparatus which performs a speech recognition by obtaining a likelihood of a plurality of candidate words corresponding to an input speech signal by using a hidden Markov model method corresponding to an input speech signal, and a similar word is previously prepared. A similar word storage means for storing the grouped groups, and a standard feature information storage means for storing at least one type of standard feature information of the spectrum envelope corresponding to the recognizable words in advance, and a spectrum envelope of the input speech signal. The same type of feature information as the standard feature information is extracted, the feature information sequence extraction means for serializing and outputting the feature information, and the similar word storage means is searched from the candidate word and its likelihood, and An extraction unit that performs backtrace processing according to the search result to extract the optimum identification effective section, and the feature information series of the optimum identification effective section and the standard feature information. DOO using speech recognition apparatus characterized by generating and outputting an estimated word identification result optimally changes the likelihood.