JPH08248974A

JPH08248974A - Speech recognition device and standard pattern learning method

Info

Publication number: JPH08248974A
Application number: JP7049435A
Authority: JP
Inventors: Akio Amano; 明雄天野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-03-09
Filing date: 1995-03-09
Publication date: 1996-09-27

Abstract

(57)【要約】【目的】本発明の目的は、標準パタンの学習の各段階
において、各標準パタンの分布の広がりを測定し、各標
準パタンが同等の分布の広がりを持つように構成し、高
精度な音声認識を実現することにある。【構成】標準的学習処理では従来の標準パタンの学習
方法を用いて標準パタンの学習を行なう。できあがった
各標準パタンについて、分布の広がりを測定する。標準
パタンセットとしてのバランスの判定処理では前記分布
の広がりの測定結果にしたがって、標準パタンの集合と
してのバランスを判断する。バランスが悪いと判断され
た場合には、標準パタンセットの中から分布の広がりの
大きな標準パタンを選択し、これを複数化し、個々の標
準パタンの分布の広がりが大きくならないようにする。
以上の処理を標準パタンの集合としてのバランスが良い
と判断されるまで繰り返す。 (57) [Summary] [Purpose] The object of the present invention is to measure the spread of the distribution of each standard pattern at each stage of learning of the standard patterns, and configure each standard pattern to have the same spread of distribution. , To realize highly accurate voice recognition. [Structure] In the standard learning process, standard pattern learning is performed using a conventional standard pattern learning method. For each standard pattern created, measure the spread of the distribution. In the process of determining the balance as a standard pattern set, the balance as a set of standard patterns is determined according to the measurement result of the spread of the distribution. If it is determined that the balance is unbalanced, a standard pattern with a large distribution spread is selected from the standard pattern set, and a plurality of standard patterns are selected so that the distribution spread of each standard pattern does not become large.
The above processing is repeated until it is determined that the balance of the set of standard patterns is good.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音節や音韻（子音、母
音）等の音声言語表現上の基本的な単位を標準パタンと
して用いるような音声認識装置、およびその標準パタン
の作成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus that uses a basic unit in a phonetic language expression such as a syllable or a phoneme (consonant or vowel) as a standard pattern, and a method for creating the standard pattern.

【０００２】[0002]

【従来の技術】音声認識装置における標準パタンの単位
としては、単語単位、音節単位、音韻単位（母音、子音
まで細分化した単位）などいくつかの単位が考えられ
る。音声認識装置においてどの様な単位の標準パタンを
用いるかは、対象とする認識対象語彙数の大小、標準パ
タンに持たせる音声現象表現能力、標準パタンの学習に
使用できる学習用音声データの量などを考慮のうえ決定
する。2. Description of the Related Art Several units such as a word unit, a syllable unit, a phonological unit (a vowel and a consonant subdivided unit) can be considered as a unit of a standard pattern in a speech recognition apparatus. What kind of standard pattern is used in the speech recognition device depends on the size of the target vocabulary to be recognized, the ability to express speech phenomena in the standard pattern, the amount of learning voice data that can be used for learning the standard pattern, etc. Make a decision after considering.

【０００３】まず、標準パタンの個数の観点から考え
る。例えば、１０数字のみを認識対象とするような小語
彙の音声認識装置では、単語単位（一桁数字単位）の標
準パタンを用いても、標準パタンの個数は１０個程度で
すむ。しかしながら、日本語の任意の文章を認識対象に
したり、大語彙の単語認識（例えば日本人の全人名等）
を対象とする音声認識装置では、単語単位に標準パタン
を用意することは個数が多くなりすぎて実際問題として
不可能となる。そこで、標準パタンを音節や音韻の単位
で用意し、これを連結組合せて単語の標準パタンを作成
したり、あるいは音節単位に認識を行なった結果を後処
理して単語認識結果や文認識結果を得るようにする。日
本語の場合、音節の数は約１２０、また音韻の数は約４
０であるので、比較的少ない個数の標準パタンを用意す
るだけで全ての日本語に対応できることになる。すなわ
ち、小さい音声単位を用いれば少ない個数の標準パタン
によって広い範囲の認識対象をカバーでき、大きい音声
単位を用いると数多くの標準パタンが必要となる。First, consideration will be given from the viewpoint of the number of standard patterns. For example, in a small vocabulary voice recognition device that recognizes only 10 numbers, the number of standard patterns is about 10 even if a standard pattern of word units (single digit numbers) is used. However, it recognizes arbitrary sentences in Japanese and recognizes words in a large vocabulary (for example, all names of Japanese people).
In the speech recognition apparatus for the above, it is practically impossible to prepare standard patterns for each word because the number is too large. Therefore, standard patterns are prepared in units of syllables and phonemes, and standard patterns of words are created by concatenating and combining these, or results of recognizing in syllable units are post-processed to obtain word recognition results and sentence recognition results. To get it. In Japanese, there are about 120 syllables and about 4 phonemes.
Since it is 0, all Japanese can be supported by preparing a relatively small number of standard patterns. That is, if a small voice unit is used, a small number of standard patterns can cover a wide range of recognition targets, and if a large voice unit is used, a large number of standard patterns are required.

【０００４】次に標準パタンの音声現象表現能力につい
て考える。前述のように日本語音声は１２０個程度の音
節により原理的には表現可能であるが、同じ音節でも先
行する音節、後続する音節によって音声パタンは大きく
変動する。また、同じ音節でも話者によって音声パタン
は大きく変形する。音節単位や音韻単位の標準パタンで
は、本来は前後環境によって変化する音声パタンを唯一
の標準パタンで代表することになるので、その表現能力
にはおのずと限界が生ずる。一方、単語のような大きな
単位で標準パタンを登録すると前後環境に基づく音節や
音韻のパタンの変動は、単語標準パタンの中に取り込ま
れ結果として音声現象を忠実に表現する標準パタンとな
る。すなわち、音声現象（特に前後環境に基づく音声パ
タンの変動）の表現能力の観点から考えると標準パタン
は大きい音声単位（例えば単語）を用いるのが望まし
い。Next, let us consider the speech pattern expression capability of the standard pattern. As described above, Japanese speech can be represented in principle by about 120 syllables, but even in the same syllable, the speech pattern greatly varies depending on the preceding syllable and the following syllable. In addition, even if the syllable is the same, the voice pattern is greatly changed depending on the speaker. In the standard pattern of syllable unit or phonological unit, since the voice pattern that changes depending on the environment before and after is represented by the only standard pattern, the expression ability is naturally limited. On the other hand, when a standard pattern is registered in a large unit such as a word, variations in syllable and phoneme patterns based on the surrounding environment are incorporated into the word standard pattern, and as a result, the standard pattern faithfully expresses the speech phenomenon. That is, it is desirable to use a large voice unit (for example, a word) as the standard pattern from the viewpoint of the ability to express a voice phenomenon (especially, a change in voice pattern based on the surrounding environment).

【０００５】次に利用可能な学習用音声データの量の観
点から考える。前述のように小さい音声単位を用いると
標準パタンの個数は少なく抑えられ、大きな単位を用い
ると標準パタンの総数は多くなる。したがって、標準パ
タンの学習に利用できる学習用音声データの総量が一定
であるとすると、小さな音声単位を用いたほうが各標準
パタンあたりの学習用音声データの量が多くなり、信頼
性の高い標準パタンを作成できる。また、信頼性の程度
を同程度として考えると、大きい単位の標準パタンを用
いる場合の方が学習用音声データの必要量が大きくな
る。Consider next from the viewpoint of the amount of available learning voice data. As described above, when a small voice unit is used, the number of standard patterns can be kept small, and when a large unit is used, the total number of standard patterns becomes large. Therefore, assuming that the total amount of learning voice data that can be used for learning the standard pattern is constant, the use of small voice units increases the amount of learning voice data for each standard pattern, resulting in a highly reliable standard pattern. Can be created. If the reliability is considered to be the same, the required amount of learning voice data is larger when the standard pattern of a larger unit is used.

【０００６】上記の状況により、認識対象語彙数が少な
いときには単語単位の標準パタン、対象語彙数が多いと
きには音節、音韻あるいはそれに準じる単位の標準パタ
ンを用いるのが常識的考え方となっている。Under the above circumstances, it is common sense to use a standard pattern in units of words when the number of vocabularies to be recognized is small, and to use a standard pattern in units of syllables or phonemes or similar when the number of vocabularies to be recognized is large.

【０００７】従来技術の中にはこれらの改良として、前
後環境に依存して異なる標準パタンを用いる手法、さら
に前後環境に依存して異なる標準パタンを作成する際に
信頼性を高めるためにクラスタリングの技法を用いる手
法などがある。以下従来技術について簡単に説明する。Among these conventional techniques, as an improvement of these techniques, a method of using different standard patterns depending on the front-back environment, and a clustering method for increasing reliability when creating different standard patterns depending on the front-back environment are used. There is a technique using the technique. The conventional technique will be briefly described below.

【０００８】前後環境に依存して異なる標準パタンを用
いる手法としては、Kluwer Academic Publishers， Nor
wel， MA， 1989 “Automatic Speech Recognition”、
95頁-97頁に記載のような例がある。この例は英語を対
象としたもので認識の基本単位には英語の音韻（Phone)
を採用し、前後の音韻に依存して異なる音韻標準パタン
を持つようにしている。このように標準パタンを用意す
ることにより前後環境に基づく音声パタンの変動による
誤認識を削減することができる。英語の音韻は約４０ほ
どあり、前後環境に依存して異なる標準パタンを持つよ
うにすると、その総数は組合せてきに大きくなり、数千
を超える。このような膨大な個数の標準パタンを信頼性
高く学習するためには膨大な量の学習用音声サンプルが
必要となり、現実的でない。本従来例ではこの問題に対
応するために類似の前後環境をまとめて扱い、標準パタ
ンの総数を抑える工夫をしている。As a method of using different standard patterns depending on the surrounding environment, Kluwer Academic Publishers, Nor
wel, MA, 1989 “Automatic Speech Recognition”,
There are examples as described on pages 95-97. This example is for English, and the basic unit of recognition is the English phoneme (Phone).
Is adopted to have different phoneme standard patterns depending on the phonemes before and after. By preparing the standard pattern in this way, it is possible to reduce erroneous recognition due to changes in the voice pattern based on the surrounding environment. There are about 40 phonemes in English, and if different standard patterns are used depending on the surrounding environment, the total number will be large, exceeding several thousand. In order to reliably learn such an enormous number of standard patterns, an enormous amount of learning speech samples are required, which is not realistic. In this conventional example, in order to deal with this problem, similar front and rear environments are handled collectively, and the total number of standard patterns is suppressed.

【０００９】日本語の場合の同様の従来例として、電子
情報通信学会論文誌、D2、Vol.J76-D-2，No.10，PP.215
5-2164，1993年10月，“逐次状態分割法による隠れマル
コフ網の自動生成”に記載のような例がある。この例で
は日本語の音韻（子音と母音）を認識の基本単位とし、
前後の音韻環境に依存しないHMMから出発して、順次状
態を分割して前後の音韻環境による変動に対応するよう
にしている。状態分割はモデルを複数化することに対応
し、前後の音韻環境に依存して異なるモデルを設けるの
と同様である。本従来例では、どの様な音韻環境によっ
てモデルを分割するかの決定を学習サンプルの分布から
自動的に決定する様にしている。As a similar conventional example in the case of Japanese, the Institute of Electronics, Information and Communication Engineers, D2, Vol.J76-D-2, No.10, PP.215.
5-2164, October 1993, "Automatic Generation of Hidden Markov Networks by Sequential State Division Method" is an example. In this example, Japanese phonemes (consonants and vowels) are the basic units of recognition,
Starting from an HMM that does not depend on the phonetic environment before and after, the state is divided sequentially to deal with changes due to the phonemic environment before and after. The state division corresponds to a plurality of models, and is similar to providing different models depending on the preceding and following phoneme environments. In this conventional example, the decision as to what phonological environment the model should be divided into is automatically determined from the distribution of learning samples.

【００１０】[0010]

【発明が解決しようとする課題】上記従来技術では、前
後の音韻環境に依存して異なる標準パタンを設けること
により、前後の音韻環境に基づく音声パタンの変動を適
切に表現する標準パタンが用意でき、認識精度が向上す
る。また、前後の音韻環境を考慮することにより標準パ
タンの個数が増加し、個々の標準パタン当りの学習用音
声サンプル数が減り、個々の標準パタンの信頼性が低下
する問題に対しては、クラスタリングの技法等を用いる
ことにより、標準パタンの個数を削減して信頼性の低下
を防いでいる。In the above-mentioned prior art, by providing different standard patterns depending on the preceding and following phonological environments, it is possible to prepare a standard pattern that appropriately expresses the variation of the speech pattern based on the preceding and following phonological environments. , The recognition accuracy is improved. In addition, the clustering is used to solve the problem that the number of standard patterns increases by considering the phonological environment before and after, and the number of training speech samples for each standard pattern decreases, and the reliability of each standard pattern decreases. By using the above technique, the number of standard patterns is reduced to prevent the deterioration of reliability.

【００１１】しかしながら、音声認識において誤認識の
発生を少なくするためには、単に個々の標準パタンの精
度や信頼性を向上するだけではなく、対立するカテゴリ
の標準パタンが同程度の精度および信頼性を持つように
し、標準パタンの集合として全体的に均等な精度、信頼
性を持つようにすることが重要となる。一部の標準パタ
ンの精度や信頼性を高くしても、精度あるいは信頼性の
低い標準パタンが集合の中に含まれていると、この標準
パタンが悪影響を及ぼし誤認識を引き起こす。However, in order to reduce the occurrence of erroneous recognition in speech recognition, not only the accuracy and reliability of individual standard patterns are improved, but also the standard patterns of opposite categories have the same accuracy and reliability. Therefore, it is important to ensure that the set of standard patterns has an even accuracy and reliability as a whole. Even if the accuracy and reliability of some standard patterns are high, if the standard patterns with low accuracy or reliability are included in the set, the standard patterns have an adverse effect and cause misrecognition.

【００１２】本発明の目的は上記従来技術において考慮
が不十分であった標準パタンの集合としての全体的なバ
ランスのとれた標準パタンのセットを作成する手段を提
供することにある。It is an object of the present invention to provide a means for creating an overall balanced set of standard patterns as a set of standard patterns which has been insufficiently considered in the above prior art.

【００１３】[0013]

【課題を解決するための手段】上記本発明の目的は、標
準パタンの学習の各段階において、各標準パタンの分布
の広がりを測定し、各標準パタンが同等の分布の広がり
を持つように、分布の広がりの大きな標準パタンについ
ては標準パタンを複数化し、複数化されたそれぞれの標
準パタンについては、分布の広がりが大きくならないよ
うにし、全体として各標準パタンが同等の分布の広がり
を持つようにすることにより達成される。The object of the present invention is to measure the spread of the distribution of each standard pattern at each stage of learning the standard pattern so that each standard pattern has the same spread of distribution. For standard patterns with a large distribution spread, multiple standard patterns should be used.For each of the standardized patterns, the distribution spread should not be large, and each standard pattern should have the same distribution spread as a whole. It is achieved by

【００１４】[0014]

【作用】標準的学習処理では従来の標準パタンの学習方
法を用いて標準パタンの学習を行なう。できあがった各
標準パタンについて、分布の広がりを測定する。標準パ
タンセットとしてのバランスの判定処理では前記分布の
広がりの測定結果にしたがって、標準パタンの集合とし
てのバランスを判断する。分布の広がりに偏りがある場
合には、バランスが悪いと判断され、各標準パタンの分
布の広がりがほぼ同程度で、均等であるときはバランス
が良いと判断される。バランスが悪いと判断された場合
には、標準パタンセットの中から分布の広がりの大きな
標準パタンを選択し、これを複数化し、個々の標準パタ
ンの分布の広がりが大きくならないようにする。以上の
処理により、標準パタンの集合としてバランスのとれた
標準パタンを作成することができ、高精度な音声認識を
実現できる。In the standard learning process, the standard pattern is learned by using the conventional standard pattern learning method. For each standard pattern created, measure the spread of the distribution. In the process of determining the balance as a standard pattern set, the balance as a set of standard patterns is determined according to the measurement result of the spread of the distribution. When the spread of the distribution is biased, it is determined that the balance is poor, and when the spread of the distribution of each standard pattern is approximately the same and even, the balance is determined to be good. If it is determined that the balance is unbalanced, a standard pattern with a large distribution spread is selected from the standard pattern set, and a plurality of standard patterns are selected so that the distribution spread of each standard pattern does not become large. By the above processing, a well-balanced standard pattern can be created as a set of standard patterns, and highly accurate speech recognition can be realized.

【００１５】[0015]

【実施例】以下、図を用いて本発明の実施例を説明す
る。本発明は、単語音声認識、連続音声認識のどちらに
も適用可能であるが、ここでは簡単のため単語音声認識
の場合を例にとって説明する。Embodiments of the present invention will be described below with reference to the drawings. The present invention can be applied to both word voice recognition and continuous voice recognition, but here, for simplicity, the case of word voice recognition will be described as an example.

【００１６】図１は本発明の単語音声認識装置の一実施
例の構成を示すブロック図である。入力された音声は音
声入力手段１において電気信号に変換される。電気信号
に変換された音声はさらに音声分析手段２において分析
され、特徴ベクトルの時系列が出力される。一方、標準
パタン連結手段７では、標準パタン格納手段５に予め格
納されている認識基本単位の標準パタンを単語辞書６に
格納されている情報にしたがって連結し単語標準パタン
とする。標準パタン連結手段７で作成された標準パタン
と前記入力音声の特徴ベクトル時系列とが照合手段３に
て照合され、認識対象の各単語毎にスコアが求められ
る。判定手段４では前記各単語のスコアに基づいて認識
結果を出力する。FIG. 1 is a block diagram showing the configuration of an embodiment of the word voice recognition apparatus of the present invention. The input voice is converted into an electric signal by the voice input means 1. The voice converted into the electric signal is further analyzed by the voice analysis means 2 and the time series of the feature vector is output. On the other hand, the standard pattern connecting means 7 connects the standard patterns of the recognition basic units stored in the standard pattern storing means 5 in advance according to the information stored in the word dictionary 6 to form a word standard pattern. The standard pattern created by the standard pattern connection unit 7 and the feature vector time series of the input voice are collated by the collation unit 3, and a score is obtained for each word to be recognized. The judging means 4 outputs the recognition result based on the score of each word.

【００１７】次に本発明の中で用いている認識基本単位
の標準パタンについて説明する。本発明では、標準パタ
ンとして確率モデルを採用している。図２は本発明の中
で用いている標準パタンである確率モデル（Hidden Ma
rkov Model、以下HMMと略す）を示した図である。図中
各円は状態を表わし、矢印は状態間の遷移を表わす。矢
印に添えた記号ａijは状態ｉから状態ｊへの遷移が生じ
る確率を表わし、記号ｂij（ｋ）は状態ｉから状態ｊへ
の遷移が生じたときに第ｋ番目の分類に属する特徴ベク
トルが出力される確率を表わす。入力音声の特徴ベクト
ル時系列が与えられると、前記状態遷移確率、出力確率
を用いて入力音声の特徴ベクトル時系列がこの確率モデ
ル（HMM）から出力された確率を計算することができ
る。前記図１の中の照合手段３では、この確率計算の処
理が行なわれる。確率計算処理の詳細に関しては、Kluw
er Academic Publishers、 Norwel、 MA、 1989 “Auto
matic Speech Recognition”、95頁-97頁に記載されて
いる公知の方法を用いればよい。Next, the standard pattern of the basic recognition unit used in the present invention will be described. In the present invention, a probabilistic model is adopted as the standard pattern. FIG. 2 is a probabilistic model (Hidden Ma) which is a standard pattern used in the present invention.
rkov Model, hereinafter abbreviated as HMM). In the figure, each circle represents a state, and arrows represent transitions between the states. The symbol aij attached to the arrow represents the probability that the transition from the state i to the state j will occur, and the symbol bij (k) indicates that the feature vector belonging to the kth classification when the transition from the state i to the state j occurs. It represents the probability of being output. Given the feature vector time series of the input voice, the probability that the feature vector time series of the input voice is output from this probability model (HMM) can be calculated using the state transition probability and the output probability. The matching means 3 in FIG. 1 performs this probability calculation process. For details on the probability calculation process, see Kluw
er Academic Publishers, Norwel, MA, 1989 “Auto
The known method described in “matic Speech Recognition”, pages 95 to 97 may be used.

【００１８】次に本発明の音声認識装置において用いる
標準パタンの連結方法について図３を用いて説明する。
図３は単語辞書６にしたがって標準パタンを連結する様
子を説明する図である。前述の様に本発明の音声認識装
置では標準パタンとして状態遷移モデルであるHMMを用
いているので標準パタンの連結が容易に行なわれる。標
準パタンの連結は、先行するモデルの最終状態から出る
状態遷移先を後続するモデルの最初の状態にする様にす
ればよい。図３では、認識の基本単位として日本語の音
節を採用し、辞書中の単語「日立（/hitachi）」を取り
上げている。標準パタン格納手段５には日本語の音節
（日本語のかな文字「あ」「い」…「ん」に対応する音
声単位）に対応するHMMが格納されている。単語「日
立」のHMMを作成するには、まず、単語辞書６を調べ単
語「日立」が音節列/ｈｉ/、/ｔａ/、/ｃｈｉ/から構成
されていることを読み出す。標準パタン連結手段７では
前記音節列にしたがって、順次標準パタン格納手段５か
ら/ｈｉ/のHMM、/ｔａ/のHMM、/ｃｈｉ/のHMMを読み出
しこれを連結した大きな一つのHMMとする。Next, a method of connecting standard patterns used in the speech recognition apparatus of the present invention will be described with reference to FIG.
FIG. 3 is a diagram for explaining how standard patterns are connected according to the word dictionary 6. As described above, the speech recognition apparatus of the present invention uses the HMM, which is a state transition model, as the standard pattern, so that the standard patterns can be easily connected. The connection of the standard patterns may be such that the destination of state transition from the final state of the preceding model is the first state of the subsequent model. In FIG. 3, the Japanese syllable is adopted as the basic unit of recognition, and the word “Hitachi (/ hitachi)” in the dictionary is taken up. The standard pattern storage means 5 stores HMMs corresponding to Japanese syllables (voice units corresponding to Japanese kana characters “a” “i” ... “n”). To create an HMM for the word "Hitachi", first the word dictionary 6 is checked and it is read that the word "Hitachi" is composed of syllable strings / hi /, / ta /, / chi /. The standard pattern connection means 7 sequentially reads the / hi / HMM, the / ta / HMM, and the / chi / HMM from the standard pattern storage means 5 in accordance with the syllable string, and sets them as one large HMM.

【００１９】次に本発明の音声認識装置において用いる
標準パタンであるHMMの通常の学習方法について説明す
る。HMMは大量の学習用音声サンプルを用いてパラメタ
推定を行なうことにより実施する。図４に示したのはそ
の学習フローの概要を示すフローチャートである。まず
HMMの初期モデルを何らかの方法により作成し（１０
１）、その後学習用音声サンプルを用いたパラメタ再推
定処理（１０２）を収束条件を満たすまで（１０３）繰
り返す。本学習方法は元々繰り返し推定アルゴリズムで
あり、繰り返し回数が増える毎にモデルの精度が向上す
る。したがって、初期モデルは必ずしも精度高く作成す
る必要はない。初期モデルの作成方法については何通り
かの方法があるが、例えば乱数を与えるような手法でよ
い。パラメタ再推定の方法については後述する。収束条
件判断についても何通りかの方法が考えられるが、例え
ば繰り返しの回数を固定して、一定回数（例えば５回）
の繰り返しを行なったら終了する様な方法で実用上問題
ない。Next, a normal learning method of the HMM which is a standard pattern used in the speech recognition apparatus of the present invention will be described. The HMM is implemented by parameter estimation using a large number of training speech samples. FIG. 4 is a flowchart showing an outline of the learning flow. First
Create an initial model of HMM by some method (10
1) and then the parameter re-estimation process (102) using the learning voice sample is repeated (103) until the convergence condition is satisfied. This learning method is originally an iterative estimation algorithm, and the accuracy of the model improves as the number of iterations increases. Therefore, it is not always necessary to create the initial model with high accuracy. There are several methods for creating the initial model, but a method such as giving a random number may be used. The method of parameter re-estimation will be described later. There are several possible methods for determining the convergence condition, but for example, the number of repetitions is fixed and a fixed number of times (for example, 5 times).
There is no problem in practice by the method of ending after repeating.

【００２０】収束条件が満足されたら繰り返しを終了
し、パラメタ推定により得られた各HMMのパラメタを格
納する（１０４）。When the convergence condition is satisfied, the iteration is terminated and the parameters of each HMM obtained by the parameter estimation are stored (104).

【００２１】次にHMMのパラメタ再推定処理について説
明する。図４のフローチャートに示したようにHMMのパ
ラメタ再推定処理は学習フローの中で繰り返し行なわれ
る。ここではその一回分の処理を図５のフローチャート
を用いて説明する。HMMのパラメタ再推定処理は学習用
の音声サンプルを用いて行なう。学習用の音声サンプル
の個数がNであるとすると、N回類似のパラメタ推定計算
処理を行ない、これが終了した後に各HMMのパラメタを
新しい値に更新する。各音声サンプルを用いたパラメタ
推定処理においては、まず音声サンプルの発声内容に合
わせて認識基本単位のHMMを連結し（２０３）、この連
結したHMMに対してForward-Backwardアルゴリズムと呼
ばれる手法を用いてパラメタ推定を行なう（２０４）。
連結されたHMMを元の認識基本単位に分解することによ
り、各認識基本単位のHMMのパラメタ推定値が得られる
（２０５）。ただし、この時点では各認識基本単位のHM
Mのパラメタの更新は行なわず、全音声サンプルについ
てパラメタ推定値が得られた後にそれまでに得られた全
パラメタ推定値を総合して各認識基本単位のHMMのパラ
メタの更新を行なう（２０７）。なお、パラメタ推定
（Forward-Backwardアルゴリズム）の具体的な計算手続
きについてはKluwer Academic Publishers、 Norwel、
MA、 1989 “Automatic Speech Recognition”、95頁-9
7頁に記載されている公知の方法を用いればよい。Next, the parameter re-estimation processing of the HMM will be described. As shown in the flowchart of FIG. 4, the parameter re-estimation process of the HMM is repeatedly performed in the learning flow. Here, the one-time processing will be described with reference to the flowchart of FIG. The HMM parameter re-estimation process is performed using speech samples for learning. Assuming that the number of training voice samples is N, the parameter estimation calculation processing similar to N times is performed, and after this processing is completed, the parameters of each HMM are updated to new values. In the parameter estimation process using each voice sample, first, the HMMs of the basic recognition units are concatenated according to the utterance content of the voice sample (203), and a method called the Forward-Backward algorithm is used for this concatenated HMM. Parameter estimation is performed (204).
By decomposing the concatenated HMMs into the original recognition basic units, the parameter estimation value of the HMM of each recognition basic unit is obtained (205). However, at this point, the HM of each recognition basic unit
The parameters of M are not updated, and after the parameter estimates are obtained for all speech samples, the parameter estimates of all recognition parameters obtained so far are combined to update the parameters of the HMM of each recognition basic unit (207). . For the specific calculation procedure of parameter estimation (Forward-Backward algorithm), see Kluwer Academic Publishers, Norwel,
MA, 1989 “Automatic Speech Recognition”, page 95-9
The known method described on page 7 may be used.

【００２２】次に本発明の主眼点である標準パタンの集
合としてバランスを考慮した標準パタンの学習方法につ
いて説明する。図６に示すのは本学習方法を説明するフ
ローチャートである。本学習においてはまず従来からあ
る標準的な手法によりHMMの学習を行なう（３０１）。
これによりでき上がった各HMMについて分布の広がりを
測定する（３０２）。分布の広がりは例えば次式により
もとめることができる。Next, a method of learning a standard pattern considering balance as a set of standard patterns, which is the main point of the present invention, will be described. FIG. 6 is a flow chart for explaining the learning method. In this learning, first, HMM learning is performed by a standard method that has been used conventionally (301).
The spread of the distribution is measured for each HMM thus created (302). The spread of the distribution can be obtained by the following equation, for example.

【００２３】[0023]

【数１】 [Equation 1]

【００２４】ただし、However,

【００２５】[0025]

【数２】 [Equation 2]

【００２６】分布の広がりとしては、（数１）に示すよ
うに特徴ベクトルの各次元の分散ＶＡＲｋを総和したも
のを求めた。なお、分散式は（数２）により求められ
る。ここに、Ｖｋは特徴ベクトルＶの第ｋ次元の値、μ
ｋはその平均値である。As the spread of the distribution, the sum of the variance VARk of each dimension of the feature vector was obtained as shown in (Equation 1). The dispersion formula is calculated by (Equation 2). Here, Vk is the value of the k-th dimension of the feature vector V, μ
k is the average value.

【００２７】次に各HMMの分布の広がりの程度に大きな
ばらつきがなく集合としてバランスがとれているかどう
かの判定を行ない（３０３）、バランスがとれていると
判定されれば処理を終了する。バランスがとれているか
どうかの判定は各HMMの分布の広がりの大きさの最大値
が閾値以内に収まっているかどうか等の判定法により行
なうことができる。閾値の設定は例えば、各HMMの分布
の広がり（分散）の値の平均値に対して定数α倍（例え
ば１.３倍）するといった方法で行なうことができる。
この判定を具体的に式で示すと次式のようになる。Next, it is judged whether there is a large variation in the extent of the distribution of each HMM and there is a balance as a set (303), and if it is judged that there is a balance, the process ends. The balance can be determined by a determination method such as whether the maximum value of the spread of the distribution of each HMM is within a threshold value. The threshold can be set by, for example, a method of multiplying the average value of the spread (dispersion) of the distribution of each HMM by a constant α (for example, 1.3 times).
This determination is specifically expressed by the following equation.

【００２８】[0028]

【数３】 (Equation 3)

【００２９】この条件を満足しない場合には最も分布の
広がりの大きなHMMを求め（３０４）、これを複数（例
えば２つ）に分割する（３０５）。次に複数化されたHM
Mに対する音節または音韻が含まれる音声サンプルを学
習様音声サンプルの中から選びだし、複数に分割された
それぞれのHMMを用いて認識処理を行なうことにより音
声サンプルを複数化されたHMMのいずれかに割り付ける
（３０６）。以上の割り付けにより二分された音声サン
プルを用いて、複数化されたそれぞれのHMMのパラメタ
推定を行ない分割されたHMMのパラメタを更新する（３
０７）。以上の処理の後再び、各HMMについて分布の広
がりを測定し（３０２）、各HMMの分布の広がりの程度
に大きなばらつきがなく集合としてバランスがとれてい
るかどうかの判定を行なう（３０３）という処理を繰り
返す。バランスがとれていると判定されれば処理を終了
する。If this condition is not satisfied, the HMM having the largest spread of distribution is obtained (304), and this is divided into a plurality (for example, two) (305). Next, multiple HM
A speech sample containing syllables or phonemes for M is selected from the learning-like speech samples, and recognition processing is performed using each of the HMMs that have been divided into a plurality of HMMs. Allocate (306). Using the voice samples divided into two by the above allocation, the parameters of each HMM that has been made multiple are estimated and the parameters of the divided HMMs are updated (3
07). After the above processing, the spread of the distribution is measured again for each HMM (302), and it is determined whether there is no large variation in the spread of the distribution of each HMM and it is balanced as a set (303). repeat. If it is determined that there is a balance, the process ends.

【００３０】HMMMの複数かにおいては、もとのHMMに対
して複数の適当な個数のHMMを作成する。ここでは、２
つに分割する場合について説明する HMMを２つに分割するには例えばHMMが持つ出力確率分布
に対して、微小な変動分を加えた分布を新たな第一の分
布とし、元の分布から微小な変動分をと差し引いた分布
を新たな第二の分布とする様な方法が考えられる。第一
の分布としては例えば、出力確率密度の最大値をとる特
徴ベクトルの値から一定の範囲内については、確率密度
の値を定数β倍（例えば（１．０１倍）する。その他の
範囲内では確率密度の値が小さくなるようにやはり確率
密度を定数γ倍する。なお、γの値は１より小さい値で
あり、β倍による確率密度の増分とγ倍による確率密度
の減少分が相殺するように設定する。これにより第一の
分布が得られる。第二の分布は出力確率密度の最大値を
とる特徴ベクトルの値から一定の範囲内については、確
率密度の値を定数２−β倍（βが１．０１の場合０．９
９）し、その他の範囲内では確率密度の値が小さくなる
ように確率密度を定数２−γ倍する。この方法によって
新たな分布が得られる様子を図７に示す。図７では４０
１が元の分布を示す。４０４から４０５の範囲が出力確
率密度の最大値をとる特徴ベクトルの値から一定の範囲
内を示す。４０２が新たな第一の分布を、４０３が新た
な第二の分布を表わす。For a plurality of HMMMs, a plurality of appropriate numbers of HMMs are created for the original HMM. Here, 2
In order to divide the HMM into two, for example, to divide the output probability distribution of the HMM into a new first distribution with a minute change, add a minute distribution from the original distribution. A method is conceivable in which the distribution obtained by subtracting the large fluctuation is used as the new second distribution. As the first distribution, for example, within a certain range from the value of the feature vector that takes the maximum value of the output probability density, the value of the probability density is multiplied by a constant β (for example (1.01 times)). Then, the probability density is also multiplied by a constant γ so that the value of the probability density becomes smaller.The value of γ is smaller than 1. The increase of the probability density by β times and the decrease of the probability density by γ times cancel each other out. As a result, the first distribution is obtained, and the second distribution has a probability density value of a constant 2-β within a certain range from the value of the feature vector taking the maximum value of the output probability density. Times (0.9 when β is 1.01
9) Then, within other ranges, the probability density is multiplied by a constant 2-γ so that the value of the probability density becomes small. FIG. 7 shows how a new distribution is obtained by this method. 40 in FIG.
1 indicates the original distribution. The range from 404 to 405 indicates a certain range from the value of the feature vector having the maximum output probability density. Reference numeral 402 represents a new first distribution, and 403 represents a new second distribution.

【００３１】なお、前記HMMの分布の広がりの測定方法
として、HMMが持つ出力確率分布の分布形状の多峰性を
検出するような手法も考えられる。また、HMMの分割法
として、検出された多峰性のそれぞれの分布を新たなHM
Mの分布とするような分割方法も考えられる。As a method of measuring the spread of the HMM distribution, a method of detecting multimodality of the distribution shape of the output probability distribution of the HMM can be considered. In addition, as a method of HMM division, the distribution of each detected
A method of dividing the distribution of M is also conceivable.

【００３２】[0032]

【発明の効果】以上本発明によれば、標準パタンの集合
としてバランスのとれた標準パタンを作成することがで
きるので高精度な音声認識が可能となる。As described above, according to the present invention, since a well-balanced standard pattern can be created as a set of standard patterns, highly accurate speech recognition is possible.

【００３３】[0033]

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例の構成を示す
ブロック図。FIG. 1 is a block diagram showing the configuration of an embodiment of a voice recognition device of the present invention.

【図２】本発明の音声認識装置で用いる認識基本単位の
隠れマルコフモデルを説明する図。FIG. 2 is a diagram illustrating a hidden Markov model of a basic recognition unit used in the speech recognition apparatus of the present invention.

【図３】本発明の音声認識装置で用いる認識基本単位の
隠れマルコフモデルを単語辞書にしたがって連結する様
子を説明する図。FIG. 3 is a diagram for explaining how to connect hidden Markov models of basic recognition units used in the speech recognition apparatus of the present invention according to a word dictionary.

【図４】本発明の標準パタンの学習方法を説明するフロ
ーチャート。FIG. 4 is a flowchart illustrating a standard pattern learning method of the present invention.

【図５】本発明の標準パタンの学習方法におけるパラメ
タ推定処理を説明するフローチャート。FIG. 5 is a flowchart illustrating a parameter estimation process in the standard pattern learning method of the present invention.

【図６】本発明の標準パタンの学習方法の別の学習フロ
ーを説明するフローチャート。FIG. 6 is a flowchart illustrating another learning flow of the standard pattern learning method of the present invention.

【図７】本発明の標準パタンの学習方法の中で行なう隠
れマルコフモデルの分割方法を説明する図。FIG. 7 is a diagram illustrating a method of dividing a hidden Markov model performed in the standard pattern learning method of the present invention.

[Explanation of symbols]

１・・・音声入力手段、２・・・音声分析手段、３・・
・照合手段、４・・・判定手段５・・・標準パタン格納
手段、６・・・単語辞書、７・・・標準パタン連結手段１０１・・・初期モデル作成処理、１０２・・・パラメ
タ再推定処理２０４・・・Forward-Backwardアルゴリズム３０２・・・標準パタンの分布の広がり測定処理３０３・・・標準パタンの集合としてのバランス判定処
理３０５・・・標準パタンの複数化処理1 ... voice input means, 2 ... voice analysis means, 3 ...
Collation means, 4 determination means 5 standard pattern storage means, 6 word dictionary, 7 standard pattern connection means 101 initial model creation processing, 102 parameter re-estimation Processing 204 ... Forward-Backward algorithm 302 ... Standard pattern distribution spread measurement processing 303 ... Balance determination processing as a set of standard patterns 305 ... Standard pattern multiple processing

Claims

[Claims]

1. A voice input means for inputting a voice, a voice analysis means for analyzing the input voice and outputting a time series of feature vectors, and a standard for a syllable, a phoneme, or a voice basic unit smaller than a phoneme. A standard pattern storage means for storing patterns, a word dictionary that describes the words of the recognition target word as a syllable, or a phonological unit, or a sequence of phonetic basic units smaller than a phonological unit, and a syllable, a phonological unit, or a smaller than phonological unit. Standard pattern connecting means for connecting standard patterns for basic speech units to form a standard pattern for a recognition target word, and collating means for matching the time series of the feature vector of the input voice with the standard pattern formed by the connection. In the voice recognition device for recognizing based on the matching result output from the matching means, the syllable or the phoneme, Or, a standard pattern for a phonetic unit smaller than a phoneme, the syllable, or a phoneme, or to provide a plurality according to the degree of variation of the phonetic pattern of a phonetic unit smaller than a phoneme, a syllable having a plurality of standard patterns, Alternatively, a phonetic recognition unit or a phonetic basic unit smaller than a phoneme is matched with a plurality of standard patterns in the matching process, and a plurality of matching results are integrated to obtain a recognition result.

2. The speech recognition apparatus according to claim 1, wherein the standard pattern is composed of a probabilistic model, and the probabilistic model is learned by using a speech sample for learning.

3. The speech recognition apparatus according to claim 2, wherein the probabilistic model is a hidden Markov model.

4. A standard for representing the entire utterance content by concatenating standard patterns of syllables, phonemes, or basic units of speech smaller than the phoneme according to the utterance content to a learning voice sample whose utterance content is known. Create a pattern, learn the standard pattern as a standard pattern that represents the entire utterance content using the learning voice sample, and use the created standard pattern that represents the entire utterance content as a syllable, a phoneme, or a voice that is smaller than the phoneme. A learning method of a standard pattern such that a standard pattern of a syllable, a phoneme, or a voice basic unit smaller than a phoneme is decomposed into basic units, the number of which is equal to the number of learning voice samples given the learning process. In a method of learning a standard pattern that is repeated, a standard pattern of created syllables, phonemes, or basic units of speech smaller than phonemes. The degree of the spread of the distribution of the voice pattern of is measured, the standard pattern with a wide distribution is diversified, and the standardized pattern is trained again by using the speech sample for learning. A standard pattern learning method characterized by.

5. The standard pattern is constructed by a stochastic model, and the degree of spread of the distribution of speech patterns with respect to the standard pattern is measured based on the variance value in the distribution information of the probabilistic model. The standard pattern learning method according to claim 4, wherein

6. The standard pattern is constructed by a probabilistic model, and the degree of spread of the distribution of the speech pattern with respect to the standard pattern is measured based on the detection of the bimodality of the distribution possessed by the probabilistic model. The standard pattern learning method according to claim 4.

7. The standard pattern is constructed by a probabilistic model, and the plurality of standard patterns are made by adding a minute change to the distribution information of the original standard pattern. A learning method of the standard pattern according to claim 5 or 6.

8. The standard pattern is constructed by a probabilistic model, and the standard pattern is made plural by detecting bimodality with respect to the distribution information of the original standard pattern, and components from the detected plural distributions are detected. 7. A standard pattern learning method according to claim 4, 5 or 6, wherein a distribution having a large value of is selected and the distribution is performed based on the selected distribution.

9. The standard pattern learning method according to claim 5, wherein the probabilistic model is a hidden Markov model.