JPH04271397A

JPH04271397A - Voice recognizer

Info

Publication number: JPH04271397A
Application number: JP3032933A
Authority: JP
Inventors: Hiroki Onishi; 大西宏樹; Masanori Miyatake; 宮武正典
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1991-02-27
Filing date: 1991-02-27
Publication date: 1992-09-28

Abstract

PURPOSE:To recognize voice by the voice recognizer in consideration of the combination, of connections between successive obtained phonemes by learning and acquiring only necessary standard patterns corresponding to the combinations of connections between the successive phonemes while performing the recognition even if the standard patterns corresponding to all the combinations of connections between the successive phonemes by preparing the standard patterns corresponding to the phonemes of certain extent. CONSTITUTION:After a recognition result is determined at the time of the recognition of a voice, a phoneme pattern corresponding to an input voice pattern is taken out and environment information (combination information on successive phonemes, i.e., three combination phoneme labels in concrete) showing which position in a word the phoneme pattern is present with a phoneme label obtained from a word dictionary 16 is stored in a phoneme environment table 17; and the phoneme standard pattern is obtained and stored in a phoneme standard pattern memory 15.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は音素あるいは音節などの
セグメントを認識単位として音声を認識する音声認識装
置に関し、特に認識単位セグメトの標準パターンを能率
よく選択してパターンマッチングなど識別処理を行なう
音声認識装置に関するものである。[Field of Industrial Application] The present invention relates to a speech recognition device that recognizes speech using segments such as phonemes or syllables as recognition units, and in particular, speech recognition that efficiently selects standard patterns of recognition unit segments and performs identification processing such as pattern matching. This relates to a recognition device.

【０００２】0002

【従来の技術】従来、実用化されてきた音声認識装置は
、音声パターンを単語単位で扱い、単語ごとの認識単位
で認識を行なう単語音声認識装置であった。これは、予
め認識対象となる単語音声のパターンを単語単位で標準
パターンとして記憶しておき、単語ごとにパターンマッ
チングなどの認識処理を行なうものであった。2. Description of the Related Art Conventionally, speech recognition devices that have been put to practical use are word speech recognition devices that handle speech patterns in units of words and perform recognition on a per-word basis. In this method, patterns of word sounds to be recognized are stored in advance as standard patterns for each word, and recognition processing such as pattern matching is performed for each word.

【０００３】このような単語音声認識装置は、実用可能
な認識率が得られる反面、認識できる語彙が標準パター
ンとして登録してある単語に限定されるので、勿論登録
していない単語の認識はできない。そこでできるだけ多
くの単語の認識を可能とするには、多数の単語の登録処
理が必要となる。従って、この登録処理に於て、使用者
が何百、何千もの音声を発生しなければならなくなる事
を考慮すれば、このような音声認識装置は、全く実用的
なものとは言えない。[0003] Although such a word speech recognition device can obtain a recognition rate that is practical, the vocabulary it can recognize is limited to words that have been registered as standard patterns, so of course it cannot recognize words that have not been registered. . Therefore, in order to enable recognition of as many words as possible, it is necessary to register a large number of words. Therefore, considering that the user has to generate hundreds or thousands of voices during this registration process, such a voice recognition device cannot be said to be practical at all.

【０００４】これに対し、現在、研究が進められている
連続音声認識装置は、認識単位として音素あるいは音節
など、単語よりも細かい音声のセグメント（数十種類程
度）を認識単位として選び、これらのセグメント毎に音
声を認識することにより、あらゆる語彙の音声でも認識
できるようになしたものである。[0004] On the other hand, continuous speech recognition devices, which are currently being researched, select speech segments (about several dozen types) that are smaller than words, such as phonemes or syllables, as recognition units, and By recognizing speech segment by segment, it is possible to recognize speech of any vocabulary.

【０００５】このような連続音声認識装置では、オート
マトン制御によるレベルビルディング法の採用が可能で
ある。オートマトン制御によるレベルビルディング法自
体は、単語を認識単位として連続単語認識を行う技術で
すでに採用されており、例えば、特公平１−５９６００
号公報に詳しい。[0005] In such a continuous speech recognition device, it is possible to employ a level building method using automaton control. The level building method using automaton control has already been adopted in technology for continuous word recognition using words as recognition units; for example, in Japanese Patent Publication No. 1-59600.
I am familiar with the issue.

【０００６】以下に、オートマトン制御によるレベルビ
ルディング法を用いた音声認識装置の構成並びに動作を
図２ないし図５に基づいて、以下に解説する。The configuration and operation of a speech recognition apparatus using a level building method using automaton control will be explained below with reference to FIGS. 2 to 5.

【０００７】図２ないし図５に示す音声認識装置は、音
素単位の標準パターンをもち、連続音声を認識するもの
であって、上記特公平１−５９６００号公報に記載され
た連続単語認識を連続音素認識に置き替え、更に、その
オートマトン記憶部を単語辞書に置き換えたものと見做
せる。The speech recognition devices shown in FIGS. 2 to 5 have a standard pattern for each phoneme and recognize continuous speech. It can be seen as replacing phoneme recognition with a word dictionary in addition to its automaton memory.

【０００８】まず、図２に於て、マイク２１より入力さ
れた入力音声は、音声分析部２２でＬＰＣケプストラム
分析などの分析方法でパラメータとして特徴抽出され、
パラメータベクトルの時系列からなる入力音声パターン
が作成され、パターンバッファ２３に格納される。斯し
て得られた入力音声パターンが音素認識部２４に導入さ
れ、音素標準パターンメモリ２５と単語辞書２６を用い
て認識処理が行われる。First, in FIG. 2, the input voice input from the microphone 21 is extracted as a parameter by a voice analysis section 22 using an analysis method such as LPC cepstral analysis.
An input audio pattern consisting of a time series of parameter vectors is created and stored in the pattern buffer 23. The input speech pattern thus obtained is introduced into the phoneme recognition section 24, and recognition processing is performed using the phoneme standard pattern memory 25 and word dictionary 26.

【０００９】図３に音素認識部２４の内部構成を示す。同図の音素認識部２４に於ては、単語辞書制御部２４４
により、図５の如く、音素の木構造連鎖を持つ単語辞書
２６中から単語を記述した音素ラベル列を読み出す。FIG. 3 shows the internal configuration of the phoneme recognition section 24. In the phoneme recognition unit 24 in the figure, the word dictionary control unit 244
As shown in FIG. 5, a phoneme label string describing a word is read out from the word dictionary 26 having a tree-structured chain of phonemes.

【００１０】即ち、まず、音素ラベル列の先頭の音素ラ
ベルを始めに取り出し、同一の音素ラベルが付記された
音素標準パターンを音素標準パターン制御部２４３によ
り、図４の音素標準パターンメモリ２５から読み出し、
パターンバッファ２３中の入力音声パターンの先頭より
、レベルビルディング法によりＤＰマッチングを行なう
ことになる。これに続いて、音素ラベル列の順序に従い
、順に音素ラベルを取り出し、同一の音素ラベルが付記
された音素標準パターンを音素標準パターン制御部２４
３により音素標準パターンメモリ２５から読み出し、レ
ベルビルディングを継続していく。That is, first, the phoneme label at the beginning of the phoneme label string is taken out, and the phoneme standard pattern to which the same phoneme label is attached is read out from the phoneme standard pattern memory 25 in FIG. 4 by the phoneme standard pattern control section 243. ,
DP matching is performed from the beginning of the input audio pattern in the pattern buffer 23 by the level building method. Following this, the phoneme standard pattern control unit 24 extracts phoneme labels in order according to the order of the phoneme label string, and selects the phoneme standard patterns to which the same phoneme labels are attached.
3, the phoneme standard pattern memory 25 is read out and level building is continued.

【００１１】そして、音素ラベル列の終端に達し、かつ
、レベルビルディングがパターンバッファ２３中の入力
音声パターンの終端に達していれば、出力候補としてマ
ッチング距離値と単語を表す音素ラベル列が単語判定部
２４２に送られる。[0011] If the end of the phoneme label string is reached and the level building reaches the end of the input speech pattern in the pattern buffer 23, the matching distance value and the phoneme label string representing the word are used as output candidates for word determination. The information is sent to section 242.

【００１２】このように出力候補を受信した単語判定部
２４２は、全てのマッチング距離値から最小のものを選
び、それに対する単語の音素ラベル列を認識結果として
出力する。[0012] The word determining unit 242, which has received the output candidates in this manner, selects the minimum matching distance value from all matching distance values, and outputs the phoneme label string of the corresponding word as a recognition result.

【００１３】この例では、音素標準パターンメモリ２５
は、図４のような音素ラベル／ａ／，／ｋ／，〜，／Ｎ
／とこれに対応する音素標準パターンとを関連付けたテ
ーブル形式のメモリであり、又、単語辞書２６は図５の
ように音素が木構造をして記憶されたメモリである。In this example, the phoneme standard pattern memory 25
is the phoneme label /a/, /k/, ~, /N as shown in Figure 4.
The word dictionary 26 is a memory in a table format in which / and corresponding phoneme standard patterns are associated with each other, and the word dictionary 26 is a memory in which phonemes are stored in a tree structure as shown in FIG.

【００１４】このような連続音声認識装置に於ては、次
に挙げるような欠点があり、現在の解決課題になってい
る。[0014] Such a continuous speech recognition device has the following drawbacks, which are current problems to be solved.

【００１５】即ち、上述の従来装置では、音素標準パタ
ーンを単語辞書中の音素ラベル列の順に従って、音素標
準パターンメモリから取り出して来て、マッチングして
いくことになるが、この時、同じ音素ラベルでも単語中
の発声位置によってその前後の音素の影響を受け、音素
パターンは変形しているので、同じ音素ラベルの音素で
あるからといって、全て同じ音素標準パターンを用いて
マッチングを行なうのは不適切である。That is, in the conventional device described above, phoneme standard patterns are retrieved from the phoneme standard pattern memory in accordance with the order of the phoneme label string in the word dictionary, and matching is performed. Even labels are affected by the phonemes before and after the utterance in a word, and the phoneme pattern is deformed, so even if the phonemes have the same phoneme label, it is difficult to match them using the same standard phoneme pattern. is inappropriate.

【００１６】しかしながら、音素の前後の接続音素の環
境までも考慮しようとすると、前後の音素を含めた音素
ラベルの組合せは数千種あり、これを全て充たす音声デ
ータから一度に音素標準パターンを作成するのは最初に
大量の音声データを処理しなければならないと言った不
都合があった。However, if we try to take into account the environment of connected phonemes before and after a phoneme, there are thousands of combinations of phoneme labels including the phonemes before and after the phoneme, and it is difficult to create standard phoneme patterns at once from audio data that satisfies all of these combinations. The disadvantage was that a large amount of audio data had to be processed first.

【００１７】[0017]

【発明が解決しようとする課題】本発明は上述の不都合
に鑑みてなされたものであり、初期状態として、ある程
度のセグメントに対応する標準パターンを用意しておけ
ば、セグメントの前後接続の全ての組合せに対応した標
準パターンが不足していても、認識処理を行いながら、
セグメントの前後接続の組合せに対応した標準パターン
を必要なだけ学習獲得できるようになした音声認識装置
を実現するものである。[Problems to be Solved by the Invention] The present invention has been made in view of the above-mentioned disadvantages, and it is intended that if a standard pattern corresponding to a certain number of segments is prepared as an initial state, all of the connections before and after the segments can be completed. Even if there is a lack of standard patterns that correspond to the combination, while performing recognition processing,
The purpose of this invention is to realize a speech recognition device that can learn and acquire as many standard patterns as necessary corresponding to combinations of front and rear connections of segments.

【００１８】[0018]

【課題を解決するための手段】本発明の音声認識装置は
、音声を構成する音素あるいは音節などのセグメントを
認識単位としたものであって、認識単位セグメントの標
準パターンを数値表現した標準パターンデータと該標準
パターンの認識単位セグメントに接続される前後の認識
単位セグメントに関する前後環境情報とを対応付けて記
憶した前後環境情報テーブルと、認識単位セグメントを
見出しとしたセグメント連鎖からなる単語が多数蓄えら
れた単語辞書とを備え、認識時には、上記前後環境情報
テーブルの認識単位の前後環境情報に基づいて認識単位
の標準パターンを選択して認識を行なう認識処理を行い
、初期学習時には、予め発声入力された音声のクラスタ
リングを行なうことにより、該入力音声から標準パター
ンと前後環境情報テーブルを作成し、追加学習時には、
前記標準パターンを前記前後環境情報テーブルに記憶さ
れた前後環境情報と学習サンプル数に基づく頻度情報に
基づいた重み係数により重み付けを行なった音声パター
ンを用いて、音声のクラスタリングを行なうことにより
、新たな標準パターン、及び前後環境情報テーブルを作
成するものである。[Means for Solving the Problems] The speech recognition device of the present invention uses segments such as phonemes or syllables constituting speech as recognition units, and uses standard pattern data that numerically expresses standard patterns of recognition unit segments. A front and rear environment information table is stored in which the front and rear environment information regarding the recognition unit segments before and after the recognition unit segment of the standard pattern are associated with each other, and a large number of words are stored, each consisting of a segment chain with the recognition unit segment as a heading. At the time of recognition, a standard pattern of the recognition unit is selected and recognized based on the surrounding environment information of the recognition unit in the above-mentioned surrounding environment information table. By clustering the input speech, a standard pattern and a front/back environment information table are created from the input speech, and during additional learning,
By performing voice clustering using a voice pattern obtained by weighting the standard pattern with a weighting coefficient based on the front and rear environment information stored in the front and rear environment information table and frequency information based on the number of learning samples, a new This is to create a standard pattern and a front/back environment information table.

【００１９】[0019]

【作用】本発明の音声認識装置によれば、音声認識時に
おいて、使用者が学習機能を指定するとその学習期間中
に入力された音声によって、音素標準パターン及びテー
ブルの追加学習が行なえる。これによって、未学習の音
素標準パターンを学習し、認識性能を徐々に向上させて
いくことが可能になる。また、このような追加学習は、
通常の認識操作と並行して行なうことも可能で、認識を
行ないながら追加学習し、認識性能を向上させていくこ
ともできる。According to the speech recognition apparatus of the present invention, when a user specifies a learning function during speech recognition, additional learning of phoneme standard patterns and tables can be performed using speech input during the learning period. This makes it possible to learn unlearned phoneme standard patterns and gradually improve recognition performance. In addition, such additional learning
It is also possible to perform this in parallel with normal recognition operations, and additional learning can be performed while performing recognition to improve recognition performance.

【００２０】[0020]

【実施例】図１に本発明の音声認識装置の一実施例の構
成を示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows the configuration of an embodiment of a speech recognition apparatus according to the present invention.

【００２１】同図に於て、１１〜１６は図２の２１〜２
６と同様に「マイク」〜「単語辞書」を示しており、本
実施例装置が図１の装置と異なるところは、音素環境テ
ーブル１７を付加した点にある。In the figure, 11 to 16 correspond to 21 to 2 in FIG.
Similarly to 6, "microphone" to "word dictionary" are shown, and the device of this embodiment differs from the device of FIG. 1 in that a phoneme environment table 17 is added.

【００２２】同図の本発明実施例の構成並びに動作を以
下に解説する。まず、マイク１１より入力された音声は
音声分析部１２でＬＰＣケプストラム分析などの分析方
法でパラメータとして特徴抽出され、パラメータベクト
ルの時系列からなる入力音声パターンが作成され、パタ
ーンバッファ１３に格納される。The configuration and operation of the embodiment of the present invention shown in the figure will be explained below. First, the voice input from the microphone 11 is extracted as parameters by an analysis method such as LPC cepstrum analysis in the voice analysis unit 12, and an input voice pattern consisting of a time series of parameter vectors is created and stored in the pattern buffer 13. .

【００２３】音素認識部１４では、図６に示す如く、単
語辞書制御部１４４により単語辞書１６中から単語を記
述した音素ラベル列のうち、レベルビルディング部１４
１のレベル情報により、対応した位置の音素ラベル、及
び、前接音素ラベル、後続音素ラベルの３つ組音素ラベ
ルを読み出す。In the phoneme recognition unit 14, as shown in FIG.
Based on the level information of 1, the phoneme label at the corresponding position and the triplet phoneme label of the preceding phoneme label and the following phoneme label are read out.

【００２４】単語辞書１６としては、例えば、図７のよ
うな木構造の音素ラベル列が記憶されたメモリが使用さ
れており、単語”ｉｗａｋｉ”の／ｋ／について読み出
す時は、３つ組音素ラベルとして音素ラベル／ｋ／（＊
２）、及び、前接音素ラベル／ａ／（＊１）、後続音素
ラベル／ｉ／（＊３）の３つ組音素ラベルが同時に得ら
れる。As the word dictionary 16, for example, a memory in which a tree-structured phoneme label string as shown in FIG. Phoneme label /k/(*
2), and the triple phoneme label of the preceding phoneme label /a/ (*1) and the following phoneme label /i/ (*3) are obtained simultaneously.

【００２５】次に、この３つ組音素ラベル／ａ／，／ｋ
／，／ｉ／（音素環境）により、音素環境テーブル１７
からその音素環境に対応する音素標準パターンが読み出
される。Next, this triplet phoneme label /a/, /k
/, /i/ (phoneme environment), phoneme environment table 17
A phoneme standard pattern corresponding to that phoneme environment is read out from the phoneme environment.

【００２６】このような音素環境テーブル１７としては
、例えば、図１１のように、［クラスタＮｏ］とその［
音素標準パターンラベル］とその［音素環境］とを関連
づけて記憶したメモリが使用されている。例えば、／ａ
／，／ｋ／，／ｉ／（図中では／ａｋｉ／と表記）につ
いては、３つ組音素の欄を検索し、クラスタＮｏ．１の
／ｋ／が対応する音素標準パターンラベルとして見いだ
される。このような３つ組音素の総数は、数１０００の
オーダとなり、これに対し、音素標準パターンラベル総
数は、たかだか２００程度の数である。Such a phoneme environment table 17, for example, as shown in FIG.
A memory is used in which a phoneme standard pattern label] and its [phoneme environment] are stored in association with each other. For example, /a
For /, /k/, /i/ (denoted as /aki/ in the figure), search the triplet phoneme column and find the cluster number. 1 /k/ is found as the corresponding phoneme standard pattern label. The total number of such triplet phonemes is on the order of several thousand, whereas the total number of phoneme standard pattern labels is about 200 at most.

【００２７】これは、音素標準パターン数が増えるとマ
ッチングの処理時間が増えるので、音素標準パターン数
を制限し、１つの音素標準パターンに対し、複数の音素
環境（３つ組音素）を割り当てる。この割り当てを記述
したものを音素環境テーブルと呼ぶこととする。この割
り当て（対応付け）は、多数の学習用音素パターンをも
とに統計的に求められる。This is because matching processing time increases as the number of phoneme standard patterns increases, so the number of phoneme standard patterns is limited and a plurality of phoneme environments (triad phonemes) are assigned to one phoneme standard pattern. A table that describes this assignment is called a phoneme environment table. This assignment (correspondence) is statistically determined based on a large number of learning phoneme patterns.

【００２８】音素標準パターン制御部１４３は、音素環
境テーブル１７が示した音素標準パターンラベル／ｋ／
（クラスタＮｏ．４）により音素標準パターンメモリ１
５中のクラスタＮｏ．４音素ラベル／ｋ／の音素標準パ
ターンを読み出し、レベルビルディング部１４１に音素
標準パターンを送る。レベルビルディング部１４１は、
パターンバッファ１３中の入力音声パターンの先頭より
、レベルビルディング法によりＤＰマッチングを行なっ
ており、図１０に示すごとく、入力音声パターンの対応
した位置により、音素標準パターン制御部１４３から送
られてきた音素標準パターンとマッチングしていく。The phoneme standard pattern control unit 143 controls the phoneme standard pattern label /k/ indicated by the phoneme environment table 17.
(Cluster No. 4) makes phoneme standard pattern memory 1
Cluster No.5 in 5. The phoneme standard pattern for the four-phoneme label /k/ is read out and sent to the level building section 141. The level building section 141 is
DP matching is performed by the level building method from the beginning of the input speech pattern in the pattern buffer 13, and as shown in FIG. 10, the phoneme sent from the phoneme standard pattern control unit 143 is Matching with standard pattern.

【００２９】また、レベルビルディング部１４１は、マ
ッチングが単語辞書の音素ラベル列の終端に達し、かつ
、レベルビルディングがパターンバッファ１３中の入力
音声パターンの終端に達していれば出力候補としてマッ
チング距離値と単語を表す音素ラベル列が単語判定部１
４２に送る。Further, if the matching reaches the end of the phoneme label string in the word dictionary and the level building reaches the end of the input speech pattern in the pattern buffer 13, the level building unit 141 outputs the matching distance value as an output candidate. The phoneme label string representing the word is processed by the word determination unit 1.
Send to 42.

【００３０】単語判定部１４２は、全てのマッチング距
離値から最小のものを選び、それに対する単語の音素ラ
ベル列（認識結果）を出力する。[0030] The word determination unit 142 selects the minimum matching distance value from all matching distance values, and outputs the phoneme label string (recognition result) of the corresponding word.

【００３１】このようにある音素の標準パターンを選択
する際に音素の環境が一致したものを有効に選択するた
めの音素環境テーブルを用いている。[0031] In this manner, when selecting a standard pattern for a certain phoneme, a phoneme environment table is used to effectively select a pattern with a matching phoneme environment.

【００３２】ここで、音素標準パターンと音素環境テー
ブルの初期学習について、説明する。The initial learning of the phoneme standard pattern and phoneme environment table will now be explained.

【００３３】まず、音素区間ごとに音素ラベル付けされ
た初期学習用音声データから、各音素区間のデータ（音
素セグメントデータと呼ぶ）を集める。これら音素セグ
メントデータをもとにクラスタリングを行なう。クラス
タリングは、以下のように行なう。First, data for each phoneme section (referred to as phoneme segment data) is collected from the initial learning speech data in which phoneme labels are attached for each phoneme section. Clustering is performed based on these phoneme segment data. Clustering is performed as follows.

【００３４】（１）初期クラスタとして、同一の音素ラ
ベルをもつ音素セグメントを１つのクラスタとし、各ク
ラスタのセントロイドを求める。音素ラベルは、図８の
ようなものとすると、初期クラスタ数Ｎは４０となる。(1) As an initial cluster, phoneme segments having the same phoneme label are set as one cluster, and the centroid of each cluster is determined. If the phoneme label is as shown in FIG. 8, the initial number of clusters N is 40.

【００３５】（２）クラスタごとに平均歪を求め、最も
大きい歪をＤＮｍａｘとする。クラスタ数Ｎが所定の数
であれば、終了する。(2) Find the average distortion for each cluster, and set the largest distortion as DNmax. If the number of clusters N is a predetermined number, the process ends.

【００３６】（３）歪最大のクラスタをクラスタ内で最
長距離にある音素セグメントで２分割し、それぞれのセ
ントロイドを求める。Ｎ＝Ｎ＋１とし、（２）から繰り
返す。なお、クラスタｎの平均歪Ｄｎは、(3) Divide the cluster with the maximum distortion into two by the phoneme segment that is the longest distance within the cluster, and find the centroid of each. Set N=N+1 and repeat from (2). Note that the average strain Dn of cluster n is

【００３７】[0037]

【数１】[Math 1]

【００３８】とする。ただし、Ｍｎはクラスタｎに含ま
れる要素（音素セグメント）数、Ｃｎはクラスタｎのセ
ントロイド、Ｘｎｉはクラスタｎのｉ番目の要素（音素
セグメント）を表す。また、最終クラスタ数Ｎとしては
、１２８個から２５６個程度とする。[0038] Here, Mn represents the number of elements (phoneme segments) included in cluster n, Cn represents the centroid of cluster n, and Xni represents the i-th element (phoneme segment) of cluster n. Further, the final number N of clusters is approximately 128 to 256.

【００３９】このようにして得られた各クラスタのセン
トロイドである音素セグメントデータを音素標準パター
ンとし、各クラスタに含まれる音素セグメントデータの
音素環境及びその個数を音素環境テーブルとする。この
ようなテーブルの一例が、図１１に示されており、クラ
スタＮｏ．、音素ラベル、音素環境の頻度情報を持った
テーブルとなっている。各音素環境の後に続く数字は音
素環境の頻度情報を表しており、学習時に各クラスタに
含まれる要素（音素セグメント）数を表す。The phoneme segment data that is the centroid of each cluster obtained in this manner is used as a phoneme standard pattern, and the phoneme environment and number of phoneme segment data included in each cluster is used as a phoneme environment table. An example of such a table is shown in FIG. 11, where cluster no. , phoneme labels, and phoneme environment frequency information. The number following each phoneme environment represents frequency information of the phoneme environment, and represents the number of elements (phoneme segments) included in each cluster during learning.

【００４０】ただし、頻度情報としては、例えば、その
音素環境の累積出現回数を最大値１００で正規化した値
を用いてもよい。However, as the frequency information, for example, a value obtained by normalizing the cumulative number of appearances of the phoneme environment to a maximum value of 100 may be used.

【００４１】認識時には、この出現回数を利用し、音素
標準パターンを選択する。例えば、３つ組音素ラベルの
／＿ｋｉ／は、クラスタ５、６の／ｋ／にそれぞれに現
われているが、出現回数は、２７、３３となっている。これを最大値を１００に正規化すると、８１、１００と
なる。ここで、選択のしきい値を８５とすると、／＿ｋ
ｉ／の音素環境を有する音素標準パターンラベルは、ク
ラスタ６となり、これに対応する音素標準パターンが音
素標準パターンメモリから読み出される。しきい値によ
っては、同一の音素環境で複数の音素標準パターンとマ
ッチングをする場合も生じるが、そのときは、入力音声
パターンとのマッチング結果で、距離の小さいものが最
終的に選択される。At the time of recognition, this number of appearances is used to select a phoneme standard pattern. For example, the triple phoneme label /_ki/ appears in /k/ in clusters 5 and 6, respectively, and the number of occurrences is 27 and 33. If this is normalized to a maximum value of 100, it becomes 81,100. Here, if the selection threshold is 85, /_k
The phoneme standard pattern label having the phoneme environment of i/ becomes cluster 6, and the corresponding phoneme standard pattern is read from the phoneme standard pattern memory. Depending on the threshold value, there may be cases where matching is performed with multiple phoneme standard patterns in the same phoneme environment, but in that case, the one with the smallest distance from the matching result with the input speech pattern is ultimately selected.

【００４２】このようにある音素の標準パターンは、１
つの音素に対して複数の音素標準パターンをもち、かつ
、それぞれの音素標準パターンは、複数の音素環境をも
つことになる。しかし、従来のマルチテンプレートの音
素標準パターンと違い、音素環境によりどの音素標準パ
ターンを選択されるか決定され、かつ、その決定の基準
においては頻度情報をも合わせもつという特長がある。よって、従来のマルチテンプレート方式と異なり、有効
に標準パターン（テンプレート）の選択が行なえる。[0042] In this way, the standard pattern of a certain phoneme is 1
There are multiple phoneme standard patterns for one phoneme, and each phoneme standard pattern has multiple phoneme environments. However, unlike conventional multi-template phoneme standard patterns, this method has the advantage that which phoneme standard pattern is selected is determined by the phoneme environment, and frequency information is also included in the criteria for this determination. Therefore, unlike the conventional multi-template method, standard patterns (templates) can be effectively selected.

【００４３】次に、音素標準パターンと音素環境テーブ
ルの追加学習について、説明する。初期学習によって得
られた音素環境テーブルと音素環境テーブルは、初期学
習データ中に含まれる音素環境を反映したものとなって
いるが、先に示したとおり音声の全ての音素環境を包含
したものではない。追加学習は、この初期学習データに
含まれていない音素環境を学習するもので、以下のよう
に行なう。Next, additional learning of phoneme standard patterns and phoneme environment tables will be explained. The phoneme environment table and phoneme environment table obtained through initial learning reflect the phoneme environment included in the initial learning data, but as shown earlier, they do not include all the phoneme environments of speech. do not have. Additional learning is to learn phoneme environments that are not included in this initial learning data, and is performed as follows.

【００４４】（１）認識時にマッチング結果から自動切
り出しされた入力音声中の音素セグメントデータをその
音素環境情報とともに追加学習データ１とする。(1) The phoneme segment data in the input speech automatically extracted from the matching result during recognition is used as additional learning data 1 together with its phoneme environment information.

【００４５】（２）音素標準パターンを音素環境テーブ
ルのクラスタ要素数で重み付けを行なったものを追加学
習データ２とする。（これは、クラスタごとの平均歪を
もつ初期学習データの近似データと考えられる。）（３
）追加学習データ１、及び、２から、先に示した初期学
習と同じ方法で、音素標準パターンと音素環境テーブル
を再作成する。(2) The phoneme standard pattern weighted by the number of cluster elements in the phoneme environment table is used as additional learning data 2. (This can be considered as approximation data of the initial training data with average distortion for each cluster.) (3
) From the additional learning data 1 and 2, the phoneme standard pattern and phoneme environment table are re-created using the same method as the initial learning described above.

【００４６】このようにしてできた音素標準パターン、
及び音素環境テーブルは、初期学習データと追加学習デ
ータの両方に含まれる音素環境を包含したものとなり、
音素標準パターンは新たなセントロイドとして得られ、
音素環境テーブルは両方に含まれる音素環境とその頻度
情報を持ったものとなっている。[0046] The phoneme standard pattern created in this way,
and the phoneme environment table includes the phoneme environment included in both the initial learning data and the additional learning data,
The phoneme standard pattern is obtained as a new centroid,
The phoneme environment table has phoneme environments included in both and their frequency information.

【００４７】また、追加学習は、何回でも行ない、新た
な音素環境を獲得していくことができる。そのとき、デ
ータ量としては、それまでに学習に使われたデータをＮ
（＝１２８〜２５６）個の音素標準パターンで代表させ
るため、追加学習を何回行なっても、Ｎ＋追加学習デー
タ数で処理することができる。しかも、保存データ数も
音素標準パターンＮ個のみでよく、２回目以降の追加学
習の処理量も増やすこともない。Further, additional learning can be performed any number of times to acquire new phoneme environments. At that time, the amount of data is N
Since it is represented by (=128 to 256) phoneme standard patterns, no matter how many times additional learning is performed, it can be processed using N+the number of additional learning data. Moreover, the number of data to be stored is only N phoneme standard patterns, and the amount of processing for additional learning from the second time onwards does not increase.

【００４８】また、追加学習用のデータを獲得し、追加
学習を行なうには、認識時に追加学習期間を指定しなけ
ればならない。Furthermore, in order to acquire data for additional learning and perform additional learning, an additional learning period must be specified at the time of recognition.

【００４９】本発明の音声認識装置においては、追加学
習期間を指定する方法として、認識を行なう状態におい
て、学習機能を指定するとそれ以降の所定期間（使用者
の終了指示があるまで、あるいは、一定時間、あるいは
、一定入力数など）の間に入力された音声を認識結果、
あるいは、その修正結果に基づいて、入力音声中から音
素セグメントデータを追加学習用データとして獲得し追
加学習を行なう。追加学習は、通常の認識操作と並行し
て行なうことも可能でその場合には、使用者は、学習に
ついては意識することなく、音声認識装置が次第に自分
の音声に適応していくようになる。In the speech recognition device of the present invention, as a method of specifying an additional learning period, when a learning function is specified in the recognition state, a predetermined period thereafter (until the end instruction is given by the user, or The recognition result of the voice input during a period of time or a certain number of inputs, etc.
Alternatively, based on the correction result, phoneme segment data is acquired from the input speech as additional learning data and additional learning is performed. Additional learning can be performed in parallel with normal recognition operations, in which case the speech recognition device will gradually adapt to the user's own voice without the user being aware of the learning. .

【００５０】なお、実施例においては、音素セグメント
を認識単位としてレベルビルディング法を用いた例を示
したが、この部分は音素ごとのセグメンテーションを先
に行ない各音素区間ごとに音素標準パターンとマッチン
グを行なっていく方式によっても同様の効果が得られる
。このときは、音素認識部の構成は図１２のようになり
、レベルビルディング部１４１は音素セグメンテーショ
ン部１４５と音素マッチング部１４６に置き換えられて
いる。[0050] In the example, an example was shown in which the level building method was used with phoneme segments as recognition units, but in this part, segmentation for each phoneme is first performed and matching with the phoneme standard pattern is performed for each phoneme interval. Similar effects can be obtained depending on the method used. At this time, the configuration of the phoneme recognition section is as shown in FIG. 12, and the level building section 141 is replaced with a phoneme segmentation section 145 and a phoneme matching section 146.

【００５１】また、以上の説明では、音素セグメントを
認識単位とした場合について述べたが、本発明はこれに
限定されるものでなく、例えば音節セグメントを認識単
位として用いても同様の効果が得られる。[0051] Furthermore, in the above explanation, a case has been described in which phoneme segments are used as recognition units, but the present invention is not limited to this. For example, similar effects can be obtained even if syllable segments are used as recognition units. It will be done.

【００５２】以上のように、本発明によると、ある音素
の標準パターンを選択する際に音素の環境が一致したも
のを有効に選択するための音素環境テーブルを用いてい
ることにより、従来の１つの音素ラベルに１つの音素標
準パターンをもつ方式や、また、１つの音素ラベルに複
数の音素標準パターンをもつがそれぞれが同じ扱いをさ
れる単純なマルチテンプレート方式と異なり、個々の音
素標準パターンが辞書中の音素環境に従い、有効に音素
標準パターン（テンプレート）の選択が行なえる音声認
識装置が実現できる。As described above, according to the present invention, when selecting a standard pattern for a certain phoneme, a phoneme environment table is used to effectively select a pattern with a matching phoneme environment. Unlike the method in which one phoneme standard pattern has one phoneme standard pattern, or the simple multi-template method in which one phoneme label has multiple phoneme standard patterns, but each is treated the same, each phoneme standard pattern A speech recognition device that can effectively select a phoneme standard pattern (template) according to the phoneme environment in the dictionary can be realized.

【００５３】[0053]

【発明の効果】本発明の音声認識装置は、音声の認識時
に認識結果が確定後、入力された音声パターンの該当す
る音素パターンを取り出し、単語辞書から得られる音素
ラベルにより、その音素パターンがその単語の中のどの
ような位置にあるかという環境情報と音素標準パターン
を獲得できる。従って、該装置の初期状態で、ある程度
の音素に対応する標準パターンさえ用意されておれば、
音素の前後接続の全ての組合せに対応した標準パターン
が不足していても、認識処理を行いながら音素の前後接
続の組合せに対応した標準パターンを必要なだけ学習獲
得でき、このようにして獲得した音素の前後接続の組合
せを考慮した認識制度の高い音声認識が実現できる。Effects of the Invention The speech recognition device of the present invention extracts the corresponding phoneme pattern of the input speech pattern after the recognition result is determined during speech recognition, and identifies the phoneme pattern according to the phoneme label obtained from the word dictionary. It is possible to acquire environmental information such as the position in a word and the standard phoneme pattern. Therefore, if standard patterns corresponding to a certain number of phonemes are prepared in the initial state of the device,
Even if there are not enough standard patterns that correspond to all combinations of phoneme connections, it is possible to learn and acquire as many standard patterns that correspond to combinations of phoneme connections as necessary while performing recognition processing. It is possible to realize speech recognition with high recognition accuracy that takes into account the combination of front and rear connections of phonemes.

[Brief explanation of the drawing]

【図１】本発明による音声認識装置の一実施例を示す構
成図。FIG. 1 is a configuration diagram showing an embodiment of a speech recognition device according to the present invention.

【図２】従来の音声認識装置の一実施例を示す構成図。FIG. 2 is a configuration diagram showing an example of a conventional speech recognition device.

【図３】第２図の従来の音声認識装置の音素認識部を示
す構成図。FIG. 3 is a block diagram showing a phoneme recognition unit of the conventional speech recognition device shown in FIG. 2;

【図４】第２図の従来の音声認識装置の音素標準パター
ンメモリを示す構成図。FIG. 4 is a configuration diagram showing a phoneme standard pattern memory of the conventional speech recognition device of FIG. 2;

【図５】第２図の従来の音声認識装置の単語辞書を示す
構成図。FIG. 5 is a configuration diagram showing a word dictionary of the conventional speech recognition device shown in FIG. 2;

【図６】本発明による音声認識装置の音素認識部を示す
構成図。FIG. 6 is a configuration diagram showing a phoneme recognition unit of the speech recognition device according to the present invention.

【図７】本発明による音声認識装置の単語辞書を示す構
成図。FIG. 7 is a configuration diagram showing a word dictionary of the speech recognition device according to the present invention.

【図８】本発明による音声認識装置の音素ラベルの例を
示す模式図。FIG. 8 is a schematic diagram showing an example of phoneme labels of the speech recognition device according to the present invention.

【図９】本発明による音声認識装置の音素標準パターン
メモリを示す構成図。FIG. 9 is a configuration diagram showing a phoneme standard pattern memory of the speech recognition device according to the present invention.

【図１０】本発明による音声認識装置の入力パターンと
音素標準パターンとのマッチングを示す図。FIG. 10 is a diagram showing matching between an input pattern of the speech recognition device according to the present invention and a phoneme standard pattern.

【図１１】本発明による音声認識装置の音素環境テーブ
ルを示す構成図。FIG. 11 is a configuration diagram showing a phoneme environment table of the speech recognition device according to the present invention.

【図１２】本発明による音声認識装置の音素認識部の他
の実施例を示す構成図。FIG. 12 is a configuration diagram showing another embodiment of the phoneme recognition unit of the speech recognition device according to the present invention.

[Explanation of symbols]

１１　　マイク１２　　音声分析部１３　　パターンバッファ１４　　音声認識部１５　　音素標準パターンメモリ１６　　単語辞書１７　　音素環境取テーブル 11. Microphone 12 Speech analysis section 13 Pattern buffer 14 Speech recognition section 15 Phoneme standard pattern memory 16 Word dictionary 17 Phoneme environment table

Claims

[Claims]

Claim 1: In a speech recognition device that recognizes speech using segments such as phonemes or syllables constituting speech as recognition units, standard pattern data numerically expressing a standard pattern of recognition unit segments and recognition unit segments of the standard pattern. The recognition system is equipped with a front and rear environment information table that stores the front and rear environment information related to the recognition unit segments before and after the recognition unit segment connected to the target in association with each other, and a word dictionary that stores a large number of words consisting of segment chains with the recognition unit segment as the header. Sometimes, a recognition process is performed in which a standard pattern of the recognition unit is selected and recognized based on the surrounding environment information of the recognition unit in the above-mentioned surrounding environment information table, and during initial learning, the speech input in advance is clustered. , create a standard pattern and a front/back environment information table from the input audio, and during additional learning, the standard pattern is calculated using a weighting coefficient based on the front/back environment information stored in the front/back environment information table and frequency information based on the number of learning samples. Using weighted voice patterns,
A speech recognition device characterized by creating a new standard pattern and a front/back environment information table by performing clustering of speech.