JPH08171396A

JPH08171396A - Speech recognition device

Info

Publication number: JPH08171396A
Application number: JP6316382A
Authority: JP
Inventors: Yumi Wakita; 由実脇田; Shingaa Hararudo; ハラルド・シンガー; Yoshinori Kosaka; 芳典匂坂
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1994-12-20
Filing date: 1994-12-20
Publication date: 1996-07-02
Anticipated expiration: 2014-11-10
Also published as: JP2975542B2

Abstract

PURPOSE: To provide the speech recognition device which more surely corrects the error of phoneme recognition to obtain a higher speech recognition rate in comparison with a conventional example. CONSTITUTION: A collation part 5 uses phoneme HMM(hidden Markov model) to perform the phoneme collation of an input spoken speech and outputs a recognized phoneme sequence corresponding to the spoken speech and its speech section. At the time of learning, the recognized phoneme sequence and its speech section are compared with a phoneme sequence as a correct answer and its speech section; and if the recognized phoneme sequence is different from the phoneme sequence as the correct answer, a pair of the recognized erroneous phoneme sequence and the phoneme sequence as the correct answer are extracted as an error inclination table. If the recognized phoneme sequence is compared with erroneous phoneme sequences in the error inclination table based on the recognized phoneme sequence and its speech section to detect an erroneous phoneme sequence at the time of recognition, the recognized phoneme sequence is substituted with the phoneme sequence as the correct answer to correct the error. The phoneme sequence may be a state sequence.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、隠れマルコフモデル
（以下、ＨＭＭという。）を用いて入力される発声音声
を音声認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a voice that is input using a hidden Markov model (hereinafter referred to as HMM).

【０００２】[0002]

【従来の技術及び発明が解決しようとする課題】従来の
音声認識装置では、標準モデルと入力音声との音響的な
マッチングの不確実さを緩和するために、マッチングの
際にビーム幅を大きくとり、言語制約から音韻や単語の
候補を絞る方法や、スコアの高い複数の結果候補から、
文脈条件などを用いて適切な候補を絞る方法を用いてい
る。しかしながら、実際には、処理時間や処理容量の限
界から、十分なビーム幅や認識候補数が確保できなかっ
たり、確保できても候補数が多すぎて正解を絞れないと
いう問題点があった。2. Description of the Related Art In a conventional voice recognition apparatus, in order to reduce the uncertainty of acoustic matching between a standard model and an input voice, a large beam width is used for matching. , A method of narrowing down phoneme and word candidates from language restrictions, and multiple result candidates with high scores,
A method of narrowing down appropriate candidates by using context conditions is used. However, in reality, due to the limitation of the processing time and the processing capacity, there is a problem that a sufficient beam width and the number of recognition candidates cannot be secured, or even if they are secured, the number of candidates is too large to narrow down the correct answer.

【０００３】従来、入力音節のシンボル系列に対する認
識音節のシンボル系列の誤りの傾向を、系列長が３音節
の場合を最大長として学習することによって、誤り訂正
を行う方法（以下、第１の従来例という。）が、神谷伸
ほか，“音節連鎖の出現確率に基づく切り出し・認識誤
りの訂正”，音響学会講演論文集，３−５−８，ｐｐ．
１０３−１０４において提案され、その有効性が確認さ
れている。Conventionally, a method of performing error correction by learning an error tendency of a symbol sequence of a recognized syllable with respect to a symbol sequence of an input syllable with the maximum length when the sequence length is 3 syllables (hereinafter referred to as the first conventional technique). Shinji Kamiya et al., “Correction of cut-out / recognition error based on appearance probability of syllable chain”, Proceedings of ASJ, 3-5-8, pp.
Proposed at 103-104, its effectiveness has been confirmed.

【０００４】しかしながら、この第１の従来例では、シ
ンボル系列上の比較では、実際にマッチングした発声音
声区間が異なっていても、シンボル上では対応する部分
として処理されるために、同じ誤りを訂正することがで
きる保証はない。また、系列の最大長を限定すること
も、同一の誤りを訂正することができる系列を途中で切
ってしまう可能性があるという問題点があった。従っ
て、音声認識時の訂正効率が比較的低い。However, in the first conventional example, in the comparison on the symbol series, even if the actually matched vocal utterance sections are different, they are processed as corresponding portions on the symbol, so that the same error is corrected. There is no guarantee that you can. Further, limiting the maximum length of the sequence also has a problem that the sequence that can correct the same error may be cut off in the middle. Therefore, the correction efficiency during voice recognition is relatively low.

【０００５】さらに、学習時に、各音素ごとの誤り率と
誤り内容を、前後のコンテキスト条件も考慮して計算す
る方法（以下、第２の従来例という。）が、田中ほか，
“日本語Ｄｉｃｔａｔｉｏｎシステムにおける文節検出
の高速化”，電子通信学会研究会，ＳＰ９０−７０，ｐ
ｐ．１７−２４，１９９０年において開示されている。
この第２の従来例では、以下の方法を用いる。音素の誤
りがある場合は、音素認識部の従来のConfusion Matrix
から各音素の誤り確率を計算することができるので、辞
書中に含まれる３つの音素の組が誤り確率を予め計算す
ることができ、この計算した誤り確率を認識スコアの重
み係数として用いて、認識結果を訂正する。Further, a method of calculating the error rate and the error content of each phoneme at the time of learning in consideration of context conditions before and after (hereinafter referred to as a second conventional example) is Tanaka et al.
"Speeding up phrase detection in Japanese Dictation system", IEICE Technical Committee, SP90-70, p.
p. 17-24, 1990.
In the second conventional example, the following method is used. If there is a phoneme error, the conventional Confusion Matrix of the phoneme recognizer is used.
Since the error probability of each phoneme can be calculated from, the error probability can be calculated in advance for the set of three phonemes included in the dictionary, and the calculated error probability is used as the weighting coefficient of the recognition score. Correct the recognition result.

【０００６】しかしながら、この第２の従来例では、前
後のコンテキストを含めた３音素内での誤りしか訂正で
きない。従って、第２の従来例においても、音声認識時
の訂正効率が比較的低い。However, in the second conventional example, only the error within the three phonemes including the contexts before and after can be corrected. Therefore, also in the second conventional example, the correction efficiency at the time of voice recognition is relatively low.

【０００７】本発明の目的は以上の問題点を解決し、音
韻認識の誤りをより確実に訂正することができ、従来例
に比較してより高い音声認識率を得ることができる音声
認識装置を提供することにある。An object of the present invention is to solve the above problems and to provide a speech recognition apparatus capable of more reliably correcting phonological recognition errors and obtaining a higher speech recognition rate as compared with the conventional example. To provide.

【０００８】[0008]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、入力された発声音声を音韻隠れマ
ルコフモデルを用いて音素照合を行い、上記発声音声に
対応する認識された音韻系列とその音声区間情報を出力
する照合手段と、正解の音韻系列とその音声区間が既知
である学習用発声音声に対する上記照合手段による照合
結果である音韻系列とその音声区間情報とに基づいて、
当該認識された音韻系列とその音声区間情報を、上記正
解の音韻系列とその音声区間と比較し、互いに対応する
音声区間で、認識された音韻系列が正解の音韻系列と異
なっているときに、当該認識された誤り音韻系列と当該
正解の音韻系列との対を抽出する誤り抽出手段と、上記
誤り抽出手段によって抽出された誤り音韻系列と正解の
音韻系列との対を、誤り傾向テーブルとして記憶する記
憶手段と、音声認識すべき入力された発声音声に対する
上記照合手段による照合結果である音韻系列とその音声
区間情報とに基づいて、上記記憶手段によって記憶され
た誤り傾向テーブルを参照して、当該認識された音韻系
列と、上記誤り傾向テーブル内の誤り音韻系列とを比較
して、当該認識された音韻系列の中に上記誤り音韻系列
を検出したときに、当該認識された音韻系列を、上記誤
り音韻系列に対応する正解の音韻系列に置き換えること
により誤り訂正を行った音韻系列を、上記照合手段によ
る照合結果である音韻系列に追加して、音韻認識結果候
補として出力する誤り訂正処理手段とを備えたことを特
徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus according to the present invention, wherein phonemes are collated by using a phoneme-hidden Markov model for the inputted uttered speech, and the corresponding uttered speech is recognized. Based on the collating means for outputting the phoneme sequence and its speech section information, based on the phoneme sequence and its speech section information which are the collation result by the above collating means for the correct phoneme sequence and the learning vocalization whose speech section is known. ,
The recognized phoneme sequence and its phoneme section information are compared with the correct phoneme series and its phoneme section, and in the phoneme sections corresponding to each other, when the recognized phoneme sequence is different from the correct phoneme series, An error extraction unit that extracts a pair of the recognized error phoneme sequence and the correct phoneme sequence, and a pair of the error phoneme sequence and the correct phoneme sequence extracted by the error extraction unit are stored as an error tendency table. Based on the phoneme sequence that is the matching result by the matching means and the voice section information for the input voicing speech to be recognized, and referring to the error tendency table stored by the storage means, When the recognized phoneme sequence is compared with the error phoneme sequence in the error tendency table, and the error phoneme sequence is detected in the recognized phoneme sequence. The recognized phoneme sequence is replaced with a correct phoneme sequence corresponding to the erroneous phoneme sequence to add a phoneme sequence that is error-corrected to the phoneme sequence that is the matching result by the matching means, and the phoneme recognition result is obtained. An error correction processing means for outputting as a candidate is provided.

【０００９】また、本発明に係る請求項２記載の音声認
識装置は、入力された発声音声を音韻隠れマルコフモデ
ルを用いて音素照合を行い、上記発声音声に対応する認
識された状態系列とその音声区間情報を出力する照合手
段と、正解の状態系列とその音声区間が既知である学習
用発声音声に対する上記照合手段による照合結果である
状態系列とその音声区間情報とに基づいて、当該認識さ
れた状態系列とその音声区間情報を、上記正解の状態系
列とその音声区間と比較し、互いに対応する音声区間
で、認識された状態系列が正解の状態系列と異なってい
るときに、当該認識された誤り状態系列と当該正解の状
態系列との対を抽出する誤り抽出手段と、上記誤り抽出
手段によって抽出された誤り状態系列と正解の状態系列
との対を、誤り傾向テーブルとして記憶する記憶手段
と、音声認識すべき入力された発声音声に対する上記照
合手段による照合結果である状態系列とその音声区間情
報とに基づいて、上記記憶手段によって記憶された誤り
傾向テーブルを参照して、当該認識された状態系列と、
上記誤り傾向テーブル内の誤り状態系列とを比較して、
当該認識された状態系列の中に上記誤り状態系列を検出
したときに、当該認識された状態系列を、上記誤り状態
系列に対応する正解の状態系列に置き換えることにより
誤り訂正を行った状態系列を、上記照合手段による照合
結果である状態系列に追加して、音韻認識結果候補とし
て出力する誤り訂正処理手段とを備えたことを特徴とす
る。According to a second aspect of the present invention, a speech recognition apparatus performs phoneme matching on an input uttered voice using a phoneme hidden Markov model, and a recognized state sequence corresponding to the uttered voice and its recognized state sequence. Based on the matching means that outputs the voice section information, the correct state series and the state series that is the result of the matching by the matching means with respect to the learning vocalization whose voice section is known and the voice section information, the recognition is performed. The state sequence and its voice section information are compared with the correct state sequence and its voice section, and when the recognized state sequence is different from the correct state sequence in the corresponding voice sections, the corresponding state sequence is recognized. The error tendency means for extracting a pair of the error state series and the correct state series, and the pair of the error state series and the correct state series extracted by the error extracting means as an error tendency. The error tendency table stored by the storage means based on the storage means for storing as a table, the state series as the result of the comparison by the matching means for the input uttered speech to be recognized, and the voice section information. Then, the recognized state series,
Compare with the error state series in the error tendency table,
When the error state sequence is detected in the recognized state sequence, the recognized state sequence is replaced with the correct state sequence corresponding to the error state sequence to obtain the error-corrected state sequence. An error correction processing means for outputting as a phoneme recognition result candidate in addition to the state series which is the matching result by the matching means.

【００１０】また、請求項３記載の音声認識装置は、請
求項１又は２記載の音声認識装置において、上記入力さ
れた発声音声は１つの文章からなり、上記誤り訂正処理
手段から出力される音韻認識結果候補に対して、所定の
形態素辞書を参照して形態素解析を行って、１つの文章
として最適な音声認識結果を出力する形態素解析手段を
さらに備えたことを特徴とする。The speech recognition apparatus according to a third aspect is the speech recognition apparatus according to the first or second aspect, in which the input uttered voice is composed of one sentence and is output from the error correction processing means. It is characterized by further comprising morpheme analysis means for performing morpheme analysis on the recognition result candidates with reference to a predetermined morpheme dictionary and outputting an optimum speech recognition result as one sentence.

【００１１】さらに、請求項４記載の音声認識装置は、
請求項１又は２記載の音声認識装置において、上記入力
された発声音声は１つの単語からなり、上記誤り訂正処
理手段から出力される音韻認識結果候補に対して、所定
の単語辞書を参照して単語解析を行って、１つの単語と
して最適な音声認識結果を出力する単語解析手段をさら
に備えたことを特徴とする。Further, the voice recognition apparatus according to claim 4 is
3. The voice recognition device according to claim 1, wherein the input uttered voice is composed of one word, and a predetermined word dictionary is referred to for a phoneme recognition result candidate output from the error correction processing means. It is characterized by further comprising word analysis means for performing word analysis and outputting an optimum speech recognition result as one word.

【００１２】また、請求項５記載の音声認識装置は、請
求項１、３又は４記載の音声認識装置において、上記誤
り傾向テーブルは、正解の音韻系列とその音韻隠れマル
コフモデルの状態番号と、誤りの音韻系列とその音韻隠
れマルコフモデルの状態番号との対を含むことを特徴と
する。さらに、請求項６記載の音声認識装置は、請求項
１乃至５のうちの１つに記載の音声認識装置において、
上記音韻隠れマルコフモデルは、１つの音韻が複数の状
態で構成された隠れマルコフモデルであることを特徴と
する。The speech recognition apparatus according to claim 5 is the speech recognition apparatus according to claim 1, 3 or 4, wherein the error tendency table is a correct phoneme sequence and a state number of the phoneme hidden Markov model. It is characterized by including a pair of an erroneous phoneme sequence and a state number of the phoneme hidden Markov model. Furthermore, the speech recognition apparatus according to claim 6 is the speech recognition apparatus according to any one of claims 1 to 5,
The phoneme hidden Markov model is characterized in that one phoneme is a hidden Markov model composed of a plurality of states.

【００１３】また、請求項７記載の音声認識装置は、請
求項１乃至６のうちの１つに記載の音声認識装置におい
て、上記音声区間情報は、入力された発声音声を所定長
のフレーム区間で区切った複数のフレームのうちの、そ
の音韻が開始する始端フレーム番号と終端フレーム番号
とで表されたことを特徴とする。さらに、請求項８記載
の音声認識装置は、請求項２又は５に従属する請求項７
を除く、請求項７記載の音声認識装置において、上記照
合手段による照合結果である音韻系列及び上記正解の音
韻系列はそれぞれ、１つの音韻に対して、その音韻と、
状態番号と、始端フレーム番号と、終端フレーム番号と
の組の形式で表されたことを特徴とする。A speech recognition apparatus according to a seventh aspect is the speech recognition apparatus according to any one of the first to sixth aspects, wherein the speech section information is a speech section of an input uttered speech of a predetermined length. It is characterized in that it is represented by a starting frame number and an ending frame number where the phoneme starts among a plurality of frames separated by. Further, the voice recognition device according to claim 8 is dependent on claim 2 or 5.
In the speech recognition apparatus according to claim 7, the phoneme sequence and the correct phoneme sequence as a result of the matching by the matching means are each one phoneme and one phoneme, respectively.
It is characterized in that it is expressed in the form of a set of a state number, a starting frame number, and an ending frame number.

【００１４】[0014]

【作用】以上のように構成された請求項１記載の音声認
識装置においては、学習時に、上記誤り抽出手段は、正
解の音韻系列とその音声区間が既知である学習用発声音
声に対する上記照合手段による照合結果である音韻系列
とその音声区間情報とに基づいて、当該認識された音韻
系列とその音声区間情報を、上記正解の音韻系列とその
音声区間と比較し、互いに対応する音声区間で、認識さ
れた音韻系列が正解の音韻系列と異なっているときに、
当該認識された誤り音韻系列と当該正解の音韻系列との
対を抽出して、誤り傾向テーブルとして上記記憶手段に
格納する。さらに、認識時に、上記誤り訂正手段は、音
声認識すべき入力された発声音声に対する上記照合手段
による照合結果である音韻系列とその音声区間情報とに
基づいて、上記記憶手段によって記憶された誤り傾向テ
ーブルを参照して、当該認識された音韻系列と、上記誤
り傾向テーブル内の誤り音韻系列とを比較して、当該認
識された音韻系列の中に上記誤り音韻系列を検出したと
きに、当該認識された音韻系列を、上記誤り音韻系列に
対応する正解の音韻系列に置き換えることにより誤り訂
正を行った音韻系列を、上記照合手段による照合結果で
ある音韻系列に追加して、音韻認識結果候補として出力
する。In the speech recognition apparatus according to claim 1 configured as described above, at the time of learning, the error extracting means is the matching means with respect to the correct phonological sequence and the uttered speech for learning whose speech section is known. Based on the phoneme sequence and its speech section information which is the matching result by the, the recognized phoneme sequence and its speech section information are compared with the correct phoneme sequence and its speech section, and in the speech sections corresponding to each other, When the recognized phoneme sequence is different from the correct phoneme sequence,
A pair of the recognized error phoneme sequence and the correct phoneme sequence is extracted and stored in the storage means as an error tendency table. Further, at the time of recognition, the error correcting means stores the error tendency stored by the storing means on the basis of the phoneme sequence which is the result of the matching by the matching means with respect to the inputted uttered speech to be recognized, and the voice section information. By referring to a table, the recognized phoneme sequence is compared with the error phoneme sequence in the error tendency table, and when the error phoneme sequence is detected in the recognized phoneme sequence, the recognition The phonological sequence that has been error-corrected by replacing the phonological sequence that is generated with the correct phonological sequence corresponding to the erroneous phonological sequence is added to the phonological sequence that is the collation result by the collating means, and as a phonological recognition result candidate. Output.

【００１５】また、請求項２記載の音声認識装置におい
ては、学習時に、上記誤り抽出手段は、正解の状態系列
とその音声区間が既知である学習用発声音声に対する上
記照合手段による照合結果である状態系列とその音声区
間情報とに基づいて、当該認識された状態系列とその音
声区間情報を、上記正解の状態系列とその音声区間と比
較し、互いに対応する音声区間で、認識された状態系列
が正解の状態系列と異なっているときに、当該認識され
た誤り状態系列と当該正解の状態系列との対を抽出し
て、誤り傾向テーブルとして上記記憶手段に格納する。
さらに、認識時に、上記誤り訂正手段は、音声認識すべ
き入力された発声音声に対する上記照合手段による照合
結果である状態系列とその音声区間情報とに基づいて、
上記記憶手段によって記憶された誤り傾向テーブルを参
照して、当該認識された状態系列と、上記誤り傾向テー
ブル内の誤り状態系列とを比較して、当該認識された状
態系列の中に上記誤り状態系列を検出したときに、当該
認識された状態系列を、上記誤り状態系列に対応する正
解の状態系列に置き換えることにより誤り訂正を行った
状態系列を、上記照合手段による照合結果である状態系
列に追加して、音韻認識結果候補として出力する。Further, in the speech recognition apparatus according to the second aspect, at the time of learning, the error extraction means is a matching result by the matching means with respect to a training utterance whose correct state sequence and its voice section are known. Based on the state series and its voice section information, the recognized state series and its voice section information are compared with the correct state series and its voice section, and the recognized state series are shown in the corresponding voice sections. Is different from the correct answer state series, a pair of the recognized error state series and the correct answer state series is extracted and stored in the storage means as an error tendency table.
Further, at the time of recognition, the error correction means, based on the state series and its voice section information, which is the result of the matching by the matching means with respect to the input uttered voice to be recognized,
By referring to the error tendency table stored by the storage means, the recognized state series is compared with the error state series in the error tendency table, and the error state is included in the recognized state series. When the sequence is detected, the recognized state sequence is replaced with the correct state sequence corresponding to the error state sequence, and the error-corrected state sequence is converted into the state sequence which is the collation result by the collating means. In addition, it outputs as a phoneme recognition result candidate.

【００１６】また、請求項３記載の音声認識装置におい
ては、上記形態素解析手段は、上記入力された発声音声
は１つの文章からなり、上記誤り訂正処理手段から出力
される音韻認識結果候補に対して、所定の形態素辞書を
参照して形態素解析を行って、１つの文章として最適な
音声認識結果を出力する。従って、発声音声文の音声認
識を行うことができる。Further, in the speech recognition apparatus according to the third aspect, the morpheme analysis means includes the input uttered speech as one sentence and the phoneme recognition result candidate output from the error correction processing means is Then, the morpheme analysis is performed with reference to a predetermined morpheme dictionary, and the optimum speech recognition result is output as one sentence. Therefore, the voice recognition of the uttered voice sentence can be performed.

【００１７】さらに、請求項４記載の音声認識装置にお
いては、上記単語解析手段は、上記入力された発声音声
は１つの単語からなり、上記誤り訂正処理手段から出力
される音韻認識結果候補に対して、所定の単語辞書を参
照して単語解析を行って、１つの単語として最適な音声
認識結果を出力する。従って、発声音声の単語を音声認
識することができる。Further, in the speech recognition apparatus according to the fourth aspect, the word analysis means includes a single word in the input uttered voice, and the phoneme recognition result candidate output from the error correction processing means is supplied to the word recognition means. Then, the word analysis is performed with reference to a predetermined word dictionary, and the optimum speech recognition result is output as one word. Therefore, it is possible to voice-recognize the word of the uttered voice.

【００１８】また、請求項５記載の音声認識装置におい
ては、上記誤り傾向テーブルは、好ましくは、正解の音
韻系列とその音韻隠れマルコフモデルの状態番号と、誤
りの音韻系列とその音韻隠れマルコフモデルの状態番号
との対を含む。さらに、請求項６記載の音声認識装置に
おいては、上記音韻隠れマルコフモデルは、好ましく
は、１つの音韻が複数の状態で構成された隠れマルコフ
モデルである。Further, in the speech recognition apparatus according to claim 5, the error tendency table is preferably such that the correct phoneme sequence and the state number of the phoneme hidden Markov model, the error phoneme sequence and the phoneme hidden Markov model thereof. Including a pair with the state number of. Further, in the speech recognition apparatus according to the sixth aspect, the phoneme hidden Markov model is preferably a hidden Markov model in which one phoneme is composed of a plurality of states.

【００１９】また、請求項７記載の音声認識装置におい
ては、上記音声区間情報は、好ましくは、入力された発
声音声を所定長のフレーム区間で区切った複数のフレー
ムのうちの、その音韻が開始する始端フレーム番号と終
端フレーム番号とで表される。さらに、請求項８記載の
音声認識装置においては、上記照合手段による照合結果
である音韻系列及び上記正解の音韻系列はそれぞれ、好
ましくは、１つの音韻に対して、その音韻と、状態番号
と、始端フレーム番号と、終端フレーム番号との組の形
式で表される。Further, in the speech recognition apparatus according to the seventh aspect of the invention, the speech section information is preferably a phoneme of a plurality of frames obtained by dividing the input uttered speech into frame sections of a predetermined length. It is represented by the starting frame number and the ending frame number. Further, in the voice recognition device according to claim 8, the phoneme sequence and the correct phoneme sequence that are the matching results by the matching means are preferably, for one phoneme, the phoneme and the state number, respectively. It is expressed in the form of a set of a starting frame number and an ending frame number.

【００２０】[0020]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は、本発明に係る一実施例である
音声認識装置のブロック図である。この音声認識装置
は、（ａ）マイクロホン１に入力された後Ａ／Ｄ変換器
２によってＡ／Ｄ変換された発声音声のディジタル音声
信号に対して所定の音響的な特徴パラメータを抽出する
特徴抽出部３と、（ｂ）特徴抽出部３からバッファメモ
リ４を介して入力される特徴パラメータに基づいて、音
韻ＨＭＭメモリ１０に格納された音韻ＨＭＭを用いて音
素照合を行い、上記発声音声に対応する音韻系列とその
音声区間情報を、照合結果保管バッファメモリ（以下、
保管バッファメモリという。）１１に格納する音響パラ
メータ照合部（以下、照合部という。）５と、（ｃ）正
解の音韻系列とその音声区間が既知である学習用発声音
声に対する照合部５による照合結果である音韻系列とそ
の音声区間情報とに基づいて、当該認識された音韻系列
とその音声区間情報を、上記正解の音韻系列とその音声
区間と比較し、互いに対応する音声区間で、認識された
音韻系列が正解の音韻系列と異なっているときに、当該
認識された誤り音韻系列と当該正解の音韻系列との対を
抽出する認識誤り音韻系列誤り抽出部（以下、誤り抽出
部という。）６と、（ｄ）上記誤り抽出部６によって抽
出された誤り音韻系列と正解の音韻系列との対を、誤り
傾向テーブルとして記憶する誤り傾向テーブルメモリ１
２と、（ｅ）音声認識すべき入力された発声音声に対す
る照合部５による照合結果である音韻系列とその音声区
間情報とに基づいて、誤り傾向テーブルメモリ１２によ
って記憶された誤り傾向テーブルを参照して、当該認識
された音韻系列と、上記誤り傾向テーブル内の誤り音韻
系列とを比較して、当該認識された音韻系列の中に上記
誤り音韻系列を検出したときに、当該認識された音韻系
列を、上記誤り音韻系列に対応する正解の音韻系列に置
き換えることにより誤り訂正を行った音韻系列を、照合
部５による照合結果である音韻系列に追加して、音韻認
識結果候補として出力する結果候補誤り訂正処理部（以
下、誤り訂正処理部という）７と、（ｆ）上記入力され
た発声音声は１つの文章からなるときに、誤り訂正処理
部７から出力される音韻認識結果候補に対して、形態素
辞書メモリ１３内の形態素辞書と、Ｎ−グラム辞書メモ
リ１４内のＮ−グラム辞書とを参照して形態素解析を行
って、１つの文章として最適な音声認識結果を出力する
形態素解析部８とを備える。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention. This voice recognition device (a) feature extraction for extracting a predetermined acoustic feature parameter from a digital voice signal of a voiced voice which is A / D converted by an A / D converter 2 after being input to a microphone 1. Based on the feature parameters input from the unit 3 and (b) the feature extraction unit 3 via the buffer memory 4, phoneme matching is performed using the phoneme HMM stored in the phoneme HMM memory 10 to correspond to the vocalized voice. The collation result storage buffer memory (hereinafter,
The storage buffer memory. ) 11 stored in the acoustic parameter collation unit (hereinafter referred to as collation unit) 5, and (c) a phoneme sequence which is a collation result by the collation unit 5 with respect to a correct phoneme sequence and a learning vocalization whose voice section is known. And the phoneme section information thereof, the recognized phoneme sequence and the phoneme section information thereof are compared with the correct phoneme series and the phoneme section thereof, and the recognized phoneme series are correctly answered in the corresponding phoneme sections. Recognition error phoneme sequence error extraction unit (hereinafter referred to as error extraction unit) 6 for extracting a pair of the recognized error phoneme sequence and the correct phoneme sequence when the phoneme sequence is different from the phoneme sequence of ) An error tendency table memory 1 for storing a pair of an error phoneme sequence and a correct phoneme sequence extracted by the error extraction unit 6 as an error tendency table.
2 and (e) the error tendency table stored in the error tendency table memory 12 is referred to based on the phoneme sequence which is the result of the matching by the matching unit 5 with respect to the input uttered speech to be recognized and its voice section information. Then, the recognized phoneme sequence is compared with the error phoneme sequence in the error tendency table, and when the error phoneme sequence is detected in the recognized phoneme sequence, the recognized phoneme sequence is detected. A result of adding a phoneme sequence that has been error-corrected by replacing the sequence to a correct phoneme sequence corresponding to the error phoneme sequence to the phoneme sequence that is the matching result by the matching unit 5 and outputting as a phoneme recognition result candidate. A candidate error correction processing unit (hereinafter referred to as an error correction processing unit) 7 and (f) when the input uttered voice is composed of one sentence, is output from the error correction processing unit 7. The phoneme recognition result candidate is subjected to morphological analysis with reference to the morpheme dictionary in the morpheme dictionary memory 13 and the N-gram dictionary in the N-gram dictionary memory 14, and the optimum speech recognition result as one sentence. And a morphological analysis unit 8 for outputting

【００２１】本実施例の音声認識装置は、誤り傾向テー
ブルを抽出する学習モードと、マイクロホンから入力さ
れる発声音声を音声認識する認識モードとを有する。各
処理部１乃至８のうち、抽出部６は学習モードのみにお
いて動作し、誤り訂正処理部７と形態素解析部８とは認
識モードのみにおいて動作する。The voice recognition apparatus of this embodiment has a learning mode for extracting an error tendency table and a recognition mode for recognizing voiced speech input from a microphone. Of the processing units 1 to 8, the extraction unit 6 operates only in the learning mode, and the error correction processing unit 7 and the morphological analysis unit 8 operate only in the recognition mode.

【００２２】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、Ａ／Ｄ変
換器２に入力されて、フレーム周期が例えば１０ミリ秒
で、アナログ音声信号からディジタル音声信号にＡ／Ｄ
変換される。このとき、ディジタル音声信号の所定長の
フレーム区間で区切った各フレームには、入力開始時か
らのシリアル番号であるフレーム番号が付与され、当該
ディジタル音声信号は、特徴抽出部３に入力される。特
徴抽出部３は、入力されるディジタル音声信号に対し
て、例えばＬＰＣ分析を実行し、対数パワー、１６次ケ
プストラム係数、Δ対数パワー及び１６次Δケプストラ
ム係数を含む３４次元の特徴パラメータを抽出する。抽
出された特徴パラメータの時系列はバッファメモリ４を
介して照合部５に入力される。In FIG. 1, the voice uttered by the speaker is input to the microphone 1 and converted into a voice signal, and then input to the A / D converter 2, and the frame period is, for example, 10 milliseconds. To digital audio signal from A / D
To be converted. At this time, a frame number, which is a serial number from the start of input, is given to each frame divided into frame sections of a predetermined length of the digital voice signal, and the digital voice signal is input to the feature extraction unit 3. The feature extraction unit 3 performs, for example, LPC analysis on the input digital audio signal, and extracts 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. . The time series of the extracted characteristic parameters is input to the matching unit 5 via the buffer memory 4.

【００２３】照合部５に接続されるＨＭＭメモリ１０内
のＨＭＭは、複数の状態と、各状態間の遷移を示す弧か
ら構成され、各弧には状態間の遷移確率と入力コードに
対する出力確率を有している。本実施例においては、音
韻ＨＭＭは、各音韻の特徴を１つの音韻当たり３状態で
表現したＨＭＭで表現したものであり、３状態が連続し
４ループを有する連続型ＨＭＭである。なお、本実施例
では、１つの音韻当たり３状態で表現したＨＭＭを用い
ているが、本発明はこれに限らず、１つの音韻当たり複
数の状態で表現したＨＭＭを用いても良い。The HMM in the HMM memory 10 connected to the collation unit 5 is composed of a plurality of states and arcs indicating transitions between the states. Each arc has a transition probability between states and an output probability for an input code. have. In the present embodiment, the phoneme HMM is a HMM in which the features of each phoneme are expressed in three states per phoneme, and is a continuous HMM having three continuous states and four loops. In the present embodiment, the HMM expressed in three states per phoneme is used, but the present invention is not limited to this, and an HMM expressed in a plurality of states per phoneme may be used.

【００２４】照合部４は、入力される特徴パラメータの
データに対して、各音韻ＨＭＭとの距離を時系列に沿っ
て計算してその距離が所定の距離以下のものを整合結果
として得ることにより音素照合、すなわちＤＰ（dynami
c programming）による整合（いわゆるＤＰマッチン
グ）を行い、上記発声音声に対応する音韻系列とその音
声区間情報を保管バッファメモリ１１に出力して格納す
る。本実施例においては、照合部４は、例えば、公知の
フレーム同期型ＯｎｅＰａｓｓＤＰ法を用いてＶｉ
ｔｅｒｂｉアルゴリズムで照合を行うことにより、時系
列に沿って各フレームまでで最も確からしい音韻列の上
位からｎ位候補までの音韻系列の候補を出力し、処理が
最終フレームに達したときに、すべての入力発声音声の
音韻認識結果候補を出力する。The collation unit 4 calculates the distance from each phoneme HMM to the input characteristic parameter data in time series, and obtains the distance less than a predetermined distance as the matching result. Phoneme matching, that is, DP (dynami
C programming) matching (so-called DP matching) is performed, and the phoneme sequence corresponding to the uttered speech and its speech section information are output to the storage buffer memory 11 and stored. In the present embodiment, the collation unit 4 uses, for example, the well-known frame synchronization type One Pass DP method for Vi.
By performing matching with the terbi algorithm, phonological sequence candidates from the most probable phonological sequence up to the n-th candidate in each frame along the time series are output, and when the processing reaches the final frame, all The phonological recognition result candidates of the input uttered speech of are output.

【００２５】保管バッファメモリ１１は、学習モード時
には、照合した認識結果である音韻系列と、照合したＨ
ＭＭの状態番号と、並びに、認識された各音韻の照合区
間、すなわち対応する音声区間情報を保管格納する。一
方、認識時には、照合した認識結果である音韻系列と、
ＨＭＭの状態番号を保管格納する。In the learning buffer mode, the storage buffer memory 11 compares the phoneme sequence which is the recognition result obtained by the collation with the H sequence obtained by the collation.
The state number of the MM and the matching section of each recognized phoneme, that is, the corresponding speech section information are stored and stored. On the other hand, at the time of recognition, a phoneme sequence that is the recognition result obtained by collation,
Stores and stores the HMM status number.

【００２６】例えば、１音韻が３状態で構成されている
ＨＭＭを用いて認識を行い、「ｋｏＮｂａＮｗａ（こん
ばんわ）」が「ｋａＮｂａＮｎａ」と認識された場合に
は、保管バッファメモリ１１には、次の表１に示すよう
に、音韻の認識結果である音韻系列と、ＨＭＭの状態番
号と、対応する音声区間情報である始端フレーム番号と
終端フレーム番号とが格納される。ここで、ＨＭＭの状
態番号は、音韻ＨＭＭが作成された時点で、各音韻毎に
独立した３つの番号が付与され、当該認識時において１
つの音韻に対して対応する３つの状態番号が保管格納さ
れる。なお、以下の表において、始端フレーム番号を始
端ＦＮと表し、終端フレーム番号を終端ＦＮと表す。For example, when recognition is performed using an HMM in which one phoneme is composed of three states and "koNbaNwa (good evening)" is recognized as "kaNbaNna", the storage buffer memory 11 stores the following As shown in Table 1, a phoneme sequence as a result of phoneme recognition, an HMM state number, and a start frame number and an end frame number that are corresponding voice section information are stored. Here, as the state number of the HMM, three independent numbers are given to each phoneme at the time when the phoneme HMM is created.
Three state numbers corresponding to one phoneme are stored. In the table below, the starting frame number is referred to as the starting FN, and the ending frame number is referred to as the ending FN.

【００２７】[0027]

【表１】 ─────────────────────────── 認識音韻系列状態番号始端ＦＮ終端ＦＮ ─────────────────────────── ｋ［１，２，３］１２０ａ［１３，１４，１５］２１３５Ｎ［７，８，９］３６４５ｂ［１０，１１，１２］４６５０ａ［１３，１４，１５］５１６５Ｎ［７，８，９］６６８０ｎ［１９，２０，２１］８１８８ａ［１３，１４，１５］８９１００ ───────────────────────────[Table 1] ─────────────────────────── Recognized phoneme sequence State number Start FN End FN ────────── ───────────────── k [1,2,3] 1 20 a [13,14,15] 21 35 N [7,8,9] 36 45 b [10 , 11, 12] 46 50 a [13, 14, 15] 51 65 N [7, 8, 9] 66 80 n [19, 20, 21] 81 88 a [13, 14, 15] 89 100 ── ────────────────────────

【００２８】本実施例においては、音声区間情報は、入
力された発声音声を所定長のフレーム区間で区切った複
数のフレームのうちの、その音韻が開始する始端フレー
ム番号と終端フレーム番号とで表されている。また、認
識された音韻系列及び、後述する正解の音韻系列はそれ
ぞれ、１つの音韻に対して、その音韻と、状態番号と、
始端フレーム番号と、終端フレーム番号との組の形式で
表される。In the present embodiment, the voice section information is represented by the start frame number and the end frame number at which the phoneme starts among a plurality of frames obtained by dividing the input uttered voice into frame sections of a predetermined length. Has been done. In addition, the recognized phoneme sequence and the correct phoneme sequence described below are each a phoneme, a state number, and
It is expressed in the form of a set of a starting frame number and an ending frame number.

【００２９】学習モードのときに、誤り抽出部６は、正
解の音韻系列とその音声区間が既知である学習用発声音
声に対する照合部５による照合結果である音韻系列とそ
の音声区間情報とに基づいて、当該認識された音韻系列
とその音声区間情報を、上記正解の音韻系列とその音声
区間と比較し、互いに対応する音声区間で、認識された
音韻系列が正解の音韻系列と異なっているときに、当該
認識された誤り音韻系列と当該正解の音韻系列との対を
抽出する。そして、抽出した誤り音韻系列と正解の音韻
系列の対を、そのＨＭＭの状態番号とともに、誤り傾向
テーブルとして、誤り傾向テーブルメモリ１２に格納す
る。In the learning mode, the error extraction unit 6 is based on the phoneme sequence which is the matching result of the matching unit 5 with respect to the correct phoneme sequence and the learning vocalization whose voice segment is known, and its voice segment information. Then, the recognized phoneme sequence and its speech segment information are compared with the correct phoneme sequence and its speech segment, and when the recognized phoneme sequence is different from the correct phoneme sequence in mutually corresponding speech segments. Then, a pair of the recognized error phoneme sequence and the correct phoneme sequence is extracted. Then, the pair of the extracted error phoneme sequence and correct phoneme sequence is stored in the error tendency table memory 12 as an error tendency table together with the state number of the HMM.

【００３０】誤り傾向テーブルは、誤り抽出部６の上記
処理にて抽出された正解の音韻系列と誤り音韻系列と
を、ＨＭＭの状態番号とともに格納する。ここで、例え
ば、正解の音韻系列が「ｋｏＮｂａＮｗａ」であり、保
管バッファメモリ１１に、学習モードの処理の前に予め
格納されるデータが、表２に示すように、その正解の音
韻系列と、そのＨＭＭの状態番号と、その音声区間情報
とを含む形式で格納された場合について考える。The error tendency table stores the correct phoneme sequence and the error phoneme sequence extracted by the above-mentioned processing of the error extracting unit 6 together with the HMM state number. Here, for example, the correct phoneme sequence is “koNbaNwa”, and the data stored in advance in the storage buffer memory 11 before the learning mode processing is, as shown in Table 2, the correct phoneme sequence, Consider a case where the state number of the HMM and the voice section information are stored in a format.

【００３１】[0031]

【表２】 ─────────────────────────── 正解音韻系列状態番号始端ＦＮ終端ＦＮ ─────────────────────────── ｋ［１，２，３］１２０ｏ［４，５，６］２１３５Ｎ［７，８，９］３６４５ｂ［１０，１１，１２］４６５０ａ［１３，１４，１５］５１６５Ｎ［７，８，９］６６８０ｗ［１６，１７，１８］８１８８ａ［１３，１４，１５］８９１００ ───────────────────────────[Table 2] ─────────────────────────── Correct answer phonological sequence State number Start FN End FN ─────────── ───────────────── k [1,2,3] 1 20 o [4,5,6] 21 35 N [7,8,9] 36 45 b [10 , 11, 12] 46 50 a [13, 14, 15] 51 65 N [7, 8, 9] 66 80 w [16, 17, 18] 81 88 a [13, 14, 15] 89 100 ─── ────────────────────────

【００３２】さらに、認識モード時に、操作者がマイク
ロホン１に向かって、「ｋｏＮｂａＮｗａ」と発声した
ときの照合部５により照合結果が、先の例と同様の形式
で、次の表３に示すように、保管バッファメモリ１１に
保管格納されたものとする。Further, in the recognition mode, when the operator utters "koNbaNwa" into the microphone 1, the collation unit 5 produces a collation result in the same format as the previous example, as shown in Table 3 below. It is assumed that the data is stored and stored in the storage buffer memory 11.

【００３３】[0033]

【表３】 ─────────────────────────── 認識音韻系列状態番号始端ＦＮ終端ＦＮ ─────────────────────────── ｋ［１，２，３］１２０ａ［１３，１４，１５］２１３５Ｎ［７，８，９］３６４５ｂ［１０，１１，１２］４６５０ａ［１３，１４，１５］５１６５Ｎ［７，８，９］６６８０ｎ［１９，２０，２１］８１８８ａ［１３，１４，１５］８９１００ ───────────────────────────[Table 3] ─────────────────────────── Recognition phoneme sequence State number Start FN End FN ─────────── ───────────────── k [1,2,3] 1 20 a [13,14,15] 21 35 N [7,8,9] 36 45 b [10 , 11, 12] 46 50 a [13, 14, 15] 51 65 N [7, 8, 9] 66 80 n [19, 20, 21] 81 88 a [13, 14, 15] 89 100 ── ────────────────────────

【００３４】これらの、表２の正解の音韻系列と、表３
の認識された音韻系列とを、抽出部６によって比較する
と、次の表４で示されるように、誤り音韻系列として抽
出され、誤り傾向テーブルメモリ１２に保管される。こ
こで、誤り傾向テーブルは、正解の音韻系列とその音韻
ＨＭＭの状態番号と、誤りの音韻系列とその音韻ＨＭＭ
の状態番号との対を含む。These correct phoneme sequences in Table 2 and Table 3
When the extraction unit 6 compares the recognized phoneme sequence of No. 1 with the recognized phoneme sequence, the extracted phoneme sequence is extracted as an error phoneme sequence and stored in the error tendency table memory 12. Here, the error tendency table includes the correct phoneme sequence, the state number of the phoneme HMM, the error phoneme sequence, and the phoneme HMM.
Including a pair with the state number of.

【００３５】[0035]

【表４】 ──────────────────────────────── 正解音韻系列とその状態番号 → 誤り音韻系列とその状態番号 ──────────────────────────────── ｏ,Ｎ,［４,５,６,７,８,９］→ａ,Ｎ,［１３,１４,１５,７,８,９］ｗ，［９,１０,１１］ →ｎ,［１９,２０,２１］ ────────────────────────────────[Table 4] ──────────────────────────────── Correct phoneme sequence and its state number → Error phoneme sequence and its state Number ──────────────────────────────── o, N, [4,5,6,7,8,9] → a, N, [13,14,15,7,8,9] w, [9,10,11] → n, [19,20,21] ────────────── ────────────────────

【００３６】さらに、誤り訂正処理部７は、音声認識す
べき入力された発声音声に対する照合部５による照合結
果である音韻系列とその音声区間情報とに基づいて、誤
り傾向テーブルを参照して、当該認識された音韻系列
と、上記誤り傾向テーブル内の誤り音韻系列とを比較し
て、当該認識された音韻系列の中に上記誤り音韻系列を
検出したときに、当該認識された音韻系列を、上記誤り
音韻系列に対応する正解の音韻系列に置き換えることに
より誤り訂正を行った音韻系列を、上記照合手段による
照合結果である音韻系列に追加して、音韻認識結果候補
として形態素解析部８に出力する。Further, the error correction processing section 7 refers to the error tendency table on the basis of the phoneme sequence and its speech section information which are the matching results by the matching section 5 with respect to the inputted uttered speech to be recognized. The recognized phoneme sequence is compared with the error phoneme sequence in the error tendency table, and when the error phoneme sequence is detected in the recognized phoneme sequence, the recognized phoneme sequence is The phoneme sequence that has been error-corrected by replacing it with the correct phoneme sequence corresponding to the error phoneme sequence is added to the phoneme sequence that is the matching result by the matching means, and is output to the morphological analysis unit 8 as a phoneme recognition result candidate. To do.

【００３７】例えば、学習時に表４に示すように誤り傾
向テーブルが作成されたとした場合で、照合部５による
照合結果の音韻系列が、「ｋａＮｎｉｃｈｉｎａ」であ
る場合、誤り傾向テーブル内の誤り音韻系列と同一の音
韻系列である音韻系列「ａｎ」と音韻「ｎ」とが、上記
照合結果内に含まれているので、誤り訂正処理部７で
は、誤り傾向テーブルを参照して、これらの音韻系列又
は音韻を正解の音韻系列に置き換えた以下の表５に示す
候補も、照合結果すなわち音韻認識結果に追加する。For example, when the error tendency table is created as shown in Table 4 during learning, and the phoneme sequence of the matching result by the matching unit 5 is "kaNnichina", the error phoneme sequence in the error tendency table is obtained. Since the phoneme sequence “an” and the phoneme “n”, which are the same phoneme sequences as the above, are included in the collation result, the error correction processing unit 7 refers to the error tendency table and refers to these phoneme sequences. Alternatively, the candidates shown in Table 5 below in which the phonemes are replaced with correct phoneme sequences are also added to the matching result, that is, the phoneme recognition result.

【００３８】[0038]

【表５】 ────────────── 「ｋｏＮｎｉｃｈｉｎａ」「ｋａＮｎｉｃｈｉｗａ」「ｋｏＮｎｉｃｈｉｗａ」 ──────────────[Table 5] ────────────── “koNnichina” “kaNnichiwa” “koNnichiwa” ───────────────

【００３９】さらに、形態素解析部８は、上記入力され
た発声音声は１つの文章からなるときに、誤り訂正処理
部７から出力される音韻認識結果候補に対して、形態素
辞書メモリ１３内の形態素辞書と、Ｎ−グラム辞書メモ
リ１４内のＮ−グラム辞書とを参照して形態素解析を行
って、１つの文章として最適な音声認識結果を出力す
る。すなわち、形態素解析部８は、音韻系列で表された
複数の認識結果候補をまず、ひらがな表記に変換する。
次に、ひらがな表記された結果が文として成立するか否
かを判断するために、ひらがな表記された認識対象とな
る単語とその品詞名とを格納した形態素辞書を参照して
各認識結果候補に対して形態素解析を行う。１つの文章
入力に対して、複数の形態素解析結果が考えられるが、
その際に、予め隣接する単語、又は品詞の出現確率を調
べておき、これらをＮ−グラム辞書メモリ１４にＮ−グ
ラム辞書として格納する。形態素解析部８は、当該Ｎ−
グラム辞書を参照して、複数の形態素解析結果の中で最
も隣接する単語又は品詞の可能性の大きい文章を最も確
からしい形態素解析結果として出力する。一方、形態素
解析部８に誤った文章が入力されたときは、形態素辞書
に対応する単語がなかったり、隣接する可能性が無い単
語又は品詞が隣接したりしたとき、その解析は途中で失
敗する。Further, the morpheme analysis unit 8 applies the morpheme in the morpheme dictionary memory 13 to the phoneme recognition result candidate output from the error correction processing unit 7 when the input uttered voice consists of one sentence. Morphological analysis is performed with reference to the dictionary and the N-gram dictionary in the N-gram dictionary memory 14, and an optimum speech recognition result is output as one sentence. That is, the morpheme analysis unit 8 first converts the plurality of recognition result candidates represented by the phoneme sequence into the Hiragana notation.
Next, in order to determine whether or not the result written in Hiragana is valid as a sentence, each recognition result candidate is referred to by referring to the morpheme dictionary that stores the word to be recognized written in Hiragana and its part-of-speech name. On the other hand, morphological analysis is performed. Multiple morphological analysis results are possible for one sentence input,
At that time, the probabilities of occurrence of adjacent words or parts of speech are checked in advance, and these are stored in the N-gram dictionary memory 14 as an N-gram dictionary. The morphological analysis unit 8 uses the N-
The gram dictionary is referred to and the sentence having the highest possibility of the word or part of speech that is most adjacent among the plurality of morphological analysis results is output as the most probable morphological analysis result. On the other hand, when an incorrect sentence is input to the morpheme analysis unit 8, when there is no corresponding word in the morpheme dictionary, or when there is a word or part of speech that has no possibility of being adjacent, the analysis fails in the middle. .

【００４０】例えば、誤り訂正処理部７から出力される
認識結果として、次の４つの認識結果候補が形態素解析
部８に入力された場合を考える。（Ａ１）「ｋａＮｎｉｃｈｉｎａ」（Ａ２）「ｋｏＮｎｉｃｈｉｎａ」（Ａ３）「ｋａＮｎｉｃｈｉｗａ」（Ａ４）「ｋｏＮｎｉｃｈｉｗａ」このとき、認識結果候補（Ａ１）、（Ａ２）、及び（Ａ
４）は、形態素解析部８において解析失敗となり、認識
結果候補（Ａ４）のみが「こんにちわ（感動詞）」とい
う解析結果が形態素解析部８から出力されることにな
る。For example, consider the case where the following four recognition result candidates are input to the morphological analysis unit 8 as the recognition results output from the error correction processing unit 7. (A1) "kaNinichina" (A2) "koNnichina" (A3) "kaNnichiwa" (A4) "koNnichiwa" At this time, recognition result candidates (A1), (A2), and (A)
In 4), the morpheme analysis unit 8 fails in the analysis, and the morpheme analysis unit 8 outputs the analysis result that only the recognition result candidate (A4) is "Konichiwa (verb)".

【００４１】さらに、本発明者が図１の音声認識装置を
用いて実施したシミュレーション結果について以下に説
明する。本発明の誤り訂正処理部７を含む音声認識装置
における形態素解析部８への入力は、正しいと思われる
結果を含んだ複数の認識結果候補である。この誤り訂正
法を評価するために、形態素解析部８に入力する複数の
認識結果候補の中に、如何に効率よく正解の音韻系列が
含まれているかを調べた。その結果、誤り訂正を行わ
ず、認識結果の上位ｎ個の候補を形態素解析の入力にし
た場合に比べ、同じ候補数に１．３３倍の正解の音韻系
列候補を含んでいた。また、第１の従来例に比較して、
同じ候補数に対して１．２４倍の正解の音韻系列を含ん
でいた。以上より、本発明の誤り訂正法が、従来例の方
法に比べて、訂正効率がよいことがわかる。これによ
り、音韻認識の誤りをより確実に訂正することができ、
従来例に比較してより高い音声認識率を得ることができ
る。Further, the result of the simulation performed by the present inventor using the speech recognition apparatus of FIG. 1 will be described below. Inputs to the morphological analysis unit 8 in the speech recognition apparatus including the error correction processing unit 7 of the present invention are a plurality of recognition result candidates including a result that seems to be correct. In order to evaluate this error correction method, how efficiently a correct phoneme sequence is included in a plurality of recognition result candidates input to the morphological analysis unit 8 was examined. As a result, 1.33 times as many phonological sequence candidates were included in the same number of candidates as in the case where the top n candidates of the recognition result were input as morphological analysis without error correction. In addition, compared to the first conventional example,
For the same number of candidates, 1.24 times the correct phoneme sequence was included. From the above, it can be seen that the error correction method of the present invention has better correction efficiency than the conventional method. This makes it possible to more reliably correct phonological recognition errors,
It is possible to obtain a higher voice recognition rate as compared with the conventional example.

【００４２】以上説明したように、この実施例によれ
ば、ＨＭＭの単位に依存せずに、誤り傾向の抽出が可能
であり、その結果、訂正効率が従来例に比較して高い。
また、当該誤り傾向テーブルを用いて認識結果の訂正を
行うことができるので、誤り学習と異なった単語や文章
においても高精度な訂正が可能である。従って、音韻認
識の誤りをより確実に訂正することができ、従来例に比
較してより高い音声認識率を得ることができる。As described above, according to this embodiment, the error tendency can be extracted without depending on the unit of HMM, and as a result, the correction efficiency is higher than that of the conventional example.
In addition, since the recognition result can be corrected using the error tendency table, it is possible to correct a word or a sentence different from the error learning with high accuracy. Therefore, an error in phoneme recognition can be corrected more reliably, and a higher speech recognition rate can be obtained as compared with the conventional example.

【００４３】以上の実施例においては、形態素解析部８
を設けているが、本発明はこれに限らず、これに代え
て、単語解析部を設けてもよい。当該単語解析部は、入
力された発声音声が１つの単語からなるときに、誤り訂
正処理部７から出力される音韻認識結果候補に対して、
所定の単語辞書を参照して単語解析を行って、１つの単
語として最適な音声認識結果を出力する。In the above embodiment, the morphological analysis unit 8
However, the present invention is not limited to this, and a word analysis unit may be provided instead of this. When the input uttered voice consists of one word, the word analysis unit applies to the phoneme recognition result candidates output from the error correction processing unit 7,
The word analysis is performed with reference to a predetermined word dictionary, and the optimum speech recognition result is output as one word.

【００４４】以上の実施例においては、１つの音韻に対
して３つ１組の状態番号が１対１に対応する場合を示し
ているが、本発明はこれに限らず、認識結果の系列、正
解の系列及び誤りの系列を、ＨＭＭの状態番号又は状態
記号などで表された状態系列で表しても良い。すなわ
ち、より音声認識をより確実に行うためには、各音韻の
前後の音韻環境により、同一の音韻でも異なる状態を与
え得る場合がある。例えば、音韻「Ｎ」は、状態番号の
系列［７，８，９］で表わせ得る場合と、これとは異な
る状態番号の系列［７，３２，３３］で表させ得る場合
がある。これに対処するために、認識結果の系列、正解
の系列及び誤りの系列を、ＨＭＭの状態番号又は状態記
号などで表された状態系列で表す。この変形例では、音
韻をより詳細な形式で表しているので、音韻系列を用い
る請求項１記載の装置に比較して、訂正効率を改善する
ことができ、より確実に音声認識を行うことができる。
ＨＭＭの状態番号の系列で表したこの変形例の場合、表
１乃至表４における音韻系列は無く、状態番号の系列の
みになる。なお、この場合においても、音韻ＨＭＭは、
１つの音韻に対して、３つ以外の複数の状態で表しても
良い。In the above embodiment, the case where one phoneme corresponds to one set of three state numbers is shown, but the present invention is not limited to this, and the recognition result sequence, The correct sequence and the error sequence may be represented by a state sequence represented by HMM state numbers or state symbols. That is, in order to perform the speech recognition more reliably, there are cases where the same phoneme can give different states depending on the phoneme environment before and after each phoneme. For example, the phoneme "N" may be represented by a sequence of state numbers [7, 8, 9] or a sequence of state numbers [7, 32, 33] different from this. In order to deal with this, the series of recognition results, the series of correct answers, and the series of errors are represented by a state series represented by an HMM state number or state symbol. In this modified example, since the phoneme is expressed in a more detailed format, the correction efficiency can be improved and the speech recognition can be performed more reliably as compared with the device according to claim 1 which uses the phoneme sequence. it can.
In the case of this modified example represented by a sequence of HMM state numbers, there is no phoneme sequence in Tables 1 to 4, but only a sequence of state numbers. Even in this case, the phoneme HMM is
One phoneme may be represented by a plurality of states other than three.

【００４５】[0045]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の音声認識装置によれば、入力された発声音声を
音韻隠れマルコフモデルを用いて音素照合を行い、上記
発声音声に対応する認識された音韻系列とその音声区間
情報を出力する照合手段と、正解の音韻系列とその音声
区間が既知である学習用発声音声に対する上記照合手段
による照合結果である音韻系列とその音声区間情報とに
基づいて、当該認識された音韻系列とその音声区間情報
を、上記正解の音韻系列とその音声区間と比較し、互い
に対応する音声区間で、認識された音韻系列が正解の音
韻系列と異なっているときに、当該認識された誤り音韻
系列と当該正解の音韻系列との対を抽出する誤り抽出手
段と、上記誤り抽出手段によって抽出された誤り音韻系
列と正解の音韻系列との対を、誤り傾向テーブルとして
記憶する記憶手段と、音声認識すべき入力された発声音
声に対する上記照合手段による照合結果である音韻系列
とその音声区間情報とに基づいて、上記記憶手段によっ
て記憶された誤り傾向テーブルを参照して、当該認識さ
れた音韻系列と、上記誤り傾向テーブル内の誤り音韻系
列とを比較して、当該認識された音韻系列の中に上記誤
り音韻系列を検出したときに、当該認識された音韻系列
を、上記誤り音韻系列に対応する正解の音韻系列に置き
換えることにより誤り訂正を行った音韻系列を、上記照
合手段による照合結果である音韻系列に追加して、音韻
認識結果候補として出力する誤り訂正処理手段とを備え
る。As described above in detail, according to the speech recognition apparatus of the first aspect of the present invention, the input uttered voice is phoneme-matched by using the phoneme hidden Markov model to correspond to the uttered voice. Collating means for outputting the recognized phoneme sequence and its speech section information, and a phoneme sequence and its speech section information which are the collation result by the collating means for the correct phoneme sequence and the learning vocalization whose speech section is known. Based on the above, the recognized phoneme sequence and its speech section information are compared with the correct phoneme sequence and its phoneme section, and the recognized phoneme sequence is different from the correct phoneme series in mutually corresponding phoneme sections. Error extracting means for extracting a pair of the recognized error phoneme sequence and the correct phoneme sequence, and the error phoneme sequence and correct phoneme system extracted by the error extracting means. And a pair of and are stored as an error tendency table, and are stored by the storage means based on the phoneme sequence as a result of the matching by the matching means with respect to the input uttered speech to be recognized, and its speech section information. When the detected phoneme sequence is detected in the recognized phoneme sequence by comparing the recognized phoneme sequence with the error phoneme sequence in the error tendency table by referring to the recognized error tendency table Further, the recognized phoneme sequence is replaced with a correct phoneme sequence corresponding to the error phoneme sequence to add a phoneme sequence that is error-corrected to the phoneme sequence that is the matching result by the matching means, Error correction processing means for outputting as a recognition result candidate.

【００４６】それ故、ＨＭＭの単位に依存せずに、誤り
傾向の抽出が可能であり、その結果、訂正効率が従来例
に比較して高く、誤り学習と異なった単語や文章におい
ても高精度な訂正が可能である。従って、音韻認識の誤
りをより確実に訂正することができ、従来例に比較して
より高い音声認識率を得ることができる。Therefore, the error tendency can be extracted without depending on the unit of the HMM, and as a result, the correction efficiency is higher than that of the conventional example and the accuracy is high even in the words and sentences different from the error learning. Correction is possible. Therefore, an error in phoneme recognition can be corrected more reliably, and a higher speech recognition rate can be obtained as compared with the conventional example.

【００４７】また、本発明に係る請求項２記載の音声認
識装置によれば、入力された発声音声を音韻隠れマルコ
フモデルを用いて音素照合を行い、上記発声音声に対応
する認識された状態系列とその音声区間情報を出力する
照合手段と、正解の状態系列とその音声区間が既知であ
る学習用発声音声に対する上記照合手段による照合結果
である状態系列とその音声区間情報とに基づいて、当該
認識された状態系列とその音声区間情報を、上記正解の
状態系列とその音声区間と比較し、互いに対応する音声
区間で、認識された状態系列が正解の状態系列と異なっ
ているときに、当該認識された誤り状態系列と当該正解
の状態系列との対を抽出する誤り抽出手段と、上記誤り
抽出手段によって抽出された誤り状態系列と正解の状態
系列との対を、誤り傾向テーブルとして記憶する記憶手
段と、音声認識すべき入力された発声音声に対する上記
照合手段による照合結果である状態系列とその音声区間
情報とに基づいて、上記記憶手段によって記憶された誤
り傾向テーブルを参照して、当該認識された状態系列
と、上記誤り傾向テーブル内の誤り状態系列とを比較し
て、当該認識された状態系列の中に上記誤り状態系列を
検出したときに、当該認識された状態系列を、上記誤り
状態系列に対応する正解の状態系列に置き換えることに
より誤り訂正を行った状態系列を、上記照合手段による
照合結果である状態系列に追加して、音韻認識結果候補
として出力する誤り訂正処理手段とを備える。According to the second aspect of the speech recognition apparatus of the present invention, the input uttered voice is phoneme-matched using the phoneme hidden Markov model, and the recognized state sequence corresponding to the uttered voice is obtained. And a matching means for outputting the voice section information thereof, based on the state series and its voice section information which are the result of the matching by the above-mentioned matching means with respect to the correct state series and the learning vocalization of which the voice section is known. The recognized state series and its voice section information are compared with the correct state series and its voice section, and when the recognized state series is different from the correct state series in the corresponding voice sections, An error extraction unit that extracts a pair of the recognized error state sequence and the correct state sequence, and a pair of the error state sequence and the correct state sequence extracted by the error extraction unit are A storage unit for storing the tendency table, and an error tendency table stored by the storage unit based on the state series and the voice section information which are the collation result by the collating unit for the input uttered voice to be recognized. With reference, the recognized state sequence is compared with the error state sequence in the error tendency table, and when the error state sequence is detected in the recognized state sequence, the recognized state sequence is recognized. The state sequence corrected for error by replacing the state sequence with the correct state sequence corresponding to the error state sequence is added to the state sequence that is the collation result by the collating means, and is output as a phoneme recognition result candidate. Error correction processing means.

【００４８】それ故、ＨＭＭの単位に依存せずに、誤り
傾向の抽出が可能であり、その結果、訂正効率が従来例
に比較して高く、誤り学習と異なった単語や文章におい
ても高精度な訂正が可能である。従って、音韻認識の誤
りをより確実に訂正することができ、従来例に比較して
より高い音声認識率を得ることができる。さらに、この
場合、音韻をより詳細な形式で表しているので、音韻系
列を用いる請求項１記載の音声認識装置に比較して、訂
正効率を改善することができ、より確実に音声認識を行
うことができる。Therefore, the error tendency can be extracted without depending on the unit of HMM, and as a result, the correction efficiency is higher than that of the conventional example, and the accuracy is high even in the words and sentences different from the error learning. Correction is possible. Therefore, an error in phoneme recognition can be corrected more reliably, and a higher speech recognition rate can be obtained as compared with the conventional example. Further, in this case, since the phoneme is expressed in a more detailed format, the correction efficiency can be improved and the voice recognition can be performed more reliably as compared with the voice recognition device according to claim 1 which uses the phoneme sequence. be able to.

【００４９】また、請求項３記載の音声認識装置によれ
ば、上記入力された発声音声は１つの文章からなり、上
記誤り訂正処理手段から出力される音韻認識結果候補に
対して、所定の形態素辞書を参照して形態素解析を行っ
て、１つの文章として最適な音声認識結果を出力する形
態素解析手段をさらに備える。これにより、発声音声文
の音声認識を従来例に比較してより高い音声認識率で実
行することができる。According to the speech recognition apparatus of the third aspect, the inputted uttered speech is composed of one sentence, and the morpheme recognition result candidate output from the error correction processing means has a predetermined morpheme. The apparatus further includes a morphological analysis unit that performs morphological analysis with reference to the dictionary and outputs an optimal speech recognition result as one sentence. As a result, the voice recognition of the uttered voice sentence can be executed with a higher voice recognition rate than the conventional example.

【００５０】さらに、請求項４記載の音声認識装置によ
れば、上記入力された発声音声は１つの単語からなり、
上記誤り訂正処理手段から出力される音韻認識結果候補
に対して、所定の単語辞書を参照して単語解析を行っ
て、１つの単語として最適な音声認識結果を出力する単
語解析手段をさらに備える。これにより、発声音声単語
の音声認識を従来例に比較してより高い音声認識率で実
行することができる。Further, according to the voice recognition device of the fourth aspect, the inputted uttered voice consists of one word,
The phoneme recognition result candidate output from the error correction processing means is further provided with word analysis means for performing word analysis by referring to a predetermined word dictionary and outputting an optimum speech recognition result as one word. As a result, it is possible to perform voice recognition of a voiced voice word with a higher voice recognition rate than in the conventional example.

[Brief description of drawings]

【図１】本発明に係る一実施例である音声認識装置の
ブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

[Explanation of symbols]

１…マイクロホン、２…Ａ／Ｄ変換器、３…特徴抽出部、４…バッファメモリ、５…音響パラメータ照合部、６…認識誤り音韻系列抽出部、７…結果候補誤り訂正処理部、８…形態素解析部、１０…音韻ＨＭＭメモリ１１…照合結果保管バッファメモリ、１２…誤り傾向テーブルメモリ、１３…形態素辞書メモリ、１４…Ｎ−グラム辞書メモリ。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... A / D converter, 3 ... Feature extraction part, 4 ... Buffer memory, 5 ... Acoustic parameter collation part, 6 ... Recognition error phoneme sequence extraction part, 7 ... Result candidate error correction processing part, 8 ... Morphological analysis unit, 10 ... Phoneme HMM memory 11 ... Collation result storage buffer memory, 12 ... Error tendency table memory, 13 ... Morphological dictionary memory, 14 ... N-gram dictionary memory.

フロントページの続き (72)発明者ハラルド・シンガー京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者匂坂芳典京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内Front page continued (72) Inventor Harald Singer, 5 Seiraya, Seika-cho, Seika-cho, Soraku-gun, Kyoto Pref. A / R Co., Ltd., Speech Translation Laboratory Machidaiji, Inui-ya, small, Mihiratani No. 5 Address, T.R. Co., Ltd.

Claims

[Claims]

1. A matching means for performing phoneme matching on an input uttered voice using a phoneme hidden Markov model, and outputting a recognized phoneme sequence corresponding to the uttered voice and its speech section information, and a correct phoneme sequence. Based on the phoneme sequence that is the result of the matching by the matching means with respect to the learning vocalization whose voice segment is known and its voice segment information, the recognized phoneme sequence and its voice segment information are converted into the correct phoneme. When the recognized phoneme sequence is different from the correct phoneme sequence in the corresponding voice sections, the pair of the recognized error phoneme sequence and the correct phoneme sequence is compared. Error extracting means for extracting; a memory means for storing a pair of the error phoneme sequence and the correct phoneme sequence extracted by the error extracting means as an error tendency table; Based on the phoneme sequence that is the result of matching by the matching means with respect to the inputted voice and the voice section information thereof, the error tendency table stored by the storage means is referred to, and the recognized phoneme sequence, When the error phoneme sequence in the error tendency table is compared to detect the error phoneme sequence in the recognized phoneme sequence, the recognized phoneme sequence is set to the correct answer corresponding to the error phoneme sequence. Error correction processing means for adding a phoneme sequence that has been error-corrected by replacing it to the phoneme sequence that is the matching result by the matching means and outputting it as a phoneme recognition result candidate. Voice recognition device.

2. A collating means for performing phoneme matching on the input uttered speech using a phonological hidden Markov model, and outputting a recognized state series corresponding to the uttered speech and its speech section information, and a correct state series. Based on the state series and its voice section information, which is the result of the matching by the above-mentioned matching means for the voiced speech for learning whose voice section is already known, the recognized state series and its voice section information are set to the correct state. When the recognized state sequence is different from the correct state sequence in the corresponding speech sections, the pair of the recognized error state sequence and the correct state sequence is compared. Error extracting means for extracting; a memory means for storing a pair of the error state sequence and the correct state sequence extracted by the error extracting means as an error tendency table; Based on the state sequence and its voice section information which is the result of the collation by the collating means with respect to the input uttered voice, the error tendency table stored by the storage means is referred to, and the recognized state sequence, When the error state sequence in the recognized state sequence is detected by comparing with the error state sequence in the error tendency table, the recognized state sequence is set to the correct answer corresponding to the error state sequence. Error correction processing means for adding a state series that has been error-corrected by replacing the state series as a result of the verification to the state series that is the verification result by the verification means, and outputting the result as a phoneme recognition result candidate. Voice recognition device.

3. The input uttered voice is composed of one sentence, and a morphological analysis is performed on a phoneme recognition result candidate output from the error correction processing means by referring to a predetermined morpheme dictionary, and 1 The speech recognition apparatus according to claim 1 or 2, further comprising a morphological analysis unit that outputs an optimal speech recognition result as one sentence.

4. The input uttered voice consists of one word, and the phoneme recognition result candidates output from the error correction processing means are subjected to word analysis by referring to a predetermined word dictionary, and 3. The voice recognition device according to claim 1, further comprising a word analysis unit that outputs an optimum voice recognition result as one word.

5. The error tendency table includes a pair of a correct phoneme sequence and a state number of the phoneme hidden Markov model, and a pair of an error phoneme sequence and a state number of the phoneme hidden Markov model. The voice recognition device according to Item 1, 3 or 4.

6. The speech recognition apparatus according to claim 1, wherein the phoneme hidden Markov model is a hidden Markov model in which one phoneme is composed of a plurality of states.

7. The voice section information is represented by a start frame number and a end frame number at which the phoneme starts among a plurality of frames obtained by dividing the input uttered voice into frame sections of a predetermined length. The voice recognition device according to claim 1, wherein the voice recognition device is a voice recognition device.

8. The phoneme sequence and the correct phoneme sequence as a result of the matching by the matching means are a set of a phoneme, a state number, a starting frame number, and an ending frame number for one phoneme, respectively. Claim 7 dependent on Claim 2 or 5 characterized by being represented in the form of
The voice recognition device according to claim 7.