JPH08314490A

JPH08314490A - Word spotting type speech recognition method and device

Info

Publication number: JPH08314490A
Application number: JP7123469A
Authority: JP
Inventors: Toru Imai; 亨今井; Akio Ando; 彰男安藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1995-05-23
Filing date: 1995-05-23
Publication date: 1996-11-29

Abstract

(57)【要約】【目的】湧き出し誤りが少なく、認識精度が高く、高
速認識が可能なワードスポッティング型音声認識方法お
よび装置を提供する。【構成】母音の標準パターンを用いて入力音声の各時
刻の母音候補を求め、これとキーワード発音辞書中の母
音列との対応をとり、入力音声中のキーワードの存在区
間を検出し（２）、該検出された存在区間におけるキー
ワードの尤度を隠れマルコフモデルにより求める（３）
とともに、検出された前記キーワードの存在区間と求め
られた前記尤度から音声認識結果を出力する（４）。 (57) [Abstract] [Purpose] To provide a word spotting type speech recognition method and device which have a small number of source errors, high recognition accuracy, and high-speed recognition. [Structure] Using the standard pattern of vowels, the vowel candidate at each time of the input voice is obtained, and the vowel sequence in the keyword pronunciation dictionary is associated with it to detect the presence section of the keyword in the input voice (2). , Finding the likelihood of the keyword in the detected existence section by a hidden Markov model (3)
At the same time, a speech recognition result is output from the detected keyword existence section and the obtained likelihood (4).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、音声認識方法と装置
に係り、特に入力音声中のキーワードを検出して音声認
識を行うワードスポッティング型音声認識方法と装置に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition method and apparatus, and more particularly to a word spotting type voice recognition method and apparatus for detecting a keyword in an input voice to perform voice recognition.

【０００２】[0002]

【従来の技術】従来のワードスポッティング型音声認識
装置を使用する方法には、例えば以下の (ａ), (ｂ) の
ような方法が提案されている。（ａ）隠れマルコフモデルだけを使用する方法（今村明
弘：“ＨＭＭ（隠れマルコフモデル）による電話音声の
スポッティング”，信学技報 SP90-18, 1990参照）。（ｂ）音素認識ネットワーク素子による単語予備選択
と、ＤＰマッチング（例えば“確率モデルによる音声認
識”，電子情報通信学会刊，中川聖著，18頁，1988参
照）による単語決定を併用した方法（小倉広実：“音素
事後確率時系列を用いた単語予備選択と大語彙単語音声
認識”, 信学技報 SP94-87, 1995）。2. Description of the Related Art As a method of using a conventional word spotting type voice recognition device, for example, the following methods (a) and (b) have been proposed. (A) Method using only Hidden Markov Model (See Akihiro Imamura: "Spotting of Telephone Voice by HMM (Hidden Markov Model)", IEICE Technical Report SP90-18, 1990). (B) A method in which word preselection by a phoneme recognition network element is combined with word determination by DP matching (for example, "speech recognition by a probabilistic model", published by The Institute of Electronics, Information and Communication Engineers, Sei Nakagawa, page 18, 1988) (Ogura Hiromi: “Word preselection and large vocabulary word speech recognition using phoneme posterior probability time series,” IEICE Tech. SP94-87, 1995).

【０００３】[0003]

【発明が解決しようとする課題】上述の従来の方法
（ａ）では、入力音声の任意のフレームから任意のフレ
ームまでをキーワード存在区間と仮定し、隠れマルコフ
モデルのみにより尤度を求めていた。そのため計算量が
多く、実際には発生していない場所でキーワードを検出
してしまうこと（湧き出し誤り）が多いという問題点が
あった。In the above-mentioned conventional method (a), it is assumed that an arbitrary frame to an arbitrary frame of the input speech is a keyword existence section, and the likelihood is calculated only by the hidden Markov model. Therefore, there is a problem that the amount of calculation is large and that a keyword is often detected (a spelling error) in a place where it does not actually occur.

【０００４】また従来の方法（ｂ）では、子音も含めた
全ての音素の認識を高い精度で実現することが困難であ
り、ＤＰマッチングによるキーワードの尤度計算の精度
も、特に不特定話者を想定した場合には隠れマルコフモ
デルより劣るという問題点があった。そこで本発明の目
的は、従来のこの種技術の欠点を排除し、湧き出し誤り
が少なく、認識精度が高く、高速認識が可能なワードス
ポッティング型音声認識方法および装置を提供せんとす
るものである。Further, with the conventional method (b), it is difficult to realize recognition of all phonemes including consonants with high accuracy, and the accuracy of likelihood calculation of keywords by DP matching is also particularly high. There was a problem that it was inferior to Hidden Markov Model when assumed. SUMMARY OF THE INVENTION Therefore, an object of the present invention is to eliminate the drawbacks of the conventional techniques of the related art, to provide a word spotting type speech recognition method and device which have few source errors, high recognition accuracy and high-speed recognition. .

【０００５】[0005]

【課題を解決するための手段】この目的を達成するため
本発明ワードスポッティング型音声認識方法は、母音の
標準パターンを用いて入力音声の各時刻の母音候補を求
め、これとキーワード発音辞書中の母音列との対応をと
り、入力音声中のキーワードの存在区間を検出し、該検
出された存在区間におけるキーワードの尤度を隠れマル
コフモデルにより求めるとともに、検出された前記キー
ワードの存在区間と求められた前記尤度から音声認識結
果を出力することを特徴とするものである。In order to achieve this object, the word spotting type speech recognition method of the present invention finds a vowel candidate at each time of an input voice using a standard pattern of vowels and stores it in a keyword pronunciation dictionary. Corresponding to the vowel sequence, the presence section of the keyword in the input speech is detected, the likelihood of the keyword in the detected presence section is obtained by the hidden Markov model, and the presence section of the detected keyword is obtained. The speech recognition result is output from the likelihood.

【０００６】また、本発明ワードスポッティング型音声
認識装置は、入力音声の音響分析を行う音響分析部と、
該分析部の分析結果およびあらかじめ設けられたキーワ
ード発音辞書に基づいて入力音声の母音の認識を行い、
入力音声における各キーワードの存在区間を検出する存
在区間検出部と、前記分析結果、キーワード発音辞書お
よび検出されたキーワード存在区間に基づいて入力音声
に含まれる各キーワードの尤度を求める尤度算出部と、
各キーワードの存在区間および求められた尤度に基づい
てキーワード認識結果を出力する認識結果出力部とを具
備したことを特徴とするものである。The word spotting type speech recognition apparatus of the present invention further comprises an acoustic analysis section for performing acoustic analysis of input speech,
The vowel recognition of the input voice is performed based on the analysis result of the analysis unit and the keyword pronunciation dictionary provided in advance,
An existence section detection unit that detects an existence section of each keyword in the input voice, and a likelihood calculation unit that obtains the likelihood of each keyword included in the input voice based on the analysis result, the keyword pronunciation dictionary, and the detected keyword existence section. When,
The present invention is characterized by further comprising a recognition result output unit that outputs a keyword recognition result based on the presence interval of each keyword and the obtained likelihood.

【０００７】[0007]

【実施例】以下添附図面を参照し実施例により本発明を
詳細に説明する。本発明によるワードスポッティング型
音声認識方法を実施するために使用される装置の一実施
例を示す図１を参照するに、本発明音声認識装置は、入
力音声の音響分析を行う音響分析部１と、分析結果およ
びキーワード音声辞書に基づいて母音の認識を行い、入
力音声における各キーワードの存在区間を検出する存在
区間検出部２と、分析結果とキーワード音声辞書とキー
ワード存在区間に基づいて各キーワードの尤度を求める
尤度算出部と、各キーワードの存在区間と尤度に基づい
てキーワード認識結果を出力する認識結果出力部で構成
されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to the accompanying drawings. Referring to FIG. 1 showing an embodiment of an apparatus used for implementing a word spotting type speech recognition method according to the present invention, the speech recognition apparatus of the present invention comprises an acoustic analysis unit 1 for performing acoustic analysis of input speech. The presence section detection unit 2 for recognizing vowels based on the analysis result and the keyword voice dictionary and detecting the presence section of each keyword in the input voice, and for each keyword based on the analysis result, the keyword voice dictionary, and the keyword presence section. It is composed of a likelihood calculation unit that obtains a likelihood and a recognition result output unit that outputs a keyword recognition result based on the existence interval and likelihood of each keyword.

【０００８】音響分析部１は、図２に示すように、入力
音声をＡ／Ｄ変換器１１によりＡ／Ｄ変換し、フレーム
化器１２により２０ｍｓ程度の幅を１フレームとして、
各フレームの平均パワー、ＬＰＣ（線形予測）ケプスト
ラム係数（周波数分布を表す係数）などの値を算出部１
３により計算する。As shown in FIG. 2, the acoustic analysis unit 1 A / D-converts an input voice by an A / D converter 11, and a framer 12 sets a width of about 20 ms as one frame.
The calculation unit 1 calculates values such as the average power of each frame and the LPC (linear prediction) cepstrum coefficient (coefficient representing frequency distribution).
Calculate by 3.

【０００９】存在区間検出部２には図３に示すようにあ
らかじめ、母音（あ、い、う、え、お、ん）のケプスト
ラム係数などの標準的な値（母音標準パターン）を母音
標準パターンメモリ部２１に保存しておく。この母音標
準パターンと入力音声の分析結果を比較器２２で比較
し、入力音声の各フレームがどの母音に相当するかを母
音候補決定部２３で求める。ただし各フレームに対して
一意に１つの母音を決定するのではなく、各母音の標準
パターンと入力音声の近さを尤度に変換して、各フレー
ムに対して複数の母音候補を求める。As shown in FIG. 3, the existing section detecting unit 2 previously stores standard values (vowel standard patterns) such as cepstral coefficients of vowels (a, i, u, e, o, and n) as vowel standard patterns. It is saved in the memory unit 21. The vowel standard pattern and the analysis result of the input voice are compared by the comparator 22, and the vowel candidate determining unit 23 determines which vowel each frame of the input voice corresponds to. However, instead of uniquely determining one vowel for each frame, the proximity of the standard pattern of each vowel and the input voice is converted into a likelihood to obtain a plurality of vowel candidates for each frame.

【００１０】次に、認識対象であるキーワードの母音列
が入力音声に含まれているかどうかを、キーワード発音
辞書に基づいて調べる。これも母音候補決定部２３で行
われる。キーワード発音辞書は、認識対象であるキーワ
ードの漢字仮名混じり表記と、その発音をローマ字で記
述したものである。例えば「スポーツ」というキーワー
ドに対しては「s,u,p,oo,ts,u 」という発音を、「相
撲」というキーワードに対しては「s,u,m,oo」という発
音をあらかじめ記述しておく。キーワードの母音列「u,
oo,u」や「u,oo」などが入力音声に含まれている場合に
は、入力音声の何フレーム目から何フレーム目までがど
のキーワードなのか、各キーワードの存在区間情報を出
力する。Next, it is checked based on the keyword pronunciation dictionary whether or not the vowel sequence of the keyword to be recognized is included in the input voice. This is also performed by the vowel candidate determination unit 23. The keyword pronunciation dictionary is a description in which the keywords to be recognized are mixed with kanji and kana and their pronunciations are written in Roman letters. For example, the pronunciation "s, u, p, oo, ts, u" for the keyword "sports" and the pronunciation "s, u, m, oo" for the keyword "sumo" are described in advance. I'll do it. Keyword vowel sequence "u,
When the input voice includes "oo, u", "u, oo", etc., it outputs which section of the input voice is from which frame to which keyword, and the existence section information of each keyword is output.

【００１１】キーワードの母音列が入力音声に含まれて
いるかどうかを調べる際、例えば「スポーツ」の母音列
「u,oo,u」が入力音声に含まれているかどうかを調べる
際、「u,oo,a,u」など母音が挿入した場合もキーワード
の存在区間とする。もちろん、１つのキーワードに対し
て複数の存在区間を出力可能とする。以上の操作をキー
ワード発音辞書中のすべてのキーワードに対して行う。When checking whether a vowel sequence of a keyword is included in an input voice, for example, when checking whether a vowel sequence “u, oo, u” of “sports” is included in an input voice, “u, Even when a vowel is inserted such as "oo, a, u", it is also defined as a keyword existence section. Of course, a plurality of existing sections can be output for one keyword. The above operation is performed for all the keywords in the keyword pronunciation dictionary.

【００１２】尤度算出部３には図４に示すように、あら
かじめ母音(a,i,u,e,o) と子音(p,t,k,ts など) の音素
単位の隠れマルコフモデルをメモリ（ＲＯＭ）に保存し
ておく。そしてキーワード発音辞書中の発音に従ってこ
れらを直列に連結３２し、「スポーツ」や「相撲」など
のキーワードに相当する隠れマルコフモデルをブロック
３３（ＲＡＭ）で作成する。存在区間検出部２から各キ
ーワードの存在区間を受け取り、入力音声のその区間に
対する尤度を、尤度算出器３４で各キーワードに相当す
る隠れマルコフモデルにより求める。As shown in FIG. 4, the likelihood calculating unit 3 has a hidden Markov model for each phoneme of a vowel (a, i, u, e, o) and a consonant (p, t, k, ts) in advance. Save in memory (ROM). Then, these are serially connected 32 according to pronunciations in the keyword pronunciation dictionary, and a hidden Markov model corresponding to a keyword such as "sports" or "sumo" is created in a block 33 (RAM). The existence section of each keyword is received from the existence section detection unit 2, and the likelihood of the input speech with respect to the section is obtained by the likelihood calculator 34 by the hidden Markov model corresponding to each keyword.

【００１３】認識結果出力部４は図５に示すように各キ
ーワードの存在区間と尤度を尤度メモリ部４１で受け取
り、あらかじめ設定した閾値メモリ部４２からの尤度の
閾値と比較器４３で比較し、その閾値を超えるキーワー
ドがあった場合には、そのキーワードが発声されたもの
として、そのキーワードを認識結果として出力する。As shown in FIG. 5, the recognition result output unit 4 receives the presence interval and likelihood of each keyword in the likelihood memory unit 41, and the likelihood threshold from the preset threshold memory unit 42 and the comparator 43. When the comparison is made and there is a keyword exceeding the threshold value, it is determined that the keyword is uttered, and the keyword is output as a recognition result.

【００１４】認識例を図６により説明する。例えば「あ
のー、スポーツ番組ありますか」というような音声が入
力された時、存在区間検出部２においてまず母音の認識
が行われ、「a,oo,u,oo,u,a,N,u,i,a,i,a,u,a 」という
ような母音列が得られる。次に「スポーツ（s,u,p,oo,t
s,u)」、「スキー(s,u,k,ii)」、「相撲(s,u,m,oo)」な
どのキーワードの母音列が入力音声のある部分と一致す
ることが検出され、この区間がキーワード存在区間とし
て尤度算出部３に渡される。尤度算出部３は、「スポー
ツ（s,u,p,oo,ts,u)」、「スキー(s,u,k,ii)」、「相撲
(s,u,m,oo)」に対して隠れマルコフモデルにより尤度を
求める。認識結果出力部４は、最も大きな尤度を示し、
あらかじめ定めた−１３．０などの閾値を越える尤度−
１２．０を示した「スポーツ」をキーワード認識結果と
して出力する。A recognition example will be described with reference to FIG. For example, when a voice such as "Are there any sports programs?" Is input, the presence section detection unit 2 first recognizes vowels, and then "a, oo, u, oo, u, a, N, u, Vowel sequences such as "i, a, i, a, u, a" are obtained. Next, "Sports (s, u, p, oo, t
s, u) '', `` ski (s, u, k, ii) '', and `` sumo (s, u, m, oo) '' are detected as matching the vowel sequence of the input voice. This section is passed to the likelihood calculation unit 3 as a keyword existing section. Likelihood calculation unit 3 uses “sports (s, u, p, oo, ts, u)”, “skis (s, u, k, ii)”, and “sumo”.
For (s, u, m, oo) ”, the likelihood is calculated by the hidden Markov model. The recognition result output unit 4 shows the largest likelihood,
Likelihood to exceed a predetermined threshold such as -13.0-
“Sports” indicating 12.0 is output as the keyword recognition result.

【００１５】認識実験により、本発明の有効性の検証を
行う。まず母音標準パターンと隠れマルコフモデルを市
販の女性音声データベースにより学習し、２人の女性話
者の５０文でさらに話者適応化を行った。母音標準パタ
ーンと隠れマルコフモデルにはＬＰＣケプストラム係数
を特徴量として用いた。キーワード発音辞書には２０の
単語キーワードを登録した。前述の２人の女性話者が、
文中に１つのキーワードを含む４８文を発声し、本発明
の認識装置により認識を行った。その結果、９４％の認
識率が得られ（従来の方法（ａ）では３０％であっ
た）、湧き出し誤りはゼロであり、ほぼ実時間で処理を
終えた。The effectiveness of the present invention is verified by recognition experiments. First, the vowel standard pattern and hidden Markov model were learned by a commercially available female voice database, and speaker adaptation was further performed with 50 sentences of two female speakers. The LPC cepstrum coefficient was used as a feature amount for the vowel standard pattern and the hidden Markov model. Twenty word keywords were registered in the keyword pronunciation dictionary. The two female speakers mentioned above
Forty-eight sentences including one keyword in the sentence were uttered and recognized by the recognition device of the present invention. As a result, a recognition rate of 94% was obtained (it was 30% in the conventional method (a)), and there was no springing error, and the processing was completed in almost real time.

【００１６】以上一実施例により本発明を説明してきた
が、本発明はこの実施例に限定されることはなく、本発
明の要旨内で各種の変形、変更の可能なことは当業者に
自明であろう。例えば図３図示の尤度２４は、分析結果
の入力音声の母音パターンを母音標準パターンと比較し
て両者の距離差を求めるものであってもよい。また本発
明方法を実施する装置は、上述のハード構成のものの
他、プログラムされたコンピュータソフトで処理される
ものであってもよい。Although the present invention has been described with reference to one embodiment, it is obvious to those skilled in the art that the present invention is not limited to this embodiment and various modifications and changes can be made within the scope of the present invention. Will. For example, the likelihood 24 shown in FIG. 3 may be obtained by comparing the vowel pattern of the input speech as the analysis result with the vowel standard pattern to obtain the distance difference between the two. In addition to the hardware configuration described above, the apparatus for implementing the method of the present invention may be processed by programmed computer software.

【００１７】[0017]

【発明の効果】本発明方法ならびに装置によれば、母音
の標準パターンを用いて入力音声中のキーワードの存在
区間を求めた後に、隠れマルコフモデルを用いてキーワ
ードの尤度を求めるので従来のワードスポッティング型
音声認識装置に比し、湧き出し誤りが少なく、認識精度
が高く、高速認識の可能な音声認識方法ならびに装置が
提供される。According to the method and apparatus of the present invention, the likelihood of the keyword is calculated using the hidden Markov model after the existence section of the keyword in the input speech is calculated using the standard pattern of vowels. Provided are a voice recognition method and a device capable of high-speed recognition with less generation error and higher recognition accuracy than a spotting type voice recognition device.

[Brief description of drawings]

【図１】本発明方法を実施するための装置の実施例構成
を示すブロック線図。FIG. 1 is a block diagram showing the configuration of an embodiment of an apparatus for carrying out the method of the present invention.

【図２】図１図示音響分析部の構成を示す図。FIG. 2 is a diagram showing a configuration of an acoustic analysis unit shown in FIG.

【図３】図１図示存在区間検出部の構成を示す図。FIG. 3 is a diagram showing a configuration of an existing section detection unit shown in FIG.

【図４】図１図示尤度算出部の構成を示す図。FIG. 4 is a diagram showing the configuration of a likelihood calculation unit shown in FIG.

【図５】図１図示認識結果出力部の構成を示す図。FIG. 5 is a diagram showing a configuration of an illustration recognition result output unit in FIG.

【図６】認識例を説明するための図。FIG. 6 is a diagram for explaining a recognition example.

[Explanation of symbols]

１音響分析部２存在区間検出部３尤度算出部４認識結果出力部１１Ａ／Ｄ変換器１２フレーム化器１３パワーＬＰＣケプストラム係数算出部２１母音標準パターンメモリ部２２比較器２３母音候補決定部２４尤度３１音素単位隠れマルコフモデルメモリ部３２連結３３キーワード単位隠れマルコフモデル３４尤度算出器４１尤度メモリ部４２閾値メモリ部４３比較器 DESCRIPTION OF SYMBOLS 1 Acoustic analysis unit 2 Presence section detection unit 3 Likelihood calculation unit 4 Recognition result output unit 11 A / D converter 12 Framer 13 Power LPC cepstrum coefficient calculation unit 21 Vowel standard pattern memory unit 22 Comparator 23 Vowel candidate determination unit 24 Likelihood 31 Hidden Markov Model Memory Unit of Phoneme Unit 32 Concatenation 33 Hidden Markov Model of Keyword Unit 34 Likelihood Calculator 41 Likelihood Memory Unit 42 Threshold Memory Unit 43 Comparator

Claims

[Claims]

1. A vowel candidate at each time of an input voice is obtained using a standard pattern of vowels, the vowel sequence in a keyword pronunciation dictionary is associated with the vowel candidate, and a keyword existing section in the input voice is detected. A word spotting type speech characterized in that a likelihood of a keyword in the detected existence section is obtained by a hidden Markov model, and a speech recognition result is output from the detected existence section of the keyword and the obtained likelihood. Recognition method.

2. The presence of each keyword in the input voice, which recognizes the vowels of the input voice based on the acoustic analysis unit that performs the acoustic analysis of the input voice, the analysis result of the analysis unit, and the keyword pronunciation dictionary provided in advance. An existing section detection unit that detects a section, a likelihood calculation section that obtains the likelihood of each keyword included in the input voice based on the analysis result, the keyword pronunciation dictionary, and the detected keyword existing section, and the existing section of each keyword And a recognition result output unit that outputs a keyword recognition result based on the obtained likelihood, and a word spotting type speech recognition device.