JP4296290B2

JP4296290B2 - Speech recognition apparatus, speech recognition method and program

Info

Publication number: JP4296290B2
Application number: JP2003361646A
Authority: JP
Inventors: 貴克吉村; 立太寺嶌; 位好寺澤; 敏裕脇田
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2003-10-22
Filing date: 2003-10-22
Publication date: 2009-07-15
Anticipated expiration: 2023-10-22
Also published as: JP2005128130A

Description

本発明は、音声認識装置、音声認識方法及びプログラムに係り、特に、音節単位で区切り発声された音声を認識する音声認識装置、音声認識方法及びプログラムに関する。 The present invention relates to a speech recognition device, a speech recognition method, and a program, and more particularly, to a speech recognition device, a speech recognition method, and a program for recognizing speech that is uttered in syllable units.

従来、音声の入力速度を向上させるために、様々な音声認識装置が提案されている。 Conventionally, various speech recognition apparatuses have been proposed in order to improve the voice input speed.

特許文献１には、音節に対応する音節列（例えば、音節「あ」に対して音節列「あいうえおのあ」）を標準パターンとして登録することにより、単音節単体で認識するのが難しい場合であっても、音節列「あいうえおのあ」が音声入力されれば、音節「あ」を高確率で認識することが記載されている。 In Patent Document 1, a syllable string corresponding to a syllable (for example, a syllable string “ai”) is registered as a standard pattern, and it is difficult to recognize a single syllable alone. Even if there is a voice, the syllable string “A” is recognized with a high probability if the syllable string “Ai Ueno A” is input.

特許文献２には、単語を音節に区切って発生された音声と、同じ単語を連続的に発生された音声の両方を使って、高精度に音声を認識することが記載されている。 Patent Document 2 describes that speech is recognized with high accuracy by using both speech generated by dividing a word into syllables and speech generated continuously from the same word.

特許文献３には、音声認識結果と信頼性レベルを確認しやすくするため、信頼性レベルに応じてキャラクタの表情を変えることが記載されている。
特開平９−１７９５７８号公報特開平１０−３４００９６号公報特開平９−２９２８９５号公報 Patent Document 3 describes changing the facial expression of a character in accordance with the reliability level in order to make it easy to confirm the voice recognition result and the reliability level.
JP 9-179578 A JP 10-340096 A Japanese Patent Laid-Open No. 9-292895

特許文献１では、ユーザは、単音節「あ」を認識させるために、それに対応する音節列「あいうえおのあ」を発生しなければならない。このため、音声の入力速度が遅くなってしまう問題があった。 In Patent Document 1, in order to recognize a single syllable “A”, the user must generate a syllable string “Ai Ueno A” corresponding thereto. For this reason, there has been a problem that the voice input speed is slow.

特許文献２では、ユーザは同じ単語を２通りの方法で発声しなければならず、このため音声の入力速度が遅くなってしまう問題があった。 In Patent Document 2, the user has to utter the same word in two ways, which causes a problem that the voice input speed becomes slow.

特許文献３では、音声認識結果の信頼度をユーザにフィードバックしているだけに過ぎず、これだけでは音声の入力速度を上げることができなかった。 In Patent Document 3, only the reliability of the voice recognition result is fed back to the user, and this alone cannot increase the voice input speed.

本発明は、上述した課題を解決するために提案されたものであり、音節毎に区切り発声された音声の入力速度を向上させることができる音声認識装置、音声認識方法及びプログラムを提供することを目的とする。 The present invention has been proposed to solve the above-described problems, and provides a speech recognition device, a speech recognition method, and a program that can improve the input speed of speech uttered by syllable. Objective.

請求項１に記載の発明である音声認識装置は、音節単位で区切り発声された音声を入力する音声入力手段と、前記音声入力手段により入力された音声を音節毎に認識する音節認識手段と、前記音節認識手段の認識結果の信頼度に応じた態様で前記音節の認識結果を出力する出力手段と、を備え、前記出力手段は、前記認識結果の信頼度が認識不可を示す第１の閾値より低いときは、前記音節の認識結果として予め定められた情報を出力し、前記認識結果の信頼度が前記第１の閾値以上であり認識不完全を示す第２の閾値より低いときは、前記音節の認識結果として少なくとも前記音節の母音を出力する。 The speech recognition device according to claim 1 is a speech input means for inputting speech uttered in syllable units, a syllable recognition means for recognizing speech input by the speech input means for each syllable, Output means for outputting the recognition result of the syllable in a manner corresponding to the reliability of the recognition result of the syllable recognition means , wherein the output means has a first threshold value indicating that the reliability of the recognition result is unrecognizable. When it is lower, it outputs predetermined information as the recognition result of the syllable, and when the reliability of the recognition result is equal to or higher than the first threshold and lower than the second threshold indicating incomplete recognition, At least a vowel of the syllable is output as a syllable recognition result .

請求項４に記載の発明である音声認識方法は、音節単位で区切り発声された音声を音節毎に認識する音節認識工程と、前記音節認識工程による認識結果の信頼度に応じた態様で前記音節の認識結果を出力する出力工程と、を備え、前記出力工程では、前記認識結果の信頼度が認識不可を示す第１の閾値より低いときは、前記音節の認識結果として予め定められた情報を出力し、前記認識結果の信頼度が前記第１の閾値以上であり認識不完全を示す第２の閾値より低いときは、前記音節の認識結果として少なくとも前記音節の母音を出力する。 According to a fourth aspect of the present invention, there is provided a speech recognition method comprising: a syllable recognition step for recognizing speech uttered in units of syllables for each syllable; and a syllable in a manner corresponding to a reliability of a recognition result obtained by the syllable recognition step. An output step of outputting the recognition result of the first step, and in the output step, when the reliability of the recognition result is lower than a first threshold value indicating that the recognition is impossible, information predetermined as the recognition result of the syllable is obtained. When the reliability of the recognition result is equal to or higher than the first threshold and lower than the second threshold indicating incomplete recognition, at least a vowel of the syllable is output as the recognition result of the syllable .

請求項１１に記載の発明である音声認識プログラムは、コンピュータを、音節単位で区切り発声された音声を入力する音声入力手段と、前記音声入力手段により入力された音声を音節毎に認識する音節認識手段と、前記音節認識手段の認識結果の信頼度に応じた態様で前記音節の認識結果を出力する出力手段と、して機能させ、前記出力手段は、前記認識結果の信頼度が認識不可を示す第１の閾値より低いときは、前記音節の認識結果として予め定められた情報を出力し、前記認識結果の信頼度が前記第１の閾値以上であり認識不完全を示す第２の閾値より低いときは、前記音節の認識結果として少なくとも前記音節の母音を出力するように機能させる。 According to an eleventh aspect of the present invention, there is provided a speech recognition program, comprising: a voice input unit that inputs a voice uttered in units of syllables; and a syllable recognition that recognizes a voice input by the voice input unit for each syllable. And an output means for outputting the recognition result of the syllable in a manner corresponding to the reliability of the recognition result of the syllable recognition means, and the output means determines that the reliability of the recognition result is unrecognizable. When the value is lower than the first threshold value, information that is predetermined as the syllable recognition result is output, and the reliability of the recognition result is equal to or higher than the first threshold value and the second threshold value indicating incomplete recognition. When the value is low, the syllable recognition result is output at least as a vowel of the syllable.

音節認識手段は、ユーザによって音節単位で区切り発声された音声を音節毎に認識する。ここで、音節認識手段の認識結果は、信頼度が高いものや低いものなど、様々が存在する。しかし、認識結果の信頼度が低いときであってもユーザに何ら情報をフィートバックしないとすると、ユーザは、次の音節を発話しようとしないことが多い。 The syllable recognition means recognizes, for each syllable, the voice uttered by the user in syllable units. Here, there are various recognition results of the syllable recognition means, such as those with high reliability and those with low reliability. However, even when the reliability of the recognition result is low, if the user does not provide any information back to the user, the user often does not try to speak the next syllable.

出力手段は、音節認識手段の認識結果の信頼度に応じた態様で音節の認識結果を出力する。すなわち、出力手段は、音節の認識結果の信頼度に応じて、その認識結果の出力態様を変えている。 The output means outputs the syllable recognition result in a manner corresponding to the reliability of the recognition result of the syllable recognition means. That is, the output means changes the output mode of the recognition result according to the reliability of the recognition result of the syllable.

したがって、上記発明によれば、次の音節の入力を促すことができるので、その結果、音声入力速度を向上させることができる。 Therefore, according to the above invention, the input of the next syllable can be prompted, and as a result, the voice input speed can be improved.

音節の認識結果の信頼度が第１の閾値未満であるときは、当該音節は全く認識されていない。このとき、出力手段はその音節について何ら情報を出力しないと、ユーザは次の音節を発話しようとしない。 When the reliability of the recognition result of the syllable is less than the first threshold, the syllable is not recognized at all. At this time, if the output means does not output any information about the syllable, the user does not try to speak the next syllable.

したがって、上記発明によれば、音節の認識結果の信頼度が認識不可を示す第１の閾値より低いときは、音節の認識結果として予め定められた情報を出力することによって、音節を全く認識できない場合でも、ユーザに次の音節の発声を促すことができる。なお、予め定められた情報としては、相づちのような情報であってもよい。 Therefore, according to the above invention, when the reliability of the recognition result of the syllable is lower than the first threshold value indicating that the recognition is impossible, the syllable cannot be recognized at all by outputting the predetermined information as the recognition result of the syllable. Even in this case, the user can be prompted to utter the next syllable. Note that the predetermined information may be information such as a combination.

音節の認識結果の信頼度が第１の閾値以上第２の閾値未満であるときは、当該音節は一部だけ認識されているが、完全に認識されていない。例えば、当該音節の一部である母音は認識されているが、その他の部分である子音は認識されていない。 When the reliability of the recognition result of the syllable is not less than the first threshold and less than the second threshold, only a part of the syllable is recognized, but not completely recognized. For example, vowels that are part of the syllable are recognized, but consonants that are other parts are not recognized.

このとき、出力手段はその音節について何ら情報を出力しないと、ユーザは次の音節を発話しようとしない。一方、何か情報を出力するときは、認識された部分だけでも出力した方が、ユーザの発声を促すことができる。 At this time, if the output means does not output any information about the syllable, the user does not try to speak the next syllable. On the other hand, when outputting some information, it is possible to prompt the user to speak by outputting only the recognized part.

したがって、上記発明によれば、認識結果の信頼度が第１の閾値以上であり認識不完全を示す第２の閾値より低いときは、音節の認識結果として少なくとも音節の母音を出力することにより、音節の一部が認識されたことをユーザに報知できるので、ユーザに次の音節の発声を促すことができる。 Therefore, according to the invention, when the reliability of the recognition result is equal to or higher than the first threshold and lower than the second threshold indicating incomplete recognition, by outputting at least the syllable vowel as the syllable recognition result, Since the user can be notified that a part of the syllable has been recognized, the user can be prompted to utter the next syllable.

請求項２に記載の発明である音声認識装置は、請求項１に記載の発明であって、前記出力手段は、音声を出力する音声出力手段、画像を出力する画像出力手段の少なくとも一方である。 The speech recognition apparatus according to claim 2 is the invention according to claim 1 , wherein the output means is at least one of a sound output means for outputting sound and an image output means for outputting an image. .

請求項５に記載の発明である音声認識方法は、請求項４に記載の発明であって、前記出力工程では、音声、画像の少なくとも一方を出力する。 A speech recognition method according to a fifth aspect of the present invention is the voice recognition method according to the fourth aspect of the present invention, wherein at the output step, at least one of a voice and an image is output.

請求項３に記載の発明である音声認識装置は、請求項１又は請求項２に記載の発明であって、複数の音節列候補を記憶する音節列候補記憶手段と、前記音節列候補記憶手段に記憶された複数の音節列候補の中から、前記音節認識手段により認識された複数の音節で構成された音節列に最も対応する音節列候補を選択する選択手段と、を更に備えている。 A speech recognition apparatus according to a third aspect of the invention is the invention according to the first or second aspect , wherein the syllable string candidate storage means stores a plurality of syllable string candidates, and the syllable string candidate storage means. Selecting means for selecting a syllable string candidate most corresponding to the syllable string composed of the plurality of syllable strings recognized by the syllable recognition means from among the plurality of syllable string candidates stored in the syllable recognition unit.

請求項６に記載の発明である音声認識方法は、請求項４又は請求項５に記載の発明であって、複数の音節列候補の中から前記音節認識工程で認識された複数の音節で構成された音節列に最も対応する音節列候補を選択する選択工程と、を更に備えている。 The speech recognition method according to claim 6 is the invention according to claim 4 or claim 5 , comprising a plurality of syllables recognized in the syllable recognition step from a plurality of syllable string candidates. And a selection step of selecting a syllable string candidate most corresponding to the syllable string.

１つの音節だけを認識しても意味がなく、最終的には、複数の音節で構成された音節列を認識する必要がある。ここで、意味をなす単語であって例えば名詞などからなる音節列候補を予め用意しておく。そして、選択手段は、音節列候補の中から、既に認識された複数の音節で構成された音節列に最も対応する音節列候補を選択する。 There is no point in recognizing only one syllable, and ultimately it is necessary to recognize a syllable string composed of a plurality of syllables. Here, syllable string candidates that are meaningful words, such as nouns, are prepared in advance. Then, the selecting means selects a syllable string candidate most corresponding to the syllable string composed of a plurality of already recognized syllables from the syllable string candidates.

これにより、上記発明によれば、区切り発声された複数の音節からなる音節列について、音節の一部に認識不可又は認識不完全が存在しても、高精度かつ確実に認識することができる。 Thus, according to the above-described invention, a syllable string composed of a plurality of syllables that are uttered separately can be recognized with high accuracy and reliability even if a part of the syllable is unrecognizable or incompletely recognized.

本発明に係る音声認識装置、音声認識方法及びプログラムは、音節単位で区切り発声された音声を音節毎に認識し、認識結果の信頼度に応じた態様で音節の認識結果を出力することによって、次の音節の入力を促すことができ、この結果、音声入力速度を向上させることができる。 The speech recognition apparatus, the speech recognition method, and the program according to the present invention recognize the speech uttered in units of syllables for each syllable, and output the syllable recognition result in a manner according to the reliability of the recognition result. The input of the next syllable can be prompted, and as a result, the voice input speed can be improved.

以下、本発明を実施するための最良の形態について、図面を参照しながら詳細に説明する。 Hereinafter, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。上記音声認識装置は、ユーザが音節単位で区切り発声した音声を認識するものである。なお、本実施の形態では、ユーザが「ひ・が・し・や・ま・ど・う・ぶ・つ・え・ん」を音節単位で発声した例を挙げて説明する。 FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to an embodiment of the present invention. The speech recognition device recognizes speech uttered by the user in syllable units. In the present embodiment, an example will be described in which the user utters “hi, g, shi, ya, ma, do, u, bu, tsu, e, n” in units of syllables.

音声認識装置は、ユーザが発声した音声を入力して音声信号を生成するマイク１と、音声信号から音声区間を切り出して音響パラメータを抽出する音声区間切出器２と、単音節の音声認識を行う単音節認識器３と、音節列候補を選択して最終的な認識結果を出力する音節候補選択器４と、複数の音節列候補を表した辞書を記憶する音節列候補辞書データベース５と、認識結果を画像出力する表示装置６と、認識結果を音声出力するスピーカ７と、を備えている。 The speech recognition apparatus includes a microphone 1 that inputs speech uttered by a user and generates a speech signal, a speech segment extractor 2 that extracts a speech segment from the speech signal and extracts acoustic parameters, and single syllable speech recognition. A single syllable recognizer 3 to perform, a syllable candidate selector 4 for selecting a syllable string candidate and outputting a final recognition result, a syllable string candidate dictionary database 5 for storing a dictionary representing a plurality of syllable string candidates, A display device 6 that outputs the recognition result as an image and a speaker 7 that outputs the recognition result as sound are provided.

単音節認識器３は、単音節の音声認識を行うと共に、当該単音節の認識結果の信頼度を演算する。なお、単音節認識器３で認識された複数の単音節を、「認識対象音節列」とする。認識対象音節列は、認識不可又は認識不完全（母音のみ認識可）の音節が含まれてもよい。 The single syllable recognizer 3 performs speech recognition of a single syllable and calculates the reliability of the recognition result of the single syllable. A plurality of single syllables recognized by the single syllable recognizer 3 is referred to as a “recognition target syllable string”. The recognition target syllable string may include syllables that cannot be recognized or are incompletely recognized (only vowels can be recognized).

音節列候補辞書データベース５は、複数の音節列候補と、各音節列候補に対応する音響モデル列とを記憶している。本実施の形態では、音節列候補は、１つの意味をなす単語であれば特に限定されないが、本実施の形態では、例えば地名や施設などの名詞であるものとする。 The syllable string candidate dictionary database 5 stores a plurality of syllable string candidates and an acoustic model string corresponding to each syllable string candidate. In the present embodiment, the syllable string candidate is not particularly limited as long as it has a single meaning, but in this embodiment, it is assumed to be a noun such as a place name or a facility.

音節候補選択器４は、認識対象音節列に基づいて音節列候補辞書データベース５の中から音節列候補を選択し、音響パラメータと音響モデル列とのマッチングを行って、認識対象音節列の最終的な音声認識結果を出力する。 The syllable candidate selector 4 selects a syllable string candidate from the syllable string candidate dictionary database 5 based on the recognition target syllable string, matches the acoustic parameter with the acoustic model string, and finally determines the recognition target syllable string. A simple speech recognition result.

図２は、音声認識装置による音声認識処理の手順を示すフローチャートである。ユーザが音節毎に区切られた音声を発生すると、マイク１はその音声を音声信号に変換して音声区間切出器２に供給する。 FIG. 2 is a flowchart showing a procedure of voice recognition processing by the voice recognition device. When the user generates a voice divided for each syllable, the microphone 1 converts the voice into a voice signal and supplies the voice signal to the voice segment extractor 2.

音声区間切出器２は、マイク１から供給された音声信号の入力を受け付け（ステップＳＴ１）、音声信号から音声区間を切り出し、音響分析を行うことで特徴パラメータ（音響パラメータ）を抽出する（ステップＳＴ２）。 The voice segment extractor 2 accepts the input of the voice signal supplied from the microphone 1 (step ST1), cuts the voice segment from the voice signal, and performs acoustic analysis to extract feature parameters (acoustic parameters) (step). ST2).

単音節認識器３は、音声区間切出器２で抽出された音響パラメータを用いて、単音節の認識を行うと共に、その認識結果の信頼度を演算する（ステップＳＴ３）。このとき、表示装置６及びスピーカ７は、単音節の認識結果の信頼度に応じた態様で、その認識結果を出力する。 The single syllable recognizer 3 recognizes single syllables using the acoustic parameters extracted by the voice segment extractor 2, and calculates the reliability of the recognition result (step ST3). At this time, the display device 6 and the speaker 7 output the recognition result in a manner corresponding to the reliability of the recognition result of the single syllable.

具体的には、表示装置６は、認識結果の信頼度が高いときはその単音節の文字画像をそのまま表示する。表示装置６は、認識結果の信頼度が少し低いとき（例えば、信頼度が第１の閾値以上第２の閾値未満：母音しか認識できなかったとき）は、その母音の文字画像を表示すると共に、その横に子音認識不可を表す所定画像“？”を表示する。また、表示装置６は、認識結果の信頼度が低いとき（例えば、信頼度が第１の閾値未満：母音及び子音が共に認識できなかったとき）は、音節認識不可であり次の音節入力を促すことを表す所定画像“＊”を表示する。 Specifically, when the reliability of the recognition result is high, the display device 6 displays the single syllable character image as it is. When the reliability of the recognition result is slightly low (for example, when the reliability is equal to or higher than the first threshold and lower than the second threshold: only the vowel is recognized), the display device 6 displays the character image of the vowel. Next, a predetermined image “?” Indicating that consonant recognition is impossible is displayed. In addition, when the reliability of the recognition result is low (for example, when the reliability is less than the first threshold: both vowels and consonants cannot be recognized), the display device 6 cannot recognize the syllable and inputs the next syllable. A predetermined image “*” representing prompting is displayed.

一方、スピーカ７は、認識結果の信頼度が高いときはその単音節の合成音声を出力し、認識結果の信頼度が少し低いとき（母音しか認識できなかったとき）は、その母音の合成音声のみを出力する。また、スピーカ７は、認識結果の信頼度が低いとき（母音及び子音が共に認識できなかったとき）は、音節認識不可であり次の音節入力を促すことを表す合成音声“はい”を出力する。 On the other hand, the speaker 7 outputs the synthesized speech of the single syllable when the reliability of the recognition result is high, and the synthesized speech of the vowel when the reliability of the recognition result is a little low (when only the vowel can be recognized). Only output. Further, when the reliability of the recognition result is low (when both vowels and consonants cannot be recognized), the speaker 7 outputs a synthesized speech “Yes” indicating that syllable recognition is impossible and prompting the next syllable input. .

このように、音声認識装置は、単音節の認識結果の信頼度が高くない場合であっても、認識結果の信頼度を視覚や聴覚を通じてユーザにフィードバックすることにより、システムが正しく動作しているとユーザに思いこませることができる。この結果、ユーザに次の単音節の入力を促すことができる。 As described above, even if the reliability of the recognition result of the single syllable is not high, the speech recognition apparatus operates the system correctly by feeding back the reliability of the recognition result to the user through vision or hearing. Can be reminiscent of the user. As a result, the user can be prompted to input the next single syllable.

つぎに、単音節認識器３は、単音節の認識結果が正解であるか否かを判定する（ステップＳＴ４）。ここでは、ユーザが、表示装置６及びスピーカ７の出力を介して、単音節の認識結果を確認することができる。そして、ユーザは、その認識結果が誤りであると判断したときは、例えば、認識結果が誤りである旨を示す図示しないボタンを押圧することができる。 Next, the single syllable recognizer 3 determines whether or not the recognition result of the single syllable is correct (step ST4). Here, the user can confirm the recognition result of the single syllable via the output of the display device 6 and the speaker 7. When the user determines that the recognition result is incorrect, for example, the user can press a button (not shown) indicating that the recognition result is incorrect.

そして、単音節認識器３は、所定時間経過しても上記ボタンの押圧を検出しないときは単音節の認識結果が正解であると判定し、上記ボタンの押圧を検出したときは単音節の認識結果が正解でないと判定する。なお、上記ステップＳＴ４において、ユーザに正解か否かを判断させるかわりに、例えば、単音節認識器３が単音節の信頼度に応じて自動的に正解か否かを判断してもよい。 The single syllable recognizer 3 determines that the recognition result of the single syllable is correct when it does not detect the button press even after a predetermined time has elapsed, and recognizes the single syllable when the button press is detected. It is determined that the result is not correct. In step ST4, instead of making the user determine whether the answer is correct, for example, it may be determined whether the single syllable recognizer 3 is automatically correct according to the reliability of the single syllable.

単音節認識器３は、単音節の認識結果が正解でないと判定したときは、次候補となっていた単音節の認識結果を出力することによって、再び音節認識を行う（ステップＳＴ３）。単音節認識器３は、単音節の認識結果が正解になるまでステップＳＴ３及びステップＳＴ４の処理を繰り返し実行する。 When the single syllable recognizer 3 determines that the recognition result of the single syllable is not correct, the single syllable recognizer 3 performs the syllable recognition again by outputting the recognition result of the single syllable which is the next candidate (step ST3). The single syllable recognizer 3 repeatedly executes the processing of step ST3 and step ST4 until the recognition result of the single syllable becomes correct.

一方、単音節認識器３は、認識結果が正解であると判定したときは、当該単音節の波形データ（音響パラメータ）を図示しないメモリに格納する（ステップＳＴ５）。 On the other hand, when the single syllable recognizer 3 determines that the recognition result is correct, the single syllable recognizer 3 stores the waveform data (acoustic parameter) of the single syllable in a memory (not shown) (step ST5).

次に、単音節認識器３は、すべての音節入力が完了したか否かを判定する（ステップＳＴ６）。ここでは、単音節認識器３は、次の単音節の入力があったときは音節入力が完了していないと判定して、ステップＳＴ１に戻る。また、ステップＳＴ５の処理後所定時間経過しても次の単音節の入力がないときは音節入力が完了したと判定する。 Next, the single syllable recognizer 3 determines whether or not all syllable inputs have been completed (step ST6). Here, the single syllable recognizer 3 determines that the syllable input is not completed when the next single syllable is input, and returns to step ST1. Further, if the next single syllable is not input even after a predetermined time has elapsed after the processing of step ST5, it is determined that the syllable input is completed.

これにより、例えば、ステップＳＴ１からステップＳＴ５において単音節「ひ」の音節認識処理が終了したときは、再びステップＳＴ１に戻って、次の単音節「が」の音節認識処理が行われる。そして、音節列「ひ・が・し・や・ま・ど・う・ぶ・つ・え・ん」を構成する各々の単音節について音節認識処理が行われる。 Thereby, for example, when the syllable recognition process for the single syllable “hi” is completed in steps ST1 to ST5, the process returns to step ST1 again, and the syllable recognition process for the next single syllable “ga” is performed. Then, the syllable recognition process is performed for each single syllable constituting the syllable string “hi-ga-shi-ya-ma-do-u-bu-tsu-e-n”.

図３は、入力音声「ひ・が・し・や・ま・ど・う・ぶ・つ・え・ん」の単音節毎の認識結果を説明する図である。 FIG. 3 is a diagram for explaining a recognition result for each single syllable of the input speech “Hi, Gashi, Ya, Ma, Do, U, Bu, Tsu, E, N”.

ユーザが単音節を発声する毎に、表示装置６は、認識結果として「＊」、「？あ」、「し」、「？あ」、「ま」、「ど」、「う」、「？う」、「＊」、「え」、「＊」の画像を順次出力する。同時に、スピーカ７は、認識結果として、「はい」、「あ」、「し」、「あ」、「ま」、「ど」、「う」、「う」、「はい」、「え」、「はい」の合成音声を順次出力する。 Every time the user utters a single syllable, the display device 6 recognizes “*”, “? A”, “shi”, “? A”, “ma”, “do”, “u”, “?” As recognition results. U, “*”, “e”, and “*” images are sequentially output. At the same time, the speaker 7 recognizes “Yes”, “A”, “Shi”, “A”, “Ma”, “Do”, “U”, “U”, “Yes”, “E”, "Yes" synthesized speech is output sequentially.

これにより、ユーザは、単音節の認識結果の信頼度に影響されることなく、単音節を連続的に発声することができる。すなわち、音声認識装置は、単音節の認識結果の一部に誤りがあったとしても、ユーザに単音節の連続的な発話を促しているので、単音節毎の修正をできるかぎり回避することができる。なお、図３に示すように認識された複数の単音節を、「認識対象音節列」とする。 Thereby, the user can utter a single syllable continuously without being influenced by the reliability of the recognition result of the single syllable. That is, since the speech recognition apparatus prompts the user to continuously speak single syllables even if there is an error in part of the recognition result of single syllables, it is possible to avoid correction for each single syllable as much as possible. it can. A plurality of single syllables recognized as shown in FIG. 3 are referred to as “recognition target syllable strings”.

音節候補選択器４は、音節列候補辞書データベース５に記憶されている音節列候補辞書の中から、認識対象音節列に形式上一致する（例えば、単音節の数が同じ、同じ音節・母音が同じ位置にある等の条件が一致する）音節列候補を選択する（ステップＳＴ７）。なお、音節候補選択器４は、“＊”に対応する認識不可の単音節については、任意の１文字とみなす。これにより、音節候補選択器４は、認識対象音節列の中に部分的に認識できなかった単音節が含まれていても、音節列候補を選択している。 The syllable candidate selector 4 matches the recognition target syllable string in the syllable string candidate dictionary stored in the syllable string candidate dictionary database 5 (for example, the same syllable / vowel having the same number of single syllables). A syllable string candidate that matches the condition such as being in the same position is selected (step ST7). The syllable candidate selector 4 regards an unrecognizable single syllable corresponding to “*” as an arbitrary character. Thereby, the syllable candidate selector 4 selects a syllable string candidate even if the recognition target syllable string includes a single syllable that could not be partially recognized.

さらに、音節候補選択器４は、メモリから各音節の音響パラメータを読み出し、認識対象音節列の音響パラメータと各々の音節列候補の音響モデルをマッチングさせて、認識対象音節列に最も対応する音節列候補を選択する。具体的には、認識対象音節列と各々の音節列候補についてスコアを演算し、最もスコアの高い音節列候補を再音節認識結果（最終認識結果）として選択する（ステップＳＴ８）。そして、表示装置６及びスピーカ７は、最終認識結果を出力する。 Further, the syllable candidate selector 4 reads out the acoustic parameters of each syllable from the memory, matches the acoustic parameters of the recognition target syllable string and the acoustic model of each syllable string candidate, and corresponds most to the recognition target syllable string. Select a candidate. Specifically, the score is calculated for the recognition target syllable string and each syllable string candidate, and the syllable string candidate with the highest score is selected as a re-syllable recognition result (final recognition result) (step ST8). Then, the display device 6 and the speaker 7 output the final recognition result.

以上のように、本発明の実施の形態に係る音声認識装置は、ユーザが発話した単音節とその認識結果が完全に一致していなくても、ユーザに単音節の連続的な発話を促すことができるため、見かけ上の誤認識を少なくして、音声入力速度を向上させることができる。 As described above, the speech recognition apparatus according to the embodiment of the present invention prompts the user to continuously speak single syllables even if the recognition result does not completely match the single syllable spoken by the user. Therefore, apparent misrecognition can be reduced and the voice input speed can be improved.

また、上記音声認識装置は、ステップＳＴ７以降の後処理においては、認識不可又は認識不完全の単音節を含んだ認識対象音節列から音節列候補を絞り込んだ後、再び同じ認識対象音節列の音響パラメータを用いて、認識対象音節列に対応する音節列候補を、最終的な認識結果として出力することができる。すなわち、認識結果である認識対象音節列と音節列候補辞書データベース５の音節列候補とを照合し、入力音声の絞り込みを行うことによって、単音節の認識が不可又は不完全であっても、高精度に認識対象音節列を認識することができる。 Further, in the post-processing after step ST7, the speech recognition apparatus narrows down syllable string candidates from recognition target syllable strings including unrecognizable or incompletely recognized single syllables, and then repeats the sound of the same recognition target syllable string. Using parameters, syllable string candidates corresponding to recognition target syllable strings can be output as final recognition results. That is, by comparing the recognition target syllable string that is the recognition result with the syllable string candidates in the syllable string candidate dictionary database 5 and narrowing down the input speech, even if single syllable recognition is impossible or incomplete, The recognition target syllable string can be recognized with high accuracy.

なお、本発明は、上述した実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で設計上変更されたものについても適用可能である。 In addition, this invention is not limited to embodiment mentioned above, It can apply also about what was changed in design within the range described in the claim.

例えば、上述したステップＳＴ１からステップＳＴ８までの処理を実行する音声認識プログラムをコンピュータにインストールして、そのコンピュータに音声区間切出器２、単音節認識器３、音節候補選択器４、音節列候補辞書データベース５の機能を実行させてもよい。なお、上記コンピュータは、通信回線を介して伝送された音声認識プログラムをインストールしてもよいし、光ディスク、磁気ディスク、半導体メモリなどの記録媒体に記録された音声認識プログラムをインストールしてもよい。 For example, a speech recognition program for executing the processes from step ST1 to step ST8 described above is installed in a computer, and a speech segment extractor 2, a single syllable recognizer 3, a syllable candidate selector 4, a syllable string candidate are installed in the computer. The function of the dictionary database 5 may be executed. The computer may install a voice recognition program transmitted via a communication line, or may install a voice recognition program recorded on a recording medium such as an optical disk, a magnetic disk, or a semiconductor memory.

さらに、表示装置６は、“？”や“＊”の代わりに他の記号、文字、キャラクターを表示してもよい。同様に、スピーカ７は、“はい”の代わりに、相づちのような情報、例えば“えー”などの他の合成音声を出力してもよい。 Further, the display device 6 may display other symbols, characters, and characters instead of “?” Or “*”. Similarly, instead of “Yes”, the speaker 7 may output other synthesized speech such as “e”, for example, information such as “Kai”.

本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on embodiment of this invention. 音声認識装置による音声認識処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech recognition process by a speech recognition apparatus. 音声認識結果を説明するための図である。It is a figure for demonstrating a speech recognition result.

Explanation of symbols

１マイク
２音声区間切出器
３単音節認識器
４音節候補選択器
５音節列候補辞書データベース
６表示装置
７スピーカ 1 Microphone 2 Voice segment extractor 3 Single syllable recognizer 4 Syllable candidate selector 5 Syllable string candidate dictionary database 6 Display device 7 Speaker

Claims

A voice input means for inputting voice uttered in syllable units;
Syllable recognition means for recognizing the voice input by the voice input means for each syllable;
Output means for outputting the recognition result of the syllable in a manner according to the reliability of the recognition result of the syllable recognition means ,
The output means outputs information predetermined as the recognition result of the syllable when the reliability of the recognition result is lower than a first threshold indicating that recognition is impossible, and the reliability of the recognition result is the first A speech recognition device that outputs at least a vowel of the syllable as a recognition result of the syllable when the threshold is lower than a second threshold that is not less than the second threshold and indicates incomplete recognition.

The output means is at least one of a sound output means for outputting sound and an image output means for outputting an image.
The speech recognition apparatus according to claim 1 .

Syllable string candidate storage means for storing a plurality of syllable string candidates;
Selecting means for selecting a syllable string candidate most corresponding to a syllable string composed of a plurality of syllable strings recognized by the syllable recognition means from a plurality of syllable string candidates stored in the syllable string candidate storage means; The speech recognition apparatus according to claim 1, further comprising:

A syllable recognition process for recognizing speech uttered in syllable units for each syllable;
And an output step of outputting a recognition result of the syllable in a manner that depends on the reliability of the recognition result of the syllable recognition step
In the output step, when the reliability of the recognition result is lower than a first threshold indicating that recognition is not possible, information predetermined as the recognition result of the syllable is output, and the reliability of the recognition result is the first reliability. A speech recognition method for outputting at least a vowel of the syllable as a recognition result of the syllable when the threshold is lower than a second threshold that indicates incomplete recognition.

In the output step, at least one of sound and image is output.
The speech recognition method according to claim 4 .

A selection step of selecting a syllable string candidate most corresponding to the syllable string composed of a plurality of syllable strings recognized in the syllable recognition process from a plurality of syllable string candidates.
The speech recognition method according to claim 4 or 5 .

Computer
A voice input means for inputting voice uttered in syllable units;
Syllable recognition means for recognizing the voice input by the voice input means for each syllable;
Function as output means for outputting the recognition result of the syllable in a manner according to the reliability of the recognition result of the syllable recognition means ,
The output means outputs information predetermined as the recognition result of the syllable when the reliability of the recognition result is lower than a first threshold indicating that recognition is impossible, and the reliability of the recognition result is the first A speech recognition program that functions to output at least a vowel of the syllable as a recognition result of the syllable when the threshold is lower than a second threshold that indicates incomplete recognition.