JP2006184421A

JP2006184421A - Speech recognition device and speech recognition method

Info

Publication number: JP2006184421A
Application number: JP2004376211A
Authority: JP
Inventors: Hiroshi Saito; 浩斎藤; Kengo Suzuki; 堅悟鈴木
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2004-12-27
Filing date: 2004-12-27
Publication date: 2006-07-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device which completes speech recognition processing within a short time after a speech for recognition is inputted and identifies a speech-inputted vocabulary. <P>SOLUTION: The speech recognition device comprises: a speech input means 102 for converting and outputting an inputted speech into a speech signal; a beginning/ending speech recognition means 103 capable of recognizing each one syllable of the beginning and ending of a word in the speech signal outputted from the speech input means 102, a vocabulary narrowing-down means 111 for narrowing down a vocabulary range for recognition using either one syllable of the beginning or that of the ending or both of them recognized by the beginning/ending speech recognition means 103; and a vocabulary speech recognition means 104 for recognizing the vocabulary in the above speech signal using the vocabularies for recognition in the range narrowed down by the above vocabulary narrowing-down means 111. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声認識装置及び音声認識方法に関する。 The present invention relates to a voice recognition device and a voice recognition method.

昨今、操作者の発話情報を認識して情報入力を行う音声インタフェース手法が数多く提案されている。これは、いちいち、手でスイッチを操作する手間を軽減できる方法として注目されており、特に、情報検索操作（名称検索や電話番号検索等）に威力を発揮する。 Recently, many voice interface methods for recognizing operator's utterance information and inputting information have been proposed. This is attracting attention as a method that can reduce the trouble of manually operating the switch, and is particularly effective for information search operations (name search, telephone number search, etc.).

音声インタフェースを実現するには、一般に、認識可能な語彙を辞書としてシステムに登録しておき、発話された情報と当該辞書の内容を照合し、最も尤度が高い言葉を１つ特定することによって行う。この時、用意された全ての語彙との照合処理を行う為、辞書の容量が増加すると、照合に要する時間が増加するという問題が生じる。これは、認識辞書を、ＣＤ−ＲＯＭやＤＶＤのような外部記憶装置に保持しているため、辞書の読み出し時間が長くかかるためである。もちろん、語彙数が増加すれば、照合自体に要する時間も長くなる。一般に、数百語レベルの辞書であれば、照合処理は１秒以内に終了するが、１万語を超える辞書サイズになると、場合によっては数秒オーダの処理時間を要する場合もある。このような場合、発話を完了してから認識結果が出力されるまでに時間がかかるため、インタフェースのテンポが阻害され、不自然なインタフェースになってしまう可能性がある。 In order to realize a voice interface, generally, a recognizable vocabulary is registered in the system as a dictionary, and the spoken information is compared with the contents of the dictionary to identify one word with the highest likelihood. Do. At this time, since collation processing with all prepared vocabularies is performed, if the capacity of the dictionary increases, there arises a problem that the time required for collation increases. This is because the recognition dictionary is held in an external storage device such as a CD-ROM or DVD, and it takes a long time to read the dictionary. Of course, as the number of vocabularies increases, the time required for collation itself increases. In general, in the case of a dictionary having a level of several hundred words, the collation process is completed within one second. However, when the dictionary size exceeds 10,000 words, a processing time on the order of several seconds may be required in some cases. In such a case, since it takes time until the recognition result is output after the utterance is completed, there is a possibility that the interface tempo is hindered and the interface becomes unnatural.

特開２００１−０８３９８１号公報JP 2001-039881 A

従来技術においては、上記の、音声認識のための照合処理に数秒オーダの処理時間を要する場合もあり、不自然なインタフェースになってしまうという問題点に対応した発明として、例えば下記特許文献１に記載されているように、よく使われる語彙は高速アクセス可能な場所に格納し、使用頻度が低い語彙は、アクセス速度が低い場所に格納することを特徴とする音声認識システムが提案されている。 In the prior art, the above-described collation processing for speech recognition may require a processing time on the order of a few seconds, and as an invention corresponding to the problem of an unnatural interface, for example, in Patent Document 1 below, As described, a speech recognition system has been proposed in which frequently used vocabulary is stored in a place where high-speed access is possible, and vocabulary with low usage frequency is stored in a place where access speed is low.

しかしながら、このような方法によると、語彙の使用頻度を予め分析しておく必要がある。一般に、情報検索を行う場合、検索される情報の頻度を予測するのは困難で、一般には何らかの制約条件をかけざるを得ない。例えば、神奈川県にいる時は、神奈川県の情報を検素する可能性が高いから、その情報は、高遠にアクセスできる場所に格納する、という具合である。このような制約をかけてしまうと、操作者の意図と外れると使い勝手が悪くなり、インタフェースとしての使用感が悪化するという問題が生じる。大量の語彙の情報検索は、できるだけフラットに行えるのが望ましい。 However, according to such a method, it is necessary to analyze in advance the vocabulary usage frequency. In general, when performing an information search, it is difficult to predict the frequency of information to be searched, and in general, some constraints must be imposed. For example, when you are in Kanagawa Prefecture, there is a high possibility that the information of Kanagawa Prefecture will be checked, so that information is stored in a place that can be accessed at a high distance. If such a restriction is applied, there is a problem that if the operator's intention is not satisfied, the usability is deteriorated and the usability as an interface is deteriorated. It is desirable that information retrieval of a large amount of vocabulary can be performed as flat as possible.

本発明は、このような問題点に着目してなされたものであり、本発明が解決しようとする課題は、認識のための音声が入力されてから短時間内に、音声認識処理を終えて、音声入力された語彙を特定する音声認識装置及び音声認識方法を提供することにある。 The present invention has been made paying attention to such problems, and the problem to be solved by the present invention is to finish the speech recognition processing within a short time after the speech for recognition is input. Another object of the present invention is to provide a speech recognition apparatus and speech recognition method for specifying a vocabulary inputted by speech.

語彙が音声として入力された時に、入力された音声を音声信号に変換し、該音声信号における語頭の１音節及び語尾の１音節の一方のみまたは両方を認識し、その認識の結果を用いて認識対象語彙の範囲を絞込み、絞込まれた範囲内の認識対象語彙を用いて上記音声信号における語彙すなわち入力語彙の認識を行う音声認識装置を構成する。ここで、「認識対象語彙」は、音声入力された語彙と一致する可能性がある語彙を意味する。 When a vocabulary is input as speech, it converts the input speech into a speech signal, recognizes only one or both of the first syllable and the last syllable in the speech signal, and recognizes using the recognition result A speech recognition device that narrows down the range of the target vocabulary and recognizes the vocabulary in the speech signal, that is, the input vocabulary, using the recognition target vocabulary within the narrowed down range is configured. Here, the “recognition target vocabulary” means a vocabulary that may match the vocabulary input by voice.

本発明の実施によって、認識のための音声が入力されてから短時間内に、音声認識処理を終えて、音声入力された語彙を特定する音声認識装置及び音声認識方法を提供することが可能となる。 By implementing the present invention, it is possible to provide a speech recognition device and a speech recognition method for identifying a vocabulary input by speech after finishing speech recognition processing within a short period of time after speech for recognition is input. Become.

本発明に係る音声認識装置及び音声認識方法の特徴は、発話された言葉の語頭と語尾の一方あるいは両方における１音節を認識し、その結果に基づいて、認識対象語彙を絞込み、その後発話全体に照合処理を行って、語彙を特定するものである。 The features of the speech recognition apparatus and speech recognition method according to the present invention are to recognize one syllable in one or both of the beginning and the end of a spoken word, narrow down the recognition target vocabulary based on the result, and then to the entire utterance. A collation process is performed to specify the vocabulary.

以下に、実施の形態例によって、本発明を詳細に説明する。 Hereinafter, the present invention will be described in detail by way of embodiments.

（第１実施の形態例）
本実施の形態例は、本発明を、車両用ナビゲーションシステムの施設検索に適用したものである。以下、図面に基づいて、本発明の内容を説明する。 (First embodiment)
In this embodiment, the present invention is applied to facility search of a vehicle navigation system. Hereinafter, the contents of the present invention will be described with reference to the drawings.

本実施の形態例はステアリングに設置され、図１に示すように、以下の構成要素を有する音声認識装置である。すなわち、本実施の形態例は、音声入力開始を指示する音声入力開始指示手段１０１と、マイクロフォン等を構成要素とする、入力音声を音声信号に変換して出力する音声入力手段１０２と、音声入力手段１０２によって変換された音声信号における語頭の１音節と語尾の１音節とを認識することが可能な語頭・語尾音声認識手段１０３と、上記音声信号における語彙の認識を行う単語音声認識手段１０４と、認識対象語彙が展開された音声認識用辞書１０５と、認識対象語彙が格納された音声認識用情報データベース１０６と、音声や画面表示による応答出力を生成する応答生成手段１０７と、応答出力としてテキスト情報や地図情報をモニタに出力する視覚情報表示手段１０８と、応答出力として、生成した応答文を出力する聴覚情報提示手段１０９と、入力した内容に応じてナビゲーションシステムを制御するナヒゲーションシステム制御手段１１０と、語頭・語尾音声認識手段１０３が認識した語頭の１音節及び語尾の１音節の一方または両方を用いて認識対象語彙の範囲を絞込む語彙絞込み手段１１１とを有する。 The present embodiment is a voice recognition device that is installed on a steering wheel and has the following components as shown in FIG. That is, the present embodiment includes a voice input start instructing unit 101 that instructs to start voice input, a voice input unit 102 that includes a microphone as a component, converts input voice into a voice signal, and outputs, and voice input. An initial and ending speech recognition means 103 capable of recognizing one syllable at the beginning and the ending syllable in the speech signal converted by the means 102; and a word speech recognition means 104 for recognizing the vocabulary in the speech signal. A speech recognition dictionary 105 in which the recognition target vocabulary is expanded, a speech recognition information database 106 in which the recognition target vocabulary is stored, response generation means 107 for generating a response output by voice or screen display, and text as a response output Visual information display means 108 for outputting information and map information to a monitor, and auditory information provision for outputting a generated response sentence as a response output. Recognition using means 109, navigation system control means 110 for controlling the navigation system according to the input content, and one or both of the first syllable and the last syllable recognized by the beginning / ending speech recognition means 103 Vocabulary narrowing means 111 for narrowing down the range of the target vocabulary.

本実施の形態例は、車両に設置されたナビゲーションシステムを音声で操作することができる音声入力装置である。ここでは、音声で施設名検索を行う場合を例にあげ、本発明に係る音声入力装置の動作を説明する。 The present embodiment is a voice input device that can operate a navigation system installed in a vehicle by voice. Here, the operation of the voice input device according to the present invention will be described by taking as an example the case of performing facility name search by voice.

以下、作用の流れを、図２に示したフローチャートと対応付けて説明する。
（ａ）音声入力開始指示手段１０１は、車両のステアリングホイールに設置される。これは、音声入力の開始指示のために利用される。
（ｂ）音声入力手段１０２として、マイクロフォンが、車両のルームミラー近傍、あるいは、ステアリングコラム等、ドライバの口元に近接した位置に設置される。
（ｃ）音声入力開始指示手段１０１の音声入力開始ボタンを押すと、音声入力手段１０２が起動し、発話入力可能状態に遷移する（Ｓ２０１）。この時、聴覚情報提示手段１０９から、「ご用件をどうぞ」のように、発話を促すガイダンス音声が出力される（Ｓ２０２）。同時に、視覚情報表示手段１０８を通し、モニタ画面に、音声入力を促す表示が出力される。この時、音声認識用辞書１０５には、音声認識用情報データベース１０６から認識対象語彙が転送、セットされる。
（ｄ）「ご用件をどうぞ」のガイダンス音声の後ビープ音が出力され、当該ビープ音が終了すると、音声入力待ち受け状態となる。この時、画面には音声入力可能であることを示すアイコンが表示される。ここで、操作者は、「施設検索」というコマンドを発話する（Ｓ２０３）。
（ｅ）使用者の「施設検索」という発話は、音声入力手段１０２に入力され、音声入力手段１０２によって音声信号に変換され、該音声信号は単語音声認識手段１０４によって、音声認識用辞書１０５と照合され、「施設検索」というコマンドが認識される（Ｓ２０４）。
（ｆ）聴覚情報提示手段１０９から、「施設名をどうぞ」というガイダンス音声が出力される（Ｓ２０５）。同時に、視覚情報表示手段１０８を通し、モニタ画面に、施設名称入力を促す表示が出力される。この時、音声認識用辞書１０５には、５０音の１文字データがセットされる。
（ｇ）「施設名をどうぞ」のガイダンス音声の後ビープ音が出力され、当該ビープ音が終了すると、音声入力待ち受け状態となる。この時、画面には音声入力可能であることを示すアイコンが表示される。ここで、操作者は、「東京ディズニーランド」という施設名称を発話する（Ｓ２０６）。この発話情報（操作者の発話が、音声入力手段１０２によって、音声信号に変換されたものを含む）は、一時記憶される。
（ｈ）語頭・語尾音声認識手段１０３において、発話された文言の語頭と語尾の各１音節が認識される。本例の場合、語頭が「と」、語尾が「ど」と認識される（Ｓ２０７）。
（ｉ）音声認識用情報データベース１０６に格納された施設名情報の語頭が「と」、語尾が「ど」である情報が選択され（Ｓ２０７）、語彙絞込み手段１１１によって、語頭が「と」、語尾が「ど」であるという絞込み条件の下で、語彙範囲の絞込みが行われ、絞込まれた範囲内にある、音声認識用情報データベース１０６に格納された施設名が、認識対象語彙として、音声認識用辞書１０５に転送、セットされる。これによって、認識対象語彙の絞込みが行われる（Ｓ２０８）。
（ｊ）単語音声認識手段１０４において、先に（ｇ）で発話された情報の認識処理が実行され、入力音声信号における語彙「東京ディズニーランド」が認識される（Ｓ２０９）。この時の認識は、一時記憶されている発話情報（入力音声信号）を、音声認識用辞書１０５に転送されている、語頭が「と」、語尾が「ど」である語彙（絞込み後の認識対象語彙）と照合することによって行われる。このように、すでに、語頭・語尾音声認識手段１０３によって、語頭・語尾音声認識されている入力音声信号は、再び認識処理される。
（ｋ）聴覚情報提示手段１０９から、「東京ディズニーランドを検索します」というガイダンス音声が出力され（Ｓ２１０）、検索された当該地点の地図が、視覚情報表示手段１０８を通し、モニタ画面に表示される。
（ｌ）その後、画面の指示に従い、「表示された地点を登録する」、「そこを目的地に設定する」等の操作を行うことができる。これは、ナビゲーションシステム制御手段１１０によって行われる。この時、東京ディズニーランドの地図表示を行い（Ｓ２１１）、上記の操作を実行する（Ｓ２１２）。 Hereinafter, the flow of action will be described in association with the flowchart shown in FIG.
(A) The voice input start instruction unit 101 is installed on the steering wheel of the vehicle. This is used for a voice input start instruction.
(B) As the voice input unit 102, a microphone is installed in the vicinity of the driver's mouth, such as in the vicinity of a vehicle rearview mirror or a steering column.
(C) When the voice input start button of the voice input start instructing unit 101 is pressed, the voice input unit 102 is activated and transitions to a speech input enabled state (S201). At this time, the guidance information for urging the utterance is output from the auditory information presentation unit 109, such as “Please give me a request” (S202). At the same time, a display prompting voice input is output on the monitor screen through the visual information display means 108. At this time, the recognition target vocabulary is transferred and set from the speech recognition information database 106 to the speech recognition dictionary 105.
(D) A beep sound is output after the guidance message “Please give me a request”, and when the beep sound ends, a voice input standby state is entered. At this time, an icon indicating that voice input is possible is displayed on the screen. Here, the operator utters the command “facility search” (S203).
(E) The user's utterance “facility search” is input to the voice input means 102, converted into a voice signal by the voice input means 102, and the voice signal is converted to the voice recognition dictionary 105 by the word voice recognition means 104. The collation is performed and the command “facility search” is recognized (S204).
(F) A guidance voice “Please name the facility” is output from the auditory information presentation unit 109 (S205). At the same time, through the visual information display means 108, a display prompting the facility name input is output on the monitor screen. At this time, one character data of 50 sounds is set in the speech recognition dictionary 105.
(G) A beep sound is output after the guidance sound “Please name the facility”, and when the beep sound ends, a voice input standby state is entered. At this time, an icon indicating that voice input is possible is displayed on the screen. Here, the operator speaks the facility name “Tokyo Disneyland” (S206). This utterance information (including the utterance of the operator converted into an audio signal by the audio input means 102) is temporarily stored.
(H) The beginning / ending speech recognition means 103 recognizes one syllable of each of the beginning and end of the spoken word. In this example, the beginning of the word is recognized as “to” and the end of the word is recognized as “do” (S207).
(I) Information whose facility name information stored in the speech recognition information database 106 is “to” and whose ending is “do” is selected (S207). The vocabulary range is narrowed down under the narrowing condition that the ending is “DO”, and the facility name stored in the speech recognition information database 106 within the narrowed range is the recognition target vocabulary. Transferred to and set in the speech recognition dictionary 105. As a result, the vocabulary to be recognized is narrowed down (S208).
(J) In the word speech recognition means 104, recognition processing of the information previously spoken in (g) is executed, and the vocabulary “Tokyo Disneyland” in the input speech signal is recognized (S209). At this time, the utterance information (input speech signal) temporarily stored is transferred to the speech recognition dictionary 105, and the vocabulary with the word beginning “to” and the word ending “do” (recognition after narrowing down) is recognized. This is done by collating with the target vocabulary. In this way, the input speech signal that has already been recognized by the beginning / ending speech recognition means 103 is recognized again.
(K) The guidance information “Search for Tokyo Disneyland” is output from the auditory information presentation means 109 (S210), and the map of the searched point is displayed on the monitor screen through the visual information display means 108. The
(L) Thereafter, according to the instructions on the screen, operations such as “register the displayed point” and “set it as the destination” can be performed. This is performed by the navigation system control means 110. At this time, a map of Tokyo Disneyland is displayed (S211), and the above operation is executed (S212).

上記においては、語頭の１音節と語尾の１音節とを認識し、その両方を用いて語彙絞込みを行ったが、語頭の１音節のみ、もしくは語尾の１音節のみを認識し、それを用いて語彙絞込みを行ってもよい。その場合、１音節認識処理の負担をより小さくすることができる。但し、その後の認識対象語彙の数が、より多くなる可能性もあるため、認識対象語彙の総数を勘案して、語頭と語尾の双方を認識してそれらを語彙絞込みに用いる、語頭のみを認識してそれを語彙絞込みに用いる、語尾のみを認識してそれを語彙絞込みに用いる、のいずれにするかを決定すればよい。すなわち、語頭・語尾音声認識手段１０３が上記音声信号における語頭の１音節及び語尾の１音節の一方のみを認識した結果を用いて、語彙絞込み手段１１１が絞込んだ範囲内の認識対象語彙の個数が、あらかじめ定められた個数を超えている時に、記語頭・語尾音声認識手段１０３は、上記音声信号における語頭の１音節及び語尾の１音節の他方をも認識し、該認識の結果をも用いて、語彙絞込み手段１１１が認識対象語彙の範囲をさらに絞込むようにすればよい。 In the above, one syllable at the beginning of the word and one syllable at the end of the word are recognized, and both are used to narrow down the vocabulary. You may narrow down the vocabulary. In that case, the burden of the one syllable recognition process can be further reduced. However, there is a possibility that the number of vocabulary to be recognized after that may increase, so that the total number of recognition vocabularies is taken into account and both the beginning and ending are recognized and used for narrowing down the vocabulary only. Then, it may be determined whether to use it for narrowing the vocabulary or to recognize only the ending and use it for narrowing the vocabulary. That is, the number of words to be recognized within the range narrowed down by the vocabulary narrowing means 111 using the result of the recognition of only one of the initial syllable and the ending syllable of the speech signal in the speech signal. However, when the number exceeds the predetermined number, the initial / ending speech recognition means 103 recognizes the other one of the initial syllable and the final syllable in the speech signal, and also uses the result of the recognition. Thus, the vocabulary narrowing means 111 may narrow down the range of the recognition target vocabulary.

また、本実施の形態例では、施設名称を認識する際、語頭や語尾の音節を認識したが、これは、「施設検索」というコマンド発話時に適用してもよい。その場合、認識対象コマンドの数が少なければ、語頭や語尾の音を認識するだけで、コマンドが唯１つに絞込まれる場合もあり、そのような場合、認識処理の高速化が実現可能である。また、施設名称の入力を、操作者が音声以外の入力方法（例えば、キー操作による選択）でおこなってもよい。 In this embodiment, the syllable at the beginning or the end of the word is recognized when the facility name is recognized. However, this may be applied at the time of command utterance of “facility search”. In that case, if the number of commands to be recognized is small, the command may be narrowed down to just one by recognizing the sound at the beginning or end of the word. In such a case, the recognition process can be speeded up. is there. In addition, the facility name may be input by an operator using an input method other than voice (for example, selection by key operation).

また、施設名称の入力は、語彙絞込みの後、単語音声認識手段１０４による音声認識の前に行われてもよい。 The facility name may be input after the vocabulary is narrowed down and before the speech recognition by the word speech recognition means 104.

なお、上記においては、語頭・語尾音声認識手段１０３が音節認識を行う前（（ｈ）の前）に、カテゴリ分類による認識対象語彙の範囲の絞込みである、認識対象語彙の範囲を施設名称に限る範囲の絞込み（（ｃ）〜（ｅ））を行っているが、このようなカテゴリ分類による認識対象語彙の範囲の絞込みは、語頭・語尾音声認識手段１０３が音節認識を行った後（（ｈ）の後）、単語音声認識手段１０４が語彙の認識を始める前（（ｊ）の前）に行ってもよい。 In the above, the recognition target vocabulary range, which is the narrowing down of the recognition target vocabulary range by category classification, is used as the facility name before the beginning / ending speech recognition means 103 performs syllable recognition (before (h)). The limited range is narrowed down ((c) to (e)), but the range of the recognition target vocabulary by such category classification is narrowed down after the syllable recognition is performed by the beginning / ending speech recognition means 103 (( After h), it may be performed before the word speech recognition means 104 starts vocabulary recognition (before (j)).

以上に説明したように、本実施の形態においては、発話された語彙における語頭の１音節と語尾の１音節との一方または両方のみを認識し、その結果に基づいて、候補となる語彙を絞込み、再度認識処理を行って、結果を特定する。 As described above, in the present embodiment, only one or both of the first syllable and the last syllable in the spoken vocabulary are recognized, and the candidate vocabulary is narrowed down based on the result. The recognition process is performed again, and the result is specified.

この方法により、認識対象語彙が多数あっても、絞込み後の候補との照合を行うだけで足りるので、認識を短時間で行うことができる。すなわち、認識のための音声が入力されてから短時間内に、照合処理を終えて、発話された語彙を特定することができる。 According to this method, even if there are a large number of recognition target words, it is sufficient to perform collation with the candidates after narrowing down, so that recognition can be performed in a short time. That is, it is possible to specify the spoken vocabulary by completing the collation process within a short time after the input of recognition speech.

また、ディスク等の外部記憶装置から語彙を読み出す際、絞込んだ語彙のみ読み出せばよいため、読み出し時間も短くできる。 In addition, when reading a vocabulary from an external storage device such as a disk, only the narrowed vocabulary needs to be read, so that the reading time can be shortened.

また、上記の構成にしたため、認識対象語彙を展開するメモリ容量が小さくて済む、という効果が得られる。 In addition, since the above configuration is used, an effect that a memory capacity for expanding the recognition target vocabulary is small can be obtained.

また、語頭・語尾の１音節の認識は、１つの語の語頭・語尾以外の１音節の認識よりも容易であり、これによって、確度高い音声認識が可能となる。 Also, recognition of one syllable at the beginning / end of the word is easier than recognition of one syllable other than the beginning / end of one word, thereby enabling highly accurate speech recognition.

（第２実施の形態例）
本実施の形態例は、本発明を、車両用ナビゲーションシステムの施設検索に適用したものである。以下、図面に基づいて、本発明の内容を説明する。 (Second embodiment)
In this embodiment, the present invention is applied to facility search of a vehicle navigation system. Hereinafter, the contents of the present invention will be described with reference to the drawings.

本実施の形態例はステアリングに設置され、図３に示すように、以下の構成要素を有する音声認識装置である。すなわち、本音声認識装置は、音声入力開始を指示する音声入力開始指示手段３０１と、マイクロフォン等を構成要素とする、音声入力を音声信号に変換して出力する音声入力手段３０２と、音声入力手段３０２によって変換された音声信号における語頭の１音節と語尾の１音節とを認識することが可能な語頭・語尾音声認識手段３０３と、上記音声信号における語彙の認識を行う単語音声認識手段３０４と、認識対象語彙が展開された音声認識用辞書３０５と、語頭・語尾１音節の認識結果の確からしさを算出する認識確からしさ分析手段３０６と、認識対象語彙が格納された音声認識用情報データベース３０７と、音声や画面表示による応答出力を生成する応答生成手段３０８と、応答出力としてテキスト情報や地図情報をモニタに出力する視覚情報表示手段３０９と、応答出力として、生成した応答文を出力する聴覚情報提示手段３１０と、入力した内容に応じてナビゲーションシステムを制御するナビゲーションシステム制御手段３１１と、語頭・語尾音声認識手段３０３が認識した語頭の１音節及び語尾の１音節の一方または両方を用いて認識対象語彙の範囲を絞込む語彙絞込み手段３１２を有する。 The present embodiment is a voice recognition device that is installed in a steering and has the following components as shown in FIG. That is, the speech recognition apparatus includes a speech input start instructing unit 301 that instructs to start speech input, a speech input unit 302 that converts a speech input into a speech signal, and includes a microphone or the like, and a speech input unit. Initial and ending speech recognition means 303 capable of recognizing one syllable at the beginning and one ending in the speech signal converted by 302, word speech recognition means 304 for recognizing the vocabulary in the speech signal, A speech recognition dictionary 305 in which the recognition target vocabulary is expanded, a recognition probability analyzing means 306 for calculating a probability of the recognition result of the beginning / end of one syllable, a speech recognition information database 307 storing the recognition target vocabulary, , Response generation means 308 for generating a response output by voice or screen display, and outputting text information and map information to the monitor as a response output Visual information display means 309; auditory information presentation means 310 that outputs a generated response sentence as a response output; navigation system control means 311 that controls the navigation system in accordance with the input content; A vocabulary narrowing means 312 for narrowing down the range of the vocabulary to be recognized using one or both of one syllable at the beginning and one syllable at the end of the word recognized by 303.

認識確からしさ分析手段３０６が算出する語頭・語尾１音節の認識結果の確からしさの一例としては、認識尤度がある。認識尤度は、例えば、入力発話波形と照合音節波形とのパターンマッチングの度合い、すなわち、両波形の間の相関係数の絶対値として定義する。この場合に、両波形の間の相関係数は、それぞれの波形をｆ(ｔ)、ｇ(ｔ）（ｔは時間）とし、Ｓ(ｘ）を、両波形が存在する時間区間において、ｘを時間ｔに関して積分して得る値とした時に、

Ｓ(ｆ(ｔ)・ｇ(ｔ))/[(Ｓ(ｆ(ｔ)^２)・Ｓ(ｇ(ｔ)^２)]^１／２

として定義される。この認識尤度は、両波形が同一の場合（ｋを定数として、ｆ(ｔ)＝ｋ・ｇ(ｔ)と表される場合）に１となり、両波形の類似性が低くなるほど小さな値となる。 An example of the likelihood of the recognition result of the first ending / ending syllable calculated by the recognition probability analyzing unit 306 is a recognition likelihood. The recognition likelihood is defined as, for example, the degree of pattern matching between the input speech waveform and the collation syllable waveform, that is, the absolute value of the correlation coefficient between the two waveforms. In this case, the correlation coefficient between the two waveforms is f (t) and g (t) (t is time) for each waveform, and S (x) is x in the time interval where both waveforms exist. Is a value obtained by integrating with respect to time t,

S (f (t) · g (t)) / [(S (f (t) ² ) · S (g (t) ² )] ^1/2

Is defined as This recognition likelihood is 1 when both waveforms are the same (when k is a constant and expressed as f (t) = k · g (t)), and the recognition likelihood becomes smaller as the similarity between both waveforms is lower. Become.

本実施の形態例は、車両に設置されたナビゲーションシステムを音声で操作することができる音声入力装置である。ここでは、音声で施設名検索を行う場合を例にあげ、本発明の動作を説明する。以下、作用の流れを、図４のフローチャートと対応付けて説明する。
（ａ）音声入力開始手段３０１は、車両のステアリングホイールに設置される。これは、音声入力の開始や、キャンセル操作を指示するために利用される。
（ｂ）音声入力手段３０２として、マイクロフォンが、車両のルームミラー近傍、あるいは、ステアリングコラム等、ドライバの口元に近接した位置に設置される。
（ｃ）音声入力開始手段３０１の音声入力開始ボタンを押すと、音声入力手段３０２が起動し、発話入力可能状態に遷移する（Ｓ４０１）。この時、聴覚情報提示手段３１０から、「ご用件をどうぞ」のように、発話を促すガイダンス音声が出力される（Ｓ４０２）。同時に、視覚情報表示手段（３０９）を通し、モニタ画面に、音声入力を促す表示が出力される。この時、音声認識用辞書３０５には、音声認識用情報データベース３０７から認識対象語彙が転送、セットされる。
（ｄ）「ご用件をどうぞ」のガイダンス音声の後ビープ音が出力され、当該ビープ音が終了すると、音声入力待ち受け状態となる。この時、画面には音声入力可能であることを示すアイコンが表示される。ここで、操作者は、「施設検索」というコマンドを発話する（Ｓ４０３）。
（ｅ）使用者の発話は、音声入力手段３０２を通して、音声信号として単語音声認識手段３０４に入力され、単語音声認識手段３０４によって、音声認識用辞書３０５と照合され、「施設検索」というコマンドが認識される（Ｓ４０４）。
（ｆ）「施設名をどうぞ」のガイダンス音声の後ビーブ音が出力され、当該ビーブ音が終了すると、音声入力待ち受け状態となる（Ｓ４０５）。この時、画面には音声入力可能であることを示すアイコンが表示される。ここで、操作者は、「東京ディズニーランド」という施設名称を発話する（Ｓ４０６）。この発話情報は、一時記憶される。
（ｇ）語頭・語尾音声認識手段３０３において、発話され音声入力手段３０２によって音声信号に変換された文言の語頭と語尾の各１音節が認識される。本例の場合、語頭が「と」、語尾が「ど」と認識される（Ｓ４０７）。
（ｈ）この時、認識確からしさ分析手段（３０６）において、まず、前記語頭・語尾の各１音節の認識結果の確からしさが算出される。本例で、語頭の１音節の認識確からしさがＡ、語尾の１音節の認識確からしさがＢと算出されたとする。ここで、確からしさとは、音声認識のための照合処理を行う際、辞書として用意された言葉と、入力された言葉の近さを示す尺度である。本例の場合、値が大きいほど、認識結果が確からしいものとする、認識確からしさ分析手段（３０６）において、次に、あらかじめ定められた基準である、確からしさの閾値ＴとＡ、Ｂの大小関係が分析される（Ｓ４０８）。 The present embodiment is a voice input device that can operate a navigation system installed in a vehicle by voice. Here, the operation of the present invention will be described by taking the case of performing facility name search by voice as an example. Hereinafter, the flow of action will be described in association with the flowchart of FIG.
(A) The voice input start means 301 is installed on the steering wheel of the vehicle. This is used to start voice input or instruct a cancel operation.
(B) As the voice input means 302, a microphone is installed in the vicinity of the driver's mouth, such as in the vicinity of a vehicle rearview mirror or a steering column.
(C) When the voice input start button of the voice input start unit 301 is pressed, the voice input unit 302 is activated and transitions to a speech input enabled state (S401). At this time, the guidance information prompting the utterance is output from the auditory information presenting means 310, such as “Please give me a request” (S402). At the same time, through the visual information display means (309), a display prompting voice input is output on the monitor screen. At this time, the recognition target vocabulary is transferred and set from the speech recognition information database 307 to the speech recognition dictionary 305.
(D) A beep sound is output after the guidance message “Please give me a request”, and when the beep sound ends, a voice input standby state is entered. At this time, an icon indicating that voice input is possible is displayed on the screen. Here, the operator utters a command “facility search” (S403).
(E) The user's utterance is input to the word speech recognition unit 304 as a voice signal through the voice input unit 302, checked by the word speech recognition unit 304 against the speech recognition dictionary 305, and a command “Facilities Search” is issued. Recognized (S404).
(F) After the guidance sound “Please name the facility” is output, a beep sound is output, and when the beep sound ends, a voice input standby state is entered (S405). At this time, an icon indicating that voice input is possible is displayed on the screen. Here, the operator speaks the facility name “Tokyo Disneyland” (S406). This utterance information is temporarily stored.
(G) The beginning / end speech recognition unit 303 recognizes one syllable of each of the beginning and end of a sentence spoken and converted into a speech signal by the speech input unit 302. In this example, the beginning of the word is recognized as “to” and the end of the word is recognized as “do” (S407).
(H) At this time, in the recognition probability analysis means (306), first, the probability of the recognition result of each one syllable of the beginning and ending is calculated. In this example, it is assumed that the recognition probability of one syllable at the beginning of a word is calculated as A and the recognition probability of one syllable at the end of a word is calculated as B. Here, the certainty is a scale indicating the closeness of words prepared as a dictionary and input words when collation processing for speech recognition is performed. In this example, the recognition probability analysis means (306) assumes that the larger the value, the more likely the recognition result is. Then, in the recognition probability analysis means (306), the thresholds T and A, B of the certainty, which are predetermined criteria, are set. The magnitude relationship is analyzed (S408).

Ａ≧ＴかつＢ≧Ｔの時（Ｓ４０９）
語彙絞込み手段１１１によって、音声認識用情報データベース３０７に格納された施設名情報の語頭が「と」、かつ語尾が「ど」である情報のみが選択され、音声認識用辞書３０５に転送、セットされる。すなわち認識対象語彙の絞込みがおこなわれる（Ｓ４１３）。 When A ≧ T and B ≧ T (S409)
By the vocabulary narrowing means 111, only the information whose facility name information stored in the speech recognition information database 307 has “to” and ending with “do” is selected, transferred to the speech recognition dictionary 305, and set. The That is, the recognition target vocabulary is narrowed down (S413).

Ａ≧ＴかつＢ＜Ｔの時（Ｓ４１０）
語彙絞込み手段１１１によって、音声認識用情報データベース３０７に格納された施設名情報の語頭が「と」である情報のみが選択され、音声認識用辞書３０５に転送、セットされる。すなわち認識対象語彙の絞込みがおこなわれる（Ｓ４１３）。 When A ≧ T and B <T (S410)
The vocabulary narrowing means 111 selects only information whose facility name information stored in the speech recognition information database 307 starts with “to”, and transfers and sets it to the speech recognition dictionary 305. That is, the recognition target vocabulary is narrowed down (S413).

Ａ＜ＴかつＢ≧Ｔの時（Ｓ４１１）
語彙絞込み手段１１１によって、音声認識用情報データベース３０７に格納された施設名情報の語尾が「ど」である情報のみが選択され、音声認識用辞書３０５に転送、セットされる。すなわち認識対象語彙の絞込みがおこなわれる（Ｓ４１３）。 When A <T and B ≧ T (S411)
The vocabulary narrowing means 111 selects only information whose facility name information stored in the speech recognition information database 307 has the ending “DO”, and transfers and sets it to the speech recognition dictionary 305. That is, the recognition target vocabulary is narrowed down (S413).

Ａ＜ＴかつＢ＜Ｔの時（Ｓ４１２）
再発話を促す。すなわち、「もう１度お話しください」と応答音声出力をして、（ｆ）の音声入力待ち受け状態となる（Ｓ４０５）。
（ｉ）Ｓ４１２以外の場合には、単語音声認識手段３０４において、先に（ｆ）で発話された情報が、（ｈ）で設定した辞書と照合され、「東京ディズニーランド」が認識される（Ｓ４１４）。
（ｊ）聴覚情報提示手段３１０から、「東京ディズニーランドを検索します」というガイダンス音声が出力され（Ｓ４１５）、検索された当該地点の地図が、視覚情報表示手段３０９を通し、モニタ画面に表示される（Ｓ４１６）。
（ｋ）その後、画面の指示に従い、「表示された地点を登録する」、「そこを目的地に設定する」等の操作を行うことができる（Ｓ４１７）。これは、ナビゲーションシステム制御手段３１１によって行われる。 When A <T and B <T (S412)
Encourage recurrence. That is, a response voice output saying "Please speak again" is made, and the voice input standby state of (f) is entered (S405).
(I) In cases other than S412, the word speech recognition unit 304 compares the information previously uttered in (f) with the dictionary set in (h) to recognize “Tokyo Disneyland” (S414). ).
(J) The guidance information “Search for Tokyo Disneyland” is output from the auditory information presentation means 310 (S415), and the map of the searched point is displayed on the monitor screen through the visual information display means 309. (S416).
(K) Thereafter, in accordance with the instructions on the screen, operations such as “register the displayed point” and “set it as the destination” can be performed (S417). This is performed by the navigation system control means 311.

以上に説明したように、本実施の形態においては、語頭・語尾の認識結果の確からしさに基づいて、認識対象語彙の絞込み範囲を変更する。具体的には、
（１）語頭・語尾の認識結果の確からしさが、ともに所定値以上の時、語頭・語尾両方の認識結果によって、認識対象語彙を絞込み、
（２）語頭の認識結果のみ、確からしさが所定値以上の時、語頭の認識結果によって、認識対象語彙を絞込み、
（３）語尾の認識結果のみ、確からしさが所定値以上の時、語尾の認識結果によって、認識対象語彙を絞込み、
（４）語頭・語尾の認識結果の確からしさが、ともに所定値未満の時、再発話を促す。 As described above, in the present embodiment, the narrowing range of the recognition target vocabulary is changed based on the probability of the recognition result of the beginning / ending of the word. In particular,
(1) When the accuracy of the recognition result of the beginning / ending of the word is more than a predetermined value, the recognition target vocabulary is narrowed down by the recognition result of both the beginning / ending of the word,
(2) Only when the initial recognition result is more than a certain value, the recognition target vocabulary is narrowed down by the initial recognition result.
(3) When only the ending recognition result is more than a predetermined value, the recognition target vocabulary is narrowed down by the ending recognition result,
(4) When the likelihood of the recognition result of the beginning / ending of the word is less than a predetermined value, re-speech is urged.

なお、上記においては、語頭・語尾音声認識手段３０３が音節認識を行う前（（ｇ）の前）に、カテゴリ分類による認識対象語彙の範囲の絞込みである、認識対象語彙の範囲を施設名称に限る範囲の絞込み（（ｃ）〜（ｅ））を行っているが、このようなカテゴリ分類による認識対象語彙の範囲の絞込みは、語頭・語尾音声認識手段３０３が音節認識を行った後（（ｇ）の後）、単語音声認識手段３０４が語彙の認識を始める前（（ｉ）の前）に行ってもよい。 In the above description, the recognition target vocabulary range, which is the narrowing down of the recognition target vocabulary range by category classification, is used as the facility name before the beginning / ending speech recognition means 303 performs syllable recognition (before (g)). Although the limited range is narrowed down ((c) to (e)), the range of the recognition target vocabulary by such category classification is narrowed down after the syllable recognition by the beginning / ending speech recognition means 303 (( After g), it may be performed before the word speech recognition means 304 starts vocabulary recognition (before (i)).

また、ディスク等の外部記憶装置から語彙を読み出す際、絞込んだ語彙のみ読み出せばいいため、読み出し時間も短くできる。 Further, when reading a vocabulary from an external storage device such as a disk, it is only necessary to read the narrowed vocabulary, so the reading time can be shortened.

さらに、本実施の形態においては、語頭や語尾の認識結果の確からしさに応じて認識対象語彙を絞込むことができるため、操作者の発話状態に応じた語彙絞込みが可能となり、結果として認識性能を向上することができる（確信度が低い認識語彙を絞込み条件にかけて、間違えの可能性のある語彙に対して照合処理を実施せずに済む）という効果が得られる。例えば、誤った語頭や語尾の認識結果に基づいて、誤った絞込みを行う場合を排除することができる。 Furthermore, in the present embodiment, since the recognition target vocabulary can be narrowed down according to the probability of the recognition result of the beginning and ending of the word, it is possible to narrow down the vocabulary according to the utterance state of the operator, resulting in recognition performance. (There is no need to perform a collation process on a vocabulary that may be mistaken by using a recognition vocabulary with a low certainty as a narrowing-down condition). For example, it is possible to eliminate a case where an erroneous narrowing down is performed based on a recognition result of an erroneous head or tail.

請求項１ないし３に記載の音声認識装置の構成要件と請求項７に記載の音声認識方法の構成要件とは第１実施の形態例において満足され、請求項４ないし６に記載の音声認識装置の構成要件と請求項８に記載の音声認識方法の構成要件とは第２実施の形態例において満足されている。 The constituent requirements of the speech recognition apparatus according to claim 1 and the constituent requirements of the speech recognition method according to claim 7 are satisfied in the first embodiment, and the speech recognition apparatus according to claims 4 to 6 is satisfied. The component requirement of the voice recognition method according to the eighth aspect is satisfied in the second embodiment.

第１実施の形態例の構成図である。It is a block diagram of the example of 1st Embodiment. 第１実施の形態例の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of 1st Embodiment. 第２実施の形態例の構成図である。It is a block diagram of 2nd Embodiment. 第２実施の形態例の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of 2nd Embodiment.

Explanation of symbols

１０１：音声入力開始指示手段、１０２：音声入力手段、１０３：語頭・語尾音声認識手段、１０４：単語音声認識手段、１０５：音声認識用辞書、１０６：音声認識用情報データベース、１０７：応答生成手段、１０８：視覚情報表示手段、１０９：聴覚情報提示手段、１１０：ナヒゲーションシステム制御手段、１１１：語彙絞込み手段、３０１：音声入力開始指示手段、３０２：音声入力手段、３０３：語頭・語尾音声認識手段、３０４：単語音声認識手段、３０５：音声認識用辞書、３０６：認識確からしさ分析手段、３０７：音声認識用情報データベース、３０８：応答生成手段、３０９：視覚情報表示手段、３１０：聴覚情報提示手段、３１１：ナヒゲーションシステム制御手段、３１２：語彙絞込み手段。 101: voice input start instructing means, 102: voice input means, 103: beginning / ending voice recognition means, 104: word voice recognition means, 105: dictionary for voice recognition, 106: information database for voice recognition, 107: response generation means , 108: visual information display means, 109: auditory information presentation means, 110: navigation system control means, 111: vocabulary narrowing means, 301: voice input start instruction means, 302: voice input means, 303: head / end voice recognition Means 304: Word voice recognition means 305: Speech recognition dictionary 306: Recognition probability analysis means 307: Speech recognition information database 308: Response generation means 309: Visual information display means 310: Auditory information presentation Means, 311: Navigation system control means, 312: Vocabulary narrowing means.

Claims

In a speech recognition device that recognizes vocabulary input by speech,
A voice input means for converting the input voice into a voice signal and outputting it; and a head part capable of recognizing at least one of the first syllable and the last syllable in the voice signal outputted by the voice input means; Ending speech recognition means, vocabulary narrowing means for narrowing down the range of recognition target vocabulary using the recognition result in the beginning / ending speech recognition means, and the recognition target vocabulary within the range narrowed down by the vocabulary narrowing means A speech recognition apparatus comprising word speech recognition means for recognizing a vocabulary in the speech signal.

The speech recognition device according to claim 1,
The speech recognition apparatus according to claim 1, wherein the beginning / ending speech recognition means recognizes only one or both of one syllable at the beginning and one syllable at the end of the speech signal.

The speech recognition device according to claim 2,
The number of words to be recognized within the range narrowed down by the vocabulary narrowing means is determined by using the result of the word / word ending speech recognition means recognizing only one of the first syllable and the last syllable in the speech signal. When the number exceeds a predetermined number, the beginning / ending speech recognition means recognizes the other one of the beginning and ending syllables in the speech signal, and also uses the recognition result to A speech recognition apparatus, characterized in that the vocabulary narrowing means further narrows down the range of the recognition target vocabulary.

In a speech recognition device that recognizes vocabulary input as speech,
A voice input means for converting the input voice into a voice signal and outputting it; and a head part capable of recognizing at least one of the first syllable and the last syllable in the voice signal outputted by the voice input means; Ending speech recognition means, recognition probability analysis means for calculating the probability of the recognition result of one syllable at the beginning and the probability of the recognition result of one syllable at the ending in the beginning / ending speech recognition means, and Vocabulary narrowing means for narrowing the range of vocabulary to be recognized using one or both of one syllable at the beginning and one syllable at the end of which the certainty of the recognized recognition result is equal to or greater than a predetermined standard, and the vocabulary A speech recognition apparatus comprising: word speech recognition means for recognizing a vocabulary in the speech signal using a recognition target vocabulary within a range narrowed down by the narrowing-down means.

The speech recognition apparatus according to claim 4,
The speech recognition apparatus characterized in that the probability of the recognition result is a recognition likelihood when the recognition result is obtained.

The speech recognition device according to any one of claims 1 to 5,
A speech characterized by narrowing down the range of recognition target vocabulary by category classification before or after the beginning / ending speech recognition means performs syllable recognition and before the word speech recognition means starts to recognize words Recognition device.

In a speech recognition method for recognizing vocabulary input by speech,
The input speech is converted into a speech signal, one or both of the first syllable and the last syllable in the speech signal are recognized, and the range of the recognition target vocabulary is narrowed down by using the recognition result. A speech recognition method comprising: recognizing a vocabulary in the speech signal using a recognition target vocabulary within a predetermined range.

In a speech recognition method for recognizing vocabulary input as speech,
The input speech is converted into a speech signal, and one or both of the first syllable and the last syllable of the speech signal are recognized, the probability of the recognition result is calculated, and the accuracy of the calculated recognition result is confirmed. Is narrowed down the range of the recognition target vocabulary by using one or both of the first syllable and the last syllable of the ending word that is equal to or more than a predetermined standard, and the above using the recognition target vocabulary within the narrowed range A speech recognition method characterized by recognizing a vocabulary in a speech signal.