JP2003099089A

JP2003099089A - Speech recognition/synthesis device and method

Info

Publication number: JP2003099089A
Application number: JP2001287049A
Authority: JP
Inventors: Hiroyuki Kanza; 浩幸勘座
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-09-20
Filing date: 2001-09-20
Publication date: 2003-04-04

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition/synthesis device in which it is unnecessary to previously register vocabularies being key words and precision in speech recognition is high. SOLUTION: The device is provided with a text analysis section 1 which is used to analyze inputted character strings, a speech synthesis section 2 which generates synthesized speech corresponding to the character strings based on the analysis result of the section 1, a recognition vocabulary generating section 3 which generates vocabulary that is to be used for speech recognition based on the analysis result of the section 1, a vocabulary storage section 4 which stores the vocabulary generated by the section 3 and a speech recognition section 5 which recognizes inputted speech by referring to the vocabulary stored in the section 4. Basically, the device voice recognizes the language that is same as the language being read in synthesized speech.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、入力された音声を
認識する音声認識処理と、テキストの文字列を音声に変
換する音声合成処理を行うための音声認識・合成装置お
よび方法、並びにこのような装置および方法を実現する
ためのプログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition / synthesis apparatus and method for performing a voice recognition process for recognizing an input voice and a voice synthesis process for converting a character string of a text into a voice, and such a method. Program for realizing various devices and methods.

【０００２】[0002]

【従来の技術】パソコンなどの情報処理装置の普及や音
声合成技術の進展により、文字列を音声に変換して活用
するシーンが増えてきた。文字を見ることなく耳から情
報を取得できるので、表示領域が限定される場合に有効
であり、また、別の作業をしながら音声により情報を知
るといったことが実現できるようになった。2. Description of the Related Art Due to the spread of information processing devices such as personal computers and the development of voice synthesis technology, the number of scenes in which a character string is converted into a voice and utilized is increasing. Since the information can be obtained from the ear without looking at the characters, it is effective when the display area is limited, and it is possible to know the information by voice while performing another work.

【０００３】しかし音声を合成して出力するという一方
向の処理だけでは利用できる範囲が限定される。合成音
声に対してユーザが働きかけるといったインタラクショ
ンの要素があれば用途はさらに拡大すると考えられる。
ユーザとのインタラクションの手段として、ボタンを押
す操作や表示メニューからキー入力により選択するとい
う方法があるが、合成された音声に対してユーザが音声
で応えるという方法は自然であり、ユーザにとっても利
便性が高い。However, the usable range is limited only by the one-way processing of synthesizing and outputting the voice. If there is an interaction element such that the user works on the synthetic voice, the application is expected to expand further.
As a means of interacting with the user, there is a method of pressing a button or selecting from a display menu by key input, but it is natural for the user to respond to the synthesized voice with a voice, which is convenient for the user. It is highly likely.

【０００４】音声での問い合わせに対して、ユーザが音
声で回答する技術は、音声対話技術として数多く開発さ
れている。しかし、任意の文章をテキスト音声合成で読
み上げ、それに対して発せられるユーザの声を音声認識
するという技術は音声対話とは異なる。Many techniques for allowing a user to answer a voice inquiry by voice have been developed as voice conversation techniques. However, the technology of reading an arbitrary sentence by text-to-speech synthesis and recognizing the user's voice uttered in response to it is different from the voice dialogue.

【０００５】音声合成で読み上げられたテキストに対し
て音声認識する技術として、特開平6-338946号公報に開
示されたものがある。開示された技術によれば、メール
の音声出力期間に利用者が発声するキーワードに関する
情報に従い、利用者が望む位置からの音声再生出力が行
える。As a voice recognition technique for text read aloud by voice synthesis, there is one disclosed in Japanese Patent Laid-Open No. 6-338946. According to the disclosed technology, voice reproduction can be performed from a position desired by a user according to information about a keyword uttered by the user during a voice output period of a mail.

【０００６】たとえば、「・・・・会議の連絡です。え
ー、日時は４月１日の９時からです。場所は・・・・」
という合成音声が出力されているときに、「日時」とい
う発声を認識して、「日時」以降の合成音声を再度出力
することができる。[0006] For example, "... contacting the meeting. Well, the date and time is from 9:00 on April 1. The place is ..."
When the synthetic voice is output, the utterance "date and time" can be recognized and the synthetic voice after "date and time" can be output again.

【０００７】特開平6-338946号の技術を実施するために
は、図１２に示す構成要素が必要になる。テキスト解析
部５１は入力された文字列を形態素解析して読みの情報
などを求める処理を行う。音声合成部５２はテキスト解
析処理の結果をもとに韻律情報を求めて合成音声を生成
する。語彙記憶部５４は音声再出力の対象となるキーワ
ードを音声認識辞書として記憶する。音声認識部５５は
語彙記憶部５４に記憶されている辞書を参照して音声認
識を行う。In order to implement the technique of Japanese Patent Laid-Open No. 6-338946, the components shown in FIG. 12 are required. The text analysis unit 51 performs a morphological analysis of the input character string to obtain reading information and the like. The speech synthesis unit 52 obtains prosody information based on the result of the text analysis processing and generates synthetic speech. The vocabulary storage unit 54 stores a keyword to be re-outputted as a voice, as a voice recognition dictionary. The voice recognition unit 55 refers to the dictionary stored in the vocabulary storage unit 54 to perform voice recognition.

【０００８】語彙記憶部５４に登録されるのは、音声再
出力操作に必要最小限の語彙だけでもよいし、不特定多
数の大語彙でも構わない。いずれにせよ、予め決められ
た固定された語彙が辞書に登録されていて、音声認識部
５５は、入力された音声とそれら登録語彙との照合処理
を行うことで、認識結果を求めるよう動作する。Only the minimum vocabulary necessary for the voice re-output operation may be registered in the vocabulary storage unit 54, or a large number of unspecified large vocabularies may be registered. In any case, a predetermined fixed vocabulary is registered in the dictionary, and the voice recognition unit 55 operates to obtain a recognition result by performing a matching process between the input voice and the registered vocabulary. .

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、上記の
特開平6-338946号に開示された技術では、キーワードと
なる語彙をあらかじめ登録しておかなければならないと
いう問題がある。また、語彙の制限のない音声認識装置
であってもかまわない旨が記述されているが、語彙の制
限がなければ認識性能が低くなり、実用的でなくなると
いう問題がある。さらに、音声認識が合成音声の再出力
の目的にのみ使用されているため、音声認識を他の用
途、たとえば、合成音声で発声されたメニューの中から
一つを選択するといった用途には使用できないという問
題がある。However, the technique disclosed in the above-mentioned Japanese Patent Laid-Open No. 6-338946 has a problem that the vocabulary as a keyword must be registered in advance. Further, although it is described that a voice recognition device with no restriction on vocabulary may be used, if there is no restriction on vocabulary, the recognition performance will be low and it will be impractical. Furthermore, since the voice recognition is used only for the purpose of re-outputting the synthetic voice, the voice recognition cannot be used for other purposes, for example, selecting one from a menu spoken by the synthetic voice. There is a problem.

【００１０】そこで、この発明の課題は、キーワードと
なる語彙をあらかじめ登録しておく必要がなく、精度高
い音声認識を行え、音声認識を複数の目的に使用できる
音声認識・合成装置および方法およびそのためのプログ
ラムを提供することである。SUMMARY OF THE INVENTION Therefore, an object of the present invention is to perform a voice recognition / synthesis apparatus and method capable of performing voice recognition with high accuracy and using voice recognition for a plurality of purposes without having to previously register a vocabulary as a keyword. Is to provide the program.

【００１１】[0011]

【課題を解決するための手段】本発明は、基本的に、合
成音声で読み上げられた言葉に限って音声認識するよう
にすることで、上記課題の解決を図るものである。SUMMARY OF THE INVENTION The present invention is basically intended to solve the above problems by recognizing only the words read aloud by synthetic speech.

【００１２】具体的には、本発明の一側面に係る音声認
識・合成装置は、入力された文字列を解析するテキスト
解析部と、上記テキスト解析部による解析結果に基づい
て、上記文字列に対応する合成音声を生成する音声合成
部と、上記テキスト解析部による解析結果に基づいて、
音声認識に使用される語彙を生成する認識語彙生成部
と、上記認識語彙生成部によって生成された語彙を記憶
する語彙記憶部と、上記語彙記憶部に記憶された語彙を
参照して、入力された音声を認識する音声認識部とを備
えたことを特徴としている。Specifically, a speech recognition / synthesis apparatus according to one aspect of the present invention uses a text analysis unit for analyzing an input character string, and the character string based on the analysis result by the text analysis unit. Based on the speech synthesis unit that generates the corresponding synthetic speech and the analysis result by the text analysis unit,
It is input by referring to a recognition vocabulary generation unit that generates a vocabulary used for speech recognition, a vocabulary storage unit that stores the vocabulary generated by the recognition vocabulary generation unit, and a vocabulary stored in the vocabulary storage unit. And a voice recognition unit for recognizing the voice.

【００１３】この音声認識・合成装置では、テキスト解
析部に入力される文字列から音声認識処理に使用する語
彙を生成し、この語彙を参照することにより、入力され
た音声について音声認識処理を行っている。したがっ
て、上述した従来技術とは異なり、キーワードとなる語
彙をあらかじめ登録しておく必要がない。また、テキス
ト解析部に入力される文字列から音声認識処理に使用す
る語彙を生成しているので、合成音声で読み上げた言葉
をユーザが音声入力すると、精度よく音声認識できる。In this voice recognition / synthesis apparatus, a vocabulary used for voice recognition processing is generated from a character string input to the text analysis unit, and the voice recognition processing is performed on the input voice by referring to this vocabulary. ing. Therefore, unlike the above-mentioned conventional technique, it is not necessary to register the vocabulary as a keyword in advance. Further, since the vocabulary used for the voice recognition processing is generated from the character string input to the text analysis unit, the user can perform voice recognition with high accuracy when the user inputs the voice read out by the synthetic voice.

【００１４】上記音声認識・合成装置は、単語間の類似
性に基づいて上記語彙記憶部に記憶される語彙を拡張す
る語彙拡張手段を備え、入力された文字列以外の単語も
上記語彙記憶部に記憶するようにしてもよい。The speech recognition / synthesis apparatus includes vocabulary expansion means for expanding the vocabulary stored in the vocabulary storage unit based on the similarity between words, and words other than the input character string are also included in the vocabulary storage unit. It may be stored in.

【００１５】また、上記音声認識・合成装置は、言語尤
度を求める言語尤度推定部をさらに備えていてもよい。
この場合、上記音声認識部は、上記言語尤度を考慮して
音声認識を行う。上記言語尤度は、生成される語彙の時
系列順序を基にして求めることもできるし、各単語の重
要度（発声される可能性）を基にして求めることもでき
る。いずれにしても、音声認識の際に言語尤度が考慮さ
れることにより、音声認識性能が向上する。Further, the speech recognition / synthesis apparatus may further include a language likelihood estimation unit for obtaining a language likelihood.
In this case, the voice recognition unit performs voice recognition in consideration of the language likelihood. The language likelihood can be obtained based on the time-series order of the generated vocabulary, or can be obtained based on the importance (probability of utterance) of each word. In any case, the speech recognition performance is improved by considering the language likelihood at the time of speech recognition.

【００１６】一実施形態では、上記音声認識・合成装置
はメニュー選択部をさらに備えている。そして、上記テ
キスト解析部には選択肢となる１以上のメニュー項目が
入力され、上記メニュー選択部は上記音声認識部による
認識結果に基づいて、上記１以上のメニュー項目の中か
ら１つを選択するようになっている。つまり、ここで
は、音声認識をメニュー選択の目的で使用している。In one embodiment, the voice recognition / synthesis device further includes a menu selection unit. Then, one or more menu items as options are input to the text analysis unit, and the menu selection unit selects one from the one or more menu items based on the recognition result by the voice recognition unit. It is like this. That is, here, voice recognition is used for the purpose of menu selection.

【００１７】この場合、音声認識・合成装置はメニュー
項目と該メニュー項目を構成する１以上の単語とを対応
付けて記憶する対応記憶部をさらに備えていてもよい。
上記メニュー選択部は、上記音声認識部による音声認識
結果と上記テキスト解析部に入力されたいずれかのメニ
ュー項目とが少なくとも部分的に一致すれば、メニュー
選択を実行することができる。In this case, the voice recognition / synthesis apparatus may further include a correspondence storage unit for storing the menu item and one or more words constituting the menu item in association with each other.
The menu selection unit can execute menu selection if the voice recognition result by the voice recognition unit and at least a part of the menu item input to the text analysis unit at least partially match.

【００１８】別の実施形態では、上記音声認識・合成装
置はリンク情報取得部をさらに備えている。そして、上
記テキスト解析部にはハイパーテキストが入力され、上
記リンク情報取得部は、上記音声認識部による音声認識
結果に基づいて、上記ハイパーテキストのリンク先の情
報を得るようになっている。ここでは、音声認識をリン
クの目的で使用している。In another embodiment, the voice recognition / synthesis device further includes a link information acquisition unit. Then, hypertext is input to the text analysis unit, and the link information acquisition unit obtains information on the link destination of the hypertext based on the voice recognition result by the voice recognition unit. Here, speech recognition is used for the purpose of linking.

【００１９】この場合、音声認識・合成装置はリンク文
字列名とこのリンク文字列名を構成する１以上の単語と
を対応付けて記憶する対応記憶部をさらに備えていても
よい。上記リンク情報取得部は、上記音声認識部による
音声認識結果と上記テキスト解析部に入力されたいずれ
かのリンク文字列名とが少なくとも部分的に一致すれ
ば、リンク先の情報の取得を実行することができる。In this case, the voice recognition / synthesis apparatus may further include a correspondence storage unit for storing the link character string name and one or more words forming the link character string name in association with each other. The link information acquisition unit executes acquisition of information of a link destination when the voice recognition result by the voice recognition unit and at least a part of the link character string name input to the text analysis unit at least partially match. be able to.

【００２０】上記音声合成部は、リンクの有無に応じて
音声合成様式を変更する手段を備えていてもよい。リン
ク情報とそうでないテキスト情報とが異なる合成音声で
出力されるので、ユーザは、読み上げられテキストのう
ち、どの文字列にリンクがはられているのを、合成音声
の違いによって判別できる。The voice synthesizing section may include means for changing the voice synthesizing mode depending on the presence or absence of a link. Since the link information and the other text information are output as different synthetic voices, the user can determine which character string in the read text is linked by the difference in the synthetic voices.

【００２１】上記いずれかの構成を備えた本発明の音声
認識・合成装置は携帯端末器に適用可能である。The voice recognition / synthesis apparatus of the present invention having any one of the above configurations can be applied to a mobile terminal device.

【００２２】さらに、上記いずれかの構成を備えた本発
明の音声認識・合成装置はサーバと端末器とを備えた通
信システムにも適用できる。この場合、本発明の別の側
面によると、上記音声認識・合成装置の上記構成部分の
うち、上記テキスト解析部と上記認識語彙生成部がサー
バ側に設けられる一方、それ以外の構成部分が端末側に
設けられる。Further, the speech recognition / synthesis apparatus of the present invention having any one of the above configurations can be applied to a communication system having a server and a terminal. In this case, according to another aspect of the present invention, of the constituent parts of the speech recognition / synthesis apparatus, the text analysis unit and the recognized vocabulary generation unit are provided on the server side, and the other constituent parts are the terminals. It is provided on the side.

【００２３】本発明はさらに、入力された文字列を解析
するステップと、上記解析の結果に基づき、上記文字列
に対応する合成音声を生成するステップと、上記解析の
結果に基づき、音声認識に使用される語彙を生成するス
テップと、生成された語彙を記憶するステップと、記憶
された語彙を参照して、入力された音声を認識するステ
ップとを備えたことを特徴とする音声認識・合成方法を
提供する。The present invention further includes a step of analyzing the input character string, a step of generating a synthetic voice corresponding to the character string based on the result of the analysis, and a voice recognition based on the result of the analysis. Speech recognition / synthesis characterized by including a step of generating a vocabulary to be used, a step of storing the generated vocabulary, and a step of recognizing an input voice with reference to the stored vocabulary. Provide a way.

【００２４】本発明はさらにまた、コンピュータに上記
音声認識・合成方法の各ステップを実行させるプログラ
ム、およびこのプログラムを記録したコンピュータ読み
取り可能な記録媒体を提供する。The present invention further provides a program for causing a computer to execute each step of the above speech recognition / synthesis method, and a computer-readable recording medium recording the program.

【００２５】[0025]

【発明の実施の形態】以下、この発明を図示の実施の形
態により詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION The present invention will be described in detail below with reference to the embodiments shown in the drawings.

【００２６】＜第１実施の形態＞図１は、第１実施の形
態の音声認識・合成装置の構成を示すブロック図であ
る。テキスト解析部１は入力された文字列を形態素解析
して読みの情報などを求める処理を行う。音声合成部２
はテキスト解析処理の結果をもとに韻律情報を求め、合
成音声を生成する。認識語彙生成部３はテキスト解析処
理の結果をもとに音声認識で使用可能な認識語彙を生成
し、この認識語彙を含んだ辞書を作成する。語彙記憶部
４は認識語彙生成部が作成した辞書を記憶するもので、
音声認識時に音声認識部５より参照される。音声認識部
５は語彙記憶部４に記録されている辞書を参照して音声
認識を行う。この音声認識・合成装置は、図６に示す処
理を行うためのプログラムを搭載したコンピュータによ
って構成されている。<First Embodiment> FIG. 1 is a block diagram showing the arrangement of a speech recognition / synthesis apparatus according to the first embodiment. The text analysis unit 1 performs a morphological analysis on the input character string to obtain reading information and the like. Speech synthesizer 2
Generates prosody information based on the result of text analysis processing and generates synthetic speech. The recognition vocabulary generation unit 3 generates a recognition vocabulary that can be used in voice recognition based on the result of the text analysis process, and creates a dictionary including this recognition vocabulary. The vocabulary storage unit 4 stores the dictionary created by the recognized vocabulary generation unit,
It is referred to by the voice recognition unit 5 during voice recognition. The voice recognition unit 5 refers to the dictionary recorded in the vocabulary storage unit 4 to perform voice recognition. This voice recognition / synthesis apparatus is configured by a computer equipped with a program for performing the processing shown in FIG.

【００２７】図１に示した第１実施の形態の構成を、図
１２に示した従来の音声認識・合成装置の構成と比較す
れば、第１実施の形態では、テキスト解析結果から音声
認識語彙を生成する認識語彙生成部３が新たに追加され
ていることがわかる。図１２の語彙記憶部５４にはあら
かじめ決められたキーワードが登録されているか、制限
のない大語彙が登録されているかのいずれかである。第
１実施の形態では、テキスト解析処理の結果をもとに音
声認識で使用可能な認識語彙を生成する認識語彙生成部
３を備えたことにより、音声で読み上げられるテキスト
の言葉と１対１に対応した辞書が作成できるため、必要
最小限の認識語彙での音声認識が可能になり、高精度な
音声認識を実現できるのである。When the configuration of the first embodiment shown in FIG. 1 is compared with the configuration of the conventional speech recognition / synthesis apparatus shown in FIG. 12, the speech recognition vocabulary is obtained from the text analysis result in the first embodiment. It can be seen that the recognition vocabulary generation unit 3 for generating is newly added. Either a predetermined keyword is registered in the vocabulary storage unit 54 in FIG. 12 or a large vocabulary without limitation is registered. In the first embodiment, the recognition vocabulary generation unit 3 that generates the recognition vocabulary that can be used in the voice recognition based on the result of the text analysis processing is provided, so that it is in a one-to-one correspondence with the words of the text read aloud. Since a corresponding dictionary can be created, it is possible to perform voice recognition with the minimum necessary recognition vocabulary, and it is possible to realize highly accurate voice recognition.

【００２８】図１の音声認識・合成装置における処理の
流れを図６のフローチャートに沿って説明する。The flow of processing in the voice recognition / synthesis apparatus of FIG. 1 will be described with reference to the flowchart of FIG.

【００２９】まず一文を取得し（Ｓ１）、テキスト解析
部１においてテキスト解析を行う（Ｓ２）。テキスト解
析部１では入力された文字列の言語を解析し、文字列を
構成する形態素に分割する。複数の分割候補がある場合
には、それらすべてを出力する。それぞれの候補にはそ
の候補の可能性の度合いを表す尤度が与えられている。First, one sentence is acquired (S1), and the text analysis unit 1 performs text analysis (S2). The text analysis unit 1 analyzes the language of the input character string and divides it into the morphemes forming the character string. If there are multiple division candidates, all of them are output. Each candidate is given a likelihood that represents the likelihood of that candidate.

【００３０】音声合成に必要なテキスト解析の技術は一
般には形態素解析である。形態素解析処理に関しては、
右方向最長一致法や接続表を用いる方法が一般的であ
り、『自然言語解析の基礎』（田中穂積著：産業図書
1989年）等の文献が知られている。The text analysis technique required for speech synthesis is generally morphological analysis. Regarding morphological analysis processing,
The method using the longest right-matching method in the right direction or the connection table is generally used, and "Basics of Natural Language Analysis" (Hozumi Tanaka: Sangyo Tosho)
1989) and other documents are known.

【００３１】次に、分割された形態素に読み方を付与す
る。複数の読み方が存在する場合には、それらすべてを
出力する。それぞれの読み方にはその読み方の可能性の
度合いを表す尤度が与えられている。Next, a reading is added to the divided morphemes. If there are multiple readings, output them all. Each reading is given a likelihood representing the degree of possibility of the reading.

【００３２】例えば、「明日の天気」という文字列が与
えられた場合、「明日（名詞）」「の（助詞）」「天気
（名詞）」という形態素に分割される。「明日」には
「あす」「あした」の２種類の読みが存在し、尤度が与
えられる。For example, when a character string "weather of tomorrow" is given, it is divided into morphemes "tomorrow (noun)", "no (particle)" and "weather (noun)". There are two types of readings tomorrow, tomorrow and tomorrow, and the likelihood is given.

【００３３】次に、テキスト解析の結果をもとに、認識
語彙生成が行われる（Ｓ３）。ここでは音声認識に必要
な情報が生成される。必要な情報として、読み方に関す
る情報、読み方の尤度に関する情報、形態素同士の隣接
確率に関する情報などがある。Next, recognition vocabulary generation is performed based on the result of the text analysis (S3). Here, information necessary for voice recognition is generated. The necessary information includes information about reading, information about likelihood of reading, information about adjacency probability of morphemes, and the like.

【００３４】読み方に関する情報とは、「明日」という
表記の音素記号列が「a s u」「ash i t a」であ
るということ、読み方の尤度に関する情報とは「明日」
という表記が「a s u」あるいは「a sh i t a」
である確率がどの程度であるかということ、形態素同士
の隣接確率とは、「明日」の次に「の」が続く確率がど
の程度であるかということである。The information on the reading means that the phoneme symbol string "tomorrow" is "asu" and "ash ita", and the information on the likelihood of reading is "tomorrow".
Notation is "asu" or "a sh ita"
What is the probability of being, and the adjacency probability of morphemes is what the probability of "to" is next to "tomorrow".

【００３５】次に、マイクから入力があったかどうかが
判断される（Ｓ４）。判断の方法は入力信号のレベルだ
けから判断してもよいし、トークスイッチを設けてこれ
が押された場合、その後数秒間の入力を音声認識する入
力と判断してもよい。また、トークスイッチを押してい
る間の入力を音声認識する入力と判断することもでき
る。Next, it is determined whether or not there is an input from the microphone (S4). The determination method may be based only on the level of the input signal, or when a talk switch is provided and pressed, the input for a few seconds after that may be determined as the voice recognition input. It is also possible to judge that the input while the talk switch is being pressed is the voice recognition input.

【００３６】マイクからの入力があったと判断された場
合、音声認識処理が行われる（Ｓ７）。音声認識の方法
については、『ディジタル音声処理』（古井著：東海大
学出版会、１９８５年）等の文献に詳しく記されてい
る。If it is determined that there is an input from the microphone, a voice recognition process is performed (S7). The method of speech recognition is described in detail in documents such as "Digital Speech Processing" (Furui: Tokai University Press, 1985).

【００３７】音声認識は認識語彙生成部３で生成された
認識語彙を対象として行われ、認識結果が提示される
（Ｓ８）。認識結果の提示方法は、画面に結果を出す方
法でもよいし、合成音声を出力する方法でも良い。合成
音声で認識結果を出力する場合は、現在読み上げている
テキストを一時停止し、認識結果の提示が終り次第再開
するという方法をとることができる。The voice recognition is performed on the recognition vocabulary generated by the recognition vocabulary generating unit 3, and the recognition result is presented (S8). The method of presenting the recognition result may be a method of displaying the result on the screen or a method of outputting the synthetic voice. In the case of outputting the recognition result by the synthetic voice, it is possible to take a method of temporarily stopping the text being read aloud and restarting the presentation of the recognition result as soon as it is finished.

【００３８】テキスト解析部１で得られた解析結果は、
音声認識の辞書を作成するだけでなく、音声合成のため
にも利用される。音声合成の場合、読みを一意に決定し
なければいけないため、「明日」のように複数の読み方
がある単語は、尤度の高い候補一つだけが選択される。
この点が複数の読み方をすべて語彙として登録する認識
語彙生成の処理と異なる。The analysis result obtained by the text analysis unit 1 is
It is used not only for creating a voice recognition dictionary but also for voice synthesis. In the case of speech synthesis, since the reading must be uniquely determined, only one candidate having a high likelihood is selected from words having a plurality of readings, such as “tomorrow”.
This point is different from the process of recognition vocabulary generation in which a plurality of readings are all registered as a vocabulary.

【００３９】テキスト解析部１は、得られた単語や読み
方の尤度をもとに、最も可能性の高い読み方についての
情報を音声合成部２へ送る。音声合成部２は韻律の情報
を求めて、合成音声を生成し出力する（Ｓ５）。音声合
成の技術に関しては、『ディジタル音声処理』（古井
著：東海大学出版会、１９８５年）等の文献が知られて
いる。The text analysis unit 1 sends information about the most probable reading to the speech synthesis unit 2 based on the obtained word and the likelihood of reading. The voice synthesizing unit 2 obtains prosody information and generates and outputs synthetic voice (S5). Regarding the technology of speech synthesis, literatures such as "Digital Speech Processing" (Furui: Tokai University Press, 1985) are known.

【００４０】Ｓ６での判別処理の結果、次文があれば、
Ｓ１に戻り同様の処理を繰り返す。次文がなければ処理
を終了する。As a result of the discrimination processing in S6, if there is a next sentence,
The process returns to S1 and the same processing is repeated. If there is no next sentence, the process ends.

【００４１】図１に示した認識語彙生成部３に語彙拡張
機能を設けることにより、テキスト音声合成で読み上げ
られた文字列以外の語彙も、認識語彙として登録するこ
とが可能になる。以下、語彙拡張機能について説明す
る。By providing the recognized vocabulary generation unit 3 shown in FIG. 1 with a vocabulary expansion function, vocabularies other than the character string read aloud by text-to-speech synthesis can be registered as recognized vocabularies. The vocabulary extension function will be described below.

【００４２】複数の語彙の類似性を記述した辞書をあら
かじめ生成し、図示しない記憶部に格納しておく。語彙
の類似性とは、単語Ａと単語Ｂが、意味的、語用論的に
どの程度似ているかを表すものである。A dictionary describing the similarity of a plurality of words is generated in advance and stored in a storage unit (not shown). The lexical similarity indicates how similar the words A and B are to each other semantically and pragmatically.

【００４３】分類語彙表（国立国語研究所）は、同義の
単語を階層ごとに分類して示しており、類似性を記述し
た辞書の一つの例として考えることができる。すなわ
ち、分類語彙表によれば、単語Ａと同義の別の単語を得
ることができる。The classification vocabulary table (National Institute for Japanese Language) shows synonymous words classified by hierarchy and can be considered as an example of a dictionary describing similarities. That is, according to the classification vocabulary table, another word having the same meaning as the word A can be obtained.

【００４４】この情報を利用して、単語Ａが読み上げテ
キスト中に現れた場合、認識辞書生成部３はＡと同義の
別の単語Ｂも音声認識語彙として生成でき、語彙拡張が
実現できる。Using this information, when the word A appears in the read-aloud text, the recognition dictionary generator 3 can also generate another word B having the same meaning as A as the voice recognition vocabulary, and the vocabulary expansion can be realized.

【００４５】単語Ａ、Ｂが同義か否かの判定以外にも、
単語Ａ、Ｂが属する階層が同義か否かを判定すること
で、より多くの語彙を拡張することも可能である。Besides determining whether the words A and B are synonymous,
By determining whether or not the layers to which the words A and B belong are synonymous, it is possible to expand more vocabulary.

【００４６】なお、分類語彙表は２つの単語が同義であ
るか否かを記述したデータであり、同義の程度を２値化
（離散化）して表現したものと捉えることができるが、
類似度を連続的に数値化することにより、より柔軟な語
彙拡張の実現も可能になる。The classification vocabulary table is data describing whether two words have the same meaning or not, and it can be considered that the degree of the same meaning is binarized (discretized).
By digitizing the degree of similarity continuously, more flexible vocabulary expansion can be realized.

【００４７】＜第２実施の形態＞図２は、第２実施の形
態の音声認識・合成装置の構成を示すブロック図であ
る。テキスト解析部１１、音声合成部１２、認識語彙生
成部１３、語彙記憶部１４、音声認識部１５は、図１に
示された構成部分１，２，３，４，５とそれぞれ同様の
機能を有する。第２実施の形態の音声認識・合成装置
は、メニュー選択部１６を備えている点が図１の音声認
識・合成装置と異なる。<Second Embodiment> FIG. 2 is a block diagram showing the arrangement of a speech recognition / synthesis apparatus according to the second embodiment. The text analysis unit 11, the voice synthesis unit 12, the recognized vocabulary generation unit 13, the vocabulary storage unit 14, and the voice recognition unit 15 have the same functions as the constituent units 1, 2, 3, 4, and 5 shown in FIG. 1, respectively. Have. The speech recognition / synthesis apparatus according to the second embodiment differs from the speech recognition / synthesis apparatus shown in FIG. 1 in that the menu selection unit 16 is provided.

【００４８】メニュー選択部１６は、音声認識部１５に
よる認識結果をもとにメニュー選択を行う。The menu selection unit 16 selects a menu based on the recognition result by the voice recognition unit 15.

【００４９】例えば、「今日のニュース、明日の天気、
今日のテレビ、、、、」がメニューなっており、「今日
のニュース」を選択するとニュースの情報が得られ、
「明日の天気」を選択すると天気の情報が得られる装置
（コンピュータ）があるとする。そのような装置に本実
施の形態の音声認識・合成装置を組み込むことにより、
メニューの読み上げも、メニュー項目の選択も音声で行
うことが可能になる。具体的には、上述のメニュー「今
日のニュース、明日の天気、今日のテレビ、、、、」を
音声合成部１２が音声で読み上げ、それを聞いていたユ
ーザが「明日の天気」という発声をすれば、音声認識部
５がその語彙を認識し、その認識結果をもとにメニュー
選択部１６がそのメニュー項目「明日の天気」を選択す
る。その結果、「明日の天気は晴れです」といった情報
を提示することが可能になる。For example, "Today's news, tomorrow's weather,
"Today's TV ..." is a menu, and you can get news information by selecting "Today's News".
It is assumed that there is a device (computer) that can obtain weather information by selecting “Tomorrow's weather”. By incorporating the speech recognition / synthesis device of the present embodiment into such a device,
You can read out the menu and select menu items by voice. Specifically, the voice synthesizer 12 reads aloud the above-mentioned menu "Today's news, tomorrow's weather, today's TV, ...", and the user who is listening to it says "Tomorrow's weather". Then, the voice recognition unit 5 recognizes the vocabulary, and the menu selection unit 16 selects the menu item "weather of tomorrow" based on the recognition result. As a result, information such as "Tomorrow's weather is sunny" can be presented.

【００５０】情報の提示の方法は、文字を画面に表示し
てもよいし、合成音声を生成して出力してもかまわな
い。As a method of presenting information, characters may be displayed on the screen, or synthetic speech may be generated and output.

【００５１】＜第３実施の形態＞図３は、第３実施の形
態の音声認識・合成装置の構成を示すブロック図であ
る。テキスト解析部２１、音声合成部２２、認識語彙生
成部２３、語彙記憶部２４、音声認識部２５は、メニュ
ー選択部２６は、図２の対応構成部分１１，１２，１
３，１４，１５，１６と同様の機能を有する。第３実施
の形態は、対応記憶部２７をさらに備えたこと、そし
て、メニュー選択部２６によるメニュー項目の選択が、
対応記憶部２７からのデータをもとに行われる点が第２
実施形態と異なる。<Third Embodiment> FIG. 3 is a block diagram showing the arrangement of a speech recognition / synthesis apparatus according to the third embodiment. The text analysis unit 21, the voice synthesis unit 22, the recognized vocabulary generation unit 23, the vocabulary storage unit 24, the voice recognition unit 25, the menu selection unit 26, and the corresponding constituent units 11, 12, 1 in FIG.
It has the same function as 3, 14, 15, 16. The third embodiment further includes a correspondence storage unit 27, and the selection of menu items by the menu selection unit 26 is
The second point is that it is performed based on the data from the correspondence storage unit 27.
Different from the embodiment.

【００５２】対応記憶部２７はメニュー項目とメニュー
項目を構成する各単語との対応関係を記憶している。The correspondence storage unit 27 stores the correspondence relationship between the menu item and each word constituting the menu item.

【００５３】この対応記憶部２７を参照することによ
り、メニュー項目を構成するある一部分の単語だけをユ
ーザが発声しても、メニュー選択部２６はそのメニュー
項目を選択できるようになる。例えば、第２実施の形態
２の場合と同様に、「今日のニュース、明日の天気、今
日のテレビ」がメニューになっており、項目「今日のニ
ュース」を選択するとニュースの情報が得られ、項目
「明日の天気」を選択すると天気の情報が得られる装置
があるとする。この装置に本実施の形態の音声認識・合
成装置が組みこまれている場合、対応記憶部２７は「今
日のメニュー」「明日の天気」「今日のテレビ」という
メニュー項目と、各メニューを構成する単語「今日」
「の」「メニュー」、「明日」「の」「天気」、「今
日」「の」「テレビ」との対応関係を記憶する。この場
合の対応記憶部２７の内容の一例を図７に示す。By referring to the correspondence storage unit 27, even if the user utters only a part of the words constituting the menu item, the menu selection unit 26 can select the menu item. For example, as in the case of the second embodiment 2, "Today's news, tomorrow's weather, today's TV" is a menu, and selecting the item "Today's news" will give news information, It is assumed that there is a device that can obtain weather information by selecting the item “weather of tomorrow”. When the voice recognition / synthesis device according to the present embodiment is incorporated in this device, the correspondence storage unit 27 configures menu items such as “Today's menu”, “Tomorrow's weather”, “Today's TV”, and each menu. The word "today"
The correspondence relationship between "no", "menu", "tomorrow", "no", "weather", "today", "no", and "television" is stored. An example of the contents of the correspondence storage unit 27 in this case is shown in FIG.

【００５４】ユーザが「ニュース」と発声した場合、音
声認識部２５は「ニュース」という語彙を認識し、対応
記憶部２７に記録されている構成単語を調べる。「ニュ
ース」に対応するメニュー項目は「今日のニュース」で
あるため、このメニュー項目が対応記憶部２７から出力
され、メニュー選択部２６によって選択される。When the user utters "news", the voice recognition unit 25 recognizes the vocabulary "news" and checks the constituent words recorded in the correspondence storage unit 27. Since the menu item corresponding to “news” is “today's news”, this menu item is output from the correspondence storage unit 27 and selected by the menu selection unit 26.

【００５５】音声でメニューが読み上げられた場合、メ
ニューの単位（メニュー項目）が何であるか、すなわち
「今日のニュース」なのか、あるいは「ニュース」なの
かをユーザは普通意識しないものである。また長すぎる
メニュー項目を間違いなく発声することも困難である。
よって、以上述べたようなメニュー項目を構成する一部
の単語だけが発声されても、メニュー項目が認識できる
ことは有益である。When the menu is read out by voice, the user usually does not care what the unit of the menu (menu item) is, that is, "today's news" or "news". It is also difficult to utter a menu item that is too long.
Therefore, it is useful that the menu item can be recognized even if only some of the words that constitute the menu item as described above are uttered.

【００５６】しかし、一つの単語が複数のメニュー項目
に含まれるケースがある。例えば「今日」という単語は
「今日のニュース」にも「今日のテレビ」にも出現す
る。この場合は、ユーザに選択肢が複数ある旨を伝え、
再度選択できる仕組みを設けることで対応することがで
きる。However, there are cases where one word is included in a plurality of menu items. For example, the word "today" appears in both "today's news" and "today's television." In this case, tell the user that there are multiple options,
This can be dealt with by providing a mechanism for selecting again.

【００５７】＜第４実施の形態＞図４は第４実施の形態
の音声認識・合成装置の構成を示すブロック図である。
テキスト解析部３１、音声合成部３２、認識語彙生成部
３３、語彙記憶部３４、音声認識部３５は、図１に示し
たと対応構成部分１，２，３，４，５と同様の機能を有
する。第４実施の形態は、言語尤度推定部３８を備えた
点が第１実施の形態と異なる。<Fourth Embodiment> FIG. 4 is a block diagram showing the arrangement of a speech recognition / synthesis apparatus according to the fourth embodiment.
The text analysis unit 31, the voice synthesis unit 32, the recognized vocabulary generation unit 33, the vocabulary storage unit 34, and the voice recognition unit 35 have the same functions as the corresponding constituent units 1, 2, 3, 4, and 5 shown in FIG. . The fourth embodiment differs from the first embodiment in that a language likelihood estimation unit 38 is provided.

【００５８】言語尤度推定部３８は、生成される語彙の
時系列順序をもとに言語尤度を求めるもので、求められ
た言語尤度は語彙記憶部３４に書き込まれ、音声認識部
３５によって参照される。The language likelihood estimation unit 38 obtains the language likelihood based on the time-series order of the generated vocabulary. The obtained language likelihood is written in the vocabulary storage unit 34, and the speech recognition unit 35. Referenced by.

【００５９】まず、語彙の時系列順序について説明す
る。時系列順序とは文字列中の単語の出現順序のこと
で、文字列を読み上げているある時点からみて、以前読
み上げられた単語がどの程度過去のものかを表す指標で
ある。First, the time-series order of vocabulary will be described. The time-series order is the order of appearance of words in a character string, and is an index indicating how far the previously read word is from the point of time when the character string is read.

【００６０】例えば「今日のニュース、明日の天気、今
日のテレビ、、、、」という文字列が入力され、現在
「ニュース」が読み上げられているとする。図８に示す
ように、「今日」は一番最初に読み上げられる単語なの
で１、「ニュース」は３番目の単語なので３、というよ
うに時系列順序が割り振られる。この例では、この時系
列順序の数字が大きいほど、最近読み上げられた単語と
いうことになる。音声合成された単語をメニューの選択
肢として使う場合、時系列順序の数字が大きいほど発声
される確率は高くなる。これは遠い過去に読み上げられ
た単語より、近い過去に読み上げられた単語の方が、よ
り記憶に残るため優先的に発声されるという根拠に基づ
く。特に、何か他の作業をしながら合成音声を聞き、関
心のある内容が聞き取れた場合にその言葉をオウム返し
のように発声することでその情報を得るような使い方を
想定すれば、この傾向はなおさら強くなる。For example, it is assumed that the character string "Today's news, tomorrow's weather, today's TV ..." is input and "News" is currently read out. As shown in FIG. 8, the time sequence order is assigned such that "today" is the first word to be read aloud, "news" is the third word, and so on. In this example, the larger the number in the chronological order, the more recently read the word. When using words that have been speech-synthesized as a menu option, the larger the number in time-series order, the higher the probability of being uttered. This is based on the reason that words read aloud in the near past are preferentially spoken because they are more memorable than words spoken in the distant past. In particular, if you think of how to listen to synthetic speech while doing some other work and get the information by uttering the word like parrot when you can hear the content you are interested in, this tendency Becomes even stronger.

【００６１】よって、遠い過去に読み上げられた単語よ
り、近い過去に読み上げられた単語の方を優先させるた
めの仕組みが必要になる。これを実現するために、言語
尤度を用いる。言語尤度は現在読み上げが行われている
単語と、すでに読み上げが行われた単語との時系列順序
の差に応じて調整される。すなわち、時系列順序の差が
小さければ言語尤度が高くなり、順序の差が大きくなれ
ばなるほど、言語尤度は低くなるようにする。Therefore, it is necessary to provide a mechanism for giving priority to words read in the near past over words read in the distant past. To achieve this, language likelihood is used. The linguistic likelihood is adjusted according to the difference in the time-series order between the currently read word and the already read word. That is, the smaller the time-series order difference, the higher the language likelihood, and the larger the order difference, the lower the language likelihood.

【００６２】言語尤度の計算例を以下に示す。一般に言
語尤度はその単語の出現確率より求めることができる。
出現確率P(w)は学習データ中の総単語数をＮ、単語wが
学習データ中に出現した回数をC(w)で表すと、次のよう
になる。 P(w) = C(w) ／ N ここでは、先に説明した時系列順序の情報を利用して、
上記出現確率を補正する。ある時刻tにおける、単語wの
補正値をα(w,t)とする。αは単語wの時系列順序をもと
に計算される。時系列順序はテキスト合成音声で読み上
げられる単語の順序で、小さい数ほど遠い過去に発声さ
れた単語になる。よって、wに対応する時系列順序が小
さければ小さいほどα(w,t)は小さくなるように設定さ
れる。A calculation example of the language likelihood is shown below. Generally, the language likelihood can be obtained from the appearance probability of the word.
The appearance probability P (w) is as follows, where N is the total number of words in the learning data and C (w) is the number of times the word w appears in the learning data. P (w) = C (w) / N Here, using the information of the time series order described above,
The above appearance probability is corrected. The correction value of the word w at a certain time t is α (w, t). α is calculated based on the time series order of the word w. The time-series order is the order of words read aloud in a text-to-speech voice, and the smaller the number, the more distantly the word is spoken in the past. Therefore, the smaller the time series order corresponding to w, the smaller α (w, t) is set.

【００６３】ある時刻t0における単語の時系列順序の例
を図８に示す。図８の状態は、「今日のニュース、明日
の天気、今日のテレビ、、、、」という文字列の「ニュ
ース」という単語の読み上げが終わった直後の時刻t0に
おける時系列順序を示す。このときの補正値αには以下
のような関係が成り立つ。 α(今日,t0) ＜ α(の,t0) ＜ α(ニュース,t0) このαを取り入れて、合成音声が読み上げられているあ
る時点tにおけるある単語の出現確率を以下のように定
義する。なお下記の定義は出現確率を求める一例を示し
たものであり、これに限定されるものではない。 P(w,t) = λ{( C(w) + α(w,t) ) ／Ｎ} ここで、λは確率値の合計を１にするための定数であ
る。FIG. 8 shows an example of the time-series order of words at a certain time t0. The state of FIG. 8 shows the time-series order at time t0 immediately after the reading of the word "news" in the character string "Today's news, tomorrow's weather, today's TV ..." is finished. The correction value α at this time has the following relationship. α (Today, t0) <α (,, t0) <α (News, t0) By taking in this α, the appearance probability of a certain word at a certain time point t when the synthetic speech is read is defined as follows. Note that the following definition shows an example of obtaining the appearance probability, and the present invention is not limited to this. P (w, t) = λ {(C (w) + α (w, t)) / N} where λ is a constant for making the total of the probability values one.

【００６４】上記出現確率はt0とは違う時刻t1では異な
る値になる。例えば図９の状態は、文字列「天気」とい
う単語の読み上げが終わった直後の時刻t1における時系
列状態である。「ニュース」という単語はt0に比べ遠い
過去になっているため、以下のような関係が成り立つ。 α(ニュース,t0) ＞ α(ニュース,t1) 以上述べたように、時系列状態を考慮した出現確率を求
め、これを言語尤度として活用することで、発声される
確率がより高い言葉の出現確率を高くすることができる
ため、精度の高い音声認識機能を実現することができ
る。The above appearance probability has a different value at time t1 which is different from t0. For example, the state of FIG. 9 is a time-series state at time t1 immediately after the reading of the word "weather" is finished. The word "news" has a long past compared to t0, so the following relationship holds. α (News, t0) ＞ α (News, t1) As described above, by obtaining the appearance probability considering the time-series state and using this as the language likelihood, it is possible to use words with higher probabilities of utterance. Since the appearance probability can be increased, a highly accurate voice recognition function can be realized.

【００６５】以上、言語尤度推定部３８で言語尤度を求
める方法として、時系列状態を考慮した方法を述べた
が、他にも方法はある。例えば、テキスト解析で利用す
る辞書に各単語の重要度のランク付けを表す値をあらか
じめ付与しておき、その値をもとに言語尤度を求める方
法である。発声される可能性が高い単語すなわち重要度
が高い単語に高いランク値を付与しておき、その値を言
語尤度に反映することで、認識語彙に重要度に応じた言
語尤度を与えることができる。これにより、重要度の高
い単語ほど、認識しやすくすることが可能になり、より
認識性能を向上させることが可能になる。As a method of obtaining the language likelihood in the language likelihood estimation unit 38, the method considering the time series state has been described above, but there are other methods. For example, it is a method of preliminarily assigning a value indicating the ranking of importance of each word to a dictionary used for text analysis, and obtaining the language likelihood based on the value. Assigning a high rank value to a word that is likely to be uttered, that is, a word that is highly important, and reflecting that value in the language likelihood to give the recognition vocabulary a language likelihood according to the importance. You can As a result, it becomes possible to make recognition easier for words having higher importance, and it is possible to further improve recognition performance.

【００６６】＜第５実施の形態＞図５（ａ）は第５実施
の形態の音声認識・合成装置の構成を示すブロック図で
ある。テキスト解析部４１、音声合成部４２、認識語彙
生成部４３、語彙記憶部４４、音声認識部４５、対応記
憶部４７は、図３に示した対応構成部分２１，２２，２
３，２４，２５、２７と同様の機能を有する。第５実施
の形態は、メニュー選択部に代えてリンク情報取得部４
６を備えた点が、第３実施の形態と基本的に異なる。<Fifth Embodiment> FIG. 5A is a block diagram showing the arrangement of a speech recognition / synthesis apparatus according to the fifth embodiment. The text analysis unit 41, the voice synthesis unit 42, the recognized vocabulary generation unit 43, the vocabulary storage unit 44, the voice recognition unit 45, and the correspondence storage unit 47 are the corresponding constituent units 21, 22, 2 shown in FIG.
It has the same function as 3, 24, 25, 27. In the fifth embodiment, the link information acquisition unit 4 is used instead of the menu selection unit.
The point that 6 is provided is basically different from the third embodiment.

【００６７】第５実施の形態は、ＷＷＷ（ワールド・ワ
イド・ウェッブ）などを利用してハイパーテキストで記
述されたドキュメントを取得し、文字列を合成音声で読
み上げ、それを聞いているユーザがリンクのはられてい
るアンカーの部分の文字列（以下、「リンク文字列」）
名を発声することで、リンク先の情報を取得することが
できるようにしたものである。したがって、この音声認
識・合成装置では、リンク情報取得部４６を備えると共
に、対応記憶部４７に、リンク文字列とリンク文字列を
構成する各文字列（単語）との対応関係を記憶してい
る。In the fifth embodiment, a document described in hypertext is acquired by using WWW (World Wide Web) or the like, a character string is read aloud by a synthetic voice, and a user who hears it reads a link. The character string of the anchor part (hereinafter, "link character string")
By uttering a name, it is possible to obtain information on the link destination. Therefore, this voice recognition / synthesis apparatus includes the link information acquisition unit 46, and stores the correspondence relationship between the link character string and each character string (word) forming the link character string in the correspondence storage unit 47. .

【００６８】「今日のニュース、明日の天気、今日のテ
レビ、、、、」というテキストがＷＷＷ上で取得したハ
イパーテキストであり、「今日のニュース」のリンク先
に今日のニュース記事が書かれているという状況で、ユ
ーザが「今日のニュース」と発声すると、ニュースの内
容を取得することができる。The text "Today's news, tomorrow's weather, today's TV ..." is hypertext acquired on the WWW, and today's news articles are written at the "Today's news" link. When the user says "Today's news" in the situation that the user is present, the content of the news can be acquired.

【００６９】以下、ＨＴＭＬ（HyperText Markup Langu
age ハイパーテキスト・マークアップ・ランゲージ）
に代表されるページ記述言語を例に説明する。リンク情
報とは、図１０に示すような記述をさす。リンク情報は
リンク先１０１とリンク文字列１０２を含んでいる。上
述の例では、リンク文字列が「今日のニュース、明日の
天気、今日のテレビ、、、、」というテキスト、リンク
先が「今日のニュース」の内容が記されたファイルに相
当する。リンク先１０１にはURL（Uniform Resource Lo
cator ユニフォーム・リソース・ロケイター）に代表
されるアドレス情報が記載されている。このアドレスに
あるファイルに、取得したい情報が含まれている。図１
０では１０３、１０４がリンク先のファイルである。リ
ンク文字列は普通ＷＷＷブラウザ上に表示される文字列
であり、ここでは音声合成部４２によって音声合成され
る文字列である。Below, HTML (HyperText Markup Langu
age hypertext markup language)
An example of a page description language represented by is described below. The link information refers to the description as shown in FIG. The link information includes a link destination 101 and a link character string 102. In the above-mentioned example, the link character string corresponds to a file in which the text "Today's news, tomorrow's weather, today's TV ..." is described, and the link destination is the content of "Today's news". URL (Uniform Resource Lo
Address information represented by cator uniform resource locator) is described. The file at this address contains the information you want to get. Figure 1
In 0, 103 and 104 are link destination files. The link character string is a character string that is normally displayed on the WWW browser, and here is a character string that is voice-synthesized by the voice synthesizer 42.

【００７０】通常ＨＴＭＬ文書をＷＷＷブラウザ上で表
示し、リンク文字列をクリックすることでリンク先に記
述されているＨＴＭＬ文書が表示できるが、本実施の形
態ではＷＷＷブラウザとは無関係に、音声認識と音声合
成を利用して必要な情報を取得できる。このことは視覚
情報を用いず、聴覚情報だけでハイパーテキストのリン
クをたどることができることを意味する。このことによ
りパソコンなどで他の作業をしながらでも、音を聞いて
声で指示を出すことでＷＷＷ上の必要な情報を取得でき
るようになる。Normally, an HTML document is displayed on a WWW browser, and the HTML document described in the link destination can be displayed by clicking the link character string. However, in the present embodiment, voice recognition is performed regardless of the WWW browser. And necessary information can be acquired by using voice synthesis. This means that hypertext links can be followed only by auditory information, without using visual information. As a result, it becomes possible to obtain necessary information on the WWW by listening to the sound and giving an instruction by voice while performing other work on the personal computer or the like.

【００７１】なお、指定したURLの情報を取得する仕組
みは、ＷＷＷブラウザが取得する方法と同じであり、本
発明とは直接に関係しないので、ここでは述べない。The mechanism for acquiring the information of the specified URL is the same as the method of acquiring by the WWW browser and is not directly related to the present invention, and therefore will not be described here.

【００７２】ところで、読み上げているテキストのう
ち、どの文字列にリンクがはられているのか利用者が判
断できるようにするため、図５（ｂ）に示すように、音
声合成様式を変更する音声合成様式変更手段４２ａを音
声合成部４２に設けるのが好ましい。By the way, in order to allow the user to determine which character string in the read text is linked, a voice for changing the voice synthesis style as shown in FIG. 5B. It is preferable to provide the synthesizing mode changing means 42a in the voice synthesizing unit 42.

【００７３】視覚情報の場合、その文字列にリンクがは
られているか否かは、文字列の色やフォントを他と変更
するなどして、利用者が判断できるようにするのが一般
的である。音声合成の場合、音声合成様式を変更するこ
とで、その情報を利用者に伝えることができる。In the case of visual information, it is generally possible for the user to determine whether the character string is linked or not by changing the character string color or font. is there. In the case of voice synthesis, the information can be transmitted to the user by changing the voice synthesis mode.

【００７４】音声合成様式としては、様々な声質の話
者、発話速度、ピッチ、パワーなどがある。例えば話者
の場合、リンクがはられている文字列は男性の声でしゃ
べり、その他は女性の声でしゃべることで、リンク先の
有無を伝えることができる。The voice synthesizing mode includes speakers of various voice qualities, speech rates, pitches, and powers. For example, in the case of a speaker, the presence or absence of the link destination can be notified by speaking the linked character string with a male voice and the other with a female voice.

【００７５】あるいは、リンクがはられている文字列
を、他の文字列より速度を遅くして発声する、あるいは
ピッチを上げて発声する、パワーを増して発声するなど
の方法により、区別することもできる。Alternatively, the linked character strings are distinguished from each other by a method such as uttering at a slower speed than other character strings, uttering at a higher pitch, or uttering with increased power. You can also

【００７６】このように、音声合成様式を変化させるこ
とで、リンクの有無を利用者に伝えることが可能にな
る。As described above, by changing the voice synthesis mode, it becomes possible to inform the user of the presence or absence of a link.

【００７７】なお、図５（ａ），（ｂ）では、対応記憶
部４７を設けているが、これは省略してもよい。Although the correspondence storage unit 47 is provided in FIGS. 5A and 5B, it may be omitted.

【００７８】＜第６実施の形態＞以上説明したいずれの
実施の形態およびその変形である音声認識・合成装置も
携帯端末器等の端末装置に搭載することができる。その
際、テキスト解析部と語彙生成部をサーバ側に、その他
の部分を端末側に持たせる構成も可能である。図１１
に、そのような構成の一例として、第１実施の形態の全
構成部分をサーバ側と端末側とに分割したものに相当す
る通信システムの構成を示す。図１１の例では、音声認
識・合成装置の全構成部分のうち、テキスト解析部６１
と認識語彙生成部６３がサーバ側に設けられ、端末側は
音声合成部６２、語彙記憶部６４、音声認識部６５のみ
を備える。<Sixth Embodiment> Any of the above-described embodiments and a modification of the voice recognition / synthesis apparatus can be mounted on a terminal device such as a portable terminal device. At that time, it is also possible to have a configuration in which the text analysis unit and the vocabulary generation unit are provided on the server side and the other parts are provided on the terminal side. Figure 11
As an example of such a configuration, a configuration of a communication system corresponding to a configuration in which all the components of the first embodiment are divided into a server side and a terminal side is shown. In the example of FIG. 11, the text analysis unit 61 out of all the components of the speech recognition / synthesis apparatus.
The recognition vocabulary generation unit 63 is provided on the server side, and the terminal side includes only the voice synthesis unit 62, the vocabulary storage unit 64, and the voice recognition unit 65.

【００７９】以上、本発明を種々の実施の形態により説
明したが、以上説明した実施の形態の構成は適宜組み合
わせてもよいし、適宜変更しても構わない。たとえば、
図４に示した言語尤度推定部は、音声認識精度を上げる
ために、図１、図２、図３、あるいは図５に示した音声
認識・合成装置にも設けてよい。Although the present invention has been described with reference to various embodiments, the configurations of the above-described embodiments may be combined or modified as appropriate. For example,
The language likelihood estimation unit shown in FIG. 4 may be provided in the speech recognition / synthesis apparatus shown in FIGS. 1, 2, 3 or 5 in order to improve the accuracy of speech recognition.

【００８０】[0080]

【発明の効果】以上より明らかなように、本発明による
音声認識・合成装置は、あらかじめ決められたキーワー
ドや、広範囲の単語を辞書として持つ必要がなく、合成
音声で読み上げられた文字列だけから音声認識辞書を自
動的に作成するので、高い性能の音声認識が実現でき
る。As is clear from the above, the speech recognition / synthesis apparatus according to the present invention does not need to have a predetermined keyword or a wide range of words as a dictionary, and can use only a character string read aloud by synthetic speech. Since the voice recognition dictionary is created automatically, high performance voice recognition can be realized.

【００８１】また、言語尤度を時系列状態を考慮して求
めることにより、さらに性能の高い音声認識が実現でき
る。Further, by obtaining the language likelihood in consideration of the time series state, it is possible to realize the speech recognition with higher performance.

【００８２】また、合成音声で読み上げた内容と同じ言
葉を認識できることにより、メニュー選択等を目視する
ことなく実行することができる。Further, by being able to recognize the same words as the contents read aloud by the synthetic voice, it is possible to execute the menu selection and the like without visual inspection.

【００８３】また、パソコンで他の作業をしながら、合
成音声を流しておき、気になる言葉が聞こえたときに、
その言葉をオウム返しして発声すれば、目的の情報を取
得することが可能になり、他の作業をしながら必要な情
報を取得するといった「ながら」的な使い方が可能にな
る。Also, while doing other work on the personal computer, the synthesized voice is played, and when a word of concern is heard,
If you say that parrot back and say it, you will be able to obtain the desired information, and you will be able to use it in a "while" way, such as obtaining the necessary information while doing other work.

【００８４】本発明を実施することにより、従来一目的
でしか使えなかった機器が、同時に複数の目的に使うこ
とが可能になり、一定時間に複数の作業を効率よく行う
環境を提供できるようになる。By implementing the present invention, it becomes possible to use an apparatus that has conventionally been used only for one purpose, for a plurality of purposes at the same time, and to provide an environment in which a plurality of works can be efficiently performed in a fixed time. Become.

[Brief description of drawings]

【図１】この発明の第１実施の形態である音声認識・
合成装置のブロック図である。FIG. 1 is a first embodiment of the present invention;
It is a block diagram of a synthesizer.

【図２】この発明の第２実施の形態である音声認識・
合成装置のブロック図である。FIG. 2 is a second embodiment of the present invention;
It is a block diagram of a synthesizer.

【図３】この発明の第３実施の形態である音声認識・
合成装置のブロック図である。FIG. 3 is a third embodiment of the present invention; speech recognition /
It is a block diagram of a synthesizer.

【図４】この発明の第４実施の形態である音声認識・
合成装置のブロック図である。FIG. 4 is a fourth embodiment of the present invention;
It is a block diagram of a synthesizer.

【図５】（ａ）はこの発明の第５実施の形態である音
声認識・合成装置のブロック図、（ｂ）は変形例を示す
ブロック図である。5A is a block diagram of a voice recognition / synthesis device according to a fifth embodiment of the present invention, and FIG. 5B is a block diagram showing a modified example.

【図６】本発明の処理の流れの一例を示すフローチャ
ートである。FIG. 6 is a flowchart showing an example of the flow of processing of the present invention.

【図７】対応記憶部の一例を示す。FIG. 7 shows an example of a correspondence storage unit.

【図８】単語と時系列順序の関係の一例を示す。FIG. 8 shows an example of a relationship between words and a time series order.

【図９】単語と時系列順序の関係を一例を示す。FIG. 9 shows an example of a relationship between words and a time-series order.

【図１０】リンク情報の一例を示す。FIG. 10 shows an example of link information.

【図１１】この発明の音声認識・合成装置を適用した
通信システムのブロック図である。FIG. 11 is a block diagram of a communication system to which the speech recognition / synthesis device of the present invention is applied.

【図１２】従来の音声認識・合成装置のブロック図で
ある。FIG. 12 is a block diagram of a conventional voice recognition / synthesis device.

[Explanation of symbols]

１，１１，２１，３１，４１，５１，６１ … テキス
ト解析部、２，１２，２２，３２，４２，５２，６２ … 音声合
成部、３，１３，２３，３３，４３，６３ … 認識語彙生成
部、４，１４，２４，３４，４４，５４，６４ … 語彙記
憶部、５，１５，２５，３５，４５，５５，６５ … 音声認
識部１６，２６，３６ … メニュー選択部２７，３７，４７ … 対応記憶部３８ … 言語尤度推定部４６ … リンク情報取得部１０１ … リンク先１０２ … リンク文字列１０３，１０４ … リンク先のファイル1, 11, 21, 31, 41, 51, 61 ... Text analysis part, 2, 12, 22, 32, 42, 52, 62 ... Speech synthesis part, 3, 13, 23, 33, 43, 63 ... Recognition vocabulary Generation unit, 4, 14, 24, 34, 44, 54, 64 ... Vocabulary storage unit, 5, 15, 25, 35, 45, 55, 65 ... Speech recognition unit 16, 26, 36 ... Menu selection unit 27, 37 , 47 ... Corresponding storage unit 38 ... Language likelihood estimation unit 46 ... Link information acquisition unit 101 ... Link destination 102 ... Link character strings 103, 104 ... Link destination file

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/18 Ｇ１０Ｌ 3/00 Ｅ 15/28 ５５１Ｐ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 15/18 G10L 3/00 E 15/28 551P

Claims

[Claims]

1. A text analysis unit that analyzes an input character string, a voice synthesis unit that generates a synthetic voice corresponding to the character string based on an analysis result by the text analysis unit, and a text analysis unit. Based on the analysis result, a recognition vocabulary generation unit that generates a vocabulary used for speech recognition, a vocabulary storage unit that stores the vocabulary generated by the recognition vocabulary generation unit, and a vocabulary stored in the vocabulary storage unit. A voice recognition / synthesis device comprising: a voice recognition unit for recognizing an input voice with reference.

2. The voice recognition / synthesis apparatus according to claim 1, further comprising a vocabulary expansion unit for expanding the vocabulary stored in the vocabulary storage unit based on the similarity between words, and other than the input character string. A voice recognition / synthesis apparatus characterized in that words are also stored in the vocabulary storage unit.

3. The speech recognition / synthesis apparatus according to claim 1, further comprising a language likelihood estimation unit that obtains a language likelihood, wherein the speech recognition unit performs speech recognition in consideration of the language likelihood. A voice recognition / synthesis device characterized by performing.

4. The voice recognition / synthesis apparatus according to claim 1, further comprising a menu selection unit, wherein the text analysis unit is input with one or more menu items as options. The voice recognition / synthesis apparatus, wherein the menu selection unit selects one from the one or more menu items based on a recognition result by the voice recognition unit.

5. The voice recognition / synthesis apparatus according to claim 4, further comprising: a correspondence storage unit that stores a menu item and one or more words constituting the menu item in association with each other, wherein the menu selection unit includes: A voice recognition / synthesis apparatus, which performs menu selection if a voice recognition result by the voice recognition unit and at least a part of a menu item input to the text analysis unit match.

6. The voice recognition / synthesis apparatus according to claim 1, further comprising a link information acquisition unit, wherein hypertext is input to the text analysis unit, and the link information acquisition unit. A voice recognition / synthesis apparatus, characterized in that, based on a voice recognition result by the voice recognition unit, information on a link destination of the hypertext is obtained.

7. The speech recognition / synthesis apparatus according to claim 6, wherein the speech synthesis unit includes means for changing a speech synthesis pattern depending on the presence / absence of a link.

8. The voice recognition / synthesis apparatus according to claim 6 or 7, further comprising a correspondence storage unit that stores a link character string name and one or more words forming the link character string name in association with each other. The link information acquisition unit executes acquisition of information of a link destination when the voice recognition result by the voice recognition unit and at least a part of the link character string name input to the text analysis unit at least partially match. A speech recognition / synthesis device characterized in that

9. A communication system comprising the speech recognition / synthesis device according to claim 1, wherein the text analysis unit is one of the constituent parts of the speech recognition / synthesis device. A communication system, wherein the recognition vocabulary generation unit is provided on the server side, and the other components are provided on the terminal side.

10. A mobile terminal device comprising the voice recognition / synthesis device according to claim 1. Description:

11. A step of analyzing an input character string, a step of generating a synthetic voice corresponding to the character string based on a result of the analysis, and a voice recognition based on a result of the analysis. A voice recognition / synthesis method comprising: a step of generating a vocabulary; a step of storing the generated vocabulary; and a step of recognizing an input voice with reference to the stored vocabulary.

12. A step of analyzing a character string input to a computer, a step of generating a synthetic voice corresponding to the character string based on a result of the analysis, and a voice recognition based on a result of the analysis. A program for executing a step of generating a vocabulary to be used, a step of storing the generated vocabulary, and a step of recognizing an input voice with reference to the stored vocabulary.

13. A computer-readable recording medium in which the program according to claim 12 is recorded.