JP2001343992A

JP2001343992A - Method and device for learning voice pattern model, computer readable recording medium with voice pattern model learning program recorded, method and device for voice recognition, and computer readable recording medium with its program recorded

Info

Publication number: JP2001343992A
Application number: JP2000162964A
Authority: JP
Inventors: Toshiyuki Hanazawa; 利行花沢
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-05-31
Filing date: 2000-05-31
Publication date: 2001-12-14
Anticipated expiration: 2020-05-31
Also published as: JP4004716B2

Abstract

PROBLEM TO BE SOLVED: To solve the problem that proper voice pattern models for conversa tional voices cannot be provided. SOLUTION: An m-phoneme set extraction part 10 and a model learning part 3 are provided, and the part 10 uses reading voice m-phoneme set models to recognize 3-phoneme sets held in a conversational voice learning data memory 8 and extracts m-phoneme sets having low recognition rates, and the part 3 uses a time series of feature vectors of tokens held in the memory 8 to learn conversational voice m-phoneme set models with respect to only m-phoneme sets extracted by the part 10.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、対話音声のよう
に発話速度がはやくかつ曖昧な音声について適切に音声
パターンモデルを学習することが可能な音声パターンモ
デル学習装置、音声パターンモデル学習方法、および音
声パターンモデル学習プログラムを記録したコンピュー
タ読み取り可能な記録媒体に関するものである。さら
に、この発明は、対話音声のように発話速度がはやくか
つ曖昧な音声を精度よく認識することが可能な音声認識
装置、音声認識方法、および音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice pattern model learning apparatus, a voice pattern model learning method, and a voice pattern model learning method capable of appropriately learning a voice pattern model for an utterance having a high utterance speed and an ambiguous voice such as an interactive voice. The present invention relates to a computer-readable recording medium storing a speech pattern model learning program. Further, the present invention relates to a voice recognition device, a voice recognition method, and a computer-readable recording medium storing a voice recognition program capable of accurately recognizing an ambiguous voice having a fast utterance speed such as a dialogue voice. Things.

【０００２】[0002]

【従来の技術】一般に、音声認識は、音声を音響分析し
て得られる音声の特徴ベクトルの時系列と、その特徴ベ
クトルの時系列のパターンをモデル化した音声パターン
モデルとのパターンマッチングを行うことにより実現さ
れる。この音声パターンモデルとしては、ＨＭＭ（Ｈｉ
ｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ，隠れマルコフモ
デル）が用いられることが多い。2. Description of the Related Art Generally, in speech recognition, pattern matching is performed between a time series of a feature vector of a speech obtained by acoustic analysis of a speech and a speech pattern model obtained by modeling a pattern of the time series of the feature vector. Is realized by: As the voice pattern model, HMM (Hi
Dden Markov Model, Hidden Markov Model) is often used.

【０００３】音声パターンモデルとしてＨＭＭを用いる
場合、モデル化する音声パターンの単位としては、音素
を用いることが多い。音素は子音（／ｓ／，／ｈ／，／
ｆ／，／ｐ／，／ｔ／，／ｋ／，／ｚ／，／ｂ／，／ｚ
／，／ｇ／，／ｍ／，／ｎ／，／ｒ／）や母音（／ａ
／，／ｉ／，／ｕ／，／ｅ／，／ｏ／）等である。日本
語に現われる全音素をＨＭＭによってモデル化しておけ
ば、音素ＨＭＭを接続することにより任意の単語や文章
をモデル化することができ、単語音声や連続音声の認識
を行うことができる。When an HMM is used as a voice pattern model, a phoneme is often used as a unit of a voice pattern to be modeled. Phonemes are consonants (/ s /, / h /, /
f /, / p /, / t /, / k /, / z /, / b /, / z
/, / G /, / m /, / n /, / r /) and vowels (/ a
/, / I /, / u /, / e /, / o /) and the like. If all phonemes appearing in Japanese are modeled by the HMM, an arbitrary word or sentence can be modeled by connecting the phoneme HMMs, and word speech or continuous speech can be recognized.

【０００４】音素をＨＭＭでモデル化する場合、以下の
ように音素を細分化してモデル化する場合が多い。例え
ば音節／ｈａ／と／ｈｉ／の第１番目の音素である／ｈ
／は同じ音素であっても後続音素である／ａ／，／ｉ／
の影響を受け、／ａ／に先行する／ｈ／と、／ｉ／に先
行する／ｈ／では音響特徴量（以後、特徴ベクトルとい
う）が異なっている。このように同じ音素での特徴ベク
トルが異なるものを異音と呼ぶ。異音は主に音素の出現
するコンテキスト、すなわち後続音素や先行音素の違い
によって生じるものとされている。そこで、各音素を１
つのモデルで表現するのではなく、コンテキストの違い
により別々のモデルで表現する方法が多く用いられてい
る。特に近年、Ｒ．Ｓｃｈｗａｒｔｚ，Ｙ．Ｃｈｏｗ著
「“ＩＭＰＲＯＶＥＤＨＩＤＤＥＮＭＡＲＫＯＶ
ＭＯＤＥＬＩＮＧＯＦＰＨＯＮＥＭＥＳＯＦＣ
ＯＮＴＩＮＵＯＵＳＳＰＥＥＣＨＲＥＣＯＧＮＩＴ
ＩＯＮ”，ＩＥＥＥＩＮＴＥＲＮＡＴＩＯＮＡＬＣ
ＯＮＦＥＲＥＮＣＥＯＮＡＣＯＵＳＴＩＣＳ，ＳＰ
ＥＥＣＨ，ＡＮＤＳＩＧＮＡＬＰＲＯＣＥＳＳＩ
ＮＧ，Ｖｏｌ．３，３５．６．１−３５．６．４」（以
後、文献１と呼ぶ）等で提案された先行と後続の両方の
音素コンテキストを考慮した３音素組（トライフォン）
モデルを用いることが多い。例えば／ａｋｉ／の／ｋ／
は３音素組では（ａ）ｋ（ｉ）、／ｈａｋｏ／の／ｋ／
は３音素組では（ａ）ｋ（ｏ）である。ここで（）内は
先行または後続の音素を示すものとする。上記（ａ）ｋ
（ｉ）と（ａ）ｋ（ｏ）は、後続の音素が異なるため別
の３音素組となる。この３音素組モデルを用いることに
よって、通常の音素モデルよりも高い認識性能を得るこ
とができる。なお、上記の（ａ）ｋ（ｉ）と（ａ）ｋ
（ｏ）等の表記法を以後、ｍ音素組表記と呼ぶことにす
る。When a phoneme is modeled by an HMM, the phoneme is often segmented and modeled as follows. For example, the first phoneme of syllables / ha / and / hi / is / h
/ Is a subsequent phoneme even if it is the same phoneme / a /, / i /
And / h / preceding / a / and / h / preceding / i / have different acoustic feature amounts (hereinafter referred to as feature vectors). Such a phoneme having a different feature vector is called an abnormal sound. It is assumed that abnormal sounds are mainly caused by the context in which the phoneme appears, that is, the difference between the succeeding phoneme and the preceding phoneme. So, each phoneme is 1
Instead of using one model, many models use different models depending on the context. Particularly in recent years, R.A. Schwartz, Y .; Chow, “Improved Hidden Markov
MODELING OF PHONEMES OF C
ONTINUOUS SPEECH RECOGNIT
ION ”, IEEE INTERNATIONAL C
ONFERENCE ONACOUSTICS, SP
EECH, AND SIGNAL PROCESSI
NG, Vol. 3, 35.6.1-35.6.4 "(hereinafter referred to as reference 1) and the like, and a three-phoneme set (triphone) considering both preceding and succeeding phoneme contexts
Models are often used. For example, / aki / of / k /
Is (a) k (i), / hako // k /
Is (a) k (o) in a three-phoneme set. Here, the parentheses indicate the preceding or succeeding phonemes. The above (a) k
(I) and (a) k (o) are different sets of three phonemes because the subsequent phonemes are different. By using this three-phoneme set model, higher recognition performance than a normal phoneme model can be obtained. Note that the above (a) k (i) and (a) k
The notation such as (o) is hereinafter referred to as m phoneme set notation.

【０００５】次に３音素組モデルの作成方法について説
明する。図２３は例えば上記文献１に開示された３音素
組モデルを学習する従来の音声パターンモデル学習装置
の一例の構成を示すブロック図である。なお、文献１で
は英語の音素で説明しているが、日本語でも全く同じ技
術が使用できるので以下では日本語の音素を例にとって
説明する。図２３において、１００は３音素組モデルの
学習データが格納されている学習データメモリ、２００
は学習データメモリ１００に格納されている学習データ
中に含まれる音声の特徴ベクトルの時系列、３００は３
音素組モデルの学習を行うモデル学習部、４００はモデ
ル学習部３００により学習された３音素組モデルのパラ
メータ、５００は学習された３音素組モデルのパラメー
タ４００等を格納するための３音素組モデルメモリであ
る。Next, a method for creating a three-phoneme set model will be described. FIG. 23 is a block diagram showing a configuration of an example of a conventional voice pattern model learning apparatus for learning a three-phoneme set model disclosed in the above-mentioned Document 1, for example. Note that, although the description is given in reference 1 using English phonemes, the same technique can be used in Japanese, so the following description will be given using Japanese phonemes as an example. In FIG. 23, reference numeral 100 denotes a learning data memory in which learning data of a three-phoneme set model is stored;
Is a time series of a speech feature vector included in the learning data stored in the learning data memory 100;
A model learning unit 400 for learning a phoneme set model; 400, a parameter of the three phoneme set model learned by the model learning unit 300; 500, a three phoneme set model for storing the learned parameter 400 of the three phoneme set model; Memory.

【０００６】次に動作について説明する。学習データメ
モリ１００に格納されている学習データは、多様な３音
素組のコンテキストを含んだ単語や文章を多数の話者が
読み上げた音声や、人対人の対話音声等を音響分析して
得られる、特徴ベクトルの時系列と発話内容を示す音素
組表記であって、具体的には、学習データの音声波形を
音響分析して得られる特徴ベクトルの時系列を音素区間
ごとに切り出したトークンの集合と、学習データ中に存
在する３音素組の３音素組表記とを対応づける３音素組
テーブルである。この３音素組テーブルの例を図２４に
示す。Next, the operation will be described. The learning data stored in the learning data memory 100 can be obtained by acoustic analysis of a voice read by a number of speakers reading a word or a sentence including various contexts of three phonemes, a human-to-person conversation voice, and the like. , A set of tokens which is a phoneme set notation indicating a time series of feature vectors and utterance contents, specifically, a time series of feature vectors obtained by acoustic analysis of a speech waveform of learning data and cut out for each phoneme section. 6 is a three-phoneme set table for associating three-phoneme set notations of three-phoneme sets existing in the learning data. FIG. 24 shows an example of the three phoneme set table.

【０００７】音響分析として例えばＬＰＣ（Ｌｉｎｅａ
ｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ，線形予測分
析）が使用され、特徴ベクトルはＬＰＣケプストラムで
ある。音素区間ごとへの切り出しは例えば人間がスペク
トログラムを観察して行う。また、各トークンには当該
トークンの音素名と先行音素名および後続音素名を記し
た３音素組表記が付与されているものとする。３音素組
表記の例を図２５に示す。As an acoustic analysis, for example, LPC (Linea
r Predictive Coding (linear prediction analysis) is used, and the feature vector is an LPC cepstrum. The segmentation for each phoneme section is performed by, for example, a human observing a spectrogram. It is also assumed that each token is given a three-phoneme set notation in which the phoneme name, preceding phoneme name, and subsequent phoneme name of the token are described. FIG. 25 shows an example of the three-phoneme set notation.

【０００８】また、３音素組モデルは連続分布型のＨＭ
Ｍであると仮定する。この場合、各３音素組モデルの構
造としては図２６に示すように５状態のｌｅｆｔ−ｔｏ
−ｒｉｇｈｔモデルを用いる。図２６において、状態１
が初期状態、状態５が最終状態である。各３音素組モデ
ルは、状態遷移確率ａ_ｉｊと、ラベル出力確率ｂ
_ｉｊ（ｘ）から構成される。ここで添字ｉｊは状態ｉか
ら状態ｊへの遷移を示すものであり、状態遷移確率ａ
_ｉｊは状態ｉから状態ｊへの遷移が起きる確率である。
また、ラベル出力確率ｂ_ｉｊ（ｘ）は、連続分布型のＨ
ＭＭでは多次元正規分布で表現される。状態遷移確率ａ
_ｉｊおよびラベル出力確率ｂ_ｉｊ（ｘ）をＨＭＭのパラ
メータという。ＨＭＭのパラメータを求めることをＨＭ
Ｍの学習という。The three-phoneme model is a continuous distribution type HM
Suppose M. In this case, the structure of each three-phoneme set model is a five-state left-to-
-Use the right model. In FIG. 26, state 1
Is the initial state, and state 5 is the final state. Each three-phone set model has a state transition probability a _ij and a label output probability b
_ij (x). Here, the subscript ij indicates a transition from the state i to the state j, and the state transition probability a
_ij is the probability of a transition from state i to state j.
The label output probability b _ij (x) is a continuous distribution type H
In MM, it is represented by a multidimensional normal distribution. State transition probability a
_ij and the label output probability b _ij (x) are referred to as HMM parameters. HM to determine the parameters of the HMM
It is called M learning.

【０００９】次にモデル学習動作について説明する。（１）学習手順１：モデル学習部３００は、学習データ
メモリ１００が保持する３音素組テーブルを読み込み、
３音素組テーブルの記述内容にしたがって、３音素組を
学習対象として選択する。３音素組テーブルが例えば図
２４のように記述されている場合、モデル学習部３００
はまず先頭の３音素組である（ａ）ａ（ａ）を学習対象
として選択する。Next, the model learning operation will be described. (1) Learning procedure 1: The model learning unit 300 reads the three-phoneme set table held in the learning data memory 100,
According to the description contents of the three phoneme set table, the three phoneme sets are selected as learning targets. When the three phoneme set table is described, for example, as shown in FIG.
First selects (a) a (a), which is the first set of three phonemes, as a learning target.

【００１０】（２）学習手順２：次に、モデル学習部３
００は、学習データメモリ１００から上記学習手順１に
おいて選択した３音素組と一致する３音素組表記を持つ
全てのトークンの特徴ベクトルの時系列２００を読み込
み、例えばフォワード・バックワードアルゴリズムを用
いて選択した３音素組についてモデルを学習する。学習
を終了すると、モデル学習部３００は学習を終了したモ
デルのパラメータである状態遷移確率ａ_ｉ _ｊおよびラベ
ル出力確率ｂ_ｉｊ（ｘ）、ならびにその３音素組表記
を、３音素組モデルメモリ５００に送出する。３音素組
モデルメモリ５００は学習を終了したモデルのパラメー
タおよび３音素組表記を保持する。(2) Learning procedure 2: Next, the model learning section 3
00 reads from the learning data memory 100 a time series 200 of feature vectors of all tokens having a three-phoneme set notation that matches the three-phoneme set selected in the above-described learning procedure 1, and selects the time series 200 using, for example, a forward / backward algorithm. The model is learned for the set of three phonemes. Upon completion of learning, the state transition probability model learning unit 300 is a parameter of the model ended learning a _i _j and label output probabilities b _{ij (x),} and the 3 phoneme set notation, the 3 phoneme sets model memory 500 Send out. The three-phoneme set model memory 500 holds the parameters of the model for which learning has been completed and the three-phoneme set notation.

【００１１】（３）学習手順３：モデル学習部３００
は、学習データメモリ１００が保持する３音素組テーブ
ルを参照し、学習データ中に存在する全ての３音素組に
ついてモデルの学習が終了するまで、３音素組テーブル
に記述されている順番にしたがって次の３音素組を学習
対象として選択し、上記学習手順２を繰り返す。このよ
うにして、モデル学習部３００は、学習データ中に存在
する全ての３音素組についてモデルを学習する。(3) Learning procedure 3: Model learning section 300
Refers to the three-phoneme set table held in the learning data memory 100, and continues in the order described in the three-phoneme set table until model learning is completed for all three-phoneme sets existing in the learning data. Are selected as learning targets, and the above learning procedure 2 is repeated. In this way, the model learning unit 300 learns a model for all three phoneme sets existing in the learning data.

【００１２】[0012]

【発明が解決しようとする課題】従来の音声パターンモ
デル学習装置は以上のように構成されているので、先行
と後続の両方の音素コンテキストを考慮した３音素組モ
デルを用いて音素コンテキストの違いによって生じる音
素の特徴ベクトルの変形を考慮したモデルを作成し、認
識性能の向上を計っていたが、文章発声、朗読調、対話
調などの発話様式の違いに対処できないという課題があ
った。すなわち、音素の特徴ベクトルの変形は音素コン
テキストだけでなく、単語として発声する場合と文章発
声、朗読調、対話調などの発話様式の違いによっても生
じる。例えば、「予約」という言葉を単語として単独で
発声する場合と、「明日、予約したいんですが」という
テキストを読み上げる場合と、このテキストを人に向か
って話しかける場合とでは、特徴ベクトルの変形状態が
異なってくる。したがって、音声パターンモデル学習装
置は、従来の学習データとしてテキストを読み上げた音
声のみを用いた場合には、対話調の音声に対して適切な
音声パターンモデルを提供できないという課題があっ
た。Since the conventional speech pattern model learning apparatus is configured as described above, it uses a three-phoneme set model that considers both the preceding and succeeding phoneme contexts and uses the three-phoneme context model to determine the difference between phoneme contexts. Although a model was created in consideration of the resulting deformation of phoneme feature vectors to improve recognition performance, there was a problem that it was not possible to cope with differences in speech styles such as sentence utterance, reading style, and dialogue style. That is, the deformation of the feature vector of the phoneme is caused not only by the phoneme context but also by the difference between the case of uttering as a word and the utterance style such as sentence utterance, reading tone, dialogue tone and the like. For example, when the word "reservation" is uttered as a word alone, when the text "I want to make a reservation tomorrow" is read out, and when this text is spoken to a person, the deformation state of the feature vector Will be different. Therefore, the voice pattern model learning apparatus has a problem that it is not possible to provide a voice pattern model appropriate for a dialogue voice when using only voices that read text as conventional learning data.

【００１３】また、テキストを読み上げた音声、人との
対話音声等の種々の発話様式の音声の学習データを同時
に用いて音声パターンモデルを学習する場合には、特徴
ベクトルの変形状態が異なる種々の特徴ベクトルを１個
のモデルで表現するので、音声パターンモデルの精度が
低下するという課題があった。When learning a speech pattern model by simultaneously using speech data of various utterance styles, such as text-to-speech speech and dialogue speech with a person, various deformation modes of feature vectors differ. Since the feature vector is represented by one model, there is a problem that the accuracy of the voice pattern model is reduced.

【００１４】さらに、テキストを読み上げた音声、人と
の対話音声等の種々の発話様式の音声ごとの学習データ
を用いて音声パターンモデルを学習する場合には、音声
パターンモデルの精度低下を避けることはできるが、音
声パターンモデルの数が学習する発話様式の数に比例し
て増加してしまうという課題があった。Further, when learning a voice pattern model using learning data for each voice in various utterance styles, such as text-to-speech voices and dialogue voices with people, it is necessary to avoid a decrease in accuracy of the voice pattern model. However, there is a problem that the number of voice pattern models increases in proportion to the number of utterance styles to be learned.

【００１５】また、対話音声のように発話速度がはやく
かつ曖昧な音声では、前後の１音素からだけではなく前
後の２音素からも影響を受けて、特徴ベクトルの変形が
生じることがあり、３音素組モデルでは十分な学習が行
えないという課題があった。[0015] In addition, in the case of a speech having a fast and ambiguous speech rate such as a dialogue voice, the feature vector is affected not only by the preceding and succeeding phonemes but also by the preceding and succeeding phonemes. There was a problem that sufficient learning could not be performed with the phoneme set model.

【００１６】この発明は上記のような課題を解決するた
めになされたもので、対話調の音声に対しても、音声パ
ターンモデルの数を大きく増加させることなく効率的に
音声パターンモデルを学習する音声パターンモデル学習
装置、音声パターンモデル学習方法、および音声パター
ンモデル学習プログラムを記録したコンピュータ読み取
り可能な記録媒体を得ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and efficiently learns a voice pattern model even for an interactive voice without greatly increasing the number of voice pattern models. A voice pattern model learning device, a voice pattern model learning method, and a computer readable recording medium storing a voice pattern model learning program are provided.

【００１７】また、この発明は、３音素組モデルでは十
分な学習が行えない、対話音声のように発話速度がはや
くかつ曖昧な音声について、音声パターンモデルの数を
大きく増加させることなく効率的に、より長い音素環境
を考慮した音声パターンモデルを学習する音声パターン
モデル学習装置、音声パターンモデル学習方法、および
音声パターンモデル学習プログラムを記録したコンピュ
ータ読み取り可能な記録媒体を得ることを目的とする。Further, the present invention can efficiently perform, without a large increase in the number of voice pattern models, a voice having a fast and ambiguous utterance speed, such as a dialog voice, which cannot be sufficiently learned by a three-phoneme set model. It is another object of the present invention to provide a voice pattern model learning device, a voice pattern model learning method, and a computer readable recording medium storing a voice pattern model learning program for learning a voice pattern model in consideration of a longer phoneme environment.

【００１８】さらに、この発明は、対話音声のように発
話速度がはやくかつ曖昧な音声について精度よく音声認
識を行う音声認識装置、音声認識方法、および音声認識
プログラムを記録したコンピュータ読み取り可能な記録
媒体を得ることを目的とする。Further, the present invention provides a speech recognition device, a speech recognition method, and a computer-readable recording medium on which a speech recognition program is recorded, for accurately recognizing speech having a fast and ambiguous speech rate such as conversational speech. The purpose is to obtain.

【００１９】[0019]

【課題を解決するための手段】この発明に係る音声パタ
ーンモデル学習装置は、テキストを読み上げた音声を用
いて学習した読み上げ音声ｍ音素組モデルを用い、対話
音声学習データから認識率が所定の閾値以下であるｍ音
素組を抽出するｍ音素組抽出手段と、抽出した各ｍ音素
組について、上記対話音声学習データを用いて対話音声
ｍ音素組モデルを学習するモデル学習手段とを備えたも
のである。A speech pattern model learning apparatus according to the present invention uses a read-aloud m-phoneme set model trained using text-to-speech voices, and a recognition rate of a predetermined threshold is determined from interactive voice learning data. It comprises m phoneme set extraction means for extracting the following m phoneme sets, and model learning means for learning a dialogue speech m phoneme set model for each extracted m phoneme set by using the dialogue speech learning data. is there.

【００２０】この発明に係る音声パターンモデル学習装
置は、ｍ音素組抽出手段が、対話音声学習データ中から
同一ｍ音素組表記をもつデータ数が所定数以上であるｍ
音素組を選択し、読み上げ音声ｍ音素組モデルを用いて
選択した該ｍ音素組を認識し、認識率が所定の閾値以下
であるならば選択した上記ｍ音素組を抽出するものであ
る。In the speech pattern model learning apparatus according to the present invention, the m phoneme group extracting means includes a m phoneme group having the same m phoneme group notation in the dialog speech learning data.
A phoneme set is selected, the selected m-phoneme set is recognized using an m-phoneme set model of the read-aloud voice, and the selected m-phoneme set is extracted if the recognition rate is equal to or less than a predetermined threshold.

【００２１】この発明に係る音声パターンモデル学習装
置は、テキストを読み上げた音声を用いて学習した読み
上げ音声ｍ音素組モデルを用い、対話音声学習データか
ら認識率が第１の所定の閾値以下であるｍ音素組を抽出
するｍ音素組抽出手段と、抽出した各ｍ音素組につい
て、上記対話音声学習データを用いて対話音声ｍ音素組
モデルを学習する対話音声ｍ音素組モデル学習手段と、
上記読み上げ音声ｍ音素組モデルと上記対話音声ｍ音素
組モデルとを用いて、上記対話音声学習データから認識
率が第２の所定の閾値以下のｎ音素組を抽出するｎ音素
組抽出手段と、抽出した各ｎ音素組について、上記対話
音声学習データを用いて対話音声ｎ音素組モデルを学習
する対話音声ｎ音素組モデル学習手段とを備えたもので
ある。A speech pattern model learning apparatus according to the present invention uses a read-aloud m-phoneme set model learned using a text-to-speech voice, and a recognition rate based on conversational voice learning data is equal to or less than a first predetermined threshold. m phoneme set extraction means for extracting m phoneme sets, dialogue speech m phoneme set model learning means for learning a dialogue speech m phoneme set model using the dialogue speech learning data for each extracted m phoneme set,
An n-phoneme set extraction unit that extracts an n-phoneme set whose recognition rate is equal to or less than a second predetermined threshold from the dialogue speech learning data using the read-aloud m-phoneme set model and the dialogue m-phoneme set model; A dialogue speech n phoneme set model learning means for learning a dialogue speech n phoneme set model using the dialogue speech learning data for each extracted n phoneme set.

【００２２】この発明に係る音声パターンモデル学習装
置は、ｎ音素組抽出手段が、対話音声学習データ中から
同一ｎ音素組表記をもつデータ数が所定数以上であるｎ
音素組を選択し、読み上げ音声ｍ音素組モデルと対話音
声ｍ音素組モデルとを用いて選択した上記ｎ音素組を認
識し、認識率が第２の所定の閾値以下であるならば選択
した上記ｎ音素組を抽出するものである。In the speech pattern model learning apparatus according to the present invention, the n phoneme group extracting means may include an n phoneme group extracting unit in which the number of data having the same n phoneme group notation is equal to or more than a predetermined number from the conversation speech learning data.
Selecting a phoneme set, recognizing the selected n phoneme set using the read-aloud speech m-phoneme set model and the dialogue speech m-phoneme set model, and selecting the n-phoneme set if the recognition rate is equal to or less than a second predetermined threshold value; This is for extracting n phoneme sets.

【００２３】この発明に係る音声認識装置は、上記音声
パターンモデル学習装置によって学習された読み上げ音
声ｍ音素組モデル、対話音声ｍ音素組モデルおよび対話
音声ｎ音素組モデルを並列に接続することによって認識
対象語彙に対する音声パターンモデルを作成する認識対
象語彙モデル作成手段と、該認識対象語彙モデル作成手
段によって作成した認識対象語彙に対する音声パターン
モデルを用いて、入力音声の認識を行う認識手段とを備
えたものである。A speech recognition apparatus according to the present invention recognizes by connecting in parallel a m-phoneme set model, an m-phoneme set model, and an n-phoneme set model learned by the above-mentioned speech pattern model learning apparatus. A recognition target vocabulary model creating means for creating a speech pattern model for the target vocabulary; and a recognition means for recognizing the input speech using the speech pattern model for the recognition target vocabulary created by the recognition vocabulary model creating means. Things.

【００２４】この発明に係る音声パターンモデル学習方
法は、テキストを読み上げた音声を用いて学習した読み
上げ音声ｍ音素組モデルを用い、対話音声学習データか
ら認識率が所定の閾値以下であるｍ音素組を抽出し、抽
出した各ｍ音素組について、上記対話音声学習データを
用いて対話音声ｍ音素組モデルを学習するものである。A speech pattern model learning method according to the present invention uses a m-phoneme set model of a read-aloud speech trained using a text-to-speech speech, and recognizes m-phoneme sets whose recognition rate is equal to or lower than a predetermined threshold from conversational speech learning data. Is extracted, and for each of the extracted m phoneme sets, a dialogue speech m phoneme set model is learned using the above dialogue speech learning data.

【００２５】この発明に係る音声パターンモデル学習方
法は、ｍ音素組を抽出する際に、対話音声学習データ中
から同一ｍ音素組表記をもつデータ数が所定数以上であ
るｍ音素組を選択し、読み上げ音声ｍ音素組モデルを用
いて選択した上記ｍ音素組を認識し、認識率が所定の閾
値以下であるならば選択した上記ｍ音素組を抽出するも
のである。In the speech pattern model learning method according to the present invention, when extracting m phoneme sets, m phoneme sets in which the number of data having the same m phoneme set notation is equal to or more than a predetermined number are selected from conversational speech learning data. The selected m-phoneme set is recognized by using the m-phoneme-set model of the read-out voice, and the selected m-phoneme set is extracted if the recognition rate is equal to or less than a predetermined threshold.

【００２６】この発明に係る音声パターンモデル学習方
法は、テキストを読み上げた音声を用いて学習した読み
上げ音声ｍ音素組モデルを用い、対話音声学習データか
ら認識率が第１の所定の閾値以下であるｍ音素組を抽出
し、抽出した各ｍ音素組について、上記対話音声学習デ
ータを用いて対話音声ｍ音素組モデルを学習し、上記読
み上げ音声ｍ音素組モデルと上記対話音声ｍ音素組モデ
ルとを用いて、上記対話音声学習データから認識率が第
２の所定の閾値以下のｎ音素組を抽出し、抽出した各ｎ
音素組について、上記対話音声学習データを用いて対話
音声ｎ音素組モデルを学習するものである。A speech pattern model learning method according to the present invention uses a read-aloud m-phoneme set model learned using a text-to-speech voice, and a recognition rate is less than or equal to a first predetermined threshold value based on interactive voice learning data. An m-phoneme set is extracted, and for each of the extracted m-phoneme sets, a dialogue m-phoneme set model is learned using the dialogue speech learning data. The n phoneme sets whose recognition rate is equal to or less than a second predetermined threshold value are extracted from the conversational speech learning data using
For a phoneme set, a dialogue speech n phoneme set model is learned using the above-mentioned dialogue speech learning data.

【００２７】この発明に係る音声パターンモデル学習方
法は、ｎ音素組を抽出する際に、対話学習音声データ中
から同一ｎ音素組表記をもつデータ数が所定数以上であ
るｎ音素組を選択し、読み上げ音声ｍ音素組モデルと対
話音声ｍ音素組モデルとを用いて選択した上記ｎ音素組
を認識し、認識率が第２の所定の閾値以下であるならば
選択した上記ｎ音素組を抽出するものである。In the speech pattern model learning method according to the present invention, when extracting n phoneme sets, an n phoneme set in which the number of data having the same n phoneme set notation is equal to or more than a predetermined number is selected from the interactive learning speech data. Recognizing the selected n-phoneme set using the read-aloud m-phoneme set model and the dialogue m-phoneme set model, and extracting the selected n-phoneme set if the recognition rate is equal to or less than a second predetermined threshold. Is what you do.

【００２８】この発明に係る音声認識方法は、音声パタ
ーンモデル学習方法によって学習された読み上げ音声ｍ
音素組モデル、対話音声ｍ音素組モデルおよび対話音声
ｎ音素組モデルを並列に接続することによって認識対象
語彙に対する音声パターンモデルを作成し、作成した認
識対象語彙に対する音声パターンモデルを用いて、入力
音声の認識を行うものである。The speech recognition method according to the present invention provides a read-out speech m learned by a speech pattern model learning method.
A speech pattern model for the vocabulary to be recognized is created by connecting the phoneme set model, the dialogue speech m phoneme set model, and the dialogue speech n phoneme set model in parallel, and the input speech is generated using the created speech pattern model for the recognition target vocabulary. This is to recognize.

【００２９】この発明に係る音声パターンモデル学習プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体は、テキストを読み上げた音声を用いて学習した読み
上げ音声ｍ音素組モデルを用い、対話音声学習データか
ら認識率が所定の閾値以下であるｍ音素組を抽出するｍ
音素組抽出ステップと、抽出したｍ音素組について、上
記対話音声学習データを用いて対話音声ｍ音素組モデル
を学習する対話音声ｍ音素組モデル学習ステップとを有
するものである。A computer-readable recording medium on which a speech pattern model learning program according to the present invention is recorded uses a read-aloud m-phoneme set model trained using text-to-speech voices, and a recognition rate is obtained from interactive voice learning data. M to extract m phoneme sets that are less than or equal to a predetermined threshold
The method includes a phoneme set extracting step and a dialogue m phoneme set model learning step of learning a dialogue m m phoneme set model using the dialogue speech learning data for the extracted m phoneme set.

【００３０】この発明に係る音声パターンモデル学習プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体は、ｍ音素組抽出ステップが、対話音声学習データ中
から同一ｍ音素組表記をもつデータ数が所定数以上であ
るｍ音素組を選択し、読み上げ音声ｍ音素組モデルを用
いて選択した上記ｍ音素組を認識し、認識率が所定の閾
値以下であるならば選択した上記ｍ音素組を抽出するも
のである。[0030] In the computer-readable recording medium storing the speech pattern model learning program according to the present invention, the m-phoneme group extraction step includes the step of: if the number of data having the same m-phoneme group notation is more than a predetermined number from the interactive speech learning data. A m-phoneme group is selected, the m-phoneme group selected using the m-phoneme-speech model is read, and the m-phoneme group selected is extracted if the recognition rate is equal to or less than a predetermined threshold. .

【００３１】この発明に係る音声パターンモデル学習プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体は、テキストを読み上げた音声を用いて学習した読み
上げ音声ｍ音素組モデルを用い、対話音声学習データか
ら認識率が第１の所定の閾値以下であるｍ音素組を抽出
するｍ音素組抽出ステップと、抽出した各ｍ音素組につ
いて、上記対話音声学習データを用いて対話音声ｍ音素
組モデルを学習する対話音声ｍ音素組モデル学習ステッ
プと、上記読み上げ音声ｍ音素組モデルと上記対話音声
ｍ音素組モデルとを用いて、上記対話音声学習データか
ら認識率が第２の所定の閾値以下のｎ音素組を抽出する
ｎ音素組抽出ステップと、抽出した各ｎ音素組につい
て、上記対話音声学習データを用いて対話音声ｎ音素組
モデルを学習する対話音声ｎ音素組モデル学習ステップ
とを有するものである。A computer-readable recording medium on which a speech pattern model learning program according to the present invention is recorded uses a read-out speech m-phoneme set model learned by using text-to-speech speech, and a recognition rate is obtained from interactive speech learning data. An m-phoneme set extracting step of extracting m-phoneme sets that are equal to or smaller than a first predetermined threshold value; and a dialogue speech m for learning a dialogue speech m-phoneme set model using the dialogue speech learning data for each of the extracted m-phoneme sets. Using the phoneme set model learning step, and using the read-aloud m-phoneme set model and the dialogue m-phoneme set model, an n-phoneme set whose recognition rate is equal to or less than a second predetermined threshold is extracted from the dialogue speech learning data. an n-phoneme group extraction step, and a learning step for learning a dialogue n-phoneme group model for each of the extracted n-phoneme groups using the above dialogue voice learning data. Those having an audio n phoneme sets model learning step.

【００３２】この発明に係る音声パターンモデル学習プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体は、ｎ音素組抽出ステップが、対話音声学習データ中
から同一ｎ音素組表記をもつデータ数が所定数以上であ
るｎ音素組を選択し、読み上げ音声ｍ音素組モデルと対
話音声ｍ音素組モデルとを用いて選択した上記ｎ音素組
を認識し、認識率が第２の所定の閾値以下であるなら
ば、選択した上記ｎ音素組を抽出するものである。[0032] In the computer readable recording medium storing the speech pattern model learning program according to the present invention, the n phoneme group extracting step is such that the number of data having the same n phoneme group notation is more than a predetermined number from the interactive speech learning data. Selecting a certain n phoneme set and recognizing the selected n phoneme set using the read-aloud m-phoneme set model and the dialogue m-phoneme set model, and if the recognition rate is equal to or less than a second predetermined threshold, The selected n phoneme sets are extracted.

【００３３】この発明に係る音声認識プログラムを記録
したコンピュータ読み取り可能な記録媒体は、音声パタ
ーンモデル学習方法によって学習された読み上げ音声ｍ
音素組モデル、対話音声ｍ音素組モデルおよび対話音声
ｎ音素組モデルを並列に接続することによって認識対象
語彙に対する音声パターンモデルを作成する認識対象語
彙モデル作成ステップと、該認識対象語彙モデル作成ス
テップで作成した認識対象語彙に対する音声パターンモ
デルを用いて、入力音声の認識を行う認識ステップとを
有するものである。The computer-readable recording medium on which the voice recognition program according to the present invention is recorded is a read-out voice m learned by the voice pattern model learning method.
A recognition target vocabulary model creating step of creating a speech pattern model for the recognition target vocabulary by connecting the phoneme group model, the dialogue m m phoneme group model, and the dialogue n n phoneme group model in parallel; A recognition step of recognizing the input voice using the generated voice pattern model for the recognition target vocabulary.

【００３４】[0034]

【発明の実施の形態】以下、この発明の実施の一形態を
説明する。実施の形態１．図１はこの発明の実施の形態１による音
声パターンモデル学習装置の構成を示すブロック図であ
る。図において、３は、読み上げ音声学習データメモリ
６に格納された各ｍ音素組についてテキストを読み上げ
た音声を用いて読み上げ音声ｍ音素組モデルを学習する
とともに、ｍ音素組抽出部（ｍ音素組抽出手段）１０に
よって抽出された各ｍ音素組について、対話音声学習デ
ータメモリ８に格納された対話音声学習データを用いて
対話音声ｍ音素組モデルを学習するモデル学習部（モデ
ル学習手段）、７は読み上げ音声学習データメモリ６に
含まれる読み上げ音声の特徴ベクトルの時系列、９は対
話音声学習データメモリ８に含まれる対話音声の特徴ベ
クトルの時系列、１１はｍ音素組抽出部１０によって抽
出されたｍ音素組のｍ音素組表記、１２は抽出ｍ音素組
表記メモリ、１３は読み上げ音声ｍ音素組モデルのパラ
メータおよびｍ音素組表記、１４は読み上げ音声ｍ音素
組モデルメモリ、１５は対話音声ｍ音素組モデルのパラ
メータおよびｍ音素組表記、１６は対話音声ｍ音素組モ
デルメモリである。なお、以下ではｍ＝３である３音素
組を例にして説明する。また、典型的には、この実施の
形態１で使用される読み上げ音声ｍ音素組モデルおよび
対話音声ｍ音素組モデルはともに連続分布型のＨＭＭで
ある。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below. Embodiment 1 FIG. FIG. 1 is a block diagram showing a configuration of a speech pattern model learning device according to Embodiment 1 of the present invention. Referring to FIG. 3, reference numeral 3 denotes a m-phoneme group extracting unit (m-phoneme group extraction unit) which learns a m-phoneme group extraction model using a voice read out of a text with respect to each m-phoneme group stored in the read-aloud speech learning data memory 6. Means) a model learning section (model learning means) for learning a dialogue m m phoneme set model using the dialogue voice learning data stored in the dialogue voice learning data memory 8 for each m phoneme set extracted by 10; The time series of the feature vector of the reading voice included in the reading voice learning data memory 6, the time series 9 of the feature vector of the dialogue voice included in the dialogue learning data memory 8, and the 11 are extracted by the m phoneme set extraction unit 10. m phoneme set notation of m phoneme set, 12 is an extracted m phoneme set notation memory, 13 is a parameter of m-phoneme set model and m phoneme set Serial, the reading voice m phoneme sets model memory 14, 15 parameters and m phonemes sets representation of interactive voice m phoneme sets model 16 is an interactive voice m phoneme sets model memory. In the following, a description will be given of a three-phoneme set where m = 3 as an example. Also, typically, the read-aloud m m phoneme set model and the dialogue m m phoneme set model used in the first embodiment are both continuous distribution HMMs.

【００３５】読み上げ音声学習データメモリ６は、多様
なｍ音素組のコンテキストを含んだ単語や文章を多数の
話者が読み上げた音声を音響分析して得られる、特徴ベ
クトルの時系列と発話内容を示す音素組表記とを含む読
み上げ音声学習データを格納するものであって、具体的
には、読み上げ音声学習データは、テキストを読み上げ
た音声波形を音響分析して得られる特徴ベクトルの時系
列を音素区間ごとに切り出したトークンの集合と、ｍ音
素組のｍ音素組表記の集合とを対応づけるｍ音素組テー
ブルである。このｍ音素組テーブルは例えば従来技術と
同様に図２４のように記述されている。ここで、音響分
析方法としては従来技術と同様に例えばＬＰＣ分析を用
い、特徴ベクトルはＬＰＣケプストラムである。音素区
間ごとへの切り出しは例えば人間がスペクトログラムを
観察して行う。また、読み上げ音声学習データメモリ６
が保持する各トークン（各トークンにはトークン番号が
付されている）には各トークンの音素名、先行音素名お
よび後続音素名を記したｍ音素組表記が付与されてい
る。各ｍ音素組表記は、例えばｍ＝３の場合、従来技術
と同様に図２５のように記述される。The read-aloud speech learning data memory 6 stores a time series of feature vectors and utterance contents obtained by acoustically analyzing a speech read out by a number of speakers from words and sentences including various m-phoneme contexts. The phonetic speech learning data includes a phoneme set notation shown, and specifically, the speech training data is a phoneme based on a time series of a feature vector obtained by acoustically analyzing a speech waveform obtained by reading a text. It is an m phoneme set table that associates a set of tokens extracted for each section with a set of m phoneme set notations of m phoneme sets. This m phoneme set table is described, for example, as shown in FIG. Here, as the acoustic analysis method, for example, LPC analysis is used as in the related art, and the feature vector is an LPC cepstrum. The segmentation for each phoneme section is performed by, for example, a human observing a spectrogram. In addition, the reading voice learning data memory 6
Are assigned m-phoneme set notation in which the phoneme name, preceding phoneme name, and subsequent phoneme name of each token are described. Each m-phoneme set notation is described as shown in FIG.

【００３６】対話音声学習データメモリ８は、多様な場
面での人対人の対話音声を音響分析して得られる、特徴
ベクトルの時系列と発話内容を示す音素組表記とを含む
対話音声学習データを格納するものであって、具体的に
は、対話音声学習データは、人対人の対話音声波形を音
響分析して得られる特徴ベクトルの時系列を音素区間ご
とに切り出したトークンの集合と、ｍ音素組のｍ音素組
表記の集合とを対応づけるｍ音素組テーブルである。こ
のｍ音素組テーブルは、読み上げ音声学習データメモリ
６のｍ音素組テーブルと同様の形式を有している。ま
た、音響分析の方法としては、読み上げ音声学習データ
と同様に例えばＬＰＣ分析を用い、特徴ベクトルはＬＰ
Ｃケプストラムである。音素区間ごとへの切り出しも読
み上げ音声学習データと同様に例えば人間がスペクトロ
グラムを観察して行うものとする。また対話音声学習デ
ータメモリ８が保持する各トークンにも（各トークンに
はトークン番号が付されている）各トークンの音素名、
先行音素名および後続音素名を記したｍ音素組表記が付
与されているものとする。各ｍ音素組表記は、読み上げ
音声学習データメモリ６のｍ音素組表記と同様のもので
ある。The dialogue speech learning data memory 8 stores dialogue speech learning data obtained by acoustic analysis of a person-to-person dialogue voice in various scenes, including a time series of feature vectors and a phoneme set notation indicating utterance contents. Specifically, the dialogue speech learning data includes a set of tokens obtained by extracting a time series of feature vectors obtained by acoustically analyzing a person-to-person dialogue speech waveform for each phoneme section, and m phonemes. It is an m phoneme set table which associates a set of m phoneme set notations with a set. This m-phoneme set table has the same format as the m-phoneme set table in the reading voice learning data memory 6. In addition, as a method of acoustic analysis, for example, LPC analysis is used in the same manner as the read-out speech learning data, and the feature vector is LP
C cepstrum. It is assumed that, for example, a human observes a spectrogram as in the case of the read-out voice learning data, and cuts out each phoneme section. The phoneme name of each token (each token is assigned a token number) is also stored in each token held in the dialogue voice learning data memory 8.
It is assumed that m phoneme group notation in which the preceding phoneme name and the subsequent phoneme name are described is given. Each m-phoneme set notation is the same as the m-phoneme set notation in the reading voice learning data memory 6.

【００３７】読み上げ音声学習データは、テキストを読
み上げた音声のように比較的丁寧で明瞭な発声に関する
学習データであるのに対し、対話音声学習データは人対
人の自然な対話音声に関する学習データであるので音素
の特徴ベクトルの変形が激しくなっているのが特徴であ
る。The read-aloud speech learning data is learning data relating to relatively polite and clear utterances, such as a text-to-speech voice, whereas the interactive speech learning data is learning data relating to a natural human-to-person interactive voice. The feature is that the feature vector of the phoneme is severely deformed.

【００３８】次に動作について説明する。この発明の実
施の形態１による音声パターンモデル学習装置は、読み
上げ音声ｍ音素組モデルを次のようにして作成し、読み
上げ音声ｍ音素組モデルメモリ１４に格納する。この場
合、音声パターンモデル学習装置は、モデル学習部３の
入力端子Ａを読み上げ音声学習データメモリ６の出力端
子Ｂ１に接続することにより、読み上げ音声学習データ
メモリ６が保持するデータをモデル学習部３へ入力する
ようにセットする。さらに、モデル学習部３の出力端子
Ｃが読み上げ音声ｍ音素組モデルメモリ１４の入力端子
Ｄ１に接続される。この接続状態で、以下の手順にした
がって、この実施の形態１による音声パターンモデル学
習装置は読み上げ音声ｍ音素組モデルを学習する。Next, the operation will be described. The speech pattern model learning device according to the first embodiment of the present invention creates a read-aloud m m-phoneme set model as follows, and stores it in the read-aloud m-phoneme set model memory 14. In this case, the speech pattern model learning device connects the input terminal A of the model learning unit 3 to the output terminal B1 of the speech learning data memory 6 so that the data held in the speech learning data memory 6 is stored in the model learning unit 3. Set to input to. Further, the output terminal C of the model learning unit 3 is connected to the input terminal D1 of the read-aloud m-phoneme set model memory 14. In this connection state, the speech pattern model learning device according to the first embodiment learns the read-aloud m m phoneme set model according to the following procedure.

【００３９】（１）読み上げ音声モデル学習手順１：モ
デル学習部３は、読み上げ音声学習データメモリ６が保
持するｍ音素組テーブルを読み込み、このｍ音素組テー
ブルの記述内容にしたがって先頭のｍ音素組をまず学習
対象として選択する。この場合、ｍ＝３であるｍ音素組
テーブルが従来技術と同様に例えば図２４のように記述
されているならば、モデル学習部３はまず先頭のｍ音素
組である（ａ）ａ（ａ）を学習対象として選択する。(1) Reading voice model learning procedure 1: The model learning unit 3 reads the m phoneme set table held in the reading voice learning data memory 6, and according to the description contents of the m phoneme set table, the first m phoneme set. Is first selected as a learning target. In this case, if the m phoneme set table in which m = 3 is described as in the prior art, for example, as shown in FIG. 24, the model learning unit 3 first determines the first m phoneme set as (a) a (a ) Is selected as a learning target.

【００４０】（２）読み上げ音声モデル学習手順２：モ
デル学習部３は、上記読み上げ音声モデル学習手順１ま
たは下記読み上げ音声モデル学習手順３において選択し
たｍ音素組と一致するｍ音素組表記を持つ全てのトーク
ンの特徴ベクトルの時系列７を読み上げ音声学習データ
メモリ６から読み込み、例えばフォワード・バックワー
ドアルゴリズムを用いて選択した上記ｍ音素組について
モデルを学習する。学習を終了すると、モデル学習部３
はモデルのパラメータである状態遷移確率およびラベル
出力確率ならびにそのｍ音素組表記１３を読み上げ音声
ｍ音素組モデルメモリ１４に送出する。読み上げ音声ｍ
音素組モデルメモリ１４は上記のように学習を終了した
モデルのパラメータとそのｍ音素組表記１３を保持す
る。(2) Speech model learning procedure 2: The model learning section 3 has m phoneme set notation that matches the m phoneme set selected in the above read speech model learning procedure 1 or the following read speech model learning procedure 3. The time series 7 of the feature vector of the token is read out from the read-out speech learning data memory 6, and the model is learned for the m phoneme set selected using, for example, a forward / backward algorithm. When learning is completed, the model learning unit 3
Sends the state transition probability and label output probability, which are model parameters, and the m-phoneme set notation 13 thereof to the m-phoneme set model memory 14 for read-out speech. Reading voice m
The phoneme set model memory 14 holds the parameters of the model for which learning has been completed as described above and the m phoneme set notation 13 thereof.

【００４１】（３）読み上げ音声モデル学習手順３：そ
の後、モデル学習部３は読み上げ音声学習データメモリ
６が保持するｍ音素組テーブルを参照し、読み上げ音声
学習データメモリ６に存在する全てのｍ音素組について
モデルの学習が終了するまで、上記ｍ音素組テーブルに
記述されている順番にしたがって次のｍ音素組を学習対
象として選択し、上記読み上げ音声モデル学習手順２を
繰り返して、全てのｍ音素組について読み上げ音声ｍ音
素組モデルの学習を終了する。(3) Reading voice model learning procedure 3: Thereafter, the model learning unit 3 refers to the m phoneme set table held in the reading voice learning data memory 6 and reads all m phonemes present in the reading voice learning data memory 6. Until the learning of the model for the set is completed, the next m-phoneme set is selected as a learning target according to the order described in the m-phoneme set table, and the above-described reading voice model learning procedure 2 is repeated to obtain all m-phoneme sets. The learning of the read-aloud m-phoneme set model for the set is ended.

【００４２】次に、モデル学習部３は、ｍ音素組抽出部
１０と協働して、対話音声ｍ音素組モデルを学習し、学
習によって得た結果を対話音声ｍ音素組モデルメモリ１
６に格納する。学習を開始する前に、音声パターンモデ
ル学習装置は、モデル学習部３の入力端子Ａを対話音声
学習データメモリ８の出力端子Ｂ２に接続することによ
り、対話音声学習データメモリ８が保持するデータをモ
デル学習部３へ入力するようにセットする。さらに、モ
デル学習部３の出力端子Ｃが対話音声ｍ音素組モデルメ
モリ１６の入力端子Ｄ２に接続される。この接続状態
で、以下の手順にしたがって、この実施の形態１による
音声パターンモデル学習装置は対話音声ｍ音素組モデル
を学習する。Next, the model learning section 3 cooperates with the m-phoneme set extracting section 10 to learn the m-phoneme set model of the dialogue speech, and stores the result obtained by the learning in the dialogue m-phoneme set model memory 1.
6 is stored. Before starting the learning, the speech pattern model learning device connects the input terminal A of the model learning unit 3 to the output terminal B2 of the dialogue learning data memory 8 to thereby store the data held in the dialogue learning data memory 8. It is set so as to be input to the model learning unit 3. Further, the output terminal C of the model learning unit 3 is connected to the input terminal D2 of the conversational speech m phoneme group model memory 16. In this connection state, the speech pattern model learning apparatus according to the first embodiment learns the m-phoneme set model of the dialogue speech according to the following procedure.

【００４３】この対話音声ｍ音素組モデルの学習手順
は、読み上げ音声ｍ音素組モデルメモリ１４に格納され
ている読み上げ音声ｍ音素組モデルを用いて対話音声学
習データメモリ８に格納されている各トークンの認識を
行い、認識率の低いｍ音素組を抽出する手順と、このよ
うにして抽出した各ｍ音素組について対話音声ｍ音素組
モデルを学習する手順との２つの手順からなる。The learning procedure of the dialogue m m phoneme set model is performed by using each of the tokens stored in the dialogue speech learning data memory 8 using the readout m m phoneme set model stored in the readout m m phoneme set model memory 14. And a procedure for extracting a m-phoneme set having a low recognition rate, and a procedure for learning a dialogue m-phoneme set model for each m-phoneme set extracted in this manner.

【００４４】まず、認識率の低いｍ音素組を抽出する手
順について説明する。（１）ｍ音素組抽出手順１：ｍ音素組抽出部１０は、読
み上げ音声ｍ音素組モデルメモリ１４から全ての読み上
げ音声ｍ音素組モデルのパラメータとそのｍ音素組表記
とを読み込む。First, a procedure for extracting m phoneme sets having a low recognition rate will be described. (1) m-phoneme-set extraction procedure 1: The m-phoneme-set extraction unit 10 reads the parameters of all the read-aloud m-phoneme-set models and the m-phoneme-set notation from the read-aloud m-phoneme-set model memory 14.

【００４５】（２）ｍ音素組抽出手順２：ｍ音素組抽出
部１０は、対話音声学習データメモリ８が保持するｍ音
素組テーブルを参照しこのｍ音素組テーブルの記述内容
にしたがって、先頭のｍ音素組を認識対象として選択す
る。ｍ＝３のｍ音素組テーブルが例えば図２４のように
記述されている場合、ｍ音素組抽出部１０はまず先頭の
ｍ音素組である（ａ）ａ（ａ）を認識対象として選択す
る。(2) m phoneme set extraction procedure 2: The m phoneme set extraction unit 10 refers to the m phoneme set table held in the conversational speech learning data memory 8 and reads the m phoneme set table according to the description contents of the m phoneme set table. The m phoneme sets are selected as recognition targets. When the m-phoneme set table of m = 3 is described, for example, as shown in FIG. 24, the m-phoneme set extraction unit 10 first selects the first m-phoneme set (a) a (a) as a recognition target.

【００４６】（３）ｍ音素組抽出手順３：ｍ音素組抽出
部１０は、上記ｍ音素組抽出手順２または下記ｍ音素組
抽出手順４において選択したｍ音素組と一致するｍ音素
組表記を持つ全てのトークンの特徴ベクトルの時系列９
を対話音声学習データメモリ８から読み込み、読み込ん
だ各トークンのそれぞれについて、上記ｍ音素組抽出手
順１で読み込んだ全ての読み上げ音声ｍ音素組モデルと
の尤度を計算し、一番高い尤度を示したｍ音素組モデル
のｍ音素組表記を当該トークンの認識結果とする。な
お、尤度計算には例えばビタビアルゴリズムを用いる。
ｍ音素組抽出部１０は、読み込んだ全てのトークンにつ
いて認識結果を求めた後、下記（１）式にしたがって認
識率Ｒ_ｔを計算する。(3) m-phoneme-set extraction procedure 3: The m-phoneme-set extraction unit 10 extracts the m-phoneme-set notation that matches the m-phoneme-set selected in the above-mentioned m-phoneme-set extraction procedure 2 or the following m-phoneme-set extraction procedure 4. Time series 9 of feature vectors of all tokens
Is read from the conversational speech learning data memory 8, and for each of the read tokens, the likelihoods of all the read-out speech m-phoneme set models read in the above m-phoneme set extraction procedure 1 are calculated, and the highest likelihood is calculated. The notation of the m phoneme set of the indicated m phoneme set model is used as the recognition result of the token. The likelihood calculation uses, for example, a Viterbi algorithm.
m phoneme set extraction unit 10, after obtaining the recognition result for all tokens read, calculates the recognition rate R _t in accordance with the following equation (1).

【００４７】Ｒ_ｔ＝Ｃ_ｔ／Ｎ_ｔ＊１００．０（１）R _t = C _t / N _t * 100.0 (1)

【００４８】但し、（１）式中で添字ｔは選択したｍ音
素組の種類を示しており、Ｎ_ｔはｍ音素組表記がｍ音素
組の種類がｔであるトークンの個数、Ｃ_ｔはその中で正
認識であったトークンの個数である。ここで正認識と
は、読み込んだ各トークンのｍ音素組表記が一番高い尤
度を示したｍ音素組モデルのｍ音素組表記と一致する場
合を正認識とする。[0048] However, (1) subscript t denotes the m phoneme sets of type selected in formula, N _t is the number of tokens m phoneme pairs notation is m phoneme sets of types t, C _t is This is the number of tokens that were recognized correctly. Here, the correct recognition is defined as a case where the m phoneme set notation of each read token matches the m phoneme set notation of the m phoneme set model showing the highest likelihood.

【００４９】ｍ音素組抽出部１０は、上記認識率Ｒ_ｔを
予め定めた閾値Ｔ_ｒと比較し、閾値Ｔ_ｒ以下であれば、
そのｍ音素組のｍ音素組表記を抽出ｍ音素組表記メモリ
１２に送出する。抽出ｍ音素組表記メモリ１２は、入力
されたｍ音素組表記を保持する。The m phoneme set extraction unit 10 compares the recognition rate _Rt with a predetermined threshold value _Tr, and if it is equal to or smaller than the threshold value _Tr ,
The m phoneme set notation of the m phoneme set is sent to the extracted m phoneme set notation memory 12. The extracted m phoneme set notation memory 12 holds the input m phoneme set notation.

【００５０】（４）ｍ音素組抽出手順４：ｍ音素組抽出
部１０は、対話音声学習データメモリ８が保持するｍ音
素組テーブルを参照し、対話音声学習データメモリ８に
存在する全てのｍ音素組から認識率の低いものを抽出す
るために、上記ｍ音素組テーブルに記述されている順番
にしたがって次のｍ音素組を選択し、上記ｍ音素組抽出
手順３を繰り返す。(4) m phoneme set extraction procedure 4: The m phoneme set extraction unit 10 refers to the m phoneme set table held in the dialogue speech learning data memory 8 and reads all m-phoneme set data existing in the dialogue speech learning data memory 8. In order to extract a low recognition rate from a phoneme set, the next m phoneme set is selected according to the order described in the m phoneme set table, and the m phoneme set extraction procedure 3 is repeated.

【００５１】以上のように、ｍ音素組抽出部１０は、上
記ｍ音素組抽出手順１〜４を行うことによって、認識率
Ｒ_ｔが閾値Ｔ_ｒ以下である全てのｍ音素組を抽出し、そ
れらのｍ音素組表記を抽出ｍ音素組表記メモリ１２に格
納する。As described above, the m-phoneme set extraction unit 10 extracts all the m-phoneme sets whose recognition rate _Rt is equal to or less than the threshold value _Tr by performing the m-phoneme set extraction procedures 1 to 4. The m phoneme group notation is stored in the extracted m phoneme group notation memory 12.

【００５２】次に上記のようにして抽出した各ｍ音素組
について対話音声ｍ音素組モデルを学習する手順を説明
する。Next, the procedure for learning the m-phoneme group model for dialogue speech for each m-phoneme group extracted as described above will be described.

【００５３】（１）抽出ｍ音素組モデル学習手順１：モ
デル学習部３は、抽出ｍ音素組表記メモリ１２に保持さ
れているｍ音素組表記を読み込み、抽出ｍ音素組表記メ
モリ１２に保持されている順番にしたがい、まず先頭の
ｍ音素組を学習対象として選択する。抽出ｍ音素組表記
メモリ１２の内容が例えば図２のようである場合、モデ
ル学習部３は先頭のｍ音素組である（ａ）ａ（ｕ）を学
習対象として選択する。(1) Extracted m-phoneme set model learning procedure 1: The model learning section 3 reads the m-phoneme set notation held in the extracted m-phoneme set notation memory 12 and holds the m-phoneme set notation memory 12. First, the first m phoneme set is selected as a learning target. If the contents of the extracted m-phoneme set notation memory 12 are as shown in FIG. 2, for example, the model learning unit 3 selects the first m-phoneme set (a) a (u) as a learning target.

【００５４】（２）抽出ｍ音素組モデル学習手順２：モ
デル学習部３は、上記抽出ｍ音素組モデルの学習手順１
または下記抽出ｍ音素組モデルの学習手順３において選
択したｍ音素組と一致するｍ音素組表記を持つ全てのト
ークンの特徴ベクトルの時系列９を対話音声学習データ
メモリ８から読み込み、例えばフォワード・バックワー
ドアルゴリズムを用いて選択したｍ音素組に対する対話
音声ｍ音素組モデルを学習する。そして、モデル学習部
３は、学習したモデルのパラメータとそのｍ音素組表記
を対話音声ｍ音素組モデルメモリ１６に送出する。対話
音声ｍ音素組モデルメモリ１６は、受け取ったモデルの
パラメータおよびｍ音素組表記を保持する。(2) Extracted m phoneme set model learning procedure 2: The model learning unit 3 learns the extracted m phoneme set model learning procedure 1
Alternatively, the time series 9 of the feature vectors of all tokens having the m phoneme set notation that matches the m phoneme set selected in the learning procedure 3 of the extracted m phoneme set model is read from the interactive speech learning data memory 8 and forward-backed, for example. A dialogue m-phoneme set model for the selected m-phoneme set is learned using a word algorithm. Then, the model learning unit 3 sends the learned model parameters and the m-phoneme set notation to the dialogue-phone m-phoneme set model memory 16. The conversational speech m-phoneme set model memory 16 holds the received model parameters and m-phoneme set notation.

【００５５】（３）抽出ｍ音素組モデル学習手順３：次
に、モデル学習部３は、抽出ｍ音素組表記メモリ１２に
保持されている順番にしたがって、抽出ｍ音素組表記メ
モリ１２に保持されている次のｍ音素組を選択し、上記
の抽出ｍ音素組モデル学習手順２を繰り返す。(3) Extracted m-phoneme set model learning procedure 3: Next, the model learning section 3 holds the extracted m-phoneme set notation memory 12 in the order stored in the extracted m-phoneme set notation memory 12. Then, the next m phoneme set is selected, and the above-described extracted m phoneme set model learning procedure 2 is repeated.

【００５６】次にこの実施の形態１による音声パターン
モデル学習装置が使用する、ｍ音素組モデルを学習する
方法を具体的に説明する。図３はこの発明の実施の形態
１による音声パターンモデル学習方法の手順を示すフロ
ーチャートである。図３に示すとおり、この実施の形態
１による音声パターンモデル学習装置ではｍ音素組モデ
ルの学習手順は大きく３つのステップに分けられる。Next, a method for learning the m phoneme set model used by the voice pattern model learning apparatus according to the first embodiment will be described in detail. FIG. 3 is a flowchart showing the procedure of the voice pattern model learning method according to Embodiment 1 of the present invention. As shown in FIG. 3, in the speech pattern model learning apparatus according to the first embodiment, the learning procedure of the m phoneme set model is roughly divided into three steps.

【００５７】まず、モデル学習部３は、第１ステップで
あるステップＳＴ１０１において、読み上げ音声ｍ音素
組モデルを学習し学習した結果であるモデルのパラメー
タおよびｍ音素組表記を読み上げ音声ｍ音素組モデルメ
モリ１４に格納する。First, in step ST101, which is the first step, the model learning unit 3 learns model parameters and m-phoneme set notation obtained as a result of learning and learning the read-out m-phoneme set model, and reads out the m-phoneme set model memory. 14 is stored.

【００５８】次に、ｍ音素組抽出部１０は、第２ステッ
プであるステップＳＴ１０２において、読み上げ音声ｍ
音素組モデルメモリ１４に格納されている読み上げ音声
ｍ音素組モデルを用いて対話音声学習データメモリ８に
格納されている各トークンの認識を行い、認識率の低い
ｍ音素組を抽出する。Next, in step ST102, which is the second step, the m-phoneme set extraction unit 10 reads out the read speech m
Each token stored in the conversational speech learning data memory 8 is recognized by using the m-phoneme group model of the read-out voice stored in the phoneme group model memory 14 and an m-phoneme group having a low recognition rate is extracted.

【００５９】その後、モデル学習部３は、第３ステップ
であるステップＳＴ１０３において、対話音声学習デー
タメモリ８に格納されているトークンを用いて上記第２
ステップで抽出したｍ音素組について、対話音声ｍ音素
組モデルを学習する。Thereafter, in step ST103, which is the third step, the model learning section 3 uses the token stored in the conversational voice learning data memory 8 to store the second
For the m phoneme set extracted in the step, a m-phoneme set model of conversational speech is learned.

【００６０】次に上記第１〜第３ステップを詳しく説明
する。図４は上記第１ステップである読み上げ音声ｍ音
素組モデルの学習手順を示すフローチャートである。図
４を参照しながら読み上げ音声ｍ音素組モデルの学習手
順を詳細に説明する。Next, the first to third steps will be described in detail. FIG. 4 is a flowchart showing a learning procedure of the read-aloud m-phoneme set model as the first step. Referring to FIG. 4, the learning procedure of the m-phoneme set model of the read-aloud speech will be described in detail.

【００６１】モデル学習部３は、ステップＳＴ２０１に
おいて、読み上げ音声学習データメモリ６のｍ音素組テ
ーブルを読み込み、このｍ音素組テーブルの記述内容に
したがって、先頭のｍ音素組を学習対象として選択す
る。ｍ音素組テーブルが従来技術と同様に例えば図２４
のように記述されている場合、モデル学習部３は先頭の
ｍ音素組である（ａ）ａ（ａ）を学習対象として選択す
る。In step ST201, the model learning section 3 reads the m-phoneme set table in the read-aloud speech learning data memory 6, and selects the first m-phoneme set as a learning target according to the description contents of the m-phoneme set table. As shown in FIG.
In this case, the model learning unit 3 selects the head m phoneme set (a) a (a) as a learning target.

【００６２】モデル学習部３は、次に、ステップＳＴ２
０２において、上記ステップＳＴ２０１またはステップ
ＳＴ２０６において選択したｍ音素組と一致するｍ音素
組表記を持つ全てのトークンの特徴ベクトルの時系列７
を読み上げ音声学習データメモリ６から読み込む。The model learning unit 3 then proceeds to step ST2
02, a time series 7 of feature vectors of all tokens having m phoneme set notation that matches the m phoneme set selected in step ST201 or ST206.
Is read from the reading voice learning data memory 6.

【００６３】そして、モデル学習部３は、ステップＳＴ
２０３において、例えばフォワード・バックワードアル
ゴリズムを用いて上記ステップＳＴ２０１またはステッ
プＳＴ２０６において選択したｍ音素組について読み上
げ音声ｍ音素組モデルを学習する。Then, the model learning section 3 determines in step ST
At 203, a m-phoneme set model to be read aloud is learned for the m-phoneme set selected at step ST201 or ST206 using, for example, a forward / backward algorithm.

【００６４】その後、モデル学習部３は、ステップＳＴ
２０４において、学習を終了すると上記ステップＳＴ２
０３における学習の結果得たモデルのパラメータである
状態遷移確率およびラベル出力確率ならびにそのｍ音素
組表記１３を読み上げ音声ｍ音素組モデルメモリ１４に
送出する。読み上げ音声ｍ音素組モデルメモリ１４は受
け取ったこれらのモデルのパラメータおよびｍ音素組表
記１３を保持する。After that, the model learning section 3 executes step ST
At step 204, when the learning is completed, the above-described step ST2 is performed.
The state transition probability and the label output probability, which are the parameters of the model obtained as a result of the learning in step 03 and the m-phoneme group notation 13, are sent to the read-out m-phoneme group model memory 14. The read speech m-phoneme set model memory 14 holds the parameters of these models and the m-phoneme set notation 13 that have been received.

【００６５】次に、モデル学習部３は、ステップＳＴ２
０５において、読み上げ音声学習データメモリ６が保持
するｍ音素組テーブルを参照し、読み上げ音声学習デー
タメモリ６に存在する全てのｍ音素組について読み上げ
音声ｍ音素組モデルの学習を終了したか否かを判定し、
全てのｍ音素組について学習が終了していない場合は、
ステップＳＴ２０６において、ｍ音素組テーブルに記述
されている順番にしたがって次のｍ音素組を学習対象と
して選択し、上記ステップＳＴ２０２に戻る。一方、全
てのｍ音素組について学習が終了したならば、モデル学
習部３はこの読み上げ音声ｍ音素組モデル学習手順を終
了する。Next, the model learning section 3 determines in step ST2
At 05, the m-phoneme set table held in the read-speech learning data memory 6 is referred to, and it is determined whether or not the learning of the read-speech m-phoneme set model has been completed for all the m-phoneme sets existing in the read-speech learning data memory 6. Judge,
If learning has not been completed for all m phoneme sets,
In step ST206, the next m phoneme set is selected as a learning target according to the order described in the m phoneme set table, and the process returns to step ST202. On the other hand, when the learning is completed for all the m phoneme sets, the model learning unit 3 ends the reading voice m phoneme set model learning procedure.

【００６６】次に、ｍ音素組抽出部１０が、第２ステッ
プにおいて、読み上げ音声ｍ音素組モデルメモリ１４に
格納されている読み上げ音声ｍ音素組モデルを用いて対
話音声学習データメモリ８に格納されている各トークン
の認識を行い、認識率の低いｍ音素組を抽出する。図５
はこの第２ステップの抽出手順を示すフローチャートで
あり、以下では、図５を参照しながらこの抽出手順を詳
細に説明する。Next, in a second step, the m-phoneme set extracting unit 10 stores the m-phoneme set data in the conversational speech learning data memory 8 using the read-out m-phoneme set model stored in the m-phoneme set model memory 14. Each token is recognized, and m phoneme sets having a low recognition rate are extracted. FIG.
Is a flowchart showing the extraction procedure of the second step. Hereinafter, this extraction procedure will be described in detail with reference to FIG.

【００６７】まず、ｍ音素組抽出部１０は、ステップＳ
Ｔ３０１において、読み上げ音声ｍ音素組モデルメモリ
１４から全ての読み上げ音声ｍ音素組モデルのパラメー
タおよびそのｍ音素組表記１３を読み込む。First, the m-phoneme set extraction unit 10 determines in step S
At T301, the parameters of all the read m-phoneme group models and the m-phoneme set notation 13 are read from the read m-phoneme group model memory 14.

【００６８】次に、ｍ音素組抽出部１０は、ステップＳ
Ｔ３０２において、対話音声学習データメモリ８に格納
されているｍ音素組テーブルを読み込み、このｍ音素組
テーブルの記述内容にしたがって、先頭のｍ音素組を認
識対象として選択する。ｍ音素組テーブルが例えば図２
４のように記述されている場合、ｍ音素組抽出部１０は
先頭のｍ音素組である（ａ）ａ（ａ）を認識対象として
選択する。Next, the m-phoneme set extraction unit 10 determines in step S
At T302, the m phoneme set table stored in the conversational speech learning data memory 8 is read, and the first m phoneme set is selected as a recognition target according to the description contents of the m phoneme set table. The m phoneme set table is shown in FIG.
4, the m-phoneme set extraction unit 10 selects the first m-phoneme set (a) a (a) as a recognition target.

【００６９】そして、ｍ音素組抽出部１０は、ステップ
ＳＴ３０３において、上記ステップＳＴ３０２またはス
テップＳＴ３０８において選択したｍ音素組と一致する
ｍ音素組表記を持つ全てのトークンの特徴ベクトルの時
系列９を対話音声学習データメモリ８から読み込む。Then, in step ST303, the m-phoneme set extraction unit 10 interacts with the time series 9 of the feature vectors of all the tokens having the m-phoneme set notation that matches the m-phoneme set selected in step ST302 or ST308. Read from the voice learning data memory 8.

【００７０】その後、ｍ音素組抽出部１０は、ステップ
ＳＴ３０４において、読み込んだ全てのトークンのそれ
ぞれについて、上記ステップＳＴ３０１で読み込んだ全
ての読み上げ音声ｍ音素組モデルとの尤度を計算し、一
番高い尤度を示したｍ音素組モデルのｍ音素組表記を、
当該トークンの認識結果とする。なお、既に述べたよう
に、尤度計算には例えばビタビアルゴリズムを用いる。
ｍ音素組抽出部１０は、読み込んだ全てのトークンに対
する認識結果を求めた後、上記（１）式にしたがって認
識率Ｒ_ｔを計算する。Thereafter, in step ST304, the m-phoneme set extraction unit 10 calculates the likelihood of each of the read tokens with all the read-out speech m-phoneme set models read in step ST301. The m phoneme set notation of the m phoneme set model showing high likelihood is
This is the recognition result of the token. As described above, for example, the Viterbi algorithm is used for the likelihood calculation.
m phoneme set extraction unit 10, after obtaining the recognition result for all tokens read, calculates the recognition rate R _t in accordance with the equation (1).

【００７１】次に、ｍ音素組抽出部１０は、ステップＳ
Ｔ３０５において、上記ステップＳＴ３０４で求めた認
識率Ｒ_ｔを予め定めた閾値Ｔ_ｒと比較し、閾値Ｔ_ｒ以下
であれば、ステップＳＴ３０６に進み、選択したｍ音素
組のｍ音素組表記１１を抽出ｍ音素組表記メモリ１２に
送出する。抽出ｍ音素組表記メモリ１２は、入力された
ｍ音素組表記１１を保持する。一方、上記認識率Ｒ_ｔが
閾値Ｔ_ｒよりも大きいならば、ｍ音素組抽出部１０は何
も抽出ｍ音素組表記メモリ１２へ送出せずにステップＳ
Ｔ３０７に進む。Next, the m-phoneme set extraction unit 10 executes step S
In T305, it is compared with a threshold value _{T r} which defines a recognition rate _{R t} obtained in step ST304 in advance, if less than the threshold value _{T r,} the process proceeds to step ST 306, the m phoneme sets of m phonemic sets notation 11 selected extracted It is sent to the m phoneme set notation memory 12. The extracted m phoneme set notation memory 12 holds the input m phoneme set notation 11. On the other hand, if the recognition rate _Rt is greater than the threshold value _Tr , the m-phoneme-set extracting unit 10 does not send anything to the extracted m-phoneme-set notation memory 12 and returns to step S
Proceed to T307.

【００７２】ステップＳＴ３０７に進むと、ｍ音素組抽
出部１０は、対話音声学習データメモリ８に格納された
ｍ音素組テーブルを参照し、対話音声学習データメモリ
８に存在する全てのｍ音素組について認識率Ｒ_ｔを計算
したか否かを判定し、全てのｍ音素組について認識が終
了していない場合は、ステップＳＴ３０８へ進み、ｍ音
素組テーブルに記述されている順番にしたがって次のｍ
音素組を認識対象として選択し、ステップＳＴ３０３に
戻る。一方、ｍ音素組抽出部１０が全てのｍ音素組につ
いて認識を終了しているならばこのｍ音素組抽出手順を
終了する。At step ST307, the m-phoneme set extraction unit 10 refers to the m-phoneme set table stored in the dialogue speech learning data memory 8, and checks all m-phoneme sets existing in the dialogue speech learning data memory 8. It is determined whether or not the recognition rate _Rt has been calculated. If the recognition has not been completed for all m phoneme sets, the process proceeds to step ST308, where the next m phoneme set table is processed according to the order described in the m phoneme set table.
The phoneme set is selected as a recognition target, and the process returns to step ST303. On the other hand, if the m-phoneme-set extracting unit 10 has finished recognizing all the m-phoneme-sets, the m-phoneme-set extraction procedure ends.

【００７３】このようにｍ音素組抽出部１０がｍ音素組
抽出手順（図５のステップＳＴ３０１〜ステップＳＴ３
０８）を行うことによって、認識率Ｒ_ｔが閾値Ｔ_ｒ以下
である全てのｍ音素組を抽出しそれらのｍ音素組表記１
１を抽出ｍ音素組表記メモリ１２に格納することができ
る。As described above, the m phoneme set extraction unit 10 executes the m phoneme set extraction procedure (steps ST301 to ST3 in FIG. 5).
08), all m phoneme sets whose recognition rate R _t is equal to or less than the threshold value _Tr are extracted, and the m phoneme set notations 1
1 can be stored in the extracted m phoneme set notation memory 12.

【００７４】最後に、モデル学習部３は第３ステップで
対話音声学習データメモリ８に格納されているトークン
を用いて上記第２ステップで抽出した各ｍ音素組につい
て対話音声ｍ音素組モデルを学習する。図６は第３ステ
ップの学習手順を示すフローチャートであり、以下で
は、図６を参照しながら学習手順の詳細を説明する。Finally, the model learning section 3 learns the dialogue m m phoneme set model for each m phoneme set extracted in the second step using the token stored in the dialogue speech learning data memory 8 in the third step. I do. FIG. 6 is a flowchart showing the learning procedure of the third step. Hereinafter, the details of the learning procedure will be described with reference to FIG.

【００７５】モデル学習部３は、まず、ステップＳＴ４
０１において、抽出ｍ音素組表記メモリ１２に保持され
ているｍ音素組表記１１を読み込み、抽出ｍ音素組表記
メモリ１２に保持されている順番にしたがって、まず先
頭のｍ音素組を学習対象として選択する。抽出ｍ音素組
表記メモリ１２の内容が例えば図２のようである場合、
モデル学習部３はまず先頭のｍ音素組である（ａ）ａ
（ｕ）を学習対象として選択する。First, the model learning section 3 first proceeds to step ST4
In step 01, the m phoneme group notation 11 stored in the extracted m phoneme group notation memory 12 is read, and the first m phoneme group is selected as a learning target according to the order stored in the extracted m phoneme group notation memory 12. I do. When the contents of the extracted m phoneme set notation memory 12 are as shown in FIG. 2, for example,
First, the model learning unit 3 is the first m phoneme set (a) a
(U) is selected as a learning target.

【００７６】次に、モデル学習部３は、ステップＳＴ４
０２において、上記ステップＳＴ４０１またはステップ
ＳＴ４０６において選択したｍ音素組と一致するｍ音素
組表記を持つ全てのトークンの特徴ベクトルの時系列９
を対話音声学習データメモリ８から読み込む。Next, the model learning section 3 determines in step ST4
02, a time series 9 of feature vectors of all tokens having the m phoneme set notation that matches the m phoneme set selected in step ST401 or ST406.
Is read from the conversation voice learning data memory 8.

【００７７】そして、モデル学習部３は、ステップＳＴ
４０３において、例えばフォワード・バックワードアル
ゴリズムを用いて選択したｍ音素組について対話音声ｍ
音素組モデルを学習する。Then, the model learning section 3 determines in step ST
At 403, the dialogue speech m for the m phoneme set selected using, for example, the forward-backward algorithm
Learn phoneme set models.

【００７８】その後、モデル学習部３は、ステップＳＴ
４０４において、上記ステップＳＴ４０３における学習
の結果得たモデルのパラメータおよびそのｍ音素組表記
１５を対話音声ｍ音素組モデルメモリ１６に送出する。
対話音声ｍ音素組モデルメモリ１６は受け取ったモデル
のパラメータおよびｍ音素組表記１５を保持する。Thereafter, the model learning section 3 determines in step ST
In 404, the parameters of the model obtained as a result of the learning in step ST403 and the m-phoneme set notation 15 are sent to the dialogue speech m-phoneme set model memory 16.
The conversational voice m phoneme set model memory 16 holds the parameters of the received model and the m phoneme set notation 15.

【００７９】次に、モデル学習部３は、ステップＳＴ４
０５において、抽出ｍ音素組表記メモリ１２に保持され
ている全てのｍ音素組について、全てのｍ音素組モデル
を学習したか否かを判定し、全てのｍ音素組について学
習が終了していない場合は、ステップＳＴ４０６に進
み、抽出ｍ音素組表記メモリ１２に記述されている順番
にしたがって次のｍ音素組を学習対象として選択し、ス
テップＳＴ４０２に戻る。一方、モデル学習部３は、全
てのｍ音素組について学習を終了しているならば、この
対話音声ｍ音素組モデル学習手順を終了する。Next, the model learning section 3 determines in step ST4
At 05, it is determined whether all m phoneme set models have been learned for all m phoneme sets held in the extracted m phoneme set notation memory 12, and learning has not been completed for all m phoneme set models. In this case, the process proceeds to step ST406, where the next m phoneme set is selected as a learning target according to the order described in the extracted m phoneme set notation memory 12, and the process returns to step ST402. On the other hand, if the learning has been completed for all the m phoneme sets, the model learning unit 3 ends the conversational speech m phoneme set model learning procedure.

【００８０】この実施の形態１による音声パターンモデ
ル学習方法をソフトウェアで実現する場合、読み上げ音
声ｍ音素組モデルを学習し読み上げ音声ｍ音素組モデル
メモリ１４に格納する、読み上げ音声ｍ音素組モデルを
学習する第１ステップと、読み上げ音声ｍ音素組モデル
メモリ１４に格納されている読み上げ音声ｍ音素組モデ
ルを用いて対話音声学習データメモリ８に格納されてい
る各トークンの認識を行い、認識率の低いｍ音素組を抽
出する第２ステップと、対話音声学習データメモリ８に
格納されているトークンを用いて上記第２ステップで抽
出した全てのｍ音素組のそれぞれについて、対話音声ｍ
音素組モデルを学習する第３ステップとを有する、コン
ピュータに音声パターンモデルを学習させるための音声
パターンモデル学習プログラムを記録したコンピュータ
で読み取り可能な記録媒体が必要である。When the voice pattern model learning method according to the first embodiment is realized by software, the read-aloud m m-phoneme set model is learned and stored in the read-aloud m-phoneme set model memory 14. The first step is to perform the recognition of each token stored in the conversational speech learning data memory 8 using the read-out speech m-phoneme set model stored in the read-out speech m-phoneme set model memory 14, and the recognition rate is low. a second step of extracting m phoneme sets, and a dialog voice m for each of all m phoneme sets extracted in the second step using the token stored in the dialog voice learning data memory 8.
And a third step of learning a phoneme set model, and a computer-readable recording medium storing a voice pattern model learning program for causing a computer to learn a voice pattern model is required.

【００８１】以上説明したように、この実施の形態１の
音声パターンモデル学習装置および音声パターンモデル
学習方法によれば、読み上げ音声ｍ音素組モデルを用い
て対話音声学習データメモリ８に保持されている全ての
ｍ音素組のそれぞれの認識を行い、認識率の低いｍ音素
組を抽出して、抽出したｍ音素組についてのみ対話音声
学習データメモリ８が保持するトークンの特徴ベクトル
の時系列を用いて対話音声ｍ音素組モデルを学習するの
で、全てのｍ音素組に対して対話音声ｍ音素組モデルを
学習することなしに、読み上げ音声で学習した読み上げ
音声ｍ音素組モデルでは認識が困難であった対話音声を
も認識可能な対話音声ｍ音素組モデルを効率良く学習で
きる効果を奏する。なお、この実施の形態１ではｍ＝３
として説明したが、ｍが３以外の任意の整数を選ぶこと
も可能であり、その場合にも同様の効果を奏する。As described above, according to the speech pattern model learning apparatus and the speech pattern model learning method of the first embodiment, the conversational speech learning data memory 8 holds the read-out speech m-phoneme set model. Recognition of all m phoneme sets is performed, m phoneme sets having a low recognition rate are extracted, and only the extracted m phoneme sets are extracted using the time series of the feature vectors of the tokens held in the conversational speech learning data memory 8. Since the m-phoneme set model of the dialogue speech is learned, it is difficult to recognize the m-phoneme set model of the read-out speech trained with the read-out speech without learning the m-phoneme set model of the dialogue speech for all the m-phoneme sets. This has the effect of efficiently learning a dialogue m m phoneme set model that can recognize dialogue voices. In the first embodiment, m = 3
However, it is also possible to select any integer other than 3 for m, and in that case, the same effect is obtained.

【００８２】実施の形態２．この発明の実施の形態２に
よる音声パターンモデル学習装置は、上記実施の形態１
によるｍ音素組抽出手順１〜４に代わって以下に示す改
良ｍ音素組抽出手順１〜４を実行するｍ音素組抽出部１
０を備えたものである。なお、この実施の形態２による
音声パターンモデル学習装置は図１に示す上記実施の形
態１によるものと同一の構成を有しており、ｍ音素組抽
出部１０以外の構成要素は上記実施の形態１による音声
パターンモデル学習装置と同じ動作をするので、以下で
はその他の構成要素の説明を省略する。また、この実施
の形態２においてもｍ＝３のｍ音素組を対象として説明
する。Embodiment 2 The speech pattern model learning device according to the second embodiment of the present invention is similar to the first embodiment.
M phoneme set extraction unit 1 that executes the following improved m phoneme set extraction procedures 1 to 4 in place of m phoneme set extraction procedures 1 to 4
0 is provided. The speech pattern model learning apparatus according to the second embodiment has the same configuration as that according to the first embodiment shown in FIG. 1, and the components other than the m phoneme set extraction unit 10 are the same as those in the first embodiment. 1 performs the same operation as the speech pattern model learning apparatus according to No. 1, and the description of other components will be omitted below. Also, in the second embodiment, a description will be given of m phoneme sets where m = 3.

【００８３】次に動作について説明する。（１）改良ｍ音素組抽出手順１：ｍ音素組抽出部１０
は、読み上げ音声ｍ音素組モデルメモリ１４から全ての
読み上げ音声ｍ音素組モデルのパラメータおよびそのｍ
音素組表記１３を読み込む。Next, the operation will be described. (1) Improved m phoneme group extraction procedure 1: m phoneme group extraction unit 10
Are the parameters of all the read m-phoneme set models and the m
The phoneme set notation 13 is read.

【００８４】（２）改良ｍ音素組抽出手順２：次に、ｍ
音素組抽出部１０は、対話音声学習データメモリ８に格
納されたｍ音素組テーブルを読み込み、このｍ音素組テ
ーブルの記述内容にしたがって、対話音声学習データ中
から先頭のｍ音素組を認識対象として選択する。ｍ音素
組テーブルが例えば図２４のように記述されている場
合、ｍ音素組抽出部１０は先頭のｍ音素組である（ａ）
ａ（ａ）を認識対象として選択する。(2) Improved m phoneme group extraction procedure 2: Next, m
The phoneme set extraction unit 10 reads the m phoneme set table stored in the dialogue speech learning data memory 8 and sets the first m phoneme set in the dialogue speech learning data as a recognition target according to the description contents of the m phoneme set table. select. When the m phoneme set table is described, for example, as shown in FIG. 24, the m phoneme set extraction unit 10 is the first m phoneme set (a).
a (a) is selected as a recognition target.

【００８５】（３）改良ｍ音素組抽出手順３：ｍ音素組
抽出部１０は、上記改良ｍ音素組抽出手順２または下記
改良ｍ音素組抽出手順４において選択したｍ音素組と一
致するｍ音素組表記を持つ全てのトークンの特徴ベクト
ルの時系列９を対話音声学習データメモリ８から読み込
む。読み込んだトークンの数Ｎ_ｔ（添字ｔは選択したｍ
音素組の名前を示す）が予め定めた閾値Ｎ未満であれ
ば、ｍ音素組抽出部１０は抽出ｍ音素組表記メモリ１２
には何も送出せず、下記改良ｍ音素組抽出手順４に移
る。一方、Ｎ_ｔが予め定めた閾値Ｎ以上であれば、上記
実施の形態１と同様に認識を行う。すなわち、読み込ん
だ各トークンについて、上記改良ｍ音素組抽出手順１で
読み込んだ全ての読み上げ音声ｍ音素組モデルとの尤度
を計算し、一番高い尤度を示したｍ音素組モデルのｍ音
素組表記を、当該トークンの認識結果とする。なお、尤
度計算には例えばビタビアルゴリズムを用いる。読み込
んだ全てのトークンに対する認識結果を求めた後、ｍ音
素組抽出部１０は、上記実施の形態１と同様に上記
（１）式によって認識率Ｒ_ｔを計算する。そして、ｍ音
素組抽出部１０は、上記認識率Ｒ_ｔを予め定めた閾値Ｔ
_ｒと比較し、閾値Ｔ_ｒ以下であれば、そのｍ音素組のｍ
音素組表記１１を抽出ｍ音素組表記メモリ１２に送出す
る。抽出ｍ音素組表記メモリ１２は入力されたｍ音素組
表記１１を保持する。(3) Improved m-phoneme group extraction procedure 3: The m-phoneme group extraction unit 10 selects m m-phonemes that match the m-phoneme group selected in the above-mentioned improved m-phoneme group extraction procedure 2 or the following improved m-phoneme group extraction procedure 4. The time series 9 of the feature vectors of all the tokens having the set notation is read from the interactive speech learning data memory 8. Number of read tokens N _t (subscript t is selected m
If the phoneme set name is less than a predetermined threshold N, the m phoneme set extraction unit 10 extracts the m phoneme set notation memory 12
Nothing is sent, and the procedure moves to the following improved m phoneme group extraction procedure 4. On the other hand, if the threshold value N or more that N _t is predetermined, for recognizing as in the first embodiment. That is, for each of the read tokens, the likelihood of all the read m-phoneme set models read out in the improved m-phoneme set extraction procedure 1 is calculated, and the m-phoneme set of the m-phoneme set model showing the highest likelihood is calculated. The set notation is the recognition result of the token. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition result for all tokens read, m phoneme set extraction unit 10 calculates the recognition rate R _t in the same manner as the first embodiment by the expression (1). Then, the m phoneme set extraction unit 10 sets the recognition rate R _t to a predetermined threshold T
_r, and if it is less than or equal to the threshold value _Tr , m
The phoneme set notation 11 is sent to the extracted m phoneme set notation memory 12. The extracted m phoneme set notation memory 12 holds the input m phoneme set notation 11.

【００８６】（４）改良ｍ音素組抽出手順４：ｍ音素組
抽出部１０は、対話音声学習データメモリ８が保持する
ｍ音素組テーブルを参照し、対話音声学習データ中に存
在する全てのｍ音素組について上記改良ｍ音素組抽出手
順３を実行するために、上記ｍ音素組テーブルに記述さ
れている順番にしたがって次のｍ音素組を選択し、上記
改良ｍ音素組抽出手順３を繰り返す。このようにして、
対話音声学習データ中に存在する全てのｍ音素組につい
て認識率を求めると、ｍ音素組抽出部１０は改良ｍ音素
組抽出手順を終了する。(4) Improved m-phoneme set extraction procedure 4: The m-phoneme set extraction unit 10 refers to the m-phoneme set table held in the dialogue speech learning data memory 8 and reads all m-phoneme set data present in the dialogue speech learning data. In order to execute the improved m phoneme set extraction procedure 3 for the phoneme set, the next m phoneme set is selected according to the order described in the m phoneme set table, and the improved m phoneme set extraction procedure 3 is repeated. In this way,
When the recognition rates are obtained for all the m phoneme sets existing in the conversational speech learning data, the m phoneme set extraction unit 10 ends the improved m phoneme set extraction procedure.

【００８７】次にこの実施の形態２による音声パターン
モデル学習装置が使用するｍ音素組モデルを学習する方
法を具体的に説明する。実施の形態２による音声パター
ンモデル学習装置では、上記実施の形態１による音声パ
ターンモデル学習装置と同様にｍ音素組モデルの学習手
順は大きく３つのステップに分けられる。Next, a method for learning the m-phoneme set model used by the voice pattern model learning apparatus according to the second embodiment will be specifically described. In the speech pattern model learning device according to the second embodiment, the learning procedure of the m phoneme set model is roughly divided into three steps as in the speech pattern model learning device according to the first embodiment.

【００８８】まず、第１ステップは、読み上げ音声ｍ音
素組モデルを学習し学習により得た結果であるモデルの
パラメータおよびそのｍ音素組表記１３を読み上げ音声
ｍ音素組モデルメモリ１４に格納する、読み上げ音声ｍ
音素組モデルを学習するステップである。First, the first step is to store the parameters of the model obtained as a result of learning and learning the m-phoneme set model of the read-aloud speech m and the m-phoneme set notation 13 in the read-aloud speech m-phoneme set model memory 14. Sound m
This is a step of learning a phoneme set model.

【００８９】次の第２ステップは、読み上げ音声ｍ音素
組モデルメモリ１４に格納されている読み上げ音声ｍ音
素組モデルを用いて、対話音声学習データメモリ８が保
持するｍ音素組テーブルに記述されたｍ音素組の中から
トークンの数Ｎ_ｔが閾値Ｎ以上でかつ認識率Ｒ_ｔが閾値
Ｔ_ｒ以下であるｍ音素組を抽出するステップである。In the second step, the m-phoneme set model stored in the conversational speech learning data memory 8 is described by using the m-phoneme set model of the read speech m stored in the m-phoneme set model memory 14. the number N _t of tokens from the m phoneme set is a step of threshold N or more and recognition rate R _t to extract m phonemes sets is equal to or less than the threshold T _r.

【００９０】そして、次の第３ステップは、対話音声学
習データメモリ８に格納されているトークンを用いて上
記第２ステップで抽出した各ｍ音素組について、対話音
声ｍ音素組モデルを学習するステップである。Then, the next third step is a step of learning a dialogue m m phoneme set model for each m phoneme set extracted in the second step using the token stored in the dialogue speech learning data memory 8. It is.

【００９１】上記第１〜第３ステップのうち、第１およ
び第３ステップは上記実施の形態１と全く同じ手順であ
るので以下ではその説明を省略し、第２ステップである
ｍ音素組の抽出手順を詳細に説明する。図７はこの第２
ステップの抽出手順を示すフローチャートであり、以下
では図７を参照しながら抽出手順を詳細に説明する。Of the first to third steps, the first and third steps have exactly the same procedure as in the first embodiment, and therefore will not be described below, and the second step of extracting m phoneme sets will be described. The procedure will be described in detail. FIG. 7 shows this second
8 is a flowchart showing a procedure for extracting steps. Hereinafter, the extraction procedure will be described in detail with reference to FIG.

【００９２】ｍ音素組抽出部１０は、まず、ステップＳ
Ｔ５０１において、読み上げ音声ｍ音素組モデルメモリ
１４から全ての読み上げ音声ｍ音素組モデルのパラメー
タおよびそのｍ音素組表記１３を読み込む。The m-phoneme-set extracting unit 10 first executes step S
At T501, the parameters of all the read-out m-phoneme set models and the m-phoneme set notation 13 are read from the read-out m-phoneme set model memory 14.

【００９３】次に、ｍ音素組抽出部１０は、ステップＳ
Ｔ５０２において、対話音声学習データメモリ８に格納
されたｍ音素組テーブルを読み込み、このｍ音素組テー
ブルの先頭に記述されているｍ音素組を認識対象として
選択する。ｍ音素組テーブルが例えば図２４のように記
述されている場合、ｍ音素組抽出部１０は先頭のｍ音素
組である（ａ）ａ（ａ）を認識対象として選択する。Next, the m-phoneme set extraction unit 10 executes step S
At T502, the m phoneme set table stored in the conversational speech learning data memory 8 is read, and the m phoneme set described at the head of the m phoneme set table is selected as a recognition target. When the m phoneme set table is described, for example, as shown in FIG. 24, the m phoneme set extraction unit 10 selects the first m phoneme set (a) a (a) as a recognition target.

【００９４】そして、ｍ音素組抽出部１０は、ステップ
ＳＴ５０３において、上記ステップＳＴ５０２またはス
テップＳＴ５０９において選択したｍ音素組と一致する
ｍ音素組表記を持つ全てのトークンの特徴ベクトルの時
系列９を対話音声学習データメモリ８から読み込む。Then, in step ST503, the m-phoneme set extraction unit 10 interacts with the time series 9 of the feature vectors of all the tokens having the m-phoneme set notation that matches the m-phoneme set selected in step ST502 or ST509. Read from the voice learning data memory 8.

【００９５】その後、ｍ音素組抽出部１０は、ステップ
ＳＴ５０４において、読み込んだトークンの数Ｎ_ｔ（添
字ｔは選択したｍ音素組の名前を示す）を予め定めた閾
値Ｎと比較し、Ｎ_ｔ＜Ｎであれば、抽出ｍ音素組表記メ
モリ１２には何も送出せず、ステップＳＴ５０８に移
る。一方、Ｎ_ｔ＞＝Ｎであれば、ｍ音素組抽出部１０は
ステップＳＴ５０５に移る。[0095] Thereafter, m phoneme set extraction unit 10, at step ST 504, compared to the number _{N t} (subscript t indicates the name of the m phoneme sets selected) threshold N a predetermined a read token, _{N t} If <N, nothing is sent to the extracted m phoneme set notation memory 12, and the routine goes to step ST508. On the other hand, if N _t > = N, the m phoneme group extraction unit 10 moves to step ST505.

【００９６】ステップＳＴ５０５においては、ｍ音素組
抽出部１０は、読み込んだ各トークンについて、上記ス
テップＳＴ５０３で読み込んだ全ての読み上げ音声ｍ音
素組モデルとの尤度を計算し、一番高い尤度を示したｍ
音素組モデルのｍ音素組表記を、当該トークンの認識結
果とする。なお、尤度計算には例えばビタビアルゴリズ
ムを用いる。読み込んだ全てのトークンについて認識結
果を求めた後、ｍ音素組抽出部１０は上記（１）式にし
たがって認識率Ｒ_ｔを計算する。In step ST505, the m-phoneme set extraction unit 10 calculates the likelihood of each of the read tokens with all the read-out speech m-phoneme set models read in step ST503, and determines the highest likelihood. M shown
The m phoneme set notation of the phoneme set model is set as the recognition result of the token. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition result for all tokens read, m phoneme set extraction unit 10 calculates the recognition rate R _t in accordance with the equation (1).

【００９７】次に、ｍ音素組抽出部１０は、ステップＳ
Ｔ５０６において、上記ステップＳＴ５０５において求
めた認識率Ｒ_ｔを予め定めた閾値Ｔ_ｒと比較し、閾値Ｔ
_ｒ以下であれば、ステップＳＴ５０７に進み、そのｍ音
素組のｍ音素組表記１１を抽出ｍ音素組表記メモリ１２
に送出する。抽出ｍ音素組表記メモリ１２は入力された
ｍ音素組表記１１を保持する。一方、上記認識率Ｒ_ｔが
閾値Ｔ_ｒよりも大きいならば、ｍ音素組抽出部１０はス
テップＳＴ５０８に進む。Next, the m-phoneme set extraction unit 10 executes step S
In T506, it is compared with a threshold value _{T r} which defines a recognition rate _{R t} determined in advance at step ST505, the threshold T
If it is equal to or smaller than _r , the process proceeds to step ST507, where the m phoneme set notation 11 of the m phoneme set is extracted and the m phoneme set notation memory 12
To send to. The extracted m phoneme set notation memory 12 holds the input m phoneme set notation 11. On the other hand, if the recognition rate _{R t} is larger than the threshold value _{T r,} m phoneme set extraction unit 10 proceeds to step ST 508.

【００９８】そして、ステップＳＴ５０８では、ｍ音素
組抽出部１０は、対話音声学習データメモリ８に格納さ
れたｍ音素組テーブルを参照し、対話音声学習データメ
モリ８に存在する全てのｍ音素組を既に選択し終えたか
否かを判定し、未選択のｍ音素組が存在する場合は、ス
テップＳＴ５０９に進み上記ｍ音素組テーブルに記述さ
れている順番にしたがって次のｍ音素組を認識対象とし
て選択し、ステップＳＴ５０３に戻る。一方、ｍ音素組
抽出部１０は、既に全てのｍ音素組を選択し終えたので
あるならばｍ音素組の抽出手順を終了する。Then, in step ST508, the m-phoneme set extraction unit 10 refers to the m-phoneme set table stored in the dialogue speech learning data memory 8 and retrieves all the m-phoneme sets existing in the dialogue speech learning data memory 8. It is determined whether or not the selection has already been completed. If there is an unselected m phoneme set, the process proceeds to step ST509, and the next m phoneme set is selected as a recognition target in accordance with the order described in the m phoneme set table. Then, the process returns to step ST503. On the other hand, if the m phoneme set extraction unit 10 has already selected all m phoneme sets, the m phoneme set extraction procedure ends.

【００９９】なお、この実施の形態２による音声パター
ンモデル学習方法をソフトウェアで実現する場合、読み
上げ音声ｍ音素組モデルを学習し学習により得た結果を
読み上げ音声ｍ音素組モデルメモリ１４に格納する、読
み上げ音声ｍ音素組モデルを学習する第１ステップと、
読み上げ音声ｍ音素組モデルメモリ１４に格納されてい
る読み上げ音声ｍ音素組モデルを用いて対話音声学習デ
ータメモリ８に格納されたｍ音素組テーブルに記述され
たｍ音素組の中からトークンの数Ｎ_ｔが閾値Ｎ以上でか
つ認識率Ｒ_ｔが閾値Ｔ_ｒ以下であるｍ音素組を抽出する
第２ステップと、対話音声学習データメモリ８に格納さ
れているトークンを用いて上記第２ステップで抽出した
ｍ音素組について、対話音声ｍ音素組モデルを学習する
第３ステップとを有した、コンピュータに音声パターン
モデルを学習させるためのプログラムを記録したコンピ
ュータで読み取り可能な記録媒体が必要である。When the speech pattern model learning method according to the second embodiment is realized by software, the read-out speech m-phoneme set model is learned, and the result obtained by the learning is stored in the read-out speech m-phoneme set model memory 14. A first step of learning a read-aloud m phoneme group model;
The number of tokens N from the m phoneme sets described in the m phoneme set table stored in the conversational speech learning data memory 8 using the read out voice m phoneme set model stored in the read out voice m phoneme set model memory 14. using a second step of extracting m phonemes sets _t is a threshold value N or more and recognition rate R _t is less than the threshold value T _r, the token stored in the interactive voice learning data memory 8 is extracted with the second step And a third step of learning the m-phoneme set model of the dialogue speech with respect to the m-phoneme set described above, and a computer-readable recording medium storing a program for causing a computer to learn a speech pattern model is required.

【０１００】以上説明したように、この実施の形態２に
よる音声パターンモデル学習装置は、上記改良ｍ音素組
抽出手順１〜４（図７のステップＳＴ５０１〜ステップ
ＳＴ５０９）を実行することによって、トークンの数Ｎ
_ｔが閾値Ｎ以上でかつ認識率Ｒ_ｔが閾値Ｔ_ｒ以下である
全てのｍ音素組のｍ音素組表記１１を抽出し、抽出した
全てのｍ音素組のｍ音素組表記１１を抽出ｍ音素組表記
メモリ１２に格納する。したがって、この実施の形態２
による音声パターンモデル学習装置は、抽出ｍ音素組モ
デルの学習においてトークンの数Ｎ_ｔが閾値Ｎ以上のｍ
音素組のみモデルを学習するので、読み上げ音声ｍ音素
組モデルで認識率が低い対話音声のｍ音素組のうち、ト
ークンの数Ｎ_ｔが閾値Ｎ未満で統計的に信頼度の低いモ
デルの学習を回避し、統計的に信頼度の高いモデルのみ
を効率的に学習できるという効果を奏する。なお、この
実施の形態２ではｍ＝３として説明したが、ｍが３以外
の任意の整数を選ぶことも可能であり、その場合にも同
様の効果を奏する。As described above, the speech pattern model learning apparatus according to the second embodiment executes the above-described improved m phoneme group extraction procedures 1 to 4 (steps ST501 to ST509 in FIG. 7), thereby obtaining the tokens. Number N
_t is extracted m phonemic sets representation 11 of every m phonemes sets the threshold value N or more and recognition rate R _t is equal to or less than the threshold value T _r, extracting m phonemes all m phoneme sets of m phonemic sets notation 11 extracted It is stored in the set notation memory 12. Therefore, the second embodiment
Speech pattern model learning device according to the extracted m number N _t is the threshold value N or more m tokens in the training of phoneme pairs model
Since learning model only phoneme set, reading of the m phoneme sets of low recognition rate interactive voice sound m phoneme sets model, a statistically of low reliability model learning number N _t is less than the threshold value N of tokens This has the effect of avoiding and efficiently learning only models with high statistical reliability. In the second embodiment, m is described as 3; however, m may be any integer other than 3, and the same effect is obtained in that case.

【０１０１】実施の形態３．図８はこの発明の実施の形
態３による音声パターンモデル学習装置の構成を示すブ
ロック図である。図において、３０は、読み上げ音声学
習データメモリ６に格納された各ｍ音素組についてテキ
ストを読み上げた音声を用いて読み上げ音声ｍ音素組モ
デルを学習するとともに、ｍ音素組抽出部（ｍ音素組抽
出手段）１０によって抽出された各ｍ音素組について、
対話音声学習データメモリ８０に格納された対話音声学
習データを用いて対話音声ｍ音素組モデルを学習し、さ
らに、上記読み上げ音声ｍ音素組モデルと上記対話音声
ｍ音素組モデルとを用いてｎ音素組抽出部（ｎ音素組抽
出手段）１７によって上記対話音声学習データから抽出
された各ｎ音素組について、上記対話音声学習データを
用いて対話音声ｎ音素組モデルを学習するモデル学習部
（対話音声ｍ音素組モデル学習手段、対話音声ｎ音素組
モデル学習手段）、７は読み上げ音声学習データメモリ
６に含まれる読み上げ音声の特徴ベクトルの時系列、９
は対話音声学習データメモリ８０に含まれる対話音声の
特徴ベクトルの時系列、１１はｍ音素組抽出部１０によ
って抽出されたｍ音素組のｍ音素組表記、１２は抽出ｍ
音素組表記メモリ、１３は読み上げ音声ｍ音素組モデル
のパラメータおよびｍ音素組表記、１４は読み上げ音声
ｍ音素組モデルメモリ、１５は対話音声ｍ音素組モデル
のパラメータおよびｍ音素組表記、１６は対話音声ｍ音
素組モデルメモリ、１８はｎ音素組抽出部１７によって
抽出されたｎ音素組のｎ音素組表記、１９は抽出ｎ音素
組表記メモリ、２０は対話音声ｎ音素組モデルのパラメ
ータおよびｎ音素組表記、２１は対話音声ｎ音素組モデ
ルメモリである。なお、図８において、図１に示すもの
と同一の符号は上記実施の形態１による音声パターンモ
デル学習装置の構成要素と同一または相当するものを示
している。なお、以下では、ｍ＝３、ｎ＝５として説明
する。また、この実施の形態３による音声パターンモデ
ル学習装置が使用する音声パターンモデルは、上記実施
の形態１と同じく連続分布型のＨＭＭであるとする。Embodiment 3 FIG. FIG. 8 is a block diagram showing a configuration of a speech pattern model learning device according to Embodiment 3 of the present invention. In the drawing, reference numeral 30 denotes a m-phoneme group extraction unit (m-phoneme group extraction unit) which learns a m-phoneme group extraction model by using a text-to-speech voice for each m-phoneme group stored in the speech-to-speech learning data memory 6. Means) For each m phoneme set extracted by 10,
The conversational speech m-phoneme set model is learned using the conversational speech learning data stored in the conversational speech learning data memory 80, and further, the n phonemes are set using the read-out speech m-phoneme set model and the dialogue m-phoneme set model. For each of the n phoneme sets extracted from the conversational speech learning data by the set extraction unit (n phoneme group extraction means) 17, a model learning unit (conversational speech learning) for learning a conversational speech n phoneme group model using the conversational speech learning data. m phoneme set model learning means, dialogue speech n phoneme set model learning means), 7 is a time series of the feature vector of the read speech included in the read speech learning data memory
Is a time series of the feature vector of the dialogue speech included in the dialogue speech learning data memory 80, 11 is an m phoneme set notation of the m phoneme set extracted by the m phoneme set extractor 10, and 12 is an extracted m
Phoneme set notation memory, 13 is a parameter of m-phoneme set model of read-out speech m and m-phoneme set notation, 14 is a read-out speech m-phoneme set model memory, 15 is a parameter of m-phoneme set model of dialogue speech and m-phoneme set notation, 16 is a dialogue Speech m phoneme group model memory, 18 n phoneme group notation of n phoneme group extracted by n phoneme group extractor 17, 19 extracted n phoneme group notation memory, 20 parameters of n phoneme model of conversational speech and n phonemes A set notation, 21 is a conversational speech n phoneme set model memory. In FIG. 8, the same reference numerals as those shown in FIG. 1 denote the same or corresponding components as those of the speech pattern model learning apparatus according to the first embodiment. In the following, description will be made assuming that m = 3 and n = 5. The speech pattern model used by the speech pattern model learning device according to the third embodiment is a continuous distribution type HMM as in the first embodiment.

【０１０２】対話音声学習データメモリ８０は、上記実
施の形態１による対話音声学習データメモリ８が保持す
るデータに加えて、対話音声学習データ中に存在するｎ
音素組の種類を記述したｎ音素組テーブルを保持する。
ここでｎはｎ＞ｍなる整数であり、ｎ音素組とは、ｍ音
素組よりも長い範囲の音素の違いを考慮したｎ個の音素
のセットである。例えばｎ＝５の場合には、／ｓａＱｐ
ｏｒｏ（札幌）／の／ｐ／はｎ（＝５）音素組では（ａ
Ｑ）ｐ（ｏｒ）となる。なお、この（ａＱ）ｐ（ｏｒ）
等の表記法を以後、５音素組表記と呼ぶことにする。５
音素組テーブルの例を図９に示す。また、対話音声学習
データメモリ８０が保持する各トークン（各トークンに
はトークン番号が付されている）には、３音素組表記と
ともに当該トークンの音素名と先々行音素名、先行音素
名および後続音素名、後々続音素名とを記した５音素組
表記が付与されている。３音素組表記とともに付与され
た５音素組表記の例を図１０に示す。The conversational speech learning data memory 80 includes n data existing in the conversational speech learning data in addition to the data held in the conversational speech learning data memory 8 according to the first embodiment.
It holds an n phoneme set table describing the types of phoneme sets.
Here, n is an integer satisfying n> m, and the n phoneme set is a set of n phonemes taking into account differences in phonemes in a range longer than the m phoneme set. For example, when n = 5, / saQp
oro (Sapporo) // p / is n (= 5) phoneme set (a
Q) It becomes p (or). This (aQ) p (or)
Will be referred to as pentaphone notation hereinafter. 5
FIG. 9 shows an example of the phoneme set table. In addition, each token (each token is assigned a token number) held by the dialogue speech learning data memory 80 includes a phoneme name of the token, a phoneme name preceding the phoneme, a preceding phoneme name, and a succeeding phoneme together with the three-phoneme set notation. A five-phoneme set notation in which a name and a subsequent phoneme name are described later is given. FIG. 10 shows an example of a five-phoneme set notation provided together with the three-phoneme set notation.

【０１０３】次に動作について説明する。この実施の形
態３による音声パターンモデル学習装置は、以下のよう
に分かれた５つの手順：（１）読み上げ音声ｍ音素組モ
デルの学習手順、（２）対話音声学習データメモリ８０
が保持する認識率の低いｍ音素組の抽出手順、（３）抽
出したｍ音素組に対する対話音声ｍ音素組モデルの学習
手順、（４）対話音声学習データメモリ８０が保持する
認識率の低いｎ音素組の抽出手順、（５）抽出した対話
音声ｎ音素組モデルの学習手順を順番に実行することに
よりモデル学習を行う。Next, the operation will be described. The speech pattern model learning apparatus according to the third embodiment includes five procedures divided as follows: (1) a procedure for learning a m-phoneme set model of a read-aloud speech, and (2) a conversational speech learning data memory 80.
, A procedure for extracting a m-phoneme set having a low recognition rate, (3) a learning procedure for a m-phoneme set model of the dialogue speech for the extracted m-phoneme set, and (4) an n having a low recognition rate held in the dialogue learning data memory 80 Model learning is performed by sequentially executing a phoneme group extraction procedure and (5) a learning procedure of the extracted dialogue speech n phoneme group model.

【０１０４】まず、読み上げ音声ｍ音素組モデルの学習
手順について説明する。音声パターンモデル学習装置
は、図８に示すモデル学習部３０の入力端子Ａを読み上
げ音声学習データメモリ６に接続された端子Ｂ１に接続
し、読み上げ音声学習データメモリ６中のデータを入力
とするようにセットする。また、音声パターンモデル学
習装置は、モデル学習部３０の出力端子Ｃを読み上げ音
声ｍ音素組モデルメモリ１４に接続された端子Ｄ１に接
続する。音声パターンモデル学習装置は、まず、この接
続状態で読み上げ音声ｍ音素組モデルを学習する。この
実施の形態３による音声パターンモデル学習装置のモデ
ル学習部３０は、上記実施の形態１で説明した読み上げ
音声モデル学習手順１〜３にしたがって、読み上げ音声
ｍ音素組モデルを学習し、読み上げ音声ｍ音素組モデル
メモリ１４に学習の結果得たモデルのパラメータとその
ｍ音素組表記を格納する。読み上げ音声学習データメモ
リ６に存在する全てのｍ音素組について読み上げ音声ｍ
音素組モデルの学習を終了した時に、モデル学習部３０
は読み上げ音声ｍ音素組モデルの学習手順を終了する。First, the procedure for learning the m-phoneme set model of the read-out speech will be described. The voice pattern model learning device connects an input terminal A of the model learning unit 30 shown in FIG. 8 to a terminal B1 connected to the read-out voice learning data memory 6, and receives data in the read-out voice learning data memory 6. Set to. Further, the speech pattern model learning device connects the output terminal C of the model learning unit 30 to the terminal D1 connected to the read-out speech m phoneme set model memory 14. First, the speech pattern model learning device learns the read-aloud m-phoneme set model in this connection state. The model learning unit 30 of the speech pattern model learning apparatus according to the third embodiment learns the read-out speech m phoneme set model according to the read-out speech model learning procedures 1 to 3 described in the first embodiment, and reads out the read-out speech m. The model parameters obtained as a result of learning and the m phoneme group notation are stored in the phoneme group model memory 14. Speech m for all m phoneme sets existing in the speech speech learning data memory 6
When the learning of the phoneme set model is completed, the model learning unit 30
Ends the learning procedure of the m-phoneme set model of the read-aloud voice.

【０１０５】次に音声パターンモデル学習装置はｍ音素
組抽出部１０により対話音声学習データメモリ８０が保
持する認識率の低いｍ音素組の抽出を行う。ｍ音素組抽
出部１０は、上記実施の形態１で説明したｍ音素組抽出
手順１〜４にしたがって認識率の低いｍ音素組の抽出手
順を実行し、抽出した全てのｍ音素組のｍ音素組表記を
抽出ｍ音素組表記メモリ１２に格納する。Next, in the speech pattern model learning apparatus, the m phoneme set extraction unit 10 extracts m phoneme sets having a low recognition rate held in the conversational speech learning data memory 80. The m phoneme set extraction unit 10 executes the m phoneme set extraction procedure with a low recognition rate in accordance with the m phoneme set extraction procedures 1 to 4 described in the first embodiment, and executes m phoneme set extraction for all m phoneme sets. The set notation is stored in the extracted m-phoneme set notation memory 12.

【０１０６】次にモデル学習部３０は対話音声ｍ音素組
モデルの学習を行う。学習を開始する前に、音声パター
ンモデル学習装置はモデル学習部３０の入力端子Ａを対
話音声学習データメモリ８０の出力端子Ｂ２に接続し、
また、モデル学習部３０のもう一つの入力端子Ｅを抽出
ｍ音素組表記メモリ１２の出力端子Ｆ１に接続する。さ
らに、音声パターンモデル学習装置はモデル学習部３０
の出力端子Ｃを対話音声ｍ音素組モデルメモリ１６の入
力端子Ｄ２に接続する。この接続状態で、モデル学習部
３０は対話音声ｍ音素組モデルを学習する。Next, the model learning section 30 learns the m-phoneme set model of the dialogue voice. Before starting the learning, the speech pattern model learning apparatus connects the input terminal A of the model learning unit 30 to the output terminal B2 of the conversational speech learning data memory 80,
Further, another input terminal E of the model learning unit 30 is connected to the output terminal F1 of the extracted m phoneme set notation memory 12. Further, the voice pattern model learning device includes a model learning unit 30.
Is connected to the input terminal D2 of the conversational voice m phoneme group model memory 16. In this connection state, the model learning unit 30 learns the m-phoneme set model of the dialogue voice.

【０１０７】モデル学習部３０は、上記実施の形態１に
よる抽出ｍ音素組モデルの学習手順１〜３にしたがっ
て、対話音声ｍ音素組モデルを学習し、対話音声ｍ音素
組モデルメモリ１６に学習の結果得たモデルのパラメー
タとそのｍ音素組表記を格納する。そして、モデル学習
部３０は抽出ｍ音素組表記メモリ１２に保持された全て
のｍ音素組について対話音声ｍ音素組モデルの学習を終
了した時に対話音声ｍ音素組モデルの学習手順を終了す
る。The model learning unit 30 learns the m-phoneme group model of the dialogue speech in accordance with the learning procedures 1 to 3 of the m-phoneme group model extracted in the first embodiment, and stores the learning in the dialogue m-phoneme group model memory 16. The resulting model parameters and their m phoneme set notation are stored. Then, the model learning unit 30 ends the learning procedure of the dialogue m m phoneme set model when the learning of the dialogue m m phoneme set model is completed for all the m phoneme sets stored in the extracted m phoneme set notation memory 12.

【０１０８】次に音声パターンモデル学習装置はｎ音素
組抽出部１７により対話音声学習データメモリ８０が保
持する認識率の低いｎ音素組の抽出を行う。ｎ音素組を
抽出する手順は以下のとおりである。Next, in the speech pattern model learning apparatus, the n phoneme set extraction unit 17 extracts an n phoneme set having a low recognition rate held in the conversational speech learning data memory 80. The procedure for extracting the n phoneme sets is as follows.

【０１０９】（１）ｎ音素組抽出手順１：ｎ音素組抽出
部１７は、読み上げ音声ｍ音素組モデルメモリ１４から
全ての読み上げ音声ｍ音素組モデルのパラメータとその
ｍ音素組表記１３を読み込む。ｎ音素組抽出部１７は、
さらに、対話音声ｍ音素組モデルメモリ１６から全ての
対話音声ｍ音素組モデルのパラメータとそのｍ音素組表
記１５を読み込む。(1) Procedure for extracting n phoneme sets 1: The n phoneme set extraction unit 17 reads the parameters of all read m speech phoneme models and the m phoneme set notation 13 from the read speech m phoneme set model memory 14. The n phoneme set extraction unit 17
Further, the parameters of all the m-phoneme set models of the dialogue speech and the m-phoneme set notation 15 are read from the dialogue m-phoneme set model memory 16.

【０１１０】（２）ｎ音素組抽抽出手順２：次に、ｎ音
素組抽出部１７は、対話音声学習データメモリ８０が保
持するｎ音素組テーブルを参照し、この音素組テーブル
の記述内容にしたがって先頭のｎ音素組を認識対象とし
て選択する。ｎ＝５でｎ音素組テーブルが例えば図９の
ように記述されている場合、ｎ音素組抽出部１７は、ま
ず、先頭のｎ音素組である（ｋａ）ａ（ａｉ）を認識対
象として選択する。(2) N phoneme set extraction extraction procedure 2: Next, the n phoneme set extraction unit 17 refers to the n phoneme set table held in the conversational speech learning data memory 80, and adds the description contents of this phoneme set table. Therefore, the first n phoneme sets are selected as recognition targets. When n = 5 and the n phoneme set table is described as shown in FIG. 9, for example, the n phoneme set extraction unit 17 first selects the first n phoneme set (ka) a (ai) as a recognition target. I do.

【０１１１】（３）ｎ音素組抽出手順３：ｎ音素組抽出
部１７は、上記ｎ音素組抽出手順３または下記ｎ音素組
抽出手順４において選択したｎ音素組と一致するｎ音素
組表記を持つ全てのトークンの特徴ベクトルの時系列９
を対話音声学習データメモリ８０から読み込み、読み込
んだ各トークンについて、上記ｎ音素組抽出手順１で読
み込んだ全ての読み上げ音声ｍ音素組モデルおよび全て
の対話音声ｍ音素組モデルとの尤度を計算し、一番高い
尤度を示したｍ音素組モデルのｍ音素組表記を、当該ト
ークンの認識結果とする。なお、尤度計算には例えばビ
タビアルゴリズムを用いる。読み込んだ全てのトークン
に対する認識結果を求めた後、ｎ音素組抽出部１７は下
記（２）式にしたがって認識率Ｒ_ｑを計算する。(3) N phoneme set extraction procedure 3: The n phoneme set extraction unit 17 extracts the n phoneme set notation that matches the n phoneme set selected in the above n phoneme set extraction procedure 3 or the following n phoneme set extraction procedure 4. Time series 9 of feature vectors of all tokens
Is read from the dialogue speech learning data memory 80, and for each of the read tokens, the likelihood of all the read-out m-phoneme set models and all the dialogue m-phoneme set models read in the n-phoneme set extraction procedure 1 is calculated. The m phoneme set notation of the m phoneme set model showing the highest likelihood is used as the token recognition result. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition result for all tokens read, n phoneme set extraction unit 17 calculates the recognition rate R _q in accordance with the following equation (2).

【０１１２】Ｒ_ｑ＝Ｃ_ｑ／Ｎ_ｑ＊１００．０（２）R _q = C _q / N _q * 100.0 (2)

【０１１３】但し、添字ｑは選択したｎ音素組の種類を
示し、Ｎ_ｑは、ｎ音素組表記のｎ音素組種類がｑである
トークンの個数、Ｃ_ｑはその中で正認識であったトーク
ンの個数である。ここで正認識とは、当該トークンのｍ
音素組表記が一番高い尤度を示したｍ音素組モデルのｍ
音素組表記と一致する場合を正認識とする。例えばｎ音
素組表記（ｎ＝５）が（ｋａ）ａ（ａｉ）であるトーク
ンはｍ音素組表記（ｍ＝３）が（ａ）ａ（ａ）であるの
で、一番高い尤度を示したｍ音素組モデルのｍ音素組表
記が（ａ）ａ（ａ）であれば正認識とする。Here, the subscript q indicates the type of the selected n phoneme set, _Nq is the number of tokens whose n phoneme set type is q in the n phoneme set notation, and C _q is positively recognized among them. The number of tokens. Here, correct recognition means m of the token
M of phoneme set model with phoneme set notation showing highest likelihood
The case where it matches the phoneme set notation is regarded as correct recognition. For example, a token whose n phoneme group notation (n = 5) is (ka) a (ai) has the highest likelihood because the m phoneme group notation (m = 3) is (a) a (a). If the m phoneme set notation of the m phoneme set model is (a) a (a), it is determined that the recognition is correct.

【０１１４】ｎ音素組抽出部１７は、上記認識率Ｒ_ｑを
予め定めた閾値Ｔ_ｑと比較し、閾値Ｔ_ｑ以下であれば、
そのｎ音素組のｎ音素組表記１８を抽出ｎ音素組表記メ
モリ１９に送出する。抽出ｎ音素組表記メモリ１９は入
力されたｎ音素組表記１８を保持する。[0114] n phoneme set extraction unit 17 compares the threshold _{T q} determined in advance of the recognition rate _{R q,} equal to or less than the threshold value _{T q,}
The n phoneme set notation 18 of the n phoneme set is sent to the extracted n phoneme set notation memory 19. The extracted n phoneme set notation memory 19 holds the inputted n phoneme set notation 18.

【０１１５】（４）ｎ音素組抽出手順４：ｎ音素組抽出
部１７は、対話音声学習データメモリ８０が保持するｎ
音素組テーブルを参照し、対話音声学習データメモリ８
０中に存在する全てのｎ音素組について上記ｎ音素組抽
出手順３を実行するために、上記ｎ音素組テーブルに記
述されている順番にしたがって次のｎ音素組を選択し、
上記ｎ音素組抽出手順３を繰り返す。(4) n Phoneme Group Extraction Procedure 4: The n phoneme group extraction unit 17 stores n
Referring to the phoneme set table, the dialogue speech learning data memory 8
In order to execute the above n phoneme group extraction procedure 3 for all n phoneme groups existing in 0, the next n phoneme group is selected according to the order described in the n phoneme group table,
The above n phoneme group extraction procedure 3 is repeated.

【０１１６】このようにしてｎ音素組抽出部１７はｎ音
素組を抽出する手順を終了する。ｎ音素組抽出部１７
は、上記ｎ音素組抽出手順１〜４を実行することによっ
て、認識率Ｒ_ｑが閾値Ｔ_ｑ以下である全てのｎ音素組の
ｎ音素組表記１８を抽出し、抽出ｎ音素組表記メモリ１
９に格納することができる。Thus, the n-phoneme-set extracting unit 17 ends the procedure for extracting the n-phoneme-set. n phoneme set extraction unit 17
Extracts the n phoneme set notations 18 of all n phoneme sets whose recognition rate R _q is equal to or less than the threshold value T _q by executing the above n phoneme set extraction procedures 1 to 4, and extracts the n phoneme set notation memory 1
9 can be stored.

【０１１７】次にモデル学習部３０は上記のようにして
抽出した各ｎ音素組について対話音声ｎ音素組モデルを
学習する。学習を開始する前に、音声パターンモデル学
習装置は、モデル学習部３０の入力端子Ａを対話音声学
習データメモリ８０の出力端子Ｂ２に接続し、またモデ
ル学習部３０のもう一つの入力端子Ｅを抽出ｎ音素組表
記メモリ１９の出力端子Ｆ２に接続する。さらに、音声
パターンモデル学習装置は、モデル学習部３０の出力端
子Ｃを対話音声ｎ音素組モデルメモリ２１の入力端子Ｄ
３に接続する。この接続状態で、モデル学習部３０は対
話音声ｎ音素組モデルを学習する。学習手順を以下に示
す。Next, the model learning unit 30 learns a conversational speech n phoneme group model for each n phoneme group extracted as described above. Before starting the learning, the speech pattern model learning apparatus connects the input terminal A of the model learning unit 30 to the output terminal B2 of the conversational speech learning data memory 80, and connects another input terminal E of the model learning unit 30 to the input terminal E. It is connected to the output terminal F2 of the extracted n phoneme set notation memory 19. Further, the speech pattern model learning device connects the output terminal C of the model learning unit 30 to the input terminal D of the conversational speech n phoneme set model memory 21.
Connect to 3. In this connection state, the model learning unit 30 learns the conversational speech n-phoneme set model. The learning procedure is shown below.

【０１１８】（１）抽出ｎ音素組モデル学習手順１：モ
デル学習部３０は、まず、抽出ｎ音素組表記メモリ１９
に保持されている各ｎ音素組表記を読み込み、抽出ｎ音
素組表記メモリ１９に保持されていた順番にしたがっ
て、先頭のｎ音素組を学習対象として選択する。抽出ｎ
音素組表記メモリ１９の内容が例えば図１１のようであ
る場合、モデル学習部３０は先頭のｎ音素組である（ｋ
ａ）ａ（ａｉ）を学習対象として選択する。(1) Extracted n phoneme set model learning procedure 1: The model learning unit 30 firstly outputs the extracted n phoneme set notation memory 19
Is read, and the first n phoneme set is selected as a learning target in accordance with the order held in the extracted n phoneme set notation memory 19. Extraction n
If the contents of the phoneme set notation memory 19 are, for example, as shown in FIG. 11, the model learning unit 30 is the first n phoneme sets (k
a) Select a (ai) as a learning target.

【０１１９】（２）抽出ｎ音素組モデル学習手順２：次
に、モデル学習部３０は、上記抽出ｎ音素組モデル学習
手順１または下記抽出ｎ音素組モデル学習手順３におい
て選択したｎ音素組と一致するｎ音素組表記を持つ全て
のトークンの特徴ベクトルの時系列９を対話音声学習デ
ータメモリ８０から読み込み、例えばフォワード・バッ
クワードアルゴリズムを用いて選択したｎ音素組に対す
るモデルを学習する。そして、モデル学習部３０は、学
習の結果得たモデルのパラメータとそのｎ音素組表記を
対話音声ｎ音素組モデルメモリ２１に送出する。対話音
声ｎ音素組モデルメモリ２１は受け取ったモデルのパラ
メータとそのｎ音素組表記を保持する。(2) Extracted n phoneme set model learning procedure 2: Next, the model learning unit 30 selects the n phoneme set selected in the extracted n phoneme set model learning procedure 1 or the extracted n phoneme set model learning procedure 3 described below. The time series 9 of the feature vectors of all tokens having the matching n phoneme set notation is read from the dialogue speech learning data memory 80, and a model for the selected n phoneme set is learned using, for example, a forward / backward algorithm. Then, the model learning unit 30 sends the parameters of the model obtained as a result of the learning and the n phoneme group notation to the n phoneme group model memory 21 of the dialogue speech. The conversational speech n phoneme set model memory 21 holds the parameters of the received model and the n phoneme set notation.

【０１２０】（３）抽出ｎ音素組モデル学習手順３：モ
デル学習部３０は、抽出ｎ音素組表記メモリ１９に保持
されている全てのｎ音素組について上記抽出ｎ音素組モ
デル学習手順２を実行するために、抽出ｎ音素組表記メ
モリ１９に保持されている順番にしたがって次のｎ音素
組を選択し、上記抽出ｎ音素組モデル学習手順２を繰り
返す。このようにして、モデル学習部３０は抽出ｎ音素
組モデルの学習を終了する。(3) Extracted n phoneme set model learning procedure 3: The model learning unit 30 executes the extracted n phoneme set model learning procedure 2 for all the n phoneme sets held in the extracted n phoneme set notation memory 19. To do so, the next n phoneme sets are selected according to the order stored in the extracted n phoneme set notation memory 19, and the extracted n phoneme set model learning procedure 2 is repeated. Thus, the model learning unit 30 ends the learning of the extracted n phoneme set model.

【０１２１】次にこの実施の形態３による音声パターン
モデル学習装置が使用する、ｍ音素組モデルとｎ音素組
モデルを学習する方法を具体的に説明する。図１２はこ
の発明の実施の形態３による音声パターンモデル学習方
法の手順を示すフローチャートである。図１２に示すと
おり、この実施の形態３による音声パターンモデル学習
方法における読み上げ音声ｍ音素組モデル、対話音声ｍ
音素組モデルおよび対話音声ｎ音素組モデルの学習手順
は大きく５つのステップに分けられる。Next, a method of learning the m phoneme set model and the n phoneme set model used by the speech pattern model learning apparatus according to the third embodiment will be specifically described. FIG. 12 is a flowchart showing a procedure of a voice pattern model learning method according to Embodiment 3 of the present invention. As shown in FIG. 12, the read-out speech m phoneme set model and the dialogue speech m in the speech pattern model learning method according to the third embodiment.
The learning procedure of the phoneme set model and the dialogue n-phoneme set model is mainly divided into five steps.

【０１２２】すなわち、第１ステップ（図１２のステッ
プＳＴ６０１）は、読み上げ音声ｍ音素組モデルを学習
し学習の結果得たモデルのパラメータおよびｍ音素組表
記１３を読み上げ音声ｍ音素組モデルメモリ１４に格納
する、読み上げ音声ｍ音素組モデル学習手順である。That is, the first step (step ST601 in FIG. 12) is to learn the m-phoneme set model of the read-aloud voice and to store the model parameters obtained as a result of the learning and the m-phoneme set notation 13 in the read-aloud m-phoneme set model memory 14. This is the learning procedure for the m-phoneme set model of the reading voice to be stored.

【０１２３】次の第２ステップ（図２のステップＳＴ６
０２）は、読み上げ音声ｍ音素組モデルメモリ１４に格
納されている読み上げ音声ｍ音素組モデルを用いて対話
音声学習データメモリ８０に格納されている各トークン
の認識を行い、認識率の低いｍ音素組を抽出する手順で
ある。The next second step (step ST6 in FIG. 2)
02) recognizes each token stored in the interactive speech learning data memory 80 using the read-aloud m-phoneme set model stored in the read-aloud m-phoneme set model memory 14, and performs m-phonemes with a low recognition rate. This is a procedure for extracting a set.

【０１２４】次の第３ステップ（図１２のステップＳＴ
６０３）は、対話音声学習データメモリ８０に格納され
ているトークンを用いて上記第２ステップで抽出した各
ｍ音素組について、対話音声ｍ音素組モデルを学習し学
習の結果得たモデルのパラメータおよびｍ音素組表記１
５を対話音声ｍ音素組モデルメモリ１６に格納する、対
話音声ｍ音素組モデル学習手順である。The next third step (step ST in FIG. 12)
603) learns the dialogue m m phoneme set model for each m phoneme set extracted in the second step using the token stored in the dialogue speech learning data memory 80, and obtains the parameters of the model obtained as a result of learning and m phoneme set notation 1
5 is a dialogue m-phoneme set model learning procedure for storing 5 in the dialogue m-phoneme set model memory 16.

【０１２５】次の第４ステップ（図１２のステップＳＴ
６０４）は、読み上げ音声ｍ音素組モデルメモリ１４に
格納されている読み上げ音声ｍ音素組モデルと対話音声
ｍ音素組モデルメモリ１６に格納されている対話音声ｍ
音素組モデルとを用いて対話音声学習データメモリ８０
に格納されている各トークンの認識を行い、認識率の低
いｎ音素組を抽出する手順である。The next fourth step (step ST in FIG. 12)
Reference numeral 604) denotes a reading voice m phoneme group model stored in the reading voice m phoneme group model memory 14 and a dialog voice m stored in the dialog voice m phoneme group model memory 16.
Interactive speech learning data memory 80 using phoneme set model
Is a procedure for recognizing each of the tokens stored in the, and extracting an n-phoneme set having a low recognition rate.

【０１２６】次の第５ステップ（図１２のステップＳＴ
６０５）は、対話音声学習データメモリ８０に格納され
ているトークンを用いて上記第４ステップで抽出したｎ
音素組に対する対話音声ｎ音素組モデルを学習し学習の
結果得たモデルのパラメータおよびｎ音素組表記２０を
対話音声ｎ音素組モデルメモリ２１に格納する、対話音
声ｎ音素組モデル学習手順である。The next fifth step (step ST in FIG. 12)
605) is the n extracted in the fourth step using the token stored in the conversational speech learning data memory 80.
This is a dialog speech n phoneme group model learning procedure in which a dialog speech n phoneme group model for a phoneme group is learned, and the model parameters obtained as a result of the learning and the n phoneme group notation 20 are stored in the dialog speech n phoneme group model memory 21.

【０１２７】上記第１〜第５ステップのうち、第１、第
２および第３ステップは上記実施の形態１のものと全く
同じであるので説明を省略し、以下では第４ステップと
第５ステップを説明する。図１３は第４ステップの詳細
を示すフローチャートであり、以下では図１３を参照し
ながら第４ステップである認識率の低いｎ音素組の抽出
手順を詳細に説明する。Of the first to fifth steps, the first, second, and third steps are exactly the same as those in the first embodiment, and thus description thereof will be omitted, and the fourth and fifth steps will be described below. Will be described. FIG. 13 is a flowchart showing the details of the fourth step. Hereinafter, the procedure of extracting the n phoneme sets having a low recognition rate, which is the fourth step, will be described in detail with reference to FIG.

【０１２８】まず、ｎ音素組抽出部１７は、ステップＳ
Ｔ７０１において、読み上げ音声ｍ音素組モデルメモリ
１４から全ての読み上げ音声ｍ音素組モデルのパラメー
タとそのｍ音素組表記１３を読み込む。また、ｎ音素組
抽出部１７は、ステップＳＴ７０２において、対話音声
ｍ音素組モデルメモリ１６から全ての対話音声ｍ音素組
モデルのパラメータとそのｍ音素組表記１５を読み込
む。First, the n phoneme group extraction unit 17 determines in step S
At T701, the parameters of all the read m-phoneme group models and the m-phoneme group notation 13 are read from the read m-phoneme group model memory 14. In step ST702, the n-phoneme set extraction unit 17 reads the parameters of all m-phoneme set models of the dialogue speech and the m-phoneme set notation 15 from the dialogue m-phoneme set model memory 16.

【０１２９】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ７０３において、対話音声学習データメモリ８０が保
持するｎ音素組テーブルを参照し、このｎ音素組テーブ
ルの記述内容にしたがって先頭のｎ音素組を認識対象と
して選択する。ｎ音素組テーブルが例えば図９のように
記述されている場合、ｎ音素組抽出部１７はまず先頭の
ｎ音素組である（ｋａ）ａ（ａｉ）を認識対象として選
択する。Next, the n-phoneme set extraction unit 17 determines in step S
At T703, the n phoneme set table held in the conversational speech learning data memory 80 is referred to, and the first n phoneme set is selected as a recognition target according to the description contents of the n phoneme set table. When the n-phoneme set table is described, for example, as shown in FIG. 9, the n-phoneme set extraction unit 17 first selects (ka) a (ai), which is the first n-phoneme set, as a recognition target.

【０１３０】そして、ｎ音素組抽出部１７は、ステップ
ＳＴ７０４において、上記ステップＳＴ７０３またはス
テップＳＴ７０９において選択したｎ音素組と一致する
ｎ音素組表記を持つ全てのトークンの特徴ベクトルの時
系列９を対話音声学習データメモリ８０から読み込む。Then, in step ST704, the n-phoneme set extraction unit 17 interacts with the time series 9 of the feature vectors of all tokens having the n-phoneme set notation that matches the n-phoneme set selected in step ST703 or ST709. It is read from the voice learning data memory 80.

【０１３１】その後、ｎ音素組抽出部１７は、ステップ
ＳＴ７０５において、読み込んだ各トークンについて、
上記ステップＳＴ７０１と上記ステップＳＴ７０２で読
み込んだ全ての読み上げ音声ｍ音素組モデルおよび全て
の対話音声ｍ音素組モデルとの尤度を計算し、一番高い
尤度を示したｍ音素組モデルのｍ音素組表記を、当該ト
ークンの認識結果とする。なお、尤度計算には例えばビ
タビアルゴリズムを用いる。読み込んだ全てのトークン
について認識結果を求めた後、ｎ音素組抽出部１７は上
記（２）式によって認識率Ｒ_ｑを計算する。Thereafter, in step ST705, the n-phoneme set extracting unit 17 calculates, for each token read,
The likelihood of all the read-aloud m m-phoneme set models and all the dialogue m-phoneme set models read in the above-mentioned steps ST701 and ST702 is calculated, and the m-phoneme of the m-phoneme set model showing the highest likelihood is calculated. The set notation is the recognition result of the token. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition results for all the read tokens, the n-phoneme group extraction unit 17 calculates the recognition rate _Rq by the above equation (2).

【０１３２】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ７０６において、上記認識率Ｒ_ｑを予め定めた閾値Ｔ
_ｑと比較し、閾値Ｔ_ｑ以下であれば、ステップＳＴ７０
７に進み、そのｎ音素組のｎ音素組表記１８を抽出ｎ音
素組表記メモリ１９に送出する。抽出ｎ音素組表記メモ
リ１９は入力されたｎ音素組表記１８を保持する。一
方、上記認識率Ｒ_ｑが閾値Ｔ_ｑよりも大きいならば、ｎ
音素組抽出部１７はステップＳＴ７０８に進む。Next, the n-phoneme set extraction unit 17 determines in step S
In T706, the threshold T determined in advance of the recognition rate _{R q}
compared to _q, equal to or less than the threshold value _{T q,} step ST70
Then, the process proceeds to step S7, where the n phoneme group notation 18 of the n phoneme group is sent to the extracted n phoneme group notation memory 19. The extracted n phoneme set notation memory 19 holds the inputted n phoneme set notation 18. On the other hand, if the recognition rate _Rq is larger than the threshold _Tq , n
The phoneme set extraction unit 17 proceeds to step ST708.

【０１３３】ステップＳＴ７０８では、ｎ音素組抽出部
１７は、対話音声学習データメモリ８０が保持するｎ音
素組テーブルを参照し、対話音声学習データ中に存在す
る全てのｎ音素組について認識率を計算したか否かを判
定する。そして、全てのｎ音素組について認識が終了し
ていない場合は、ｎ音素組抽出部１７は、ステップＳＴ
７０９において、上記ｎ音素組テーブルに記述されてい
る順番にしたがって次のｎ音素組を認識対象として選択
し、ステップＳＴ７０４に戻る。一方、全てのｎ音素組
について認識が終了していれば、ｎ音素組抽出部１７は
ｎ音素組を抽出する手順を終了する。In step ST708, the n-phoneme set extraction unit 17 refers to the n-phoneme set table held in the dialogue speech learning data memory 80, and calculates the recognition rate for all n-phoneme sets present in the dialogue speech learning data. It is determined whether or not it has been performed. If the recognition has not been completed for all n phoneme sets, the n phoneme set extraction unit 17 proceeds to step ST
In 709, the next n phoneme sets are selected as recognition targets in the order described in the n phoneme set table, and the process returns to step ST704. On the other hand, if the recognition has been completed for all the n phoneme sets, the n phoneme set extraction unit 17 ends the procedure for extracting the n phoneme sets.

【０１３４】以上のように、ｎ音素組抽出部１７は、上
記ｎ音素組抽出手順（図１３におけるステップＳＴ７０
１〜ステップＳＴ７０９）を実行することによって、認
識率Ｒ_ｑが閾値Ｔ_ｑ以下である全てのｎ音素組のｎ音素
組表記１８を抽出して、抽出ｎ音素組表記メモリ１９に
格納することができる。As described above, the n phoneme set extraction unit 17 performs the n phoneme set extraction procedure (step ST70 in FIG. 13).
1 to ST709), the n phoneme set notations 18 of all n phoneme sets whose recognition rate R _q is equal to or less than the threshold value T _q can be extracted and stored in the extracted n phoneme set notation memory 19. it can.

【０１３５】次に、モデル学習部３０は、第５ステップ
である対話音声ｎ音素組モデル学習手順を実行する。学
習を実行する前に、音声パターンモデル学習装置は、モ
デル学習部３０の入力端子Ａを対話音声学習データの出
力端子Ｂ２に接続し、またモデル学習部３０のもう一つ
の入力端子Ｅを抽出ｎ音素組表記メモリ１９の出力端子
Ｆ２に接続する。さらに、音声パターンモデル学習装置
はモデル学習部３０の出力端子Ｃを対話音声ｎ音素組モ
デルメモリ２１の入力端子Ｄ３に接続する。この接続状
態で、モデル学習部３０は対話音声ｎ音素組モデルを学
習する。Next, the model learning section 30 executes the fifth step, that is, the learning procedure of the dialogue n phoneme group model. Before performing the learning, the speech pattern model learning apparatus connects the input terminal A of the model learning unit 30 to the output terminal B2 of the conversational speech learning data, and extracts another input terminal E of the model learning unit 30. It is connected to the output terminal F2 of the phoneme set notation memory 19. Further, the speech pattern model learning device connects the output terminal C of the model learning unit 30 to the input terminal D3 of the conversational speech n phoneme set model memory 21. In this connection state, the model learning unit 30 learns the conversational speech n-phoneme set model.

【０１３６】図１４は対話音声ｎ音素組モデル学習手順
の詳細を示すフローチャートであり、以下では、図１４
を参照しながら学習手順の詳細について説明する。ま
ず、モデル学習部３０は、ステップＳＴ８０１におい
て、抽出ｎ音素組表記メモリ１９に保持されているｎ音
素組表記１８を読み込み、抽出ｎ音素組表記メモリ１９
に保持されていた順番にしたがって先頭のｎ音素組を学
習対象として選択する。抽出ｎ音素組表記メモリ１９の
内容が例えば図１１のようである場合、モデル学習部３
０は先頭のｎ音素組である（ｋａ）ａ（ａｉ）を学習対
象として選択する。FIG. 14 is a flow chart showing the details of the learning procedure of the n-phoneme group model of the dialogue speech.
The details of the learning procedure will be described with reference to FIG. First, in step ST801, the model learning unit 30 reads the n phoneme group notation 18 stored in the extracted n phoneme group notation memory 19, and reads the extracted n phoneme group notation memory 19
Are selected as learning targets in accordance with the order held in. If the contents of the extracted n-phoneme set notation memory 19 are, for example, as shown in FIG.
0 selects the first n phoneme set (ka) a (ai) as a learning target.

【０１３７】次に、モデル学習部３０は、ステップＳＴ
８０２において、上記ステップＳＴ８０１またはステッ
プＳＴ８０６において選択したｎ音素組と一致するｎ音
素組表記を持つ全てのトークンの特徴ベクトルの時系列
９を対話音声学習データメモリ８０から読み込む。そし
て、モデル学習部３０は、ステップＳＴ８０３におい
て、例えばフォワード・バックワードアルゴリズムを用
いて選択したｎ音素組についてモデルを学習する。Next, the model learning section 30 executes step ST
In step 802, the time series 9 of the feature vectors of all tokens having the n phoneme set notation that matches the n phoneme set selected in step ST 801 or ST 806 is read from the interactive speech learning data memory 80. Then, in step ST803, the model learning unit 30 learns a model for the selected n phoneme set using, for example, a forward / backward algorithm.

【０１３８】その後、モデル学習部３０は、ステップＳ
Ｔ８０４において、学習の結果得た上記モデルのパラメ
ータとそのｎ音素組表記を対話音声ｎ音素組モデルメモ
リ２１に送出する。対話音声ｎ音素組モデルメモリ２１
は受け取ったモデルのパラメータおよびｎ音素組表記を
保持する。Thereafter, the model learning section 30 determines in step S
At T804, the parameters of the model obtained as a result of the learning and the n phoneme group notation are sent to the dialogue voice n phoneme group model memory 21. Dialogue voice n phoneme set model memory 21
Holds the received model parameters and n phoneme set notation.

【０１３９】次に、モデル学習部３０は、ステップＳＴ
８０５において、抽出ｎ音素組表記メモリ１９に保持さ
れている全てのｎ音素組について対話音声ｎ音素組モデ
ルを学習したか否かを判定し、全てのｎ音素組について
学習が終了していない場合には、モデル学習部３０は、
ステップＳＴ８０６において、抽出ｎ音素組表記メモリ
１９に記述されている順番にしたがって次のｎ音素組を
学習対象として選択し、ステップＳＴ８０２に戻る。一
方、全てのｎ音素組について学習が終了しているなら
ば、モデル学習部３０は、第５ステップである対話音声
ｎ音素組モデルの学習手順を終了する。Next, the model learning section 30 determines in step ST
In 805, it is determined whether or not the conversational speech n phoneme set model has been learned for all n phoneme sets held in the extracted n phoneme set notation memory 19, and the learning has not been completed for all n phoneme set models. In the model learning unit 30,
In step ST806, the next n phoneme sets are selected as learning targets in the order described in the extracted n phoneme set notation memory 19, and the process returns to step ST802. On the other hand, if the learning has been completed for all the n phoneme sets, the model learning unit 30 ends the fifth step, which is the learning procedure of the dialogue n phoneme set model.

【０１４０】なお、この実施の形態３による音声パター
ンモデル学習方法をソフトウェアで実現する場合、読み
上げ音声ｍ音素組モデルを学習し学習の結果を読み上げ
音声ｍ音素組モデルメモリ１４に格納する、読み上げ音
声ｍ音素組モデルを学習する第１の手順と、読み上げ音
声ｍ音素組モデルメモリ１４に格納されている読み上げ
音声ｍ音素組モデルを用いて対話音声学習データメモリ
８０に格納されている各トークンの認識を行う、認識率
の低いｍ音素組を抽出する第２の手順と、対話音声学習
データメモリ８０に格納されているトークンを用いて上
記第２の手順で抽出した各ｍ音素組について、対話音声
ｍ音素組モデルを学習する第３の手順と、読み上げ音声
ｍ音素組モデルメモリ１４に格納されている読み上げ音
声ｍ音素組モデルと対話音声ｍ音素組モデルメモリ１６
に格納されている対話音声ｍ音素組モデルとを用いて対
話音声学習データメモリ８０に格納されている各トーク
ンの認識を行い認識率の低いｎ音素組を抽出する第４の
手順と、対話音声学習データメモリ８０に格納されてい
るトークンを用いて抽出した各ｎ音素組について対話音
声ｎ音素組モデルを学習する第５の手順とを有した、コ
ンピュータに音声パターンモデルを学習させるためのプ
ログラムを記録したコンピュータで読み取り可能な記録
媒体が必要である。When the speech pattern model learning method according to the third embodiment is realized by software, a read-out speech m-phoneme set model memory is learned, and the result of the learning is stored in the read-out speech m-phoneme set model memory 14. A first procedure for learning the m phoneme set model, and recognition of each token stored in the conversational speech learning data memory 80 using the read out speech m phoneme set model stored in the read out speech m phoneme set model memory 14 And a second step of extracting m phoneme sets having a low recognition rate, and a dialogue speech of each m phoneme set extracted in the second procedure using the token stored in the dialogue speech learning data memory 80. A third procedure for learning the m phoneme set model, and a read speech m phoneme set model stored in the read speech m phoneme set model memory 14 Interactive voice m phoneme sets model memory 16
A fourth procedure for recognizing each token stored in the conversational speech learning data memory 80 using the conversational speech m phoneme set model stored in the dialogue box and extracting an n phoneme set with a low recognition rate; A fifth step of learning a dialogue n-phoneme set model for each n-phoneme set extracted using the token stored in the learning data memory 80. A computer-readable recording medium that records the recorded data is required.

【０１４１】以上説明したように、この実施の形態３に
よる音声パターンモデル学習装置および音声パターンモ
デル学習方法では、上記ｎ音素組抽出手順（図１４のス
テップＳＴ８０１〜ステップＳＴ８０６）を行うことに
よって、認識率Ｒ_ｑが閾値Ｔ _ｑ以下である全てのｎ音素
組を抽出し、抽出した各ｎ音素組について、対話音声学
習データが保持するトークンを用いて対話音声ｎ音素組
モデルを学習するので、対話音声のように発話速度がは
やくかつ曖昧な音声で読み上げ音声ｍ音素組モデルと対
話音声ｍ音素組モデルでは十分な認識性能が得られない
各ｎ音素組について効率的に対話音声ｎ音素組モデルを
学習することができる効果を奏する。なお、この実施の
形態３では、ｍ＝３、ｎ＝５として説明したが、ｍ、ｎ
は、ｍ＜ｎなる任意の整数の組を選択してもよく、この
場合にも同様の効果を奏する。As described above, in the third embodiment,
Pattern model learning device and voice pattern model
In the Dell learning method, the n phoneme set extraction procedure (step S
Steps ST801 to ST806)
Therefore, the recognition rate R_qIs the threshold T _qAll n phonemes that are
Dialogue phonetics for each of the extracted n phoneme groups
N phoneme group of dialogue speech using tokens held by learning data
Since the model is learned, the utterance speed is high like dialogue voice.
Pair with m-phoneme set model, which is read out quickly and vaguely
Sufficient recognition performance cannot be obtained with m-phoneme set model
Efficiently construct dialogue n phoneme set models for each n phoneme set
It has the effect of being able to learn. Note that this implementation
In the third embodiment, m = 3 and n = 5 have been described.
May select any set of integers such that m <n.
The same effect is obtained in such a case.

【０１４２】実施の形態４．この発明の実施の形態４に
よる音声パターンモデル学習装置は、上記実施の形態３
によるｎ音素組抽出手順１〜４に代わって以下に示す改
良ｎ音素組抽出手順１〜４を実行するｎ音素組抽出部１
７を備えたものである。なお、実施の形態４による音声
パターンモデル学習装置は図８に示す上記実施の形態３
によるものと同一の構成を有しており、ｎ音素組抽出部
１７以外の構成要素は上記実施の形態３による音声パタ
ーンモデル学習装置と同じ動作をするので、以下ではそ
の他の構成要素の説明を省略する。また、この実施の形
態４においてもｍ＝３のｍ音素組およびｎ＝５のｎ音素
組を対象として説明する。Embodiment 4 The speech pattern model learning device according to the fourth embodiment of the present invention is similar to the third embodiment.
N phoneme set extraction unit 1 that executes the following improved n phoneme set extraction procedures 1 to 4 instead of n phoneme set extraction procedures 1 to 4
7 is provided. The speech pattern model learning apparatus according to the fourth embodiment is different from the third embodiment shown in FIG.
And the components other than the n-phoneme set extraction unit 17 operate in the same manner as the speech pattern model learning device according to the third embodiment, so that the other components will be described below. Omitted. Also in the fourth embodiment, a description will be given of m phoneme sets of m = 3 and n phoneme sets of n = 5.

【０１４３】次に動作について説明する。（１）改良ｎ音素組抽出手順１：ｎ音素組抽出部１７
は、読み上げ音声ｍ音素組モデルメモリ１４から全ての
読み上げ音声ｍ音素組モデルのパラメータとそのｍ音素
組表記１３を読み込む。ｎ音素組抽出部１７は、さら
に、対話音声ｍ音素組モデルメモリ１６から全ての対話
音声ｍ音素組モデルのパラメータとそのｍ音素組表記１
５を読み込む。Next, the operation will be described. (1) Improved n phoneme group extraction procedure 1: n phoneme group extraction unit 17
Reads all the parameters of the m-phoneme set model of the read-aloud speech and the m-phoneme set notation 13 from the read-aloud m-phoneme set model memory 14. The n-phoneme group extraction unit 17 further stores the parameters of all the m-phoneme group models of the dialogue speech and the m-phoneme group notation 1
Read 5

【０１４４】（２）改良ｎ音素組抽出手順２：次に、ｎ
音素組抽出部１７は、対話音声学習データメモリ８０に
格納されたｎ音素組テーブルを読み込み、このｎ音素組
テーブルの記述内容にしたがって、対話音声学習データ
中から先頭のｎ音素組を認識対象として選択する。ｎ音
素組テーブルが例えば図９のように記述されている場
合、ｎ音素組抽出部１７は先頭のｎ音素組である（ｋ
ａ）ａ（ａｉ）を認識対象として選択する。(2) Improved n phoneme group extraction procedure 2: Next, n
The phoneme set extraction unit 17 reads the n phoneme set table stored in the dialogue speech learning data memory 80, and recognizes the first n phoneme set in the dialogue speech learning data according to the description contents of the n phoneme set table. select. If the n-phoneme set table is described, for example, as shown in FIG. 9, the n-phoneme set extraction unit 17 is the first n-phoneme set (k
a) Select a (ai) as a recognition target.

【０１４５】（３）改良ｎ音素組抽出手順３：ｎ音素組
抽出部１７は、上記改良ｎ音素組抽出手順２または下記
改良ｎ音素組抽出手順４において選択したｎ音素組と一
致するｎ音素組表記を持つ全てのトークンの特徴ベクト
ルの時系列９を対話音声学習データメモリ８０から読み
込む。そして読み込んだトークンの数Ｎ_ｑ（添字ｑは選
択したｎ音素組の名前を示す）が予め定めた閾値Ｎ未満
であれば、抽出ｎ音素組表記メモリ１９には何も送出せ
ず、次の改良ｎ音素組抽出手順４に移る。一方、Ｎ_ｑが
予め定めた閾値Ｎ以上であれば、ｎ音素組抽出部１７は
上記実施の形態３と同様に認識を行う。すなわち、ｎ音
素組抽出部１７は、読み込んだ各トークンについて、上
記改良ｎ音素組抽出手順１で読み込んだ全ての読み上げ
音声ｍ音素組モデルおよび全ての対話音声ｍ音素組モデ
ルとの尤度を計算し、一番高い尤度を示したｍ音素組モ
デルのｍ音素組表記を、当該トークンの認識結果とす
る。なお、尤度計算には例えばビタビアルゴリズムを用
いる。読み込んだ全てのトークンに対する認識結果を求
めた後、ｎ音素組抽出部１７は、上記（２）式によって
認識率Ｒ_ｑを計算する。そして、ｎ音素組抽出部１７
は、上記認識率Ｒ_ｑを予め定めた閾値Ｔ_ｑと比較し、閾
値Ｔ_ｑ以下であれば、そのｎ音素組のｎ音素組表記を抽
出ｎ音素組表記メモリ１９に送出する。抽出ｎ音素組表
記メモリ１９は、入力されたｎ音素組表記を保持する。(3) Improved n phoneme set extraction procedure 3: The n phoneme set extraction unit 17 selects the n phonemes that match the n phoneme set selected in the improved n phoneme set extraction procedure 2 or the improved n phoneme set extraction procedure 4 described below. The time series 9 of the feature vectors of all tokens having the set notation is read from the dialogue speech learning data memory 80. If the number of read tokens N _q (subscript q indicates the name of the selected n phoneme set) is less than a predetermined threshold N, nothing is sent to the extracted n phoneme set notation memory 19 and the next The procedure moves to the improved n phoneme set extraction procedure 4. On the other hand, if the threshold value N or more N _q is predetermined, n phoneme set extraction unit 17 performs the recognition as in the third embodiment. That is, the n-phoneme set extraction unit 17 calculates the likelihood of each of the read tokens with all the read-aloud m-phoneme set models and all the dialogue m-phoneme set models read in the improved n-phoneme set extraction procedure 1. Then, the m phoneme set notation of the m phoneme set model showing the highest likelihood is set as the token recognition result. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition results for all the read tokens, the n-phoneme set extraction unit 17 calculates the recognition rate _Rq by the above equation (2). Then, the n phoneme set extraction unit 17
Compares the recognition rate _Rq with a predetermined threshold value _Tq, and if it is equal to or smaller than the threshold value _Tq , sends the n-phoneme set notation of the n-phoneme set to the extracted n-phoneme set notation memory 19. The extracted n phoneme group notation memory 19 holds the input n phoneme group notation.

【０１４６】（４）改良ｎ音素組抽出手順４：ｎ音素組
抽出部１７は、対話音声学習データメモリ８０が保持す
るｎ音素組テーブルを参照し、対話音声学習データメモ
リ８０に存在する全てのｎ音素組について上記改良ｎ音
素組抽出手順３を実行するために、上記ｎ音素組テーブ
ルに記述されている順番にしたがって次のｎ音素組を認
識対象として選択し、上記改良ｎ音素組抽出手順３を繰
り返す。このようにして、対話音声学習データ中に存在
する全てのｎ音素組について認識率を求めると、ｎ音素
組抽出部１７は改良ｎ音素組抽出手順を終了する。(4) Improved n-phoneme set extraction procedure 4: The n-phoneme set extraction unit 17 refers to the n-phoneme set table held in the dialogue speech learning data memory 80, and retrieves all the n-phoneme set tables existing in the dialogue speech learning data memory 80. In order to execute the improved n phoneme set extraction procedure 3 for the n phoneme sets, the next n phoneme set is selected as a recognition target according to the order described in the n phoneme set table, and the improved n phoneme set extraction procedure is performed. Repeat 3. When the recognition rates are obtained for all the n phoneme sets existing in the conversational speech learning data in this way, the n phoneme set extraction unit 17 ends the improved n phoneme set extraction procedure.

【０１４７】次にこの実施の形態４による音声パターン
モデル学習装置が使用する、ｍ音素組モデルとｎ音素組
モデルを学習する方法を具体的に説明する。実施の形態
４による音声パターンモデル学習装置では、上記実施の
形態３による音声パターンモデル学習装置と同様にｍ音
素組モデルとｎ音素組モデルの学習手順は大きく５つの
ステップに分けられる。Next, a method for learning the m phoneme set model and the n phoneme set model used by the speech pattern model learning apparatus according to the fourth embodiment will be specifically described. In the speech pattern model learning device according to the fourth embodiment, the learning procedure of the m phoneme set model and the n phoneme set model is roughly divided into five steps, similarly to the speech pattern model learning device according to the third embodiment.

【０１４８】まず、第１ステップは、読み上げ音声ｍ音
素組モデルを学習し学習により得た結果であるモデルの
パラメータおよびｍ音素組表記１３を読み上げ音声ｍ音
素組モデルメモリ１４に格納する、読み上げ音声ｍ音素
組モデルを学習するステップである。First, the first step is to read out the m-phoneme set model of the read-aloud speech m and store the m-phoneme set notation 13 in the read-aloud m-phoneme set model memory 14. This is the step of learning the m phoneme set model.

【０１４９】次の第２ステップは、読み上げ音声ｍ音素
組モデルメモリ１４に格納されている読み上げ音声ｍ音
素組モデルを用いて、対話音声学習データメモリ８０に
格納されている各トークンの認識を行い、認識率の低い
ｍ音素組を抽出するステップである。In the next second step, each token stored in the conversational speech learning data memory 80 is recognized using the read-out speech m-phoneme set model stored in the read-out speech m-phoneme set model memory 14. , Extracting m-phoneme sets with a low recognition rate.

【０１５０】次の第３ステップは、対話音声学習データ
メモリ８０に格納されているトークンを用いて上記第２
ステップで抽出したｍ音素組について、対話音声ｍ音素
組モデルを学習するステップである。The next third step is to use the token stored in the conversational speech learning data memory 80 for the second step.
This is a step of learning an m-phoneme set dialogue speech model for the m-phoneme set extracted in the step.

【０１５１】次の第４ステップは、読み上げ音声ｍ音素
組モデルメモリ１４に格納されている読み上げ音声ｍ音
素組モデルおよび対話音声ｍ音素組モデルメモリ１６に
格納されている対話音声ｍ音素組モデルを用いて、対話
音声学習データメモリ８０が保持するｎ音素組テーブル
に記述されたｎ音素組の中からトークンの数Ｎ_ｑが閾値
Ｎ以上でかつ認識率Ｒ_ｑが閾値Ｔ_ｑ以下であるｎ音素組
を抽出するステップである。In the next fourth step, the speech m m phoneme set model stored in the speech m m phoneme set model memory 14 and the dialog m m phoneme set model stored in the dialog m m phoneme set model memory 16 are stored. The n phonemes for which the number of tokens N _q is equal to or greater than the threshold N and the recognition rate R _q is equal to or less than the threshold T _q from among the n phoneme sets described in the n phoneme set table held by the dialogue speech learning data memory 80 This is the step of extracting sets.

【０１５２】次の第５ステップは、対話音声学習データ
メモリ８０に格納されているトークンを用いて上記第４
ステップで抽出した各ｎ音素組について、対話音声ｎ音
素組モデルを学習するステップである。In the next fifth step, the fourth step is performed by using the token stored in the conversational speech learning data memory 80.
This is a step of learning a dialogue speech n phoneme set model for each n phoneme set extracted in the step.

【０１５３】上記第１〜第５ステップのうち、第１、第
２、第３および第５ステップは上記実施の形態３と全く
同じ手順であるので以下ではその説明を省略し、第４ス
テップであるｎ音素組の抽出手順を詳細に説明する。図
１５はこの第４ステップの抽出手順を示すフローチャー
トであり、以下では図１５を参照しながら抽出手順を詳
細に説明する。[0153] Of the first to fifth steps, the first, second, third and fifth steps are exactly the same as those in the third embodiment, so that the description thereof will be omitted below and the fourth step will be described. The extraction procedure of a certain n phoneme set will be described in detail. FIG. 15 is a flowchart showing the extraction procedure of the fourth step. Hereinafter, the extraction procedure will be described in detail with reference to FIG.

【０１５４】ｎ音素組抽出部１７は、まず、ステップＳ
Ｔ９０１において、読み上げ音声ｍ音素組モデルメモリ
１４から全ての読み上げ音声ｍ音素組モデルのパラメー
タとそのｍ音素組表記１３を読み込む。続いて、ｎ音素
組抽出部１７は、ステップＳＴ９０２において、対話音
声ｍ音素組モデルメモリ１６から全ての対話音声ｍ音素
組モデルのパラメータとそのｍ音素組表記１５を読み込
む。First, the n phoneme group extraction unit 17 first executes step S
At T901, the parameters of all the read-out m-phoneme set models and the m-phoneme set notation 13 are read from the read-out m-phoneme set model memory 14. Subsequently, in step ST902, the n-phoneme set extraction unit 17 reads the parameters of all the m-phoneme set models of the dialogue speech and the m-phoneme set notation 15 from the dialogue m-phoneme set model memory 16.

【０１５５】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ９０３において、対話音声学習データメモリ８０が保
持するｎ音素組テーブルを読み込み、このｎ音素組テー
ブルの記述内容にしたがって先頭のｎ音素組を認識対象
として選択する。ｎ音素組テーブルが例えば図９のよう
に記述されている場合、ｎ音素組抽出部１７は先頭のｎ
音素組である（ｋａ）ａ（ａｉ）を認識対象として選択
する。Next, the n-phoneme set extraction unit 17 determines in step S
At T903, the n phoneme set table held by the conversational speech learning data memory 80 is read, and the first n phoneme set is selected as a recognition target according to the description contents of the n phoneme set table. If the n phoneme set table is described, for example, as shown in FIG.
The phoneme set (ka) a (ai) is selected as a recognition target.

【０１５６】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ９０４において、上記ステップＳＴ９０３またはステ
ップＳＴ９１０において選択したｎ音素組と一致するｎ
音素組表記を持つ全てのトークンの特徴ベクトルの時系
列９を対話音声学習データメモリ８０から読み込む。Next, the n-phoneme set extraction unit 17 determines in step S
In T904, n matching the n phoneme set selected in step ST903 or ST910 described above.
The time series 9 of the feature vectors of all the tokens having the phoneme set notation is read from the conversational speech learning data memory 80.

【０１５７】そして、ｎ音素組抽出部１７は、ステップ
ＳＴ９０５において、読み込んだトークンの数Ｎ_ｑ（添
字ｑは選択したｎ音素組の名前を示す）を予め定めた閾
値Ｎと比較し、Ｎ_ｑ＜Ｎであれば、抽出ｎ音素組表記メ
モリ１９には何も送出せず、ステップＳＴ９０９に移
る。一方、Ｎ_ｑ＞＝Ｎであれば、ｎ音素組抽出部１７は
ステップＳＴ９０６に移る。[0157] Then, n phoneme set extraction unit 17, at step ST 905, is compared with a threshold value N which defines the number _{N q} of the read tokens (subscript q represents the name of the n phonemes sets selected) beforehand, _{N q} If <N, nothing is sent to the extracted n phoneme set notation memory 19, and the routine goes to step ST909. On the other hand, if N _q > = N, the n phoneme group extraction unit 17 proceeds to step ST906.

【０１５８】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ９０６において、読み込んだ各トークンについて、上
記ステップＳＴ９０２およびステップＳＴ９０３で読み
込んだ全ての読み上げ音声ｍ音素組モデルおよび全ての
対話音声ｍ音素組モデルとの尤度を計算し、一番高い尤
度を示したｍ音素組モデルのｍ音素組表記を、当該トー
クンの認識結果とする。なお、尤度計算には例えばビタ
ビアルゴリズムを用いる。読み込んだ全てのトークンに
ついて認識結果を求めた後、ｎ音素組抽出部１７は上記
（２）式によって認識率Ｒ_ｑを計算する。Next, the n-phoneme set extraction unit 17 determines in step S
At T906, the likelihood of all the read-in speech m-phoneme set models and all the dialogue m-phoneme set models read in steps ST902 and ST903 is calculated for each of the read tokens, and the highest likelihood is indicated. The m phoneme set notation of the m phoneme set model is used as the recognition result of the token. The likelihood calculation uses, for example, a Viterbi algorithm. After obtaining the recognition results for all the read tokens, the n-phoneme group extraction unit 17 calculates the recognition rate _Rq by the above equation (2).

【０１５９】次に、ｎ音素組抽出部１７は、ステップＳ
Ｔ９０７において、上記ステップＳＴ９０６において求
めた認識率Ｒ_ｑを予め定めた閾値Ｔ_ｑと比較し、閾値Ｔ
_ｑ以下であれば、ステップＳＴ９０８に進み、そのｎ音
素組のｎ音素組表記１８を抽出ｎ音素組表記メモリ１９
に送出する。抽出ｎ音素組表記メモリ１９は入力された
ｎ音素組表記１８を保持する。一方、上記認識率Ｒ_ｑが
閾値Ｔ_ｑよりも大きいならば、ｎ音素組抽出部１７はス
テップＳＴ９０９に進む。Next, the n-phoneme set extraction unit 17 determines in step S
In T907, it is compared with a threshold value _{T q} that defines a recognition rate _{R q} previously determined in step ST 906, the threshold value T
If not more than _q , the process proceeds to step ST908, where the n phoneme set notation 18 of the n phoneme set is extracted.
To send to. The extracted n phoneme set notation memory 19 holds the inputted n phoneme set notation 18. On the other hand, if the recognition rate R _q is larger than the threshold value T _q , the n-phoneme set extraction unit 17 proceeds to step ST909.

【０１６０】そして、ステップＳＴ９０９では、ｎ音素
組抽出部１７は、対話音声学習データメモリ８０に格納
されたｎ音素組テーブルを参照し、対話音声学習データ
メモリ８０に存在する全てのｎ音素組を既に選択し終え
たか否かを判定し、未選択のｎ音素組が存在する場合
は、ステップＳＴ９１０に進み上記ｎ音素組テーブルに
記述されている順番にしたがって次のｎ音素組を認識対
象として選択し、ステップＳＴ９０４に戻る。一方、ｎ
音素組抽出部１７は、既に全てのｎ音素組を選択し終え
たのであるならばｎ音素組の抽出手順を終了する。[0160] In step ST909, the n-phoneme set extraction unit 17 refers to the n-phoneme set table stored in the conversational speech learning data memory 80, and retrieves all n-phoneme sets existing in the conversational speech learning data memory 80. It is determined whether or not the selection has been completed. If there is an unselected n-phoneme set, the process proceeds to step ST910, and the next n-phoneme set is selected as a recognition target according to the order described in the n-phoneme set table. Then, the process returns to step ST904. On the other hand, n
The phoneme set extraction unit 17 ends the procedure for extracting the n phoneme sets if all the n phoneme sets have already been selected.

【０１６１】なお、この実施の形態４による音声パター
ンモデル学習方法をソフトウェアで実現しようとする場
合、読み上げ音声ｍ音素組モデルを学習し学習により得
た結果を読み上げ音声ｍ音素組モデルメモリ１４に格納
する、読み上げ音声ｍ音素組モデルを学習する第１ステ
ップと、読み上げ音声ｍ音素組モデルメモリ１４に格納
されている読み上げ音声ｍ音素組モデルを用いて対話音
声学習データメモリ８０に格納されている各トークンの
認識を行い、認識率の低いｍ音素組を抽出する第２ステ
ップと、対話音声学習データメモリ８０に格納されてい
るトークンを用いて上記第２ステップで抽出した各ｍ音
素組について、対話音声ｍ音素組モデルを学習する第３
ステップと、対話音声学習データメモリ８０が保持する
ｎ音素組テーブルに記述されたｎ音素組のなかからトー
クンの数Ｎ_ｑが閾値Ｎ以上でかつ認識率Ｒ_ｑが閾値Ｔ_ｑ
以下であるｎ音素組を抽出する第４ステップと、対話音
声学習データメモリ８０に格納されているトークンを用
いて抽出した各ｎ音素組について対話音声ｎ音素組モデ
ルを学習するステップとを有した、コンピュータに音声
パターンモデルを学習させるためのプログラムを記録し
たコンピュータで読み取り可能な記録媒体が必要であ
る。When the voice pattern model learning method according to the fourth embodiment is to be realized by software, the read-aloud m m-phoneme set model is learned and the result obtained by the learning is stored in the read-aloud m-phoneme set model memory 14. A first step of learning a read-aloud m m-phoneme set model, and using the read-aloud m-phoneme set model stored in the read-aloud m-phoneme set model memory 14 to store each data stored in the interactive voice learning data memory 80. A second step of recognizing tokens and extracting m phoneme sets having a low recognition rate; and a dialogue for each m phoneme set extracted in the second step using the token stored in the dialogue speech learning data memory 80. 3rd learning of phonetic m phoneme set model
Steps and the number N _q of tokens from among n phoneme sets described in n phoneme sets table held spoken dialogue learning data memory 80 is the threshold N or more and recognition rate R _q is a threshold value T _q
A fourth step of extracting the following n phoneme sets and a step of learning a dialogue n phoneme set model for each of the n phoneme sets extracted using the token stored in the dialogue learning data memory 80 are included. In addition, a computer-readable recording medium that records a program for causing a computer to learn an audio pattern model is required.

【０１６２】以上説明したように、この実施の形態４に
よる音声パターンモデル学習装置は、上記改良ｎ音素組
抽出手順（図１５のステップＳＴ９０１〜ステップＳＴ
９１０）を実行することによって、トークンの数Ｎ_ｑが
閾値Ｎ以上でかつ認識率Ｒ_ｑが閾値Ｔ_ｑ以下である全て
のｎ音素組のｎ音素組表記を抽出し、抽出した全てのｎ
音素組のｎ音素組表記１８を抽出ｎ音素組表記メモリ１
９に格納する。したがって、この実施の形態４による音
声パターンモデル学習装置は、抽出ｎ音素組モデルの学
習においてトークンの数Ｎ_ｑが閾値Ｎ以上のｎ音素組の
みモデルを学習するので、読み上げ音声ｍ音素組モデル
と対話音声ｍ音素組モデルでは認識率が低い対話音声の
ｎ音素組のうち、トークンの数Ｎ_ｑが閾値Ｎ未満で統計
的に信頼度の低いモデルの学習を回避し、統計的に信頼
度の高いモデルのみを効率的に学習できるという効果を
奏する。なお、この実施の形態４ではｍ＝３、ｎ＝５と
して説明したが、ｍ、ｎは、ｍ＜ｎなる任意の整数の組
を選択してもよく、その場合にも同様の効果を奏する。As described above, the speech pattern model learning apparatus according to the fourth embodiment uses the improved n phoneme set extraction procedure (step ST901 to step ST901 in FIG. 15).
910), n phoneme set notations of all n phoneme sets whose number of tokens N _q is equal to or greater than the threshold N and whose recognition rate R _q is equal to or less than the threshold T _q are extracted, and all extracted n
Extract n phoneme set notation 18 of phoneme set n phoneme set notation memory 1
9 is stored. Therefore, the speech pattern model learning apparatus according to the fourth embodiment learns only n phoneme sets whose number of tokens _Nq is equal to or greater than the threshold N in learning the extracted n phoneme set models. In the dialogue m phoneme set model, among the n phoneme sets of the dialogue speech having a low recognition rate, the number of tokens _Nq is less than the threshold N, and learning of a model having a statistically low reliability is avoided. This has the effect that only high models can be efficiently learned. Although the fourth embodiment has been described assuming that m = 3 and n = 5, m and n may be any set of integers such that m <n, and the same effect is obtained in that case. .

【０１６３】実施の形態５．図１６はこの発明の実施の
形態５による音声認識装置の構成を示すブロック図であ
る。図において、１４は読み上げ音声ｍ音素組モデルメ
モリ、１６は対話音声ｍ音素組モデルメモリ、２１は対
話音声ｎ音素組モデルメモリ、２２は音声信号の入力端
子、２３は音声信号の入力端子２２から入力された音声
信号、２４は音声信号２３の音響特徴ベクトルの時系列
を算出する音響分析部、２５は音響分析部２４の出力で
ある特徴ベクトルの時系列、２６は認識対象語彙の音素
表記を格納する認識対象語彙メモリ、２７は認識対象語
彙の音素組表記、２８は上記実施の形態３または４によ
る音声パターンモデル学習装置によって学習された読み
上げ音声ｍ音素組モデル、対話音声ｍ音素組モデルおよ
び対話音声ｎ音素組モデルを並列に接続することによっ
て認識対象語彙に対する音声パターンモデル（すなわち
認識対象語彙モデル）を作成する認識対象語彙モデル作
成部（認識対象語彙モデル作成手段）、２９は認識対象
語彙モデルのパラメータおよび音素組表記、３１は認識
対象語彙モデルメモリ、３２は認識対象語彙モデル作成
部２８によって作成した認識対象語彙に対する音声パタ
ーンモデルを用いて、入力音声の認識を行う認識部（認
識手段）、３３は認識結果である。なお、図１６におい
て、図８に示すものと同一の符号は上記実施の形態３に
よる音声パターンモデル学習装置の構成要素と同一また
は相当するものを示している。Embodiment 5 FIG. FIG. 16 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 5 of the present invention. In the figure, reference numeral 14 denotes a read speech m phoneme set model memory, 16 denotes a dialogue speech m phoneme set model memory, 21 denotes a dialogue speech n phoneme set model memory, 22 denotes a speech signal input terminal, and 23 denotes a speech signal input terminal 22. The input speech signal, 24 is an acoustic analysis unit that calculates a time series of acoustic feature vectors of the audio signal 23, 25 is a time series of feature vectors output from the acoustic analysis unit 24, and 26 is a phoneme notation of the vocabulary to be recognized. A recognition target vocabulary memory to be stored; 27, a phoneme set notation of the recognition target vocabulary; 28, a read-out speech m-phoneme set model, a conversational speech m-phoneme set model, which is learned by the speech pattern model learning apparatus according to the third or fourth embodiment. A speech pattern model for the vocabulary to be recognized (ie, a vocabulary model to be recognized) by connecting the dialogue n-phoneme set models in parallel A recognition target vocabulary model creation unit (recognition target vocabulary model creation means), 29 is a recognition target vocabulary model parameter and phoneme set notation, 31 is a recognition target vocabulary model memory, and 32 is a recognition target vocabulary model creation unit 28. A recognition unit (recognition means) 33 for recognizing the input voice using the voice pattern model for the recognition target vocabulary is a recognition result. In FIG. 16, the same reference numerals as those shown in FIG. 8 denote the same or corresponding components as those of the speech pattern model learning apparatus according to the third embodiment.

【０１６４】読み上げ音声ｍ音素組モデルメモリ１４
は、上記実施の形態３または４による音声パターンモデ
ル学習装置によって作成された全ての読み上げ音声ｍ音
素組モデルのパラメータおよびそのｍ音素組表記を保持
している。また、対話音声ｍ音素組モデルメモリ１６
は、同様に、上記実施の形態３または４による音声パタ
ーンモデル学習装置によって作成された全ての対話音声
ｍ音素組モデルのパラメータおよびそのｍ音素組表記を
保持している。さらに、対話音声ｎ音素組モデルメモリ
２１は、上記実施の形態３または４による音声パターン
モデル学習装置によって作成された全ての対話音声ｎ音
素組モデルのパラメータおよびそのｎ音素組表記を保持
している。なお、以下では、ｍ＝３、ｎ＝５として説明
する。また、以下では、対話音声ｍ音素組モデルメモリ
１６は、上記実施の形態４による音声パターンモデル学
習装置によって作成された全ての対話音声ｍ音素組モデ
ルのパラメータおよびそのｍ音素組表記を保持してお
り、対話音声ｎ音素組モデルメモリ２１は、上記実施の
形態４による音声パターンモデル学習装置によって作成
された全ての対話音声ｎ音素組モデルのパラメータおよ
びそのｎ音素組表記を保持していると仮定する。Read-out speech m phoneme group model memory 14
Holds the parameters of all the read-aloud m-phoneme set models created by the speech pattern model learning apparatus according to Embodiment 3 or 4, and the m-phoneme set notation. Further, the dialogue voice m phoneme group model memory 16
Holds the parameters of all the dialogue m-phoneme set models created by the speech pattern model learning apparatus according to the third or fourth embodiment and the m-phoneme set notation. Further, the conversational speech n phoneme set model memory 21 holds the parameters of all the dialogue speech n phoneme set models created by the speech pattern model learning device according to the third or fourth embodiment and the n phoneme set notation. . In the following, description will be made assuming that m = 3 and n = 5. In the following, the dialogue m m phoneme set model memory 16 holds the parameters of all the dialogue m m phoneme set models created by the speech pattern model learning apparatus according to the fourth embodiment and the m phoneme set notation. It is assumed that the dialogue speech n phoneme set model memory 21 holds the parameters of all the dialogue speech n phoneme set models created by the speech pattern model learning device according to Embodiment 4 and the n phoneme set notation. I do.

【０１６５】次に動作について説明する。この実施の形
態５による音声認識装置は、認識を行う前に認識対象語
彙モデルを作成し、作成した認識対象語彙モデルを認識
対象語彙モデルメモリ３１に保持する。Next, the operation will be described. The speech recognition apparatus according to the fifth embodiment creates a recognition target vocabulary model before performing recognition, and stores the created recognition target vocabulary model in the recognition target vocabulary model memory 31.

【０１６６】まずこの実施の形態５による音声認識装置
が用いる認識対象語彙モデルの作成方法について説明す
る。認識対象語彙モデル作成部２８は、認識対象語彙メ
モリ２６に格納されている認識対象語彙のモデルを作成
する。認識対象語彙メモリ２６にはまた認識対象とする
語彙の音素表記が記述されている。認識対象語彙メモリ
２６の内容の例を図１７に示す。この例ではホテル予約
を想定したユーザの発話を認識対象としており、語彙番
号１の語彙は「予約お願いします」、語彙番号２は「あ
した空いてますか」、１０００は「駅から近いですか」
である。認識対象語彙モデル作成部２８は、以下のよう
に認識対象語彙モデルを作成する。First, a method of creating a recognition target vocabulary model used by the speech recognition apparatus according to the fifth embodiment will be described. The recognition target vocabulary model creating unit 28 creates a recognition target vocabulary model stored in the recognition target vocabulary memory 26. The recognition target vocabulary memory 26 also describes the phoneme notation of the vocabulary to be recognized. FIG. 17 shows an example of the contents of the vocabulary memory 26 to be recognized. In this example, the utterance of the user assuming a hotel reservation is to be recognized, and the vocabulary of vocabulary number 1 is "Please make a reservation", vocabulary number 2 is "Are you free?""
It is. The recognition target vocabulary model creation unit 28 creates a recognition target vocabulary model as follows.

【０１６７】（１）認識対象語彙モデル作成手順１：認
識対象語彙モデル作成部２８は、認識対象語彙メモリ２
６に記載されている語彙番号の順番にモデル作成の対象
とする認識対象語彙を選択してこの認識対象語彙の音素
表記２７を読み込む。例えば認識対象語彙メモリ２６の
内容が図１７のようであれば、認識対象語彙モデル作成
部２８はまず語彙番号１である／ｙｏｙａｋｕｏｎｅｇ
ａｉｓｉｍａｓｕ／を選択する。(1) Recognition target vocabulary model creation procedure 1: The recognition target vocabulary model creation unit 28
6. The recognition target vocabulary to be model-created is selected in the order of the vocabulary numbers described in No. 6, and the phoneme description 27 of the recognition target vocabulary is read. For example, if the contents of the recognition target vocabulary memory 26 are as shown in FIG. 17, the recognition target vocabulary model creating unit 28 first has the vocabulary number 1 / yoyakuoneg.
Select aisimasu /.

【０１６８】（２）認識対象語彙モデル作成手順２：次
に、認識対象語彙モデル作成部２８は、選択した認識対
象語彙の音素表記２７にしたがって読み上げ音声ｍ音素
組モデルメモリ１４から、読み上げ音声ｍ音素組モデル
のパラメータ１３を読み込み、読み上げ音声ｍ音素組モ
デルを直列接続して、選択した認識対象語彙について直
列接続モデルを作成する。例えば音素表記が／ｙｏｙａ
ｋｕｏｎｅｇａｉｓｉｍａｓｕ／の場合、ｍ＝３である
ので、／（＃）ｙ（ｏ）／，／（ｙ）ｏ（ｙ）／，／
（ｏ）ｙ（ａ）／，／（ｙ）ａ（ｋ）／，／（ａ）ｋ
（ｕ）／，／（ｋ）ｕ（ｏ）／，／（ｕ）ｏ（ｎ）／，
／（ｏ）ｎ（ｅ）／，／（ｎ）ｅ（ｇ）／，／（ｅ）ｇ
（ａ）／，／（ｇ）ａ（ｉ）／，／（ａ）ｉ（ｓ）／，
／（ｉ）ｓ（ｉ）／，／（ｓ）ｉ（ｍ）／，／（ｉ）ｍ
（ａ）／，／（ｍ）ａ（ｓ）／，／（ａ）ｓ（ｕ）／，
／（ｓ）ｕ（＃）／の計１８個のｍ音素組モデルを接続
する。ここで、／＃／は発話の前後の無音区間を意味す
る。この実施の形態５では、各ｍ音素組モデルは図２６
に示すように５状態の構造を有しているとする。図２６
において、状態１が初期状態、状態５が最終状態であ
る。／ｙｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／に対す
る直列接続モデルは図１８のようになる。(2) Recognition-target vocabulary model creation procedure 2: Next, the recognition-target vocabulary model creation unit 28 reads out the read-out speech m from the phoneme set model memory 14 according to the phoneme notation 27 of the selected recognition-target vocabulary. The parameters 13 of the phoneme set model are read, the read-out speech m phoneme set models are connected in series, and a series connection model is created for the selected vocabulary to be recognized. For example, phoneme notation is / yoya
In the case of kuonegaishimasu /, since m = 3, / (#) y (o) /, / (y) o (y) /, /
(O) y (a) /, / (y) a (k) /, / (a) k
(U) /, / (k) u (o) /, / (u) o (n) /,
/ (O) n (e) /, / (n) e (g) /, / (e) g
(A) /, / (g) a (i) /, / (a) i (s) /,
/ (I) s (i) /, / (s) i (m) /, / (i) m
(A) /, / (m) a (s) /, / (a) s (u) /,
/ (S) u (#) /, a total of 18 m phoneme set models are connected. Here, / # / means a silent section before and after the utterance. In the fifth embodiment, each m-phoneme set model is shown in FIG.
It is assumed that the device has a five-state structure as shown in FIG. FIG.
, State 1 is an initial state and state 5 is a final state. FIG. 18 illustrates a series connection model for / yoyokuoneigashimasu /.

【０１６９】（３）認識対象語彙モデル作成手順３：次
に、認識対象語彙モデル作成部２８は、対話音声ｍ音素
組モデルメモリ１６が保持するｍ音素組表記を参照し、
上記／ｙｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／を構成
するｍ（＝３）音素組である／（＃）ｙ（ｏ）／，／
（ｙ）ｏ（ｙ）／，／（ｏ）ｙ（ａ）／，／（ｙ）ａ
（ｋ）／，／（ａ）ｋ（ｕ）／，／（ｋ）ｕ（ｏ）／，
／（ｕ）ｏ（ｎ）／，／（ｏ）ｎ（ｅ）／，／（ｎ）ｅ
（ｇ）／，／（ｅ）ｇ（ａ）／，／（ｇ）ａ（ｉ）／，
／（ａ）ｉ（ｓ）／，／（ｉ）ｓ（ｉ）／，／（ｓ）ｉ
（ｍ）／，／（ｉ）ｍ（ａ）／，／（ｍ）ａ（ｓ）／，
／（ａ）ｓ（ｕ）／，／（ｓ）ｕ（＃）／のうち、対話
音声ｍ音素組モデルメモリ１６が保持するｍ音素組表記
に存在するｍ音素組のモデルのパラメータ１５を対話音
声ｍ音素組モデルメモリ１６から読み込み、上記認識対
象語彙モデル作成手順１で作成した直列接続モデルの該
当する場所に読み上げ音声ｍ音素組モデルと対話音声ｍ
音素組モデルとを並列に接続することによって、選択し
た認識対象語彙に対する並列接続モデルを作成する。(3) Recognition target vocabulary model creation procedure 3: Next, the recognition target vocabulary model creation unit 28 refers to the m phoneme set notation held in the dialogue speech m phoneme set model memory 16,
/ (#) Y (o) /, / is an m (= 3) phoneme set that constitutes the above / yoyokuoneigaishimasu /
(Y) o (y) /, / (o) y (a) /, / (y) a
(K) /, / (a) k (u) /, / (k) u (o) /,
/ (U) o (n) /, / (o) n (e) /, / (n) e
(G) /, / (e) g (a) /, / (g) a (i) /,
/ (A) i (s) /, / (i) s (i) /, / (s) i
(M) /, / (i) m (a) /, / (m) a (s) /,
Of the / (a) s (u) /, / (s) u (#) /, dialog parameters 15 of the m phoneme set model existing in the m phoneme set notation held in the dialogue m m phoneme set model memory 16 are used. The speech m phoneme set model and the dialogue speech m are read from the speech m phoneme set model memory 16 and stored in the corresponding locations of the serial connection model created in the recognition target vocabulary model creation procedure 1.
A parallel connection model for the selected vocabulary to be recognized is created by connecting the phoneme set models in parallel.

【０１７０】並列に接続するとは、接続対象とする読み
上げ音声ｍ音素組モデルと対話音声ｍ音素組モデルの初
期状態同士と最終状態同士を共有化し、一個の初期状態
からどちらのモデルへも遷移が可能で、どちらのモデル
へ遷移した場合でも、共通の最終状態で遷移を終えるよ
うに接続するものである。例えば、対話音声ｍ音素組モ
デルメモリ１６が保持するｍ音素組表記に存在するｍ音
素組が、／（ｙ）ｏ（ｙ）／，／（ｏ）ｙ（ａ）／，／
（ａ）ｋ（ｕ）／，／（ｕ）ｏ（ｎ）／，／（ｎ）ｅ
（ｇ）／，／（ｅ）ｇ（ａ）／，／（ｇ）ａ（ｉ）／，
／（ｉ）ｍ（ａ）／の８個であるとすると、対話音声ｍ
音素組モデルを図１９のように図１８の直列接続モデル
に並列に接続して並列接続モデルを作成する。The connection in parallel means that the initial state and the final state of the m-phoneme set model and the m-phoneme set model to be connected are shared, and the transition from one initial state to either model is performed. It is possible to connect so that the transition to either model is completed in a common final state. For example, if the m phoneme set existing in the m phoneme set notation held by the dialogue m phoneme set model memory 16 is / (y) o (y) /, / (o) y (a) /, /
(A) k (u) /, / (u) o (n) /, / (n) e
(G) /, / (e) g (a) /, / (g) a (i) /,
/ (I) m (a) /, the dialogue voice m
The phoneme set model is connected in parallel to the series connection model of FIG. 18 as shown in FIG. 19 to create a parallel connection model.

【０１７１】（４）認識対象語彙モデル作成手順４：次
に、認識対象語彙モデル作成部２８は、対話音声ｎ音素
組モデルメモリ２１が保持するｎ音素組表記を参照し、
上記／ｙｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／を構成
するｎ（本例ではｎ＝５）音素組である／（＃＃）ｙ
（ｏｙ）／，／（＃ｙ）ｏ（ｙａ）／，／（ｙｏ）ｙ
（ａｋ）／，／（ｏｙ）ａ（ｋｕ）／，／（ｙａ）ｋ
（ｕｏ）／，／（ａｋ）ｕ（ｏｎ）／，／（ｋｕ）ｏ
（ｎｅ）／，／（ｕｏ）ｎ（ｅｇ）／，／（ｏｎ）ｅ
（ｇａ）／，／（ｎｅ）ｇ（ａｉ）／，／（ｅｇ）ａ
（ｉｓ）／，／（ｇａ）ｉ（ｓｉ）／，／（ａｉ）ｓ
（ｉｍ）／，／（ｉｓ）ｉ（ｍａ）／，／（ｓｉ）ｍ
（ａｓ）／，／（ｉｍ）ａ（ｓｕ）／，／（ｍａ）ｓ
（ｕ＃）／，／（ａｓ）ｕ（＃＃）／のうち、対話音声
ｎ音素組モデルメモリ２１が保持するｎ音素組表記に存
在するｎ音素組のモデルのパラメータ２０を対話音声ｎ
音素組モデルメモリ２１から読み込み、上記認識対象語
彙モデル作成手順３で作成した並列接続モデルの該当す
る場所にさらに並列に接続することによって、選択した
認識対象語彙に対する認識対象語彙モデルを作成する。
例えば、対話音声ｎ音素組モデルメモリ２１が保持する
ｎ音素組表記に存在するｎ音素組が／（＃ｙ）ｏ（ｙ
ａ）／，／（ｙｏ）ｙ（ａｋ）／，／（ｎｅ）ｇ（ａ
ｉ）／の３個であるとすると、認識対象語彙モデル作成
部２８は対話音声ｎ音素組モデルを図２０のように接続
して認識対象語彙モデルを作成する。認識対象語彙モデ
ル作成部２８は、接続を完了した上記認識対象語彙モデ
ルのパラメータおよびその音素表記２９を認識対象語彙
モデルメモリ３１に送出する。(4) Recognition target vocabulary model creation procedure 4: Next, the recognition target vocabulary model creation unit 28 refers to the n phoneme set notation held in the dialogue speech n phoneme set model memory 21,
It is an n (n = 5 in this example) phoneme set that constitutes the above / yoyokuoneigashimasu // (##) y
(Oy) /, / (# y) o (ya) /, / (yo) y
(Ak) /, / (oy) a (ku) /, / (ya) k
(Uo) /, / (ak) u (on) /, / (ku) o
(Ne) /, / (uo) n (eg) /, / (on) e
(Ga) /, / (ne) g (ai) /, / (eg) a
(Is) /, / (ga) i (si) /, / (ai) s
(Im) /, / (is) i (ma) /, / (si) m
(As) /, / (im) a (su) /, / (ma) s
Of (u #) /, / (as) u (##) /, the parameter 20 of the model of the n phoneme set in the n phoneme set notation held in the dialogue n phoneme set model memory 21 is set to the dialogue sound n.
The recognition target vocabulary model for the selected recognition target vocabulary is created by reading from the phoneme set model memory 21 and further connecting in parallel to a corresponding place of the parallel connection model created in the recognition subject vocabulary model creation procedure 3.
For example, the n phoneme set present in the n phoneme set notation held in the dialogue phonetic n phoneme set model memory 21 is / (# y) o (y
a) /, / (yo) y (ak) /, / (ne) g (a
i) / If there are three, the recognition target vocabulary model creating unit 28 connects the dialogue speech n phoneme set models as shown in FIG. 20 to create the recognition target vocabulary model. The recognition target vocabulary model creating unit 28 sends the parameters of the recognition target vocabulary model that has been connected and the phonetic notation 29 thereof to the recognition target vocabulary model memory 31.

【０１７２】（５）認識対象語彙モデル作成手順５：次
に、認識対象語彙モデル作成部２８は、認識対象語彙メ
モリ２６を参照して認識対象語彙メモリ２６に存在する
全ての認識対象語彙について認識対象語彙モデルの作成
が終了するまで語彙番号の順番にモデル作成の対象とす
る次の認識対象語彙を選択し、上記認識対象語彙モデル
作成手順２〜４を繰り返す。このようにして、認識対象
語彙モデル作成部２８は、認識対象語彙メモリ２６に存
在する全ての認識対象語彙について認識対象語彙モデル
を作成すると、認識対象語彙モデル作成手順を終了す
る。(5) Recognition-target vocabulary model creation procedure 5: Next, the recognition-target vocabulary model creation unit 28 refers to the recognition-target vocabulary memory 26 and recognizes all the recognition-target vocabularies existing in the recognition-target vocabulary memory 26. Until the creation of the target vocabulary model is completed, the next recognition target vocabulary to be model-created is selected in the order of the vocabulary number, and the above-described recognition target vocabulary model creation steps 2 to 4 are repeated. When the recognition target vocabulary model creation unit 28 creates the recognition target vocabulary models for all the recognition target vocabularies existing in the recognition target vocabulary memory 26 in this way, the recognition target vocabulary model creation procedure ends.

【０１７３】次にこの実施の形態５による音声認識装置
の認識動作について説明する。認識動作を開始する前
に、認識部３２は、認識対象語彙モデルメモリ３１に保
持されている全ての認識対象語彙モデルのパラメータと
各認識対象語彙モデルがモデル化する音素表記とを読み
込む。例えば認識対象語彙が図１７のようであれば、認
識部３２は、１０００個の認識対象語彙モデルとこれら
の認識対象語彙モデルに対応する音素表記とを認識対象
語彙モデルメモリ３１から読み込む。Next, the recognition operation of the speech recognition apparatus according to the fifth embodiment will be described. Before starting the recognition operation, the recognition unit 32 reads the parameters of all the recognition target vocabulary models stored in the recognition target vocabulary model memory 31 and the phonemic notation modeled by each recognition target vocabulary model. For example, if the recognition target vocabulary is as shown in FIG. 17, the recognizing unit 32 reads from the recognition target vocabulary model memory 31 1000 recognition target vocabulary models and phonemic notations corresponding to these recognition target vocabulary models.

【０１７４】認識部３２の認識動作は次のように行う。
入力端子２２から音声信号２３が入力されると、音響分
析部２４は音声信号２３を特徴ベクトルの時系列２５に
変換する。この特徴ベクトルの時系列２５は例えばＬＰ
Ｃケプストラムの時系列である。The recognizing operation of the recognizing unit 32 is performed as follows.
When the audio signal 23 is input from the input terminal 22, the acoustic analysis unit 24 converts the audio signal 23 into a time series 25 of feature vectors. The time series 25 of this feature vector is, for example, LP
It is a time series of C cepstrum.

【０１７５】認識部３２は特徴ベクトルの時系列２５を
入力とし、予め読み込んである全ての認識対象語彙モデ
ルとの尤度を例えばビタビアルゴリズムによって計算
し、一番高い尤度を示した認識対象語彙モデルがモデル
化する音素表記を認識結果３３として出力する。The recognizing unit 32 receives the time series 25 of the feature vector as input, calculates the likelihood with all the vocabulary models to be read in advance by, for example, the Viterbi algorithm, and obtains the vocabulary to be recognized having the highest likelihood. The phoneme notation modeled by the model is output as a recognition result 33.

【０１７６】次にこの実施の形態５による音声認識装置
が使用する、音声認識方法を具体的に説明する。上記し
たように、この実施の形態５による音声認識方法では、
認識を行う前に認識対象語彙モデルを作成し、作成した
認識対象語彙モデルを認識対象語彙モデルメモリ３１に
保持する。まず、認識対象語彙モデルの作成手順につい
て説明する。Next, a specific description will be given of a speech recognition method used by the speech recognition apparatus according to the fifth embodiment. As described above, in the voice recognition method according to the fifth embodiment,
Before performing recognition, a vocabulary model to be recognized is created, and the created vocabulary model to be recognized is stored in the vocabulary model memory 31 to be recognized. First, a procedure for creating a vocabulary model to be recognized will be described.

【０１７７】図２１はこの発明の実施の形態５による音
声認識方法における認識対象語彙モデルの作成手順の詳
細を示したフローチャートであり、以下では、図２１を
参照しながら認識対象語彙モデルの作成手順について説
明する。FIG. 21 is a flowchart showing details of the procedure for creating a recognition target vocabulary model in the speech recognition method according to the fifth embodiment of the present invention. In the following, referring to FIG. Will be described.

【０１７８】まず、認識対象語彙モデル作成部２８が、
ステップＳＴ１００１において、認識対象語彙メモリ２
６を参照して、モデル作成の対象となる語彙番号１の認
識対象語彙を選択してこの認識対象語彙の音素表記２７
を認識対象語彙メモリ２６から読み込む。例えば認識対
象語彙メモリ２６の内容が図１７のようであれば、認識
対象語彙モデル作成部２８はまず語彙番号１である／ｙ
ｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／を選択する。First, the recognition target vocabulary model creation unit 28
In step ST1001, recognition target vocabulary memory 2
6, the vocabulary to be recognized having vocabulary number 1 to be model-created is selected, and the phoneme notation 27 of this vocabulary to be recognized is selected.
Is read from the recognition target vocabulary memory 26. For example, if the contents of the recognition target vocabulary memory 26 are as shown in FIG. 17, the recognition target vocabulary model creation unit 28 first has the vocabulary number 1 / y
Select oyakuonegaisimasu /.

【０１７９】次に、認識対象語彙モデル作成部２８は、
ステップＳＴ１００２において、上記ステップＳＴ１０
０１またはステップＳＴ１００７において選択した認識
対象語彙の音素表記２７にしたがって読み上げ音声ｍ音
素組モデルメモリ１４から、読み上げ音声ｍ音素組モデ
ルのパラメータ１３を読み込み、読み上げ音声ｍ音素組
モデルを直列接続して、認識対象語彙に対する直列接続
モデルを作成する。例えば音素表記が／ｙｏｙａｋｕｏ
ｎｅｇａｉｓｉｍａｓｕ／の場合、ｍ＝３であるなら
ば、／（＃）ｙ（ｏ）／，／（ｙ）ｏ（ｙ）／，／
（ｏ）ｙ（ａ）／，／（ｙ）ａ（ｋ）／，／（ａ）ｋ
（ｕ）／，／（ｋ）ｕ（ｏ）／，／（ｕ）ｏ（ｎ）／，
／（ｏ）ｎ（ｅ）／，／（ｎ）ｅ（ｇ）／，／（ｅ）ｇ
（ａ）／，／（ｇ）ａ（ｉ）／，／（ａ）ｉ（ｓ）／，
／（ｉ）ｓ（ｉ）／，／（ｓ）ｉ（ｍ）／，／（ｉ）ｍ
（ａ）／，／（ｍ）ａ（ｓ）／，／（ａ）ｓ（ｕ）／，
／（ｓ）ｕ（＃）／の計１８個のｍ音素組モデルを接続
する。ここで／＃／は発話の前後の無音区間を意味する
ものとする。上記したように、この実施の形態５では図
２６に示すような各ｍ音素組モデルは５状態の構造を有
しているとする。したがって、音素表記／ｙｏｙａｋｕ
ｏｎｅｇａｉｓｉｍａｓｕ／に対する直列接続モデルは
図１８のようになる。Next, the recognition target vocabulary model creation unit 28
In step ST1002, in step ST10
01 or the parameters 13 of the read-aloud m-phoneme set model from the read-aloud m-phoneme set model memory 14 in accordance with the phoneme notation 27 of the vocabulary to be recognized selected in step ST1007, and the read-aloud m-phoneme set models are connected in series. Create a serial connection model for the vocabulary to be recognized. For example, phoneme notation is / yoyakuo
In the case of negaishimasu /, if m = 3, then / (#) y (o) /, / (y) o (y) /, /
(O) y (a) /, / (y) a (k) /, / (a) k
(U) /, / (k) u (o) /, / (u) o (n) /,
/ (O) n (e) /, / (n) e (g) /, / (e) g
(A) /, / (g) a (i) /, / (a) i (s) /,
/ (I) s (i) /, / (s) i (m) /, / (i) m
(A) /, / (m) a (s) /, / (a) s (u) /,
/ (S) u (#) /, a total of 18 m phoneme set models are connected. Here, / # / means a silent section before and after the utterance. As described above, in the fifth embodiment, each m-phoneme set model as shown in FIG. 26 has a five-state structure. Therefore, phonemic notation / yoyaku
FIG. 18 shows a series connection model for onegaijima /.

【０１８０】次に、認識対象語彙モデル作成部２８は、
ステップＳＴ１００３において、対話音声ｍ音素組モデ
ルメモリ１６が保持するｍ音素組表記を参照し、上記音
素表記／ｙｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／を構
成するｍ音素組である／（＃）ｙ（ｏ）／，／（ｙ）ｏ
（ｙ）／，／（ｏ）ｙ（ａ）／，／（ｙ）ａ（ｋ）／，
／（ａ）ｋ（ｕ）／，／（ｋ）ｕ（ｏ）／，／（ｕ）ｏ
（ｎ）／，／（ｏ）ｎ（ｅ）／，／（ｎ）ｅ（ｇ）／，
／（ｅ）ｇ（ａ）／，／（ｇ）ａ（ｉ）／，／（ａ）ｉ
（ｓ）／，／（ｉ）ｓ（ｉ）／，／（ｓ）ｉ（ｍ）／，
／（ｉ）ｍ（ａ）／，／（ｍ）ａ（ｓ）／，／（ａ）ｓ
（ｕ）／，／（ｓ）ｕ（＃）／のうち、対話音声ｍ音素
組モデルメモリ１６が保持するｍ音素組表記に存在する
ｍ音素組のモデルのパラメータ１５を対話音声ｍ音素組
モデルメモリ１６から読み込み、上記ステップＳＴ１０
０２で作成した直列接続モデルの該当する場所に読み上
げ音声ｍ音素組モデルと対話音声ｍ音素組モデルとを並
列に接続することによって、選択した認識対象語彙に対
する並列接続モデルを作成する。Next, the recognition target vocabulary model creation unit 28
In step ST1003, by referring to the m-phoneme set notation held in the dialogue m-phoneme set model memory 16, the m-phoneme set constituting the above-described phoneme notation / yoyakuonegai simasu // (#) y (o) /, / (y ) O
(Y) /, / (o) y (a) /, / (y) a (k) /,
/ (A) k (u) /, / (k) u (o) /, / (u) o
(N) /, / (o) n (e) /, / (n) e (g) /,
/ (E) g (a) /, / (g) a (i) /, / (a) i
(S) /, / (i) s (i) /, / (s) i (m) /,
/ (I) m (a) /, / (m) a (s) /, / (a) s
Of (u) /, / (s) u (#) /, the parameter 15 of the m phoneme set model present in the m phoneme set notation held in the dialogue m m phoneme set model memory 16 is used as the dialogue m m phoneme set model. The data is read from the memory 16 and is read from the step ST10
A parallel connection model for the selected vocabulary to be recognized is created by connecting the read-aloud m-phoneme set model and the dialogue m-phoneme set model in parallel to the corresponding location of the serial connection model created in step 02.

【０１８１】例えば、対話音声ｍ音素組モデルメモリ１
６が保持するｍ音素組表記に存在するｍ音素組が、／
（ｙ）ｏ（ｙ）／，／（ｏ）ｙ（ａ）／，／（ａ）ｋ
（ｕ）／，／（ｕ）ｏ（ｎ）／，／（ｎ）ｅ（ｇ）／，
／（ｅ）ｇ（ａ）／，／（ｇ）ａ（ｉ）／，／（ｉ）ｍ
（ａ）／の８個であるとすると、認識対象語彙モデル作
成部２８は、これらの対話音声ｍ音素組モデルを図１９
のように接続して並列接続モデルを作成する。For example, conversational speech m phoneme group model memory 1
The m phoneme set present in the m phoneme set notation held by 6 is /
(Y) o (y) /, / (o) y (a) /, / (a) k
(U) /, / (u) o (n) /, / (n) e (g) /,
/ (E) g (a) /, / (g) a (i) /, / (i) m
Assuming that the number of (a) / is eight, the recognition target vocabulary model creation unit 28 converts these conversational speech m phoneme set models into
To create a parallel connection model.

【０１８２】次に、認識対象語彙モデル作成部２８は、
ステップＳＴ１００４において、対話音声ｎ音素組モデ
ルメモリ２１が保持するｎ音素組表記を参照し、上記音
素表記／ｙｏｙａｋｕｏｎｅｇａｉｓｉｍａｓｕ／を構
成するｎ（この実施の形態５ではｎ＝５）音素組である
／（＃＃）ｙ（ｏｙ）／，／（＃ｙ）ｏ（ｙａ）／，／
（ｙｏ）ｙ（ａｋ）／，／（ｏｙ）ａ（ｋｕ）／，／
（ｙａ）ｋ（ｕｏ）／，／（ａｋ）ｕ（ｏｎ）／，／
（ｋｕ）ｏ（ｎｅ）／，／（ｕｏ）ｎ（ｅｇ）／，／
（ｏｎ）ｅ（ｇａ）／，／（ｎｅ）ｇ（ａｉ）／，／
（ｅｇ）ａ（ｉｓ）／，／（ｇａ）ｉ（ｓｉ）／，／
（ａｉ）ｓ（ｉｍ）／，／（ｉｓ）ｉ（ｍａ）／，／
（ｓｉ）ｍ（ａｓ）／，／（ｉｍ）ａ（ｓｕ）／，／
（ｍａ）ｓ（ｕ＃）／，／（ａｓ）ｕ（＃＃）／のう
ち、対話音声ｎ音素組モデルメモリ２１が保持するｎ音
素組表記に存在するｎ音素組のモデルのパラメータ２０
を対話音声ｎ音素組モデルメモリ２１から読み込み、上
記ステップＳＴ１００３で作成した並列接続モデルの該
当する場所にさらに並列に接続することによって、選択
した認識対象語彙に対する認識対象語彙モデルを作成す
る。例えば、対話音声ｎ音素組モデルメモリ２１が保持
するｎ音素組表記に存在するｎ音素組が／（＃ｙ）ｏ
（ｙａ）／，／（ｙｏ）ｙ（ａｋ）／，／（ｎｅ）ｇ
（ａｉ）／の３個であるとすると、対話音声ｎ音素組モ
デルを図２０のように接続して認識対象語彙モデルを作
成する。Next, the recognition target vocabulary model creation unit 28
In step ST1004, by referring to the n phoneme set notation held in the dialogue phonetic n phoneme set model memory 21, n (n = 5 in the fifth embodiment) a phoneme set constituting the above phoneme notation / yoyakuoneigaishimasu // ( ##) y (oy) /, / (# y) o (ya) /, /
(Yo) y (ak) /, / (oy) a (ku) /, /
(Ya) k (uo) /, / (ak) u (on) /, /
(Ku) o (ne) /, / (uo) n (eg) /, /
(On) e (ga) /, / (ne) g (ai) /, /
(Eg) a (is) /, / (ga) i (si) /, /
(Ai) s (im) /, / (is) i (ma) /, /
(Si) m (as) /, / (im) a (su) /, /
Of the (ma) s (u #) / and / (as) u (##) /, the parameters 20 of the n-phoneme set model present in the n-phoneme set notation held in the dialogue n-phoneme set model memory 21
Is read from the dialogue n-phoneme set model memory 21 and connected in parallel to the corresponding location of the parallel connection model created in step ST1003 to create a recognition target vocabulary model for the selected recognition target vocabulary. For example, the n phoneme set existing in the n phoneme set notation held by the dialogue phonetic n phoneme set model memory 21 is / (# y) o.
(Ya) /, / (yo) y (ak) /, / (ne) g
Assuming that (ai) / is 3, the dialogue n-phoneme set models are connected as shown in FIG. 20 to create a vocabulary model to be recognized.

【０１８３】次に、認識対象語彙モデル作成部２８は、
ステップＳＴ１００５において、並列接続を完了した上
記認識対象語彙モデルのパラメータおよびその音素表記
２９を認識対象語彙モデルメモリ３１に送出する。認識
対象語彙モデルメモリ３１は、受け取った上記認識対象
語彙モデルのパラメータおよび上記音素表記２９を保持
する。Next, the recognition target vocabulary model creation unit 28
In step ST1005, the parameters of the vocabulary model to be recognized that have been connected in parallel and the phoneme notation 29 thereof are sent to the vocabulary model memory 31 to be recognized. The recognition target vocabulary model memory 31 holds the parameters of the received recognition target vocabulary model and the phoneme notation 29.

【０１８４】次に、認識対象語彙モデル作成部２８が、
ステップＳＴ１００６において、認識対象語彙メモリ２
６を参照して認識対象語彙メモリ２６中に存在する全て
の認識対象語彙について認識対象語彙モデルを作成した
か否かを調べ、未作成の認識対象語彙が存在する場合
は、ステップＳＴ１００７に進み、認識対象語彙メモリ
２６から次の認識対象語彙を選択し、ステップＳＴ１０
０２に戻る。一方、認識対象語彙モデル作成部２８は、
認識対象語彙モデルが未作成の認識対象語彙が認識対象
語彙メモリ２６に存在しない場合は認識対象語彙モデル
作成手順を終了する。Next, the recognition target vocabulary model creation unit 28
In step ST1006, the recognition target vocabulary memory 2
It is checked whether or not a recognition target vocabulary model has been created for all the recognition target vocabularies existing in the recognition target vocabulary memory 26 with reference to No. 6; if there is an uncreated recognition target vocabulary, the process proceeds to step ST1007; The next vocabulary to be recognized is selected from the vocabulary memory to be recognized 26, and step ST10 is performed.
Return to 02. On the other hand, the recognition target vocabulary model creation unit 28
If the recognition target vocabulary for which the recognition target vocabulary model has not been created does not exist in the recognition target vocabulary memory 26, the recognition target vocabulary model creation procedure ends.

【０１８５】次にこの実施の形態５による音声認識方法
の音声認識手順を具体的に説明する。既に述べたよう
に、認識動作を開始する前に、認識部３２は、認識対象
語彙モデルメモリ３１に保持されている全ての認識対象
語彙モデルのパラメータと各認識対象語彙モデルがモデ
ル化する音素表記とを読み込む。例えば認識対象語彙が
図１７のようであれば、認識部３２は、１０００個の認
識対象語彙モデルとこれらの認識対象語彙モデルに対応
する音素表記とを認識対象語彙モデルメモリ３１から読
み込む。Next, the speech recognition procedure of the speech recognition method according to the fifth embodiment will be specifically described. As described above, before starting the recognition operation, the recognizing unit 32 determines the parameters of all the recognition target vocabulary models stored in the recognition target vocabulary model memory 31 and the phonemic notation modeled by each recognition target vocabulary model. And read. For example, if the recognition target vocabulary is as shown in FIG. 17, the recognizing unit 32 reads from the recognition target vocabulary model memory 31 1000 recognition target vocabulary models and phonemic notations corresponding to these recognition target vocabulary models.

【０１８６】図２２はこの発明の実施の形態５による音
声認識方法における音声認識手順の詳細を示したフロー
チャートであり、以下では、図２２を参照しながら音声
認識手順について説明する。まず、音響分析部２４は、
ステップＳＴ１２０１において、入力端子２２から入力
された音声信号２３を特徴ベクトルの時系列２５に変換
する。この特徴ベクトルの時系列２５はＬＰＣケプスト
ラムの時系列である。FIG. 22 is a flowchart showing details of the speech recognition procedure in the speech recognition method according to the fifth embodiment of the present invention. The speech recognition procedure will be described below with reference to FIG. First, the acoustic analysis unit 24
In step ST1201, the audio signal 23 input from the input terminal 22 is converted into a time series 25 of feature vectors. The time series 25 of the feature vector is a time series of the LPC cepstrum.

【０１８７】次に、認識部３２が、ステップＳＴ１２０
２において、特徴ベクトルの時系列２５を入力とし、予
め読み込んである全ての認識対象語彙モデルとの尤度を
例えばビタビアルゴリズムによって計算し、一番高い尤
度を示した認識対象語彙モデルがモデル化する音素表記
を認識結果３３として出力する。Then, the recognizing unit 32 determines in step ST120
2, the likelihood with all recognition target vocabulary models read in advance is calculated by, for example, a Viterbi algorithm, and the recognition target vocabulary model showing the highest likelihood is modeled. The phoneme notation to be output is output as the recognition result 33.

【０１８８】なお、この実施の形態５による音声認識方
法をソフトウェアで実現しようとする場合、認識対象語
彙に対する音声パターンモデル（すなわち認識対象語彙
モデル）を作成する認識対象語彙モデル作成ステップ
と、音声信号の入力端子２２から入力された音声信号２
３を特徴ベクトルの時系列２５に変換する音響分析のス
テップと、特徴ベクトルの時系列２５を入力とし、予め
読み込んである全ての認識対象語彙モデルとの尤度を例
えばビタビアルゴリズムによって計算し、一番高い尤度
を示した認識対象語彙モデルがモデル化する音素表記を
認識結果３３として出力するステップとを有した、コン
ピュータに音声認識を実行させるための音声認識プログ
ラムを記録したコンピュータで読み取り可能な記録媒体
が必要である。When the speech recognition method according to the fifth embodiment is to be implemented by software, a recognition target vocabulary model creating step for creating a speech pattern model (that is, a recognition target vocabulary model) for the recognition target vocabulary, Signal 2 input from the input terminal 22
3 is converted into a time series 25 of feature vectors, and the time series 25 of the feature vectors are input, and the likelihoods of all the vocabulary models to be recognized which are read in advance are calculated by, for example, a Viterbi algorithm. Outputting the phoneme notation modeled by the recognition-target vocabulary model showing the highest likelihood as the recognition result 33. The computer-readable recording device stores a speech recognition program for causing the computer to execute speech recognition. A recording medium is required.

【０１８９】以上説明したように、この実施の形態５に
よる音声認識装置は、図２０に示すように、上記実施の
形態３または４による音声パターンモデル学習装置によ
って学習された読み上げ音声ｍ音素組モデル、対話音声
ｍ音素組モデルおよび対話音声ｎ音素組モデルを用い
て、対話音声のように発話速度がはやく曖昧な音声で認
識性能が低いｍ音素組やｎ音素組に対して、認識対象語
彙について別個に音声パターンモデルを作成してそのｍ
音素組やｎ音素組の音響特徴を高精度にモデル化し、読
み上げ音声ｍ音素組モデルと並列接続して認識対象語彙
モデルを作成する。したがって、この実施の形態５によ
れば、読み上げ音声のような丁寧な発声を高精度に認識
でき、かつ対話音声のように発話速度がはやく曖昧な音
声でも認識精度を改善することができるという効果を奏
する。なお、この実施の形態５では、ｍ＝３、ｎ＝５と
して説明したが、ｍ、ｎは、ｍ＜ｎなる任意の整数の組
を選択してもよく、この場合でも同様の効果を奏する。As described above, the speech recognition apparatus according to the fifth embodiment, as shown in FIG. 20, has a read-out speech m phoneme group model trained by the speech pattern model learning apparatus according to the third or fourth embodiment. Using m-phoneme group model and m-phoneme group model for dialogue speech, vocabulary to be recognized for m-phoneme group and n-phoneme group with fast utterance and low recognition performance like dialogue voice A voice pattern model is created separately and its m
The acoustic features of the phoneme set and the n-phoneme set are modeled with high accuracy, and the vocabulary model to be recognized is created by connecting the read-aloud speech m-phoneme set model in parallel. Therefore, according to the fifth embodiment, it is possible to recognize a polite utterance such as a reading voice with high accuracy, and to improve the recognition accuracy even for an utterly fast and ambiguous voice such as a dialogue voice. To play. Although the fifth embodiment has been described assuming that m = 3 and n = 5, m and n may be selected from any set of integers such that m <n. In this case, the same effect is obtained. .

【０１９０】[0190]

【発明の効果】以上のように、この発明によれば、テキ
ストを読み上げた音声を用いて学習した読み上げ音声ｍ
音素組モデルを用い、対話音声学習データから認識率が
所定の閾値以下であるｍ音素組を抽出するｍ音素組抽出
手段またはｍ音素組抽出ステップと、抽出した各ｍ音素
組について、上記対話音声学習データを用いて対話音声
ｍ音素組モデルを学習するモデル学習手段またはモデル
学習ステップとを備えるように構成したので、全てのｍ
音素組に対して対話音声ｍ音素組モデルを学習すること
なしに、読み上げ音声で学習した読み上げ音声ｍ音素組
モデルでは認識が困難であった対話音声をも認識可能な
対話音声ｍ音素組モデルを効率良く学習できる効果があ
る。As described above, according to the present invention, a read-out voice m learned using a voice read out of a text.
M phoneme set extraction means or m phoneme set extraction step for extracting a m phoneme set whose recognition rate is equal to or less than a predetermined threshold value from the dialogue speech learning data using a phoneme set model; Since it is configured to include a model learning means or a model learning step for learning a dialogue m m phoneme set model using learning data, all m
Without learning the m-phoneme group model for the dialogue speech for the phoneme group, a m-phoneme group model for dialogue speech that can recognize dialogue speech that was difficult to recognize with the m-phoneme group model read aloud speech was used. It has the effect of being able to learn efficiently.

【０１９１】この発明によれば、ｍ音素組抽出手段また
はｍ音素組抽出ステップが、対話音声学習データ中から
同一ｍ音素組表記をもつデータ数が所定数以上であるｍ
音素組を選択し、読み上げ音声ｍ音素組モデルを用いて
選択した該ｍ音素組を認識し、認識率が所定の閾値以下
であるならば選択した上記ｍ音素組を抽出するようにし
たので、読み上げ音声ｍ音素組モデルで認識率が低い対
話音声のｍ音素組のうち、データ数が所定数未満で統計
的に信頼度の低いモデルの学習を回避し、統計的に信頼
度の高いモデルのみを効率的に学習できるという効果が
ある。According to the present invention, the m-phoneme-set extracting means or the m-phoneme-set extracting step determines that the number of data having the same m-phoneme-set notation is more than a predetermined number from the interactive speech learning data.
Since the phoneme set is selected, the m-phoneme set selected using the read-aloud m-phoneme set model is recognized, and if the recognition rate is equal to or less than a predetermined threshold, the selected m-phoneme set is extracted, In the m-phoneme set model of the spoken m-phoneme set having a low recognition rate, learning of a model having a data number less than a predetermined number and having a statistically low reliability is avoided, and only a model having a statistically high reliability is used. There is an effect that can be learned efficiently.

【０１９２】この発明によれば、テキストを読み上げた
音声を用いて学習した読み上げ音声ｍ音素組モデルを用
い、対話音声学習データから認識率が第１の所定の閾値
以下であるｍ音素組を抽出するｍ音素組抽出手段または
ｍ音素組抽出ステップと、抽出した各ｍ音素組につい
て、上記対話音声学習データを用いて対話音声ｍ音素組
モデルを学習する対話音声ｍ音素組モデル学習手段また
は対話音声ｍ音素組モデル学習ステップと、上記読み上
げ音声ｍ音素組モデルと上記対話音声ｍ音素組モデルと
を用いて、上記対話音声学習データから認識率が第２の
所定の閾値以下のｎ音素組を抽出するｎ音素組抽出手段
またはｎ音素組抽出ステップと、抽出した各ｎ音素組に
ついて、上記対話音声学習データを用いて対話音声ｎ音
素組モデルを学習する対話音声ｎ音素組モデル学習手段
または対話音声ｎ音素組モデル学習ステップとを備える
ように構成したので、対話音声のように発話速度がはや
くかつ曖昧な音声で読み上げ音声ｍ音素組モデルと対話
音声ｍ音素組モデルでは十分な認識性能が得られない各
ｎ音素組について効率的に対話音声ｎ音素組モデルを学
習することができる効果がある。According to the present invention, an m-phoneme set whose recognition rate is equal to or less than a first predetermined threshold is extracted from conversational speech learning data by using a m-phoneme set model of a read-out voice trained using a text-reading voice. M phoneme set extraction means or m phoneme set extraction step, and for each extracted m phoneme set, a dialogue speech m phoneme set model learning means or dialogue speech for learning a dialogue speech m phoneme set model using the dialogue speech learning data. Using an m-phoneme group model learning step, and extracting the n-phoneme group whose recognition rate is equal to or less than a second predetermined threshold from the dialogue speech learning data, using the read-out speech m-phoneme group model and the dialogue m-phoneme group model. The n phoneme set extracting means or the n phoneme set extraction step to perform, and for each of the extracted n phoneme sets, a dialogue speech n phoneme set model is learned using the dialogue speech learning data. It is configured to include the dialogue speech n-phoneme set model learning means or the dialogue speech n-phoneme set model learning step, so that the speech rate is as fast and vocal as the dialogue speech. With the phoneme set model, there is an effect that the conversational speech n phoneme set model can be efficiently learned for each n phoneme set for which sufficient recognition performance cannot be obtained.

【０１９３】この発明によれば、ｎ音素組抽出手段また
はｎ音素組抽出ステップが、対話音声学習データ中から
同一ｎ音素組表記をもつデータ数が所定数以上であるｎ
音素組を選択し、読み上げ音声ｍ音素組モデルと対話音
声ｍ音素組モデルとを用いて選択した上記ｎ音素組を認
識し、認識率が第２の所定の閾値以下であるならば選択
した上記ｎ音素組を抽出するようにしたので、読み上げ
音声ｎ音素組モデルで認識率が低い対話音声のｎ音素組
のうち、データ数が所定数未満で統計的に信頼度の低い
モデルの学習を回避し、統計的に信頼度の高いモデルの
みを効率的に学習できるという効果がある。According to the present invention, the n-phoneme-set extracting means or the n-phoneme-set extracting step determines that the number of data having the same n-phoneme-set notation is more than a predetermined number from the interactive speech learning data.
Selecting a phoneme set, recognizing the selected n phoneme set using the read-aloud speech m-phoneme set model and the dialogue speech m-phoneme set model, and selecting the n-phoneme set if the recognition rate is equal to or less than a second predetermined threshold value; Since n phoneme sets are extracted, learning of a model with less than a predetermined number of data and statistically low reliability among n phoneme sets of conversational speech with a low recognition rate in the n-phoneme set model of the read-out voice is avoided. However, there is an effect that only a model having a statistically high reliability can be efficiently learned.

【０１９４】この発明によれば、音声パターンモデル学
習装置または音声パターンモデル学習方法によって学習
された読み上げ音声ｍ音素組モデル、対話音声ｍ音素組
モデルおよび対話音声ｎ音素組モデルを並列に接続する
ことによって認識対象語彙に対する音声パターンモデル
を作成する認識対象語彙モデル作成手段または認識対象
語彙モデル作成ステップと、該認識対象語彙モデル作成
手段によって作成した認識対象語彙に対する音声パター
ンモデルを用いて、入力音声の認識を行う認識手段また
は認識ステップとを備えるように構成したので、読み上
げ音声のような丁寧な発声を高精度に認識でき、かつ対
話音声のように発話速度がはやく曖昧な音声でも認識精
度を改善することができるという効果がある。According to the present invention, a read-aloud m m-phoneme set model, a dialogue m-phoneme set model, and a dialogue n-phoneme set model learned by the voice pattern model learning apparatus or the voice pattern model learning method are connected in parallel. A recognition target vocabulary model generating means or a recognition target vocabulary model generating step of generating a voice pattern model for the recognition target vocabulary by using the voice pattern model for the recognition target vocabulary generated by the recognition target vocabulary model generating means. Since it is configured to have a recognition means or a recognition step for performing recognition, it is possible to recognize polite utterances such as read-out voices with high accuracy, and to improve recognition accuracy even for vocal voices whose utterance speed is fast like dialogue voices There is an effect that can be.

[Brief description of the drawings]

【図１】この発明の実施の形態１による音声パターン
モデル学習装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a speech pattern model learning device according to Embodiment 1 of the present invention.

【図２】この発明の実施の形態１による音声パターン
モデル学習装置の抽出ｍ音素組表記メモリの内容の一例
を示す図である。FIG. 2 is a diagram showing an example of the contents of an extracted m phoneme set notation memory of the speech pattern model learning device according to the first embodiment of the present invention.

【図３】この発明の実施の形態１による音声パターン
モデル学習方法の手順を示すフローチャートである。FIG. 3 is a flowchart showing a procedure of a voice pattern model learning method according to the first embodiment of the present invention.

【図４】この発明の実施の形態１における読み上げ音
声ｍ音素組モデルの学習手順を示すフローチャートであ
る。FIG. 4 is a flowchart showing a learning procedure of a read-aloud m-phoneme set model according to Embodiment 1 of the present invention;

【図５】この発明の実施の形態１における認識率の低
いｍ音素組を抽出する抽出手順を示すフローチャートで
ある。FIG. 5 is a flowchart showing an extraction procedure for extracting m phoneme sets having a low recognition rate according to Embodiment 1 of the present invention.

【図６】この発明の実施の形態１における対話音声ｍ
音素組モデルの学習手順を示すフローチャートである。FIG. 6 is a dialogue voice m according to the first embodiment of the present invention.
It is a flowchart which shows the learning procedure of a phoneme set model.

【図７】この発明の実施の形態２による音声パターン
モデル学習装置におけるトークン数が所定数以上で認識
率の低いｍ音素組を抽出する抽出手順を示すフローチャ
ートである。FIG. 7 is a flowchart showing an extraction procedure for extracting a m-phoneme set having a low recognition rate and having a predetermined number of tokens or more in the voice pattern model learning device according to the second embodiment of the present invention.

【図８】この発明の実施の形態３による音声パターン
モデル学習装置の構成を示すブロック図である。FIG. 8 is a block diagram showing a configuration of a speech pattern model learning device according to Embodiment 3 of the present invention.

【図９】この発明の実施の形態３による音声パターン
モデル学習装置の対話音声学習データメモリが保持する
５音素組テーブルの一例を示す図である。FIG. 9 is a diagram showing an example of a pentaphone set table held in a conversational speech learning data memory of the speech pattern model learning device according to the third embodiment of the present invention.

【図１０】この発明の実施の形態３による音声パター
ンモデル学習装置の対話音声学習データメモリが保持す
る、３音素組表記とともに付与された５音素組表記の一
例を示す図である。FIG. 10 is a diagram showing an example of a five-phoneme set notation provided together with a three-phoneme set notation held in a dialogue speech learning data memory of the speech pattern model learning device according to the third embodiment of the present invention;

【図１１】この発明の実施の形態３による音声パター
ンモデル学習装置の抽出ｎ音素組表記メモリの内容の一
例を示す図である。FIG. 11 is a diagram showing an example of the contents of an extracted n phoneme set notation memory of the speech pattern model learning device according to the third embodiment of the present invention.

【図１２】この発明の実施の形態３による音声パター
ンモデル学習方法の手順を示すフローチャートである。FIG. 12 is a flowchart showing a procedure of a voice pattern model learning method according to Embodiment 3 of the present invention.

【図１３】この発明の実施の形態３における認識率の
低いｎ音素組を抽出する抽出手順を示すフローチャート
である。FIG. 13 is a flowchart showing an extraction procedure for extracting n phoneme sets having a low recognition rate according to Embodiment 3 of the present invention.

【図１４】この発明の実施の形態３における対話音声
ｎ音素組モデルの学習手順を示すフローチャートであ
る。FIG. 14 is a flowchart showing a learning procedure of a conversational speech n-phoneme set model according to Embodiment 3 of the present invention.

【図１５】この発明の実施の形態４による音声パター
ンモデル学習装置におけるトークン数が所定数以上で認
識率の低いｎ音素組を抽出する抽出手順を示すフローチ
ャートである。FIG. 15 is a flowchart showing an extraction procedure for extracting an n-phoneme set having a predetermined number of tokens or more and a low recognition rate in the speech pattern model learning device according to the fourth embodiment of the present invention.

【図１６】この発明の実施の形態５による音声認識装
置の構成を示すブロック図である。FIG. 16 is a block diagram showing a configuration of a speech recognition device according to Embodiment 5 of the present invention.

【図１７】この発明の実施の形態５による音声認識装
置の認識対象語彙メモリの内容の一例を示す図である。FIG. 17 is a diagram showing an example of the contents of a recognition target vocabulary memory of the speech recognition device according to the fifth embodiment of the present invention.

【図１８】認識対象語彙／ｙｏｙａｋｕｏｎｅｇａｉ
ｓｉｍａｓｕ／に対する直列接続モデルを示す図であ
る。FIG. 18: Recognition target vocabulary / yoyakuonegai
It is a figure which shows the series connection model with respect to simasu /.

【図１９】この発明の実施の形態５による音声認識装
置により作成され、図１８の直列接続モデルに対話音声
ｍ音素組モデルが並列に接続された並列接続モデルを示
す図である。FIG. 19 is a diagram showing a parallel connection model created by the voice recognition device according to the fifth embodiment of the present invention and in which the dialogue m-phoneme set model is connected in parallel to the serial connection model of FIG. 18;

【図２０】この発明の実施の形態５による音声認識装
置により作成され、図１８の直列接続モデルに対話音声
ｎ音素組モデルおよび対話音声ｎ音素組モデルが並列に
接続された認識対象語彙モデルを示す図である。FIG. 20 is a diagram illustrating a recognition target vocabulary model created by the speech recognition apparatus according to the fifth embodiment of the present invention, in which the dialogue n phoneme group model and the dialogue n phoneme group model are connected in parallel to the serial connection model of FIG. 18; FIG.

【図２１】この発明の実施の形態５による音声認識方
法における認識対象語彙モデルの作成手順を示すフロー
チャートである。FIG. 21 is a flowchart showing a procedure for creating a vocabulary model to be recognized in the speech recognition method according to the fifth embodiment of the present invention.

【図２２】この発明の実施の形態５による音声認識方
法における音声認識手順の詳細を示したフローチャート
である。FIG. 22 is a flowchart showing details of a voice recognition procedure in a voice recognition method according to Embodiment 5 of the present invention.

【図２３】従来の音声パターンモデル学習装置の一例
の構成を示すブロック図である。FIG. 23 is a block diagram showing a configuration of an example of a conventional voice pattern model learning device.

【図２４】従来の音声パターンモデル学習装置の学習
データメモリが保持する３音素組テーブルの一例を示す
図である。FIG. 24 is a diagram showing an example of a three-phoneme set table held in a learning data memory of a conventional speech pattern model learning device.

【図２５】従来の音声パターンモデル学習装置の学習
データメモリが保持するトークンの３音素組表記の一例
を示す図である。FIG. 25 is a diagram showing an example of a three-phoneme set notation of a token stored in a learning data memory of a conventional voice pattern model learning device.

【図２６】３音素組モデルの構造の一例である５状態
のｌｅｆｔ−ｔｏ−ｒｉｇｈｔモデルを示す図である。FIG. 26 is a diagram illustrating a 5-state left-to-right model that is an example of the structure of a three-phoneme set model.

[Explanation of symbols]

３モデル学習部（モデル学習手段）、６読み上げ音
声学習データメモリ、８，８０対話音声学習データメ
モリ、１０ｍ音素組抽出部（ｍ音素組抽出手段）、１
２抽出ｍ音素組表記メモリ、１４読み上げ音声ｍ音
素組モデルメモリ、１６対話音声ｍ音素組モデルメモ
リ、１７ｎ音素組抽出部（ｎ音素組抽出手段）、１９
抽出ｎ音素組表記メモリ、２１対話音声ｎ音素組モ
デルメモリ、２４音響分析部、２６認識対象語彙メ
モリ、２８認識対象語彙モデル作成部（認識対象語彙
モデル作成手段）、３０モデル学習部（対話音声ｍ音
素組モデル学習手段、対話音声ｎ音素組モデル学習手
段）、３１認識対象語彙モデルメモリ、３２認識部
（認識手段）。3 model learning unit (model learning means), 6 reading voice learning data memory, 8,80 conversational voice learning data memory, 10 m phoneme group extraction unit (m phoneme group extraction means), 1
2 extracted m phoneme set notation memory, 14 read-aloud speech m phoneme set model memory, 16 conversational speech m phoneme set model memory, 17 n phoneme set extraction unit (n phoneme set extraction means), 19
Extracted n phoneme set notation memory, 21 dialogue speech n phoneme set model memory, 24 acoustic analysis unit, 26 recognition target vocabulary memory, 28 recognition target vocabulary model creation unit (recognition target vocabulary model creation means), 30 model learning unit (interaction speech) m phoneme set model learning means, dialogue speech n phoneme set model learning means), 31 recognition target vocabulary model memory, 32 recognition unit (recognition means).

───────────────────────────────────────────────────── フロントページの続き (54)【発明の名称】音声パターンモデル学習装置、音声パターンモデル学習方法、および音声パターンモデル学習プログラムを記録したコンピュータ読み取り可能な記録媒体、ならびに音声認識装置、音声認識方法、および音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 ──────────────────────────────────────────────────続き Continued on the front page (54) [Title of the Invention] A voice pattern model learning device, a voice pattern model learning method, a computer-readable recording medium storing a voice pattern model learning program, a voice recognition device, and voice. Recognition method and computer-readable recording medium recording voice recognition program

Claims

[Claims]

1. A m-phoneme set model, which is a phoneme taking into account the difference between (m-1) / 2 phonemes before and after the m-phoneme set, which is learned using a voice read out from a text for the m-phoneme set. M phoneme set extraction means for recognizing each m phoneme set included in dialogue speech learning data obtained by acoustic analysis of a person-to-person dialogue speech and extracting m phoneme sets whose recognition rate is equal to or less than a predetermined threshold value; A speech pattern model learning apparatus comprising: a model learning means for learning a dialogue m m phoneme set model using the dialogue speech learning data for each m phoneme set extracted by the m phoneme set extraction means.

2. An m-phoneme set extraction unit selects m-phoneme sets in which the number of data having the same m-phoneme set notation is equal to or more than a predetermined number from the interactive speech learning data, and selects the m-phoneme set using a read-out m-phoneme set model. 2. The speech pattern model learning apparatus according to claim 1, wherein the m phoneme sets are recognized, and the selected m phoneme sets are extracted if the recognition rate is equal to or less than a predetermined threshold.

3. A m-phoneme set model, which is a phoneme that takes into account the difference between (m-1) / 2 phonemes before and after the m-phoneme set and is trained using a speech read out text for the m-phoneme set. M phoneme set extraction means for recognizing each m phoneme set included in dialogue speech learning data obtained by acoustic analysis of a person-to-person dialogue speech and extracting a m phoneme set whose recognition rate is equal to or less than a first predetermined threshold value A dialogue m-phoneme set model learning means for learning a dialogue m-phoneme set model using the dialogue speech learning data for each m-phoneme set extracted by the m-phoneme set extraction means; Each n-phoneme set included in the dialogue speech learning data is a phoneme that takes into account differences in phonemes in a longer range than the m-phoneme set where n> m, using the set model and the m-phoneme set model for dialogue speech. Recognize N phoneme group extraction means for extracting an n phoneme group whose recognition rate is equal to or less than a second predetermined threshold value; and for each n phoneme group extracted by the n phoneme group extraction means, a dialogue speech using the above dialogue speech learning data. A speech pattern model learning device comprising: a dialogue speech n phoneme group model learning means for learning an n phoneme group model.

4. An n-phoneme set extracting means selects n-phoneme sets whose number of data having the same n-phoneme set notation is equal to or more than a predetermined number from the dialogue speech learning data, and reads out a read-out speech m-phoneme set model and a dialogue speech m. 4. The method according to claim 3, wherein the selected n phoneme sets are recognized using a phoneme set model, and the selected n phoneme sets are extracted if the recognition rate is equal to or less than a second predetermined threshold. Voice pattern model learning device.

5. A reading voice m learned by the voice pattern model learning device according to claim 3.
A recognition target vocabulary model creating unit for creating a speech pattern model for the recognition target vocabulary by connecting a phoneme group model, a dialogue m m phoneme group model, and a dialogue speech n phoneme group model in parallel; A speech recognition device comprising: a recognition unit configured to recognize an input speech by using a created speech pattern model for a recognition target vocabulary.

6. A m-phoneme set model, which is a phoneme that takes into account the difference between (m-1) / 2 phonemes before and after the m-phoneme set and is trained using a voice read out from a text for the m-phoneme set, Recognize each m-phoneme set included in dialogue speech learning data obtained by acoustic analysis of human-to-person dialogue speech, extract m-phoneme sets whose recognition rate is equal to or less than a predetermined threshold, and for each of the extracted m-phoneme sets. And a speech pattern model learning method for learning a m-phoneme set model of conversational speech using the conversational speech learning data.

7. When extracting m phoneme sets, m phoneme sets whose number of data having the same m phoneme set notation is equal to or more than a predetermined number are selected from conversational speech learning data, and a m-phoneme set model for reading out speech is used. 7. The speech pattern model learning method according to claim 6, wherein the m phoneme set selected by the above is recognized, and if the recognition rate is equal to or less than a predetermined threshold, the selected m phoneme set is extracted.

8. A m-phoneme set model, which is a phoneme taking into account the difference between (m-1) / 2 phonemes before and after the m-phoneme set, which is trained using a speech obtained by reading a text for the m-phoneme set, Recognize each m-phoneme set included in dialogue-speech learning data obtained by acoustic analysis of a person-to-person dialogue voice, and extract m-phoneme sets whose recognition rate is equal to or less than a first predetermined threshold,
For each of the extracted m-phoneme sets, a dialogue m-phoneme set model is learned using the dialogue speech learning data, and n> m is obtained using the read-out speech m-phoneme set model and the dialogue m-phoneme set model. Recognize each of the n phoneme sets included in the conversational speech learning data, which are phonemes taking into account differences in phonemes in a range longer than the m phoneme sets, and generate n phoneme sets whose recognition rate is equal to or less than a second predetermined threshold. A speech pattern model learning method for learning a conversational speech n-phoneme set model using the conversational speech learning data for each extracted and extracted n-phoneme set.

9. When extracting n phoneme sets, an n phoneme set whose number of data having the same n phoneme set notation is equal to or greater than a predetermined number is selected from the dialogue learning speech data, and a dialogue with the m-phoneme set model of the spoken speech is performed. 9. The method according to claim 8, further comprising: recognizing the selected n phoneme set by using a speech m phoneme set model; and extracting the selected n phoneme set if the recognition rate is equal to or less than a second predetermined threshold. The described speech pattern model learning method.

10. A method of connecting a speech m-phoneme set model, a dialogue m-phoneme set model, and an interactive speech n-phoneme set model learned by the speech pattern model learning method according to claim 8 in parallel. A speech recognition method for creating a speech pattern model for a recognition target vocabulary and recognizing an input speech using the created speech pattern model for the recognition target vocabulary.

11. A read-aloud speech m-phoneme set model, which is trained using a speech obtained by reading a text about a m-phoneme set, which is a phoneme in consideration of a difference between each of the preceding and succeeding (m-1) / 2 phonemes, Recognize each m phoneme set included in the conversational speech learning data obtained by acoustically analyzing the conversational speech between people,
An m-phoneme set extracting step of extracting an m-phoneme set whose recognition rate is equal to or less than a predetermined threshold value; and a m-phoneme set extracted by the m-phoneme set extracting step, the m-phoneme set of the dialogue speech using the dialogue speech learning data. Dialogue m to learn the model
A computer-readable recording medium storing a speech pattern model learning program, comprising: a phoneme group model learning step.

12. The m-phoneme set extraction step selects m-phoneme sets whose number of data having the same m-phoneme set notation is equal to or more than a predetermined number from the dialogue speech learning data, and selects the m-phoneme set using a read-out m-phoneme set model. 12. The recording medium according to claim 11, further comprising the step of recognizing the m phoneme sets and extracting the selected m phoneme sets if the recognition rate is equal to or less than a predetermined threshold.

13. A read-aloud speech m-phoneme set model, which is learned using a text-to-speech voice for a m-phoneme set, which is a phoneme taking into account the difference between each of the preceding and following (m-1) / 2 phonemes, Recognize each m phoneme set included in the conversational speech learning data obtained by acoustically analyzing the conversational speech between people,
An m-phoneme set extraction step of extracting m-phoneme sets whose recognition rate is equal to or less than a first predetermined threshold value, and a dialogue using the dialogue speech learning data for each m-phoneme set extracted in the m-phoneme set extraction step. A dialog speech m phoneme set model learning step for learning a speech m phoneme set model, and a range longer than the m phoneme set satisfying n> m, using the read-out speech m phoneme set model and the dialogue speech m phoneme set model. An n-phoneme set extraction step of recognizing each of the n-phoneme sets included in the conversational speech learning data, and extracting an n-phoneme set whose recognition rate is equal to or less than a second predetermined threshold value. A conversational speech n phoneme group model learning step of learning a conversational speech n phoneme group model using the conversational speech training data for each n phoneme group extracted in the n phoneme group extraction step. That a computer-readable recording medium recording a speech pattern model training program.

14. An n-phoneme group extracting step selects n-phoneme groups in which the number of data having the same n-phoneme group notation is equal to or greater than a predetermined number from the dialogue speech learning data, and reads out a read-out speech m-phoneme set model and a dialogue speech m. Recognizing the selected n phoneme set using a phoneme set model, and extracting the selected n phoneme set if the recognition rate is equal to or less than a second predetermined threshold value. Item 14. The recording medium according to Item 13.

15. A speech m-phoneme set model, a dialogue m-phoneme set model, and a dialogue n-phoneme set model, which are learned by the speech pattern model learning method according to claim 8 or 9, are connected in parallel. It has a recognition target vocabulary model creating step of creating a voice pattern model for the recognition target vocabulary, and a recognition step of recognizing the input voice using the voice pattern model for the recognition target vocabulary created in the recognition target vocabulary model creating step. And a computer-readable recording medium on which a voice recognition program is recorded.