JPH09258769A

JPH09258769A - Speaker adaptation method and speaker adaptation device

Info

Publication number: JPH09258769A
Application number: JP6149796A
Authority: JP
Inventors: Yasunaga Miyazawa; 康永宮沢
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1996-03-18
Filing date: 1996-03-18
Publication date: 1997-10-03

Abstract

(57)【要約】【課題】多数の不特定話者の音声を基に作成した不特
定話者コードブックによる話者適応では、入力話者に対
して的確な話者適応が行えない問題があった。【解決手段】多数の不特定話者から得られた音声特徴
データに基づいて複数の話者クラスを設定し、各話者ク
ラスに属する話者の音声特徴データをもとに不特定話者
コードブック２３ａ、２３ｂを作成するとともに、各話
者クラスに属する話者の音声特徴ベクトル列から話者適
応用単語ごとの重心ベクトル列を得て、重心ベクトル列
記憶部２４ａ，２４ｂに記憶させておく。そして、入力
話者の音声特徴ベクトル列と前記重心ベクトル列とを対
応づけて、入力話者の音声がどの話者クラスに属するか
を話者クラス判定部２２により判定し、対応する不特定
話者コードブックを選択し、音声認識時には入力話者の
音声を選択されたコードブックによりコード化するとと
もに、対応する音声モデルを用いて音声認識する。 (57) [Abstract] [Problem] In speaker adaptation by an unspecified speaker codebook created based on the voices of many unspecified speakers, there is a problem that an accurate speaker adaptation cannot be performed for an input speaker. there were. SOLUTION: A plurality of speaker classes are set based on voice characteristic data obtained from a large number of unspecified speakers, and an unspecified speaker code is obtained based on the speech characteristic data of speakers belonging to each speaker class. The books 23a and 23b are created, and the centroid vector sequence for each speaker adaptation word is obtained from the speech feature vector sequences of the speakers belonging to each speaker class and stored in the centroid vector string storage units 24a and 24b. . Then, the speaker feature determination unit 22 determines which speaker class the voice of the input speaker belongs to by associating the voice feature vector sequence of the input speaker with the center-of-gravity vector sequence, and the corresponding unspecified talk A speaker codebook is selected, and at the time of voice recognition, the voice of the input speaker is coded by the selected codebook and voice recognition is performed using the corresponding voice model.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ベクトル量子化を
用いた音声認識に適用される話者適応化方法および話者
適応化装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation method and a speaker adaptation device applied to speech recognition using vector quantization.

【０００２】[0002]

【従来の技術】ベクトル量子化を用いた音声認識におい
ては、話者適応を行うことにより認識性能の高い音声認
識が可能となる。このベクトル量子化を用いた音声認識
における従来の話者適応方法の一つとして、入力話者の
音声が入力されると、ある一人の標準話者から作成され
たコードブックに基づいて、入力話者の特徴ベクトル列
を標準話者の特徴ベクトル列に変換して出力する方法が
ある。2. Description of the Related Art In speech recognition using vector quantization, speech recognition with high recognition performance is possible by performing speaker adaptation. As one of the conventional speaker adaptation methods in speech recognition using this vector quantization, when the speech of the input speaker is input, the input speech is based on a codebook created by a certain standard speaker. There is a method of converting a feature vector sequence of a speaker to a feature vector sequence of a standard speaker and outputting the sequence.

【０００３】これを図１６により説明する。図１６
（ａ）は、入力話者の音声特徴ベクトル列であり、同図
（ｂ）は後述の不特定話者コードブックを基に予め作成
された入力話者コードブック、同図（ｃ）は多数の不特
定話者の音声特徴データを基に作成された不特定話者コ
ードブックである。This will be described with reference to FIG. FIG.
(A) is a speech feature vector sequence of an input speaker, (b) is an input speaker codebook created in advance based on an unspecified speaker codebook described later, and (c) is a large number. It is an unspecified speaker codebook created based on the voice feature data of the unspecified speaker.

【０００４】なお、入力話者の音声は、通常、Ａ／Ｄ変
換器でディジタル信号に変換されたのち、周波数分析さ
れ、音声波形信号の周波数の特徴を表す１０次元程度の
特徴ベクトル（LPCーCEPSTRUM係数が一般的）として出
力されるが、ここでは、説明を簡略化するため、５次元
の特徴ベクトル列で示している。また、不特定話者コー
ドブックのサイズは通常２５６あるいは５１２といった
サイズが用いられるが、ここでは、サイズを３としてい
る。そして、入力話者コードブックと不特定話者コード
ブックはそれぞれのデータが予め対応づけられており、
たとえば、入力話者コードブックのＡのデータは不特定
話者コードブックのＡ’に、入力話者コードブックのＢ
のデータは不特定話者コードブックのＢ’に、入力話者
コードブックのＣのデータは不特定話者コードブックの
Ｃ’にそれぞれ対応づけられている。The speech of the input speaker is usually converted into a digital signal by an A / D converter and then frequency-analyzed to obtain a feature vector (LPC-LPC) representing the frequency characteristics of the voice waveform signal. The CEPSTRUM coefficient is generally output), but here, in order to simplify the explanation, it is shown as a five-dimensional feature vector sequence. The size of the unspecified speaker codebook is usually 256 or 512, but the size is 3 here. Then, the input speaker codebook and the unspecified speaker codebook are associated with respective data in advance,
For example, the data of A of the input speaker codebook is stored in A ′ of the unspecified speaker codebook and B of the input speaker codebook.
Is associated with B ′ of the unspecified speaker codebook, and C data of the input speaker codebook is associated with C ′ of the unspecified speaker codebook.

【０００５】今、図１６（ａ）に示すような入力話者の
入力音声の特徴ベクトル列（イ）、（ロ）、（ハ）、・
・・が、入力されると、これらの特徴ベクトル列
（イ）、（ロ）、（ハ）、・・・が同図（ｂ）に示す入
力話者コードブックのどのデータと最も近いかを距離計
算により求める。たとえば、入力音声（イ）のデータ
（３・２・０・０・０）は入力話者コードブックのデー
タＡ（３・２・０・０・０）と最も近く、入力音声
（ロ）のデータ（２・１・１・１・１）は入力話者コー
ドブックのデータＢ（１・１・１・１・１）と最も近
く、入力音声（ハ）のデータ（１・２・１・１・１）は
入力話者コードブックのデータＢ（１・１・１・１・
１）と最も近く、入力音声（ニ）のデータ（０・０・２
・２・２）は入力話者コードブックのデータＣ（０・０
・０・２・２）と最も近く、入力音声（ホ）のデータ
（０・０・０・２・３）は入力話者コードブックのデー
タＣ（０・０・０・２・２）と最も近いということが求
められる。Now, the feature vector sequences (a), (b), (c), ... Of the input speaker's input voice as shown in FIG. 16 (a).
.. is input, the feature vector sequence (a), (b), (c), ... Which data in the input speaker codebook shown in FIG. Calculated by distance calculation. For example, the input voice (a) data (3,2,0,0) is closest to the input speaker codebook data A (3,2,0,0), and the input voice (b) The data (2.1.1.1.1) is the closest to the input speaker codebook data B (1.1.1.1.1.1), and the input voice (c) data (1.2.1.1) 1.1) is the input speaker codebook data B (1.1.1.1.1
1) is the closest to the input voice (d) data (0.0.2.
・ 2 ・ 2) is the data C (0.0) of the input speaker codebook
・ It is the closest to 0 ・ 2 ・ 2), and the input voice (e) data (0 ・ 0 ・ 0 ・ 2 ・ 3) is the same as the input speaker codebook data C (0 ・ 0 ・ 0 ・ 2 ・ 2). The closest thing is required.

【０００６】このようにして、入力音声の１つ１つのデ
ータ毎に入力話者コードブックを参照して、最も近い特
徴ベクトルを選ぶ。したがって、この場合、入力音声
（イ）〜（ホ）に限って考えれば、Ａ・Ｂ・Ｂ・Ｃ・Ｃ
という特徴コードベクトルが求められる。In this way, the closest feature vector is selected by referring to the input speaker codebook for each data of the input voice. Therefore, in this case, considering only the input voices (a) to (e), A, B, B, C, C
The feature code vector is calculated.

【０００７】そして、入力話者コードブックは不特定話
者コードブックに対して、ＡはＡ’、ＢはＢ’、Ｃは
Ｃ’というような対応付けができているので、この場
合、入力音声は不特定話者コードブックのＡ’・Ｂ’・
Ｂ’・Ｃ’・Ｃ’というデータに変換されることにな
る。なお、この場合、不特定話者コードブックのデータ
Ａ’は（３・３・０・０・０）であり、データＢ’は
（２・２・２・１・１）であり、データＣ’は（０・０
・０・３・３）である。The input speaker codebook is associated with the unspecified speaker codebook such that A is A ', B is B', and C is C '. The voice is A '・ B' ・ of the unspecified speaker codebook.
It will be converted into data of B ', C', and C '. In this case, the data A'of the unspecified speaker codebook is (3, 3, 0, 0, 0), the data B'is (2, 2, 2, 1, 1), and the data C is 'Is (0.0
・ It is 0 ・ 3 ・ 3).

【０００８】以上のようにして、入力音声の特徴ベクト
ル列は不特定話者コードブックの特徴ベクトル列に変換
でき、この変換された特徴ベクトル列が音声認識処理部
に送信される。As described above, the feature vector sequence of the input voice can be converted into the feature vector sequence of the unspecified speaker codebook, and the converted feature vector sequence is transmitted to the voice recognition processing section.

【０００９】[0009]

【発明が解決しようとする課題】図１６に示す話者適応
化方法では、多数の不特定話者から作成した不特定話者
コードブックをもとに入力話者コードブックを作成し、
この入力話者コードブックを用いて話者適応を行ってい
た。しかしながら、多数の不特定話者から作成した入力
話者コードブックを用いての話者適応では、的確な話者
適応はあまり期待できず、これにより話者適応されたコ
ードベクトルを用いての音声認識は、十分な認識性能が
得られないという問題があった。これを解決する手段と
して、数多くの話者（たとえば、５００人）ごとに音声
モデルとそのコードブックを用意しておき、入力話者に
対して音声特徴の近い音声モデルとコードブックを選択
して話者適応を行う方法も考えられるが、この方法で
は、入力話者の音声特徴に対応した幾つもの音声モデル
とコードブックを用意する必要があり、小型で安価な装
置に適用するには現実的でなく、実用性に乏しかった。In the speaker adaptation method shown in FIG. 16, an input speaker codebook is created based on an unspecified speaker codebook created from a large number of unspecified speakers,
Speaker adaptation was performed using this input speaker codebook. However, in speaker adaptation using an input speaker codebook created from a large number of unspecified speakers, accurate speaker adaptation cannot be expected so much, and as a result, speech using a speaker-adapted code vector can be expected. The recognition has a problem that sufficient recognition performance cannot be obtained. As a means for solving this, a voice model and its codebook are prepared for many speakers (for example, 500 people), and a voice model and a codebook whose voice features are close to the input speaker are selected. A speaker adaptation method may be considered, but this method requires preparing several voice models and codebooks corresponding to the voice characteristics of the input speaker, which is not practical to apply to a small and inexpensive device. However, it was not practical.

【００１０】そこで、本発明は、音声認識の前段階とし
ての話者適応化処理において、多数の不特定話者を音声
特徴データに基づいてクラス分けを行い、各クラス毎に
属する音声特徴データを用いてコードブックを作成し、
話者適応時においては、入力話者の音声がどのクラスに
属するかの判定を行って、その判定結果に応じたコード
ブックを選択し、音声認識時には、選択されたコードブ
ックを用いて入力話者の音声をコード化し、それに対応
する音声モデルを用いて音声認識することで、高い認識
率での音声認識を可能とする話者適応化方法および話者
適応化装置を実現することを目的としている。Therefore, according to the present invention, in the speaker adaptation process as a pre-stage of the voice recognition, a large number of unspecified speakers are classified based on the voice feature data, and the voice feature data belonging to each class is classified. Create a codebook using
When the speaker is adapted, it is determined which class the speech of the input speaker belongs to, and the codebook is selected according to the result of the determination. When speech is recognized, the input code is selected using the selected codebook. For the purpose of realizing a speaker adaptation method and a speaker adaptation device that enable speech recognition at a high recognition rate by encoding the speech of a person and recognizing the speech using a corresponding speech model. There is.

【００１１】[0011]

【課題を解決するための手段】本発明の話者適応化方法
は、請求項１に示すように、不特定多数の話者から得ら
れた音声特徴データに基づいて分類された複数の話者ク
ラスのそれぞれの話者クラスに属する話者の音声特徴ベ
クトル列を基に作成されたデータと、入力話者の音声特
徴データとを比較して得られた結果により、入力話者の
音声がどの話者クラスに属するかを判定し、その判定結
果に基づいて、判定された話者クラスに対応した音声認
識用の音声モデルを選択し、音声認識時には入力話者の
音声をその選択された前記音声モデルを用いて音声認識
処理することを特徴とするこのように、入力話者の音声
特徴データをもとに入力話者がどの話者クラスに属する
かを判定して、その話者クラスに対応する音声モデルを
用いて音声認識するので、簡単な処理で高精度な音声認
識が期待できる。According to a speaker adaptation method of the present invention, a plurality of speakers are classified based on voice feature data obtained from an unspecified number of speakers, as set forth in claim 1. Based on the result obtained by comparing the data created based on the voice feature vector sequence of the speaker belonging to each speaker class of the class with the voice feature data of the input speaker, the voice of the input speaker is determined. It is determined whether the speaker belongs to a speaker class, and a voice model for voice recognition corresponding to the determined speaker class is selected based on the determination result, and the voice of the input speaker is selected at the time of voice recognition. It is characterized by performing voice recognition processing using a voice model.In this way, it is determined which speaker class the input speaker belongs to based on the voice feature data of the input speaker, and Speech recognition using the corresponding speech model So it can be expected highly accurate speech recognition with a simple process.

【００１２】また、請求項２の発明は、不特定多数の話
者から得られた音声特徴データに基づいて分類された複
数の話者クラスごとに、それぞれの話者クラスに属する
話者の音声特徴データをもとに不特定話者コードブック
を作成するとともに、各話者クラスに属する話者の音声
特徴ベクトル列を基に作成されたデータと入力話者の音
声特徴データとを比較して得られた結果により、入力話
者の音声がどの話者クラスに属するかを判定し、その判
定結果に基づいて、判定された話者クラスに対応した前
記不特定話者コードブックおよびその話者クラスに対応
した音声認識用の音声モデルを選択し、前記選択された
不特定話者コードブックをもとに入力話者コードブック
を作成し、音声認識時には入力話者の音声を前記作成さ
れた入力話者コードブックおよび選択された不特定話者
コードブックを用いてベクトル量子化したのち、音声認
識部に送り、音声認識部では前記選択された音声モデル
を用いて音声認識することを特徴とする。Further, according to the invention of claim 2, for each of a plurality of speaker classes classified based on the voice characteristic data obtained from an unspecified number of speakers, the voices of the speakers belonging to the respective speaker classes. An unspecified speaker codebook is created based on the feature data, and the data created based on the voice feature vector sequence of the speakers belonging to each speaker class and the voice feature data of the input speaker are compared. Based on the obtained result, it is determined which speaker class the voice of the input speaker belongs to, and based on the determination result, the unspecified speaker codebook and the speaker corresponding to the determined speaker class. A voice model for voice recognition corresponding to a class is selected, an input speaker codebook is created based on the selected unspecified speaker codebook, and the voice of the input speaker is created at the time of voice recognition. Input speaker code After vector quantized using a book and selected unspecified speaker code book, sent to the speech recognition unit, characterized in that the speech recognition using a speech model wherein the selected speech recognition unit.

【００１３】この請求項２の発明は、選択された話者ク
ラスに属する話者の音声を基に作成された不特定話者コ
ードブックを用いて、入力話者コードブックを作成し、
音声認識時には、作成された入力話者コードブックおよ
び選択された不特定話者コードブックを用いてベクトル
量子化して音声認識部に送り、音声認識部では前記選択
された音声モデルを用いて音声認識するので、より一
層、的確な話者適応が可能となり、高い認識率での認識
が可能となる。According to the second aspect of the present invention, the input speaker codebook is created by using the unspecified speaker codebook created based on the voices of the speakers belonging to the selected speaker class.
At the time of voice recognition, vector quantization is performed using the created input speaker codebook and the selected unspecified speaker codebook, and the result is sent to the voice recognition unit, and the voice recognition unit performs voice recognition using the selected voice model. As a result, more accurate speaker adaptation becomes possible, and recognition with a high recognition rate becomes possible.

【００１４】そして、前記請求項１または２において、
前記特徴ベクトル列を基に入力話者の音声がどの話者ク
ラスに属するかを判定する処理は、各話者クラスに属す
る話者の音声特徴ベクトル列から話者適応用の単語ごと
の重心ベクトル列を得て、各話者クラスごとに、前記各
単語の重心ベクトル列と入力話者の音声の特徴ベクトル
列との距離をＤＰマッチングにより求め、その距離によ
り入力話者の音声がどの話者クラスに属するかを判定す
るようにする。And in the above-mentioned claim 1 or 2,
The process of determining which speaker class the voice of the input speaker belongs to based on the feature vector sequence is performed by using the voice feature vector sequence of the speaker belonging to each speaker class, and the centroid vector for each word for speaker adaptation. After obtaining the sequence, the distance between the center of gravity vector sequence of each word and the feature vector sequence of the voice of the input speaker is obtained by DP matching for each speaker class, and which speaker the voice of the input speaker is based on the distance. Determine if it belongs to a class.

【００１５】これにより、入力話者の音声がどの話者ク
ラスに属するかの判定を簡単な処理で高精度に行うこと
ができる。With this, it is possible to accurately determine which speaker class the voice of the input speaker belongs to by a simple process.

【００１６】また、前記請求項１または２において、前
記特徴ベクトル列を基に入力話者の音声がどの話者クラ
スに属するかを判定する処理は、各話者クラスに属する
話者の音声特徴ベクトル列をもとに前記ＤＲＮＮ方式に
よる各話者クラス対応の音声モデルを作成し、入力話者
の音声特徴ベクトル列と、前記音声モデルとから所定の
単語の存在の確からしさを示す数値を出力し、この確か
らしさを示す数値を基に、入力話者の音声がどの話者ク
ラスに属するかを判定するようにすることも可能であ
る。Further, in the above-mentioned claim 1 or 2, the processing for determining to which speaker class the voice of the input speaker belongs based on the feature vector sequence is the voice feature of the speaker belonging to each speaker class. A voice model corresponding to each speaker class by the DRNN method is created based on the vector sequence, and a numerical value indicating the certainty of the existence of a predetermined word is output from the voice feature vector sequence of the input speaker and the voice model. However, it is also possible to determine to which speaker class the voice of the input speaker belongs based on the numerical value indicating the certainty.

【００１７】これによれば、入力音声の中に含まれる話
者適応用の単語をキーワードスポッティング処理により
切り出すことも可能となり、入力話者の音声がどの話者
クラスに属するかの判定を簡単な処理で高精度に行うこ
とができる。According to this, it is possible to cut out the speaker adaptation word included in the input voice by the keyword spotting process, and it is easy to determine which speaker class the voice of the input speaker belongs to. The processing can be performed with high accuracy.

【００１８】さらに、前記請求項１または２において、
前記特徴ベクトル列を基に入力話者の音声がどの話者ク
ラスに属するかを判定する処理は、各話者クラスに属す
る話者の音声特徴ベクトル列から話者適応用の単語ごと
の重心ベクトル列を得て、各話者クラスごとに、前記各
単語の重心ベクトル列と入力話者の音声の特徴ベクトル
列との距離をＤＰマッチングにより求めるとともに、各
話者クラスに属する話者の音声特徴ベクトル列をもとに
前記ＤＲＮＮ方式による各話者クラス対応の音声モデル
を作成し、入力話者の音声特徴ベクトル列と、前記音声
モデルとから所定の単語の存在の確からしさを示す数値
を求め、前記求められたＤＰマッチング距離と、前記所
定の単語の存在の確からしさを示す数値とを基に、入力
話者の音声がどの話者クラスに属するかを判定すること
も可能となる。Further, in the above-mentioned claim 1 or 2,
The process of determining which speaker class the voice of the input speaker belongs to based on the feature vector sequence is performed by using the voice feature vector sequence of the speaker belonging to each speaker class, and the centroid vector for each word for speaker adaptation. A sequence is obtained, and for each speaker class, the distance between the center of gravity vector sequence of each word and the feature vector sequence of the voice of the input speaker is obtained by DP matching, and the voice feature of the speaker belonging to each speaker class is obtained. A voice model corresponding to each speaker class by the DRNN method is created based on the vector sequence, and a numerical value indicating the certainty of the existence of a predetermined word is obtained from the voice feature vector sequence of the input speaker and the voice model. It is also possible to determine which speaker class the voice of the input speaker belongs to based on the calculated DP matching distance and the numerical value indicating the certainty of the existence of the predetermined word.

【００１９】これによれば、入力話者の音声がどの話者
クラスに属するかの判定を簡単な処理で、より一層、高
精度に行うことができる。According to this, it is possible to judge to which speaker class the voice of the input speaker belongs by a simple process with higher accuracy.

【００２０】また、請求項６の発明は、前記請求項１か
ら５のいずれかにおいて、前記不特定多数の話者から得
られた音声特徴データに基づいて複数の話者クラスに分
類する処理は、不特定多数の話者から得られたそれぞれ
の話者における複数の単語ごとの特徴ベクトル列に対
し、各話者間でそれぞれの単語ごとにＤＰマッチング距
離を求め、その距離の和を当該話者間の距離とし、前記
不特定多数のそれぞれの話者間でそれぞれの話者間距離
を求め、この話者間距離を基にクラス分けを行うことを
特徴とする。According to the invention of claim 6, in any one of claims 1 to 5, the processing for classifying into a plurality of speaker classes based on the voice characteristic data obtained from the unspecified number of speakers is performed. , A DP matching distance is calculated for each word between the speakers for a feature vector sequence for each word obtained from an unspecified number of speakers, and the sum of the distances is calculated. It is characterized in that the inter-speaker distances are determined, the inter-speaker distances are obtained among the unspecified large number of speakers, and the classification is performed based on the inter-speaker distances.

【００２１】これによれば、音声特徴データに基づく高
精度な話者クラス分けが可能となり、このようにクラス
分けされた話者クラスに属する話者から作成された不特
定話者コードブックを用いて話者適応することにより、
的確な話者適応が可能となる。According to this, it is possible to perform highly accurate speaker classification based on the voice feature data, and an unspecified speaker codebook created from speakers belonging to the speaker class thus classified is used. By adapting the speaker
Precise speaker adaptation is possible.

【００２２】また、本発明の話者適応化装置は、請求項
７に示すように、不特定多数の話者から得られた音声特
徴データに基づいて分類された複数の話者クラスのそれ
ぞれの各話者クラスに属する話者の音声特徴ベクトル列
を基に作成されたデータを蓄える特徴データ記憶手段
と、入力話者の音声の特徴ベクトル列を蓄える入力デー
タ記憶手段と、前記入力話者の或る単語に対する音声特
徴データと前記特徴データ記憶手段に記憶された当該単
語に対する特徴データとを比較して得られた結果によ
り、入力話者の音声がどの話者クラスに属するかを判定
し、その判定結果に基づいて、判定された話者クラスに
対応した音声認識用の音声モデルを選択する話者クラス
判定処理部とを有し、音声認識時には入力話者の音声を
前記選択された音声モデルを用いてベクトル量子化した
のち、音声認識部に渡すことを特徴とする。Further, as described in claim 7, the speaker adaptation apparatus of the present invention has a plurality of speaker classes classified based on voice feature data obtained from an unspecified number of speakers. Feature data storage means for storing data created based on the voice feature vector sequence of the speaker belonging to each speaker class, input data storage means for storing the feature vector sequence of the voice of the input speaker, and the input speaker Based on the result obtained by comparing the voice feature data for a certain word and the feature data for the word stored in the feature data storage means, it is determined which speaker class the voice of the input speaker belongs to, And a speaker class determination processing unit that selects a voice model for voice recognition corresponding to the determined speaker class based on the determination result, and the voice of the input speaker is selected as the voice of the input speaker during voice recognition. Mo After vector quantized using Le, characterized in that passed to the speech recognition unit.

【００２３】このように、入力話者の音声特徴データを
もとに入力話者がどの話者クラスに属するかを判定し
て、その話者クラスに対応する音声モデルを用いて音声
認識するので、簡単な処理で高精度な音声認識が期待で
きる。As described above, the speaker class to which the input speaker belongs is determined based on the voice feature data of the input speaker, and the voice recognition is performed using the voice model corresponding to the speaker class. Highly accurate voice recognition can be expected with simple processing.

【００２４】また、請求項８の発明は、不特定多数の話
者から得られた音声特徴データに基づいて分類された複
数の話者クラスごとに、それぞれの話者クラスに属する
話者の音声特徴データをもとに作成された不特定話者コ
ードブックと、各話者クラスに属する話者の音声特徴ベ
クトル列を基に作成されたデータと、入力話者の音声特
徴データとを比較して求められた結果により、入力話者
の音声がどの話者クラスに属するかを判定し、その判定
結果に基づいて、判定された話者クラスに対応した不特
定話者コードブックおよびその話者クラスに対応する音
声認識用の音声モデルを選択する話者クラス判定処理部
と、この話者クラス判定処理部により選択された不特定
話者コードブックをもとに作成された入力話者コードブ
ックとを有し、音声認識時には入力話者の音声を前記作
成された入力話者コードブックおよび選択された不特定
話者コードブックを用いてベクトル量子化したのち、音
声認識部に送り、音声認識部では、前記選択された音声
モデルを用いてお音声認識することを特徴とする。Further, according to the invention of claim 8, the voices of the speakers belonging to each of the plurality of speaker classes are classified for each of the plurality of speaker classes classified based on the voice feature data obtained from an unspecified number of speakers. An unspecified speaker codebook created based on the feature data, the data created based on the voice feature vector sequence of the speakers belonging to each speaker class, and the voice feature data of the input speaker are compared. It is determined which speaker class the input speaker's voice belongs to based on the result obtained from the above, and based on the determination result, the unspecified speaker codebook and the speaker corresponding to the determined speaker class. A speaker class determination processing unit that selects a voice model for voice recognition corresponding to a class, and an input speaker codebook created based on an unspecified speaker codebook selected by this speaker class determination processing unit. Have and sound At the time of recognition, the voice of the input speaker is vector-quantized by using the created input speaker codebook and the selected unspecified speaker codebook, and then sent to the voice recognition unit. The feature is that the voice recognition is performed by using the voice model.

【００２５】この請求項８の発明は、選択された話者ク
ラスに属する話者の音声を基に作成された不特定話者コ
ードブックを用いて、入力話者コードブックを作成し、
音声認識時には、作成された入力話者コードブックおよ
び選択された不特定話者コードブックを用いてベクトル
量子化して音声認識部に送り、音声認識部では前記選択
された音声モデルを用いて音声認識するので、より一
層、的確な話者適応が可能となり、高い認識率での認識
が可能となる。According to the invention of claim 8, the input speaker codebook is created by using the unspecified speaker codebook created based on the voices of the speakers belonging to the selected speaker class.
At the time of voice recognition, vector quantization is performed using the created input speaker codebook and the selected unspecified speaker codebook, and the result is sent to the voice recognition unit, and the voice recognition unit performs voice recognition using the selected voice model. As a result, more accurate speaker adaptation becomes possible, and recognition with a high recognition rate becomes possible.

【００２６】そして、前記請求項７または８において、
前記特徴ベクトル列を基に入力話者の音声がどの話者ク
ラスに属するかを判定する手段は、各話者クラスに属す
る話者の音声特徴ベクトル列から話者適応用の単語ごと
の重心ベクトル列を得て、各話者クラスごとにそれぞれ
の単語ごとの重心ベクトルを記憶する記憶部を有し、前
記話者クラス判定処理部により各話者クラスごとに、前
記各単語の重心ベクトル列と入力話者の音声の特徴ベク
トル列との距離をＤＰマッチングにより求め、その距離
により入力話者の音声がどの話者クラスに属するかを判
定する。And in the above-mentioned claim 7 or 8,
The means for determining which speaker class the voice of the input speaker belongs to based on the feature vector sequence is a centroid vector for each word for speaker adaptation from the voice feature vector sequence of the speaker belonging to each speaker class. Obtaining a column, having a storage unit for storing the center of gravity vector for each word for each speaker class, for each speaker class by the speaker class determination processing unit, the center of gravity vector string of each word and The distance between the voice of the input speaker and the feature vector sequence is obtained by DP matching, and it is determined to which speaker class the voice of the input speaker belongs.

【００２７】これにより、入力話者の音声がどの話者ク
ラスに属するかの判定を簡単な処理で高精度に行うこと
ができる。Thus, it is possible to accurately determine which speaker class the voice of the input speaker belongs to by a simple process.

【００２８】また、前記請求項７または８において、前
記特徴ベクトル列を基に入力話者の音声がどの話者クラ
スに属するかを判定する手段は、各話者クラスに属する
話者の音声特徴ベクトル列をもとに前記ＤＲＮＮ方式に
よる各話者クラス対応に作成された音声モデルと、入力
話者の音声特徴ベクトル列と前記音声モデルとから所定
の単語の存在の確からしさを示す数値を出力する単語検
出部とを有し、前記話者クラス判定処理部により、前記
単語検出部から出力される確からしさを示す数値を基
に、入力話者の音声がどの話者クラスに属するかを判定
することも可能である。The means for determining which speaker class the input speaker's voice belongs to on the basis of the feature vector sequence is a speech feature of a speaker belonging to each speaker class. A numerical value indicating the certainty of the existence of a predetermined word is output from a voice model created for each speaker class by the DRNN method based on a vector sequence, a voice feature vector sequence of an input speaker, and the voice model. The speaker class determination processing unit determines which speaker class the voice of the input speaker belongs to, based on the numerical value indicating the certainty output from the word detection unit. It is also possible to do so.

【００２９】これによれば、入力音声の中に含まれる話
者適応用の単語をキーワードスポッティング処理により
切り出すことも可能となり、入力話者の音声がどの話者
クラスに属するかの判定を簡単な処理で高精度に行うこ
とができる。According to this, it is possible to cut out a speaker adaptation word included in the input voice by the keyword spotting process, and it is easy to determine which speaker class the voice of the input speaker belongs to. The processing can be performed with high accuracy.

【００３０】さらに、前記請求項７または８において、
前記特徴ベクトル列を基に入力話者の音声がどの話者ク
ラスに属するかを判定する処理は、各話者クラスに属す
る話者の音声特徴ベクトル列から話者適応用の単語ごと
の重心ベクトル列を得て、各話者クラスごとにそれぞれ
の単語ごとの重心ベクトルを記憶する記憶部を有すると
ともに、各話者クラスに属する話者の音声特徴ベクトル
列をもとに前記ＤＲＮＮダ方式による各話者クラス対応
に作成された音声モデルと、入力話者の音声特徴ベクト
ル列と前記音声モデルとから所定の単語の存在の確から
しさを示す数値を出力する単語検出部とを有し、各話者
クラスごとに、前記各単語の重心ベクトル列と入力話者
の音声の特徴ベクトル列との距離をＤＰマッチングによ
り求めた距離と、前記前記単語検出部から出力される確
からしさを示す数値を基に、前記話者クラス判定処理部
により、入力話者の音声がどの話者クラスに属するかを
判定することも可能である。Further, in the above-mentioned claim 7 or 8,
The process of determining which speaker class the voice of the input speaker belongs to based on the feature vector sequence is performed by using the voice feature vector sequence of the speaker belonging to each speaker class, and the centroid vector for each word for speaker adaptation. A column is obtained, and a storage unit for storing the centroid vector of each word for each speaker class is provided, and each of the DRNNda method is used based on the voice feature vector sequence of the speaker belonging to each speaker class. A voice model created corresponding to the speaker class, a voice detection unit that outputs a numerical value indicating the certainty of the existence of a predetermined word from the voice feature vector sequence of the input speaker and the voice model, A distance obtained by DP matching for the distance between the centroid vector sequence of each word and the feature vector sequence of the voice of the input speaker for each user class, and a number indicating the certainty output from the word detection unit. Based on, by the speaker class determination unit, it is possible to determine belongs to which speaker class speech input speaker.

【００３１】これによれば、入力話者の音声がどの話者
クラスに属するかの判定を簡単な処理で、より一層、高
精度に行うことができる。According to this, it is possible to determine to which speaker class the voice of the input speaker belongs by a simple process with higher accuracy.

【００３２】また、請求項１２の発明は、前記請求項７
から１１のいずれかにおいて、前記不特定多数の話者か
ら得られた音声特徴データに基づいて複数の話者クラス
に分類する処理は、不特定多数の話者から得られたそれ
ぞれの話者における複数の単語ごとの特徴ベクトル列に
対し、各話者間でそれぞれの単語ごとにＤＰマッチング
距離を求め、その距離の和を当該話者間の距離とし、前
記不特定多数のそれぞれの話者間でそれぞれの話者間距
離を求め、この話者間距離を基にクラス分けを行うこと
を特徴とする。The invention of claim 12 is the same as that of claim 7
11 to 11, the process of classifying into a plurality of speaker classes based on the voice feature data obtained from the unspecified number of speakers is performed by each speaker obtained from the unspecified number of speakers. For a plurality of feature vector strings for each word, a DP matching distance is obtained for each word between speakers, and the sum of the distances is set as the distance between the speakers, and the unspecified number of speakers The feature is that each inter-speaker distance is obtained with, and classification is performed based on this inter-speaker distance.

【００３３】これによれば、音声特徴データに基づく高
精度な話者クラス分けが可能となり、このようにクラス
分けされた話者クラスに属する話者から作成された不特
定話者コードブックを用いて話者適応することにより、
的確な話者適応が可能となる。According to this, it becomes possible to perform highly accurate speaker classification based on the voice feature data, and an unspecified speaker codebook created from speakers belonging to the speaker class thus classified is used. By adapting the speaker
Precise speaker adaptation is possible.

【００３４】[0034]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３５】（第１の実施の形態）図１は本発明が適用
された音声認識装置の概略的な構成を説明するブロック
図であり、その構成は大きく分けると、音声入力部１、
話者適応化部２、音声認識部３から構成されている。(First Embodiment) FIG. 1 is a block diagram for explaining a schematic configuration of a voice recognition device to which the present invention is applied. The configuration is roughly divided into a voice input unit 1,
It is composed of a speaker adaptation unit 2 and a voice recognition unit 3.

【００３６】前記音声入力部１は、マイクロホン１１、
マイクロホン１１から入力された音声をＡ／Ｄ変換する
Ａ／Ｄ変換部１２、Ａ／Ｄ変換された音声波形信号を、
演算器を用いて短時間毎に周波数分析を行い、周波数の
特徴を表す数次元の特徴ベクトル（LPCーCEPSTRUM係数
が一般的）を抽出し、この特徴ベクトルの時系列（以
下、音声特徴ベクトル列という）を出力する音声分析部
１３などから構成される。The voice input unit 1 includes a microphone 11,
A / D converter 12 for A / D converting the voice input from the microphone 11, the A / D converted voice waveform signal,
Frequency analysis is performed for each short time using an arithmetic unit, and a number-dimensional feature vector (generally LPC-CEPSTRUM coefficient) that represents the feature of the frequency is extracted, and the time series of this feature vector (hereinafter, the voice feature vector sequence). Is output).

【００３７】また、話者適応化部２は、本発明の要旨と
なる部分であり、ユーザ（入力話者）の音声が、どの話
者クラスに属するかを判定し、その判定結果によって、
対応するコードブック（判定された話者クラスに属する
話者の音声特徴データを用いて作成された不特定話者コ
ードブック）を選択し、音声認識時においては、選択さ
れた前記不特定話者のコードブックを用いて、ユーザの
音声をベクトル量子化したのち、そのコードベクトルを
音声認識部３に出力するものである。以下、この話者適
応化部２について説明する。The speaker adaptation unit 2 is a part of the present invention, and determines which speaker class the voice of the user (input speaker) belongs to, and the result of the determination
A corresponding codebook (an unspecified speaker codebook created by using voice feature data of speakers belonging to the determined speaker class) is selected, and at the time of voice recognition, the selected unspecified speaker is selected. After the user's voice is vector-quantized by using the codebook, the code vector is output to the voice recognition unit 3. The speaker adaptation unit 2 will be described below.

【００３８】話者適応化部２は、ベクトル量子化部２
０、入力データ記憶部２１、話者クラス判定処理部２
２、予めクラス分けされた話者クラスごとの不特定話者
コードブック２３ａ，２３ｂ，・・・、予めクラス分け
された話者クラスごとの重心ベクトル列記憶部２４ａ，
２４ｂ，・・・などから構成されている。The speaker adaptation section 2 is a vector quantization section 2
0, input data storage unit 21, speaker class determination processing unit 2
2, pre-classified unspecified speaker codebooks for each speaker class 23a, 23b, ..., Center-of-gravity vector string storage unit 24a for each pre-classified speaker class,
24b, ...

【００３９】前記入力データ記憶部２１は、ユーザが発
話する話者適応用の単語の特徴ベクトル列を記憶するも
のである。つまり、システム側からユーザに対して、話
者適応用の単語として、たとえば、「おはようと話して
下さい」、「こんにちはと話して下さい」などの指示を
出し、その指示に従って、ユーザが発話すると、その音
声はマイクロホン１１を通して入力され、Ａ／Ｄ変換部
１２でＡ／Ｄ変換されたのち、音声分析部１３で音声分
析されて、音声特徴ベクトル列が出力される。この各単
語ごとの音声特徴ベクトル列を蓄えておくものである。The input data storage unit 21 stores a feature vector sequence of words for speaker adaptation spoken by the user. In other words, the user from the system side, as a word for the speaker adaptation, for example, "Please talk to good morning", gives an instruction such as "Please talk to Hello", according to the instruction, when the user utterance, The voice is input through the microphone 11, A / D converted by the A / D converter 12, and then voice analyzed by the voice analyzer 13 to output a voice feature vector sequence. The voice feature vector sequence for each word is stored.

【００４０】また、前記不特定話者コードブック２３
ａ，２３ｂ，・・・は、多数の不特定話者から得られた
音声特徴データに基づいて分類された複数の話者クラス
ごとに、それぞれの話者クラスに属する話者の音声特徴
データをもとに作成されるもので、たとえば、不特定話
者コードブック２３ａは不特定多数の男性の音声特徴デ
ータを基に作成されたコードブック、不特定話者コード
ブック２３ｂは不特定多数の女性の音声特徴データを基
に作成されたコードブックである。The unspecified speaker codebook 23
a, 23b, ... For each of a plurality of speaker classes classified based on the voice characteristic data obtained from a large number of unspecified speakers, the voice characteristic data of the speakers belonging to each speaker class are shown. For example, the unspecified speaker codebook 23a is a codebook created based on the voice feature data of unspecified large numbers of males, and the unspecified speaker codebook 23b is the unspecified large number of females. It is a codebook created based on the voice feature data of.

【００４１】なお、何種類かにクラス分けされる話者ク
ラスは、多数の話者の音声特徴データを基に、細かく分
ければより高性能なものとなるが、ここでは、説明を簡
単にするために、男性、女性の２種類に区分した例につ
いて説明する。また、ここでは、不特定話者コードブッ
ク２３ａを第１の不特定話者コードブックといい、不特
定話者コードブック２３ｂを第２の不特定話者コードブ
ックという。The speaker classes, which are classified into several types, have higher performance if they are finely divided based on the voice feature data of a large number of speakers, but the description will be simplified here. For this reason, an example of classification into two types, male and female, will be described. Further, here, the unspecified speaker codebook 23a is referred to as a first unspecified speaker codebook, and the unspecified speaker codebook 23b is referred to as a second unspecified speaker codebook.

【００４２】また、重心ベクトル列記憶部２４ａ，２４
ｂ，・・・は、前記同様、多数の不特定話者から得られ
た音声特徴データに基づいて分類された複数の話者クラ
スごとに、その話者クラスに属する話者がそれぞれの単
語について発話して得られた音声特徴データから求めら
れた重心ベクトル列を記憶するものであり、以下、これ
について説明する。Further, the centroid vector sequence storage units 24a, 24
Similarly to the above, b, ... For each of a plurality of speaker classes classified based on the voice feature data obtained from a large number of unspecified speakers, a speaker belonging to the speaker class is associated with each word. The center of gravity vector sequence obtained from the speech feature data obtained by uttering is stored, and this will be described below.

【００４３】ところで、重心ベクトル列というのは、あ
る単語を話者クラスごとに、多数の話者に発話させ、そ
の音声を短時間ごとに音声分析して得られた特徴ベクト
ル（たとえば、１０次元のＬＰＣケプストラム係数によ
る特徴ベクトル）を求め、各話者ごとの特徴ベクトルを
各時刻ごとに平均を取って得られたベクトル列である。
これを図２により説明する。図２は、たとえば、「おは
よう」という単語を、男性Ａ，Ｂ，Ｃ，Ｄの４人の話者
に発話させて特徴ベクトル列を得る例である。By the way, the center-of-gravity vector sequence is a feature vector (for example, 10-dimensional) obtained by uttering a certain word for each speaker class by a large number of speakers and analyzing the voice for each short time. Is a vector sequence obtained by averaging the feature vectors for each speaker at each time.
This will be described with reference to FIG. FIG. 2 is an example in which the feature vector sequence is obtained by causing the four speakers, males A, B, C, and D, to speak the word “good morning”.

【００４４】このようなＡ，Ｂ，Ｃ，Ｄの４人の話者か
らの「おはよう」という単語に対する特徴ベクトル列の
重心ベクトル列は、以下のようにして求める。The center of gravity vector sequence of the feature vector sequence for the word "good morning" from the four speakers A, B, C, D is obtained as follows.

【００４５】まず、図２において、たとえば、Ａの人が
発話した音声における時刻ｔ１の１０次元の特徴ベクト
ルをＣａ１、時刻ｔ２の１０次元の特徴ベクトルをＣａ
２、時刻ｔ３の１０次元の特徴ベクトルをＣａ３、・・
・というように表し、Ｂの人が発話した音声における時
刻ｔ１の１０次元の特徴ベクトルをＣｂ１、時刻ｔ２の
１０次元の特徴ベクトルをＣｂ２、時刻ｔ３の１０次元
の特徴ベクトルをＣｂ３、・・・というように表し、Ｃ
の人が発話した音声における時刻ｔ１の１０次元の特徴
ベクトルをＣｃ１、時刻ｔ２の１０次元の特徴ベクトル
をＣｃ２、時刻ｔ３の１０次元の特徴ベクトルをＣｃ
３、・・・というように表し、Ｄの人が発話した音声に
おける時刻ｔ１の１０次元の特徴ベクトルをＣｄ１、時
刻ｔ２の１０次元の特徴ベクトルをＣｄ２、時刻ｔ３の
１０次元の特徴ベクトルをＣｄ３、・・・というように
表すものとする。First, in FIG. 2, for example, the 10-dimensional feature vector at time t1 in the speech uttered by the person A is Ca1, and the 10-dimensional feature vector at time t2 is Ca.
2, the 10-dimensional feature vector at time t3 is Ca3, ...
The 10-dimensional feature vector at time t1 in the speech uttered by the person B is Cb1, the 10-dimensional feature vector at time t2 is Cb2, the 10-dimensional feature vector at time t3 is Cb3, ... C,
Of the voice uttered by the person Cc1 is a 10-dimensional feature vector at time t1, Cc2 is a 10-dimensional feature vector at time t2, and Cc is a 10-dimensional feature vector at time t3.
3 ..., the 10-dimensional feature vector at time t1 in the voice uttered by the person D is Cd1, the 10-dimensional feature vector at time t2 is Cd2, and the 10-dimensional feature vector at time t3 is Cd3. , ..., and so on.

【００４６】このように、同じ「おはよう」という単語
を発話した場合でも、Ａ，Ｂ，Ｃ，Ｄの人の「おはよ
う」という単語に対する特徴ベクトル列は、それぞれの
人の個性によって時間的な長さや特徴ベクトルに違いが
生じる。As described above, even when the same word "Ohayo" is uttered, the feature vector sequence for the words "Ohayo" by the persons A, B, C, and D is long in time depending on the personality of each person. Differences occur in the pod feature vector.

【００４７】次に、この「おはよう」という単語に対す
るＡ，Ｂ，Ｃ，Ｄの人の特徴ベクトル列を、それぞれの
時刻ごとに重心ベクトルを求めるわけであるが、この重
心ベクトルを求めるに際して、それぞれの特徴ベクトル
列の時間的な長さを正規化、つまり、それぞれの特徴ベ
クトルの数を同一にする必要がある。これを行うために
どれか１つの特徴ベクトル列を基準ベクトル列として選
び、その基準ベクトル列とのＤＰマッチングを取ること
で正規化を行う。この正規化処理について図３を参照し
て説明する。Next, the center of gravity vector of the person A, B, C, and D for the word "Ohayo" is obtained at each time. When obtaining the center of gravity vector, It is necessary to normalize the temporal length of the feature vector sequence of, that is, make the number of each feature vector the same. In order to do this, one of the feature vector sequences is selected as the reference vector sequence, and DP matching with the reference vector sequence is performed for normalization. This normalization process will be described with reference to FIG.

【００４８】図３は説明を分かり易くするために、Ａの
特徴ベクトル列のサンプリング時刻をｔ１，ｔ２，ｔ３
の３つとし、Ｂの特徴ベクトル列のサンプリング時刻を
ｔ１，ｔ２，ｔ３，ｔ４の４つとし、Ｃの特徴ベクトル
列のサンプリング時刻をｔ１，ｔ２，ｔ３，ｔ４，ｔ５
の５つとし、Ｄの特徴ベクトル列のサンプリング時刻を
ｔ１，ｔ２，ｔ３，ｔ４，ｔ５，ｔ６の６つとし、ここ
では、Ｂの特徴ベクトル列を基準のベクトル列とする。In FIG. 3, in order to make the explanation easy to understand, the sampling times of the feature vector sequence of A are t1, t2, and t3.
And the sampling time of the feature vector sequence of B is four of t1, t2, t3, t4, and the sampling time of the feature vector sequence of C is t1, t2, t3, t4, t5.
And the sampling time of the feature vector sequence of D is six times t1, t2, t3, t4, t5, and t6, and the feature vector sequence of B is the reference vector sequence here.

【００４９】そして、この基準となるＢの特徴ベクトル
列の時刻ｔ１，ｔ２，ｔ３，ｔ４における特徴ベクトル
Ｃｂ１，Ｃｂ２，Ｃｂ３，Ｃｂ４に対して、Ａの特徴ベ
クトル列のそれぞれの時刻における特徴ベクトルＣａ
１，Ｃａ２，Ｃa３、Ｃの特徴ベクトル列のそれぞれの
時刻における特徴ベクトルＣｃ１，Ｃｃ２，Ｃｃ３，Ｃ
ｃ４，Ｃｃ５、Ｄの特徴ベクトル列のそれぞれの時刻に
おける特徴ベクトルＣｄ１，Ｃｄ２，Ｃｄ３，Ｃｄ４，
Ｃｄ５，Ｃｄ６をＤＰマッチングにより対応付けする。Then, with respect to the feature vectors Cb1, Cb2, Cb3, Cb4 of the reference B feature vector sequence at times t1, t2, t3, t4, the feature vector Ca of the A feature vector sequence at each time.
Feature vectors Cc1, Cc2, Cc3, C at respective times in the feature vector sequence of 1, Ca2, Ca3, C
Feature vectors Cd1, Cd2, Cd3, Cd4 at respective times in the feature vector sequence of c4, Cc5, D
Cd5 and Cd6 are associated with each other by DP matching.

【００５０】このように、基準となる特徴ベクトル列の
各時刻における特徴ベクトルに対して、その他の特徴ベ
クトル列の特徴ベクトルがＤＰマッチングにより対応付
けされることにより、特徴ベクトルの数を正規化するこ
とができる。In this way, the feature vectors at other times of the reference feature vector sequence are associated with the feature vectors of other feature vector sequences by DP matching, thereby normalizing the number of feature vectors. be able to.

【００５１】つまり、基準となるＢの特徴ベクトル列と
Ａの特徴ベクトル列は、Ｃｂ１に対してはＣａ１が対応
付けされ、Ｃｂ２に対してはＣａ２が対応付けされ、Ｃ
ｂ３とＣｂ４に対してはそれぞれＣａ３が対応付けされ
るというような対応付けがなされ、また、Ｂの基準ベク
トル列とＣの特徴ベクトル列は、Ｃｂ１に対してはＣｃ
１が、Ｃｂ２に対してはＣｃ２が、Ｃｂ３に対してはＣ
ｃ３、Ｃｂ４に対してはＣｃ５がそれぞれ応付けられ
る。さらに、Ｂの基準ベクトル列とＤの特徴ベクトル列
は、Ｃｂ１に対してはＣｄ１が、Ｃｂ２に対してはＣｄ
２、Ｃｂ３に対してはＣｄ４が、Ｃｂ４に対してはＣｄ
６がそれぞれ対応付られる。That is, in the reference B feature vector sequence and the A feature vector sequence, Cb1 is associated with Ca1, Cb2 is associated with Ca2, and Cb2 is associated with C2.
Correspondence is made such that Ca3 is respectively associated with b3 and Cb4, and the reference vector sequence of B and the feature vector sequence of C are Cc with respect to Cb1.
1, Cc2 for Cb2, C for Cb3
Cc5 is assigned to c3 and Cb4, respectively. Further, the reference vector sequence of B and the feature vector sequence of D are Cd1 for Cb1 and Cd1 for Cb2.
2, Cd4 for Cb3, Cd for Cb4
6 are associated with each other.

【００５２】以上のようにして、基準となる特徴ベクト
ル列とそれ以外の特徴ベクトル列とを、ＤＰマッチング
により対応付けすることにより、特徴ベクトルの数の正
規化がなされる。そして、それぞれ対応づけられた特徴
ベクトルごとに重心ベクトルを求める。As described above, the number of feature vectors is normalized by associating the reference feature vector sequence with the other feature vector sequences by DP matching. Then, the centroid vector is obtained for each of the associated feature vectors.

【００５３】この重心ベクトルを求める手法はどのよう
な方法を用いてもよいが、ここでは、以下のようにして
重心ベクトルを求める。Any method may be used as the method for obtaining the center of gravity vector, but here, the center of gravity vector is obtained as follows.

【００５４】時刻ｔ１における特徴ベクトルＣａ１，Ｃ
ｂ１，Ｃｃ１，Ｃｄ１を構成するそれぞれの１０次元Ｌ
ＰＣケプストラム係数を、Ｃａ１＝（Ｃａ１０，Ｃａ１１，・・・，Ｃａ１９）Ｃｂ１＝（Ｃｂ１０，Ｃｂ１１，・・・，Ｃｂ１９）Ｃｃ１＝（Ｃｃ１０，Ｃｃ１１，・・・，Ｃｃ１９）Ｃｄ１＝（Ｃｄ１０，Ｃｄ１１，・・・，Ｃｄ１９）とすると、それぞれの次元毎の平均の値で構成される１
０次元のＬＰＣケプストラム係数を時刻ｔ１における重
心ベクトルとする。つまり、１次元目の平均値Ｃα１０
はＣα１０＝（Ｃａ１０＋Ｃｂ１０＋Ｃｃ１０＋Ｃｄ１
０）／４２次元目の平均値Ｃα１１は、Ｃα１１＝（Ｃａ１１＋Ｃｂ１１＋Ｃｃ１１＋Ｃｄ１
１）／４１０次元目の平均値Ｃα１９はＣα１９＝（Ｃａ１９＋Ｃｂ１０＋Ｃｃ１９＋Ｃｄ１
９）／４となる。このようにして求められた時刻ｔ１における１
０次元ＬＰＣケプストラム係数の平均（Ｃα１０，Ｃα
１１，・・・、Ｃα１９）を、時刻ｔ１における重心ベ
クトルとし、これをＣｓ１で表す。同様にして、時刻ｔ
２，ｔ３，・・・における重心ベクトルＣｓ２，Ｃｓ
３，・・・を求める。このようにして求められた重心ベ
クトルＣｓ１，Ｃｓ２，Ｃｓ３，・・・で構成される重
心ベクトル列を図３において一点鎖線で表し、求められ
た重心ベクトルＣｓ１，Ｃｓ２，Ｃｓ３，Ｃｓ４は、こ
の図では白丸で表している。Feature vectors Ca1 and C at time t1
10-dimensional L of each of b1, Cc1, and Cd1
The PC cepstrum coefficient is Ca1 = (Ca10, Ca11, ..., Ca19) Cb1 = (Cb10, Cb11, ..., Cb19) Cc1 = (Cc10, Cc11, ..., Cc19) Cd1 = (Cd10, Cd11) , ..., Cd19), which is composed of an average value for each dimension.
Let the zero-dimensional LPC cepstrum coefficient be the center of gravity vector at time t1. That is, the average value Cα10 of the first dimension
Is Cα10 = (Ca10 + Cb10 + Cc10 + Cd1
0) / 4 The average value Cα11 of the second dimension is Cα11 = (Ca11 + Cb11 + Cc11 + Cd1
1) / 4 The average value Cα19 of the 10th dimension is Cα19 = (Ca19 + Cb10 + Cc19 + Cd1
9) / 4. 1 at time t1 thus obtained
Average of 0-dimensional LPC cepstrum coefficients (Cα10, Cα
, ..., Cα19) is the center of gravity vector at time t1 and is represented by Cs1. Similarly, at time t
2, t3, ... Centroid vectors Cs2, Cs
Ask for 3, ... The center-of-gravity vector sequence composed of the center-of-gravity vectors Cs1, Cs2, Cs3, ... Determined in this way is represented by a chain line in FIG. Is represented by a white circle.

【００５５】以上は、「おはよう」という単語に対する
男性Ａ，Ｂ，Ｃ，Ｄの重心ベクトル列を求める場合であ
るが、「おはよう」以外にも幾つかの話者適応用の単語
に対して同様に、多数の男性の重心ベクトルを求め、こ
れら幾つかの単語に対する重心ベクトル列を図１の重心
ベクトル列記憶部２４ａ（以下、第１の重心ベクトル列
記憶部２４ａという）に記憶させる。同様にして、多数
の女性が発話する幾つかの話者適応用の単語に対して、
重心ベクトルを求め、これら幾つかの単語に対するそれ
ぞれの重心ベクトル列を重心ベクトル列２４ｂ（以下、
第２の重心ベクトル列記憶部２４ｂという）に記憶して
おく。ここでは、男性と女性だけのクラス分けである
が、さらに細分化した場合には、それぞれのクラス毎に
求めた幾つかの単語に対する重心ベクトル列を対応する
重心ベクトル列記憶部に記憶させておく。The above is the case of obtaining the center-of-gravity vector sequence of the men A, B, C, D for the word "Ohayo". The same applies to several speaker adaptation words other than "Ohayo". Then, the centroid vectors of a large number of men are obtained and the centroid vector sequences for these several words are stored in the centroid vector sequence storage unit 24a (hereinafter referred to as the first centroid vector sequence storage unit 24a) of FIG. Similarly, for some speaker adaptation words spoken by many women,
The center of gravity vector is obtained, and the center of gravity vector sequence for each of these several words is calculated as the center of gravity vector sequence 24b (hereinafter,
The second center-of-gravity vector sequence storage unit 24b). Here, the classification is made only for males and females, but if it is further subdivided, the centroid vector sequences for some words obtained for each class are stored in the corresponding centroid vector sequence storage unit. .

【００５６】また、前記話者クラス判定処理部２２は、
入力データ記憶部２１に取り込まれた入力音声特徴ベク
トルと第１、第２の重心ベクトル列記憶部２４ａ，２４
ｂの内容とをＤＰマッチングにより対応付けし、その結
果を基に、入力音声がどの話者クラスに属しているかを
判定し、それに応じた不特定話者コードブックを選択す
る。この場合、話者クラスは男性と女性の２種類の例で
あるから、不特定話者コードブックは、前記したよう
に、第１の不特定話者コードブック２３ａ、第２の不特
定話者コードブック２３ｂの２種類用意され、これら第
１、第２の不特定話者コードブックのいずれかが選択さ
れることになる。以下にこの話者クラス判定処理部２２
の処理について説明する。Further, the speaker class determination processing unit 22 is
The input voice feature vector stored in the input data storage unit 21 and the first and second center-of-gravity vector sequence storage units 24a, 24
The contents of b are associated with each other by DP matching, the speaker class to which the input voice belongs is determined based on the result, and an unspecified speaker codebook is selected accordingly. In this case, since there are two types of speaker classes, male and female, the unspecified speaker codebook is, as described above, the first unspecified speaker codebook 23a and the second unspecified speaker. Two types of codebooks 23b are prepared, and either the first or the second unspecified speaker codebook is selected. The speaker class determination processing unit 22 will be described below.
Will be described.

【００５７】入力データ記憶部２１には、システム側か
らの指示に基づいて、ユーザの発した音声に対する特徴
ベクトル列が格納されている。たとえば、システム側か
ら「おはようと話して下さい」という指示により、ユー
ザが「おはよう」と発話し、続いて、「こんにちわと話
して下さい」という指示に対して、「こんにちわ」と発
話するというように、システム側からの指示に基づいて
ユーザが発話する。The input data storage unit 21 stores a feature vector sequence for the voice uttered by the user based on an instruction from the system side. For example, the system says, "Please say good morning," and the user says "Good morning," and then "Hello," when you say "Please say hello." , The user speaks based on an instruction from the system side.

【００５８】そして、ユーザの「おはよう」に対する特
徴ベクトル列と、第１の重心ベクトル列記憶部２４ａに
記憶された「おはよう」に対する重心ベクトル列とをＤ
Ｐマッチングにより対応付けし、両者間の距離を求め
る。同様に、ユーザの「おはよう」に対する特徴ベクト
ル列と、第２の重心ベクトル列記憶部２４ｂに記憶され
た「おはよう」に対する重心ベクトル列とをＤＰマッチ
ングにより対応付けし、両者間の距離を求める。Then, the feature vector sequence for "Ohayo" of the user and the center of gravity vector sequence for "Ohayo" stored in the first center-of-gravity vector sequence storage unit 24a are set to D.
Correspondence is made by P matching, and the distance between the two is obtained. Similarly, the feature vector sequence for "Ohayo" of the user and the center of gravity vector sequence for "Ohayo" stored in the second center-of-gravity vector sequence storage unit 24b are associated by DP matching, and the distance between the two is obtained.

【００５９】同様にして、ユーザの「こんにちわ」に対
する特徴ベクトル列と、第１の重心ベクトル列記憶部２
４ａに記憶された「こんにちわ」に対する重心ベクトル
列とをＤＰマッチングにより対応付けし、両者間の距離
を求めるとともに、ユーザの「こんにちわ」に対する特
徴ベクトル列と、第２の重心ベクトル列記憶部２４ｂに
記憶された「こんにちは」に対する重心ベクトル列とを
ＤＰマッチングにより対応付けし、両者間の距離（ＤＰ
距離という）を求める。このようにして、話者適応用の
幾つかの単語ごとに、ユーザの特徴ベクトル列と、第１
の重心ベクトル列記憶部２４ａ、第２の重心ベクトル列
記憶部２４ｂに記憶された単語の重心ベクトル列とのＤ
Ｐ距離を求める。Similarly, the feature vector sequence for the user's "Hello" and the first centroid vector sequence storage unit 2
The center of gravity vector sequence for "konnichiwa" stored in 4a is associated by DP matching, the distance between the two is obtained, and the feature vector sequence for "konnichiwa" of the user and the second center of gravity vector sequence storage unit 24b are stored. a centroid vector sequence was mapped by DP matching for the stored "hello", the distance between them (DP
Called distance). In this way, the user's feature vector sequence and
Of the word stored in the center-of-gravity vector sequence storage unit 24a and the second center-of-gravity vector sequence storage unit 24b
Find the P distance.

【００６０】ここで、話者適応用の単語を単語ｗ１，ｗ
２，ｗ３の３個とする。そして、単語ｗ１に対するユー
ザの特徴ベクトル列と第１の重心ベクトル列記憶部２４
ａに記憶されている単語ｗ１に対する重心ベクトル列の
ＤＰ距離がｄａ１、単語ｗ１に対するユーザの特徴ベク
トル列と、第２の重心ベクトル列記憶部２４ｂに記憶さ
れている単語ｗ１に対する重心ベクトル列のＤＰ距離が
ｄｂ１、単語ｗ２に対するユーザの特徴ベクトル列と第
１の重心ベクトル列記憶部２４ａに記憶されている単語
ｗ２に対する重心ベクトル列のＤＰ距離がｄａ２、単語
ｗ２に対するユーザの特徴ベクトル列と、第２の重心ベ
クトル列記憶部２４ｂに記憶されている単語ｗ２に対す
る重心ベクトル列のＤＰ距離がｄｂ２、単語ｗ３に対す
るユーザの特徴ベクトル列と第１の重心ベクトル列記憶
部２４ａに記憶されている単語ｗ３に対する重心ベクト
ル列のＤＰ距離がｄａ３、単語ｗ３に対するユーザの特
徴ベクトル列と、第２の重心ベクトル列記憶部２４ｂに
記憶されている単語ｗ３に対する重心ベクトル列のＤＰ
距離がｄｂ３であったとすれば、ユーザの音声特徴ベク
トル列と第１の重心ベクトル列記憶部２４ａに記憶され
ている単語ｗ１，ｗ２，ｗ３に対するそれぞれの重心ベ
クトル列の合計のＤＰ距離をＤａとすると、Ｄａは、Ｄａ＝ｄａ１＋ｄａ２＋ｄａ３・・・（１）で求められ、同様に、ユーザの音声特徴ベクトル列と第
２の重心ベクトル列記憶部２４ｂに記憶されている単語
ｗ１，ｗ２，ｗ３に対するそれぞれの重心ベクトル列の
合計のＤＰ距離をＤｂとすると、Ｄｂは、Ｄｂ＝ｄｂ１＋ｄｂ２＋ｄｂ３・・・（２）で求められる。Here, the words for speaker adaptation are the words w1 and w.
There are three, 2 and w3. Then, the feature vector sequence of the user for the word w1 and the first centroid vector sequence storage unit 24
The DP distance of the centroid vector sequence for the word w1 stored in a is da1, the user feature vector sequence for the word w1 and the DP of the centroid vector sequence for the word w1 stored in the second centroid vector sequence storage unit 24b. The distance is db1, the user feature vector sequence for the word w2 and the DP distance of the center-of-gravity vector sequence for the word w2 stored in the first center-of-gravity vector sequence storage section 24a are the user feature vector sequence for the distance da2, word w2, and The DP distance of the centroid vector sequence for the word w2 stored in the second centroid vector sequence storage unit 24b is db2, the user feature vector sequence for the word w3, and the word w3 stored in the first centroid vector sequence storage unit 24a. The DP distance of the center-of-gravity vector sequence with respect to DP centroid vector sequence for the word w3 stored in the second centroid vector sequence storage unit 24b
If the distance is db3, the total DP distance of the respective barycentric vector sequences for the user's voice feature vector sequence and the words w1, w2, w3 stored in the first centroid vector sequence storage unit 24a is Da. Then, Da is obtained by Da = da1 + da2 + da3 (1), and similarly, for the user's voice feature vector sequence and the words w1, w2, and w3 stored in the second centroid vector sequence storage unit 24b, respectively. When the total DP distance of the center-of-gravity vector sequence of is Db, Db is obtained by Db = db1 + db2 + db3 (2)

【００６１】以上のようにして求められたそれぞれのＤ
Ｐ距離の和Ｄａ、Ｄｂの大きさを比較し、小さい方を選
択する。たとえば、Ｄａ＞Ｄｂであったとすると、ユー
ザの音声は、この場合、女性であると判断し、これによ
り、多数の女性の音声を基に作成された第２の不特定話
者コードブック２３ｂを選択する。Each D obtained as described above
The magnitudes of the sums Da and Db of the P distances are compared, and the smaller one is selected. For example, if Da> Db, it is determined that the user's voice is female in this case, and thus the second unspecified speaker codebook 23b created based on the voices of many females is used. select.

【００６２】このようにして、話者クラス判定処理部２
２により、入力話者の音声がどの話者クラスに属するか
の判定を行い、話者クラスが判定されると、その話者ク
ラスの音声を基に作成された不特定話者コードブックが
選択される。In this way, the speaker class determination processing unit 2
According to 2, it is determined which speaker class the voice of the input speaker belongs to, and when the speaker class is determined, an unspecified speaker codebook created based on the voice of the speaker class is selected. To be done.

【００６３】これにより、入力話者の音声を認識する
際、その入力話者の音声に近い音声をもとに作成された
不特定話者コードブックを用いて入力話者の音声がベク
トル量子化され、特徴コードベクトルとして出力される
ため、音声認識部３で高精度な認識処理が行える。As a result, when recognizing the voice of the input speaker, the voice of the input speaker is vector-quantized using the unspecified speaker codebook created based on the voice close to the voice of the input speaker. Since this is output as a feature code vector, the voice recognition unit 3 can perform highly accurate recognition processing.

【００６４】ところで、音声認識部３は、認識可能な単
語（登録単語）に対して、この場合、多数の男性の音声
を基に作成された音声特徴データが記憶されている音声
特徴データ記憶部３１ａ（以下、第１の音声モデル記憶
部３１ａという）、同じく多数の女性の音声を基に作成
された音声特徴データが記憶されている音声特徴データ
記憶部３１ｂ（以下、第２の音声モデル記憶部３１ｂと
いう）、これらいずれかの音声モデル記憶部の内容と前
記量子化された特徴コードベクトルとから、単語検出デ
ータを出力する単語検出部３２、この単語検出部３２か
らの検出データを基に入力音声に対する認識データを出
力する音声認識処理部３３等から構成されている。な
お、前記第１の音声モデル記憶部３１ａおよび第２の音
声モデル記憶部３１ｂは、話者クラス判定部２２の判定
結果に基づいて、いずれかが選択されるもので、話者適
応時において、入力話者の音声が女性話者に属するもの
であると判定された場合は、第２の音声モデル記憶部３
１ｂが選択され、入力話者の音声が男性であると判定さ
れた場合は、第１の音声モデル記憶部３１ａが選択され
るようになっている。By the way, the voice recognition unit 3 stores the voice feature data for the recognizable word (registered word) in this case, based on the voices of a large number of men. 31a (hereinafter referred to as a first voice model storage unit 31a), a voice feature data storage unit 31b (hereinafter referred to as a second voice model storage unit) that also stores voice feature data created based on a large number of female voices. Section 31b), a word detection section 32 that outputs word detection data from the contents of any one of these speech model storage sections and the quantized feature code vector, based on the detection data from this word detection section 32. The voice recognition processing unit 33, which outputs the recognition data for the input voice, is configured. It should be noted that one of the first voice model storage unit 31a and the second voice model storage unit 31b is selected based on the determination result of the speaker class determination unit 22, and during speaker adaptation, When it is determined that the voice of the input speaker belongs to the female speaker, the second voice model storage unit 3
When 1b is selected and it is determined that the voice of the input speaker is male, the first voice model storage unit 31a is selected.

【００６５】このように、話者適応処理において、入力
話者の音声が、男性話者クラスに属するか女性話者クラ
スに属するかの判定を行い、音声認識時においては、そ
のクラス分けの結果に基づいた音声モデルを用いて音声
認識を行う。この場合、入力話者の音声は女性話者クラ
スに属すると判定され、第２の不特定話者コードブック
２３ｂを用いて特徴コードベクトルに変換されて出力さ
れ、音声認識部３では、第２の音声モデル記憶部３１ｂ
の内容を用いて音声認識されるので、高精度な音声認識
が行える。As described above, in the speaker adaptation process, it is determined whether the voice of the input speaker belongs to the male speaker class or the female speaker class, and at the time of voice recognition, the result of the classification is obtained. Speech recognition is performed using a speech model based on. In this case, the voice of the input speaker is determined to belong to the female speaker class, is converted into a feature code vector using the second unspecified speaker codebook 23b, and is output. Voice model storage unit 31b
Since voice recognition is performed using the contents of, high-accuracy voice recognition can be performed.

【００６６】つまり、入力音声がどの話者クラスに属す
るかの判定を行い、その話者クラスに属する話者の音声
を用いて作成された不特定話者コードブックを選択し
て、コードベクトルに変換し、音声認識部３では、判定
された話者クラスに対応した音声モデルを用いて音声認
識処理を行うという一連の処理は、話者適応を行ったの
ち音声認識を行うのと等価な処理がなされることにな
る。That is, it is determined which speaker class the input voice belongs to, and an unspecified speaker codebook created by using the voices of the speakers belonging to the speaker class is selected and set as a code vector. A series of processes in which the voice recognition unit 3 performs the voice recognition process by using the voice model corresponding to the determined speaker class is a process equivalent to performing the speaker recognition and then performing the voice recognition. Will be done.

【００６７】以上の説明は、話者適応用の幾つかの単語
に対する入力話者音声の特徴ベクトルと重心ベクトルと
のＤＰ距離を用いて話者クラスを判定する例であった
が、これに限らず、本出願人がすでに特許出願したＤＲ
ＮＮ（ダイナミックリカレントニューラルネット
ワーク）方式による音声認識技術（この技術に関して
は、本出願人が特開平６ー４０９７、特開平６ー１１９
４７６などにより、すでに特許出願済みである。）によ
り得られた値（この値については後述する）をもとに話
者クラスを判定するようにしてもよい。The above description is an example in which the speaker class is determined using the DP distance between the feature vector of the input speaker voice and the center of gravity vector for some words for speaker adaptation, but the present invention is not limited to this. No, DR that the applicant has already applied for a patent
Speech recognition technology based on the NN (Dynamic Recurrent Neural Network) method. (For this technology, the applicant of the present invention has disclosed Japanese Patent Application Laid-Open Nos. 6-4097 and 6-119.
Patent applications have already been filed based on 476, etc. ), The speaker class may be determined based on a value (this value will be described later).

【００６８】このＤＲＮＮ方式による音声認識技術につ
いて図４を参照しながら簡単に説明する。The speech recognition technique based on the DRNN method will be briefly described with reference to FIG.

【００６９】このＤＲＮＮ方式による音声認識技術は、
たとえば、「おはようございます。今日はいいお天気だ
ね」といった連続音声の中から予め登録されている認識
可能な単語（この場合、「おはよう」、「今日」、「天
気」など）をキーワードとして、これらキーワードとな
る単語が入力音声中のどの部分にどれくらいの確かさで
存在するかを示す値を得て、その確からしさを示す値を
基に前記したような連続的な音声を理解することが可能
なものである。The speech recognition technology based on this DRNN system is
For example, a recognizable word that has been registered in advance from a continuous voice such as “Good morning. Today is a nice weather” (in this case, “Good morning”, “Today”, “Weather”, etc.) is used as a keyword. It is possible to obtain a value indicating in which part of the input speech these keywords words are present and with certainty, and to understand continuous speech as described above based on the value indicating the certainty. It is possible.

【００７０】この実施の形態では、話者クラス選択を行
う際、システム側からの指示に基づいて、ユーザが話者
適応用の単語を幾つか発話する。今、システム側からの
指示に基づいて、たとえば、ユーザが、「おはよう」と
発話し、続いて、システム側からの指示に基づいて、た
とえば、ユーザが、「こんにちわ」と発話して、図４
（ａ）に示すような音声信号が出力されたとする。これ
ら「おはよう」、「こんんちわ」は予め登録されたキー
ワードとなる単語であるとする。In this embodiment, when the speaker class is selected, the user speaks some words for speaker adaptation based on an instruction from the system side. Now, based on an instruction from the system side, for example, the user utters "Good morning", and then based on an instruction from the system side, for example, the user utters "Hello" and
It is assumed that an audio signal as shown in (a) is output. It is assumed that these "good morning" and "konchiwa" are words that are keywords registered in advance.

【００７１】そして、予め登録されたキーワードとなる
単語をたとえば１０単語としたとき、これら１０単語
（これを、単語１、単語２、単語３、・・・とする）に
対応して各単語を検出するための信号が出力されてい
て、その検出信号の値などの情報から、入力音声中にど
の程度の確かさで登録単語が存在するかを検出する。つ
まり、「おはよう」という単語（単語１）が入力音声中
に存在したときに、その「おはよう」という信号を待っ
ている検出信号が、同図（ｂ）の如く、ユーザの発話し
た入力音声の「おはよう」の部分で立ち上がる。同様
に、「こんにちわ」という単語（単語２）が入力音声中
に存在したときに、その「こんにちわ」という信号を待
っている検出信号が、同図（ｃ）の如く、入力音声の
「こんにちわ」の部分で立ち上がる。同図（ｂ），
（ｃ）において、0.9あるいは0.8といった数値は、確か
らしさ（近似度）を示す数値であり、0.9や0.8といった
高い数値であれば、その単語は入力音声の中に、高い確
からしさで存在するということができる。つまり、「お
はよう」という登録単語は、同図（ｂ）に示すように、
入力音声信号の時間軸上のｗ１の部分に0.9という確か
らしさで存在し、「こんにちわ」という登録単語は、同
図（ｃ）に示すように、入力音声信号の時間軸上のｗ２
の部分に0.8という確からしさで存在することがわか
る。Then, assuming that the words which are the keywords registered in advance are, for example, 10 words, each word is associated with these 10 words (this is referred to as word 1, word 2, word 3, ...). A signal for detection is output, and the degree of certainty that the registered word exists in the input voice is detected from information such as the value of the detection signal. That is, when the word "Ohayo" (word 1) is present in the input voice, the detection signal waiting for the signal "Ohayo" is the input voice uttered by the user as shown in FIG. Get up in the "Good morning" part. Similarly, when the word "Hello" (word 2) is present in the input voice, the detection signal waiting for the signal "Hello" is as shown in FIG. Stand up at the part. FIG.
In (c), a numerical value such as 0.9 or 0.8 is a numerical value indicating a certainty (degree of approximation), and a high numerical value such as 0.9 or 0.8 indicates that the word exists in the input voice with high certainty. be able to. That is, the registered word "good morning" is, as shown in FIG.
The registered word "konnichiwa", which exists in the w1 portion on the time axis of the input speech signal with a certainty of 0.9, has a w2 on the time axis of the input speech signal as shown in FIG.
It can be seen that there is a certainty of 0.8 in the part.

【００７２】この確からしさを示す数値（近似度）を用
いて、ユーザの音声がどの話者クラスに属するかの判定
を行うことができる。これについて図面を参照しながら
説明する。It is possible to determine which speaker class the user's voice belongs to by using the numerical value (approximation degree) indicating the certainty. This will be described with reference to the drawings.

【００７３】図５は全体的な構成を示す図で、図１と同
一部分には同一符号が付されている。図５が図１と異な
るのは、話者クラスの判定をＤＲＮＮの出力、つまり、
前記した確からしさを示す数値（近似度）を得て、この
近似度を基に、入力音声がどの話者クラスに属している
かの判定を行う点にある。したがって、話者適応化部２
には、前記近似度を出力するための手段として、ＤＲＮ
Ｎ方式による単語検出部２５と、不特定多数の男性の音
声を基に作成されたＤＲＮＮ音声モデルデータが記憶さ
れる音声モデルデータ記憶部２６ａ（以下第１のＤＲＮ
Ｎ音声モデル記憶部２６ａという）、不特定多数の女性
の音声を基に作成されたＤＲＮＮ音声モデルデータが記
憶される音声モデルデータ記憶部２６ｂ（以下第２のＤ
ＲＮＮ音声モデル記憶部２６ｂという）が設けられる。
なお、この場合も、前記同様、説明を簡単にするため、
話者クラスを男性と女性の２つに区分した例で説明す
る。FIG. 5 is a diagram showing the overall structure, and the same parts as those in FIG. 1 are designated by the same reference numerals. 5 is different from FIG. 1 in that the determination of the speaker class is output from the DRNN, that is,
The point is to obtain a numerical value (approximation degree) indicating the above-mentioned certainty and to judge which speaker class the input voice belongs to based on this approximation degree. Therefore, the speaker adaptation unit 2
Is a DRN as means for outputting the degree of approximation.
The N-type word detection unit 25 and a voice model data storage unit 26a (hereinafter, referred to as a first DRN) in which DRNN voice model data created based on voices of an unspecified number of men are stored.
N voice model storage unit 26a), a voice model data storage unit 26b (hereinafter referred to as a second D) that stores DRNN voice model data created based on voices of an unspecified number of women.
The RNN voice model storage unit 26b) is provided.
In this case as well, in order to simplify the explanation,
An example in which the speaker class is divided into male and female will be described.

【００７４】このような構成において、システム側か
ら、たとえば、「おはようと話して下さい」という指示
が出され、ユーザが「おはよう」と発話し、続いて、
「こんにちわと話して下さい」という指示に対して、
「こんにちわ」と発話するというように、システム側か
らの指示に基づいてユーザが発話すると、その音声は音
声分析されて特徴ベクトル列として出力されたのち、入
力データ記憶部２１に蓄えられる。In such a configuration, the system side gives an instruction, for example, "Please speak good morning", the user speaks "Good morning", and then,
In response to the instruction "Please speak hello",
When the user speaks based on an instruction from the system side such as "Hello", the voice is voice-analyzed, output as a feature vector sequence, and then stored in the input data storage unit 21.

【００７５】単語検出部２５は、入力音声中にどの程度
の確かさで対応する単語が存在するかを検出する。つま
り、「おはよう」という単語が入力音声中に存在したと
きに、その「おはよう」という信号を待っている検出信
号が、前述したように、図４（ｂ）の如く、入力音声の
「おはよう」の部分で立ち上がる。同様に、「こんにち
わ」という単語が入力音声中に存在したときに、その
「こんにちわ」という信号を待っている検出信号が、同
図（ｃ）の如く、入力音声の「こんにちわ」の部分で立
ち上がる。The word detecting section 25 detects with what certainty the corresponding word exists in the input voice. That is, when the word "good morning" is present in the input voice, the detection signal waiting for the signal "good morning" is, as described above, as shown in FIG. 4B, "good morning" of the input voice. Stand up at the part. Similarly, when the word "hello" is present in the input voice, the detection signal waiting for the signal "hello" rises at the "hello" portion of the input voice as shown in FIG. .

【００７６】ここでは、入力音声に対して、第１のＤＲ
ＮＮ音声モデル記憶部２６ａおよび第２のＤＲＮＮ音声
モデル記憶部２６ｂを用いて検出信号を得る。たとえ
ば、ユーザの「おはよう」という音声に対して、第１の
ＤＲＮＮ音声モデル記憶部２６ａを用いた場合、その
「おはよう」を待っている検出信号が図６（ａ）のよう
に立ち上がり、第２の音声モデル記憶部２６ｂを用いた
場合、その「おはよう」を待っている検出信号が図６
（ｂ）のように立ち上がったとする。また、ユーザの
「こんにちわ」という音声に対して、第１の音声モデル
記憶部２６ａを用いた場合、その「こんにちわ」を待っ
ている検出信号が図６（ｃ）のように立ち上がり、第２
の音声モデル記憶部２６ｂを用いた場合、その「こんに
ちわ」を待っている検出信号が図６（ｄ）のように立ち
上がったとする。Here, the first DR is applied to the input voice.
A detection signal is obtained using the NN voice model storage unit 26a and the second DRNN voice model storage unit 26b. For example, when the first DRNN voice model storage unit 26a is used for the user's voice "Ohayo", the detection signal waiting for "Ohayo" rises as shown in FIG. When the voice model storage unit 26b of FIG. 6 is used, the detection signal waiting for the “good morning” is shown in FIG.
It is assumed that the person stands up as shown in (b). Also, when the first voice model storage unit 26a is used for the user's voice "Hello", the detection signal waiting for the "Hello" is raised as shown in FIG.
6D, it is assumed that the detection signal waiting for "Hello" has risen as shown in FIG. 6D.

【００７７】このような検出信号の立ち上がり部分（Ｄ
ＲＮＮ出力）にしきい値ｔｈを設定し、このしきい値ｔ
ｈより大きい部分（図中、斜線を施した部分）の面積を
求め、その面積の大きさを基に、話者クラス判定部２２
によりユーザの音声がどの話者クラスに属するかを判定
する。The rising portion (D
RNN output), a threshold th is set, and this threshold t
The area of a portion larger than h (hatched portion in the figure) is calculated, and based on the size of the area, the speaker class determination unit 22
Determines which speaker class the user's voice belongs to.

【００７８】いま、話者適応用の単語を単語ｗ１，ｗ
２，ｗ３の３個とする。そして、単語ｗ１に対する第１
の音声モデル記憶部２６ａを用いた場合の検出信号の立
ち上がり部分（ＤＲＮＮ出力）のしきい値以上の面積を
Ｓａ１、単語ｗ２に対する第１の音声モデル記憶部２６
ａを用いた場合のＤＲＮＮ出力のしきい値以上の面積を
Ｓａ２、単語ｗ３に対する第１の音声モデル記憶部２６
ａを用いた場合のＤＲＮＮ出力のしきい値以上の面積を
Ｓａ３とすると、第１のＤＲＮＮ音声モデル記憶部２６
ａを用いた場合の単語ｗ１，ｗ２，ｗ３におけるＤＲＮ
Ｎ出力のしきい値以上の合計の面積Ｓａは、Ｓａ＝Ｓａ１＋Ｓａ２＋Ｓａ３・・・（３）で求められる。Now, the words for speaker adaptation are the words w1 and w.
There are three, 2 and w3. And the first for word w1
The area of the rising portion (DRNN output) of the detection signal which is equal to or larger than the threshold value when using the voice model storage unit 26a of No.
The area of the DRNN output that is greater than or equal to the threshold value when a is used is Sa2, and the first voice model storage unit 26 for the word w3
If the area of the DRNN output that is equal to or more than the threshold value when a is used is Sa3, the first DRNN voice model storage unit 26
DRN in words w1, w2, w3 when a is used
The total area Sa of the N output or more is equal to or more than the threshold value is obtained by: Sa = Sa1 + Sa2 + Sa3 (3)

【００７９】また、単語ｗ１に対する第２のＤＲＮＮ音
声モデル記憶部２６ｂを用いた場合のＤＲＮＮ出力のし
きい値以上の面積をＳｂ１、単語ｗ２に対する第２のＤ
ＲＮＮ音声モデル記憶部２６ｂを用いた場合のＤＲＮＮ
出力のしきい値以上の面積をＳｂ２、単語ｗ３に対する
第２のＤＲＮＮ音声モデル記憶部２６ｂを用いた場合の
ＤＲＮＮ出力のしきい値以上の面積をＳｂ３とすると、
第２のＤＲＮＮ音声モデル記憶部２６ｂを用いた場合の
単語ｗ１，ｗ２，ｗ３におけるＤＲＮＮ出力のしきい値
以上の合計の面積Ｓｂは、Ｓｂ＝Ｓｂ１＋Ｓｂ２＋Ｓｂ３・・・（４）で求められる。Further, when the second DRNN voice model storage unit 26b for the word w1 is used, an area equal to or larger than the threshold of the DRNN output is Sb1, and the second DNN for the word w2.
DRNN using the RNN voice model storage unit 26b
Let Sb2 be the area above the output threshold and Sb3 the area above the DRNN output threshold when the second DRNN voice model storage unit 26b for the word w3 is used.
The total area Sb of the words w1, w2, and w3 that is equal to or larger than the threshold value of the DRNN output when the second DRNN voice model storage unit 26b is used is obtained by Sb = Sb1 + Sb2 + Sb3 (4).

【００８０】このように求められた面積ＳａとＳｂの大
きさを比較して、話者クラスを判定する。つまり、Ｓａ
＜Ｓｂであれば、入力音声は女性話者クラスに属する音
声であるとの判定がなされ、これにより、多数の女性の
音声を基に作成されたコードブック（第２の不特定話者
コードブック）２３ｂを選択する。The sizes of the areas Sa and Sb thus obtained are compared to determine the speaker class. That is, Sa
If <Sb, it is determined that the input voice belongs to a female speaker class, and as a result, a codebook created based on a large number of female voices (second unspecified speaker codebook). ) Select 23b.

【００８１】このようにして、話者クラス判定処理部２
２により、入力話者の音声がどの話者クラスに属するか
の判定を行い、話者クラスが判定されると、その話者ク
ラスの音声を基に作成された不特定話者コードブックが
選択される。これにより、入力話者の音声を認識する
際、その入力話者の音声に近い音声をもとに作成された
コードブックを用いて入力話者の音声がベクトル量子化
され、特徴コードベクトルとして出力されるため、音声
認識部３で高精度な認識処理が行える。In this way, the speaker class determination processing unit 2
According to 2, it is determined which speaker class the voice of the input speaker belongs to, and when the speaker class is determined, an unspecified speaker codebook created based on the voice of the speaker class is selected. To be done. As a result, when recognizing the voice of the input speaker, the voice of the input speaker is vector-quantized using a codebook created based on the voice close to the voice of the input speaker, and output as a feature code vector. Therefore, the voice recognition unit 3 can perform highly accurate recognition processing.

【００８２】ところで、音声認識部３は、選択された話
者クラスに対応した音声モデルを用いて音声認識を行
う。つまり、前記したように、この場合、入力話者は女
性話者クラスに属すると判定されているので、第２の音
声モデル記憶部３１ｂを用い、単語検出部３２より近似
度を得て、その近似度を基に音声認識処理部３３により
音声認識を行う。なお、この場合、音声認識部３の認識
手段としてＤＲＮＮ方式を用いれば、図５に示した第
１、第２の音声モデル記憶部２６ａ，２６ｂ（第１、第
２のＤＲＮＮ音声モデル記憶部３１ａ，３１ｂ）、単語
検出部２５（単語検出部３２）などは、音声認識部３と
話者適応化部２で共用することができる。The voice recognition unit 3 performs voice recognition using a voice model corresponding to the selected speaker class. That is, as described above, in this case, since the input speaker is determined to belong to the female speaker class, the second voice model storage unit 31b is used to obtain the degree of approximation from the word detection unit 32 and The voice recognition processing unit 33 performs voice recognition based on the degree of approximation. In this case, if the DRNN method is used as the recognition means of the voice recognition unit 3, the first and second voice model storage units 26a and 26b (first and second DRNN voice model storage unit 31a) shown in FIG. , 31b) and the word detection unit 25 (word detection unit 32) can be shared by the voice recognition unit 3 and the speaker adaptation unit 2.

【００８３】このように、話者適応処理において、入力
話者の音声が、男性話者クラスに属するか女性話者クラ
スに属するかの判定をＤＲＮＮ出力を用いて行い、音声
認識時においては、そのクラス分けの結果に基づいた音
声モデルを用いて音声認識を行う。この場合、入力話者
の音声は女性話者クラスに属すると判定され、不特定多
数の女性話者の音声を用いて作成された不特定話者コー
ドブックを用いて特徴コードベクトルに変換されて出力
され、音声認識部３では、多数の女性話者の音声を用い
て作成された音声モデルを用いて音声認識されるので、
高精度な音声認識が行える。As described above, in the speaker adaptation process, it is determined whether the voice of the input speaker belongs to the male speaker class or the female speaker class using the DRNN output, and at the time of voice recognition, Speech recognition is performed using a speech model based on the result of the classification. In this case, the voice of the input speaker is determined to belong to the female speaker class, and is converted to a feature code vector using an unspecified speaker codebook created using the sounds of unspecified many female speakers. It is output, and the voice recognition unit 3 performs voice recognition using a voice model created using the voices of many female speakers.
Highly accurate voice recognition can be performed.

【００８４】なお、以上の説明では、話者クラスとして
男性話者と女性話者の２つの話者クラスに分けた場合で
あったが、話者クラスとして、さらに細分化すれば、よ
り一層、高精度な認識が可能となる。In the above description, the speaker class is divided into two speaker classes, that is, a male speaker and a female speaker. However, if the speaker class is further subdivided, Highly accurate recognition is possible.

【００８５】また、話者クラスの判定を行う際、最初に
説明したＤＰ距離と、次に説明したＤＲＮＮ出力の両方
を用いれば、より高精度な話者クラス選択を行うことが
できる。この場合、前記（１）式で求めたＤＰ距離の和
Ｄａに（３）式で求めたＤＲＮＮ出力のしきい値以上の
面積の和Ｓａをプラスするともに、（２）式で求めたＤ
Ｐ距離の和Ｄｂに（４）式で求めたＤＲＮＮ出力のしき
い値以上の面積の和Ｓｂをプラスして、その値をもと
に、話者クラスを判定するようにする。ただし、ＤＰ距
離の和Ｄａ，Ｄｂは数値が小さい方を採用し、ＤＲＮＮ
出力のしきい値以上の面積の和Ｓａ，Ｓｂは数値が大き
い方を採用するので、ＤＰ距離を主として考えた場合
は、Ｓａ，Ｓｂはその逆数を取り、Ｓａ，Ｓｂにそれぞ
れ適当な重み付けをしてＤａ，Ｄｂにプラスする。Further, when the speaker class is determined, both the DP distance described first and the DRNN output described next are used, so that more accurate speaker class selection can be performed. In this case, the sum Da of the DP distances obtained by the above equation (1) is added by the sum Sa of the areas of the DRNN output which is equal to or more than the threshold value obtained by the equation (3), and D obtained by the equation (2) is added.
The sum Db of the P distances is added to the sum Sb of the areas of the DRNN output obtained by the equation (4) or more, and the speaker class is determined based on the value. However, for DP sums Da and Db, the one with the smaller numerical value is adopted, and DRNN
The sum of the areas equal to or larger than the output threshold value Sa, Sb has a larger numerical value. Therefore, when the DP distance is mainly considered, Sa and Sb take the reciprocals thereof and give appropriate weights to Sa and Sb, respectively. And add to Da and Db.

【００８６】このような処理を行うことにより、入力話
者の音声に対して、より高精度な話者クラスの判定が可
能となり、そのあとの音声認識をより一層高精度に行う
ことができる。By performing such processing, it becomes possible to determine the speaker class with higher accuracy for the voice of the input speaker, and it is possible to perform subsequent voice recognition with even higher accuracy.

【００８７】なお、図１に示した例では、音声認識を行
う場合、入力話者の音声特徴ベクトル列を、ベクトル量
子化部２０によりコード化したのち、音声認識部３に渡
すようにしたが、この第１の実施の形態の場合は、必ず
しも量子化する必要はなく、音声分析された特徴ベクト
ル列をそのまま、音声認識部３に送るようにしてもよ
い。この場合、話者クラスに対応した不特定話者コード
ブック２３ａ，２３ｂは不要であり、話者クラスを判定
すると、判定された話者クラスに対応した音声認識用の
音声モデルを選択し、音声認識時においては、選択され
た音声モデルを用いて音声認識を行う。つまり、入力話
者がどの話者クラスに属するかを判定し、その判定結果
に基づいて、その話者クラスに属する音声モデルを選択
し、その選択された音声モデルを用いて認識処理するも
のである。このようにしても音声認識率を大幅に改善す
ることができる。In the example shown in FIG. 1, when performing voice recognition, the voice feature vector sequence of the input speaker is coded by the vector quantization unit 20 and then passed to the voice recognition unit 3. In the case of the first embodiment, it is not always necessary to quantize, and the voice-analyzed feature vector sequence may be sent to the voice recognition unit 3 as it is. In this case, the unspecified speaker codebooks 23a and 23b corresponding to the speaker class are unnecessary, and when the speaker class is determined, a voice model for voice recognition corresponding to the determined speaker class is selected, At the time of recognition, voice recognition is performed using the selected voice model. In other words, it is determined which speaker class the input speaker belongs to, based on the determination result, a voice model belonging to the speaker class is selected, and recognition processing is performed using the selected voice model. is there. Even in this case, the voice recognition rate can be greatly improved.

【００８８】（第２の実施の形態）以上説明した第１の
実施の形態では、入力話者の音声に対して話者クラスの
判定を行い、その判定結果に対応したコードブック（選
択された話者クラスに属する不特定の話者から作成され
た不特定話者コードブック）を選択して、音声認識時に
は、その選択された不特定話者コードブックを用いて入
力話者の音声をコード化して、音声認識部に送り、音声
認識部では、選択された話者クラスに対応した音声モデ
ルを用いて音声認識を行う処理について説明したが、こ
の第２の実施の形態では、話者クラスの判定結果により
選択された不特定話者コードブックを用いて入力話者コ
ードブックを作成し、音声認識時には、この入力話者コ
ードブックを用いて話者適応した後、音声認識を行うよ
うにするものである。以下に詳細に説明する。(Second Embodiment) In the first embodiment described above, the speaker class is determined for the voice of the input speaker, and the codebook corresponding to the determination result (selected Select an unspecified speaker codebook created from an unspecified speaker belonging to the speaker class, and code the input speaker's voice using the selected unspecified speaker codebook during voice recognition. However, in the second embodiment, the speaker class is transmitted to the voice recognition unit, and the voice recognition unit performs the voice recognition using the voice model corresponding to the selected speaker class. The input speaker codebook is created using the unspecified speaker codebook selected based on the judgment result of 1., and at the time of voice recognition, the voice is recognized after the speaker adaptation is performed using this input speaker codebook. To do . This will be described in detail below.

【００８９】図７は第２の実施の形態の全体的な構成を
説明するものであり、図１と同一部分には同一符号が付
されている。なお、この第２の実施の形態は、ＤＰ距離
を用いた話者クラス判定の場合について説明するが、Ｄ
ＲＮＮ出力を用いた場合にも同様に適応することができ
る。FIG. 7 illustrates the overall structure of the second embodiment, and the same parts as those in FIG. 1 are designated by the same reference numerals. In the second embodiment, the case of speaker class determination using the DP distance will be described.
The same can be applied when the RNN output is used.

【００９０】図７において、２７は入力話者変換コード
ブックであり、この入力話者コードブック２７は、話者
クラス判定結果に基づいて選択された不特定話者コード
ブック（この場合、第１、第２の不特定話者コードブッ
ク２３ａ，２３ｂのいずれか）を用いて作成される。こ
こでは、第１の実施の形態で説明した話者クラス判定処
理により第２の不特定話者コードブック２３ｂが選択さ
れた場合について説明する。In FIG. 7, reference numeral 27 is an input speaker conversion codebook. This input speaker codebook 27 is an unspecified speaker codebook (in this case, the first speaker codebook selected based on the speaker class determination result). , Second unspecified speaker codebooks 23a, 23b). Here, a case will be described in which the second unspecified speaker codebook 23b is selected by the speaker class determination processing described in the first embodiment.

【００９１】話者適応を行うに際して、システム側か
ら、話者適応用の単語として、たとえば、「おはよう」
と話して下さいというような指示がなされ、ユーザがそ
の指示にしたがって、「おはよう」と発話すると、その
音声はマイクロホン１１を通して入力され、Ａ／Ｄ変換
部１２でＡ／Ｄ変換されたのち、音声分析部１３から周
波数の特徴を表す音声特徴ベクトル列を出力する。そし
て、その音声分析された特徴ベクトルは入力データ記憶
部２１に、一旦、記憶される。同様に、次の話者適応用
の単語として、たとえば、「こんにちわ」と話して下さ
いというような指示がなされ、ユーザがその指示にした
がって、「こんにちわ」と言うと、その音声を音声分析
されて得られた音声特徴ベクトル列が出力され、その特
徴ベクトル列は入力データ記憶部２１に記憶される。こ
のようにして、幾つかの話者適応用の単語の特徴ベクト
ル列が記憶される。When the speaker adaptation is performed, the system side uses, for example, "Ohayo" as the speaker adaptation word.
When the user utters “Good morning” according to the instruction, the voice is input through the microphone 11, A / D converted by the A / D conversion unit 12, and then the voice. The analysis unit 13 outputs a voice feature vector sequence representing the feature of the frequency. Then, the voice-analyzed feature vector is temporarily stored in the input data storage unit 21. Similarly, as the next word for speaker adaptation, for example, an instruction to speak "Hello" is given, and when the user says "Hello" according to the instruction, the voice is analyzed by voice. The obtained voice feature vector sequence is output, and the feature vector sequence is stored in the input data storage unit 21. In this way, some feature vector strings of words for speaker adaptation are stored.

【００９２】入力話者コードブック２７の作成は、選択
された第１、第２の不特定話者コードブック２３ａ，２
３ｂのいずれか（この場合、第２の不特定話者コードブ
ック２３ｂ）、第１、第２の重心ベクトル列記憶部２４
ａ，２４ｂのいずれか（この場合、第２の重心ベクトル
列記憶部２４ｂ）、入力データ記憶部２１のそれぞれの
データを用いて行う。以下、この処理を図８を参照しな
がら説明する。The input speaker codebook 27 is created by selecting the selected first and second unspecified speaker codebooks 23a and 23a.
3b (in this case, the second unspecified speaker codebook 23b), the first and second centroid vector sequence storage units 24.
This is performed using the data of either a or 24b (in this case, the second center-of-gravity vector sequence storage unit 24b) and the input data storage unit 21. Hereinafter, this process will be described with reference to FIG.

【００９３】図８は選択された第２の不特定話者コード
ブック２３ｂを表し、ここでは、そのサイズを２５６と
し、白丸で示す２５６個のコードベクトルで構成されて
いる。そして、これらのコードベクトルをＣｋ１，Ｃｋ
２，Ｃｋ３，・・・，Ｃｋ２５６で表し、実際には、２
５６個のコードベクトルで構成されるが、図８ではこの
コードベクトルはＣｋ１，Ｃｋ２，・・・，Ｃｋ９のみ
が図示されている。このコードベクトルは、たとえば、
２００単語程度の単語数をそれぞれの単語ごとに２００
人程度の不特定の女性に話してもらったとき得られる特
徴ベクトル数、つまり、１つの単語につき２５個程度の
特徴ベクトル数が有るとすると、１００万個程度の特徴
ベクトルが得られるが、それをベクトル量子化して２５
６個の代表のコードベクトルにまとめたものである。FIG. 8 shows the selected second unspecified speaker codebook 23b, which has a size of 256 and is composed of 256 code vectors indicated by white circles. Then, these code vectors are set to Ck1 and Ck.
2, Ck3, ..., Ck256, and actually 2
Although it is composed of 56 code vectors, only the code vectors Ck1, Ck2, ..., Ck9 are shown in FIG. This code vector is, for example,
200 words for each word, 200 for each word
If there are about 25 feature vectors obtained when an unspecified woman such as a person speaks, that is, about 25 feature vectors per word, about 1 million feature vectors can be obtained. Vector quantize
It is a collection of six representative code vectors.

【００９４】このような第２の不特定話者コードブック
２３ｂに対して、たとえば、前記のように求められた
「おはよう」に対する重心ベクトル列（ここでは、図
中、黒丸で示し、重心ベクトルＣｓ１，Ｃｓ２，・・
・，Ｃｓ７で構成されているものとする）をベクトル量
子化する。つまり、「おはよう」に対する重心ベクトル
列をＣｋ１，Ｃｋ２，・・・，Ｃｋ２５６のコードベク
トルでベクトル量子化すると、重心ベクトル列の１番目
と２番目の重心ベクトルＣｓ１，Ｃｓ２はコードベクト
ルＣｋ１と対応づけられ、３番目の重心ベクトルＣｓ３
はコードベクトルＣｋ３と対応づけられ、４番目の重心
ベクトルＣｓ４はコードベクトルＣｋ４と対応づけら
れ、５番目、６番目、７番目の重心ベクトルＣｓ５，Ｃ
ｓ６，Ｃｓ７はそれぞれコードベクトルＣｋ５と対応づ
けられる、これにより、「おはよう」の重心ベクトル列
は、Ｃｋ１，Ｃｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５，Ｃｋ
５，Ｃｋ５のコードベクトル列に置き換えられることに
なる。For such a second unspecified speaker codebook 23b, for example, the center of gravity vector sequence for "Ohayo" obtained as described above (here, indicated by a black circle in the figure, the center of gravity vector Cs1). , Cs2, ...
., Cs7) is vector-quantized. That is, when the center of gravity vector sequence for “Ohayo” is vector-quantized with the code vectors of Ck1, Ck2, ..., Ck256, the first and second center of gravity vector Cs1 and Cs2 of the center of gravity vector sequence are associated with the code vector Ck1. And the third center of gravity vector Cs3
Is associated with the code vector Ck3, the fourth center of gravity vector Cs4 is associated with the code vector Ck4, and the fifth, sixth, and seventh center of gravity vectors Cs5, C
s6 and Cs7 are associated with the code vector Ck5, respectively, so that the center of gravity vector sequence of "good morning" is Ck1, Ck1, Ck3, Ck4, Ck5, Ck.
It will be replaced with a code vector sequence of 5, Ck5.

【００９５】次に、入力データ記憶部２１に記憶されて
いるユーザからの「おはよう」の特徴ベクトル列を、前
記量子化された「おはよう」の重心ベクトル列（重心コ
ードベクトル列という）に対してＤＰマッチングにより
対応付けを行う。Next, the feature vector sequence of "Ohayo" from the user stored in the input data storage unit 21 is compared with the quantized "Ohayo" centroid vector sequence (called centroid code vector sequence). Correspondence is performed by DP matching.

【００９６】これを図９に示す。なお、図９において
は、説明を分かり易くするため、第２の不特定話者コー
ドブック２３ｂの内容は、「おはよう」の重心コードベ
クトルＣｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５のみを示し、他
のコードベクトルは図示を省略している。This is shown in FIG. In FIG. 9, the content of the second unspecified speaker codebook 23b shows only the center of gravity code vectors Ck1, Ck3, Ck4, Ck5 of “Ohayo” and other code vectors for the sake of easy understanding. Are not shown.

【００９７】ここで、ユーザからの「おはよう」が入力
されると、その「おはよう」の特徴ベクトル列（入力話
者特徴ベクトル列）と前記重心コードベクトル列Ｃｋ
１，Ｃｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５，Ｃｋ５，Ｃｋ５
とをＤＰマッチングにより対応づける。前記入力話者特
徴ベクトル列のそれぞれの特徴ベクトルＣｉ１，Ｃｉ
２，Ｃｉ３，Ｃｉ４，Ｃｉ５，Ｃｉ６が図９に示すよう
な位置であるとすれば、前記重心コードベクトル列Ｃｋ
１，Ｃｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５，Ｃｋ５，Ｃｋ５
とのＤＰマッチングをとると、この場合、入力話者特徴
ベクトルＣｉ１，Ｃｉ２はそれぞれ重心コードベクトル
Ｃｋ１に対応づけられ、入力話者特徴ベクトルＣｉ３は
重心コードベクトルＣｋ３に対応づけられ、入力話者特
徴ベクトルＣｉ４，Ｃｉ５はそれぞれ重心コードベクト
ルＣｋ４に対応づけられ、入力話者特徴ベクトルＣｉ６
は重心コードベクトルＣｋ５に対応づけられる。When "Ohayo" is input from the user, the feature vector sequence of "Ohayo" (input speaker feature vector sequence) and the center of gravity code vector sequence Ck are input.
1, Ck1, Ck3, Ck4, Ck5, Ck5, Ck5
And are matched by DP matching. Feature vectors Ci1 and Ci of the input speaker feature vector sequence
If the positions 2, Ci3, Ci4, Ci5, Ci5 and Ci6 are as shown in FIG. 9, the center of gravity code vector sequence Ck
1, Ck1, Ck3, Ck4, Ck5, Ck5, Ck5
In this case, the input speaker feature vectors Ci1 and Ci2 are respectively associated with the centroid code vector Ck1, and the input speaker feature vector Ci3 is associated with the centroid code vector Ck3. The vectors Ci4 and Ci5 are respectively associated with the centroid code vector Ck4, and the input speaker feature vector Ci6
Is associated with the centroid code vector Ck5.

【００９８】このようにして、入力話者特徴ベクトル列
と、前記重心コードベクトル列との対応付けがなされる
と、次に、対応づけられたベクトル間の差分ベクトル
（入力話者特徴ベクトル−重心コードベクトル）を求め
る。この場合、入力話者特徴ベクトルＣｉ１，Ｃｉ２は
それぞれＣｋ１に対応づけられているので、差分ベクト
ルＶ１は、入力話者特徴ベクトルＣｉ１，Ｃｉ２の平均
を取って、Ｖ１＝（Ｃｉ１＋Ｃｉ２）／２−Ｃｋ１で求められ、同様に、入力話者特徴ベクトルＣｉ３はＣ
ｋ３に対応づけられられているので、差分ベクトルＶ３
は、Ｖ３＝Ｃｉ３−Ｃｋ３で求められ、同様に、入力話者特徴ベクトルＣｉ４，Ｃ
ｉ５はそれぞれＣｋ４に対応づけられているので、差分
ベクトルＶ４は、入力話者特徴ベクトルＣｉ４，Ｃｉ５
の平均を取って、Ｖ４＝（Ｃｉ４＋Ｃｉ５）／２−Ｃｋ４で求められ、同様に、入力話者特徴ベクトルＣｉ６はＣ
ｋ５に対応づけられているので、差分ベクトルＶ５は、Ｖ５＝Ｃｉ６−Ｃｋ５で求められる。すなわち、重心コードベクトル列の各重
心コードベクトルＣｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５は、
入力話者特徴ベクトル列に対し、前記のように求められ
たＶ１，Ｖ３，Ｖ４，Ｖ５の差分ベクトルを有している
ということである。When the input speaker feature vector sequence is associated with the centroid code vector sequence in this way, then the difference vector between the associated vectors (input speaker feature vector-centroid) Code vector). In this case, since the input speaker feature vectors Ci1 and Ci2 are associated with Ck1, respectively, the difference vector V1 is obtained by averaging the input speaker feature vectors Ci1 and Ci2, and V1 = (Ci1 + Ci2) / 2−Ck1 Similarly, the input speaker feature vector Ci3 is C
Since it is associated with k3, the difference vector V3
Is calculated by V3 = Ci3-Ck3, and similarly, input speaker feature vectors Ci4, C
Since i5 is respectively associated with Ck4, the difference vector V4 is the input speaker feature vector Ci4, Ci5.
V4 = (Ci4 + Ci5) / 2−Ck4, and the input speaker feature vector Ci6 is C
Since it is associated with k5, the difference vector V5 is obtained by V5 = Ci6-Ck5. That is, the respective barycentric code vectors Ck1, Ck3, Ck4, Ck5 of the barycentric code vector sequence are
That is, it has the difference vectors of V1, V3, V4, and V5 obtained as described above with respect to the input speaker feature vector sequence.

【００９９】このようにして、差分ベクトルＶ１，Ｖ
３，Ｖ４，Ｖ５が求められると、次に、この差分ベクト
ルを用いて、入力話者の「おはよう」に対するコードベ
クトルを求め、それを入力話者コードブック２７にマッ
ピングする。In this way, the difference vectors V1, V
When 3, V4 and V5 are obtained, the difference vector is then used to obtain a code vector for "good morning" of the input speaker, and the code vector is mapped to the input speaker codebook 27.

【０１００】ここで、求めるコードベクトルをＣｔｘで
表す（このｘはコードベクトルの番号を表し、ここでは
１，３，４，５の数値を取る）と、Ｃｔ１＝Ｃｋ１＋Ｖ１Ｃｔ３＝Ｃｋ３＋Ｖ３Ｃｔ４＝Ｃｋ４＋Ｖ４Ｃｔ５＝Ｃｋ５＋Ｖ５となる。Here, when the code vector to be obtained is represented by Ctx (where x is the code vector number, and the numerical values are 1, 3, 4, and 5 here), Ct1 = Ck1 + V1 Ct3 = Ck3 + V3 Ct4 = Ck4 + V4 Ct5 = Ck5 + V5.

【０１０１】これらＣｔ１，Ｃｔ３，Ｃｔ４，Ｃｔ５
は、第２の不特定話者コードブック２３ｂにおける「お
はよう」の重心コードベクトルＣｋ１，Ｃｋ３，Ｃｋ
４，Ｃｋ５と入力話者の特徴ベクトル列とを対応付け
し、その差分ベクトルＶ１，Ｖ３，Ｖ４，Ｖ５を、第２
の不特定話者コードブック２３ｂの重心コードベクトル
Ｃｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５にプラスして得られた
コードベクトルであり、図１０に示すように、第２の不
特定話者コードブック２３ｂのコードベクトルが差分ベ
クトルにより、入力話者コードブック２７のコードベク
トルに変換される。These Ct1, Ct3, Ct4 and Ct5
Is the center of gravity code vector Ck1, Ck3, Ck of “Ohayo” in the second unspecified speaker codebook 23b.
4, Ck5 and the feature vector sequence of the input speaker are associated with each other, and their difference vectors V1, V3, V4, V5
Is a code vector obtained by adding to the barycentric code vectors Ck1, Ck3, Ck4, Ck5 of the unspecified speaker codebook 23b, and the code of the second unspecified speaker codebook 23b as shown in FIG. The vector is converted into the code vector of the input speaker codebook 27 by the difference vector.

【０１０２】ただし、この場合、「おはよう」という１
つの話者適応用の単語のみについて考えているので、４
つのコードベクトルＣｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５の
みが変換されたコードベクトルとして求められたことに
なるが、その他の話者適応用の単語について同様の処理
を行うことにより、それに対する入力話者コードベクト
ルが作成される。However, in this case, 1 is called "Good morning".
Since we are only thinking about one speaker adaptation word, 4
Although only two code vectors Ck1, Ck3, Ck4, and Ck5 have been obtained as converted code vectors, the same process is performed on the other words for speaker adaptation, so that the input speaker code vector corresponding thereto is obtained. Is created.

【０１０３】このようにして、第２の不特定話者コード
ブック２３ｂのコードベクトルが入力話者コードブック
２７のコードベクトルに変換されて入力話者コードブッ
クが作成されるが、第２の不特定話者コードブック２３
ｂ内に、たとえば、２５６個のコードベクトルがあると
すると、全てが変換されるものではなく、変換されない
コードベクトル（未学習コードベクトルという）も多く
存在する。この未学習コードベクトルを変換するための
処理（これを補間処理という）について以下に説明す
る。In this way, the code vector of the second unspecified speaker codebook 23b is converted into the code vector of the input speaker codebook 27 to create the input speaker codebook. Specific speaker codebook 23
For example, if there are 256 code vectors in b, not all are converted, and many code vectors that are not converted (called unlearned code vectors) exist. The processing for converting the unlearned code vector (this is referred to as interpolation processing) will be described below.

【０１０４】ここでは、説明を簡略化するため、「おは
よう」という１つの話者適応用の単語のみについて考え
るものとし、この「おはよう」という単語に対して４つ
の重心コードベクトルＣｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５
が入力話者コードブック２７へのコードベクトルとして
変換され、そのほか変換すべきコードベクトル（未学習
コードベクトル）は図１０に示すように、Ｃｋ２，Ｃｋ
６，Ｃｋ７，Ｃｋ８，Ｃｋ９であるとする。Here, in order to simplify the explanation, it is assumed that only one word "Ohayo" for speaker adaptation is considered, and four centroid code vectors Ck1, Ck3, Ck4 for the word "Ohayo". , Ck5
Is converted as a code vector to the input speaker codebook 27, and the other code vectors to be converted (unlearned code vectors) are Ck2 and Ck as shown in FIG.
6, Ck7, Ck8, and Ck9.

【０１０５】この未学習コードベクトルＣｋ２，Ｃｋ
６，Ｃｋ７，Ｃｋ８，Ｃｋ９のうち、今、Ｃｋ２を入力
話者コードブック２３へ変換するための補間処理につい
て図１１を参照しながら説明する。This unlearned code vector Ck2, Ck
Of 6, Ck7, Ck8 and Ck9, the interpolation process for converting Ck2 into the input speaker codebook 23 will now be described with reference to FIG.

【０１０６】図１１において、未学習コードベクトルＣ
ｋ２の周辺に存在する学習済みのコードベクトルのう
ち、３つのコードベクトルを選ぶ。この場合、未学習コ
ードベクトルＣｋ２の周辺には、学習済みのコードベク
トルとしてＣｋ１，Ｃｋ３，Ｃｋ４，Ｃｋ５の４つが存
在するが、このうち、コードベクトルＣｋ１，Ｃｋ４，
Ｃｋ５の３個がＣｋ２に近い距離に存在する学習済みの
コードベクトルであるとすると、これら近い距離の３つ
の学習済みコードベクトルを選択し、これらのコードベ
クトルＣｋ１，Ｃｋ４，Ｃｋ５に対応する前記差分ベク
トルＶ１，Ｖ４，Ｖ５を用いて、未学習コードベクトル
Ｃｋ２に対する差分ベクトルＶ２を決定する。このＶ２
は、Ｖ２＝μ２１・Ｖ１＋μ２４・Ｖ４＋μ２５・Ｖ５で求められる。この式において、μ２１、μ２４、μ２
５は重みを表す係数であり、μ２１はＣｋ２とＣｋ１の
距離に応じた重み、μ２４はＣｋ２とＣｋ４の距離に応
じた重み、μ２５はＣｋ２とＣｋ５の距離に応じた重み
であることを示し、それぞれの距離に応じて重みの大き
さが設定され、μ２１＋μ２４＋μ２５＝１となるよう
に設定される。このようにして、Ｃｋ２に対する差分ベ
クトルが決定され、その差分ベクトルＶ２を用い、Ｃｔ２＝Ｃｋ２＋Ｖ２により、未学習コードベクトルＣｋ２が入力話者コード
ブック２７のコードベクトルに変換される。In FIG. 11, the unlearned code vector C
From the learned code vectors existing around k2, three code vectors are selected. In this case, there are four learned code vectors Ck1, Ck3, Ck4, and Ck5 around the unlearned code vector Ck2. Of these, code vectors Ck1, Ck4,
Assuming that three Ck5 are learned code vectors existing at a distance close to Ck2, three learned code vectors at these close distances are selected, and the difference corresponding to these code vectors Ck1, Ck4, Ck5 is selected. The difference vector V2 for the unlearned code vector Ck2 is determined using the vectors V1, V4, and V5. This V2
Is calculated by V2 = μ21 · V1 + μ24 · V4 + μ25 · V5. In this equation, μ21, μ24, μ2
5 is a coefficient representing a weight, μ21 is a weight according to the distance between Ck2 and Ck1, μ24 is a weight according to the distance between Ck2 and Ck4, and μ25 is a weight according to the distance between Ck2 and Ck5. The magnitude of the weight is set according to each distance, and is set to be μ21 + μ24 + μ25 = 1. In this way, the difference vector for Ck2 is determined, and by using the difference vector V2, the unlearned code vector Ck2 is converted into the code vector of the input speaker codebook 27 by Ct2 = Ck2 + V2.

【０１０７】同様にして、Ｃｋ２以外の未学習コードベ
クトルＣｋ６，Ｃｋ７，Ｃｋ８，Ｃｋ９のそれぞれの差
分ベクトルが求められ、それぞれの差分ベクトルを用い
て変換される。Similarly, the respective difference vectors of the unlearned code vectors Ck6, Ck7, Ck8 and Ck9 other than Ck2 are obtained and converted using the respective difference vectors.

【０１０８】なお、前記重み係数μを求める方法として
は、ファジー級数を用いる方法、あるいは２乗距離の逆
数に比例した距離を用いる方法などがある。２乗距離の
逆数に比例した距離を用いる方法の一例として、たとえ
ば、μ２１を例に取れば、 μ２１＝｛１／（ｄ２１×ｄ２１）｝／｛１／（ｄ２１
×ｄ２１）＋１／（ｄ２４×ｄ２４）＋１／（ｄ２５×
ｄ２５）｝により求める。ここで、ｄ２１はコードベクトルＣｋ２
とＣｋ１との距離、ｄ２４はコードベクトルＣｋ２とＣ
ｋ４との距離、ｄ２５はコードベクトルＣｋ２とＣｋ５
との距離を示している。なお、上式において、距離の２
乗分の１｛１／（ｄ２１×ｄ２１）｝を、｛１／（ｄ２
１×ｄ２１）＋１／（ｄ２４×ｄ２４）＋１／（ｄ２５
×ｄ２５）｝で割るのは、前記したように、μ２１＋μ
２４＋μ２５＝１となるμ２１を求めるためである。同
様にして、μ２４、μ２５も求められる。このような２
乗距離の逆数に比例した距離により重み係数を求める方
法は、計算量が少なく単純であるため、これを用いた方
が有利な場合が多い。As a method for obtaining the weighting coefficient μ, there is a method using a fuzzy series, a method using a distance proportional to the reciprocal of the squared distance, or the like. As an example of a method using a distance proportional to the reciprocal of the squared distance, for example, if μ21 is taken, μ21 = {1 / (d21 × d21)} / {1 / (d21
× d21) + 1 / (d24 × d24) + 1 / (d25 ×
d25)}. Here, d21 is the code vector Ck2
, Ck1 and d24 are code vectors Ck2 and Ck
distance from k4, d25 is code vector Ck2 and Ck5
It shows the distance to. In the above equation, the distance of 2
Multiply 1 {1 / (d21 × d21)} by {1 / (d2
1 × d21) + 1 / (d24 × d24) + 1 / (d25
Xd25)} is divided by μ21 + μ as described above.
This is for obtaining μ21 which is 24 + μ25 = 1. Similarly, μ24 and μ25 are also obtained. Such 2
Since the method of obtaining the weighting coefficient by the distance proportional to the reciprocal of the squared distance requires a small amount of calculation and is simple, it is often advantageous to use this.

【０１０９】以上のような処理により、入力話者コード
ブック２７を作成することができる。The input speaker codebook 27 can be created by the above processing.

【０１１０】このようにして、選択された話者クラスに
対応する不特定話者コードブックを用いて入力話者コー
ドブックを作成し、音声認識時においては、この入力話
者の音声をこの入力話者コードブックを用いてコード化
した後、音声認識を行うことで、より一層、高精度な音
声認識を行うことができる。In this way, an input speaker codebook is created using the unspecified speaker codebook corresponding to the selected speaker class, and at the time of voice recognition, this input speaker's voice is input. By performing voice recognition after encoding using the speaker codebook, it is possible to perform more highly accurate voice recognition.

【０１１１】ところで、以上の説明は、話者適応する場
合、システム側から、たとえば、「おはようと話して下
さい」、「こんにちわと話して下さい」というような指
示が出されて、その指示に従って、ユーザが「おはよ
う」、「こんにちわ」などと発話して得られた音声特徴
データを用いるため、ＤＰマッチング処理を行う場合、
各単語に対する音声特徴データ（特徴ベクトル列）の始
端・終端がわかっていて、その始端・終端のデータを基
にＤＰマッチング処理を行うことができたが、ユーザが
発話した内容の中から話者適応用の単語を検出して自動
的に話者適応していくことも可能である。以下に、これ
について説明する。By the way, in the above explanation, when the speaker is adapted, the system side gives an instruction such as "Speak good morning" or "Hello" and follow the instructions. When the DP matching processing is performed, since the voice feature data obtained by the user uttering “Good morning”, “Hello”, etc. is used.
The beginning and end of the voice feature data (feature vector sequence) for each word were known, and DP matching processing could be performed based on the data at the beginning and end, but the speaker uttered from the contents uttered by the user. It is also possible to detect a word for adaptation and automatically adapt to the speaker. Hereinafter, this will be described.

【０１１２】このように、システム側からの指示のな
い、いわゆる“教師なし”において、話者の発話する音
声の中から、話者適応用の単語の音声区間を切り出す手
段として、本発明では、始終端フリーＤＰマッチング
と、前記したＤＲＮＮ出力を用いる。As described above, according to the present invention, as a means for cutting out the voice section of the word for speaker adaptation from the voice uttered by the speaker in the so-called "unsupervised" mode without any instruction from the system side, according to the present invention, The start / end free DP matching and the above-mentioned DRNN output are used.

【０１１３】この始終端フリーＤＰマッチングは、何ら
かの単語の標準データと入力音声データとを比較し、入
力音声データの中から、何らかの単語に対する音声区間
を切り出し、さらに、標準単語データと切り出された音
声区間とのＤＰマッチング距離を入力するもので、この
始終端フリーＤＰマッチング出力と、ＤＲＮＮ出力とを
組み合わせることで、ユーザが発話する内容の中から自
動的に話者適応用の単語を切り出すようにする。すなわ
ち、ユーザの発話する内容の中に、たとえば「おはよ
う」という単語が存在した場合、始終端フリーＤＰマッ
チングにより、その単語に対する音声区間を切り出すこ
とができ、同様に、その発話内容の中の「おはよう」が
存在すると、高い近似度を有するＤＲＮＮ出力が出力さ
れ、これら両方の出力から、ユーザの発話内容の中の或
る時刻に、「おはよう」という単語が存在しているらし
いということを高い確率で知ることができる。In this start / end free DP matching, standard data of some word is compared with input voice data, a voice section for some word is cut out from the input voice data, and further the standard word data and the cut out voice are cut out. It inputs the DP matching distance to the section. By combining this start / end free DP matching output and the DRNN output, the words for speaker adaptation are automatically cut out from the contents uttered by the user. To do. That is, for example, when the word "Ohayo" is present in the content uttered by the user, the voice section for that word can be cut out by the start-end free DP matching, and similarly, "Ohayo" can be extracted. If "Good morning" is present, a DRNN output having a high degree of approximation is output, and from both of these outputs, it is highly possible that the word "Good morning" appears to exist at a certain time in the utterance content of the user. You can know by probability.

【０１１４】このような手段を用いることにより、ユー
ザがシステムに対して発話して行くうちに、その発話内
容のなかから、話者適応用の単語が切り出され、切り出
された話者適応用の単語が入力データ記憶部２１に蓄え
られ、自動的に前記したような話者適応がなされる。し
たがって、使用している間に認識性能が自然に高くなる
という効果が得られる。By using such means, while the user speaks to the system, a speaker adaptation word is cut out from the utterance contents, and the cut out speaker adaptation word is extracted. The words are stored in the input data storage unit 21, and the speaker adaptation as described above is automatically performed. Therefore, there is an effect that the recognition performance naturally increases during use.

【０１１５】なお、以上の第１、第２の実施の形態にお
いて、ユーザが話す幾つかの話者適応用の単語の特徴デ
ータを、一旦、入力データ記憶部２１に記憶させたの
ち、色々な処理を行うようにしている。この入力データ
記憶部２１はＲＡＭが用いられるが、装置全体をできる
だけ小型化しようとする場合、当然ながらＲＡＭの容量
も限られたものとなる。これに対処するため、入力デー
タを量子化してコード化したデータとして入力データ記
憶部２１に記憶させるようにしてもよい。特に、前記し
た教師なしの場合においては、単語データを数多く蓄え
ておく必要があることからデータ量を少なくするため、
コード化した方が有利である。In the first and second embodiments described above, the characteristic data of several words for speaker adaptation spoken by the user are temporarily stored in the input data storage unit 21, and then various data are stored. I am trying to process it. A RAM is used as the input data storage unit 21, but naturally, the capacity of the RAM is also limited in order to make the entire apparatus as small as possible. To cope with this, the input data may be stored in the input data storage unit 21 as quantized and coded data. In particular, in the case without the above-mentioned teacher, in order to reduce the amount of data, it is necessary to store a large number of word data.
It is advantageous to code it.

【０１１６】（第３の実施の形態）この第３の実施の形
態は、話者のクラス分けを行って、そのクラスごとの不
特定話者コードブック（前記第１、第２の実施の形態に
おいては、第１の不特定話者コードブック２３ａ、第２
の不特定話者コードブック２３ｂに相当する）を作成す
る処理についてである。なお、前記第１、第２の実施の
形態では、説明を分かり易くするため、便宜上、話者ク
ラスを男性と女性というように具体的な分け方をした
が、話者クラスは実際には、男性／女性というような明
確な分け方ではなく、以下に説明するように、不特定多
数の話者から作成されたコードブックの内容を音声の特
徴データを基に幾つかに分類して、分類されたものをそ
れぞれの話者クラスとするもので、分類された結果、男
性話者、女性話者、さらには、子どもというような話者
クラスに分けられるものである。したがって、たとえ
ば、２つの話者クラスにクラス分けした場合、結果的
に、女性話者と男性話者のクラスに分けられたというこ
とである。そして、入力音声がどの話者クラスに属する
かを判定し、その判定結果に対応する不特定話者コード
ブックを選択するものである。以下にそのクラス分けに
ついて説明する。(Third Embodiment) In the third embodiment, speakers are classified into classes, and an unspecified speaker codebook for each class (the first and second embodiments). , The first unspecified speaker codebook 23a, the second
(Corresponding to the unspecified speaker codebook 23b). In the first and second embodiments, in order to make the description easier to understand, the speaker classes are specifically divided into male and female for convenience, but the speaker classes are actually As described below, the contents of the codebook created by an unspecified number of speakers are classified into some groups based on the voice feature data, and not classified as male / female. Each speaker class is defined as a speaker class, and as a result of the classification, the speaker class is divided into a male speaker, a female speaker, and even a child class. Therefore, for example, when the class is divided into two speaker classes, the class is divided into a female speaker class and a male speaker class. Then, which speaker class the input voice belongs to is determined, and an unspecified speaker codebook corresponding to the determination result is selected. The classification will be described below.

【０１１７】図１２は、たとえば、「おはよう」、「こ
んにちわ」、「ただいま」という単語をＡ，Ｂ，Ｃとい
う３人が発話したときに得られた特徴ベクトル列であ
り、Ａ１はＡの話者の「おはよう」に対する特徴ベクト
ル列、Ａ２はＡの話者の「こんにちわ」に対する特徴ベ
クトル列、Ａ３はＡの話者の「ただいま」に対する特徴
ベクトル列、Ｂ１はＢの話者の「おはよう」に対する特
徴ベクトル列、Ｂ２はＢの話者の「こんにちわ」に対す
る特徴ベクトル列、Ｂ３はＢの話者の「ただいま」に対
する特徴ベクトル列、Ｃ１はＣの話者の「おはよう」に
対する特徴ベクトル列、Ｃ２はＣの話者の「こんにち
わ」に対する特徴ベクトル列、Ｃ３はＣの話者の「ただ
いま」に対する特徴ベクトル列を表している。FIG. 12 is a feature vector sequence obtained when three people A, B, and C utter the words "good morning", "hello", and "just now", for example. A2 is a feature vector sequence for "Hello" of a speaker of A, A2 is a feature vector sequence for "Hello" of a speaker of A, A3 is a sequence of feature vectors for "Imaima" of a speaker of A, B1 is "Good morning" of a speaker of B , B2 is a feature vector sequence for "Hello" of the B speaker, B3 is a feature vector sequence for "Ima" of the B speaker, C1 is a feature vector sequence for "Ohayo" of the C speaker, C2 represents a feature vector sequence for "Hello" by the speaker of C, and C3 represents a feature vector sequence for "just now" of the speaker of C.

【０１１８】この図１２では３人の話者がそれぞれ３つ
の単語を発話したときに得られる特徴ベクトルが示され
ているが、実際には、多数の不特定話者が幾つかの単語
について発話して得られた単語毎の特徴ベクトル列が多
数存在する。FIG. 12 shows feature vectors obtained when three speakers utter three words, but in reality, a large number of unspecified speakers utter some words. There are a large number of feature vector sequences for each word obtained in this way.

【０１１９】このような不特定多数の話者が幾つかの単
語毎に発話して得られた単語毎の特徴ベクトル列に対
し、或る単語に対するそれぞれの話者が発話して得られ
た特徴ベクトル列のＤＰ距離を求めてクラス分けを行
う。With respect to a feature vector sequence for each word obtained by such an unspecified number of speakers uttering every several words, the features obtained by each speaker uttering a certain word The DP distance of the vector sequence is obtained and classification is performed.

【０１２０】ここで、或る単語に対するそれぞれの話者
が発話して得られた特徴ベクトル列のＤＰ距離について
図１３を参照しながら説明する。Here, the DP distance of the feature vector sequence obtained by each speaker uttering a certain word will be described with reference to FIG.

【０１２１】図１３（ａ）は、話者Ａが「おはよう」と
発話したときの特徴ベクトル列Ａ１を基準として、話者
Ｂの「おはよう」の特徴ベクトル列Ｂ１とのＤＰマッチ
ングを取った場合のそれぞれの時刻における特徴ベクト
ル同志の対応関係を示すものであり、同図（ｂ）は、話
者Ｂが「おはよう」と発話したときの特徴ベクトル列Ｂ
１を基準として、話者Ａの「おはよう」の特徴ベクトル
列Ａ１とのＤＰマッチングを取った場合のそれぞれの時
刻における特徴ベクトル同志の対応関係を示すものであ
る。FIG. 13A shows a case where DP matching is performed with the feature vector sequence B1 of the speaker B "Ohayo" with the feature vector sequence A1 when the speaker A utters "Ohayo" as a reference. 5B shows the correspondence relationship between the feature vector comrades at each time, and FIG. 7B shows the feature vector sequence B when the speaker B utters “Ohayo”.
1 shows the correspondence relationship between the feature vector comrades at each time when DP matching is performed with the feature vector sequence A1 of "Ohayo" of the speaker A on the basis of 1.

【０１２２】このように、話者Ａと話者Ｂのどちらを基
準とするかで、ＤＰ距離が異なってくるので、話者Ａを
基準とした場合の長さで正規化されたＤＰ距離と、話者
Ｂを基準とした場合の長さで正規化されたＤＰ距離をそ
れぞれ求め、その平均値を求めて、それを或る単語に対
する話者Ａと話者Ｂの話者間距離とする。つまり、図１
３の例では、話者Ａと話者Ｂの「おはよう」という単語
に対するＤＰ距離ｄ１（Ａ１，Ｂ１）は、ｄ１（Ａ１，Ｂ１）＝｛（ｄＡ１＋ｄＡ２＋，・・・，
＋ｄＡ８）／８＋ｄＢ１＋ｄＢ２＋，・・・，＋ｄＢ
７）／７｝／２で求められる。この式において、ｄＡ１，ｄＡ２，・・
・，ｄＡ８は、話者Ａの特徴ベクトル列を基準とした場
合における話者Ｂの特徴ベクトル列とのそれぞれの時刻
対応（時刻ｔ１からｔ８と時刻ｔ１〜時刻ｔ７）の距離
を表し、ｄＢ１，ｄＢ２，・・・，ｄＢ８は、話者Ｂの
特徴ベクトル列を基準とした場合における話者Ａの特徴
ベクトル列のそれぞれの時刻対応（時刻ｔ１からｔ７と
時刻ｔ１〜時刻ｔ８）の距離を表している。As described above, the DP distance varies depending on whether the speaker A or the speaker B is used as a reference. Therefore, the DP distance normalized by the length when the speaker A is used as a reference is used. , The DP distance normalized by the length when the speaker B is used as a reference is calculated, and the average value thereof is calculated and used as the inter-speaker distance between the speaker A and the speaker B for a certain word. . That is, FIG.
In the example of 3, the DP distance d1 (A1, B1) between the speaker A and the speaker B for the word "good morning" is d1 (A1, B1) = {(dA1 + dA2 +, ...,
+ DB8) / 8 + dB1 + dB2 +, ..., + dB
7) / 7} / 2. In this formula, dA1, dA2, ...
, DA8 represents a distance corresponding to each time (time t1 to t8 and time t1 to time t7) with the feature vector sequence of the speaker B based on the feature vector sequence of the speaker A, and dB1, dB2, ..., dB8 represent distances corresponding to respective times (time t1 to t7 and time t1 to time t8) of the feature vector sequence of the speaker A based on the feature vector sequence of the speaker B. ing.

【０１２３】以上は話者ＡとＢの「おはよう」という単
語に対するＤＰ距離であるが、同様にして、話者ＡとＢ
の他の単語についてのＤＰ距離を求める。The above is the DP distance between the speakers A and B for the word "good morning". Similarly, the speakers A and B are
Find the DP distance for the other words of.

【０１２４】今、単語数がｎであるとすれば、話者Ａと
話者Ｂの各単語におけるＤＰ距離の和ｄＡＢは、ｄＡＢ＝ｄ１（Ａ１，Ｂ１）＋ｄ１（Ａ２，Ｂ２）＋，
・・・，＋ｄ１（Ａｎ，Ｂｎ）で表され、この各単語におけるＤＰ距離の和ｄＡＢを話
者Ａと話者Ｂの話者間距離という。ここで、ｄ１（Ａ
１，Ｂ１）は前記したように話者Ａと話者Ｂの「おはよ
う」という単語に対するＤＰ距離であり、ｄ１（Ａ２，
Ｂ２）、ｄ１（Ａｎ，Ｂｎ）はそれぞれの単語に対する
話者Ａと話者ＢのそれぞれのＤＰ距離を表している。Now, assuming that the number of words is n, the sum dAB of DP distances in the words of speaker A and speaker B is: dAB = d1 (A1, B1) + d1 (A2, B2) +,
..., + d1 (An, Bn), and the sum dAB of DP distances in each word is called the inter-speaker distance between speaker A and speaker B. Where d1 (A
1, B1) is the DP distance between the speakers A and B with respect to the word "good morning" as described above, and d1 (A2,
B2) and d1 (An, Bn) represent the DP distances of the speaker A and the speaker B for each word.

【０１２５】このようにして、全ての話者同志の話者間
距離を求め、求められた話者同志の話者間距離を基に、
図１４のフローチャートで示すような処理により話者ク
ラス分けを行う。以下、このフローチャートに沿って処
理を説明する。In this way, the inter-speaker distances of all speakers are obtained, and based on the obtained inter-speaker distances of the speakers,
Speaker classification is performed by the process shown in the flowchart of FIG. The process will be described below with reference to this flowchart.

【０１２６】ここでは、４つの話者クラスにクラス分け
する場合（話者クラス数Ｎ＝４）について説明する。ま
ず、Ｎ＝１とし（ステップｓ１）、このＮ＝１のクラス
（このＮ＝１というのは、クラス分けする前の全体の範
囲である）において、前記のように求められた各話者同
志の話者間距離のうち、最も話者間距離の遠い二人の話
者を選択する（ステップｓ２）。図１５は各話者を話者
間距離に基づいて、便宜的に点で表したもので、この場
合、話者間距離の最も遠い話者Ａと話者Ｚが選択された
ものとする。この図１４は説明の都合上、各話者を点で
表しているが、本来は、たとえば、数１００人というよ
うな不特定の話者が、それぞれ幾つかの単語について発
話して得られた単語毎の特徴ベクトル列が多数存在する
ものである。Here, a case will be described in which the class is divided into four speaker classes (the number of speaker classes N = 4). First, with N = 1 (step s1), in the class of N = 1 (where N = 1 is the entire range before the classification), the speakers of the same kind obtained as described above. Among the inter-speaker distances of, the two speakers with the longest inter-speaker distance are selected (step s2). FIG. 15 shows each speaker as a point based on the inter-speaker distance for convenience. In this case, it is assumed that the speakers A and Z having the longest inter-speaker distance are selected. For convenience of explanation, FIG. 14 shows each speaker by a dot, but originally, for example, an unspecified speaker such as several hundreds of speakers was obtained by uttering several words. There are many feature vector sequences for each word.

【０１２７】このようにして最も話者間距離の遠い二人
の話者Ａと話者Ｚが選択されると、次に、図１４のステ
ップｓ３により、この話者Ａおよび話者Ｚに対して、他
の話者Ｂ，Ｃ，Ｄ，・・・の代表ベクトル列のＤＰ距離
を求め、求められた距離を基に図１５の破線Ｌ１で示す
ように２分割し、２つのクラスＣＬ１１，ＣＬ２１を形
成する（第１回目の試行的なクラス分け）。次に、各ク
ラスにおいて、それぞれの単語ごとに重心ベクトル列を
求める（ステップｓ４）。ここでは今、２つのクラスＣ
Ｌ１１，ＣＬ２１であるから、クラスＣＬ１１，ＣＬ２
１において、それぞれの単語ごとに重心ベクトル列を求
める。この重心ベクトル列の求め方は前述した通りであ
り、たとえば、「おはよう」という単語を例にとれば、
クラスＣＬ１１，ＣＬ２１ごとに、それぞれのクラスに
属する話者の「おはよう」という特徴ベクトル列を用い
て重心ベクトル列を求める。When the two speakers A and Z having the longest inter-speaker distance are selected in this way, next, in step s3 of FIG. 14, the speakers A and Z are selected. Then, the DP distance of the representative vector sequence of the other speakers B, C, D, ... Is calculated, and based on the calculated distance, it is divided into two as shown by the broken line L1 in FIG. CL21 is formed (first trial classification). Next, in each class, a centroid vector sequence is obtained for each word (step s4). Two classes C now
L11 and CL21, so class CL11 and CL2
In step 1, a centroid vector sequence is obtained for each word. The method of obtaining the center-of-gravity vector sequence is as described above. For example, taking the word "good morning" as an example,
For each of the classes CL11 and CL21, the center of gravity vector sequence is obtained using the feature vector sequence “Ohayo” of the speaker belonging to each class.

【０１２８】次に、収束したか否かの判定（ステップｓ
５）を行うが、これは、ステップｓ３，ｓ４を何回か繰
り返すことにより、ある一定の結果に達したか否かを見
るものであり、１回の処理で収束することはなく、ある
一定の結果に達するまでステップｓ３，ｓ４を何回か繰
り返す。ただし、２回目の処理におけるステップｓ３
は、前回ステップｓ４で求めた重心ベクトル列を用い、
各クラスにおいて単語ごとの重心ベクトル列とそのクラ
ス内に存在する全ての話者の特徴ベクトル列とのＤＰ距
離を求め、そのＤＰ距離に基づいて、破線Ｌ２で示すよ
うに２分割し、２つのクラスＣＬ１２，ＣＬ２２を形成
する（第２回目の試行的なクラス分け）。Next, it is judged whether or not convergence has occurred (step s
5) is performed, but this is to see whether or not a certain fixed result is reached by repeating steps s3 and s4 several times, and there is no convergence in one processing, Steps s3 and s4 are repeated several times until the result of is reached. However, step s3 in the second processing
Uses the center of gravity vector sequence obtained in the previous step s4,
In each class, the DP distance between the center-of-gravity vector sequence for each word and the feature vector sequences of all speakers existing in the class is calculated, and based on the DP distance, it is divided into two as shown by the broken line L2, and two Classes CL12 and CL22 are formed (second trial classification).

【０１２９】このように、第２回目の試行的なクラス分
けを行うと、新たに形成されたクラスＣＬ１２，ＣＬ２
２に属する話者は第１回目のときとは多少異なったもの
となる。そして、各クラスにおいて、それぞれの単語ご
とに重心ベクトル列を求める（ステップｓ４）。ここで
は今、第２回目の試行的なクラス分けにより、２つのク
ラスＣＬ１２，ＣＬ２２が形成されたので、クラスＣＬ
１２，ＣＬ２２において、それぞれのクラスごとにそれ
ぞれの単語ごとの重心ベクトル列を求める。As described above, when the second trial classification is performed, the newly formed classes CL12 and CL2 are formed.
The speakers belonging to 2 are slightly different from those in the first time. Then, in each class, a centroid vector sequence is obtained for each word (step s4). Here, since the two classes CL12 and CL22 have been formed by the second trial classification, the class CL
12, CL22 obtains the centroid vector sequence for each word for each class.

【０１３０】そして再び、収束したか否かを判定する
が、この収束したか否かの判定は次のようにして行う。Then, again, it is judged whether or not it has converged. The judgment as to whether or not it has converged is made as follows.

【０１３１】つまり、現在が第ｎ回目の試行的なクラス
分けの段階であるとすれば、この第ｎ回目のクラス分け
において形成された各クラス毎に、各単語の重心ベクト
ル列とそのクラスに存在する全話者の各単語の特徴ベク
トル列とのＤＰ距離の和を求める。そして、前回のクラ
ス分け（第ｎ−１回目の試行的なクラス分け）において
求めた各クラス毎の各単語の重心ベクトル列と、そのク
ラスに存在する全話者の各単語の特徴ベクトル列とのＤ
Ｐ距離の和の差分をとり、その差分が殆ど無くなったと
きを収束したと判断する。That is, assuming that the present is the n-th trial classification stage, for each class formed in this n-th classification, the centroid vector sequence of each word and its class are assigned. The sum of the DP distances with the feature vector sequence of each word of all the existing speakers is obtained. Then, the centroid vector sequence of each word for each class obtained in the previous classification (n-1st trial classification) and the feature vector sequence of each word of all speakers existing in that class Of D
The difference between the sums of the P distances is calculated, and it is determined that the difference has almost disappeared.

【０１３２】このようにして、たとえば、２回の処理に
より収束したとすると、その収束したときの話者クラス
が話者クラスとなり、この場合、破線Ｌ２により形成さ
れる話者クラスＣＬ１２，ＣＬ２２の２つの話者クラス
が作成されることになる（話者クラス数Ｎ＝２）。In this way, for example, if it is converged by two processes, the speaker class at the time of convergence becomes the speaker class, and in this case, the speaker classes CL12 and CL22 formed by the broken line L2. Two speaker classes will be created (the number of speaker classes N = 2).

【０１３３】続いて、図１４のステップｓ６において、
話者クラス数が設定した話者クラス数（この場合、Ｎ＝
４としている）となったかどうかを判定し、設定した話
者クラス数となっていれば処理を終了するが、この場
合、設定した話者クラス数となっていないので、ステッ
プｓ２に処理が戻る。Then, in step s6 of FIG.
The number of speaker classes set by the number of speaker classes (in this case, N =
4)), and if the set number of speaker classes is reached, the process ends, but in this case, since the set number of speaker classes is not reached, the process returns to step s2. .

【０１３４】このステップｓ２における処理は、各クラ
ス毎に最も遠い話者間距離を有する２人の話者を選択す
る処理であり、この場合は、話者クラスＣＬ１２，ＣＬ
２２それぞれについて、最も遠い話者間距離を有する二
人の話者を選択する処理をおこなう。そして、前記同
様、ステップｓ３，ｓ４を行い、収束するまでその処理
を続ける。これにより、話者クラスＣＬ１２，ＣＬ２２
がそれぞれ２分割され、４つの話者クラスが作成される
ことになる。The process in step s2 is a process of selecting two speakers having the longest inter-speaker distance for each class. In this case, the speaker classes CL12 and CL are used.
The process of selecting the two speakers having the longest inter-speaker distance is performed for each of the 22. Then, similarly to the above, steps s3 and s4 are performed, and the processing is continued until the convergence. As a result, the speaker classes CL12 and CL22
Will be divided into two and four speaker classes will be created.

【０１３５】以上のようにして話者クラスが作成される
が、たとえば、最初に作成された話者クラスＣＬ１２，
ＣＬ２２は、分割された結果を見れば、たとえば、ＣＬ
１２がほぼ男性話者に近い音声特徴データを有するクラ
ス、ＣＬ２２がほぼ女性話者に近い音声特徴データを有
するクラスというような話者クラスとなり、それぞれの
話者クラスがさらに２分割された話者クラスは、それぞ
れのクラスが音声特徴データに基づいてさらに細分化さ
れた話者クラスとなるということである。The speaker class is created as described above. For example, the initially created speaker class CL12,
If the CL 22 sees the divided result, for example, CL
Speaker classes such as 12 having voice feature data close to that of a male speaker and CL22 class having voice feature data close to that of a female speaker, each speaker class being further divided into two The class means that each class becomes a speaker class that is further subdivided based on the voice feature data.

【０１３６】前記第１、第２の実施の形態では、このよ
うに作成された話者クラスＣＬ１２に属する話者の音声
データを基にして、不特定話者コードブック２３ａや音
声認識に用いられる音声モデルを作成し、話者クラスＣ
Ｌ２２に属する音声データを基にして不特定話者コード
ブック２３ｂや音声認識に用いられる音声モデルを作成
する。In the first and second embodiments, it is used for the unspecified speaker codebook 23a and voice recognition based on the voice data of the speakers belonging to the speaker class CL12 thus created. Create a voice model and use speaker class C
An unspecified speaker codebook 23b and a voice model used for voice recognition are created based on the voice data belonging to L22.

【０１３７】なお、以上の説明では、話者クラス数は、
２個（Ｎ＝２）、２の２乗個（Ｎ＝４）、２の３乗個
（ｎ＝８）、・・・という数となるが、分割されて作成
された話者クラスに属する話者数が話者クラス間で大幅
に違いがあるような場合、圧倒的に話者数の多いクラス
のみを分割するという処理を行えば、話者クラス数を３
個、５個、６個、７個というような任意の数に分けるこ
とも可能である。In the above explanation, the number of speaker classes is
The number is 2 (N = 2), 2 2 (N = 4), 2 3 (n = 8), and so on, but belongs to a speaker class created by division. If there is a large difference in the number of speakers between the speaker classes, the process of dividing only the class with the overwhelmingly large number of speakers will reduce the number of speaker classes to three.
It is also possible to divide into any number such as 5, 5, 6 and 7.

【０１３８】このように第３の実施の形態では、多数の
不特定話者の音声特徴データを用いて、任意の数の話者
クラスにクラス分けしている。これに対して、音声の特
徴データに関係なく、単に人為的に男性だけの音声を集
めて男性話者の不特定話者コードブック、女性だけの音
声を集めて女性話者の不特定話者コードブックを作成し
たり、あるいは、成人男性だけの音声を集めて成人男性
話者の不特定話者のコードブック、成人女性だけの音声
を集めて成人女性話者の不特定話者のコードブック、子
どもだけの音声を集めて子どもの不特定話者のコードブ
ックというようにそれぞれのコードブックを作成する方
法もあるが、音声特徴データに関係なくそれぞれの階層
の話者を集めてそれぞれのコードブックを作成した場
合、成人女性でも子どもに近い音声データの女性もあ
り、また、子どもでも成人男性に近い音声データを持つ
子供もあるため、音声特徴データに基づいた正確な話者
クラス分けはできない。しかし、本発明による話者クラ
ス分けは、多数の不特定話者を、その音声特徴データに
基づいてクラス分けしているので、音声特徴データに基
づく高精度な話者クラス分けが可能となる。As described above, in the third embodiment, the speech feature data of a large number of unspecified speakers are used to classify into any number of speaker classes. On the other hand, regardless of the characteristic data of the voice, the voicebook of male speakers is artificially collected by artificially collecting voices of only males, and the voicebook of female speakers is collected by collecting voices of only females. Create a codebook, or collect voices of adult males only and codebooks of adult male speakers without specific speakers, collect voices of adult females only, codebooks of adult female speakers without specifics There is also a method of creating each codebook such as a codebook of unspecified speakers of children by collecting voices of only children, but collecting speakers of each hierarchy regardless of voice feature data and each code. When creating a book, some adult women and some women have voice data that is close to that of children, and some children have voice data that is close to that of adult men. Classification can not be. However, in the speaker classification according to the present invention, since a large number of unspecified speakers are classified based on their voice feature data, it is possible to perform highly accurate speaker classification based on the voice feature data.

【０１３９】なお、以上説明した本発明の処理を行うた
めのプログラムはフロッピィディスクなどの記憶媒体に
記憶させておくことができ、本発明は、この記憶媒体を
も含むものである。The program for performing the processing of the present invention described above can be stored in a storage medium such as a floppy disk, and the present invention also includes this storage medium.

【０１４０】[0140]

【発明の効果】以上説明したように、本発明によれば、
入力話者の音声がどの話者クラスに属するかを判定し、
その判定結果に基づいて、判定された話者クラスに対応
した音声モデルを選択し、音声認識時には入力話者の音
声をその選択された前記音声モデルを用いて音声認識す
るようにしたので、簡単な処理および構成で、高い認識
率での音声認識が可能となる。As described above, according to the present invention,
Determine which speaker class the voice of the input speaker belongs to,
Based on the determination result, a voice model corresponding to the determined speaker class is selected, and at the time of voice recognition, the voice of the input speaker is recognized by using the selected voice model. With a variety of processes and configurations, it is possible to perform voice recognition with a high recognition rate.

【０１４１】また、本発明は、前記選択された話者クラ
スに対応する不特定話者コードブックを用いて入力話者
コードブックを作成して、この入力話者コードブックを
用いて話者適応するようにしたので、より一層、確かな
話者適応が可能となり、また、音声認識部では、入力話
者コードブックにより話者適応されたコードベクトル
を、前記選択された話者クラスに対応する音声モデルを
用いて音声認識処理を行うことで、より一層、高い認識
率での音声認識が可能となる。Further, according to the present invention, an input speaker codebook is created using the unspecified speaker codebook corresponding to the selected speaker class, and the speaker adaptation is performed using this input speaker codebook. As a result, more reliable speaker adaptation is possible, and in the voice recognition unit, the speaker-adapted code vector by the input speaker codebook is associated with the selected speaker class. By performing the voice recognition process using the voice model, the voice recognition with a higher recognition rate becomes possible.

【０１４２】そして、入力話者の音声がどの話者クラス
に属するかの判定を、ＤＰ距離あるいはＤＲＮＮ出力を
用い、さらにその両方を組み合わせて行うことで、正確
なクラス判定が可能となり、特に、ＤＲＮＮ出力を用い
ることにより、入力音声の中に含まれる話者適応用の単
語をキーワードスポッティング処理により切り出すこと
も可能となり、入力話者の音声がどの話者クラスに属す
るかの判定を簡単な処理で高精度に行うことができる。Then, by determining the speaker class to which the voice of the input speaker belongs, using the DP distance or the DRNN output, and combining both of them, it becomes possible to make an accurate class determination. By using the DRNN output, it is possible to cut out a speaker adaptation word included in the input voice by a keyword spotting process, and a simple process for determining to which speaker class the voice of the input speaker belongs. Can be performed with high precision.

【０１４３】また、本発明における話者のクラス分けを
行う処理は、不特定多数の話者から得られたそれぞれの
話者における複数の単語ごとの特徴ベクトル列に対し、
各話者間でそれぞれの単語ごとにＤＰマッチング距離を
求め、その距離の和を当該話者間の距離とし、前記不特
定多数のそれぞれの話者間でそれぞれの話者間距離を求
め、この話者間距離を基にクラス分けを行うので、音声
特徴データに基づく高精度な話者クラス分けが可能とな
り、このようにクラス分けされた話者クラスに属する話
者から作成された不特定話者コードブックを用いて話者
適応することにより、的確な話者適応が可能となり、ま
た、クラス分けされた話者クラスに対応して作成された
音声モデルを用いて音声認識することにより、高い認識
率が得られる。Further, the processing for classifying speakers according to the present invention is carried out for a feature vector string for each of a plurality of words in each speaker obtained from an unspecified number of speakers.
The DP matching distance is calculated for each word between the speakers, the sum of the distances is set as the distance between the speakers, and the inter-speaker distances are calculated among the unspecified number of speakers. Since classification is performed based on the inter-speaker distance, it is possible to perform highly accurate speaker classification based on voice feature data, and unspecified talk created from speakers belonging to such classified speaker classes. By adapting the speaker using the speaker codebook, accurate speaker adaptation is possible, and by recognizing the voice using the voice model created corresponding to the classified speaker class, it becomes high. A recognition rate is obtained.

[Brief description of drawings]

【図１】本発明の第１の実施の形態の説明する構成図。FIG. 1 is a configuration diagram illustrating a first embodiment of the present invention.

【図２】何人かの不特定話者による或る単語に対する特
徴ベクトル列から重心ベクトル列を求める例を説明する
図。FIG. 2 is a diagram for explaining an example of obtaining a center of gravity vector sequence from a feature vector sequence for a certain word by some unspecified speakers.

【図３】重心ベクトルを求める際に特徴ベクトル列の時
間的な長さを正規化する処理を説明する図。FIG. 3 is a diagram illustrating a process of normalizing the temporal length of a feature vector sequence when obtaining a center-of-gravity vector.

【図４】ＤＲＮＮ方式による単語検出処理を説明する
図。FIG. 4 is a diagram illustrating a word detection process according to the DRNN method.

【図５】第１の実施の形態において、話者クラス判定に
ＤＲＮＮ出力を用いる場合の構成図。FIG. 5 is a configuration diagram when a DRNN output is used for speaker class determination in the first embodiment.

【図６】ＤＲＮＮ出力による話者クラス判定を説明する
ための図。FIG. 6 is a diagram for explaining speaker class determination based on DRNN output.

【図７】本発明の第２の実施の形態を説明する構成図。FIG. 7 is a configuration diagram illustrating a second embodiment of the present invention.

【図８】第２の実施の形態において、選択された不特定
話者コードブック内のコードベクトルと重心ベクトルと
の対応付けを行い重心ベクトルを量子化する処理を説明
する図。FIG. 8 is a diagram illustrating a process of associating a code vector in a selected unspecified speaker codebook with a centroid vector and quantizing the centroid vector in the second embodiment.

【図９】量子化された重心ベクトル（重心コードベクト
ル）と入力話者特徴ベクトルとの対応付けを説明する
図。FIG. 9 is a diagram for explaining correspondence between a quantized centroid vector (centroid code vector) and an input speaker feature vector.

【図１０】差分ベクトルを用いて不特定話者コードブッ
クの学習済みコードベクトルを入力話者コードブックに
変換する処理を説明する図。FIG. 10 is a diagram illustrating a process of converting a learned code vector of an unspecified speaker codebook into an input speaker codebook using a difference vector.

【図１１】未学習コードベクトルの補間処理を説明する
図。FIG. 11 is a diagram for explaining interpolation processing of an unlearned code vector.

【図１２】第３の実施の形態を説明するための幾つかの
単語を何人かの話者が発話したときに得られた特徴ベク
トル列を示す図。FIG. 12 is a diagram showing a feature vector sequence obtained when some speakers speak some words for explaining the third embodiment.

【図１３】話者間距離を求めるために或る単語について
話者間のＤＰ距離を求める場合の説明図。FIG. 13 is an explanatory diagram for obtaining a DP distance between speakers for a certain word in order to obtain a distance between speakers.

【図１４】第３の実施の形態における処理を説明するフ
ローチャート。FIG. 14 is a flowchart illustrating processing according to the third embodiment.

【図１５】クラス分け処理を説明する図。FIG. 15 is a diagram illustrating classification processing.

【図１６】従来から行われている一般的なベクトル量子
化処理を説明する図。FIG. 16 is a diagram illustrating a general vector quantization process that has been conventionally performed.

[Explanation of symbols]

１音声入力部２話者適応化部３音声認識部１１マイクロホン１２Ａ／Ｄ変換部１３音声分析部２０ベクトル量子化部２１入力データ記憶部２２話者クラス判定処理部２３ａ第１の不特定話者コードブック２３ｂ第２の不特定話者コードブック２４ａ第１の重心ベクトル記憶部２４ｂ第２の重心ベクトル記憶部２５ＤＲＮＮ方式による単語検出部２６ａ第１のＤＲＮＮ音声モデル記憶部２６ｂ第２のＤＲＮＮ音声モデル記憶部２７入力話者コードブック３１ａ第１の音声モデル記憶部３１ｂ第２の音声モデル記憶部３２単語検出部３３音声認識処理部Ｃａ１，Ｃａ２，・・・Ａの話者の音声特徴ベクトルＣｂ１，Ｃｂ２，・・・Ｂの話者の音声特徴ベクトルＣｃ１，Ｃｃ２，・・・Ｃの話者の音声特徴ベクトルＣｄ１，Ｃｄ２，・・・Ｄの話者の音声特徴ベクトルＣｓ１，Ｃｓ２，・・・重心ベクトルＣｋ１，Ｃｋ２，・・・不特定話者コードブックのコ
ードベクトルＣｉ１，Ｃｉ２，・・・入力話者の音声特徴ベクトルＶ１，Ｖ２，・・・差分ベクトルＣｔ１，Ｃｔ２，・・・差分ベクトルを用いて変換さ
れたコードベクトルＣＬ１２，ＣＬ２２クラス分けされた話者クラス1 voice input unit 2 speaker adaptation unit 3 voice recognition unit 11 microphone 12 A / D conversion unit 13 voice analysis unit 20 vector quantization unit 21 input data storage unit 22 speaker class determination processing unit 23a first unspecified talk Person codebook 23b second unspecified speaker codebook 24a first centroid vector storage section 24b second centroid vector storage section 25 word detection section according to the DRNN method 26a first DRNN speech model storage section 26b second DRNN Speech model storage unit 27 Input speaker codebook 31a First speech model storage unit 31b Second speech model storage unit 32 Word detection unit 33 Speech recognition processing unit Ca1, Ca2, ... Speech feature vector of speaker A Cb1, Cb2, ... Speech feature vector of speaker B Cc1, Cc2, ... Speech feature vector of speaker C Cd1, Cd2, ... D speech feature vector of speaker Cs1, Cs2, ... Centroid vector Ck1, Ck2, ... Code vector of unspecified speaker codebook Ci1, Ci2, ... Input speaker Voice feature vector V1, V2, ... Difference vector Ct1, Ct2, ... Code vector converted using difference vector CL12, CL22 Classified speaker class

Claims

[Claims]

1. A voice feature vector sequence of a speaker belonging to each of the plurality of speaker classes classified based on the voice feature data obtained from an unspecified number of speakers is created. It is determined which speaker class the voice of the input speaker belongs to based on the result obtained by comparing the data with the voice feature data of the input speaker, and the determined speaker is determined based on the determination result. A speaker adaptation method, characterized in that a voice model for voice recognition corresponding to a class is selected, and at the time of voice recognition, the voice of an input speaker is subjected to voice recognition processing using the selected voice model.

2. A plurality of speaker classes classified based on voice feature data obtained from an unspecified number of speakers,
Create an unspecified speaker codebook based on the voice feature data of the speakers belonging to each speaker class, and input the data created based on the voice feature vector sequence of the speakers belonging to each speaker class. Based on the result obtained by comparing with the voice feature data of the speaker, it is determined which speaker class the voice of the input speaker belongs to, and based on the determination result, it corresponds to the determined speaker class. A voice model for speech recognition corresponding to the unspecified speaker codebook and its speaker class is selected, an input speaker codebook is created based on the selected unspecified speaker codebook, and speech recognition is performed. Sometimes the voice of the input speaker is vector-quantized using the created input speaker codebook and the selected unspecified speaker codebook, and then passed to the voice recognition unit, where the voice recognition unit selects the selected voice. Speaker adaptation method, characterized by recognition processing using the speech model.

3. The voice of the input speaker is obtained from the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each speaker class with the voice feature data of the input speaker. The process of determining which speaker class
Obtaining the centroid vector sequence for each word for speaker adaptation from the speech feature vector sequence of the speaker belonging to each speaker class, the centroid vector sequence of each word and the input speaker's speech for each speaker class 3. The speaker adaptation method according to claim 1, wherein the distance to the feature vector sequence is obtained by DP matching, and which speaker class the voice of the input speaker belongs to is determined based on the distance.

4. The voice of the input speaker is obtained from the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each speaker class with the voice feature data of the input speaker. The process of determining which speaker class
Dynamic recurrent neural network (D) based on the speech feature vector sequence of speakers belonging to each speaker class.
ynamic Recurrent Neural Networks: DRN
A voice model corresponding to each speaker class is created by a method called N), and a numerical value indicating the certainty of the existence of a predetermined word is output from the voice feature vector sequence of the input speaker and the voice model. The speaker adaptation method according to claim 1 or 2, wherein the speaker class to which the voice of the input speaker belongs is determined based on a numerical value indicating.

5. The voice of the input speaker is obtained based on the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each speaker class with the voice feature data of the input speaker. The process of determining which speaker class
Obtaining the centroid vector sequence for each word for speaker adaptation from the speech feature vector sequence of the speaker belonging to each speaker class, the centroid vector sequence of each word and the input speaker's speech for each speaker class The distance to the feature vector sequence is obtained by DP matching, and a voice model corresponding to each speaker class by the DRNN method is created based on the voice feature vector sequence of the speaker belonging to each speaker class, and the input speaker From the voice feature vector sequence and the voice model, a numerical value indicating the certainty of the existence of a predetermined word is obtained, and the calculated DP matching distance and the numerical value indicating the certainty of the existence of the predetermined word are used as the basis. 3. The speaker adaptation method according to claim 1, further comprising determining which speaker class the voice of the input speaker belongs to.

6. The process of classifying into a plurality of speaker classes based on the voice feature data obtained from the unspecified number of speakers is performed by a plurality of speakers of each speaker obtained from the unspecified number of speakers. For each feature vector sequence for each word, the DP matching distance is obtained for each word between each speaker, and the sum of the distances is set as the distance between the speakers, and the distances between the unspecified number of speakers are respectively calculated. 6. The speaker adaptation method according to claim 1, wherein the inter-speaker distance is determined, and the classification is performed based on the inter-speaker distance.

7. A voice feature vector sequence of speakers belonging to each speaker class of a plurality of speaker classes classified based on voice feature data obtained from an unspecified number of speakers is created. Data storing means for storing the data, an input data storing means for storing the feature vector sequence of the voice of the input speaker, voice characteristic data for a certain word of the input speaker, and the feature data storing means stored in the feature data storing means. Based on the result obtained by comparing with the feature data for the word, it is determined which speaker class the voice of the input speaker belongs to, and based on the determination result, the voice recognition corresponding to the determined speaker class. And a speaker class determination processing unit for selecting a voice model for use in speech recognition. When performing voice recognition, the voice of the input speaker is vector-quantized using the selected voice model and then passed to the voice recognition unit. Speaker adaptation device characterized by:

8. A plurality of speaker classes classified based on voice feature data obtained from an unspecified number of speakers,
An unspecified speaker codebook created based on the voice feature data of speakers belonging to each speaker class, and data created based on the voice feature vector sequence of speakers belonging to each speaker class, Based on the result obtained by comparing with the voice feature data of the input speaker, it is determined which speaker class the input speaker's voice belongs to, and based on the determination result, it corresponds to the determined speaker class. The unspecified speaker codebook and the speaker class determination processing unit that selects a voice model for speech recognition corresponding to the speaker class, and the unspecified speaker codebook selected by this speaker class determination processing unit Based on the input speaker codebook created based on the above, the voice of the input speaker is recognized as a vector quantity using the created input speaker codebook and the selected unspecified speaker codebook during speech recognition. After ized, passed to the speech recognition unit, a speech recognition unit, speaker adaptation device, characterized in that the recognition process using the selected speech model.

9. The voice of the input speaker is obtained from the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each of the speaker classes with the voice feature data of the input speaker. The process of determining which speaker class
Obtaining the centroid vector sequence for each word for speaker adaptation from the speech feature vector sequence of the speaker belonging to each speaker class, the centroid vector sequence of each word and the input speaker's speech for each speaker class 9. The speaker adaptation apparatus according to claim 7, wherein a distance to the feature vector sequence is obtained by DP matching, and which speaker class the voice of the input speaker belongs to is determined based on the distance.

10. The voice of the input speaker is obtained based on the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each of the speaker classes with the voice feature data of the input speaker. The process of determining which speaker class belongs to is to create a voice model corresponding to each speaker class by the DRNN method based on the voice feature vector sequence of speakers belonging to each speaker class, and From the voice feature vector sequence and the voice model, a numeric value indicating the certainty of the existence of a predetermined word is output, and based on the numeric value indicating the certainty, which speaker class the voice of the input speaker belongs to The speaker adaptation device according to claim 7 or 8, characterized in that

11. The voice of the input speaker is obtained based on the result obtained by comparing the data produced based on the voice feature vector sequence of the speaker belonging to each speaker class with the voice feature data of the input speaker. The process for determining which speaker class belongs to is to obtain a centroid vector sequence for each word for speaker adaptation from the speech feature vector sequence of speakers belonging to each speaker class, and for each speaker class, The distance between the center of gravity vector sequence of each word and the feature vector sequence of the voice of the input speaker is obtained by DP matching, and each talk by the DRNN method is performed based on the voice feature vector sequence of the speaker belonging to each speaker class. Create a voice model corresponding to the person class, and from the voice feature vector sequence of the input speaker and the voice model, obtain a numerical value indicating the certainty of the existence of a predetermined word, and the obtained DP matching distance, Based on the numerical value indicating the serial certainty of the presence of a given word,
9. The speaker adaptation device according to claim 7, wherein it is determined which speaker class the voice of the input speaker belongs to.

12. The process of classifying into a plurality of speaker classes based on the voice feature data obtained from the unspecified number of speakers is performed by a plurality of speakers of each speaker obtained from an unspecified number of speakers. For the feature vector sequence for each word, the DP matching distance is calculated for each word between each speaker,
The sum of the distances is used as the distance between the speakers, the inter-speaker distances are obtained among the unspecified large numbers of speakers, and the classification is performed based on the inter-speaker distances. The speaker adaptation device according to any one of claims 7 to 11.