JP2003044078A

JP2003044078A - Voice recognizing device using uttering speed normalization analysis

Info

Publication number: JP2003044078A
Application number: JP2001229310A
Authority: JP
Inventors: Riyouko Imai; 亮子今井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2001-07-30
Filing date: 2001-07-30
Publication date: 2003-02-14
Anticipated expiration: 2021-07-30
Also published as: CN1399941A; JP4666129B2; CN1236728C

Abstract

PROBLEM TO BE SOLVED: To improve the recognition rate by appropriately extracting features corresponding to a uttering speed by controlling an analytic method corresponding to the uttering speed of an input voice. SOLUTION: Concerning voices inputted from voice input means 1 and 6, the uttering speeds of the input voices are outputted by uttering speed output means 2 and 7, an analytic control parameter is determined by analytic control parameter calculations 4 and 8 by using the uttering speed and according to that analytic parameter, the features of the input voices are extracted by feature extracting means 4 and 9.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識に関し、
特に音声分析に発声速度正規化を用いた音声認識に関す
る。TECHNICAL FIELD The present invention relates to speech recognition,
In particular, it relates to voice recognition using vocalization speed normalization for voice analysis.

【０００２】[0002]

【従来の技術】従来、音声認識において発声速度の変動
による影響を軽減し認識率を向上させるために用いられ
ている方法を、ＨＭＭを用いた音声認識の場合を例にと
って、以下に説明する。2. Description of the Related Art Conventionally, a method used in speech recognition to reduce the influence of variations in utterance speed and improve the recognition rate will be described below by taking speech recognition using an HMM as an example.

【０００３】まず、音声の性質を調べる際には、音声波
形に対し分析を行い周波数スペクトルに関連した特徴の
抽出を行う。具体的には、音声波形に時間窓をかけ高速
フーリエ変換のアルゴリズムを用いて離散フーリエ変換
を行い、短時間スペクトルを求める。時間窓をかけて切
り出した音声区間のことはフレーム、この区間を移動さ
せる周期をフレーム周期（間隔）とよぶ。一般的に、時
間窓長は２０ｍｓ程度であり、フレーム周期（間隔）は
一定である。First, when investigating the nature of speech, the speech waveform is analyzed to extract the features related to the frequency spectrum. Specifically, a short-time spectrum is obtained by applying a time window to the speech waveform and performing a discrete Fourier transform by using a fast Fourier transform algorithm. A voice section cut out with a time window is called a frame, and a cycle of moving this section is called a frame cycle (interval). Generally, the time window length is about 20 ms, and the frame period (interval) is constant.

【０００４】音声の短時間スペクトルは、周波数と共に
ゆるやかに変化する成分（スペクトル包絡）と細かく変
化する成分（スペクトル微細構造）に分解でき、これを
調べることで声道の共振・反共振特性や音源の周期性な
どその音声に含まれる特徴を捉えることができる。一般
的に、音声認識に用いる特徴ベクトルは４０次元程度で
ある。The short-time spectrum of speech can be decomposed into a component (spectrum envelope) that gently changes with frequency and a component (spectrum fine structure) that finely changes with frequency. By examining this, the resonance / anti-resonance characteristics of the vocal tract and the sound source The features included in the voice such as the periodicity of can be captured. In general, the feature vector used for speech recognition has about 40 dimensions.

【０００５】ＨＭＭ音声認識では、音声の特徴ベクトル
の時系列は確率的に変動するとみなし、各単語や音素を
確率状態遷移モデル（ＨＭＭ）で表現し、認識を行う図
８はＨＭＭの例を示す図である。図８に示すように、Ｈ
ＭＭは、各状態とそれらの間の遷移で表される。各アー
クには、出現確率と遷移確率が割り当てられ、これらの
確率によって音声に対応する記号と時間的な変動が確率
的に表現される。音声認識の際に認識率の良いＨＭＭを
用意するためには、統計的に十分な量の音声の特徴ベク
トル時系列と対応する記号系列をセットにした学習デー
タを用い、その記号系列を最も高い確率で生成するよう
にＨＭＭの確率パラメータを推定する「学習」を行う。In HMM speech recognition, the time series of speech feature vectors is considered to change stochastically, and each word or phoneme is expressed by a stochastic state transition model (HMM) for recognition. FIG. 8 shows an example of HMM. It is a figure. As shown in FIG.
The MM is represented by each state and the transitions between them. Appearance probabilities and transition probabilities are assigned to each arc, and symbols corresponding to speech and temporal variations are probabilistically represented by these probabilities. In order to prepare an HMM with a high recognition rate in speech recognition, learning data in which a symbol sequence corresponding to a statistically sufficient amount of time series of speech feature vectors is used as a set and the symbol sequence is the highest. "Learning" is performed to estimate the probability parameter of the HMM so as to generate with probability.

【０００６】認識の際は、認識したい単語や音素の数だ
けＨＭＭを用意し、全てのＨＭＭに対し、そのＨＭＭが
入力音声の記号系列を出力する確率を計算し、最も確率
の高かったＨＭＭが出力する単語や音素を認識結果とす
る。At the time of recognition, HMMs are prepared by the number of words or phonemes to be recognized, and for all HMMs, the probability that the HMM outputs the symbol sequence of the input speech is calculated, and the HMM with the highest probability is calculated. The output word or phoneme is used as the recognition result.

【０００７】ＨＭＭの確率パラメータは学習データを用
いて推定されるため、入力音声の発声速度が学習データ
の発声速度の分布からかけ離れている場合、認識性能が
悪かった。これに対し、入力音声の発声速度を測定し、
この発声速度に基づいて音声の記号系列の継続時間を制
限することで認識性能を上げる方法が提案されている。Since the probability parameter of the HMM is estimated by using the learning data, the recognition performance is poor when the utterance speed of the input voice is far from the distribution of the utterance speed of the learning data. On the other hand, measuring the speaking speed of the input voice,
A method has been proposed in which recognition performance is improved by limiting the duration of a speech symbol sequence based on the speaking rate.

【０００８】従来の音声認識装置の一例が、特開平２−
１１３２９８号公報に記載されている。その構成を図９
に示す。図９において、発声速度検出部１０１では、音
声パワー概形スペクトルのピークから発声速度１０３を
検出する。音声記号化部１０２では、入力音声を分析し
て、ベクトル量子化により記号化した記号系列１０４を
出力する。ＨＭＭ法に基づく音声認識部１０５は、発声
速度検出部１０１の出力した発声速度１０３と音声記号
化部１０２が出力した記号系列１０４を入力とし、発声
速度を用いてＨＭＭの状態の継続時間の制限を行いなが
ら、単語ＨＭＭデータベース１０６中の各単語ＨＭＭと
記号系列１０４との確率計算を行い、最も高い確率の単
語を認識結果１０８として出力する。An example of a conventional voice recognition device is disclosed in Japanese Unexamined Patent Publication No.
No. 113298. The configuration is shown in FIG.
Shown in. In FIG. 9, the utterance speed detection unit 101 detects the utterance speed 103 from the peak of the voice power outline spectrum. The speech symbolization unit 102 analyzes the input speech and outputs a symbol sequence 104 symbolized by vector quantization. The speech recognition unit 105 based on the HMM method receives the speech rate 103 output by the speech rate detection unit 101 and the symbol sequence 104 output by the speech symbolization unit 102 as input, and limits the duration of the HMM state using the speech rate. While performing, the probability calculation of each word HMM in the word HMM database 106 and the symbol sequence 104 is performed, and the word with the highest probability is output as the recognition result 108.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、上述し
た従来の装置には、音声の発声速度に即した適切な特徴
抽出ができないという問題がある。その理由は、従来の
音声認識装置は音声分析に一定フレーム間隔、あるいは
その整数倍のフレーム間隔を用いているためである。However, the above-mentioned conventional apparatus has a problem that it is not possible to appropriately extract the features in accordance with the speech production speed of the voice. The reason is that the conventional voice recognition device uses a constant frame interval or a frame interval that is an integral multiple thereof for voice analysis.

【００１０】そこで、本発明の技術的課題は、入力音声
の発声速度に即した適切な特徴量抽出を行うことのでき
る音声認識装置を提供することにある。Therefore, a technical problem of the present invention is to provide a voice recognition device capable of performing an appropriate feature amount extraction in accordance with the utterance speed of an input voice.

【００１１】また、本発明の他の技術的課題は、処理時
間を削減することができる音声認識装置を提供すること
にある。Another technical object of the present invention is to provide a voice recognition device capable of reducing the processing time.

【００１２】[0012]

【課題を解決するための手段】本発明によれば、学習用
音声及び認識用音声の内の少なくとも一方を、発声速度
に基づいて分析フレーム間隔を伸縮させて分析すること
を特徴とする音声認識装置が得られる。According to the present invention, at least one of a learning voice and a recognition voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed. The device is obtained.

【００１３】また、本発明によれば、前記音声認識装置
において、前記学習用音声を発声速度に基づいて分析フ
レーム間隔を伸縮させて分析し、前記認識用音声を分析
することを特徴とする音声認識装置が得られる。Further, according to the present invention, in the voice recognition device, the learning voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed, and the recognition voice is analyzed. A recognition device is obtained.

【００１４】また、本発明によれば、前記音声認識装置
において、前記認識用音声を発声速度に基づいて分析フ
レーム間隔を伸縮させて分析し、前記学習用音声を分析
することを特徴とする音声認識装置が得られる。Further, according to the present invention, in the voice recognition device, the recognition voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed, and the learning voice is analyzed. A recognition device is obtained.

【００１５】また、本発明によれば、前記音声認識装置
において、前記学習用音声及び前記認識用音声ともに発
声速度に基づいて分析フレーム間隔を伸縮させて分析す
ることを特徴とする音声認識装置が得られる。Further, according to the present invention, in the voice recognition device, the voice recognition device is characterized in that both the learning voice and the recognition voice are analyzed by expanding and contracting the analysis frame interval based on the utterance speed. can get.

【００１６】また、本発明によれば、前記いずれか一つ
の音声認識装置において、前記認識用音声を発声速度に
基づいて分析フレーム間隔を伸縮させて分析する際に用
いる分析フレーム間隔の算出に、前記学習用音声と前記
認識用音声の発声速度を用いることを特徴とする音声認
識装置が得られる。Further, according to the present invention, in any one of the speech recognition devices, the calculation of an analysis frame interval used when analyzing the recognition voice by expanding and contracting the analysis frame interval based on the utterance speed, A voice recognition device is obtained which uses the utterance speeds of the learning voice and the recognition voice.

【００１７】また、本発明によれば、前記いずれか一つ
の音声認識装置において、前記認識用音声を、発声速度
に基づいて分析フレーム間隔を伸縮させて分析する際に
用いる分析フレーム間隔の算出に、前記認識用音声のみ
の発声速度を用いることを特徴とする音声認識装置が得
られる。Further, according to the present invention, in any one of the voice recognition devices, the calculation of the analysis frame interval used when the recognition voice is analyzed by expanding and contracting the analysis frame interval based on the utterance speed. A voice recognition device is obtained which uses the utterance speed of only the recognition voice.

【００１８】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を分析して特徴抽出
を行い、抽出された特徴量と予め用意している標準パタ
ーンと時間軸対応付けを行うことで算出した発声速度に
基づいて分析フレーム間隔を伸縮させて入力音声を分析
することを特徴とする音声認識装置が得られる。Further, according to the present invention, in any one of the voice recognition devices, the input voice is analyzed to extract a feature, and the extracted feature amount, a standard pattern prepared in advance and a time axis are associated with each other. It is possible to obtain a voice recognition device characterized in that the analysis frame interval is expanded or contracted based on the utterance speed calculated by performing the above to analyze the input voice.

【００１９】また、本発明によれば、前記いずれか一つ
の音声認識装置において、予め用意した規定文を話者に
発声してもらい、その文に含まれる音節数と発声時間長
から算出した発声速度に基づいて分析フレーム間隔を伸
縮させて入力音声を分析することを特徴とする音声認識
装置が得られる。Further, according to the present invention, in any one of the above speech recognition devices, a speaker utters a prepared sentence prepared in advance, and the utterance calculated from the number of syllables and the utterance duration included in the sentence. A voice recognition device is obtained which analyzes an input voice by expanding and contracting an analysis frame interval based on speed.

【００２０】また、本発明によれば、前記音声認識装置
において、前記認識用音声を、発声速度に基づいて分析
フレーム間隔を伸縮させて分析する際に、直前発声の発
声速度から算出した分析フレーム間隔を用いることを特
徴とする音声認識装置が得られる。Further, according to the present invention, in the voice recognition device, when the recognition voice is analyzed by expanding and contracting the analysis frame interval based on the utterance speed, the analysis frame calculated from the utterance speed of the immediately preceding utterance. A voice recognition device is obtained which is characterized by using intervals.

【００２１】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を、発声速度に基づ
いて分析フレーム間隔を伸縮させて分析する際に用いる
分析フレーム間隔の算出に、各話者毎の平均発声速度を
用いることを特徴とする音声認識装置が得られる。Further, according to the present invention, in any one of the above speech recognition devices, the calculation of the analysis frame interval used when analyzing the input voice by expanding and contracting the analysis frame interval based on the utterance speed, A voice recognition device is obtained which is characterized by using an average speaking rate for each speaker.

【００２２】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を、発声速度に基づ
いて分析フレーム間隔を伸縮させて分析する際に用いる
分析フレーム間隔の算出に、各発声毎の平均発声速度を
用いることを特徴とする音声認識装置が得られる。Further, according to the present invention, in any one of the above speech recognition devices, the calculation of the analysis frame interval used when analyzing the input voice by expanding and contracting the analysis frame interval based on the utterance speed A voice recognition device is obtained which is characterized by using an average utterance speed for each utterance.

【００２３】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を、発声速度に基づ
いて分析フレーム間隔を伸縮させて分析する際に用いる
分析フレーム間隔の算出に、母音は母音の発声速度、子
音は子音の発声速度を用いることを特徴とする音声認識
装置が得られる。Further, according to the present invention, in any one of the above speech recognition devices, vowels are used for calculating an analysis frame interval used when analyzing an input voice by expanding and contracting the analysis frame interval based on the utterance speed. A voice recognition device is obtained which uses the vowel production speed of vowels and the consonant production speed of consonants.

【００２４】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を発声速度に基づい
て分析フレーム間隔を伸縮させて分析する際に用いる分
析フレーム間隔を、定常部と非定常部で分けて算出する
ことを特徴とする音声認識装置が得られる。Further, according to the present invention, in any one of the voice recognition devices, the analysis frame interval used when analyzing the input voice by expanding and contracting the analysis frame interval based on the utterance speed is different from that of the stationary part. A voice recognition device is obtained which is characterized by being calculated separately in the stationary unit.

【００２５】また、本発明によれば、前記いずれか一つ
の音声認識装置において、入力音声を、発声速度に基づ
いて分析フレーム間隔を伸縮させて分析するかどうか
を、発声速度に対する閾値を用いて制御することを特徴
とする音声認識装置が得られる。According to the present invention, in any one of the voice recognition devices, whether or not to analyze the input voice by expanding and contracting the analysis frame interval based on the utterance speed is determined by using a threshold for the utterance speed. A voice recognition device characterized by controlling is obtained.

【００２６】[0026]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００２７】図１は本発明の第１の実施の形態による音
声認識装置の構成を示すブロック図である。図１を参照
すると、本発明の第１の実施の形態による音声認識装置
１０は、音響モデル学習に用いる音声を入力する学習用
音声入力手段１と、学習用音声に対する発声速度出力手
段２、発声速度に基づいて分析フレーム間隔を算出する
分析制御パラメータ計算（機能部）３、特徴抽出手段４
と、認識用音声入力手段６と、認識音声に対する発声速
度出力手段７、発声速度に基づいて分析フレーム間隔を
算出する分析制御パラメータ計算（機能部）８、特徴抽
出手段９と、音響モデル学習手段５、認識手段１１、認
識結果出力手段１２とを含む。FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to the first embodiment of the present invention. Referring to FIG. 1, a voice recognition device 10 according to a first exemplary embodiment of the present invention includes a learning voice input unit 1 for inputting a voice used for acoustic model learning, a utterance speed output unit 2 for a learning voice, and a utterance. Analysis control parameter calculation (functional section) 3 for calculating analysis frame intervals based on speed, feature extraction means 4
A recognition voice input means 6, a utterance speed output means 7 for the recognized voice, an analysis control parameter calculation (functional unit) 8 for calculating an analysis frame interval based on the utterance speed, a feature extraction means 9, and an acoustic model learning means. 5, a recognition unit 11, and a recognition result output unit 12 are included.

【００２８】分析制御パラメータ計算３，８は、発声速
度出力手段２，７の出力を夫々入力とし、特徴抽出手段
４，９で発声速度に即した適切な特徴量を抽出するため
の分析制御パラメータとして、発声速度に基づいた分析
フレーム間隔を計算する。特徴抽出手段４，９は、分析
制御パラメータと入力音声を入力とし、分析制御パラメ
ータに従って入力音声を分析し特徴抽出を行う。The analysis control parameter calculations 3 and 8 receive the outputs of the speech production speed output means 2 and 7, respectively, and the analysis control parameters for the feature extraction means 4 and 9 to extract an appropriate feature amount corresponding to the production speed. As, the analysis frame interval is calculated based on the speaking rate. The feature extracting means 4 and 9 receive the analysis control parameter and the input voice as input, analyze the input voice according to the analysis control parameter, and perform feature extraction.

【００２９】音響モデル学習手段５は、特徴抽出手段４
で出力された学習用音声の特徴量を用いて音響モデルの
パラメータを推定し、音声認識処理のうち音響処理で参
照するパターンを作成する。The acoustic model learning means 5 is the feature extraction means 4
The parameters of the acoustic model are estimated by using the feature amount of the learning voice output in step S3, and the pattern referred to in the acoustic process of the voice recognition process is created.

【００３０】認識手段１１は、音響モデル学習手段５で
作成された音響モデルと特徴抽出手段９が出力した認識
用音声の特徴量等を入力としてマッチングを行い最も確
からしい正解候補を認識結果出力手段１２で認識結果と
して出力する。The recognizing means 11 receives the acoustic model created by the acoustic model learning means 5 and the feature amount of the recognition voice output from the feature extracting means 9 as input and performs matching to output the most probable correct answer candidate to the recognition result output means. At 12, the result is output as a recognition result.

【００３１】本発明の第１の実施の形態の作用効果につ
いて説明する。The operation and effect of the first embodiment of the present invention will be described.

【００３２】本実施の形態では、入力音声の発声速度を
分析制御パラメータ計算に用い、この分析制御パラメー
タに従って分析を行うことで発声速度に即した適切な特
徴抽出を行うことができる。こうして得た学習用音声の
特徴量を用いて学習した音響モデルと認識用音声の特徴
量を認識手段の入力とすることで認識率が向上する。In the present embodiment, the utterance speed of the input voice is used for the analysis control parameter calculation, and the analysis is performed in accordance with the analysis control parameter, whereby it is possible to perform the appropriate feature extraction according to the utterance speed. The recognition rate is improved by inputting the acoustic model learned using the thus obtained learning voice feature amount and the recognition voice feature amount to the recognition means.

【００３３】次に、本発明の第２の実施の形態について
図面を参照して詳細に説明する。Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

【００３４】図２は本発明の第２の実施の形態による音
声認識装置の構成を示すブロック図である。FIG. 2 is a block diagram showing the structure of a speech recognition apparatus according to the second embodiment of the present invention.

【００３５】図２を参照すると、本発明の第２の実施の
形態による音声認識装置２０は、認識用音声処理部を構
成する分析制御パラメータ計算部８の入力に、学習用音
声の発声速度出力手段２の出力も利用する点が、図１に
示された第１の実施の形態における認識用音声処理部と
異なる。Referring to FIG. 2, a speech recognition apparatus 20 according to the second embodiment of the present invention outputs a learning speech utterance speed to an input of an analysis control parameter calculation section 8 constituting a recognition speech processing section. The point that the output of the means 2 is also used is different from the recognition voice processing unit in the first embodiment shown in FIG.

【００３６】次に本発明の第２の実施の形態の作用効果
について説明する。Next, the function and effect of the second embodiment of the present invention will be described.

【００３７】本発明の第２の実施の形態では、第１の実
施の形態の効果に加え、認識用音声の特徴抽出に認識用
音声の情報だけでなく学習用音声の情報を利用すること
で、認識用音声に対し、音響モデルを作成した環境にも
合わせた音声分析を行うことができる。In the second embodiment of the present invention, in addition to the effect of the first embodiment, not only the recognition voice information but also the learning voice information is used for feature extraction of the recognition voice. , It is possible to perform voice analysis on the recognition voice in accordance with the environment in which the acoustic model is created.

【００３８】次に、本発明の第３の実施の形態について
図面を参照して詳細に説明する。Next, a third embodiment of the present invention will be described in detail with reference to the drawings.

【００３９】図３は本発明の第３の実施の形態による音
声認識装置の構成を示すブロック図である。FIG. 3 is a block diagram showing the structure of a speech recognition apparatus according to the third embodiment of the present invention.

【００４０】図３を参照すると、本発明の第３の実施の
形態による音声認識装置３０は、学習用音声処理部に、
図１に示された第１の実施の形態における学習音声処理
部を構成する発声速度出力手段２と分析制御パラメータ
計算３が存在しない点が異なる。これにより、学習用音
声の特徴抽出手段４は従来の分析を行う。Referring to FIG. 3, the speech recognition apparatus 30 according to the third embodiment of the present invention includes a learning speech processing unit,
The difference is that the utterance speed output means 2 and the analysis control parameter calculation 3 which constitute the learning voice processing unit in the first embodiment shown in FIG. 1 do not exist. As a result, the learning voice feature extraction unit 4 performs the conventional analysis.

【００４１】次に本発明の第３の実施の形態の効果につ
いて説明する。Next, the effect of the third embodiment of the present invention will be described.

【００４２】本発明の第３の実施の形態では、第１の実
施の形態と比較した時、学習用音声処理部の処理は従来
と同様であり、発声速度の測定および分析パラメータ計
算は必要なく処理時間を削減できる。In the third embodiment of the present invention, when compared with the first embodiment, the processing of the learning voice processing unit is the same as the conventional one, and it is not necessary to measure the utterance speed and calculate the analysis parameter. The processing time can be reduced.

【００４３】次に、本発明の第４の実施の形態について
図面を参照して詳細に説明する。Next, a fourth embodiment of the present invention will be described in detail with reference to the drawings.

【００４４】図４は本発明の第４の実施の形態による音
声認識装置の構成を示すブロック図である。FIG. 4 is a block diagram showing the structure of a speech recognition apparatus according to the fourth embodiment of the present invention.

【００４５】図４を参照すると、本発明の第４の実施の
形態による音声認識装置４０は、学習用音声処理部に、
図２に示された第２の実施の形態における学習音声処理
部を構成する分析制御パラメータ計算３が存在しない点
が異なる。これにより、学習用音声の特徴抽出手段４は
従来の分析を行う。Referring to FIG. 4, a speech recognition apparatus 40 according to the fourth embodiment of the present invention has a learning speech processing unit,
The difference is that the analysis control parameter calculation 3 forming the learning speech processing unit in the second embodiment shown in FIG. 2 does not exist. As a result, the learning voice feature extraction unit 4 performs the conventional analysis.

【００４６】次に本発明の第４の実施の形態の作用効果
について説明する。Next, the function and effect of the fourth embodiment of the present invention will be described.

【００４７】本発明の第４の実施の形態では、第２の実
施の形態と比較した時、学習用音声処理部では、分析パ
ラメータ計算は必要なく処理時間を削減できる。In the fourth embodiment of the present invention, when compared with the second embodiment, the learning voice processing unit does not need to calculate the analysis parameter and can reduce the processing time.

【００４８】次に、本発明の第５の実施の形態について
図面を参照して詳細に説明する。Next, a fifth embodiment of the present invention will be described in detail with reference to the drawings.

【００４９】図５は本発明の第５の実施の形態による音
声認識装置の構成を示すブロック図である。FIG. 5 is a block diagram showing the structure of a speech recognition apparatus according to the fifth embodiment of the present invention.

【００５０】図５を参照すると、本発明の第５の実施の
形態による音声認識装置５０は、認識用音声処理部に図
１に示された第１の実施の形態における認識音声処理部
を構成する発声速度出力手段７と分析制御パラメータ計
算８が存在しない点が異なる。これにより、認識用音声
の特徴抽出手段９は、従来の分析を行う。Referring to FIG. 5, a speech recognition apparatus 50 according to the fifth embodiment of the present invention has a recognition speech processing section as the recognition speech processing section in the first embodiment shown in FIG. The difference is that the utterance speed output means 7 and the analysis control parameter calculation 8 do not exist. As a result, the recognition voice feature extraction means 9 performs the conventional analysis.

【００５１】次に本発明の第５の実施の形態の作用効果
について説明する。Next, the function and effect of the fifth embodiment of the present invention will be described.

【００５２】本発明の第５の実施の形態では、第１の実
施の形態と比較した時、認識音声入力手段６及び特徴抽
出手段９からなる認識用音声処理部の処理は従来と同様
であり、発声速度の測定および分析パラメータ計算は必
要なく処理時間を削減できる。In the fifth embodiment of the present invention, when compared with the first embodiment, the processing of the recognition voice processing unit composed of the recognition voice input means 6 and the feature extraction means 9 is the same as the conventional one. The processing time can be reduced without the need for measurement of vocalization speed and calculation of analysis parameters.

【００５３】次に、本発明の第６の実施の形態について
図面を参照して詳細に説明する。Next, a sixth embodiment of the present invention will be described in detail with reference to the drawings.

【００５４】図６は本発明の第６の実施の形態による音
声認識装置の構成を示すブロック図である。FIG. 6 is a block diagram showing the structure of a speech recognition apparatus according to the sixth embodiment of the present invention.

【００５５】図６を参照すると、本発明の第６の実施の
形態による音声認識装置６０は、認識用音声処理部が、
図１に示された第１の実施の形態における認識音声処理
部の構成に加え、パラメータ記憶・読み出し手段１３を
有する点で異なる。Referring to FIG. 6, in a voice recognition device 60 according to the sixth embodiment of the present invention, a recognition voice processing unit is
In addition to the configuration of the recognition voice processing unit in the first embodiment shown in FIG. 1, it is different in that it has a parameter storage / readout unit 13.

【００５６】パラメータ記憶・読み出し手段１３は、直
前発声を用いて計算され保持されていた分析制御パラメ
ータを読み出し、次の発声の特徴抽出のために、現在の
発声を用いて計算された分析制御パラメータの記憶を行
う。The parameter storing / reading means 13 reads the analysis control parameter calculated and held using the immediately preceding utterance, and the analysis control parameter calculated using the current utterance for the feature extraction of the next utterance. Memorize.

【００５７】特徴量抽出手段９は、パラメータ記憶・読
み出し手段１３が読み出した、直前発声を用いて計算さ
れた分析制御パラメータと認識用音声入力手段６に入力
された現在の認識用音声を入力とし、分析制御パラメー
タに従って認識用音声の特徴抽出を行う。The feature quantity extraction means 9 receives as input the analysis control parameters read by the parameter storage / readout means 13 and calculated using the immediately preceding utterance and the current recognition voice input to the recognition voice input means 6. , The feature extraction of the recognition voice is performed according to the analysis control parameter.

【００５８】次に本発明の第６の実施の形態の作用効果
について説明する。Next, the function and effect of the sixth embodiment of the present invention will be described.

【００５９】本発明の第６の実施の形態では、直前発声
を用いて計算された分析制御パラメータを用いて現在の
認識用音声の特徴抽出を行うことで、発声速度測定およ
び分析制御パラメータ計算に要する処理時間の削減を行
うことができる。In the sixth embodiment of the present invention, the feature of the current recognition voice is extracted by using the analysis control parameter calculated by using the immediately preceding utterance, whereby the utterance speed is measured and the analysis control parameter is calculated. The processing time required can be reduced.

【００６０】次に、本発明の具体例について、図７
（ａ）及び図７（ｂ）を参照して説明する。かかる具体
例は本発明の第２の実施の形態に対応するものである。FIG. 7 shows a specific example of the present invention.
This will be described with reference to (a) and FIG. 7 (b). This specific example corresponds to the second embodiment of the present invention.

【００６１】本方式は、学習用音声、認識用音声に対
し、発声速度を測定し分析制御パラメータを自由に変え
ながら発声速度に即した適切な特徴抽出を行うことがで
きるが、ここでは、図７（ａ）に示すように、発声速度
出力手段に予備特徴抽出手段２１および発声速度計算手
段２２を用いた場合、あるいは図７（ｂ）に示すよう
に、規定文発声時間長測定（機能部）２５および発声速
度計算手段２６を用いた場合を例に挙げる。According to this method, it is possible to perform appropriate feature extraction according to the utterance speed while measuring the utterance speed and freely changing the analysis control parameter for the learning voice and the recognition voice. 7 (a), when the preliminary feature extraction means 21 and the vocalization speed calculation means 22 are used as the vocalization speed output means, or as shown in FIG. 7 (b), the prescribed sentence vocalization time length measurement (functional section ) 25 and the speaking speed calculation means 26 are used as an example.

【００６２】まず、学習用音声の特徴抽出の方法を述べ
る。学習用音声入力手段１でマイクから入力された学習
用音声は、発声速度出力手段２と特徴抽出手段４の入力
となる。First, the method for extracting the characteristics of the learning voice will be described. The learning voice input from the microphone by the learning voice input unit 1 is input to the speaking speed output unit 2 and the feature extraction unit 4.

【００６３】ここで、まず一例として、発声速度出力手
段２が図７（ａ）に示されるように、予備特徴抽出手段
２１と発声速度計算手段２２で構成される場合について
説明する。Here, as an example, the case where the utterance speed output means 2 is composed of the preliminary feature extraction means 21 and the utterance speed calculation means 22 as shown in FIG. 7A will be described.

【００６４】予備特徴抽出手段２１では、入力音声の発
声速度を測定するための特徴抽出を行う。予備特徴抽出
手段２１では、入力音声に対して従来の分析を行い特徴
抽出を行う。発声速度計算手段２２では、予備特徴抽出
手段２１で抽出された特徴量と、予め用意している各音
素のＨＭＭとの時間軸対応付けを行い、入力音声の発声
速度を計算して出力する。発声速度を計算する際の単位
の例しては、一発声、母音部分、子音部分、定常部、非
定常部が挙げられる。The preliminary feature extracting means 21 performs feature extraction for measuring the speaking speed of the input voice. The preliminary feature extraction means 21 performs a conventional analysis on the input voice to perform feature extraction. The utterance speed calculation means 22 associates the feature amount extracted by the preliminary feature extraction means 21 with the HMM of each phoneme prepared in advance on the time axis, and calculates and outputs the utterance speed of the input speech. Examples of units for calculating the vocalization rate include one vocalization, a vowel part, a consonant part, a stationary part, and a non-stationary part.

【００６５】次に、他の例として発声速度出力手段２が
図７（ｂ）に示されるように、規定文発声時間長測定
（機能部）２５および発声速度計算手段２６で構成され
る場合について説明する。規定文発声時間長測定２５で
は、予め用意している規定文を発声した音声を入力と
し、発声に要した時間を測定する。発声速度計算手段２
６では、規定文に含まれる音素数と規定文発声時間長測
定２５で測定された規定文発声時間長から、入力音声の
発声違度を計算して出力する。Next, as another example, as shown in FIG. 7B, the utterance speed output means 2 is composed of a prescribed sentence utterance time length measurement (function part) 25 and a utterance speed calculation means 26. explain. In the prescribed sentence utterance time length measurement 25, the voice that utters the prepared prescribed sentence is input, and the time required for utterance is measured. Speech rate calculation means 2
In 6, the utterance dissimilarity of the input voice is calculated and output from the number of phonemes contained in the prescribed sentence and the prescribed sentence utterance duration measured by the prescribed sentence utterance duration measurement 25.

【００６６】図２を再び参照して、分析制御パラメータ
計算３に、発声速度出力手段２の出力した発声速度を入
力とし、特徴抽出手段４で入力音声の発声速度に即した
適切な特徴量を抽出するための分析制御パラメータとし
て、発声速度に基づいた分析フレーム間隔を計算する。
分析フレーム間隔の計算例を下記数１式に示す。Referring again to FIG. 2, the utterance speed output from the utterance speed output means 2 is input to the analysis control parameter calculation 3, and the feature extraction means 4 obtains an appropriate feature amount according to the utterance speed of the input voice. As an analysis control parameter for extraction, an analysis frame interval based on the speaking rate is calculated.
A calculation example of the analysis frame interval is shown in the following mathematical formula 1.

【００６７】[0067]

【数１】ここで、「Ｉ」は特徴抽出手段４で用いる分析フレーム
間隔、Ｉｏは通常の分析に用いられている一定の分析フ
レーム間隔、Ｃは発声速度出力手段２で出力された発声
速度より計算された係数、例えば、全発声の平均発声速
度、Ｒは正規化する単位、例えば、一発声、母音部分、
子音部分、定常部、非定常部の発声速度である。ｓは、
分析方法を切り替えるスイッチとして動作する。ｓが０
であれば、ＩはＩｏと同等になり従来の分析方法と同等
になり、ｓが１であれば、「Ｉ」は発声速度で正規化さ
れた分析フレーム間隔となり発声速度正規化分析が行わ
れる。従来の分析方法と同様にならないよう、ｓは全発
声を通じて０ということはないものとする。正規化する
単位に合わせてこの数１式を計算することで、母音に対
しては母音の発声速度、子音に対しては、子音の発声速
度、定常部に対しては定常部の発声速度、非定常部に対
しては非定常部の発声速度を用いた分析フレーム間隔を
求めることができる。ｓの値は、例えば、全発声を通じ
て１にし、全発声に対して発声速度正規化分析を行って
もよいし、ある発声速度の閾値を設け、入力発声の発声
速度がその発声速度よりも速ければ１とし、発声速度正
規化分析を行い、遅ければ０とし、従来の分析を行うよ
うにしてもよい。[Equation 1] Here, “I” is an analysis frame interval used in the feature extraction means 4, Io is a constant analysis frame interval used in normal analysis, and C is calculated from the utterance speed output by the utterance speed output means 2. A coefficient, for example, the average utterance rate of all utterances, R is a unit for normalization, for example, one utterance, a vowel part,
It is the vocalization speed of the consonant part, the stationary part, and the non-constant part. s is
Acts as a switch to switch the analysis method. s is 0
If so, I becomes equal to Io and becomes equal to the conventional analysis method, and if s is 1, "I" becomes the analysis frame interval normalized by the vocalization rate, and the vocalization normalization analysis is performed. . In order not to be the same as the conventional analysis method, it is assumed that s is not 0 throughout all vocalizations. By calculating this equation 1 according to the unit to be normalized, the vowel production rate for vowels, the consonant production rate for consonants, the stationary section production rate for stationary parts, For the non-stationary part, the analysis frame interval using the vocalization rate of the non-stationary part can be obtained. For example, the value of s may be set to 1 throughout all utterances, and the utterance rate normalization analysis may be performed on all utterances. If it is 1, the vocalization speed normalization analysis is performed, and if it is slow, it is set to 0, and the conventional analysis may be performed.

【００６８】特徴抽出手段４は、分析パラメータ計算
（機能部）３の出力した分析制御パラメータと学習用音
声入力手段１で入力された学習用音声を入力とし、分祈
制御パラメータに従って学習用音声の特徴抽出を行う。The feature extraction means 4 receives the analysis control parameters output by the analysis parameter calculation (function part) 3 and the learning voice input by the learning voice input means 1, and outputs the learning voice according to the prayer control parameters. Feature extraction is performed.

【００６９】次に、認臓用音声の特徴抽出の方法を述べ
る。認識用音声入力手段６は学習用音声入力手段１と同
様、発声速度出力手段７は発声速度出力手段２と同様、
特徴抽出手段９は特徴抽出手段４と同様な動作を行う。
分析制御パラメータ計算（機能部）８では、上記数１式
のＣに発声速度出力手段２が出力した学習用音声の発声
速度から計算された係数、例えば、全発声の平均発声速
度、Ｒに発声速度出力手段７が出力した認識用音声の発
声速度、例えば、一発声、母音部分、子音部分、定常
部、非定常部の発声速度を用いる。Next, a method for extracting the features of the recognized organ voice will be described. The recognition voice input means 6 is the same as the learning voice input means 1, and the utterance speed output means 7 is the same as the utterance speed output means 2.
The feature extracting means 9 performs the same operation as the feature extracting means 4.
In the analysis control parameter calculation (functional unit) 8, a coefficient calculated from the utterance speed of the learning voice output by the utterance speed output means 2 to C in the above formula 1, for example, the average utterance speed of all utterances, utterance to R The utterance speed of the recognition voice output by the speed output means 7, for example, the utterance speed of one utterance, a vowel part, a consonant part, a stationary part, and a non-stationary part is used.

【００７０】特徴抽出手段９は、分析パラメータ計算
（機能部）８の出力した分析制御パラメータと認識用音
声入力手段６で入力された認識用音声を入力とし、分析
制御パラメータに従って認識用音声の特徴抽出を行う。The feature extraction means 9 receives the analysis control parameters output from the analysis parameter calculation (function part) 8 and the recognition voice input from the recognition voice input means 6 as input, and the features of the recognition voice according to the analysis control parameters. Extract.

【００７１】音響モデル学習手段５では、特徴抽出手段
４が出力した特徴量を入力とし、音響モデルのパラメー
タを推定し、音声認識処理のうち音響処理で参照するパ
ターンを作成する。The acoustic model learning means 5 receives the characteristic amount output from the characteristic extracting means 4 as an input, estimates the parameters of the acoustic model, and creates a pattern to be referred to in the acoustic processing of the speech recognition processing.

【００７２】認識手段１１は、音響モデル学習手段５で
作成された音響モデルと特徴抽出手段９が出力した認識
用音声の特徴量等を入力としてマッチングを行い、最も
確からしい正解候補を認識結果出力手段１２で認識結果
として出力する。The recognizing means 11 receives the acoustic model created by the acoustic model learning means 5 and the feature amount of the recognition voice output from the feature extracting means 9 as input and performs matching, and outputs the most probable correct answer candidate as a recognition result. The means 12 outputs it as a recognition result.

【００７３】[0073]

【発明の効果】以上説明したように、本発明によれば、
入力音声の発声速度から計算した分析制御パラメータを
用いることにより、発声速度に即した適切な特徴抽出を
するために、発声速度正規化分析を行うことで、音響モ
デルの認識性能および全体の認識性能が向上する音声認
識装置を提供することができる。As described above, according to the present invention,
By using the analysis control parameters calculated from the vocalization speed of the input speech, the vocalization speed normalized analysis is performed in order to extract the appropriate features according to the vocalization speed. It is possible to provide a voice recognition device with improved performance.

【００７４】また、本発明によれば、直前発声を用いて
計算された分析制御パラメータの保持・読み出しを行っ
て現在の認識用音声の特徴抽出を行うため、現在の発声
に対して、直前の発声の情報を用いて逐次的に特徴抽出
を行うことができるため、処理時間を削減できる音声認
識装置を提供することができる。Further, according to the present invention, since the analysis control parameter calculated by using the immediately preceding utterance is held / read out to extract the feature of the current recognition voice, the immediately preceding utterance can be obtained with respect to the current utterance. Since the feature extraction can be sequentially performed using the utterance information, it is possible to provide a speech recognition device that can reduce the processing time.

[Brief description of drawings]

【図１】本発明の第１の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice recognition device according to a first embodiment of the present invention.

【図２】本発明の第２の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a voice recognition device according to a second embodiment of the present invention.

【図３】本発明の第３の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of a voice recognition device according to a third embodiment of the present invention.

【図４】本発明の第４の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a voice recognition device according to a fourth embodiment of the present invention.

【図５】本発明の第５の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration of a voice recognition device according to a fifth embodiment of the present invention.

【図６】本発明の第６の実施の形態による音声認識装置
の構成を示すブロック図である。FIG. 6 is a block diagram showing a configuration of a voice recognition device according to a sixth embodiment of the present invention.

【図７】（ａ）及び（ｂ）は本発明の第２の実施の形態
における発声速度出力手段の具体的な構成例を夫々示す
ブロック図である。7 (a) and 7 (b) are block diagrams respectively showing a concrete configuration example of a speech production speed output means in the second exemplary embodiment of the present invention.

【図８】従来のＨＭＭの一例を示す図である。FIG. 8 is a diagram showing an example of a conventional HMM.

【図９】従来例における音声認識装置の構成を示すブロ
ック図である。FIG. 9 is a block diagram showing a configuration of a voice recognition device in a conventional example.

[Explanation of symbols]

１学習用音声入力手段２発声速度出力手段３分析制御パラメータ計算４特徴抽出手段５音響モデル学習手段６認識用音声入力手段７発声速度出力手段８分析制御パラメータ計算９特徴抽出手段１０，２０，３０，４０，５０，６０音声認識装置２１予備特徴抽出手段２２発声速度計算手段２５規定文発声時間長測定２６発声速度計算手段１００入力端子１０１発声速度検出部１０２音声記号化部１０３発声速度１０４記号系列１０５ＨＭＭ法に基づく音声認識部１０６単語ＨＭＭデータベース１０７ＨＭＭパラメータ１０８認識結果 1 Voice input means for learning 2 Speech rate output means 3 Analysis control parameter calculation 4 Feature extraction means 5 Acoustic model learning means 6 Speech input means for recognition 7 Speech rate output means 8 Analysis control parameter calculation 9 Feature extraction means 10, 20, 30, 40, 50, 60 Speech recognition device 21 preliminary feature extraction means 22 Speech rate calculation means 25 Regular sentence vocalization time measurement 26 Speech rate calculation means 100 input terminals 101 Speech rate detector 102 Speech symbolization section 103 Speech rate 104 symbol series 105 HMM based speech recognition unit 106-word HMM database 107 HMM parameters 108 recognition result

Claims

[Claims]

1. A voice recognition device, wherein at least one of a learning voice and a recognition voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed.

2. The voice recognition device according to claim 1, wherein
A voice recognition apparatus, characterized in that the learning voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed, and the recognition voice is analyzed.

3. The voice recognition device according to claim 1, wherein
A speech recognition apparatus, characterized in that the recognition speech is analyzed by expanding and contracting an analysis frame interval based on a utterance speed to analyze the learning speech.

4. The voice recognition device according to claim 1,
A voice recognition device, wherein both the learning voice and the recognition voice are analyzed by expanding / contracting analysis frame intervals based on the utterance speed.

5. The speech recognition apparatus according to claim 3, wherein the learning voice is used to calculate an analysis frame interval used when analyzing the recognition voice by expanding and contracting the analysis frame interval based on the utterance speed. And a voice recognition speed of the recognition voice is used.

6. The speech recognition apparatus according to claim 3, wherein the recognition voice is used for calculating an analysis frame interval used when analyzing the recognition voice by expanding and contracting the analysis frame interval based on a utterance speed. A voice recognition device characterized by using a speaking speed of only voice.

7. The voice recognition apparatus according to claim 1, wherein the input voice is analyzed to perform feature extraction, and the extracted feature amount and a standard pattern prepared in advance. A voice recognition device characterized in that an input voice is analyzed by expanding and contracting an analysis frame interval based on a utterance speed calculated by performing time axis association.

8. The voice recognition apparatus according to claim 1, wherein the speaker utters a prepared sentence prepared in advance, and the number of syllables and utterance duration included in the sentence. A speech recognition apparatus, which analyzes an input speech by expanding and contracting an analysis frame interval based on the utterance speed calculated from.

9. The voice recognition device according to claim 3,
A speech recognition apparatus characterized by using an analysis frame interval calculated from a utterance speed of an immediately preceding utterance when analyzing the recognition voice by expanding and contracting an analysis frame interval based on a utterance speed.

10. The speech recognition apparatus according to claim 1, wherein an analysis frame interval used when analyzing the input voice by expanding or contracting the analysis frame interval based on a speaking rate. A voice recognition device characterized by using an average speaking speed for each speaker for calculation.

11. The voice recognition device according to claim 1, wherein an analysis frame interval used when analyzing the input voice by expanding or contracting the analysis frame interval based on a speaking rate. A voice recognition device characterized by using an average utterance speed for each utterance for calculation.

12. The speech recognition apparatus according to claim 1, wherein an analysis frame interval used when analyzing the input voice by expanding or contracting the analysis frame interval based on the utterance speed. A voice recognition device characterized in that a vowel uses a vowel vocalization speed and a consonant uses a consonant vocalization speed for calculation.

13. The voice recognition device according to claim 1, wherein an analysis frame interval used when expanding and analyzing the analysis frame interval based on a utterance speed of the input voice is used. A voice recognition device characterized by being calculated separately for a stationary part and a non-stationary part.

14. The voice recognition apparatus according to claim 1, wherein whether or not to analyze the input voice by expanding or contracting an analysis frame interval based on the voice production speed is determined with respect to the voice production speed. A voice recognition device characterized by being controlled using a threshold value.