JPH04121794A

JPH04121794A - Speech recognizing method

Info

Publication number: JPH04121794A
Application number: JP24341290A
Authority: JP
Inventors: Kazuhiko Okashita; 和彦岡下; Shingo Nishimura; 新吾西村
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1990-09-12
Filing date: 1990-09-12
Publication date: 1992-04-22

Abstract

PURPOSE:To obtain tolerance to spectrum distortion and a high recognition rate by removing parameters which are smaller than a threshold value and then finding difference values in feature parameter between frames, generating a time-series pattern of the difference values, and calculating the similarity between each speech and a standard pattern on a statistical distance scale. CONSTITUTION:1. A speech sample is inputted to a speech input part 1. 2. A feature extraction part 12 obtains frequency characteristics, frame by frame, through a band- pass filter. 3. A power decision part compares effective values of power of the frequency characteristics of the respective frames with the threshold value theta and removes the feature parameters which are smaller than the threshold value theta. 4. A difference value generation part 14 finds the difference values between frames and a time series pattern generation part 15 generates the time series pattern of the difference values. 5. A similarity calculation part 17 calculates the similarity between the time series pattern of the difference values and standard patterns of respective speeches stored in a dictionary part 16 on the statistical distance scale. 6. A decision part 18 decides the recognition result which has the highest similarity. Consequently, speech recognition wherein the influence of distortion and noises is removed is enabled regardless of whether the power is large or small.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、電気錠、ＩＣカード等のオンライン端末等て
入力音声からその単語を認識するに好適な音声認識方法
に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition method suitable for recognizing words from input speech using online terminals such as electric locks and IC cards.

［従来の技術］従来、特開平１−２６０４９０号公報に記載の如くの音
声認識方法が提案されている。この音声認識方法は、入
力音声の特徴パラメータを所定長のフレーム単位て算出
し、フレーム内の特徴パラメータの差分値を求め、該差
分値の時系列パターンを作成し、この差分値の時系列パ
ターンと各音声の標準パターンとの類似度を統計的距離
尺度によって算出し、音声認識を行なうものである。[Prior Art] Conventionally, a speech recognition method as described in Japanese Patent Application Laid-Open No. 1-260490 has been proposed. This speech recognition method calculates the feature parameters of input speech in units of frames of a predetermined length, determines the difference value of the feature parameters within the frame, creates a time series pattern of the difference value, and creates a time series pattern of the difference value. This method calculates the similarity between the standard pattern of each voice and the standard pattern of each voice using a statistical distance measure, and performs voice recognition.

［発明か解決しようとする課題］黙しながら、従来技術ては、入力音声の全フレームの特
徴パラメータを、それらフレームのパワーの大小にかか
わらずそのまま用いて、音声認識を行なっている。[Problems to be Solved by the Invention] In the prior art, speech recognition is performed by directly using the feature parameters of all frames of input speech, regardless of the power level of those frames.

然るに、パワーの小なるフレームの情報は、伝送系の歪
や定常雑音の影響を受は易いものであるため、類似度判
定の信頼度か低い。However, since information on frames with low power is easily affected by distortion and stationary noise in the transmission system, the reliability of similarity determination is low.

また、パワーの小なるフレーム間の差分値は、周波数領
域て差分をとるものであるため、パワーの大なるフレー
ム間の差分値におけると同等に扱われるものとなり、認
識率への影響は大きい。Furthermore, since the difference value between frames with low power is determined by taking the difference in the frequency domain, it is treated in the same way as the difference value between frames with high power, and this has a large effect on the recognition rate.

即ち、従来技術ては、類似度判定の信頼度か低いパワー
の小なるフレームの情報が、大きな影響度て認識率に影
響する結果、高い認識率の確保に困難かある。That is, in the prior art, it is difficult to ensure a high recognition rate because information of a small frame with low power and low reliability of similarity determination has a large influence on the recognition rate.

本発明は、定常的なスペクトル歪に強く、高い認識率を
確保てきる音声認識方法を提供することを目的とする。An object of the present invention is to provide a speech recognition method that is resistant to stationary spectral distortion and can ensure a high recognition rate.

［課趙を解決するための手段］請求項１に記載の本発明は、入力音声の特徴パラメータ
を所定長のフレーム単位て算出し、各フレームのパワー
の実効値か任意のしきい値より小なるとき、当該フレー
ムの特徴パラメータを除外した後、フレーム間の特徴パ
ラメータの差分値を求め、該差分値の時系列パターンを
作成し、この差分値の時系列パターンと各音声の標準パ
ターンとの類似度を統計的距離尺度によって算出し、音
声認識を行なうようにしだものである。[Means for solving the problem] The present invention according to claim 1 calculates the characteristic parameters of input speech in units of frames of a predetermined length, and calculates whether the effective value of the power of each frame is smaller than an arbitrary threshold value. After excluding the feature parameters of the frame, calculate the difference value of the feature parameters between frames, create a time series pattern of the difference value, and compare the time series pattern of this difference value with the standard pattern of each voice. This method calculates similarity using a statistical distance measure and performs speech recognition.

請求項２に記載の本発明は、入力音声の特徴パラメータ
を所定長のフレーム単位で算出し、各フレームのパワー
の実効値か任意のしきい値より小なるとき、当該フレー
ムの特徴パラメータの影響が少なくなるように重み付け
を行なった後、フレーム間の特徴パラメータの差分値を
求め、該差分値の時系列パターンを作成し、この差分値
の時系列パターンと各音声の標準パターンとの類似度を
統計的距離尺度によって算出し、音声認識を行なうよう
にしたものである。The present invention as set forth in claim 2 calculates the characteristic parameters of input audio in units of frames of a predetermined length, and when the effective value of the power of each frame is smaller than an arbitrary threshold value, the influence of the characteristic parameters of the frame is calculated. After weighting is performed to reduce the difference between frames, the difference value of the feature parameters between frames is calculated, a time series pattern of the difference value is created, and the similarity between the time series pattern of this difference value and the standard pattern of each voice is calculated. is calculated using a statistical distance measure to perform speech recognition.

［作用］本発明によれば、伝送系の歪や定常雑音の影響を受は易
く、類似度判定の信頼度か低いパワーの小なるフレーム
の特徴パラメータを、除外、又は影響か少なくなるよう
に重み付けした後、フレーム間の特徴パラメータの差分
値を求め、この差分値に基づいて音声認識を行なうこと
となる。[Operation] According to the present invention, feature parameters of small frames that are easily affected by distortion and stationary noise in the transmission system and have low power and low reliability for similarity determination are excluded or their influence is reduced. After weighting, difference values of feature parameters between frames are determined, and speech recognition is performed based on this difference value.

即ち、パワーの大小にかかわらず全フレーム間の差分値
を用いるものに比して、伝送系の歪みや定常雑音の影響
を消去した音声認識を行なうこととなる。従って、定常
的なスペクトル歪に強く、高い認識率を確保てきる音声
認識方法を得ることがてきる。That is, compared to the method that uses difference values between all frames regardless of the power level, speech recognition is performed while eliminating the effects of transmission system distortion and stationary noise. Therefore, it is possible to obtain a speech recognition method that is resistant to stationary spectral distortion and can ensure a high recognition rate.

［実施例］第１図は本発明の一実施例に係る音声認識システムを示
す模式図である。[Embodiment] FIG. 1 is a schematic diagram showing a speech recognition system according to an embodiment of the present invention.

音声認識システム１０は、音声入力部１１、特徴抽出部
１２、パワー判定部１３、差分値作成部１４、時系列パ
ターン作成部１５、辞書部（標準パターン格納部）１６
、類似度算出部１７、判定部１８を有して構成される。The speech recognition system 10 includes a speech input section 11, a feature extraction section 12, a power determination section 13, a difference value creation section 14, a time series pattern creation section 15, and a dictionary section (standard pattern storage section) 16.
, a similarity calculation section 17, and a determination section 18.

以下、音声認識システム１０を用いた辞書作成手順、認
識手順について説明する。Hereinafter, a dictionary creation procedure and a recognition procedure using the speech recognition system 10 will be explained.

（Ａ）音声入力部１１にて、音声試料を取り入れる。(A) A voice sample is taken into the voice input section 11.

このとき、認識単語を４７都道府県名、特定話者を１名
とした。At this time, the recognized words were the names of 47 prefectures, and the specific speaker was one person.

ＴＢ）辞書作成 ■各認識単語の既知入力音声波形を、特徴抽出部１２に
おいて、１６チヤンネルのバンドパスフィルタに通し、
１フレーム（１２，８ｍ５ｅｃ）毎に周波数特性を得る
。TB) Dictionary creation ■The known input speech waveform of each recognized word is passed through a 16-channel bandpass filter in the feature extraction unit 12,
Frequency characteristics are obtained every frame (12.8 m5ec).

■パワー制定部１３において、実験的に決めたしきい値
θと各フレームの周波数特性のパワーの実効値を比較し
、パワーの実効値かしきい値θより小なるフレームの特
徴パラメータを除外する。■The power establishing unit 13 compares the experimentally determined threshold value θ with the effective value of the power of the frequency characteristic of each frame, and excludes the characteristic parameters of frames whose effective value of power is smaller than the threshold value θ. .

■差分値作成部１４において、フレーム間の特徴パラメ
ータの差分値を求め、時系列パターン作成部１５におい
て、該差分値の時系列パターンを作成する０時系列パタ
ーン作成部１５て作成した差分値の時系列パターンを辞
書部１６に格納し、辞書とする。■The difference value creation unit 14 calculates the difference value of the feature parameters between frames, and the time series pattern creation unit 15 creates a time series pattern of the difference value. The time series pattern is stored in the dictionary section 16 and used as a dictionary.

（Ｃ）認識 ■各認識単語の未知入力音声波形に定常雑音を付加した
ものを、特徴抽出部１２において、１６チヤンネルのバ
ンドパスフィルタに通し、１フレーム（１２，８ｍ５ｅ
ｃ）毎に周波数特性を得る。(C) Recognition ■The unknown input speech waveform of each recognition word with stationary noise added is passed through a 16-channel band-pass filter in the feature extraction unit 12 for one frame (12,8m5e
c) Obtain frequency characteristics for each step.

■パワー判定部１３において、実験的に決めたしきい値
θと各フレームの周波数特性のパワーの実効値を比較し
、パワーの実効値がしきい値θより小なるフレームの特
徴パラメータを除外する。■The power determination unit 13 compares the effective value of the power of the frequency characteristic of each frame with the experimentally determined threshold θ, and excludes characteristic parameters of frames whose effective value of power is smaller than the threshold θ. .

■差分値作成部１４において、フレーム間の差分値を求
め、時系列パターン作成部１５において、該差分値の時
系列パターンを作成する。(2) A difference value creation unit 14 calculates a difference value between frames, and a time series pattern creation unit 15 creates a time series pattern of the difference values.

■類似度算出部１７において、上記■で作成した差分値
の時系列パターンと、辞書部１６に格納しである各音声
の標準パターンとの類似度を統計的距離尺度によって算
出する。(2) The similarity calculation unit 17 calculates the similarity between the time series pattern of the difference values created in (2) above and the standard pattern of each voice stored in the dictionary unit 16 using a statistical distance measure.

■判定部１８において、上記■の結果、類似度か最も高
いものを認識結果とする。(2) In the determination unit 18, the one with the highest degree of similarity as a result of (2) above is determined as the recognition result.

然るに、従来方式と、上記音声認識システム１０による
本発明方式の実験結果について説明する。However, experimental results of the conventional method and the method of the present invention using the speech recognition system 10 described above will be explained.

（従来方式）実験：特徴パラメータ（バントパスフィルタの出力）の
フレーム間差分値を用い、統計的距離尺度により計算し
たとき。(Conventional method) Experiment: When calculated using a statistical distance measure using inter-frame difference values of feature parameters (outputs of band-pass filters).

尚、特定話者を１名、認識単語を４７都道府県名とした
。The specific speaker was one person, and the recognized words were the names of 47 prefectures.

結果：認識率は９３．２％てあった。Results: The recognition rate was 93.2%.

（本発明方式）実験：パワーの小なる特徴パラメータ（バントパスフィ
ルタの出力）を除外し、入力にフレーム間差分値を用い
、統計的距離尺度により認識したとき。(Method of the present invention) Experiment: When feature parameters with small power (output of band pass filter) are excluded, inter-frame difference values are used as input, and recognition is performed using a statistical distance measure.

結果：認識率は９５．３％てあった。Results: The recognition rate was 95.3%.

尚、本発明の実施において、辞書作成段階、及び認識段
１ｉ１（上述の（Ｂ）の■の段階、及び（Ｃ）の■の段
階）て、パワーの小さいフレームの特徴パラメータを除
外することなく、当該フレームの特徴パラメータの影響
か少なくなるように重み付けを行なうものてあっても良
い。In the implementation of the present invention, the dictionary creation stage and the recognition stage 1i1 (stage (B) (■) and (C) (■) described above) do not exclude feature parameters of frames with low power. , weighting may be performed to reduce the influence of the characteristic parameters of the frame.

上記音声認識システム１０によれば、以下の如くの作用
かある。The voice recognition system 10 has the following effects.

上記実施例によれば、伝送系の歪や定常雑音の影響を受
は易く、類似度判定の信頼度か低いパワーの小なるフレ
ームの特徴パラメータを、除外、又は影響が少なくなる
ように重み付けした後、フレーム間の特徴パラメータの
差分値を求め、この差分値に基づいて音声認識を行なう
こととなる。According to the above embodiment, feature parameters of small frames that are easily affected by transmission system distortion and stationary noise and have low power or reliability for similarity determination are excluded or weighted to reduce the influence. After that, difference values of feature parameters between frames are determined, and speech recognition is performed based on this difference value.

即ち、パワーの大小にかかわらず全フレーム間の差分値
を用いるものに比して、伝送系の歪みゃ定常雑音の影響
を消去した音声認識を行なうこととなる。従って、定常
的なスペクトル歪に強く、高い認識率を確保できる音声
認識方法を得ることがてきる。That is, compared to the method that uses difference values between all frames regardless of the power level, speech recognition is performed while eliminating the effects of distortion and stationary noise in the transmission system. Therefore, it is possible to obtain a speech recognition method that is resistant to stationary spectral distortion and can ensure a high recognition rate.

［発明の効果］以上のように本発明によれば、定常的なスペクトル歪に
強く、高い認識率を確保できる音声認識方法を得ること
かてきる。[Effects of the Invention] As described above, according to the present invention, it is possible to obtain a speech recognition method that is resistant to stationary spectral distortion and can ensure a high recognition rate.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る音声認識システムを示
す模式図である。１０・・・音声認識システム、１１・・・音声入力部、１２・・・特徴抽出部、１３・・・パワー判定部、１４・・・差分値作成部、１５・・・時系列パターン作成部、１６・・・辞書部、１７・・・類似度算出部、１８・・・判定部。特許出願人　積水化学工業株式会社代表者　廣　１）　馨FIG. 1 is a schematic diagram showing a speech recognition system according to an embodiment of the present invention. DESCRIPTION OF SYMBOLS 10... Speech recognition system, 11... Speech input unit, 12... Feature extraction unit, 13... Power determination unit, 14... Difference value creation unit, 15... Time series pattern creation unit , 16... Dictionary section, 17... Similarity calculation section, 18... Judgment section. Patent applicant: Sekisui Chemical Co., Ltd. Representative Hiroshi 1) Kaoru

Claims

[Claims]

(1) Calculate the feature parameters of the input audio in units of frames of a predetermined length, and when the effective value of the power of each frame is smaller than an arbitrary threshold, after excluding the feature parameters of the frame, Speech recognition that calculates the difference values of parameters, creates a time series pattern of the difference values, calculates the similarity between this time series pattern of the difference values and the standard pattern of each voice using a statistical distance scale, and performs speech recognition. Method.

(2) Calculate the feature parameters of the input audio in units of frames of a predetermined length, and when the effective value of the power of each frame is smaller than an arbitrary threshold, weighting is applied so that the influence of the feature parameters of the frame is reduced. After that, calculate the difference value of the feature parameters between frames, create a time series pattern of the difference value, and calculate the similarity between this time series pattern of the difference value and the standard pattern of each voice using a statistical distance measure. A voice recognition method that performs voice recognition.