JPS6228479B2

JPS6228479B2 -

Info

Publication number: JPS6228479B2
Application number: JP53159417A
Authority: JP
Inventors: Eru Moshiaa Suteiibun; Jii Baaraa Rorensu
Original assignee: Dialog Systems Inc
Current assignee: Dialog Systems Inc
Priority date: 1978-04-27
Filing date: 1978-12-26
Publication date: 1987-06-20
Also published as: FR2424589B1; GB1603929A; JPS54142910A; FR2424589A1; GB1603928A; GB1603927A; GB1603926A

Description

[Detailed description of the invention]

本発明はスピーチ認識方法に関し、詳しくいう
と、連続的音声信号中の１つまたはそれ以上のキ
ーワードを実時間で認識するための方法に関す
る。適当に処理された未知の隔絶された音声信号
を、１つまたはそれ以上の既知のキーワードの前
以つて用意された表示と比較することによつて隔
絶された発声を認識する種々のスピーチ認識シス
テムが従来より提案されている。これと関連し
て、用語「キーワード」はつながりのある一群の
音素および音を意味するのに使用され、例えばシ
ラブル（音節）、ワード（語）、フレーズ（句）、
等の一部分である。多くのシステムはある程度の
成功を得たにすぎないが、１つのシステムは商業
上の応用において隔絶されたキーワードを認識す
るのに首尾よく使用されている。このシステムは
米国特許第4038503号に記載された方法に従つて
実質的に動作し、そして未知の音声信号データの
境界がこの認識システムによつて測定されたとき
に沈黙あるいはバツクグラウンドノイズであるな
らば、制限されたボキヤブラリーのキーワードの
１つを認識するための上首尾の方法を提供する。
このシステムは未知の音声信号が生じる期間が十
分に定められかつ単一の発声を含むという推定に
依存する。キーワード境界が先験的に知られていないまた
はマークされていない、例えば連続する会話的ス
ピーチのような、連続音声信号においては、（隔
絶されたワードが連続スピーチ信号の一面であ
る）、到来音声データを区分するために、すなわ
ち、音素、シラブル、ワード、センテンス
（文）、等のような言語単位の境界を、キーワード
認識プロセスの開始前に、決定するために、いく
つかの方法が提案されている。しかしながら、こ
れら従来の連続スピーチシステムは、満足なセグ
メント化プロセスが見出されていないので、単に
限られた成功しか得ていない。その上、実質的な
問題が存在する。例えば、限られたボキヤブラリ
ーのみが低い誤りアラーム率で首尾一貫して認識
され得ること、認識の精度が異なる話し手の音声
特性間の差に非常に敏感であること、ならびにこ
れらシステムが、例えば通常の電話通信装置を通
じて送信される音声信号において代表的には生じ
るような、分析されている音声信号における歪み
に非常に敏感であることである。従つて、連続ス
ピーチがオブザーバーによつて容易に識別でき、
理解できても、連続音声信号中の限られたボキヤ
ブラリーのキーワードでさえ機械で認識すること
はいまだ大きな成功を収めていない。連続スピーチにおけるキーワードを認識するの
に有効であるスピーチ分析システムが本出願人の
米国特許出願「連続スピーチ認識方法」に記載さ
れている。このシステムは各キーワードが１つま
たはそれ以上の目的（ターゲツト）パターンの命
令されたシーケンスよりなるテンプレート
（template）によつて特徴付けられ、かつ各目的
パターンが時間的に離れた複数の短期間キーワー
ドパワースペクトルを表わす方法を使用する。同
時に、目的パターンはキーワードにおけるすべて
の重要な音響上の事象をカバーする。上記米国特
許出願に記載された発明は、複数の等しい継続時
間のサンプリング期間のそれぞれにおいて音声信
号の短期間パワースペクトルを決定する一組のパ
ラメータを繰返し測定し、それによつて連続的な
時間命令されたシーケンスの短期間の音声パワー
スペクトルフレームを発生する段階と、短期間パ
ワースペクトルフレームのシーケンスから１つの
第１のフレームと少なくとも１つの後で生じるフ
レームを繰返し選択してマルチフレームスペクト
ルパターンを形成する段階とからなる周波数分析
方法を特徴としている。この方法はさらに、好ま
しくは尤度統計量を使用して、上記のように形成
された各マルチフレームパターンを各キーワード
テンプレートの各第１の日的パターンと比較する
段階と、各マルチフレームパターンがキーワード
テンプレートの第１の目的パターンの１つに対応
するか否かを決定する段階とを特徴とする。決定
段階に従つて、可能性のある候補キーワードの第
１の目的パターンに対応する各マルチフレームパ
ターンごとに、この方法は後で生じるフレームを
選択して後で生じるマルチフレームパターンを形
成することを特徴とする。次に、この方法は、同
様の態様で、後のマルチフレームパターンが可能
性のある候補キーワードの引続く目的パターンに
それぞれ対応するか否かを決定する段階、ならび
に選択されたシーケンスのマルチフレームパター
ンが選択されたキーワードテンプレートと称され
るキーワードテンプレートの目的パターンにそれ
ぞれ対応するときに候補キーワードを識別する段
階を特徴とする。上記米国特許出願に記載された方法はたとえ従
来技術のシステムより連続スピーチ中のキーワー
ドを認識するのに相当に有効であるけれども、こ
の方法でも所望の目的を達成し得ない。従つて、本発明の主な目的は連続する、マーク
のない音声信号中のキーワードを認識するのに改
善された有効性を有するスピーチ認識方法であ
る。本発明の他の目的は与えられたシステムの有効
性、特にシステムの誤りアラームに対する識別
力、を改善するように適合された方法である。本発明の他の目的は異なる話し手、従つて異な
る音声特性、に十分に等しく応答する方法であ
り、信頼できる方法であり、実時間で動作する方
法であり、現存する認識方法に適合し得る方法で
ある。本発明の他の目的は未知の音声入力信号データ
の位相および振幅歪みに比較的不感知である方法
であり、未知の音声入力信号の有節発音（調音）
速度の変化に比較的不感知である方法であり、未
知の入力信号のデイメンシヨナリテイ（次元性）
を減ずる方法である。本発明は音声入力信号中の少なくとも１つのあ
らかじめ定められたキーワードを認識するための
スピーチ分析システムに関する。各キーワードは
命令されたシーケンスの１つまたはそれ以上の目
的パターンよりなるテンプレートによつて特徴付
けられている。各目的パターンは少なくとも１つ
の短時間キーワードパワースペクトル、またはフ
レームを表わす。同時に、目的パターンはキーワ
ードにおけるすべての重要な音響上の事象をカバ
ーする。本発明は、それぞれが１つまたはそれ以
上のフレームよりなるパターンのシーケンスを選
択する段階と、選択されたパターンのシーケンス
が選択されたキーワードテンプレートと称される
キーワードテンプレートの目的パターンのシーケ
ンスにそれぞれ対応するときに候補キーワードを
識別する段階と、誤りアラーム率を改善するため
にポストデシジヨン（事後判断）処理方法を適用
する段階とを含む周波数分析方法を特徴としてい
る。ポストデシジヨン処理方法は、一面では、候補
キーワード目的パターンに対応する選択されたパ
ターン間の時間間隔を正規化する段階と、候補キ
ーワードに対する正規化された時間間隔が韻律学
的試験によつて課されるタイミング基準に合わね
ばならない韻律学的試験を正規化された時間間隔
に適用する段階とを特徴とする。この試験が満足
されない場合には、候補キーワードは例示の実施
例においては認識されたキーワードとして受け入
れられない。好ましい実施例においては、タイミ
ング基準は尤度統計量関数を正規化間隔に適用
し、この尤度統計量があらかじめ定められた最小
値を越えた場合に候補ワードを受け入れることか
らなる。第２の実施例においては、この基準は一
定のあらかじめ定められた間隔制限を各正規化間
隔に適用し、正規化間隔がこの一定の制限内に入
る場合にのみ、候補ワードを受け入れることから
なる。本発明のポストデシジヨン処理方法の他の面に
おいては、尤度統計量関数を候補ワードに対応す
る選択されたパターンのシーケンスに適用してこ
れらパターンのそれぞれに対する良さの指数を決
定する段階と、これらパターンに対する良さの指
数を累算する段階と、累算された良さの指数があ
らかじめ定められた最小値を越えた場合に候補ワ
ードを受け入れる段階とを特徴としている。好ましい一面においては、本発明は連続する境
界のない音声入力信号中の少なくとも１つのあら
かじめ定められたキーワードを認識するためのス
ピーチ分析システムに関する。各キーワードは１
つまたはそれ以上の目的パターンの命令されたシ
ーケンスよりなるテンプレートによつて特徴付け
られている。各目的パターンは少なくとも１つの
短期間キーワードパワースペクトルを表わす。同
時に、目的パターンはキーワードにおけるすべて
の重要な音響上の事象をカバーする。本発明は複
数の等しい継続時間のサンプリング期間のそれぞ
れ内の音声信号の短時間パワースペクトルを決定
する一組のパラメータを繰返し測定し、それによ
つて連続的な時間命令されたシーケンスの短期間
の音声パワースペクトルフレームを発生する段階
と、短期間パワースペクトルフレームのシーケン
スから１つの第１のフレームと少なくとも１つの
後で生じるフレームを繰返し選択してマルチフレ
ームスペクトルパターンを形成する段階とからな
る分析方法を特徴としている。この方法はさら
に、好ましくは尤度統計量を使用して、上記のよ
うに形成された各マルチフレームパターンを各キ
ーワードテンプレートの各第１の目的パターンと
比較する段階と、各マルチフレームパターンがキ
ーワードテンプレートの第１の目的パターンの１
つに対応するか否かを決定する段階とを特徴とす
る。決定段階に従つて、可能性のある候補キーワ
ードの第１の目的パターンに対応する各マルチフ
レームパターンごとに、この方法は後で生じるフ
レームを選択して後で生じるマルチフレームパタ
ーンを形成することを特徴とする。次に、この方
法は、同様の態様で、後のマルチフレームパター
ンが可能性のある候補キーワードの引続く目的パ
ターンにそれぞれ対応するか否かを決定する段階
と、選択されたマルチフレームパターンのシーケ
ンスが選択されたキーワードテンプレートと称さ
れるキーワードテンプレートの目的パターンにそ
れぞれ対応するときに候補キーワードを識別する
段階とを特徴とする。この方法の誤りアラーム率
を改善するポストデシジヨン処理方法段階をさら
に特徴としている。一面において、ポストデシジ
ヨン処理方法は選択された候補キーワードに対応
するマルチフレームパターン間の時間間隔を正規
化する段階と、韻律学的試験を正規化された時間
間隔に適用され、候補キーワードに対する正規化
された時間間隔が韻律学的試験によつて課される
タイミング基準に合わねばならないようにする段
階とを特徴とする。この試験が満足されない場合
には、例示の実施例においては、候補キーワード
は認識されたキーワードとして受け入れられな
い。本発明のポストデシジヨン処理方法の他の面に
おいては、尤度比検定法を候補ワードに対応する
選択されたマルチフレームパターンのシーケンス
に適用してこれらパターンのそれぞれに対する良
さの指数を決定する段階と、これらパターンに対
する良さの指数を累算する段階と、累算された良
さの指数があらかじめ定められた最小値を越えた
場合に候補ワードを受け入れる段階とを特徴とし
ている。また、本発明は複数の等しい継続時間のサンプ
リング期間のそれぞれにおいて音声信号の短期間
パワースペクトルを決定する一組のパラメータを
繰返し測定し、それによつて連続的な時間命令さ
れたシーケンスの短期間の音声パワースペクトル
フレームを発生する段階と、急アタツク緩デイケ
イ（フアーストアタツク・スローデイケイ）ピー
ク検出関数によつてスペクトルフレームに対応す
るピークスペクトルを繰返し発生し、そして各フ
レームごとに、各周波数帯域の振幅を対応するピ
ークスペクトルにおける対応する強度値によつて
割算する段階と、フレームパターンのシーケンス
を選択する段階と、選択されたシーケンスのフレ
ームパターンが選択されたキーワードテンプルと
称されるキーワードテンプルの目的パターンにそ
れぞれ対応するときに候補キーワードを識別する
段階とからなる周波数分析方法を特徴とするもの
である。好ましくは、本発明はさらに、ピークスペクト
ル周波数帯域のそれぞれの値を、これら周波数帯
域に対する到来する新しいスペクトル値の最大値
および１より小さい値を有する一定デイケイ係数
倍された前のピークスペクトル値から選択する段
階を特徴とする。本発明の他の面においては、データストリーム
において、統計上の分布を有する認識エレメント
ｘ_iのベクトルによつて特徴付けられた少なくと
も１つの目的パターンを識別するためのパターン
認識方法を特徴とするものである。この面は目的
パターンの複数のデザイン設定パターンサンプル
ｘから共分散行列Ｋおよび期待値ベクトルを決
定する段階と、共分散行列Ｋから固有値ｖ_iを有
する複数の固有ベクトルｅ_i（ただし、ｖ_iｖ_i+
₁）を計算する段階とからなる分析方法を特徴と
する。この方法はさらに、データストリームから
未知のパターンｙを選択し、各パターンｙを型式
（W₁，W₂，……，Ｗ_p，Ｒ）を有する新しいベク
トルＷに変換することを特徴とする。ここで、Ｗ
_i＝ｅ_i（ｙ−）、ｐはパターンｙのエレメントの
数より小さい正の整数定数、Ｒは再構成誤差統計
量であり、かつ The present invention relates to a speech recognition method, and more particularly to a method for recognizing one or more keywords in a continuous audio signal in real time. Various speech recognition systems that recognize isolated utterances by comparing a suitably processed unknown isolated speech signal with a pre-prepared representation of one or more known keywords. has been proposed so far. In this context, the term "keyword" is used to mean a group of connected phonemes and sounds, such as syllables, words, phrases,
It is a part of etc. Although many systems have met with only moderate success, one system has been successfully used to recognize isolated keywords in commercial applications. This system operates substantially according to the method described in U.S. Pat. No. 4,038,503, and if the boundaries of unknown audio signal data are silence or background noise as measured by this recognition system. For example, it provides a successful method for recognizing one of the keywords in a limited vocabulary.
This system relies on the assumption that the period during which the unknown speech signal occurs is well defined and contains a single utterance. In a continuous speech signal, such as continuous conversational speech, where the keyword boundaries are not known or marked a priori (the isolated words are an aspect of the continuous speech signal), the incoming speech Several methods have been proposed to partition the data, i.e. to determine the boundaries of linguistic units such as phonemes, syllables, words, sentences, etc., before the start of the keyword recognition process. ing. However, these conventional continuous speech systems have had only limited success because no satisfactory segmentation process has been found. Moreover, there are substantive problems. For example, only limited vocabularies can be consistently recognized with low false alarm rates, that recognition accuracy is very sensitive to differences between the speech characteristics of different speakers, and that these systems It is very sensitive to distortions in the audio signal being analyzed, such as typically occurs in audio signals transmitted through telephone communication equipment. Therefore, continuous speech can be easily identified by an observer and
Even if they can be understood, machine recognition of even limited vocabulary keywords in continuous speech signals has not yet achieved great success. A speech analysis system that is effective in recognizing keywords in continuous speech is described in the applicant's US patent application entitled ``Continuous Speech Recognition Method''. The system is characterized by a template, where each keyword consists of a directed sequence of one or more target patterns, and where each target pattern consists of a plurality of temporally separated short-term keywords. Use a method to represent the power spectrum. At the same time, the target pattern covers all important acoustic events in the keyword. The invention described in the above-referenced U.S. patent application repeatedly measures a set of parameters that determine the short-term power spectrum of an audio signal in each of a plurality of equal duration sampling periods, thereby providing successive time commands. generating a sequence of short-term audio power spectral frames; and repeatedly selecting one first frame and at least one subsequent frame from the sequence of short-term power spectral frames to form a multi-frame spectral pattern. It is characterized by a frequency analysis method consisting of steps. The method further includes the step of comparing each multi-frame pattern formed as described above to each first daily pattern of each keyword template, preferably using a likelihood statistic; and determining whether the keyword template corresponds to one of the first target patterns. According to the decision step, for each multi-frame pattern corresponding to a first target pattern of possible candidate keywords, the method selects later-occurring frames to form the later-occurring multi-frame pattern. Features. The method then includes, in a similar manner, determining whether each subsequent multi-frame pattern corresponds to a subsequent target pattern of possible candidate keywords, as well as the multi-frame pattern of the selected sequence. identifying candidate keywords when each corresponds to a target pattern of a keyword template, referred to as a selected keyword template. Although the method described in the above-mentioned US patent application is significantly more effective than prior art systems in recognizing keywords in continuous speech, this method still fails to achieve the desired objective. Therefore, the main object of the present invention is a speech recognition method with improved effectiveness in recognizing keywords in continuous, unmarked speech signals. Another object of the invention is a method adapted to improve the effectiveness of a given system, especially the system's discrimination against false alarms. Another object of the invention is a method that responds sufficiently equally to different speakers and therefore to different speech characteristics, is reliable, operates in real time, and is compatible with existing recognition methods. It is. Another object of the present invention is a method which is relatively insensitive to phase and amplitude distortions of unknown audio input signal data, and which method is relatively insensitive to phase and amplitude distortion of unknown audio input signal data.
It is a method that is relatively insensitive to changes in velocity and the dimensionality of the unknown input signal.
This is a method of reducing The present invention relates to a speech analysis system for recognizing at least one predetermined keyword in an audio input signal. Each keyword is characterized by a template consisting of one or more target patterns of commanded sequences. Each target pattern represents at least one short-term keyword power spectrum, or frame. At the same time, the target pattern covers all important acoustic events in the keyword. The present invention includes the steps of selecting a sequence of patterns, each of which consists of one or more frames, and the selected sequence of patterns corresponding respectively to a sequence of target patterns of a keyword template, referred to as a selected keyword template. The method features a frequency analysis method that includes identifying candidate keywords when performing a search, and applying a post-decision processing method to improve false alarm rates. The post-decision processing method, in one aspect, includes the steps of normalizing the time intervals between the selected patterns corresponding to the candidate keyword target pattern, and the normalized time intervals for the candidate keywords being imposed by a prosodic test. applying a prosodic test to the normalized time interval, which must meet a timing criterion determined by the normalized time interval. If this test is not satisfied, the candidate keyword is not accepted as a recognized keyword in the example embodiment. In a preferred embodiment, the timing criterion consists of applying a likelihood statistic function to a normalized interval and accepting a candidate word if this likelihood statistic exceeds a predetermined minimum value. In a second embodiment, this criterion consists of applying certain predetermined interval limits to each normalized interval and accepting a candidate word only if the normalized interval falls within this certain limit. . Other aspects of the post-decision processing method of the present invention include applying a likelihood statistic function to the sequence of selected patterns corresponding to the candidate word to determine a goodness index for each of the patterns; The method is characterized by accumulating goodness indexes for these patterns and accepting candidate words if the accumulated goodness index exceeds a predetermined minimum value. In one preferred aspect, the present invention relates to a speech analysis system for recognizing at least one predetermined keyword in a continuous, unbounded audio input signal. Each keyword is 1
It is characterized by a template consisting of a directed sequence of one or more target patterns. Each target pattern represents at least one short term keyword power spectrum. At the same time, the target pattern covers all important acoustic events in the keyword. The present invention repeatedly measures a set of parameters that determine the short-term power spectrum of an audio signal within each of a plurality of equal duration sampling periods, thereby determining the short-term power spectrum of the audio signal in a continuous time-commanded sequence. An analysis method comprising the steps of: generating a power spectral frame; and repeatedly selecting one first frame and at least one subsequent frame from a sequence of short term power spectral frames to form a multi-frame spectral pattern. It is a feature. The method further includes the step of comparing each multi-frame pattern formed as described above to each first target pattern of each keyword template, preferably using a likelihood statistic; 1 of the first objective pattern of the template
The method is characterized by a step of determining whether or not it corresponds to the above. According to the decision step, for each multi-frame pattern corresponding to a first target pattern of possible candidate keywords, the method selects later-occurring frames to form the later-occurring multi-frame pattern. Features. The method then includes, in a similar manner, determining whether each subsequent multi-frame pattern corresponds to a subsequent target pattern of possible candidate keywords, and the sequence of selected multi-frame patterns. identifying candidate keywords when each corresponds to a target pattern of a keyword template, referred to as a selected keyword template. The method further features a post-decision processing method step that improves the false alarm rate. In one aspect, the post-decision processing method includes the steps of normalizing the time intervals between multi-frame patterns corresponding to selected candidate keywords, and applying a prosodic test to the normalized time intervals to determine the normalization of the candidate keywords. and ensuring that the standardized time intervals have to meet timing criteria imposed by a prosodic test. If this test is not satisfied, in the illustrated embodiment, the candidate keyword is not accepted as a recognized keyword. In another aspect of the post-decision processing method of the present invention, applying a likelihood ratio test method to the selected sequence of multi-frame patterns corresponding to the candidate word to determine a goodness index for each of these patterns. and accumulating goodness indexes for these patterns; and accepting a candidate word if the accumulated goodness index exceeds a predetermined minimum value. The present invention also repeatedly measures a set of parameters that determine the short-term power spectrum of an audio signal in each of a plurality of equal duration sampling periods, thereby determining the short-term power spectrum of a continuous time-commanded sequence. generating an audio power spectrum frame; repeatedly generating a peak spectrum corresponding to the spectrum frame by a fast attack/slow decay peak detection function; dividing the amplitude of by the corresponding intensity value in the corresponding peak spectrum; selecting a sequence of frame patterns; and selecting a keyword temple in which the frame pattern of the selected sequence is referred to as a selected keyword temple identifying candidate keywords when they respectively correspond to target patterns. Preferably, the invention further comprises selecting the values of each of the peak spectral frequency bands from the maximum value of the incoming new spectral values for these frequency bands and the previous peak spectral value multiplied by a constant decay factor having a value less than one. It is characterized by the stages of Another aspect of the invention features a pattern recognition method for identifying in a data stream at least one target pattern characterized by a vector of recognition elements x _i having a statistical distribution. It is. This aspect consists of a step of determining a covariance matrix K and an expectation vector from a plurality of design setting pattern samples x of the target pattern, and a step of determining a covariance matrix K and an expectation _{vector from a plurality of design setting pattern samples x of the target pattern, and a step of determining a plurality of eigenvectors e i} ₍ where v _i v _i+
₁ ) The analysis method is characterized by a step of calculating. The method is further characterized in selecting unknown patterns y from the data stream and converting each pattern y into a new vector W having the form (W ₁ , W ₂ , . . . , W _p , R). Here, W
_i = e _i (y-), p is a positive integer constant less than the number of elements in pattern y, R is the reconstruction error statistic, and

【式】に等しい。尤度統計量関数をベクトルＷに適用すること
によつてパターンｙが任意の１つの目的パターン
と同一であるか否かが決定される。本発明の特定の面においては、この方法は次式
の一方または他方に従つて尤度統計量を計算する
段階をさらに特徴とする。または L″＝−１／２［（Ｒ−Ｒ）^２／ｖａｒ（Ｒ）＋ｌ_ova
r（Ｒ）］ここで上部にバーをつけた変数はサンプル平均
であり、var（）は不偏サンプル分散である。本発明の他の目的、特徴、および利点は添付図
面を参照しての次の好ましい実施例についての記
載から明らかとなろう。以下本発明の好ましい実施例につき、対応する
参照符号が全図中対応する部分を指示する添付図
面を参照して詳細に説明する。この中で記載する特定の好ましい実施例におい
ては、スピーチの認識は、一般にはスピーチであ
る到来音声データ信号のあるアナログおよびデイ
ジタル理を行なうための特別に構成された電子シ
ステム、ならびにある他のデータ換算段階および
数値測定を行なうため本発明に従つてプログラム
される汎用デイジタルコンピユータの両方を含む
全体の装置によつて実行される。このシステムの
ハードウエア部分とソフトウエア部分間のタスク
の分割は適度の費用で実時間でスピーチの認識を
達成し得る全体のシステムを得るように行なわれ
る。しかしながら、この特定のシステムにおける
ハードウエアにおいて実行されているタスクのあ
るものはソフトウエアにおいても十分に実行でき
るものであり、またこの例におけるソフトウエア
プログラミングによつて実行されているタスクの
あるものは本発明の異なる実施例における専用回
路によつても実行できるということを理解すべき
である。前記したように、本発明の一面は連続スピーチ
信号が例えば電話線によつて歪まされても、これ
ら連続スピーチ信号中のキーワードを認識する装
置の提供である。従つて、第１図を特に参照する
と、１０で指示されたラインを通じての音声入力
信号は炭素電話送信器によつて発生され、随意の
距離を結ぶ電話線、あるいは複数の交換機を結ぶ
電話線を通じて受信される音声信号とみなせる。
本発明の代表的適用は、従つて、電話システムを
通じて受信される未知の源からの音声データ中の
キーワードを認識することである。他方、入力信
号は例えば無線電気通信リンクから、商用放送ス
テーシヨンから、あるいは私設通信リンクから取
られた任意の音声データ信号、例えば音声入力信
号でよい。上記記載から明らかとなるように、本方法およ
び装置は音または音素のシーケンス、あるいは他
の認識可能な表示（符号）を含むスピーチ信号の
認識に関する。この中での記載、ならびに特許請
求の範囲において、「キーワード」、「目的パター
ンのシーケンス」、「テンプレートパターン」、あ
るいは「キーワードテンプレート」という用語が
用いられているが、これら４つの用語は総称的
で、同意であるとみなされる。これは、本方法お
よび装置が検出可能である認識可能なシーケンス
の音声音、あるいはそれらの表示を表現する好都
合な方法である。これら用語は単一の音素、シラ
ブル、あるいは音から一連のワード（文法的意味
において）ならびに単一のワードにいたるすべて
のものを包含するように広く、一般的に解釈され
るべきである。アナログ―デイジタル（Ａ／Ｄ）変換器１３は
ライン１０を介しての到来アナログ音声信号デー
タを受信し、この到来データの信号振幅をデイジ
タル形式に変換する。例示のＡ／Ｄ変換器は入力
信号データを12ビツトの２進表示に変換するよう
に設計されており、この変換は毎秒8000変換の速
度で生じる。Ａ／Ｄ変換器１３はその出力をライ
ン１５を介して自動相関器１７に供給する。自動
相関器１７はデイジタル入力信号を処理して短期
間自動相関関数を毎秒100回発生し、その出力
を、指示されたように、ライン１９を介して供給
する。各自動相関関数は32の値またはチヤンネル
を含み、各値は30ビツトの解像度に計算される。
自動相関器は第２図を参照して後で詳細に記載す
る。ライン１９を通じての自動相関関数はフーリエ
変換装置２１によつてフーリエ変換され、対応す
る短期間ウインドパワースペクトルをライン２３
上に得る。スペクトルは自動相関関数と同じ繰返
し速度で、すなわち毎秒100回、発生され、各短
期間パワースペクトルは16ビツトの解像度をそれ
ぞれ有する31の数値項を有する。理解されるよう
に、スペクトルにおける31項のそれぞれはある周
波数帯域内の信号パワーを表わす。また、フーリ
エ変換装置はにせの隣接帯域の応答を減ずるため
にハミングまたは類似のウインド関数を含むこと
が好ましい。例示の実施例においては、フーリエ変換ならび
にその後の処理段階は、適当にプログラムされた
汎用デイジタルコンピユータの制御のもとで、本
方法に従つて繰返し必要とされる算術演算を早め
るための周辺配列プロセツサを使用して実行され
る。使用された特定のコンビユータは米国、デイ
ジタル・エクイプメント・コーポレーシヨンによ
つて製造されたモデルPDP―11である。使用され
た特定の配列プロセツサは米国特許願第841390号
に記載されている。第３図を参照して後述するプ
ログラミングはこれら商業的に入手し得るデイジ
タル処理ユニツトの能力および特性に実質的に基
づく。短期間ウインドパワースペクトルは、２５で指
示されるように、周波数応答等化され、この等化は後で詳細に記載するように各周波数
帯域またはチヤンネルに生じるピーク振幅の関数
として実行される。周波数応答等化スペクトルは
毎秒当り100の速度でライン２６に発生され、各
スペクトルは16ビツトの精度に評価される31の数
値項を持つ。到来音声データの最終評価を容易に
するため、ライン２６の周波数応答等価ウインド
スペクトルは、３５で指示するように、到来スペ
クトルに非直線振幅変換を課する振幅変換を受け
る。この変換は後で詳細に記載するが、未知の到
来音声信号が基準ボキヤブラリーのキーワードと
合致し得る精度を改善するということをここでは
言及しておく。例示の実施例においては、この変
換は周波数応答等化ウインドスペクトルの全部に
ついて基準ボキヤブラリーのキーワードを表わす
キーワードテンプレートとこのスペクトルを比較
する前の時間に実行される。ライン３８上の振幅変換された等化短期間スペ
クトルは、次に、４０においてキーワードテンプ
レートと比較される。４２で指示されたキーワー
ドテンプレートは、変換等化スペクトルが比較さ
れ得るスペクトルパターンにおける基準ボキヤブ
ラリーのキーワードを表わす。かくして候補キー
ワードが比較の近似性に従つて選択される。例示
の実施例においては、選択プロセスは総体的に不
適当なパターンシーケンスを拒絶しながら、ミス
するキーワードの尤度を最小にするように設計さ
れている。候補ワード（および対応する到来デー
タに関する累算された統計量）が４６においてポ
ストデシジヨン処理のためライン４４を通じて供
給され、誤りアラーム率を減ずる。最終判断は４
８で指示されている。韻律学的マスクおよび／ま
たは音響レベル尤度比検定法の使用を含むポスト
デシジヨン処理は後で詳細に記載するように正し
い検出と誤りアラーム間の識別力を改善する。プレプロセツサ第２図に示す装置において、その固有の平均化
をともなう自動相関関数がライン１０を通じての
到来アナログ音声データ、一般的には音声信号か
らアナログ―デイジタル変換器１３によつて発生
されるデイジタルデータストリームについてデイ
ジタルに実行される。変換器１３はライン１５を
通じてデイジタル入力信号を提供する。デイジタ
ル処理機能、ならびに入力アナログ―デイジタル
変換はクロツク発振器５１の制御下で時間調整さ
れる。クロツク発振器は毎秒256000パルスの基本
タイミング信号を提供し、この信号は分周器５２
に供給され、毎秒8000パルスの第２のタイミング
信号を得る。このより遅いタイミング信号はアナ
ログ―デイジタル変換器１３ならびにラツチレジ
スタ５３を制御する。ラツチレジスタ５３は次の
変換が完了するまですぐ前の変換の12ビツトの結
果を保持する。自動相関積はレジスタ５３に含まれる数と３２
ワードシフトレジスタ５８の出力とを乗算するデ
イジタルマルチプライヤ５６によつて発生され
る。シフトレジスタ５８は再循環モードで動作さ
れ、早い方のクロツク周波数によつて駆動され、
その結果シフトレジスタデータの１つの完全な循
環が各アナログ―デイジタル変換ごとに達成され
る。シフトレジスタ５８に対する入力は各完全な
循環サイクル中に一度、レジスタ５３から取られ
る。デイジタルマルチプライヤ５６に対する１つ
の入力はラツチレジスタ５３から直接取られ、他
方マルチプライヤ５６に対する他方の入力はマル
チプレクサ５９を介してシフトレジスタの現出力
から取られる（ただし、後記する１つの例外があ
る）。これら乗算は高い方のクロツク周波数で実
行される。かくして、Ａ／Ｄ変換から得られた各値は先行
する３１の変換値のそれぞれと乗算される。この
分野の技術者によつて理解されるように、それに
よつて発生された信号は時間遅延された入力信号
それ自身と３２の異なる時間増分（そのうちの１
つは零遅延である）とを乗算することと等価であ
る。零遅延相関、すなわち信号のパワー、を生じ
させるため、マルチプレクサ５９はラツチレジス
タ５３の現在値を、各新しい値がシフトレジスタ
中に導入されるときにそれ自身と乗算させる。こ
のタイミング関数は６０で指示されている。同じくこの分野の技術者には理解されるよう
に、単一の変換からの積ならびにその３１のプレ
デセツサは適度なサンプリング期間にわたるエネ
ルギ分布またはスペクトルを適切には表わさな
い。従つて、第２図の装置はこれら組の積の平均
化したものを提供する。平均化を行なう累算プロセスは加算器６５と相
互接続されて一組の３２の累算器を形成する３２
ワードシフトレジスタ６３によつて提供される。
従つて、各ワードはデイジタルマルチプライヤか
らの対応する増分に加算された後再循環され得
る。循環ループは1/N割算回路６９によつて制御
されるゲート６７を通る。1/N割算回路６９は低
周波数のクロツク信号によつて駆動される。割算
回路６９は累算される、従つてシフトレジスタ６
３が読出される前に平均化される瞬時自動相関関
数の数を決定する係数で低周波数のクロツクを割
算する。例示の例においては、８０のサンプルが読出さ
れる前に累算される。換言すれば、1/N割算回路
６９のＮは８０に等しい。８０の変換サンプルが
このようにして相関され、累算された後、割算回
路６９はコンビユータ割込み回路７１をライン７
２を通じてトリガする。このときに、シフトレジ
スタ６３の内容は適当なインターフエース回路７
３を介してコンビユータメモリに連続的に読み込
まれ、レジスタ中の３２の連続するワードは命令
された順序でインターフエース回路７３を介して
コンピユータに与えられる。この分野の技術者に
は理解されるように、周辺ユニツト、すなわち自
動相関器プレプロセツサからコンピユータへのこ
のデータの転送は代表的には直接メモリアクセス
手続きによつて実行できる。毎秒8000サンプルの
初期サンプリング速度で８０のサンプルの平均化
に基づいて、100の平均化自動相関関数が毎秒当
りコンピユータに提供されることは理解できる。シフトレジスタの内容がコンピユータに読出さ
れている間、ゲート６７は閉じており、従つてシ
フトレジスタ中のワードのそれぞれは有効に零に
リセツトされ、累算プロセスが再び開始できるよ
うにする。第２図に示す装置の動作は、数学的項で表わす
と、次のようになる。アナログ―デイジタル変換
器が時系列Ｓ（ｔ）を発生すると仮定すると（こ
こで、ｔ＝０、To，2To，……，またToはサン
プリング期間（例示の実施例では1/8000秒）であ
る）、第２図の例示のデイジタル相関回路は、始
動時のあいまいさを無視すると、次式(1)の自動相
関関数を計算するものとみなし得る。ここでｊ＝０，１，２，……31；ｔ＝80To，
160To，……，80nTo，…… これら自動相関関数は第１図のライン１９上の相
関出力に対応する。第３図を参照すると、デイジタル相関器は10ミ
リ秒当り１つの完全な自動相関関数の速度で一連
のデータブロツクをコンビユータに送信するよう
に連続的に動作する。これは７７で指示されてい
る。データの各ブロツクは対応する部分時間から
発生される自動相関関数を表わす。上記したよう
に、例示の自動相関関数は毎秒当り100の３２ワ
ード関数の速度でコンビユータに与えられる。例示の実施例においては、自動相関関数データ
の処理は適当にプログラムされた専用デイジタル
コンピユータによつて実行される。コンピユータ
プログラムによつて提供される機能を含むフロー
チヤートが第３図に示されている。しかしなが
ら、段階のいくつかはソフトウエアではなくてハ
ードウエアによつても実行でき、同様に第２図の
装置によつて実行される機能のあるものは第３図
のフローチヤートの対応する改訂によつてソフト
ウエアでも実行できるということを指摘してお
く。第２図のデイジタル相関器は瞬時基準で発生さ
れる自動相関関数のある時間平均化を実行するけ
れど、コンピユータに読出された平均自動相関関
数はサンプルの規則正しい処理および評価を干渉
する可能性のある若干の異常な不連続性または非
一様性を依然として含み得る。従つて、データの
各ブロツク、すなわち、各自動相関関数Ψ（ｊ，
ｔ）は初めに時間に関して平滑化される。これは
第３図のフローチヤートに７９で指示されてい
る。好ましい平滑化プロセスは平滑自動相関出力
Ψｓ（ｊ，ｔ）が次式(2)によつて与えられるもの
である。 Ψｓ（ｊ，ｔ）＝C₀Ψ（ｊ，ｔ）， C₁Ψ（ｊ，ｔ−Ｔ）＋C₂Ψ（ｊ，ｔ＋
Ｔ） (2) ここで、Ψ（ｊ，ｔ）は式(1)で定義された非平
滑入力自動相関であり、Ψｓ（ｊ，ｔ）は平滑自
動相関出力、ｊは遅延時間を示し、ｔは実時間を
示し、Ｔは連続的に発生される自動相関関数間の
時間期間（好ましい実施例においては0.01秒に等
しい）を示す。重みつき関数C₀，C₁，C₂は、他
の値も選択できるけれど、例外の実施例では1/
２，1/4，1/4であるように選択されることが好ま
しい。例えば20ヘルツの周波数カツトオフをもつ
ガウスのインパルス応答を近似する平滑化関数
は、例えば、コンピユータソフトウエアで実行で
きる。しかしながら、試験の結果、実施するのが
容易な例示の平滑化関数が満足な結果を与えるこ
とが分つた。指示されるように、平滑化関数は遅
延の各値ｊごとに別個に適用される。８１で指示されるように、コサインフーリエ変
換が各時間平滑された自動相関関数Ψｓ（ｊ，
ｔ）に対し適用され、３１の点パワースペクトル
を発生させる。このパワースペクトルは次式(3)の
ように定められる。ここで、Ｓ（ｆ，ｔ）は時間ｔにおけるｆヘル
ツ（Hz）を中心とする帯域におけるスペクトルエ
ネルギであり、Ｗ（ｊ）＝1/2（１＋cos２πｊ／６３）
はサイドロープを減少させるハミングウインド関数
であり、Ψｓ（ｊ，ｔ）は遅延ｊおよび時間ｔに
おける平滑自動相関関数であり、そしてｆ＝30＋1000（0.0552m＋0.438）・〓Hz (4) ｍ＝１，２，……，31 これは「メル（mel）」スケールのピツチで等
しく離間された周波数である。理解できるよう
に、これは約300〜3500Hzの代表的通信チヤンネ
ルの帯域幅における周波数に対する本質的なピツ
チ（メルスケール）周波数―軸間隔に対応する。
同じく理解できるように、各スペクトル内の各点
または値は対応する帯域の周波数を表わす。この
フーリエ変換は通常のコンビユータハードウエア
内で完全に実行できるけれど、このプロセスは外
部ハードウエアマルチプライヤまたは高速フーリ
エ変換（FFT）周辺装置が使用される場合に
は、相当に早くできる。かかるモジユールの構成
および動作はこの分野では周知であるので、ここ
では詳細に記載しない。スペクトルのそれぞれが
上に定義した好ましいハミングウインド重みつき
関数Ｗ（ｊ）に従つて周波数平滑される周波数平
滑化機能がハードウエア高速フーリエ変換周辺装
置に有益に組み込まれる。これはハードウエアフ
ーリエ変換インプリメンテーシヨンに対応するブ
ロツク８５においてブロツク８３で指示されてい
る。連続する平滑パワースペクトルが高速フーリエ
変換周辺装置８５から受信されると、通信チヤン
ネル等化関数が、後記するように、周辺装置８５
からの各到来ウインドパワースペクトルごとに
（一般的には異なる）ピークパワースペクトルを
決定し、それに応じて高速フーリエ変換装置の出
力を変更することによつて、得られる。到来ウイ
ンドパワースペクトルＳ（ｆ，ｔ）に対応する各
新しく発生されたピーク振幅スペクトルｙ（ｆ，
ｔ）（ｆはスペクトルの複数の周波数帯域にわた
つてインデツクスされる）はスペクトルチヤンネ
ルまたは帯域のそれぞれに対する急アタツク、緩
デイケイ、ピーク検出関数の結果である。ウイン
ドパワースペクトルは対応するピーク振幅スペク
トルのそれぞれの期間に関して正規化される。こ
れは８７で指示されている。例示の実施例によれば、新しいウインドスペク
トルを受信する前に決定される「古い」ピーク振
幅スペクトルｙ（ｆ，ｔ−Ｔ）の値は周波数帯域
―周波数帯域基準で新しい到来スペクトルＳ
（ｆ，ｔ）と比較される。新しいピークスペクト
ルｙ（ｆ，ｔ）は次の規則に従つて発生される。
「古い」ピーク振幅スペクトルの各帯域における
パワー振幅は一定の分数、例えば例示の例では
５１１／５１２、が乗算される。これはピーク検出関数
の緩デイケイ部分に対応する。到来スペクトルＳ
（ｆ，ｔ）の周波数帯域ｆにおけるパワー振幅が
減衰されたピーク振幅スペクトルの対応する周波
数帯域におけるパワー振幅より大きい場合には、
その（それらの）周波数帯域に対する減衰された
ピーク振幅スペクトル値は到来ウインドスペクト
ルの対応する帯域のスペクトル値と置換される。
これはピーク検出関数の急アタツク部分に対応す
る。数学的には、ピーク検出関数は次式(5)のよう
に表現できる。ｙ（ｆ，ｔ）＝max｛ｙ（ｆ，ｔ−Ｔ）・（１−Ｅ），Ｓ（ｆ，ｔ）｝ (5) ここで、ｆは周波数帯域のそれぞれにわたつて
インデツクスされ、ｙ（ｆ，ｔ）は結果のピーク
スペクトルであり、ｙ（ｆ，ｔ−Ｔ）は「古い」
すなわち前のピークスペクトルであり、Ｓ（ｆ，
ｔ）は新しい到来パワースペクトルであり、Ｅは
デイケイパラメータである。ピークスペクトルが
発生された後、結果のピーク振幅スペクトルは、
８９において、新しく発生されたピークスペクト
ルの隣接する周波数に対応するピーク値の各周波
数帯域ピーク値を平均化することによつて、周波
数平滑される。平均値に寄与する周波数の全体の
帯域の幅はフオルマント周波数間の代表的周波数
分離にほぼ等しい。スピーチ認識分野の技術者に
は理解されるように、この分離は1000Hz程度であ
る。この特定の方法で平均化することによつて、
スペクトル中の有用な情報、すなわち、フオルマ
ント共鳴を表わす局部変動、が保持され、他方周
波数スペクトルにおける全体的または総体的強勢
は抑制される。結果の平滑ピーク振幅スペクトル
ｙ（ｆ，ｔ）は、今受信したパワースペクトルＳ
（ｆ，ｔ）を、到来平滑スペクトルＳ（ｆ，ｔ）
の各周波数帯域の振幅値を、平滑ピークスペクト
ルｙ（ｆ，ｔ）の対応する周波数帯域値で割算す
ることによつて、正規化し、周波数等化するのに
使用される。数学的には、これは次式(6)に対応す
る。 Sn（ｆ，ｔ）＝Ｓ（ｆ，ｔ）／ｙ（ｆ，ｔ） (6) ここで、Sn（ｆ，ｔ）はピーク正規化平滑パ
ワースペクトルであり、ｆは周波数帯域のそれぞ
れにわたつてインデツクスされる。この段階は９
１で指示されている。かくして、到来音声信号の
周波数内容の変化を強調し、かつ一般化された長
期間の周波数強勢または歪みを抑制する周波数等
化された正規化された短期間パワースペクトルの
シーケンスが生じる。この周波数補償方法は電話
線のような周波数を歪ませる通信リンクを通じて
送信されるスピーチ信号の認識に、補償の基準が
全体の信号または各それぞれの周波数帯域におけ
る平均パワーレベルである通常の周波数補償シス
テムに比較して、非常に有益であることが分つ
た。引続くスペクトルが種々に処理され、等化され
たけれど、到来音声信号を表わすデータは依然と
して毎秒当り100の速度で生じるスペクトルから
なるということを指摘することは有益である。９１で指示された正規化、周波数等化スペクト
ルはスペクトル振幅値の非直線スケーリングを行
なう９３で指示された振幅変換を受ける。個々の
等化、正規化スペクトルをSn（ｆ，ｔ）（式(6)か
ら）と表わすと、ただしｆはスペクトルの異なる
周波数帯域をインデツクスし、ｔは実時間を示
す、非直線スケールのスペクトルｘ（ｆ，ｔ）は
次式（7A）で示す直線分数関数である。ｘ（ｆ，ｔ）＝Ｓｎ（ｆ，ｔ）−Ａ／Ｓｎ（ｆ，ｔ）
＋Ａ（7A）ここでＡは次のように定義されるスペクトル
Sn（ｆ，ｔ）の平均値である。ここでｆ_bはパワースペクトルの周波数帯域に
わたつてインデツクスする。このスケーリング関数は短期間平均Ａから大き
く逸脱するスペクトル強度に対するソフトスレシ
ホールドおよび漸次の飽和効果を生じさせる。数
学的には、平均に近い強度に対しては、関数はほ
ぼ直線であり、平均から離れた強度に対してはほ
ぼ対数的であり、強度の極端な値においては実質
的に一定である。対数スケールに関して、関数ｘ
（ｆ，ｔ）は零に関して対称的であり、この関数
は聴覚神経刺激速度関数を思わせるスレシホール
ドおよび飽和態様を呈する。実際には、全体の認
識システムはスペクトル振幅の直線または対数ス
ケーリングで実行するよりはこの特定の非直線ス
ケーリング関数で実行する方が非常に良好であ
る。かくして、振幅変換され、周波数応答等化さ
れ、正規化された短時間パワースペクトルｘ
（ｆ，ｔ）のシーケンスが発生される。ここでｔ
は、．01，．02，．03，．04，…秒に等しく、またｆ
＝１、…、31（発生されたパワースペクトルの周
波数帯域に対応する）。32ワードが各スペクトル
に対して提供され、またＡ（式7B）の値、すな
わちスペクトル値の平均値、は32ワードで記憶さ
れる。振幅変換された短期間パワースペクトル
は、95で指示されるように、例示の実施例では
256の32ワードスペクトルに対する記憶容量を有
する先入れ先出し（FIFO）循環メモリに記憶さ
れる。かくして、2.56秒の音声入力信号が分析の
ために利用できる。この記憶容量は分析および評
価のために異なる実時間においてスペクトルを選
択するのに必要な融通性を有する、従つて分析が
必要であるときに時間的に前進および後退する能
力を有する認識システムを提供する。かくして、最新の2.56秒の間の振幅変換された
パワースペクトルは循環メモリに記憶され、必要
なときに利用できる。動作において、例示の実施
例では、各振幅変換されたパワースペクトルが
2.56秒の間記憶される。従つて時間t₁において循
環メモリに入るスペクトルは、時間t₁＋2.56に対
応して新しい振幅変換されたスペクトルが記憶さ
れるので、2.56秒後でメモリから失なわれるまた
はシフトされる。循環メモリを通る変換され等化された短期間パ
ワースペクトルは、好ましくは実時間で、既知の
ボキヤブラリーのキーワードと比較され、連続す
る音声データ中のこれらキーワードを検出または
選択する。各ボキヤブラリーキーワードは複数の
非重複マルチフレーム（好ましくは３つのスペク
トル）の目的パターン（デザイン設定パターンと
称される）に形成された複数の処理されたパワー
スペクトルを統計的に表わすテンプレートパター
ンによつて表わされる。これらパターンはキーワ
ードの重要な音響的事象を最良に表わすように選
択されることが好ましい。デザイン設定パターンを形成するスペクトルは
第３図に示すようにライン１０上の連続する未知
のスピーチ入力を処理するために上記した同じシ
ステムを使用して種々の文脈で話されたキーワー
ドに対して発生される。従つて、ボキヤブラリー中の各キーワードはそ
れと関連して、短期間パワースペクトルのある領
域において第ｉ番目のキーワードの１つの指示を
表わす一般的に複数のシーケンスのデザイン設定
パターンＰ（ｉ）_１，Ｐ（ｉ）_２，……を有す
る。各キーワードに対するデザイン設定パターン
の収集は目的パターンが発生される統計的基準を
形成する。本発明の例示の実施例においては、デザイン設
定パターンＰ（ｉ）ｊはそれぞれ直列順序に配列
された３つの選択された短期間パワースペクトル
からなる96のエレメント配列とみなし得る。パタ
ーンを形成するパワースペクトルは好ましくは時
間領域平滑化によるにせの相関をさけるために少
なくとも30ミリ秒離間されるべきである。本発明
の他の実施例においては、他のサンプリング手法
がスペクトルを選択するために実施できる。しか
しながら好ましい手法は一定継続時間、好ましく
は30ミリ秒離間されたスペクトルを選択し、キー
ワードを形成する時間期間中ずつと非重複デザイ
ン設定パターンを離間させることである。従つ
て、第１のデサイン設定パターンP₁はキーワード
の始めに近い部分に対応し、第２のパターンP₂は
時間的に遅れた部分に対応し、以下同様であり、
そしてこれらパターンP₁，P₂，……は到来音声デ
ータが合わされる直列のすなわちシーケンスの目
的パターン、すなわちキーワードテンプレート、
に対する統計的基準を形成する。目的パターン
t₁，t₂，……，は、Ｐ（ｉ）ｊが後で定義される
選択されたマルチフレームパターンと目的パター
ンとの間に尤度統計量を発生させるようにする独
立のガウスの変数より構成されると仮定すると、
それぞれ統計的データよりなる。従つて、目的パ
ターンはエントリがデザイン設定パターン配列エ
ントリの対応する収集に対する平均の標準偏差お
よび領域正規化フアクタよりなる配列からなる。
より正確な尤度統計量については後記する。実質的にすべてのキーワードが１つ以上の文脈
上のおよび／または領域の発音を有し、従つて１
つ以上のデザイン設定パターンの「スペル」を有
することはこの分野の技術者には明らかである。
従つて、上記したパターン化スペルP₁，P₂，……
を有するキーワードは現実にはＰ（ｉ）_１，Ｐ
（ｉ）_２，……ｉ＝１，２，……，Ｍとして一般
的に表現できる。ここで、Ｐ（ｉ）ｊのそれぞれ
はデザイン設定パターンの第ｊ番目のクラスの可
能な代りの記述であり、キーワードに対して合計
でＭの異なるスペクトルが存在する。従つて、最も一般的意味において、目的パター
ンt₁，t₂，……ｔ_i，……はそれぞれデザイン設定
パターンの第ｉ番目のグループまたはクラスに対
する複数の代りの統計的スペルを表わす。この中
で記載する例示の実施例においては、用語「目的
パターン」が最も一般的意味において使用され、
従つて各目的パターンは１つ以上の差しつかえの
ない代りの「統計的スペル」を有し得る。記憶されたスペクトルの処理到来連続音声データを表わす95における記憶さ
れたスペクトルは次の方法に従つてボキヤブラリ
ーのキーワードを表わす96で指示された目的パタ
ーンの記憶されたテンプレートと比較される。各
引続く変換された周波数応答等化スペクトルはマ
ルチフレームパターンの第１のスペクトル部分、
ここでは96エレメントベクトルに対応する３つの
スペクトルパターン、として取扱われる。例示の
実施例においては、パターンの第２および第３の
スペクトル部分は30および60ミリ秒遅れて（実時
間で）生じるスペクトルに対応する。97で指示さ
れた結果のパターンにおいて、第１の選択された
スペクトルはベクトルの第１の32のエレメントを
形成し、第２の選択されたスペクトルはベクトル
の第２の32のエレメントを形成し、第３の選択さ
れたスペクトルはベクトルの第３の32のエレメン
トを形成する。好ましくは、このように形成された各マルチフ
レームパターンは、相互相関を減じ、デイメンシ
ヨナリテイを減少させ、かつ目的パターンクラス
間の分離を強めるため、次の方法に従つて変換さ
れる。これは99で指示されている。例示の実施例
における変換パターンは、変換パターンが目的パ
ターンと合致する可能性の尺度を計算する100で
指示された統計的尤度計算に対する入力として供
給される。パターン変換まずパターン変換を考察し、行列表示を使用す
ると、各マルチフレームパターンは96×１列ベク
トルｘ＝（x₁＋x₂，……，x₉₆）によつて表わすこ
とができる。ここで、x₁，x₂，……，x₃₂はパタ
ーンの第１のスペクトルフレームのエレメントｘ
（ｆ，t₁）であり、x₃₃，x₃₄，……，x₆₄はパターン
の第２のスペクトルフレームのエレメントｘ
（ｆ，t₂）であり、x₆₅，x₆₆，……，x₉₆は第３のス
ペクトルフレームのエレメントｘ（ｆ，t₃）であ
る。実験的にベクトルｘのエレメントｘ_iの大部
分はそれらの平均値のまわりに対称的に集合する
確率分布を有することが観察され、従つてガウス
の確率密度関数が特定の目的パターンに対応する
デザイン設定パターンの特定の収集からのサンプ
ルにわたつている各ｘ_iの分布にぴつたりと合
う。しかしながら、多くの対のエレメントｘ_i，
ｘ_jは非常に相関付けられており、従つてｘのエ
レメントが相互に独立で、相関付けられていない
という仮定は是認されない。その上、マルチフレ
ームパターンにおける異なるフレームから生じる
エレメント間の相関は入力スピーチ信号における
フオルマント共鳴の運動の方向に関する情報を運
び、この情報はたとえフオルマント共鳴の平均周
波数が例えば話し手から話し手へで変化しても、
相対的に一定にとどまる。周知知のように、フオ
ルマント共鳴周波数の運動の方向は人間のスピー
チ認識にとつて重要な手掛りである。周知のように、ｘのエレメント中の相互相関の
影響は多変量ガウス対数尤度統計量を使用するこ
とによつて考慮できる。 −Ｌ＝１／２（ｘ−）K^-1（ｘ−）^t＋１／２ln‖
Ｋ‖ （8A）ここではｘのサンプル平均であり、Ｋは次式
（8B）によつて定義されるｘのすべての対のエレ
メント間のサンプル共分散の行列であり、‖Ｋ‖
は行列Ｋの行列式を表わす。Ｋ_ij＝（ｘ_i−_i）（ｘ_j−_j）（8B）共分散行列Ｋは周知の方法によつて固有ベクトル
表示に分解できる。Ｋ＝EVE^t （8C）ここでＥはＫの固有ベクトルｅ_iの行列であ
り、ＶはＫの固有値ｖ_iの対角線行列である。こ
れら量は次の関係によつて定義される。 Ke_i ^t＝ｖ_iｅ_i ^t （8D）行列Ｅによる乗算はベクトルｘが表わされる96
次元の空間における厳密な回転に対応する。今、
変換ベクトルｗが式（8E）と定義されるなら
ば、ｗ＝Ｅ（ｘ−）^t （8E）尤度統計量は次式（8F）のように書きかえるこ
とができる。各固有値ｖ_iは固有ベクトルｅ_iの方向において
測定されたランダムベクトルｘの統計的分散であ
る。パラメータＫ_ijおよび_iは、例示の実施例にお
いては、指示された統計的関係のそれぞれごと
に、複数の観察されたデザイン設定サンプルにわ
たつて、形成されたマルチフレームパターンを平
均化することによつて、決定される。この手続き
はＫ_ijおよびｘ_iの期待値の統計的推定を形成す
る。しかしながら、推定される独立のパラメータ
の数は（96の平均値）＋96×９７／２＝4656の共分散である。１つの目的パターンに対して数百以上のデ
ザイン設定パターンサンプルを収集することは実
行不可能であるから、統計的パラメータ当りの達
成し得る数のサンプル観察は明らかに非常に少な
い。不十分なサンブルの大きさの影響はパラメー
タ推定における偶然変動が推定されるパラメータ
に匹敵するということである。これら比較的大き
な変動は式（8F）に基づくデシジヨンプロセツ
サの分類精度に関して強力な統計的偏りを誘導
し、従つてプロセツサは高精度でそれ自身のデザ
イン設定パターンからサンプルを分類することが
できるかも知れないが、未知のデータサンプルに
ついて測定した性能は非常に貧弱である。推定される統計的パラメータの数を減じること
によつて小さなサンプル偏りの影響が減少するこ
とは周知である。このため、次の方法が通常使用
され、統計的ランダムベクトルの次元性を減じて
いる。上に定義した固有ベクトルｅ_iはそれらの
関連する固有値ｖ_iの階数を減ずることによつて
ランクされた固有ベクトルｅ^rのランクされた行
列Ｅ^rを形成するようにランクされ、その結果ｅ^ｒ _１
は最大分散ｖ^ｒ _ｉおよびｖ^ｒ _ｉ＋１ｖ^ｒ _１の方向であ
る。ベ
クトルｘ−は式（8E）におけるようにベクト
ルｗに変換されるが、（ランクされた行列Ｅ^rを使
用して）、しかしｗの第１のｐエレメントだけが
パターンベクトルｘを表わすために使用される。
ときどき「主成分分析」と称されるこの表現にお
いて、推定される統計的パラメータの有効数は
4656の代りに96pの程度である。パターンを分類
するため、尤度統計量Ｌは、合計することが１か
ら96までの代りに１からｐまでであることを除
き、式（8F）におけるように計算される。主成
分分析法を実際のデータに適用すると、プロセツ
サの分類精度は精度が最大であるｐの臨界値まで
はｐが増加するにつれて増大し、その後精度は上
記した貧弱な性能がｐ＝96で観察されるまでｐが
増加するにつれて減少することが観察される。
（第４図のグラフ(a)およびグラフ(b)を参照）。主成分分析法によつて達成される最大分類精度
は小さなサンプル統計的偏りの影響によつて依然
として制限され、必要な成分の数、または次元の
数はデータを表わすのに実際に必要であると予期
するものより非常に多い。さらに、デザイン設定
パターンサンプルに対する性能は、広範囲のｐに
わたつて、未知のサンプルに対する性能より実際
にずつと悪い。後者の２つの影響の源は、変換ベクトルｗのｐ
成分によつてサンブル空間を表わすことにより残
りの96のｐ成分が尤度統計量に寄与しないという
ことにある。パターンサンプルの大部分が見出さ
れる領域が記載されたが、しかし殆んどサンプル
が生じない領域は記載しない。後者の領域は確率
分布の尾部に対応し、従つて異なる目的パターン
クラス間のオーバーラツプする領域に対応する。
従つて、従来技術の方法は最もむずかしい分類判
断を行なうのにまさに必要な情報を除いている。
不幸にも、これらオーバーラツプ領域は高次元性
を有し、従つて上の議論を逆にし、例えば分散ｖ
_iが最大でなくて最小であるｗの少数の成分を使
用することは実行不可能である。本発明によれば、使用されない成分ｗ_p+1，…
…，w₉₆の影響は次の態様における再構成統計量
Ｒによつて推定される。Ｌに対する式（式8F）
から消えた項は成分ｗ_iの平方を含み、それぞれ
はその分散ｖ_iに従つて重み付けされている。こ
れらすべての分散は一定パラメータｃによつて近
似でき、このパラメータｃはくくり出すことがで
きる。従つて、右側の合計はベクトルw′のユークリツド的ノル
ム（長さ）の丁度平方である。ここで、 w′＝（ｗ_p+1，…，w₉₆）（8H）ベクトルｗ^pを次式で定義する。ｗ^p＝（w₁，…，ｗ_p）（8I）しかるときは、何故ならば、ベクトルｗ，w′およびｗ^pは直角
三角形を形成するように翻訳できるからである。
固有ベクトル行列Ｅは直交変換をもたらし、従つ
てｗの長さはｘ−の長さと同じである。それ
故、ｗのすべての成分を計算する必要はない。使
用されない成分の対数尤度関数Ｌに与える影響を
推定する、追求される統計量は次のようになる。これは観察されたベクトルｘ−と、ｘ−を
Ｋの第１のｐの固有ベクトルｅ_iの一次結合とし
て再構成しようとすることによつて得られるベク
トル間の差の長さである。従つて、Ｒは再構成誤
差統計量の性格を有する。尤度関数においてＲを
使用するため、変換ベクトル成分の組にＲが単に
加えられ、独立のガウス成分を有すると仮定され
る新しいランダムベクトル（w₁，w₂，…，ｗ_p，
Ｒ）を生じさせる。この仮定のもとで、新しい尤
度統計量は次式のようになる。ここでＭ＝１／２（Ｒ−Ｒ）^２／ｖａｒ（Ｒ）＋１／２lnv
ar（Ｒ）（8M）バーを付した変数はサンプル平均であり、var
（）は偏りのないサンプル分散を表す。式8Lに
おいて_iの値は零であるべきであり、また、var
（ｗ_i）はｖ_iに等しい筈である。しかしながら、固
有ベクトルは無限の算術精度で計算または適用す
ることができないから、サンプル平均および分散
を変換後再測定し、算術丸め誤差によつて生じる
システムの統計的偏りを減じるのが最良である。
この注意は式（8F）にも当てはまる。同じ最大尤度判断プロセツサにおける尤度統計
量L′の測定された性能は第４図にグラフ(c)および
(d)としてプロツトされている。ｐが増加するにつ
れ、分類精度は最大値に達するが、今回は次元の
非常に小さい数ｐにおいて達することが分る。そ
の上、達成された最大精度は単に再構成誤差Ｒが
ない点だけで相違する統計量Ｌに対するものより
著るしく高い。再構成誤差統計量Ｒの効力の別の試験として、
同じ実験が再び繰返されたが、しかしこのときは
使用された尤度関係は単に次式であつた。 L″＝−Ｍ（8N）すなわち、このときはサンプルデータの大部分
が存在する領域が無視され、比較的少ないサンプ
ルが見出される領域が描写された。得られた最大
精度（第４図のグラフ(e)および(f)）は統計量L′に
対するものと殆んど同じ高さであり、そして最大
値は次元の一層少ない数、ｐ＝３、において生じ
る。この結果は、Ｋの第１のｐの固有ベクトルの
空間に存在する任意のデータサンプルが目的パタ
ーンクラスに属するものと容認でき、かつその空
間内の詳細な確率推定を行なうことによつて得ら
れる利益は殆んどまたは全くないということを意
味すると解釈できる。統計的尤度計算形成されたマルチフレームパターンｘに対応す
る変換データｗ_iは入力として統計的尤度計算に
供給される。上記したように、このプロセツサ
は、連続的に与えられ、変換されたマルチフレー
ムパターンによつて表わされる未知の入力スピー
チが機械のボキヤブラリーのキーワードテンプレ
ートの目的パターンのそれぞれと合致する確率の
尺度を計算する。代表的には、目的パターンの各
データは僅かに非対称の確率密度を有するが、し
かしそれにも拘わらず平均値_iおよび分散var
（ｗ_i）を有する正規分布によつて統計的に十分に
近似される。ここでｉは第ｋ番目の目的パターン
のエレメントの逐次指示である。このプロセスの
最も簡単な実現はｉおよびｋの異なる値と関連し
たデータが相関付けられてなく、従つて目的パタ
ーンｋに属するデータｘに対する総合確率密度が
次式である（対数的に）と仮定する。対数は単調関数であるから、この統計量はキー
ワードテンプレートの任意の１つの目的パターン
と合致する確率がある他のボキヤブラリーの目的
パターンと合致する確率より高いかまたは低いか
を決定するのに、あるいは特定のパターンと合致
する確率があらかじめ定められた最小レベルを越
えるか否かを決定するのに、十分である。各入力
マルチフレームパターンはボキヤブラリーのキー
ワードテンプレートの目的パターンの全部に対し
て計算されたその統計的尤度Ｌ（ｔ｜ｋ）を有す
る。結果の尤度統計量Ｌ（ｔ｜ｋ）は時間ｔにお
けるｋと称される目的パターンの発生の相対的尤
度と解釈される。この分野の技術者には十分に理解されるよう
に、これら尤度統計量のランク付けは、単一の目
的パターンから実行され得る限り、スピーチ認識
を構成する。これら尤度統計量は実行されるべき
最終の関数に依存して、全体のシステムにおいて
種々の態様で使用できる。候補キーワードの選択本発明の好ましい実施例によれば、任意の第１
の目的パターンに関するマルチフレームパターン
の尤度統計量があらかじめ定められたスレシホー
ルドを越えるならば、比較が101，103で指示され
ており、到来データは、第１に、指示された第１
の目的パターンに対応する尤度統計量に対する局
部最大値を決定するために、第２に、選択された
可能性のある候補キーワードの他のパターンに対
応する他のマルチフレームパターンが存在するか
否かを決定するために、さらに調査される。これ
は105で指示されている。従つて、新たに形成さ
れたマルチスペクトルフレームをすべての第１の
目的パターンに対して繰返し試験するプロセスは
中断され、第１のマルチフレームパターンの後で
生じる、統計的尤度の意味において可能性のある
候補キーワードの次の（第２の）目的パターンに
最も対応するパターンに対し、探索が始まる。第２の目的パターンに対応する第２のマルチフ
レームパターンがプリセツト時間ウインド内に検
出されない場合には、探索シーケンスは終了し、
可能性のある候補キーワードを識別した第１のマ
ルチフレームパターンの終了直後の時間に認識プ
ロセスが再び始まる。かくして、第１のマルチフ
レームパターンが必要なスレシホールドより大き
い尤度スコアをもたらした後、タイミングウイン
ドが提供され、この時間内に選択された可能性の
ある候補キーワードに順次対応する次の目的パタ
ーンと合致するパターンが現われねばならない。タイミングウインドは、例えば特定の可能性の
ある候補キーワードの音声上のセグメントの継続
時間に依存して、変化してもよい。このプロセスは、(1)マルチフレームパターンが
キーワードテンプレートの目的パターンの全部に
対する到来データ中に識別されるまで、あるいは
(2)目的パターンが許容されたタイミングウインド
内で生じる任意のパターンと関連し得なくなるま
で、続く。探索が状態(2)によつて終了される場合
には、新しい第１のスペクトルフレームに対する
探索が第１の前に識別されたマルチフレームパタ
ーンの終了に続く次のスペクトルにおいて、上記
したように、新たに始まる。この処理レベルにおいて、目標は目的パターン
に対応する可能なマルチフレームパターンを連結
し、候補ワードを形成することである。（これは
107で指示されている。）それ故、検出スレシホー
ルドはゆるく設定され、その結果正しいマルチフ
レームパターンが拒絶されるということ、そし
て、ここでは、この音響的処理レベルにおいて、
正して検出と誤りアラーム間の識別が主として、
複数のパターン事象が一緒に検出されねばならな
い要件によつて得られること、は極めて可能性が
ない。ポストデシジヨン処理音響的レベルにおける処理は到来音声信号が終
了するまでこの態様で継続する。しかしながら、
たとえキーワードが上記した尤度確率試験を使用
して識別された後でさえ、追加のポストデシジヨ
ン処理試験（109で指示された）が使用され、正
しい検出の確率をできるだけ高く維持しながら不
適正なキーワードを選択する尤度を減少させる
（すなわち、誤りアラーム率を減少させる）こと
が好ましい。この理由のため、音響的レベルプロ
セツサの出力、すなわち、連結プロセスによつて
選択された候補ワード、が韻律学的相対タイミン
グウインドのマスクおよび／またはすべての目的
パターンクラスに関係する音響的レベルプロセツ
サからの情報を使用する尤度比検定法によつてさ
らにフイルタされる。韻律学的マスク上記したように、尤度統計量の決定中、能動目
的パターンに対する尤度統計量の局部ピーク値を
有するマルチフレームパターンの発生の時間が見
出され、好ましい実施例においては候補キーワー
ドの複数の連続する目的パターンに対応する選択
されたパターンのそれぞれごとに記録される。各
候補キーワードに対するこれら時間pt₁，pt₂，…
…，pt_oは分析され、そのキーワードに対するあ
らかじめ定められた韻律学的マスクに従つて評価
され、引続くパターン尤度ピーク間の時間間隔が
あらかじめ定められた基準に合致するか否かを決
定する。この方法によれば、尤度統計量のピーク
値の時間と時間との間の経過時間、すなわちｉ＝
２，３，…，ｎに対してはpt_i−pt_i-1、が各経過
時間期間をpt_o−pt₁で割算することによつて初め
に正規化される。結果の正規化された期間は候補
キーワードに対する韻律学的マスク、すなわち、
許容範囲の正規化期間長のシーケンス、と比較さ
れ、これら期間長が選択された範囲内に入る場合
には、候補ワードが受け入れられる。例示の実施例においては、韻律学的マスクタイ
ミングウインドはできるだけ多数の異なる話し手
によつて話されたサンプルキーワードに対する経
過時間を測定することによつて決定される。韻律
学的パターンは、次に、各韻律学的マスク（各キ
ーワードに対応する）に対する平均および標準偏
差がキーワードデザイン設定パターンサンプルか
ら引き出される統計的計算を使用して、統計的サ
ンプルキーワード時間と比較される。その後、尤
度統計量が候補キーワードに関して受け入れるべ
きか否かを、従つて候補キーワードに関して最終
決定を与えるべきか否かを、決定するために計算
される。この尤度統計量は事象のタイミングに関
するものであり、目的パターンに関するマルチフ
レームパターンに適用される尤度統計量と混同す
るべきではない。本発明の他の実施例においては、正規化期間の
範囲がゆるく設定されるが、しかし変えられない
ように固定される。この実施例においては、正規
化時間期間が固定のウインド境界内に入る場合に
のみ、候補キーワードが受け入れられる。従つ
て、候補キーワードは、正規化時間のそれぞれが
設定限界内に入る場合にのみ、受け入れられる。ワードレベル尤度比検定法本発明の好ましい実施例においては、各候補ワ
ードがまた、キーワードを受け入れるべき最終決
定がなされる前に尤度比検定法に従つて、試験さ
れる。この尤度比検定法は候補キーワードと識別
された、選択されたマルチフレームパターンのシ
ーケンスについて良さの指数を合算することによ
りなる。各マルチフレームパターンに対する良さ
の指数の和である累算された良さの指数は判断ス
レシホールド値と比較される。検出されたマルチフレームパターンに対する良
さの指数はキーワードボキヤブラリーにおける任
意の目的パターンに対する最良の対数尤度統計量
と目的パターンに対する選択を可能にされたパタ
ーンに対する最良の一般的スコアとの間の差であ
る。従つて、最良スコアの目的パターンが追求さ
れるパターンに対する正当な代替物である場合に
は、良さの指数は値零である。しかしながら、最
良のスコアが選択された候補ワード目的パターン
に対する代替物のリスト中にない目的パターンに
対応するならば（与えられた目的パターンがアク
セント、等に依存していくつかの統計的スペルを
持つ可能性がある）、良さの指数は最良のスコア
と代替物のリスト中に現われたもののうちの最良
のものとの差である。判断スレシホールドはミス
をした検出と誤りアラーム率との間に最良のバラ
ンスを得るように最適に設定される。ワードレベル尤度比検定法を数学的見地から考
察すると、ランダムマルチフレームパターンｘが
生じる確率は、入力スピーチが目的パターンクラ
スｋに対応すると仮定すると、ｐ（ｘ｜ｋ）に等
しく、「ｋに関するｘの確率」と読む。第ｋ番目
の基準パターンに対する入力ｘの対数尤度統計量
はＬ（ｘ｜ｋ）であり、式(9)によつて定義された
ようにlnp（ｘ，ｋ）に等しい。検出されたマル
チフレームパターンがｎのあらかじめ定められた
目的パターンクラスのグループのうちの１つによ
つてもたらされねばならないと仮定し、またこれ
らクラスが等しい頻度で生じるかあるいはｎの可
能な選択が等しく有効であると考えられると仮定
すると、任意の場合において事象ｘを観察する確
率は、相対発生頻度の意味において、次式によつ
て定められる確率密度の和である。これら発生について、与えられたクラスに帰因
する割合、ｐ（ｋ｜ｘ）は次の通りである。あるいは対数的には、デシジヨンプロセツサがｘを与え、ある理由の
ためにクラスｋを選ぶならば、上式（11A）また
は（11B）はその選択が正しいという確率を与え
る。上記式はベイズ（Bayes）の規則の結果であ
る。ｐ（ｘ，ｋ）＝ｐ（ｘ｜ｋ）ｐ（ｋ）＝ｐ（ｋ｜ｘ）ｐ（ｘ）ここでｐ（ｋ）は定数１／ｎと考えられる。１つのクラスだけが、例えばクラスｍだけが非
常に有望であると仮定すると、式（10）は次式に
よつて近似できる。また、次式が成立つ。 β（ｋ，ｍ，ｘ）＝Ｌ（ｘ｜ｋ） −Ｌ（ｘ｜ｍ）〓lnp（ｋ｜ｘ）（13）第ｋ番目のクラスが最も有望なものであるなら
ば、関数βはその最大値零をとるということを注
意すべきである。推定された独立のマルチフレー
ムパターンの組につついて合計すると、βの累算
された値は検出されたワードが誤りアラームでな
いという確率を推定する。従つて、βの累算値に
関する判断スレシホールドは検出および誤りアラ
ームの確率間の差引勘定に直接関係し、尤度比検
定法の基準である。βの累算値は候補キーワード
の良さの指数に対応する。このスピーチ認識方法を使用する実現されたシ
ステムについて簡単に記載する。前記したように、本発明の現在好ましい実施例
は、第２図のプレプロセツサによつて実行された
もの以上の信号およびデータ操作が米国特許願第
841390号に記載されたもののような専用プロセツ
サとの組合せで動作するデイジタル・エクイプメ
ント・コーポレーシヨンPDP―11コンピユータで
実施されかつ制御されるように構成されている。第３図のフローチヤートに関連して記載した機
能を提供する詳細なプログラムは特に必要である
とは思われないので添付しない。このプログラム
プリントアウトはデイジタル・エクイプメント・
コーポレーシヨンによつて提供されたそのPDP―
11コンピユータについてのマクロ―11（MACRO
―11）およびフオートラン言語、ならびに専用プ
ロセツサの機械語である。上記したことから、本発明のいくつかの目的は
達成され、他の有益な結果が得られたことは理解
できる。この中で記載した連続スピーチ認識方法は隔絶
されたスピーチ認識を特別の応用として含むとい
うことは理解されよう。加算、減算、削除および
記載した好ましい実施例の他の変更を含むこの中
で記載した連続スピーチ方法の他の応用はこの分
野の技術者には明らかであり、特許請求の範囲の
範囲内にあるものである。Equal to [expression]. By applying a likelihood statistic function to vector W, it is determined whether pattern y is the same as any one target pattern. In certain aspects of the invention, the method is further characterized by calculating a likelihood statistic according to one or other of the following equations. or L″=-1/2[(R-R) ² /var(R)+l _o va
r(R)] Here, the variable with a bar at the top is the sample mean, and var( ) is the unbiased sample variance. Other objects, features, and advantages of the invention will become apparent from the following description of preferred embodiments, taken in conjunction with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS Preferred embodiments of the invention will now be described in detail with reference to the accompanying drawings in which corresponding reference characters indicate corresponding parts throughout the drawings. In certain preferred embodiments described herein, speech recognition involves specially configured electronic systems for performing certain analog and digital processing on incoming audio data signals, typically speech, and certain other data signals. The entire apparatus includes both a general purpose digital computer programmed in accordance with the present invention to perform the conversion steps and the numerical measurements. The division of tasks between the hardware and software parts of this system is done in such a way as to obtain an overall system capable of achieving speech recognition in real time at a moderate cost. However, some of the tasks performed in hardware in this particular system could also be performed satisfactorily in software, and some of the tasks performed by software programming in this example are It should be understood that different embodiments of the invention can also be implemented with dedicated circuitry. As mentioned above, one aspect of the present invention is the provision of an apparatus for recognizing keywords in continuous speech signals even though the signals are distorted by, for example, telephone lines. Thus, with particular reference to FIG. 1, the voice input signal over the line designated 10 is generated by a carbon telephone transmitter and transmitted over a telephone line spanning any distance, or over a telephone line connecting multiple exchanges. It can be regarded as a received audio signal.
A typical application of the invention is therefore the recognition of keywords in voice data from unknown sources received over a telephone system. On the other hand, the input signal may be any audio data signal, such as an audio input signal, taken, for example, from a wireless telecommunications link, from a commercial broadcast station, or from a private communications link. As will be apparent from the above description, the present method and apparatus relate to the recognition of speech signals containing sequences of sounds or phonemes or other recognizable markings. The terms "keyword,""sequence of target patterns,""templatepattern," or "keyword template" are used herein and in the claims, but these four terms are used collectively. and is considered to be consent. This is a convenient way of representing recognizable sequences of audio sounds, or representations thereof, that are detectable by the present method and apparatus. These terms should be interpreted broadly and generally to include everything from a single phoneme, syllable, or sound to a sequence of words (in the grammatical sense) as well as a single word. Analog-to-digital (A/D) converter 13 receives incoming analog audio signal data on line 10 and converts the signal amplitude of the incoming data to digital form. The exemplary A/D converter is designed to convert input signal data to a 12-bit binary representation, and this conversion occurs at a rate of 8000 conversions per second. A/D converter 13 provides its output via line 15 to autocorrelator 17. Autocorrelator 17 processes the digital input signal to generate short term autocorrelation functions 100 times per second and provides its output over line 19 as directed. Each autocorrelation function contains 32 values or channels, and each value is computed to a resolution of 30 bits.
The autocorrelator will be described in detail below with reference to FIG. The autocorrelation function through line 19 is Fourier transformed by Fourier transform device 21 and the corresponding short-term wind power spectrum is converted into line 23.
Get on top. The spectra are generated at the same repetition rate as the autocorrelation function, ie, 100 times per second, and each short-term power spectrum has 31 numerical terms, each with a resolution of 16 bits. As will be appreciated, each of the 31 terms in the spectrum represents signal power within a certain frequency band. The Fourier transform device also preferably includes a Hamming or similar window function to reduce spurious adjacent band responses. In the illustrated embodiment, the Fourier transform as well as subsequent processing steps are carried out under the control of a suitably programmed general purpose digital computer using a peripheral array processor to speed up the arithmetic operations repeatedly required in accordance with the method. is executed using The particular combinator used was a model PDP-11 manufactured by Digital Equipment Corporation, USA. The specific sequence processor used is described in US Patent Application No. 841,390. The programming described below with reference to FIG. 3 is substantially based on the capabilities and characteristics of these commercially available digital processing units. The short-term wind power spectrum is frequency response equalized, as indicated at 25, and this equalization is performed as a function of the peak amplitude occurring in each frequency band or channel as described in detail below. Frequency response equalized spectra are generated on line 26 at a rate of 100 per second, each spectrum having 31 numerical terms evaluated to 16 bit accuracy. To facilitate the final evaluation of the incoming audio data, the frequency response equivalent window spectrum in line 26 is subjected to an amplitude transformation, as indicated at 35, which imposes a non-linear amplitude transformation on the incoming spectrum. This transformation will be described in detail later, but it is worth mentioning here that it improves the accuracy with which the unknown incoming speech signal can be matched with the keywords of the reference vocabulary. In the exemplary embodiment, this transformation is performed on the entire frequency response equalization wind spectrum at a time prior to comparing the spectrum with keyword templates representing keywords in the reference vocabulary. The amplitude transformed equalized short term spectrum on line 38 is then compared to the keyword template at 40. The keyword template designated 42 represents the keywords of the reference vocabulary in the spectral patterns to which the transformed equalized spectra can be compared. Candidate keywords are thus selected according to the closeness of the comparison. In an exemplary embodiment, the selection process is designed to minimize the likelihood of missing keywords while rejecting grossly inappropriate pattern sequences. Candidate words (and corresponding accumulated statistics regarding the incoming data) are provided at 46 over line 44 for post-decision processing to reduce the false alarm rate. The final verdict is 4
8 is indicated. Post-decision processing, including the use of prosodic masks and/or sound level likelihood ratio tests, improves discrimination between correct detections and false alarms, as described in detail below. Preprocessor In the apparatus shown in FIG. 2, an autocorrelation function with its inherent averaging is applied to the incoming analog audio data over line 10, generally the digital data generated by an analog-to-digital converter 13 from the audio signal. Performed digitally on the stream. Converter 13 provides a digital input signal over line 15. The digital processing functions as well as the input analog-to-digital conversion are timed under the control of clock oscillator 51. The clock oscillator provides a basic timing signal of 256,000 pulses per second, which is passed through frequency divider 52.
to obtain a second timing signal of 8000 pulses per second. This slower timing signal controls analog-to-digital converter 13 as well as latch register 53. Latch register 53 holds the 12-bit result of the immediately previous conversion until the next conversion is completed. The automatic correlation product is the number contained in register 53 and 32
is generated by digital multiplier 56 which multiplies the output of word shift register 58. Shift register 58 is operated in recirculation mode and is driven by the earlier clock frequency;
As a result, one complete rotation of shift register data is achieved for each analog-to-digital conversion. The input to shift register 58 is taken from register 53 once during each complete cycle. One input to digital multiplier 56 is taken directly from latch register 53, while the other input to multiplier 56 is taken from the current output of the shift register via multiplexer 59 (with one exception, described below). . These multiplications are performed at the higher clock frequency. Thus, each value obtained from the A/D conversion is multiplied by each of the 31 preceding conversion values. As will be understood by those skilled in the art, the signal thereby generated is the time-delayed input signal itself plus 32 different time increments (of which 1
(one has zero delay). To create a zero delay correlation, ie, the power of the signal, multiplexer 59 causes the current value of latch register 53 to be multiplied by itself as each new value is introduced into the shift register. This timing function is designated at 60. As also understood by those skilled in the art, the product from a single transform and its 31 predecessors does not adequately represent the energy distribution or spectrum over a reasonable sampling period. The apparatus of FIG. 2 therefore provides an averaged version of the products of these sets. The accumulation process that performs the averaging is interconnected with adders 65 to form a set of 32 accumulators 32
Provided by word shift register 63.
Thus, each word can be recirculated after being added to the corresponding increment from the digital multiplier. The circular loop passes through a gate 67 controlled by a 1/N divider circuit 69. The 1/N divider circuit 69 is driven by a low frequency clock signal. The divider circuit 69 is accumulated and therefore the shift register 6
Divide the low frequency clock by a factor that determines the number of instantaneous autocorrelation functions that are averaged before 3 is read out. In the illustrated example, 80 samples are accumulated before being read. In other words, N of the 1/N divider circuit 69 is equal to 80. After the 80 transform samples have been correlated and accumulated in this manner, the divider circuit 69 sends the computer interrupt circuit 71 to line 7.
Trigger through 2. At this time, the contents of the shift register 63 are transferred to the appropriate interface circuit 7.
The 32 consecutive words in the registers are presented to the computer via the interface circuit 73 in the ordered order. As will be understood by those skilled in the art, the transfer of this data from the peripheral unit, ie, the autocorrelator preprocessor, to the computer can typically be accomplished by direct memory access procedures. It can be seen that based on averaging 80 samples with an initial sampling rate of 8000 samples per second, 100 averaged autocorrelation functions are provided to the computer per second. While the contents of the shift register are being read to the computer, gate 67 is closed, so each of the words in the shift register is effectively reset to zero, allowing the accumulation process to begin again. The operation of the apparatus shown in FIG. 2 can be expressed in mathematical terms as follows. Assuming that an analog-to-digital converter generates a time series S(t), where t = 0, To, 2To, ..., and To is the sampling period (1/8000 seconds in the example embodiment) ), the exemplary digital correlation circuit of FIG. 2, ignoring start-up ambiguities, can be thought of as computing the autocorrelation function of equation (1): Here j = 0, 1, 2, ...31; t = 80To,
160To, . . . , 80nTo, . . . These autocorrelation functions correspond to the correlation outputs on line 19 in FIG. Referring to FIG. 3, the digital correlator operates continuously to transmit a series of data blocks to the computer at a rate of one complete autocorrelation function every 10 milliseconds. This is indicated by 77. Each block of data represents an autocorrelation function generated from a corresponding partial time. As mentioned above, the exemplary autocorrelation function is provided to the computer at a rate of 100 32 word functions per second. In the exemplary embodiment, processing of the autocorrelation function data is performed by a suitably programmed dedicated digital computer. A flowchart containing the functions provided by the computer program is shown in FIG. However, some of the steps can also be performed by hardware rather than software, and similarly some of the functions performed by the apparatus of FIG. I would like to point out that this can also be done using software. Although the digital correlator of Figure 2 performs time averaging with an autocorrelation function generated on an instantaneous basis, the average autocorrelation function read out to a computer can interfere with the orderly processing and evaluation of samples. It may still contain some unusual discontinuities or non-uniformities. Therefore, each block of data, i.e. each autocorrelation function Ψ(j,
t) is first smoothed in time. This is indicated at 79 in the flowchart of FIG. A preferred smoothing process is one in which the smoothed autocorrelation output Ψs(j,t) is given by the following equation (2). Ψs(j, t)=C ₀ Ψ(j, t), C ₁ Ψ(j, t-T)+C ₂ Ψ(j, t+
T) (2) Here, Ψ(j, t) is the unsmooth input autocorrelation defined by equation (1), Ψs(j, t) is the smoothed autocorrelation output, and j is the delay time, t indicates real time and T indicates the time period (equal to 0.01 seconds in the preferred embodiment) between successively generated autocorrelation functions. Although other values can be selected for the weighting functions C ₀ , C ₁ , and C ₂ , in the exceptional embodiment, 1/
Preferably, it is selected to be 2, 1/4, 1/4. A smoothing function that approximates a Gaussian impulse response with a frequency cutoff of, for example, 20 Hertz can be implemented in computer software, for example. However, testing has shown that the exemplary smoothing function, which is easy to implement, provides satisfactory results. As indicated, the smoothing function is applied separately for each value j of delay. 81, the cosine Fourier transform is applied to each time smoothed autocorrelation function Ψs(j,
t) to generate a 31 point power spectrum. This power spectrum is determined as shown in the following equation (3). Here, S(f,t) is the spectral energy in the band centered at f hertz (Hz) at time t, and W(j)=1/2(1+cos2πj/63)
is the Hamming window function that reduces the sidelobes, Ψs(j,t) is the smooth autocorrelation function at delay j and time t, and f=30+1000(0.0552m+0.438) 〓Hz (4) m= 1, 2, ..., 31 These are frequencies equally spaced by pitches on the "mel" scale. As can be appreciated, this corresponds to an essential pitch (mel scale) frequency-to-axis spacing for frequencies in the bandwidth of typical communication channels of approximately 300-3500 Hz.
As can also be understood, each point or value within each spectrum represents a corresponding band of frequencies. Although this Fourier transform can be performed entirely within normal computer hardware, the process can be significantly faster if an external hardware multiplier or Fast Fourier Transform (FFT) peripheral is used. The construction and operation of such modules are well known in the art and will not be described in detail here. A frequency smoothing function is advantageously incorporated into the hardware fast Fourier transform peripheral in which each of the spectra is frequency smoothed according to the preferred Hamming window weighted function W(j) defined above. This is indicated by block 83 in block 85 corresponding to the hardware Fourier transform implementation. Once a continuous smooth power spectrum is received from fast Fourier transform peripheral 85, a communication channel equalization function is applied to peripheral 85, as described below.
is obtained by determining a (typically different) peak power spectrum for each incoming wind power spectrum from , and changing the output of the fast Fourier transform device accordingly. Each newly generated peak amplitude spectrum y(f, t) corresponding to the incoming wind power spectrum S(f, t)
t) (where f is indexed over multiple frequency bands of the spectrum) is the result of a fast attack, slow decay, peak detection function for each of the spectral channels or bands. The wind power spectrum is normalized with respect to each period of the corresponding peak amplitude spectrum. This is indicated by 87. According to an exemplary embodiment, the value of the "old" peak amplitude spectrum y(f,t-T), determined before receiving the new wind spectrum, is the value of the new arriving spectrum S on a frequency band-frequency band basis.
(f, t). A new peak spectrum y(f,t) is generated according to the following rules.
The power amplitude in each band of the "old" peak amplitude spectrum is multiplied by a fixed fraction, for example 511/512 in the illustrative example. This corresponds to the slow decay part of the peak detection function. Arrival spectrum S
If the power amplitude in the frequency band f of (f, t) is larger than the power amplitude in the corresponding frequency band of the attenuated peak amplitude spectrum, then
The attenuated peak amplitude spectral values for that (those) frequency bands are replaced with the spectral values of the corresponding bands of the incoming wind spectrum.
This corresponds to the sudden attack part of the peak detection function. Mathematically, the peak detection function can be expressed as the following equation (5). y(f,t)=max{y(f,t-T)・(1-E),S(f,t)} (5) where f is indexed over each of the frequency bands and y (f,t) is the resulting peak spectrum and y(f,t-T) is the "old"
That is, it is the previous peak spectrum, and S(f,
t) is the new incoming power spectrum and E is the Decay parameter. After the peak spectrum is generated, the resulting peak amplitude spectrum is
At 89, the newly generated peak spectrum is frequency smoothed by averaging each frequency band peak value of the peak values corresponding to adjacent frequencies. The width of the overall band of frequencies contributing to the average value is approximately equal to the representative frequency separation between the formant frequencies. As understood by those in the speech recognition field, this separation is on the order of 1000Hz. By averaging in this particular way,
Useful information in the spectrum, ie local fluctuations representing formant resonances, is preserved, while the global or global emphasis in the frequency spectrum is suppressed. The resulting smoothed peak amplitude spectrum y(f,t) is equal to the just received power spectrum S
(f, t) as the arriving smoothed spectrum S(f, t)
is used to normalize and frequency equalize the amplitude value of each frequency band of y(f,t) by dividing it by the corresponding frequency band value of the smoothed peak spectrum y(f,t). Mathematically, this corresponds to the following equation (6). Sn (f, t) = S (f, t) / y (f, t) (6) where Sn (f, t) is the peak normalized smoothed power spectrum and f is the peak normalized smooth power spectrum over each of the frequency bands. indexed. This stage is 9
1 is indicated. Thus, a sequence of frequency-equalized normalized short-term power spectra results that accentuates changes in the frequency content of the incoming speech signal and suppresses generalized long-term frequency emphasis or distortion. This frequency compensation method is used for the recognition of speech signals transmitted through frequency-distorting communication links, such as telephone lines, in conventional frequency compensation systems where the criterion for compensation is the average power level of the entire signal or each respective frequency band. It was found to be very beneficial compared to It is instructive to point out that although the subsequent spectra have been variously processed and equalized, the data representing the incoming speech signal still consists of spectra occurring at a rate of 100 per second. The normalized, frequency equalized spectrum indicated at 91 is subjected to an amplitude transformation indicated at 93 which performs a non-linear scaling of the spectral amplitude values. Let us denote each equalized and normalized spectrum as Sn(f, t) (from equation (6)), where f indexes different frequency bands of the spectrum and t is the spectrum on a non-linear scale, indicating real time. x(f, t) is a linear fractional function expressed by the following equation (7A). x (f, t) = Sn (f, t) - A/Sn (f, t)
+A (7A) where A is the spectrum defined as
This is the average value of Sn(f, t). Here, f _b indexes over the frequency band of the power spectrum. This scaling function creates a soft threshold and gradual saturation effect on spectral intensities that deviate significantly from the short-term average A. Mathematically, the function is approximately linear for intensities near the average, approximately logarithmic for intensities away from the average, and substantially constant at extreme values of intensity. With respect to the logarithmic scale, the function x
(f,t) is symmetric about zero, and the function exhibits threshold and saturation behavior reminiscent of the auditory nerve stimulation rate function. In fact, the entire recognition system performs much better with this particular non-linear scaling function than with linear or logarithmic scaling of the spectral amplitudes. Thus, the amplitude transformed, frequency response equalized and normalized short-term power spectrum x
A sequence of (f, t) is generated. Here t
teeth,. 01,． 02．． 03．． 04,...equal to seconds and f
=1,...,31 (corresponding to the frequency band of the generated power spectrum). 32 words are provided for each spectrum, and the value of A (Equation 7B), the average value of the spectral values, is stored in 32 words. The amplitude transformed short-term power spectrum is, in the illustrated example, as indicated at 95.
Stored in a first-in-first-out (FIFO) circular memory with storage capacity for 256 32-word spectra. Thus, 2.56 seconds of audio input signal is available for analysis. This storage capacity provides the recognition system with the necessary flexibility to select spectra at different real times for analysis and evaluation, thus having the ability to move forward and backward in time as analysis is required. do. Thus, the amplitude-converted power spectrum for the most recent 2.56 seconds is stored in circular memory and available when needed. In operation, in the illustrated embodiment, each amplitude transformed power spectrum is
Remembered for 2.56 seconds. Thus, the spectrum that enters the circular memory at time t ₁ is lost or shifted from memory after 2.56 seconds, since a new amplitude transformed spectrum is stored corresponding to time t ₁ +2.56. The transformed and equalized short-term power spectra passed through the circular memory are compared, preferably in real time, with keywords of a known vocabulary to detect or select these keywords in continuous audio data. Each vocabulary keyword is a template pattern that statistically represents a plurality of processed power spectra formed into a target pattern (referred to as a design setting pattern) of a plurality of non-overlapping multiframes (preferably three spectra). It is expressed as follows. Preferably, these patterns are selected to best represent the important acoustic events of the keyword. The spectra forming the design set patterns were generated for keywords spoken in various contexts using the same system described above to process successive unknown speech inputs on line 10 as shown in Figure 3. be done. Thus, each keyword in the vocabulary has associated with it a generally plurality of sequences of design set patterns P(i) ₁ , P representing one indication of the i-th keyword in some region of the short-term power spectrum. (i) It has ₂ ,... The collection of design patterns for each keyword forms the statistical basis from which target patterns are generated. In an exemplary embodiment of the invention, the design setting pattern P(i)j can be considered as a 96 element array, each consisting of three selected short-term power spectra arranged in serial order. The power spectra forming the pattern should preferably be spaced at least 30 milliseconds apart to avoid spurious correlations due to time domain smoothing. In other embodiments of the invention, other sampling techniques can be implemented to select spectra. However, a preferred approach is to select spectra that are spaced apart by a constant duration, preferably 30 milliseconds, and to space the non-overlapping design setting patterns apart by each period of time that forms the keyword. Therefore, the first design setting pattern _P1 corresponds to a part near the beginning of the keyword, the second pattern _P2 corresponds to a part delayed in time, and so on.
These patterns P ₁ , P ₂ , . . . are serial or sequence target patterns to which the incoming audio data is combined, i.e. keyword templates,
form a statistical standard for purpose pattern
t ₁ , t ₂ , ..., are independent Gaussian variables that allow P(i)j to generate a likelihood statistic between the selected multi-frame pattern and the target pattern, which will be defined later. Assuming that it consists of
Each consists of statistical data. Thus, the target pattern consists of an array whose entries consist of the standard deviation of the mean and the area normalization factor for the corresponding collection of design setting pattern array entries.
More accurate likelihood statistics will be described later. Virtually all keywords have one or more contextual and/or regional pronunciations, and thus one
It will be apparent to those skilled in the art that there is more than one "spell" of design configuration patterns.
Therefore, the above patterned spellings P ₁ , P ₂ ,...
In reality, the keyword with P(i) ₁ , P
(i) ₂ ,...i=1, 2,...,M can be generally expressed. Here, each P(i)j is a possible alternative description of the jth class of design setting patterns, and there are a total of M different spectra for the keyword. Thus, in the most general sense, the target patterns t ₁ , t ₂ , . . . t _i , . . . each represent a plurality of alternative statistical spellings for the ith group or class of design setting patterns. In the exemplary embodiments described herein, the term "target pattern" is used in its most general sense;
Thus each target pattern may have one or more acceptable alternative "statistical spellings." Processing of Stored Spectra The stored spectra at 95 representing incoming continuous speech data are compared with stored templates of target patterns designated at 96 representing keywords of the vocabulary according to the following method. Each subsequent transformed frequency response equalization spectrum is the first spectral portion of the multi-frame pattern;
Here, they are treated as three spectral patterns corresponding to 96 element vectors. In the exemplary embodiment, the second and third spectral portions of the pattern correspond to spectra occurring 30 and 60 milliseconds later (in real time). In the resulting pattern indicated at 97, the first selected spectrum forms the first 32 elements of the vector, the second selected spectrum forms the second 32 elements of the vector, The third selected spectrum forms the third 32 elements of the vector. Preferably, each multi-frame pattern thus formed is transformed according to the following method in order to reduce cross-correlation, reduce dimensionality, and enhance separation between target pattern classes. This is indicated in 99. The transformed pattern in the exemplary embodiment is provided as input to a statistical likelihood calculation, designated 100, that calculates a measure of the likelihood that the transformed pattern matches the target pattern. Pattern Transformation Considering pattern transformation first, using matrix representation, each multi-frame pattern can be represented by a 96×1 column vector x=(x ₁ +x ₂ , . . . , x ₉₆ ). where x ₁ , x ₂ , ..., x ₃₂ are the elements x of the first spectral frame of the pattern
(f, t ₁ ) and x ₃₃ , x ₃₄ , ..., x ₆₄ are the elements x of the second spectral frame of the pattern
(f, t ₂ ), and x ₆₅ , x ₆₆ , . . . , x ₉₆ are the elements x (f, t ₃ ) of the third spectral frame. It has been experimentally observed that most of the elements x _i of vector x have a probability distribution that clusters symmetrically around their mean value, and thus the Gaussian probability density function is designed to correspond to a particular desired pattern. The set pattern closely fits the distribution of each x _i over samples from a particular collection. However, many pairs of elements x _i ,
x _j are highly correlated, so the assumption that the elements of x are mutually independent and uncorrelated is not accepted. Moreover, the correlation between elements arising from different frames in a multi-frame pattern carries information about the direction of motion of the formant resonances in the input speech signal, and this information can be transmitted even if the mean frequency of the formant resonances varies, e.g. from speaker to speaker. too,
remains relatively constant. As is well known, the direction of motion of formant resonance frequencies is an important cue for human speech recognition. As is well known, the effects of cross-correlation among the elements of x can be accounted for by using the multivariate Gaussian log-likelihood statistic. -L=1/2(x-)K ^-1 (x-) ^t +1/2ln‖
K‖ (8A) where is the sample mean of x and K is the matrix of sample covariance between all pairs of elements of x defined by (8B), ‖K‖
represents the determinant of matrix K. K _ij =(x _i − _i )(x _j − _j ) (8B) The covariance matrix K can be decomposed into eigenvector representations by well-known methods. K=EVE ^t (8C) where E is a matrix of K's eigenvectors e _i and V is a diagonal matrix of K's eigenvalues v _i . These quantities are defined by the following relationships: Ke _i ^t = v _i e _i ^t (8D) Multiplication by matrix E represents vector x96
Corresponds to exact rotation in dimensional space. now,
If the transformation vector w is defined as equation (8E), then w=E(x−) ^t (8E) The likelihood statistic can be rewritten as shown in equation (8F) below. Each eigenvalue v _i is the statistical variance of the random vector x measured in the direction of the eigenvector e _i . The parameters K _ij and _i are determined in the illustrated embodiment by averaging the formed multi-frame pattern over multiple observed design setting samples for each of the indicated statistical relationships. Then, it will be decided. This procedure forms statistical estimates of the expected values of K _ij and x _i . However, the number of independent parameters estimated is (96 mean values) + 96 x 97/2 = 4656 covariances. The achievable number of sample observations per statistical parameter is clearly very small, since it is not feasible to collect more than a few hundred design setting pattern samples for one target pattern. The effect of insufficient sample size is that chance fluctuations in parameter estimates are comparable to the estimated parameters. These relatively large variations induce a strong statistical bias on the classification accuracy of the decision processor based on equation (8F), thus allowing the processor to classify samples from its own design-set patterns with high accuracy. However, the measured performance on unknown data samples is very poor. It is well known that reducing the number of estimated statistical parameters reduces the effects of small sampling biases. For this reason, the following method is commonly used to reduce the dimensionality of statistical random vectors. The eigenvectors e _i defined above are ranked so as to form a ranked matrix E ^r of ranked eigenvectors e ^r by reducing the rank of their associated eigenvalues v _i , so that e ^r ₁
are the directions of maximum variance v ^r _i and v ^r _i+1 v ^r ₁ . The vector x- is transformed into a vector w as in equation (8E) (using the ranked matrix E ^r ), but only the first p elements of w are used to represent the pattern vector x. be done.
In this expression, sometimes referred to as "principal component analysis," the effective number of statistical parameters to be estimated is
It's about 96p instead of 4656. To classify the pattern, the likelihood statistic L is computed as in equation (8F), except that it sums from 1 to p instead of from 1 to 96. When principal component analysis is applied to real data, the classification accuracy of the processor increases as p increases until a critical value of p where the accuracy is maximum, after which the poor performance described above is observed at p = 96. It is observed that p decreases as p increases until
(See graphs (a) and (b) in Figure 4). The maximum classification accuracy achieved by principal component analysis methods is still limited by the effects of small sample statistical bias, and the number of components, or dimensions, required is limited by the number actually needed to represent the data. Much more than expected. Furthermore, the performance for design set pattern samples is actually incrementally worse than the performance for unknown samples over a wide range of p. The latter two sources of influence are due to the transformation vector w p
By representing the sample space by components, the remaining 96 p components do not contribute to the likelihood statistics. The regions where the majority of pattern samples are found have been described, but the regions where few samples occur have not been described. The latter region corresponds to the tail of the probability distribution and thus to the region of overlap between different target pattern classes.
Thus, prior art methods exclude exactly the information needed to make the most difficult classification decisions.
Unfortunately, these overlap regions have high dimensionality, thus reversing the above discussion and e.g.
It is infeasible to use a small number of components of w where _i is not the maximum but is the minimum. According to the invention, the unused components w _p+1 ,...
..., w ₉₆ is estimated by the reconstruction statistic R in the following manner. Formula for L (Formula 8F)
The terms that disappear from include the squares of the components w _i , each weighted according to its variance v _i . All these variances can be approximated by a constant parameter c, which can be extracted. Therefore, The sum on the right is exactly the square of the Euclidean norm (length) of vector w′. Here, w′=(w _p+1 ,..., w ₉₆ ) (8H) Vector w ^p is defined by the following equation. w ^p = (w ₁ ,…, w _p ) (8I) In that case, This is because the vectors w, w' and w ^p can be translated to form a right triangle.
The eigenvector matrix E yields an orthogonal transformation, so the length of w is the same as the length of x-. Therefore, it is not necessary to calculate all components of w. The statistics sought to estimate the influence of unused components on the log-likelihood function L are as follows. This is the length of the difference between the observed vector x- and the vector obtained by trying to reconstruct x- as a linear combination of the first p eigenvectors e _i of K. Therefore, R has the characteristics of a reconstruction error statistic. To use R in the likelihood function, R is simply added to the set of transformed vector components and a new random vector (w ₁ , w ₂ , ..., w _p ,
R). Under this assumption, the new likelihood statistic is as follows. Here M=1/2 (R-R) ² /var(R)+1/2lnv
ar(R)(8M) Variables with bars are sample means, var
( ) represents unbiased sample variance. In equation 8L, the value of _i should be zero, and var
(w _i ) should be equal to v _i . However, since eigenvectors cannot be calculated or applied with infinite arithmetic precision, it is best to remeasure the sample mean and variance after transformation to reduce statistical bias in the system caused by arithmetic rounding errors.
This caution also applies to equation (8F). The measured performance of the likelihood statistic L′ in the same maximum likelihood decision processor is shown in graphs (c) and
It is plotted as (d). It can be seen that as p increases, the classification accuracy reaches a maximum value, but this time at a very small number of dimensions p. Moreover, the maximum accuracy achieved is significantly higher than for statistics L that differ only in the absence of reconstruction errors R. As another test of the efficacy of the reconstruction error statistic R,
The same experiment was repeated again, but this time the likelihood relation used was simply: L″=-M (8N) That is, in this case, the area where most of the sample data was present was ignored, and the area where relatively few samples were found was delineated. (e) and (f)) are almost as high as for the statistic L′, and the maximum value occurs in a smaller number of dimensions, p=3. This result shows that the first that any data sample in the space of eigenvectors of p can be accepted as belonging to the target pattern class, and that there is little or no benefit to be gained by making detailed probability estimates in that space. Statistical likelihood calculation The transformed data w _i corresponding to the formed multi-frame pattern x is fed as input to the statistical likelihood calculation.As mentioned above, this processor continuously , and computes a measure of the probability that the unknown input speech represented by the transformed multi-frame pattern matches each of the target patterns in the keyword template of the machine's vocabulary.Typically, each of the target patterns The data have a slightly asymmetric probability density, but nevertheless mean _i and variance var
It is statistically well approximated by a normal distribution with (w _i ). Here, i is the sequential instruction of the element of the k-th target pattern. The simplest implementation of this process assumes that the data associated with different values of i and k are uncorrelated, so that the overall probability density for data x belonging to target pattern k is (logarithmically) do. Since the logarithm is a monotonic function, this statistic can be used to determine whether the probability of matching any one target pattern in a keyword template is higher or lower than the probability of matching a target pattern in another vocabulary, or It is sufficient to determine whether the probability of matching a particular pattern exceeds a predetermined minimum level. Each input multi-frame pattern has its statistical likelihood L(t|k) calculated against all of the target patterns of keyword templates in the vocabulary. The resulting likelihood statistic L(t|k) is interpreted as the relative likelihood of the occurrence of the pattern of interest, called k, at time t. As is well understood by those skilled in the art, ranking of these likelihood statistics constitutes speech recognition insofar as it can be performed from a single objective pattern. These likelihood statistics can be used in different ways in the overall system depending on the final function to be performed. Selection of Candidate Keywords According to a preferred embodiment of the invention, any first
If the likelihood statistic of the multi-frame pattern with respect to the target pattern exceeds a predetermined threshold, the comparison is indicated at 101, 103, and the incoming data is first
In order to determine the local maximum for the likelihood statistic corresponding to the desired pattern of Further investigation will be conducted to determine whether This is indicated in 105. Therefore, the process of iteratively testing the newly formed multispectral frame against all first target patterns is interrupted, and the probability in the sense of statistical likelihood occurring after the first multiframe pattern is interrupted. A search begins for a pattern that most corresponds to the next (second) target pattern for a given candidate keyword. If the second multi-frame pattern corresponding to the second target pattern is not detected within the preset time window, the search sequence ends;
The recognition process begins again at a time immediately after the end of the first multi-frame pattern that identified potential candidate keywords. Thus, after the first multi-frame pattern yields a likelihood score greater than the required threshold, a timing window is provided and the next objective sequentially corresponds to candidate keywords that may have been selected within this time. A pattern must appear that matches the pattern. The timing window may vary depending on, for example, the duration of the phonetic segment of a particular potential candidate keyword. This process continues until (1) a multi-frame pattern is identified in the incoming data for all of the desired patterns of the keyword template, or
(2) Continue until the target pattern cannot be related to any pattern that occurs within the allowed timing window. If the search is terminated by state (2), then the search for a new first spectral frame follows the termination of the first previously identified multi-frame pattern in the next spectrum, as described above. Starting anew. At this level of processing, the goal is to concatenate possible multi-frame patterns corresponding to the target pattern to form candidate words. (this is
107 as directed. ) Therefore, the detection threshold is set loosely so that correct multiframe patterns are rejected, and here, at this acoustic processing level,
The discrimination between correct detection and false alarms is mainly
It is extremely unlikely that the requirement that multiple pattern events have to be detected together would result. Post-Decision Processing Processing at the acoustic level continues in this manner until the incoming audio signal is terminated. however,
Even after keywords have been identified using the likelihood probability test described above, additional post-decision processing tests (indicated at 109) are used to identify incorrect detections while keeping the probability of correct detection as high as possible. It is preferable to reduce the likelihood of selecting keywords that are false (ie, reduce the false alarm rate). For this reason, the output of the acoustic level processor, i.e. the candidate words selected by the concatenation process, is a mask of the prosodic relative timing window and/or the acoustic level processor associated with all target pattern classes. It is further filtered by a likelihood ratio test using information from the setter. Prosodic Masks As mentioned above, during the determination of the likelihood statistics, times of occurrence of multi-frame patterns with local peak values of the likelihood statistics for the active target pattern are found, and in the preferred embodiment candidate keywords. are recorded for each of the selected patterns corresponding to a plurality of consecutive target patterns. These times pt ₁ , pt ₂ ,... for each candidate keyword
..., p _o is analyzed and evaluated according to a predetermined prosodic mask for that keyword to determine whether the time interval between subsequent pattern likelihood peaks meets predetermined criteria. . According to this method, the elapsed time between the peak values of the likelihood statistics, i.e., i=
For 2, 3, . . . , n, p _i -pt _i-1 is first normalized by dividing each elapsed time period by p _o -pt ₁ . The resulting normalized duration is the prosodic mask for the candidate keyword, i.e.
The candidate word is compared to a sequence of normalized period lengths of acceptable ranges, and if the period lengths fall within the selected range, the candidate word is accepted. In an exemplary embodiment, the prosodic mask timing window is determined by measuring elapsed time for sample keywords spoken by as many different speakers as possible. The prosodic patterns are then compared to the statistical sample keyword times using statistical calculations where the mean and standard deviation for each prosodic mask (corresponding to each keyword) is derived from the keyword design setting pattern sample. be done. A likelihood statistic is then computed to determine whether to accept or not to accept the candidate keyword, and therefore whether to give a final decision regarding the candidate keyword. This likelihood statistic concerns the timing of events and should not be confused with the likelihood statistic applied to multi-frame patterns with respect to the pattern of interest. In other embodiments of the invention, the range of the normalization period is set loosely, but fixed so that it cannot be changed. In this example, candidate keywords are accepted only if the normalized time period falls within fixed window boundaries. Therefore, candidate keywords are accepted only if each of the normalized times falls within the set limits. Word Level Likelihood Ratio Test Method In a preferred embodiment of the invention, each candidate word is also tested according to a likelihood ratio test method before a final decision is made to accept the keyword. This likelihood ratio test method consists of summing goodness indices for selected sequences of multi-frame patterns that have been identified as candidate keywords. The accumulated goodness index, which is the sum of the goodness indexes for each multi-frame pattern, is compared to a decision threshold value. The goodness index for a detected multi-frame pattern is the difference between the best log-likelihood statistic for any desired pattern in the keyword vocabulary and the best common score for any pattern that is allowed to be selected for the desired pattern. It is. Therefore, the goodness index has a value of zero if the best scoring target pattern is a legitimate alternative to the pattern being pursued. However, if the best score corresponds to a target pattern that is not in the list of alternatives to the selected candidate word target pattern (a given target pattern has some statistical spelling depending on accent, etc. possible), the goodness index is the difference between the best score and the best one that appeared in the list of alternatives. Decision thresholds are optimally set to obtain the best balance between missed detection and false alarm rate. Considering the word-level likelihood ratio test method from a mathematical point of view, the probability that a random multi-frame pattern x occurs is equal to p(x|k), assuming that the input speech corresponds to the target pattern class k; Probability of x". The log-likelihood statistic of the input x for the kth reference pattern is L(x|k), which is equal to lnp(x,k) as defined by equation (9). Assume that the detected multi-frame pattern must be caused by one of a group of n predetermined target pattern classes, and that these classes occur with equal frequency or that there are n possible choices. The probability of observing an event x in any case, in the sense of relative frequency of occurrence, is the sum of the probability densities defined by: For these occurrences, the proportion attributable to a given class, p(k|x), is: Or logarithmically, If the decision processor is given x and chooses class k for some reason, then equation (11A) or (11B) above gives the probability that the choice is correct. The above equation is a result of Bayes' rule. p(x,k)=p(x|k)p(k)=p(k|x)p(x) Here, p(k) is considered to be a constant 1/n. Assuming that only one class, say class m, is very promising, equation (10) can be approximated by: Furthermore, the following formula holds true. β(k,m,x)=L(x|k) −L(x|m)〓lnp(k|x) (13) If the kth class is the most promising one, then the function β is It should be noted that the maximum value is zero. When summed over a set of estimated independent multi-frame patterns, the accumulated value of β estimates the probability that the detected word is not a false alarm. Therefore, the decision threshold for the accumulated value of β is directly related to the difference between the probability of detection and false alarm and is the criterion for the likelihood ratio test method. The cumulative value of β corresponds to the index of goodness of the candidate keyword. A realized system using this speech recognition method is briefly described. As noted above, the presently preferred embodiment of the present invention provides signal and data manipulation beyond that performed by the preprocessor of FIG.
It is configured to be implemented and controlled by a Digital Equipment Corporation PDP-11 computer operating in combination with a special purpose processor such as that described in No. 841,390. A detailed program for providing the functions described in connection with the flowchart of FIG. 3 is not attached because it is not considered particularly necessary. This program printout is for digital equipment.
The PDP provided by the Corporation -
11 Macros for computers - 11 (MACRO
-11), Fortran language, and machine language for dedicated processors. From the foregoing, it can be seen that several objects of the invention have been achieved and other beneficial results have been obtained. It will be appreciated that the continuous speech recognition method described herein includes isolated speech recognition as a special application. Other applications of the continuous speech method described herein, including addition, subtraction, deletion, and other modifications of the preferred embodiments described, will be apparent to those skilled in the art and are within the scope of the following claims. It is something.

[Brief explanation of the drawing]

第１図は本発明の実施に従つて実行される動作
のシーケンスを例示するフローチヤート、第２図
は第１図に例示された全体のプロセスにおけるあ
るプレプロセス動作を実行するための電子装置の
ブロツク図、第３図は第１図のプロセスにおける
ある手続きを実行するデイジタルコンピユータプ
ログラムの流れ図、第４図は異なる変換手続きを
使用する分類精度を示す特性図である。１３：アナログ―デイジタル変換器、１７：自
動相関器、２１：フーリエ変換装置、２５：周波
数応答等化、３５：振幅変換、４０：比較、４
２：キーワードテンプレート、４６：ポストデシ
ジヨン処理、５１：クロツク発振器、５２：分周
器、５３：ラツチレジスタ、５６：デイジタルマ
ルチプライヤ、５８：32ワードシフトレジスタ、
５９：マルチプレクサ、６３：32ワードシフトレ
ジスタ、６５：加算器、６７：ゲート、６９：１／Ｎ割算回路、７１：コンピユータ割込み回路、７
３：インターフエース回路、８５：高速フーリエ
変換周辺装置。 FIG. 1 is a flowchart illustrating the sequence of operations performed in accordance with the practice of the present invention; FIG. 2 is a flowchart of electronic equipment for performing certain preprocess operations in the overall process illustrated in FIG. 3 is a flowchart of a digital computer program implementing certain procedures in the process of FIG. 1; and FIG. 4 is a characteristic diagram illustrating classification accuracy using different conversion procedures. 13: Analog-digital converter, 17: Autocorrelator, 21: Fourier transform device, 25: Frequency response equalization, 35: Amplitude conversion, 40: Comparison, 4
2: Keyword template, 46: Post-decision processing, 51: Clock oscillator, 52: Frequency divider, 53: Latch register, 56: Digital multiplier, 58: 32 word shift register,
59: Multiplexer, 63: 32 word shift register, 65: Adder, 67: Gate, 69: 1/N division circuit, 71: Computer interrupt circuit, 7
3: Interface circuit, 85: Fast Fourier transform peripheral device.

Claims

[Claims] 1. At least 1 audio signal in each of the audio signals.
characterized by a template having two target patterns and each target pattern representing at least one short-term power spectrum;
A speech analysis system for recognizing at least one predetermined keyword, comprising: collecting a pattern identified as corresponding to a sequence of target patterns; and the collected sequence of patterns is a target pattern of the keyword template. identifying candidate keywords when each corresponding to a sequence; normalizing a time interval between likelihood peaks of said collected patterns corresponding to said candidate keyword; and performing a prosodic test at said normalized time. the normalized time interval for a candidate keyword must meet a timing criterion imposed by the prosodic test before the candidate keyword is accepted as a recognized keyword. A speech recognition method that avoids 2. Applying the prosodic test comprises: applying a fixed, predetermined interval limit to each normalized interval, and wherein the normalized interval is set before the candidate word is accepted. The speech recognition method according to claim 1, wherein the speech recognition method has to fall within the limits of . 3. applying the prosodic test comprises: applying a likelihood statistic function to the normalized interval; and accepting the candidate keyword if the likelihood statistic exceeds a predetermined minimum threshold. A speech recognition method according to claim 1, comprising the steps of: 4 applying a likelihood ratio test to the sequence of collected patterns corresponding to the candidate keywords and determining a goodness index for each pattern; 2. The speech recognition method of claim 1, further comprising the steps of: accumulating the index of goodness; and accepting the candidate word if the accumulated goodness index exceeds a predetermined minimum value. 5. said step of applying a likelihood ratio test method determines the best value, referred to as the best general score, of the log-likelihood statistic for each of said collected patterns for any of said target patterns; determining the best value, referred to as the best objective score, of the log-likelihood statistic for each of said collected patterns for the objective pattern that is a valid substitute for the corresponding objective pattern of the candidate keyword; and determining a goodness index for each selected pattern by generating an arithmetic difference between the best general score and the best objective score for each collected pattern. The speech recognition method described in Section 4. 6 each of at least 1 in the audio signal
characterized by a template having two target patterns and each target pattern representing at least one short-term power spectrum;
A speech analysis system for recognizing at least one predetermined keyword, comprising: collecting a pattern identified as corresponding to a sequence of target patterns; and the collected sequence of patterns is a target pattern of the keyword template. identifying candidate keywords as each corresponds to a sequence; and applying a likelihood ratio test to the collected sequence of patterns corresponding to the candidate keywords to determine a goodness index for each pattern; A speech comprising: applying a ratio test; accumulating the goodness index for the pattern; and accepting the candidate keyword if the accumulated goodness index exceeds a predetermined minimum value. Recognition method. 7. said step of applying a likelihood ratio test method: determining a best value, referred to as a best general score, of a log-likelihood statistic for each of said collected patterns for any of said target patterns; determining the best value, referred to as the best objective score, of the log-likelihood statistic for each of said collected patterns for the objective pattern that is a valid substitute for the corresponding objective pattern of the candidate keyword; and determining a goodness index for each selected pattern by generating an arithmetic difference between the best general score and the best objective score for each collected pattern. The speech recognition method described in Section 6. 8. The audio signal is spectrally analyzed to recognize at least one predetermined keyword in the continuous audio signal, each keyword representing a plurality of temporally spaced short-term power spectra. In a speech analysis system characterized by a template having a pattern of interest, iteratively evaluates a set of parameters to determine the short-term power spectrum of the speech signal within each of a plurality of equal duration sampling periods. generating a continuous time-directed sequence of short-term audio power spectrum frames; and repeatedly selecting a first frame and at least one subsequent frame from said sequence of frames to form a multi-frame pattern. and comparing each so-formed multi-frame pattern with a respective first target pattern of each keyword template, each of the multi-frame patterns corresponding to the first target pattern of a keyword template. and, according to the determining step, selecting a subsequent short-term power spectrum for each multi-frame pattern corresponding to the first target pattern of possible candidate keywords. forming a subsequent multi-frame pattern using the selected keyword template; determining whether the subsequent multi-frame pattern corresponds to each subsequent target pattern of the possible candidate keyword templates; identifying candidate keyword templates when each of the multi-frame patterns corresponds to a target pattern of the keyword template; and normalizing a time interval between likelihood peaks of the multi-frame patterns corresponding to the candidate keywords. applying a prosodic test to the normalized time interval, wherein the normalized time interval for a candidate keyword is subjected to the prosodic test before the candidate keyword is accepted as a recognized keyword. A speech recognition method that requires meeting timing standards imposed by the above method. 9. Applying the prosodic test comprises applying a fixed, predetermined interval limit to each normalized interval, and wherein the normalized interval 9. The speech recognition method according to claim 8, wherein the speech recognition method has to be within a time limit of . 10 applying the prosodic test comprises: applying a likelihood statistic function to the normalized interval; and accepting the candidate keyword if the likelihood statistic exceeds a predetermined minimum threshold. A speech recognition method according to claim 8, comprising the steps of: 11 applying a likelihood ratio test method to the sequence of multi-frame patterns corresponding to candidate keywords and determining a goodness index for each pattern; and accumulating the goodness index for the patterns. 9. The speech recognition method of claim 8, further comprising the steps of: calculating the candidate keyword; and accepting the candidate keyword if the accumulated goodness index exceeds a predetermined minimum value. 12 said step of applying a likelihood ratio test method comprises: determining the best value, referred to as the best general score, of the log-likelihood statistic for each of said selected multi-frame patterns for any of said target patterns; determining a log-likelihood statistic, referred to as the best objective score, for each of said selected multi-frame patterns for an objective pattern that is a valid substitute for the corresponding objective pattern of the candidate keyword;
determining a best value and a goodness index for each selected multi-frame pattern by generating an arithmetic difference between a corresponding best general score and a best objective score for each selected multi-frame pattern; 12. The speech recognition method according to claim 11, comprising the step of determining.