JPS59127099A

JPS59127099A - Improvement in continuous voice recognition

Info

Publication number: JPS59127099A
Application number: JP58000552A
Authority: JP
Inventors: ロレンス・ジヨ−ジ・バ−ラ−
Original assignee: Exxon Corp
Current assignee: Exxon Mobil Corp
Priority date: 1983-01-07
Filing date: 1983-01-07
Publication date: 1984-07-21

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は、音声認識方法および装置に関し、特定すると
、連続音声信号中のキーワードを実時間で認識する方法
および装置に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to speech recognition methods and apparatus, and more particularly to methods and apparatus for recognizing keywords in continuous speech signals in real time.

適当に処理された未知の隔絶された音声信号を１または
複数の子め用意された既知のキーワード表示と比較する
ことにより隔絶された発声を認識するため、従来より種
々の音声認識システムが提案されて来た。本明細書にお
いて、「キーワード」なる用語は１結合された１群の音
素および音響を意味するのに使用され、例えば、音節の
一部、ワード、ワードストリング、句等である。多くの
システムはその成功度が限定されたものであったが、特
に１つの装置は、隔絶されたキーワードを認識するのに
商業上利用されて成功を納めた。このシステムは、１９
７７年７月２６日付で特許され、本出願の譲受人に譲渡
された米国特許第４，０３８，５０３号に記載される方
法にしたがってほぼ動作し、未知の音声信号データの境
界が認識システムにより測定されるバックグラウンドノ
イズまたは無音状態のいずれかであることを条件として
、限定された範囲のキーワードの１つを認識する方法を
提供するもので、この方法は好結果をもたらした。この
システムは、未知の音声信号を生じる期間は、十分に限
定されており、単一のキーワードの発声しか含まないと
いう推定に依存する。Various speech recognition systems have been proposed in the past for recognizing isolated utterances by comparing a suitably processed unknown isolated speech signal with one or more child-prepared known keyword representations. I came. The term "keyword" is used herein to mean a group of connected phonemes and sounds, such as a part of a syllable, a word, a word string, a phrase, etc. While many systems have had limited success, one device in particular has been successfully used commercially to recognize isolated keywords. This system is 19
It operates substantially in accordance with the method described in U.S. Pat. This method provided a method for recognizing one of a limited range of keywords, subject to either background noise or silence being measured, and this method has yielded good results. This system relies on the assumption that the period of time giving rise to the unknown audio signal is sufficiently limited and includes only the utterance of a single keyword.

余話音声のような連続音声信号においては、キーワード
の境界を波線的に認めることができないが、到来音声デ
ータを区分するために、すなわち音素、音節、ワード１
文章等の言語単位の境界をキーワード認識プロセスの開
始に先立って決定する種々の方法が考案された。しかし
ながら、これらの従来の連続音声システムは、満足でき
る分割方法が見出されないこともあって、その成功は限
定された。さらに、他の実質的な問題が存在する０例え
ば、−貫的には、限定された語集しか低誤報率で認識で
きないこと、認識の精度が、異なる話者の音声特性の差
に非常に敏感であること、システムが、例えば普通の電
話通信装置で伝送される可聴信号に普通化じるような分
析されつつある可聴信号の歪に非常に敏感であることな
どである。In a continuous audio signal such as an aside, the boundaries between keywords cannot be recognized as wavy lines, but in order to classify the incoming audio data, i.e., phoneme, syllable, word 1.
Various methods have been devised to determine the boundaries of linguistic units such as sentences prior to the start of the keyword recognition process. However, these conventional continuous speech systems have had limited success, in part because no satisfactory segmentation method has been found. Furthermore, there are other substantial problems, such as - comprehensively, only a limited vocabulary can be recognized with a low false alarm rate, and recognition accuracy is highly sensitive to differences in the speech characteristics of different speakers. The system is highly sensitive to distortions in the audio signal being analyzed, such as those common to audio signals transmitted in common telephone communication equipment.

米国特許第４，２２７，１７６号、第４，２４１，３２
９号および第４，２２７，１７７号に記載される連続音
声認識方法は、連続音声中のキーワードを実時間におい
て認識する商業的に容認できる有効な手法についてそれ
ぞれ記述している。これら特許に記載される一般的方法
は、現在商用に供せられており、実験的にもまた実用試
験においても、話者不依存の状況で高忠実性と低誤率を
提供することが分った。しかしながら１現今の技術の最
先端にあるこれらの技術でさえも、誤報率および話者不
依存特性の両面において欠点を有する。U.S. Patent Nos. 4,227,176 and 4,241,32
No. 9 and No. 4,227,177, respectively, describe commercially acceptable and effective techniques for recognizing keywords in continuous speech in real time. The general methods described in these patents are now commercially available and have been shown experimentally and in practical trials to provide high fidelity and low error rates in speaker-independent situations. It was. However, even these techniques, which are at the current state of the art, have drawbacks in both false alarm rates and speaker-independent characteristics.

上述のＵ、Ｓ、特許に記載される連続音声Ｗｔ繊法は、
主として「オープンボキャプ２リー」の状況に向けられ
るもので、連続音声の複数のキーボードの１つが認識ま
たは確認される。「オープンボキャブラリー」法は、到
来する課業のすべては装置に分らない方法である。特定
の応用においては、連続ワードストリングを認識できる
が、この場合、認識プロセスの結果として、連続ワード
ストリングの個々のワード要素の各々が識別される。本
明細書において連続ワードストリングとは、無音状態に
より境を定められる複数の認識可能な要素（「クローズ
トポキャブラリ−」）をいう。これは１例えば−境界が
演綽的に分る隔絶されたワードの応用例に関して上述し
た商業装置に関係づけられる。しかしながら、ここでは
境界、すなわち無音状態は未知であり、認識装置それ自
体により決定されねばならない。加えて、試験されつつ
ある一要素は、もはや単一のワード要素、でないが１複
数の要素が、ワード列を形成するように一列−に並べら
れる。The continuous sound Wt fabrication method described in the above-mentioned U.S. patents is as follows:
It is primarily intended for "open vocabulary" situations, where one of several keyboards of continuous speech is recognized or confirmed. The "Open Vocabulary" method is a method in which the device does not understand all of the tasks that are to come. In certain applications, continuous word strings may be recognized, where each individual word element of the continuous word string is identified as a result of the recognition process. A continuous word string, as used herein, refers to a plurality of recognizable elements bounded by silence (a "closed topocytria"). This is related to one of the commercial arrangements described above, for example in connection with the application of isolated words whose boundaries are demonstrably known. However, here the boundaries, ie silence, are unknown and must be determined by the recognizer itself. Additionally, the element being tested is no longer a single word element, but a plurality of elements aligned to form a word string.

従来、連続音声を認識するために種々の方法および装置
が示咬されたが、正確な音声認識を可能にするために必
要なバラメータを生成するように装置を自動的にトレニ
ングすることについては余り注意が向けられなかった。In the past, various methods and devices have been proposed for recognizing continuous speech, but little is known about automatically training the devices to generate the necessary parameters to enable accurate speech recognition. I wasn't paying attention.

さらに、従来装置における無音状態を決定する方法およ
び装置、および従来装置における文法的シンタックスの
利用は、そのニーズに一般的は十分であるが、なお多く
の改良の余地を残している。Additionally, while the methods and apparatus for determining silence and the use of grammatical syntax in prior art devices are generally sufficient for their needs, they still leave much room for improvement.

それゆえ、本発明の主たる目的は、新しい認識パターン
を発生するため装置をトレーニングするのに有効な音声
Ｕ織方法および装置を提供することである。Therefore, it is a primary object of the present invention to provide a method and apparatus for audio fabrication that is effective in training a device to generate new recognition patterns.

本発明の特定の目的は、未知の可聴入力信号データにお
いて無音状態（サイレント）を有効に認識し、認識プロ
セスにおいて文法的シンタックスを採用し、異なる話者
、したがって異なる音声特性に等しく十分に応答し、信
頼性があって低誤報率を有し、しかも実時間で動作する
この種方法および装置を提供することである。A particular object of the present invention is to effectively recognize silence in unknown audible input signal data, employ grammatical syntax in the recognition process, and respond equally well to different speakers and thus to different speech characteristics. However, it is an object of the present invention to provide such a method and apparatus that is reliable, has a low false alarm rate, and operates in real time.

本発明は、音声信号中の少なくとも１つのキーワードを
認識する音声分析方法および装置に関するものである。The present invention relates to a speech analysis method and apparatus for recognizing at least one keyword in an audio signal.

特定の１側面として、本発明は、到来音声信号のサイレ
ントを認識する方法に関する。本方法は、サイレントの
交互の表示を表わす少なくとも第１および第２のターゲ
ットテンプレートを発生し、到来音声信号を第１および
第２ターゲツトテンプレートと比較し、比較の結果を表
わす値を発生し、少なくともこの値に基づいてサイレン
ト５が検出されたかどうかを決定することを特徴とする
。In one particular aspect, the present invention relates to a method for recognizing silence in an incoming audio signal. The method includes generating at least first and second target templates representing alternating displays of silence, comparing an incoming audio signal to the first and second target templates, generating a value representing a result of the comparison, and at least The present invention is characterized in that it is determined whether or not Silent 5 is detected based on this value.

他の１側面として、本発明は、現在の到来音声信号部分
がサイレントを表わす基準パターンに対応する可能性を
表わす値を発生し、シンタックス依存の測定値にしたが
ってこの値を有効に変更し、そして有効に変更されたス
コアから本信号部分がサイレントに対応するかどうかを
決定することを特徴とする音声信号中のサイレントを認
識する方法に関する。しかして−シンタックス依存の測
定値は、文法的シンタックスにしたがって音声信号の直
前の部分の認識を表示するのである０さらに他の１側面
として、本発明は１既知のキーワードを表わしかつ話者
に対して調整された基準パターンを形成する方法に関す
る。この方法は＼キーワードを表わす話者に依存しない
基準パターンを形成し、この話者に依存しない基準パタ
ーンを使って話者により話される音声信号のキーワード
の境界を決定し、話者により話されるキーワードに対し
て装置により決定された境界を使って音声分析装置をそ
の話者に対してトレーニングすることを特徴とする。In another aspect, the invention generates a value representative of the likelihood that a current incoming audio signal portion corresponds to a reference pattern representing silence, and advantageously modifies this value according to a syntax-dependent measure; The present invention also relates to a method for recognizing silence in an audio signal, characterized in that it is determined from the effectively modified score whether the signal portion corresponds to silence. Thus - a syntax-dependent measurement indicates the recognition of the immediately preceding part of the speech signal according to the grammatical syntax.As yet another aspect, the present invention provides a method for representing known keywords and The present invention relates to a method of forming a reference pattern adjusted to a reference pattern. This method forms a speaker-independent reference pattern representing a keyword, uses this speaker-independent reference pattern to determine the boundaries of the keyword in the audio signal spoken by the speaker, and determines the boundaries of the keyword in the audio signal spoken by the speaker. The speech analysis device is trained for the speaker using the boundaries determined by the device for the keywords.

本方法は、さらに、装置に既知のキーワードを表わす話
者に依存しない基準パターンを形成し、この話者に依存
しない基準パターンを使って未知のキーワードの境界を
決定し、予め未知のキーワードに対して装置により予め
決定された境界を使って音声分析装置をトレーニングし
て、予め未知のキーワードを表わす統計的データを生成
することを特徴とする予め未知のキーワードを表わす基
準パターン形成方法に関係する。The method further includes forming a speaker-independent reference pattern representing a keyword known to the device, determining boundaries of the unknown keyword using the speaker-independent reference pattern, and determining the boundaries of the unknown keyword in advance. The present invention relates to a method for forming a reference pattern representing a previously unknown keyword, characterized in that a speech analysis device is trained using boundaries predetermined by the device to generate statistical data representing the previously unknown keyword.

さらに他の側面として、本発明は、認識されつつある一
連のキーワードが、複数の結合された判断ノードにより
特徴づけられた文法的シンタックスにより記述される音
声認識方法に関する。この音声認識法は、音声信号中の
キーワードを認識するための一連の数値スコアを形成し
、動的プログラミングを採用し、文法的シンタックスを
使用して、どのスコアが認識プロセスにおける容認し得
る進行を決定するかを決定し、シンタックス判断ノード
な折畳むことにより他の場合には容認し得る進行の数を
減じ、それＫより折り畳まれたシンタックスにしたがっ
て他の場合には容認し得る進行を放棄することを特徴と
する。In yet another aspect, the invention relates to a speech recognition method in which a set of keywords being recognized is described by a grammatical syntax characterized by a plurality of connected decision nodes. This speech recognition method forms a set of numerical scores for recognizing keywords in the speech signal, employs dynamic programming, and uses grammatical syntax to determine which score is an acceptable progression in the recognition process. Decide whether to reduce the number of otherwise acceptable progressions by folding the syntax judgment node, and then reduce the number of otherwise acceptable progressions according to the folded syntax. characterized by the abandonment of

本発明はさらに、上述の音声認識方法を実施する装置に
関する。The invention further relates to a device implementing the above-described speech recognition method.

以下、図面を参照して本発明を好ましい具体例について
説明する。Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

本明細書に記載される特定の好門しい具体例においては
１音声認識およびトレーニングは、到来可聴デーノ信号
、一般に音声の特定のアナログおよびディジタル処理を
行なう特別構成の電子装置と、特定の他のデータ変換段
階および数値評価を行なうため本発明にしたがってプロ
グラムされた汎用デイジタルコンビ二−タヲ含むシステ
ムにより遂行される。本システムのハードウェア部分と
ソフトウニσ部分の間のタスクの分割は、音声認識を安
価な価格で実時間で遂行し得るシステムを得るためにな
されたものである。しかしながら、この特定のシステム
のハードウェアで遂行されつつあるタスクのある部分は
ソフトウェアで十分遂行され得るであろうし、また本具
体例のソフトウェアプログラミングで遂行されつつある
タスクのある部分は、他の具体例においては特定目的の
回路で遂行し得るであろう。この後者ＫＭ連しては、利
用できる場合に１装置のハードウェアおよびソフトウェ
アの実施形態について説明する。In certain expedient embodiments described herein, one speech recognition and training involves specially configured electronic equipment that performs certain analog and digital processing of incoming audible audio signals, generally speech, and certain other The data conversion steps and numerical evaluations are accomplished by a system that includes a general purpose digital combiner programmed in accordance with the present invention. The division of tasks between the hardware part and the software part of the system was done in order to obtain a system that can perform speech recognition in real time at a low cost. However, some portions of the tasks being accomplished with the hardware of this particular system could be accomplished satisfactorily in software, and some portions of the tasks being accomplished with the software programming of this particular example may be accomplished with other implementations. In an example, this could be accomplished with special purpose circuitry. This latter KM series describes a single device hardware and software embodiment, if available.

本発明の１側面に依れば、信号が例えば電話銀により歪
を生じた場合でも連続音声信号中のキーワードを認識す
る装置が提供される。すなわち１特に第１図において、
１０で指示される音声入力信号は、任意の距離および任
意数の交換機を包含する電話線を介して炭素送話機およ
び受話機により発生される音声信号と考えることができ
る。それゆえ、本発明の代表例は、未知のソース（話者
に依存しない系）から供給され、電話システムを介して
受信される音声データの連続ワードストリングを認識す
ることである。他方、入力信号は１例えば、無線通信リ
ンク例えば商業放送局、私設通信リンクから取り出され
る音声信号、または装置近傍に立つオペレータの音声入
力信号のようなどのような音声信号でもよい。According to one aspect of the present invention, an apparatus is provided for recognizing keywords in a continuous audio signal even when the signal is distorted by, for example, telephone silver. In other words, 1. Especially in FIG.
The voice input signal designated at 10 can be thought of as a voice signal generated by a carbon handset and handset over a telephone line spanning any distance and encompassing any number of switches. Therefore, a representative example of the invention is the recognition of continuous word strings of voice data received over a telephone system, provided by an unknown source (speaker independent system). On the other hand, the input signal may be any audio signal, such as an audio signal derived from a wireless communications link, such as a commercial broadcast station, a private communications link, or an audio input signal of an operator standing in the vicinity of the device.

以上の説明から明らかなように、本発明の方法および装
置は、一連の音響、音素、またはその他の認識可能な符
号を含む音声信号の認識と関係する。本明細書において
は、「ワードＪ１　「要素」「一連のターゲットパター
ン」、「テンプレートパターン」または「エレメントテ
ンプレート」のいずれかくついて言及されるが、こ、の
５つの用語は、一般的なものであり、等価なものである
ｈ考えられる。これは、本方法および装置により検出さ
れ認識され得るキーワードを構成するように結合する認
識可能な一連の音響またＨその代替物を表言する便利な
方法である。これらの用語は１単−の音素、音節、また
は音響から一連のワード（文法的意味における）ならび
に単一のワードに至るいずれをも包含するように広くか
つ一般的に解釈されるべきである。As can be seen from the above description, the methods and apparatus of the present invention relate to the recognition of audio signals that include a series of sounds, phonemes, or other recognizable symbols. In this specification, any one of "word J1 "element", "series of target patterns", "template pattern", or "element template" will be referred to, but these five terms are general terms. Yes, h is considered equivalent. This is a convenient way to express a recognizable series of sounds or H alternatives that combine to constitute a keyword that can be detected and recognized by the present method and apparatus. These terms should be interpreted broadly and generally to encompass everything from a single phoneme, syllable, or sound to a series of words (in the grammatical sense) as well as a single word.

アナはグーディジタル（Ａ／Ｄ）コンバータ１３は、線
１０上の到来アナ四グ音声信号データを受信して、その
データの信ｉ幅をディジタル形式に変換する。An analog/digital (A/D) converter 13 receives the incoming analog/4G audio signal data on line 10 and converts the signal width of the data into digital form.

例示のＡ／Ｄコンバータは、入力信号データを１２ビツ
トの２進表示に変換するが、その変換は、８０００回／
秒の割合で起こる。他の具体例においては、他のサンプ
リング速度が採用できる。例えば、高品質信号が利用で
きる場合は、１６ＫＨ２の速度を使用できるＡ　／Ｉ）
変換器１３は、その出力をｌ１１５を芥ｔて自己相関器
１７に供給する。自己ｍｖａ器１７はディジタル入力信
号を処理して１１秒間に１００回短期間自己相関関数を
発生し、図示のように、ｌｌＡ１９を介してその出力を
供給する。各自己相関関数は、３２の値またはチャンネ
ルを有し、寺領は３０ビツトの解に計算される。自己相
関器は一１ａ２図と関連して追ってより詳細に説明する
０線１９上の自己相関関数は、７−リエ変換装置２１によ
りフーリエ変換され一％線２３を介して対応する短期間
窓処理パワスペクトルを発生する。The exemplary A/D converter converts input signal data to a 12-bit binary representation, which is converted 8000 times/
It happens in seconds. In other embodiments, other sampling rates may be employed. For example, if a high quality signal is available, a speed of 16KH2 can be used (A/I)
Converter 13 passes its output through l115 and supplies it to autocorrelator 17. Automva unit 17 processes the digital input signal to generate a short term autocorrelation function 100 times in 11 seconds and provides its output via 11A 19 as shown. Each autocorrelation function has 32 values or channels, and the range is calculated into a 30-bit solution. The autocorrelator will be explained in more detail later in connection with Figure 1a2. Generates a power spectrum.

スペクトルは、自己相関関数と同じ繰返し数で、すなわ
ち１ｏｏル乍の割合で発生され、そして各短期間パワス
ペクトルは、各１６ビツトの解を有する３１の数値期間
を有する。理解されるように１スペクトルの３１の期間
の各々は、ある周波数バンド内の単一のパーフを表わす
。フーリエ変換装置はまた、不要な隣接バンドレスポン
スを減するためハニングまたは類似の窓関数を含むのが
よい。The spectra are generated with the same number of repetitions as the autocorrelation function, ie, at a rate of 1oo, and each short-term power spectrum has 31 numerical periods with each 16-bit solution. As will be appreciated, each of the 31 periods of a spectrum represents a single perf within a frequency band. The Fourier transform device may also include a Hanning or similar window function to reduce unwanted adjacent band responses.

例示の具体例において、フーリエ変換ならびに後続の処
理段階は、好ましくは、本方法にしたがって反復的に必
要とされる演算をスピード化するための周辺装置を別層
して、適当にプログラムされた汎用ディジタルコンピュ
ータの制御下で遂行されるのがよい。採用されるコンピ
ュータは、マサチューセッツ所在のディジタル・エクイ
ップメント・コーポレーションにより成造されたＦＤＰ
　−１１型である。採用される特定のアレイプロセッサ
は、本出願の譲受人に譲渡された米国特許第４゜２２８
．４９８号に記載されている。第３図と関連して後述さ
れるプログラムは、これらの利用可能なデジタル処理ユ
ニットの能力および特性にほぼ基づいて設定される。In the illustrated embodiment, the Fourier transform as well as subsequent processing steps are preferably carried out using a suitably programmed general purpose computer, with separate peripherals to speed up the operations required iteratively according to the method. Preferably, it is performed under the control of a digital computer. The computer used is an FDP manufactured by Digital Equipment Corporation located in Massachusetts.
-11 type. The particular array processor employed is disclosed in U.S. Pat. No. 4,228, assigned to the assignee of this application.
．． No. 498. The program described below in connection with FIG. 3 is set up largely based on the capabilities and characteristics of these available digital processing units.

短期間窓処理ハワベクトルは、２５で指示されるように
周波数レスポンスについて等化される。The short term windowed Howa vector is equalized for frequency response as indicated at 25.

しかして、この等化は、追って詳細に示されるように各
周波数バンドまたはチャンネル内に起こるピーク振幅の
関数として遂行される。線２６上の周波数レスポンスを
等化されたスペクトルは、１００漫の割合で発生され、
そして各スペクトルは、１６ビツトの精度で評−され５
る３１の数値期間を有する。到来音声データの最終的評
価を容易にするため、１ｓ２６上の周波数レスポンスを
等価された窓処理スペクトルは、３５で指示されるよう
に振幅変換を受ける。これは、到来スペクト／Ｉ／に非
直線的振幅変換を課する。この変換については追って詳
細に記述するが、この点においては、未知の到来音声信
号が基準語業のターゲットパターンテンプレートと整合
し得る精度を改善するものでことを言及しておこう。例
示の具体例において、この変換は、スペクトルを基準語
業の要素な衷わすパターンと比較する前のある時点にお
いて周波数を等化された窓処理スペクトルのすべてにつ
いて遂行される。This equalization is thus performed as a function of the peak amplitudes occurring within each frequency band or channel, as will be shown in detail below. The frequency response equalized spectrum on line 26 is generated at a rate of 100 times,
Each spectrum is then evaluated with 16-bit precision and 5
It has 31 numerical periods. To facilitate the final evaluation of the incoming audio data, the windowed spectrum equalized frequency response on 1s26 is subjected to an amplitude transformation as indicated at 35. This imposes a non-linear amplitude transformation on the incoming spectrum /I/. This transformation will be described in more detail below, but at this point it is worth mentioning that it improves the accuracy with which the unknown incoming speech signal can be matched to the target pattern template of the reference speech. In the illustrated embodiment, this transformation is performed on all of the frequency-equalized windowed spectra at some point in time before comparing the spectra to the elemental wandering pattern of the reference speech.

線３８上の振幅変換され等化された短期間スペクトルは
、ついで、以下で説明されるように４０でエレメントテ
ンプレートと比較される。４２で指示される基準パター
ンは、変換・等価スペクトルを比較し得る統計態様の基
準語粂の要素を表わす。「サイレント」が検出される度
に１いま受信されたワードストリングの同一性に関して
決定がなされる。これは４４で指示される。このようＫ
して、比較の厳密さＫしたがって候補ワードが選択され
、例示の具体例においては、選択工程−は、キーワード
の取逃しまたは置換の可能性を最小にするように設計さ
れる。The amplitude transformed and equalized short term spectrum on line 38 is then compared to the element template at 40 as described below. The reference pattern designated 42 represents the elements of the reference pattern in statistical terms with which the transformed and equivalent spectra can be compared. Each time "silent" is detected, a determination is made as to the identity of the word string just received. This is indicated at 44. K like this
The candidate words are selected according to the stringency of the comparison K, and in the illustrated embodiment, the selection process is designed to minimize the possibility of missing or replacing keywords.

第１Ａ図を参照すると、本発明−の音声認識システムは
コントローラ４５を採用しているが、これは、例えば、
ＦＩ）Ｆ−１１のような汎用ディジタルコンピュータと
し得る。例示の具体例において、コントローラ４５は、
プリプロセッサ４６から子処理された音声データを受は
取る。プリプロセッサについては、第２図と関連して詳
細に説明する。Referring to FIG. 1A, the speech recognition system of the present invention employs a controller 45, which includes, for example:
FI) It may be a general purpose digital computer such as F-11. In the illustrated embodiment, controller 45:
The sub-processed audio data is received from the preprocessor 46. The preprocessor will be described in detail in conjunction with FIG.

プリプロセッサ４６は、ｌｌＡ４７を介して音声大刀ア
ナログ信号を受信し、インターフェース線４８を介して
制御プルセッサすなわちコントローラに処理されたデー
タを供給する。Preprocessor 46 receives audio analog signals via llA 47 and provides processed data to a control processor or controller via interface line 48.

一般に、制御プロセッサの動作速度は、汎用プロセッサ
であると、到来データを実時間で処理するに十分速くな
い。この結果、要素４５の処理速度を有効に増すために
、種々の特別目的のハードウェアを採用するのが有利で
ある。本発明の譲受人に譲渡された米国特許第４，２２
８，４９８号に記載されるようなベクメル処理装置４８
ａは、パイプライン効果を利用することにより相当増大
されたアレイ処理能力を提供する。加えて、第４．５お
よび６図と関連して詳述するように一尤度関数プｐセッ
サ４８ｂは１装置の動作速度をさらに１０倍増すためベ
クトルプロセッサと関連して使用できるＯ本発明の好ましい具体例においては制御プロセッサ４５
はディジタルコンピュータであるがＳ第１０図と関連し
て説明される他の特定の具体例においては、処理能力の
相当の８分が、逐次解読プロセッサ４９において制御プ
ロセッサの外部で実施される。このプロセッサの構造に
ついては、第１０図と関連して追って詳細に説明する。Generally, the operating speed of the control processor is not fast enough to process incoming data in real time if it is a general purpose processor. As a result, it is advantageous to employ various special purpose hardware to effectively increase the processing speed of element 45. No. 4,22, assigned to assignee of the present invention.
Becmel processing apparatus 48 as described in No. 8,498
a provides significantly increased array processing power by exploiting pipeline effects. In addition, as detailed in connection with FIGS. 4.5 and 6, the one-likelihood function processor 48b can be used in conjunction with a vector processor to further increase the operating speed of the device by a factor of ten. In the preferred embodiment, the control processor 45
In another particular embodiment, which is a digital computer but will be described in connection with FIG. The structure of this processor will be explained in detail later in connection with FIG.

このように、ここに例示される音声認識を実施するため
の装置は、その速度、およびハードウェア、ソフトウェ
アまたはハードウェアおよびソフトエアの有利な組合せ
で実施できる点において大なる変幻性を有するものであ
る。As such, the device for implementing speech recognition exemplified herein has great versatility in its speed and in that it can be implemented in hardware, software, or an advantageous combination of hardware and software. be.

次にプロセッサについて説明する。Next, the processor will be explained.

第２図に例示される装置において、固有の平均化の作用
をもつ自己相関機能は、ＭＩＯを介して供給される到来
アナαグ可聴データ、一般的には音声信号に作用するア
ナログ−ディジタルコンバータ１３により発′生される
ディジタルデータ列に対して遂行される。コンバータ１
３は、ｌｌ１１５上にディジタル入力信号を発生する。In the device illustrated in FIG. 2, the autocorrelation function with its inherent averaging effect is applied to an analog-to-digital converter that operates on incoming analog audio data, typically an audio signal, provided via an MIO. This is performed on a digital data stream generated by 13. converter 1
3 generates a digital input signal on ll115.

ディジタル処理機能ならびにアナログ−ディジタル変換
は、クロック発振器５１０制御下で調時される。クロッ
ク発振器は、２５６，０００パルス／秒の基本タイミン
グ信号を発生し、そしてこの信号は、周波数分割器５２
に供給さ□れて、ｓ、ｏｏｏパルス／秒の第２のタイミ
ング信号を得る。低速タイミング信号は、アナ！グーデ
ィジタル変換器１３ならびにラッチレジスタ５３を制御
する。しかして、このラッチレジスタは、次の変換が完
了するまで最後の変換の１２ビツトの結果を保持するも
のである。Digital processing functions as well as analog-to-digital conversion are timed under clock oscillator 510 control. The clock oscillator generates a basic timing signal of 256,000 pulses per second, and this signal is passed to the frequency divider 52.
is applied to obtain a second timing signal of s, ooo pulses/second. The low speed timing signal is Ana! It controls the digital converter 13 and the latch register 53. This latch register thus holds the 12-bit result of the last conversion until the next conversion is completed.

自己相関積は、レジスタ５３に含まれる数に３２ワード
シフトレジスタ５８の出力を乗算するディジタルマルチ
プライヤ５６により発生される０レジスタ５８は、循環
モードで動作し、高速クロック周波数により駆動される
から１シフトレジスタデータの１循環は、各アナ四グー
ディジタル変換ごとに遂行される。シフトレジスタ５８
に対する人力は、１回の循環サイクル中に−ｆｆレジス
タ５３から供給される。ディジタルマルチプレクサ５６
に対する一方の入力は、ラッチレジスタ５３から直接供
給され蔦他方の入力は１シフトレジスタの現在出力から
マルチプレクサ５９を介して供給される。乗算は、高束
クロック周波数で遂行される。The autocorrelation product is generated by a digital multiplier 56 that multiplies the number contained in register 53 by the output of a 32-word shift register 58.0 The autocorrelation product is 1 because register 58 operates in a circular mode and is driven by a fast clock frequency. One rotation of shift register data is performed for each analog to digital conversion. shift register 58
The power for is supplied from the -ff register 53 during one circulation cycle. Digital multiplexer 56
One input is provided directly from the latch register 53 and the other input is provided via the multiplexer 59 from the current output of the 1 shift register. Multiplication is performed at high flux clock frequencies.

このよう圧して、Ａ／Ｄ変換から得られる各位は、先行
の３１の変換値の各々と乗算される。技術に精通したも
のには明らかであるように、それにより発生される信号
は、入力信号を、それを３２の異なる時間増分だけ遅延
した信号と乗算することと等口１１である（１つは遅延
０である）。０遅延相関を得るため、すなわち信号の積
を生ずるため、マルチプレクサ５９は、シフトレジスタ
６０に各折しい値が導入されつつある時点に、ラッチレ
ジスタ５３の現在値をそれ自体と乗算する。このタイミ
ング機能は、６０で指示される。In this way, each value obtained from the A/D conversion is multiplied by each of the previous 31 conversion values. As will be apparent to those skilled in the art, the signal thereby generated is equivalent to multiplying the input signal by a signal delayed by 32 different time increments (one being delay is 0). To obtain a zero delay correlation, i.e. to produce a product of the signals, multiplexer 59 multiplies the current value of latch register 53 by itself at the time each value is being introduced into shift register 60. This timing function is indicated at 60.

呂れも技術に精通したものには明らかなように、１回の
変換とその３１の先行データから得られる積は、適当な
サンプリング間隔についてのエネルギ分布すなわちスペ
クトルを公正に表わさない。As will be apparent to those skilled in the art, the product of a single transform and its 31 predecessors does not fairly represent the energy distribution or spectrum for a given sampling interval.

したがって、第２図の装置は、これらの複数組の積の平
均化を行なう。Therefore, the apparatus of FIG. 2 averages these multiple sets of products.

平均化を行なう累積工程は、加算器６５と接続されて１
組の３２の累積器を形成する３２ワードシフトレジスタ
６３により提供される。すなわち、各ワードは、ディジ
タルマルチプレクサからの対応する増分に加算された後
、再循環され得る。こ−の循環ループは、低周波り四ツ
ク信号により駆動されるＮ分割器６９により制御される
ゲート６７を通る。分割器６９は、シフトレジスタ６３
が読み出されるまでに累積されしたがって平均化される
瞬間的自己相関Ｒ数の数を決定する７アクタにより、低
周波クロックを分割する。The accumulation step for averaging is connected to an adder 65 and
Provided by 32 word shift registers 63 forming a set of 32 accumulators. That is, each word can be added to the corresponding increment from the digital multiplexer and then recirculated. This circulation loop passes through a gate 67 which is controlled by a divider-by-N 69 driven by a low frequency 4K signal. The divider 69 is a shift register 63
The low frequency clock is divided by 7 actors that determine the number of instantaneous autocorrelation R numbers that are accumulated and thus averaged until the R is read out.

例示の具体例においては、読み出されるまでに８０のサ
ンプルが累積される。換言すると、Ｎ分割器６９に対す
るＮは８０に等しい。８０の変換サンプルが相関づけら
れ、累積された後、分割器６９は、ｌｌ１Ｉ７２を介し
てコンピュータ割込み回路７１をトリガする。この時点
に、シフ上レジスタ６３の内容は、適当なインターフェ
ース回路７３を介してコンピュータメモリに逐次読み込
まれる。In the illustrated implementation, 80 samples are accumulated before being read. In other words, N for N divider 69 is equal to 80. After the 80 transform samples have been correlated and accumulated, divider 69 triggers computer interrupt circuit 71 via ll1I72. At this point, the contents of the on-schiff register 63 are serially read into computer memory via appropriate interface circuitry 73.

レジスタ内の３２の逐次のワードは、インターフェース
７３を介してコンピュータに順番に提示される。技術に
精通したものには明らかなように一周辺ユニット、すな
わち自己相関器プリプロセッサからコンピュータへのこ
のデータ転送は、普通、直接メモリアクセス法により遂
行されよう。８０００の初サンプリング速度で８０のサ
ンプルが平均化。The 32 consecutive words in the register are presented in sequence to the computer via interface 73. As will be apparent to those skilled in the art, this data transfer from one peripheral unit, the autocorrelator preprocessor, to the computer will normally be accomplished by direct memory access methods. 80 samples were averaged with an initial sampling rate of 8000.

されることに基づき、毎秒１００の平均化相関関数が供
給されることが分ろう。It can be seen that 100 averaged correlation functions are provided per second.

シフトレジスタの内容がコンピュータから読み出されて
いる間、ゲート６７が閉成されるから、シフトレジスタ
の各ワードは、０にリセットされ、累積プロセスの再開
を可能属する。While the contents of the shift register are being read from the computer, gate 67 is closed so that each word of the shift register is reset to zero, allowing the accumulation process to resume.

数式で表わすと１第２図に示される装置の動作は下記の
ごとく記述できる。Expressed mathematically, the operation of the device shown in FIG. 2 can be described as follows.

アナログ−ディジタル変換器が時間列Ｓ　（ｔ、）を発
生すると仮定すると（ここにｋ＝　Ｏ、Ｔｏ、　２Ｔｏ
、・・・。Assuming that the analog-to-digital converter generates a time sequence S (t,), where k = O, To, 2To
,...

Ｔｏはサンプリング間隔（例示の具体例において１／８
０００秒））、第２図の例示のディジタル相関回路は、
始動時のあいまいさを無視すると、次の相関関数を計算
するものと考えることができる。To is the sampling interval (1/8 in the illustrated example)
000 seconds)), the exemplary digital correlation circuit of FIG.
Ignoring the start-up ambiguity, one can think of calculating the following correlation function:

ａ（ｊ、ｔ）＝Σｓ　（を十ｋＴｏ）　ｓ　（ｔ　＋（
ｋ　　ｊ）Ｔｏ）　　（１）ｋ＝。a(j, t)=Σs (10 kTo) s (t + (
k j) To) (1) k=.

ここにｋ＝ｏ、１．２．−．３１、ｔ　＝　８０Ｔｏ　
。Here k=o, 1.2. −． 31, t = 80To
.

１６０’Ｉ’ｏ、・・・＋　８０　ｎＴｏ　ｔ・・・で
ある。これらの相関関数は、第１図の線１９上の相関出
力に対応する。160'I'o...+80 nTo t... These correlation functions correspond to the correlation outputs on line 19 of FIG.

第３図を参照して説明すると、ディジタル相関器は、各
１０ミリ秒毎に１相関関数の割合で一連のデータブロッ
クをコンピュータに連続的に伝送するように動作する。Referring to FIG. 3, the digital correlator operates to continuously transmit a series of data blocks to the computer at a rate of one correlation function every 10 milliseconds.

これは第３図［７７で指示される。各データブロックは
、対応する細分時間間隔に誘導される自己相関関数を表
わす。上述のように、例示の自己相関関数は、単位秒当
り１００の３２ワード関数の割合でコンピュータに提供
される。この分析間隔は、以下において「フレーム」と
称される。This is indicated in FIG. 3 [77]. Each data block represents an autocorrelation function induced into a corresponding subdivision time interval. As mentioned above, the exemplary autocorrelation function is provided to the computer at a rate of 100 32 word functions per second. This analysis interval is referred to below as a "frame".

第１の例示の具体例において、自己相関関数の処理は、
適当にプログラムされた専月ディジタルコンピュータで
遂行される。コンピュータプログラムにより提供される
機能を含むフローチャートが第３図に示されている。し
かしながら、段階の種々のものは、ソフトウェアでなく
てハードウェア（以下に説明する）によっても遂行でき
、また第２図の装置により遂行される機能のあるものは
、第３図のフローチャートの対応する修正によりソフト
ウェアでも遂行できることを指摘しておく。In the first illustrative embodiment, the processing of the autocorrelation function is
It is carried out on a suitably programmed dedicated digital computer. A flowchart containing the functions provided by the computer program is shown in FIG. However, various of the steps can also be performed by hardware (described below) rather than software, and some of the functions performed by the apparatus of FIG. It should be pointed out that this can also be accomplished with software through modification.

第２図のディジタル相関器は、瞬間的に発生される自己
相関関数の時間平均動作を遂行するが、コンピュータに
読み出される平均相関関数は、サンプルの順次の処理お
よび評価と干渉し合うようなある種の変則的不連続性ま
たは不均一性を含む。Although the digital correlator of FIG. 2 performs a time-averaging operation of the instantaneously generated autocorrelation function, the average correlation function read out to the computer may interfere with the sequential processing and evaluation of the samples. Including anomalous discontinuity or heterogeneity of species.

したがって、データの各ブロック、すなわち各自己相関
関数ａ（ｊ、ｔ）は、まず時間に関して平滑化される。Therefore, each block of data, ie each autocorrelation function a(j,t), is first smoothed in time.

これは、第３図のフローチャートにおいて７８で指示さ
れる。好ましい平滑法は、平滑化自己相関出力ａｓ（ｊ
＋　ｔ）が下式により与えられるものである。This is indicated at 78 in the flowchart of FIG. The preferred smoothing method is the smoothed autocorrelation output as(j
+t) is given by the following formula.

ａｓ（ｊ　、　ｔ　）＝　０ｏａ（ｊ　、　ｔ、　）＋
ａ、ａ（ｊ、　ｔ−Ｔ）＋０２ａ（ｊ、　ｔ−２Ｔ）（
２）ここにａ（ｊ、　ｉ）は式（１）において定義された不
平滑入力自己相関Ｒ数であり、ａｓ（ｊ、　ｔ）は平滑
自己帽関出力であり、Ｊは遅延時間を表わし、ｔは実時
間を表わし、Ｔは連続的に発生される自己相関関数間の
時間間隔（フレーム）を表わし、好ましい具体例におい
ては０．０１秒に等しい。重み付は関数００　＋ａ、、
Ｃ＋、は、例示の具体例忙おいては好ましくは１／４　
、１／２　、１／４に選ばれるのがよいが、他の値も選
択されよう。例えば、２０Ｈ２のカットオフ周波数をも
つガウスのインパルスレスポンスヲ近似する平滑化関数
をコンピュータソフトウェア゛で実施できよう。しかし
ながら、実験によれば、式（２）Ｋ例示される実施容易
な平滑化関数で満足な結果が得られることが示された。as(j, t)=0oa(j, t,)+
a, a(j, t-T)+02a(j, t-2T)(
2) Here a(j, i) is the unsmoothed input autocorrelation R number defined in equation (1), as(j, t) is the smoothed autocorrelation output, and J represents the delay time. , t represents real time and T represents the time interval (frame) between successively generated autocorrelation functions, which in the preferred embodiment is equal to 0.01 seconds. The weighting is the function 00 +a,,
C+ is preferably 1/4 in the illustrative example.
, 1/2, and 1/4, but other values may also be selected. For example, a smoothing function that approximates a Gaussian impulse response with a cutoff frequency of 20H2 could be implemented in computer software. However, experiments have shown that satisfactory results can be obtained with the easy-to-implement smoothing function exemplified by equation (2)K.

上述のように、平滑化関数は、遅延の寺領Ｊについて別
々に適用される０以下の分析は・音声信号の短期間フーリエノくワスベク
トルに関する種々の操作を含むが、ノ・−ドウエアを簡
単にしかつ処理スピードを上げるため、自己相関関数の
周波数領域への変換は、例示の具体例においては８ビツ
トの算術で実施される。３１ｄｌｚ近傍のバンドパスの
高域の端では、スペクトルパワ密度が８ビツト量におけ
る解像に不十分なレベルに減する。それゆえ、システム
の周波数レスポンスは、６ｄｂ／オクタ、−ブの上昇率
で傾斜される。これは７９で指示される。この高周波数
の強調は、その変数すなわち時間遅延に関する自己相関
関数の二次微分を取ることＫより遂行される。As mentioned above, the smoothing function is applied separately for the delay J. To increase processing speed, the conversion of the autocorrelation function to the frequency domain is performed in the illustrated embodiment with 8-bit arithmetic. At the high end of the bandpass near 31dlz, the spectral power density is reduced to a level insufficient for resolution in 8-bit quantities. Therefore, the frequency response of the system is ramped at a rate of rise of 6 db/octave. This is indicated by 79. This high frequency enhancement is accomplished by taking the second derivative of the autocorrelation function with respect to that variable, ie the time delay, K.

微分操作は、次式のごとくである。The differential operation is as shown in the following equation.

ｂ（ｊ＋　ｔ）＝−ａ（ｊ＋１．　ｔ、）＋２ａ（ｊ、
　ｔ）−ａ（ｊ−１，ｔ、）　　（ａｔｊ＝０に対する
微分値を求めるために、自己相関関数は０に関して対称
であるから、ａ（ｊ、　ｔ）−ａ（＋ｊ、ｔ）であると
仮定する。また、（３２）Ｋ対するデータはないから、
ｊ＝３１における微分値は、ｊ＝３０のときの微分値と
同じであると仮定する。b(j+t)=-a(j+1.t,)+2a(j,
t)-a(j-1, t,) (To find the differential value with respect to atj=0, the autocorrelation function is symmetric with respect to 0, so a(j, t)-a(+j, t) Also, since there is no data for (32) K,
It is assumed that the differential value at j=31 is the same as the differential value at j=30.

第３図の７０−チャートで示されるように一１分析手続
きの高周波強調後の次の段階は、自己相関のピーク絶対
値を見出すことＫより現在のフレーム間隔における信号
パワを算出することである。The next step after high-frequency emphasis in the analysis procedure is to find the peak absolute value of the autocorrelation and calculate the signal power in the current frame interval from K as shown in the 70-chart in Figure 3. .

パワの概算値ｐ　（ｔ）は次のごとくなる。The estimated power p(t) is as follows.

ｐ（ｔ）−〒”　ｌ　ｂ（１，ｉ）ｌ　　　　　　　　
（４１８ビットスペクトル分析のための自己相関関数な
用意するため、平滑化自己相関関数は、ｐ（ｔ）に関し
てブロック標準化され（８０にて）、各標準価値の上位
８ビツトがスペクトル分析ハードウェアに入力される。p(t)−〒”l b(1,i)l
(To prepare an autocorrelation function for a 418-bit spectral analysis, the smoothed autocorrelation function is block standardized (at 80) with respect to p(t) and the top 8 bits of each standard value are input to the spectral analysis hardware. is input.

それゆえ、標準化されかつ平滑化された自己相関関数は
次のごとくなる。Therefore, the standardized and smoothed autocorrelation function is:

ｃ（ｊ、　ｔ）　＝　１２７　ｂ（ｊ、　ｔ）／　ｐ（
ｔ）　　　　　　　俤）ついで、８１で指示されるよう
に、時間に関して平滑化され、周波数強調され、標準化
された各相関関数に余弦フーリエ変換が適用され、３１
点のパワスペクトルを生成する。余弦値のマトリックス
は次式で与えられる。すなわち、ｇ（ｉ、　ｊ　）　”　１２６　ｇ（ｉ）（ｃｏｓ　（
２πｉ／８０００）ｆ（ｊ））。c(j, t) = 127 b(j, t)/p(
t) 俤) A cosine Fourier transform is then applied to each time-smoothed, frequency-enhanced, standardized correlation function as indicated at 81;
Generate a point power spectrum. The matrix of cosine values is given by: That is, g(i, j) ” 126 g(i)(cos (
2πi/8000)f(j)).

ｊ＝０，１，２＋　　・・・、３１　　　　　　　　（
６）ここに、５（１１Ｊ）は、時刻ｔにおける、ｆ（ｊ
）Ｈｚに中心を置くバンドのスペクトルエネルギ、ｇ（
ｉ）＝１／２（１＋ｃｏｓ　２πｉ／６３）は、サイド
ローブな減するための（ハミング）窓関数エンベロープ
である、およびｆ（ｊ）＝３０＋１０００（０，０５５２ｊ＋０．４３
８）１１０．６３１１ｚ、　　（７）ｊ＝０．１，２．
・・・、３１これは、主楽音ピッチいわゆる「メル」曲線上に等しく
離間された分析周波数である。明らかなように、これは
、約３０００〜５０００］ＩＺの代表的通信チャンネル
のバンド幅の周波数に対する主ピッチ（メルスケール）
周波数軸線間隔に対応する０スペクトル分析は、−３１
から＋３１までの遅を加算を必要とするから、自己相関
がＯＫ関して対称であるということを仮定すれば、Ｊの
正値しか必要としない。しかしながら、遅れＯの項を２
度計算することを避けるために１余弦マトリックスは次
のように調節される。j=0,1,2+...,31 (
6) Here, 5(11J) is f(j
) Hz, the spectral energy of the band centered at g(
i)=1/2(1+cos 2πi/63) is the (Hamming) window function envelope for sidelobe reduction, and f(j)=30+1000(0,0552j+0.43
8) 110.6311z, (7) j=0.1,2.
..., 31 These are the analysis frequencies equally spaced on the tonic pitch so-called "mel" curve. As can be seen, this is the dominant pitch (in mel scale) for the frequency of the bandwidth of a typical communications channel of approximately 3000-5000] IZ.
The zero spectral analysis corresponding to the frequency axis spacing is −31
Assuming that the autocorrelation is symmetric with respect to OK, only positive values of J are required since we need to add the delay from to +31. However, the delay O term is reduced to 2
To avoid calculating degrees, the 1 cosine matrix is adjusted as follows.

ａ（０，ｊ）＝１２６／２　＝６３．余ｊに対して　　
　（８）かくして、計算されたパワスペクトルは次式に
より与えられる。a(0,j)=126/2=63. for the rest
(8) Thus, the calculated power spectrum is given by the following equation.

ここで第ｊ番目の結果は周波数ｆ　（ｊ　）に対応する
０これも明らかなように、各スペクトル内の各点すなわ
ち値は、対応する周波数バンドを表わす。where the jth result corresponds to frequency f (j ) 0. As is also clear, each point or value in each spectrum represents a corresponding frequency band.

このフーリエ変換は従来のコンピュータハードウェア内
で完全に遂行できるが、外部のハードウェアマルチプレ
ックサまたは高速フーリエ変換（ＦＦＴ）周辺装置を利
用すれば、工程はかなりスピード化し得よう。しかしな
がら、この種のモジュールの構造および動作は技術上周
知であるから、ここでは詳細に説明しない。ハードフェ
ア高速フーリエ変換周辺装置には、周波数平滑機能が組
み込まれるのが有利であり、この場合、各スペクトルは
、上述の好ましい（ハミング）窓重み付は関数ｇ（１）
に従って周波数が平滑される。これは、ハードウェアに
よるフーリエ変換の実施に対応するブロック８５の８３
で実施される。Although this Fourier transform can be performed entirely within conventional computer hardware, the process could be considerably speeded up by utilizing external hardware multiplexers or fast Fourier transform (FFT) peripherals. However, the structure and operation of this type of module is well known in the art and will not be described in detail here. Advantageously, the hard fair fast Fourier transform peripheral incorporates a frequency smoothing function, in which case each spectrum is given the preferred (Hamming) window weighting described above by the function g(1).
The frequency is smoothed according to This corresponds to 83 of block 85, which corresponds to the implementation of the Fourier transform by hardware.
It will be carried out in

バックグラウンドノイズが相当ある場合、バックグラウ
ンドのパワスペクトルの概算値が、この段階においてＳ
′（Ｊｌｔ）から減算されねりならない。If there is considerable background noise, an estimate of the background power spectrum can be obtained at this stage by S
'(Jlt).

ノイズを表わすために選択したフレーム（１または複数
）には、音声信号を含ませてはならない。The frame(s) selected to represent noise must not contain any audio signal.

雑音フレーム間隔を選択する最適のルールは、応用にし
たがって変わるであろう。話者が例えば百声認識装置に
より制御される機械で相互通信に掛わり合う場合、例え
ば、機械がその音声応答ユニットによる話しを終了した
直後の間隔に任意にフレームを選択するのが便利である
。拘束がより少ない場合には、過ぎ去った１ないし２秒
の開の音声入力の最小の振幅のフレームを選択すること
によりノイズフレームを見出すことができる。追って詳
細に説明されるように、最小振幅「サイレント」パター
ン、実際には２つの交互の「サイレント」パターンの使
用は、有利な装置動作をもたらすことは明らかである。The optimal rule for selecting the noise frame spacing will vary depending on the application. If a speaker engages in intercommunication with a machine controlled, for example, by a whooping recognition device, it is convenient, for example, to arbitrarily select a frame at an interval immediately after the machine has finished speaking with its voice response unit. . If there are fewer constraints, the noise frame can be found by selecting the lowest amplitude frame of the open audio input in the past 1-2 seconds. It is clear that the use of a minimum amplitude "silent" pattern, in fact two alternating "silent" patterns, provides advantageous device operation, as will be explained in detail below.

逐次の平滑パワスペクトルが高速７一リエ変換周辺装置
８５から受信されると、以下で説明されるように１周辺
装置８５からのスペクトルに対するピークパワスペクト
ルエンベロープ（−７般ＫＪ％なる）を決定し、それに
応じて高速フーリエ変換装置の出力を変更することＫよ
り通信チャンネルの等化が行なわれる。到来する窒処理
パワスペクトルｓ’（ｊ、　ｔ）（ここにｊはスペクト
ルの複数の周波数に割り当てられる）に対応しかつ該ス
ペクトルにより変更された新たに発生された各ピーク振
幅は１各スペクトルチヤンネルまたはバンドに対する高
速アタック、低速デケイ、ピーク検出機能の結果である
。ウィントートパワスペクトルは、対応するピーク振幅
スペクトルのそれぞれの期間に関して標準化される。こ
れは、８７で指示される。Once the successive smoothed power spectra are received from the fast seven-tier transform peripheral 85, the peak power spectral envelope (which is −7 general KJ%) for the spectrum from one peripheral 85 is determined as described below. , by changing the output of the fast Fourier transform device accordingly, thereby equalizing the communication channel. Each newly generated peak amplitude corresponding to and modified by the incoming nitriding power spectrum s'(j, t) (where j is assigned to multiple frequencies of the spectrum) is one for each spectral channel. or the result of a fast attack, slow decay, or peak detection function on the band. The wintort power spectrum is normalized with respect to each period of the corresponding peak amplitude spectrum. This is indicated at 87.

例示の具体例においては、新しいウィンドートスベクト
ルを受は取る前に決定された「古い」ビークＭ幅スペク
トルｐ（ｊ＋　ｔＴ）が、新たに到来したスペクトル”
（ｊｙ　ｔ）と周波数バンドと周波数バンドとを比較す
るやり方で比較される。ついで、新しイヒークスペクト
ルｐ（ｊ、　ｔ）が、下記の規則にしたがって発生され
る。「古い」ピーク振幅スペクトルの各バンドのパワ振
幅は、この具体例においては固定分数、例えば１０２３
／１０２４と乗算される。In the illustrated embodiment, the "old" peak M width spectrum p(j+tT), determined before receiving the new window toss vector, is the newly arrived spectrum.
(jy t) and frequency bands are compared in a frequency band-to-frequency manner. A new characteristic spectrum p(j, t) is then generated according to the following rules. The power amplitude of each band of the "old" peak amplitude spectrum is in this example a fixed fraction, e.g. 1023
/1024.

これは、ピーク検出関数の低速デケイ部分に対応する。This corresponds to the slow decay part of the peak detection function.

到来スペクトル”（ｊｙ　ｔ）の周波数バンドＪのパワ
振幅が、崩壊ピーク振幅スベク）ｙの対応する周波数バ
ンドのパ？振幅より大きければ１その（またはそれらの
）周波数バンドに対する崩壊ピーク振幅スペクトル値は
、到来するウィンドートスベクトルの対応するバンドの
スペクトル値と置き代えられる。こハは、ピーク検出関
数の高速アタック部分に対応する。数学的には、ピーク
検出関数は、次のように表現できる。すなわちｐ（ｊ、
　ｔ）＝ｍａｘ　ｐ（ｊ、　ｔ−Ｔ）（１−Ｅ）・ｐ（
ｔ）ｓ（ｊ、　ｔ、）、’　（＋（１ｊ＝０，１．・・
・、３１ここに％Ｊは周波数バンドの各々に割り当てられ、ｐ（
Ｊｌｔ）は生じたピークスペクトルであり、ｐ（ｊ。If the power amplitude of the frequency band J of the arriving spectrum (jy t) is greater than the power amplitude of the corresponding frequency band of the decay peak amplitude Sve)y, then 1, then the decay peak amplitude spectrum value for that (or those) frequency bands is , is replaced by the spectral value of the corresponding band of the incoming window toss vector. This corresponds to the fast attack part of the peak detection function. Mathematically, the peak detection function can be expressed as .That is, p(j,
t)=max p(j, t-T)(1-E)・p(
t)s(j, t,),' (+(1j=0,1...
・, 31 where %J is assigned to each of the frequency bands and p(
Jlt) is the resulting peak spectrum and p(j.

ｔ、−Ｔ）は「古い−１すなわち先行のピークスペクト
ルであり、ｓ’（ｊｅｔ）は新たに到来した部分的に処
理されたパワスペクトルであり、ｐ（ｔ）は時刻ｔにお
けるパワ概算値であり、１はデケイパラメータである。t, −T) is the old −1 or previous peak spectrum, s′(jet) is the newly arrived partially processed power spectrum, and p(t) is the power estimate at time t. where 1 is the decay parameter.

式＋１（ＩＫしたがうと、ピークスペクトルは１、より
高値のスペクトル入力の不存在の場合、１−１の率で通
常崩壊する。普通、Ｉは１／１０２４に等し信チｔ°ン
ネルまたは音声特性の迅速な変化が予測されない場合１
ビークスペクトルのデケイを許すことは望ましくなかろ
う。サイレントフレームを限定するためには、バックグ
ラウンドノイズフレームを選択するのに採用されたのと
同じ方法が採用される。過ぎ去った１２８の７し一人の
振幅（ｐ（ｔ）の平方根）が検査され、最小値が見つけ
られる。現在フレームの振幅がこの最小値の４倍より小
さければ、現在フレームはサイレントであると決定され
、Ｅに対して、値１／１０２４の代わりに値「０」が置
き代えられる。According to the equation +1 (IK), the peak spectrum typically collapses at a rate of 1, 1-1 in the absence of higher value spectral input.Normally, I is equal to 1/1024 and the signal channel or audio Cases in which rapid changes in properties are not expected1
It would be undesirable to allow the peak spectrum to decay. To limit the silent frames, the same method adopted to select the background noise frames is adopted. The past 128 7th amplitudes (square root of p(t)) are examined and the minimum value is found. If the amplitude of the current frame is less than four times this minimum value, the current frame is determined to be silent and the value ``0'' is substituted for E instead of the value 1/1024.

ピークスペクトルが発生された後、生じたピーク振幅ス
ペクトルｐ（ｊ、ｔ）は、各周波数バンドビーク値を新
たく発生されたピークスペクトルの瞬接する周波数に対
応するピーク値と平均することにより、周波数が平滑化
される（８９）。しかして、平均値に寄与する全周波数
バンド幅は、フォーマント周波数間の代表的周波数間隔
に概ね等しい０音声認識の技術に精通したものには明ら
かなように、この間隔は、約１０００Ｈｚである０この
特定の方法による平均化により、スペクトル内の有用情
報、すなわちフォーマント共鳴を表わす局部的変動が維
持され、他方、周波数スペクトルの全体的な強調は抑制
される。好ましい具体例においては、ピークスペクトル
は、７つの隣接する周波数バンドをカバーする移動平均
関数により周波数に関して平滑化される。平均関数は次
のどとくであバスバンドの終端において、ｐ（ｋ、ｔ）
は、０より小さいｋおよび３１より大きいｋに対してＯ
となる。After the peak spectrum is generated, the resulting peak amplitude spectrum p(j,t) is calculated by averaging each frequency band peak value with the peak value corresponding to the momentarily adjacent frequency of the newly generated peak spectrum. is smoothed (89). Thus, the total frequency bandwidth contributing to the average value is approximately equal to the typical frequency spacing between the formant frequencies. As will be apparent to those familiar with the art of speech recognition, this spacing is approximately 1000 Hz. Averaging in this particular way preserves useful information in the spectrum, ie local fluctuations representing formant resonances, while suppressing the overall enhancement of the frequency spectrum. In a preferred embodiment, the peak spectrum is smoothed in frequency by a moving average function covering seven adjacent frequency bands. The average function is p(k, t) at the end of the bus band at the following dotoku
is O for k less than 0 and k greater than 31
becomes.

標準化エンベロープｈ（ｊ）は、実際に加算された有効
データ要素の数を考慮に入れる。かくして、ｂｌｄ＝７
／４　、ｈ（ｔ）＝７１５　、ｈ（２）＝＝７／６　、
ｈ（３）＝　Ｌ・・・。The normalization envelope h(j) takes into account the number of valid data elements actually added. Thus, bld=7
/4, h(t)=715, h(2)==7/6,
h(3)=L...

ｈ（財））二ｉ、　　ｈに））　＝７／６．　ｋｌ（３
０）＝７１５．そしてｈ（３１）＝７／４となる。得ら
れた平滑化ピーク振幅スペクトルｓｌ、ｔ）は、ついで
、いま受信されたパンスペクトルを標準化し、周波数等
化するのに使用されるが、これは到来平滑化スペクトル
Ｓ’（Ｊ＋ｔ）の各周波数バンドの振幅値を、平滑化ピ
ークスペクトルｅ（ｊ、ｔ）の対応する周波数バンド値
で分割すること罠より行なわれる。数学的にこれは、次
のように表わさせる。h(goods))2i, hto)) =7/6. kl(3
0)=715. Then, h(31)=7/4. The resulting smoothed peak amplitude spectrum sl,t) is then used to standardize and frequency equalize the just-received pan spectrum, which corresponds to each of the incoming smoothed spectra S'(J+t). This is done by dividing the amplitude values of a frequency band by the corresponding frequency band values of the smoothed peak spectrum e(j,t). Mathematically, this can be expressed as follows.

８ｎ（ｊ、　ｔ）＝　ｓ　（ｊ、　ｔ）／　ｅ　（ｊ、
　ｔ）　３２７６７　　　　（Ｌｌここに、Ｓ　ｎ　（
ｆ＋　ｔ）はＮピーク標準化され平滑化されたパワスペ
クトルであり、ｊは各周波数バンドに対して割り当てら
れる。このステップは、９１で指示されている。ここで
、周波数等化されかつ標準化された一連の短期間パワス
ペクトルが得られるが、このスペクトルは、到来音声信
号の周波数含分の変化が強調され、一般的な長期間周波
数強調または歪は抑制されたものである。この周波数補
償方法は、補償の基準が全信号または各周波数ハンドの
いずれにおいても平均パワレベルである通常の周波数補
償システムに比して、電話線のような周波数歪を生ずる
通信リンクを介して伝送される音声信号の認識において
非常に有利であることが分った０逐次のスペクトルは種々処理され、等化されたが、到来
音声信号を表わすデータはなお１００／秒の割合で生ず
るスベク）７ｙを含んでいることを指速しておく。8n(j, t)=s(j, t)/e(j,
t) 32767 (Ll here, S n (
f+t) is the N-peak normalized smoothed power spectrum and j is assigned for each frequency band. This step is indicated at 91. Here we obtain a series of frequency-equalized and standardized short-term power spectra that emphasize changes in the frequency content of the incoming speech signal and suppress typical long-term frequency enhancements or distortions. It is what was done. This frequency compensation method differs from conventional frequency compensation systems, where the compensation criterion is the average power level of either the entire signal or each frequency hand, which is transmitted over a communications link that produces frequency distortion, such as a telephone line. Although the successive spectra have been variously processed and equalized, the data representing the incoming audio signal is still very useful in the recognition of audio signals occurring at a rate of 100/s. Be sure to point out what is included.

９１で指示されるように標準化され、周波数等化された
スペクトルは、９１で指示されるように振幅変換を受け
る。これは、スペクトル振幅値に非直線的なスケール操
作をなすことにより行なわれる。The standardized and frequency equalized spectrum as indicated at 91 undergoes an amplitude transformation as indicated at 91. This is done by performing a non-linear scaling operation on the spectral amplitude values.

５ｎ（ｊ、ｔ）（式１２から）のごとき個々の等化され
標準化されたスベク）／ｌ／を選択すると（ここにｊは
スペクトルの興なる周波数バンドを指示し、ｔは実時間
を表わす）、非直線スケール化スペクトルｘ（ｊ、　ｔ
）は、次の直線分数関数により定義されるｔ）ノ平均値
であり、下記のように定義される０ここでＪはパワスペ
クトルの周波数バンドを指示する。5n(j, t) (from Equation 12) (from Equation 12). ), non-linearly scaled spectrum x(j, t
) is the mean value of t) defined by the following linear fractional function: 0, where J indicates the frequency band of the power spectrum.

スペクトルの３１の期間は、次式のようにＡの対数によ
り置き代えられる。すなわち、ｘ（３１，ｔ）＝１６　
ｌｏｇ、人　　　　　　　　Ｏ！９このスケール関数（
式１３）は、短期間平均値人から大きく偏ったスペクト
ル強度に対して柔軟なスレッショルドおよび漸進的な飽
和の作用を及ぼす。数学的に述べると、平均近傍の強度
に対して概ね直線的であり、平均から離れた強度に対し
て概ね対数的であり、極端な強度値に対して実質的に一
定である。対数スケールの場合、関数”　（、ｌ　ｔｔ
）は０に関して対称であり、聴覚神経を刺激するような
割合の関数を示峻するようなスムツショルドおよび飽和
の振舞を示す。実際に、全認識システムは、この特定の
非直線スケール関数の場合、れかの場合よりも相当良好
に機能する。The 31 periods of the spectrum are replaced by the logarithm of A as follows. That is, x(31,t)=16
log, person O! 9This scale function (
Equation 13) exerts a flexible threshold and gradual saturation effect on spectral intensities that deviate significantly from the short-term average value. Mathematically stated, it is approximately linear for intensities near the average, approximately logarithmic for intensities away from the average, and substantially constant for extreme intensity values. For logarithmic scale, the function "(,l tt
) is symmetric with respect to 0 and exhibits smooth and saturated behavior that exhibits a steep rate function that stimulates the auditory nerve. In fact, the entire recognition system performs considerably better with this particular non-linear scale function than with either case.

このようにして、振幅変換され、周波数レスポンスを等
化され、標準化された一連の短期間パワスペト＃　ｘ（
ｊ、　ｔ）　（ここに、ｔ＝ｏ、ｏ　１　、０．０２　
。In this way, a series of short-term power specs # x (
j, t) (here, t=o, o 1 , 0.02
.

０．０３，０．０４秒、ｊ＝０．・・・、３０（発生さ
れたパワスペクトルの周波数バンドに対応）が発生する
。各スペクトルに対して３２ワドが用意され、Ａ（式１
５）、すなわちスペクトル値の平均値の値は、３２ワー
ドとして記憶される。以下において「フレーム」として
言及されるこの振幅変換された短期間パワスペクトルは
、例示の具体例においては、９５で指示されるように、
２５６の３２ワードスペクトルに対する記憶容量をもつ
ファーストイン・ファーストアウト循環メモリに記憶さ
れる。かくして、例示の具体例においては、２．５６秒
の音声入力信号が分析のために利用可能となる。0.03, 0.04 seconds, j=0. ..., 30 (corresponding to the frequency bands of the generated power spectrum) are generated. 32 wads are prepared for each spectrum, A (Equation 1
5), that is, the average value of the spectral values is stored as 32 words. This amplitude-transformed short-term power spectrum, referred to below as a "frame," is, in the illustrated embodiment, as indicated at 95:
It is stored in a first-in, first-out circular memory with a storage capacity for 256 32-word spectra. Thus, in the illustrated embodiment, 2.56 seconds of audio input signal is available for analysis.

この記憶容量は、もし必要ならば、分析および評価のた
め異なる実時間でスペクトルを選択し、したがって分析
上必要に応じて時間的に前進、後進できるような変幻性
をもつ認識システムを提供するＯこのように、最後の２．５６秒に対するフレー　ムχ は循環メモリに記憶され、必要なときに利用できる。例
示の具体例においては、動作中１各フレームは２．５６
秒記憶される。かくして、時刻ｔ１にお゛いて循環メモ
リに入ったフレームは、２．５６秒後＼時刻ｔ＋２．５
６秒に対応する新しいフレームが記憶されるとき、メモ
リから失なわれる、すなわちシフトされる。This storage capacity, if necessary, provides a protean recognition system that can select spectra at different real times for analysis and evaluation, and thus move forward or backward in time as needed for analysis. Thus, the frame χ for the last 2.56 seconds is stored in circular memory and available when needed. In the illustrated embodiment, each frame in operation is 2.56
Seconds are memorized. Thus, the frame that entered the circular memory at time t1 will be 2.56 seconds later\time t+2.5.
When a new frame corresponding to 6 seconds is stored, it is lost or shifted from memory.

循環メモリ中を通るフレームは、好ましくは実時間にお
いて既知の範囲のワードと比較され、ワードストリング
と呼ばれるワード群において人力データを決定し識別さ
せる。各語索ワードは、複数の非重複のマルチフレーム
（好ましくは３フレーム）デザインセットパターンに形
成され複数の処理パワスペクトルを統計的に表わすテン
プレートパターンにより表わされる。これらのパターン
は、語業ワードの意味のある音響事象をもっともよく表
わすように選択されるのがよく、干して９４で記憶され
る。The frames passed through the circular memory are compared against a known range of words, preferably in real time, to determine and identify human data in groups of words called word strings. Each search word is represented by a template pattern that is formed into a plurality of non-overlapping multi-frame (preferably three-frame) design set patterns and statistically represents a plurality of processing power spectra. These patterns are preferably selected to best represent the meaningful acoustic events of the word word and are stored 94 in sequence.

デザインセットパターンを形成するスペクトルは・第１
Ｆｊ４に図示されるｍ１０上の連続する未知の音声入力
を処理するため、上述のシステムを使って種々の状況で
話されるワードに対して発生される。The spectra that form the design set pattern are:
To process the continuous unknown speech input on m10, illustrated in Fj4, the system described above is used to generate words spoken in various situations.

このように、各語禽ワード、は、それと関連する一般に
複数の一連のデザインセットバタンｐ　（１）＋　ｙｐ
　（１）２　ｔ・・・を有しており、各パターンは、短
期間スペクトルの領域においてその１番目のキーワード
についての１つの指示を与える。各キーワードに対する
デザインセットパターンの集まりは、ターゲットパター
ンを発生するについての統計的基準を形成する。In this way, each word word, generally has a plurality of series of design set words associated with it, p(1)+yp
(1) 2 t..., each pattern giving one indication for its first keyword in the region of the short-term spectrum. The collection of design set patterns for each keyword forms a statistical basis for generating target patterns.

本発明の例示の具体例において、デザインセットパター
ｙｐ（ｉ）、ｌは、各々、直列に配列された３つの選択
されたフレームを構成する９６要素配列と考えることが
できる。パターンを形成するフレームは、時間ＦＣ関す
る平滑に起因する不要相関を避けるため少なくとも３０
ミリ秒離間されるべきである。本発明の池の具体例にお
いては、フレームを選択するため他のサンプリング法を
実施できる０しかしながら、好ましい方法は一フレーム
を、一定文継続時間、好ましくは３０ミリ秒離間してフレームを選
択し、非重複デザインセットパターンをキーワードを限
定する時間間隔中離間させる方法である。すなわち、第
１のデザインセットパターンｐ、は、キーワードの開好
点近傍の部分に対応し１第２のパターンｐ、は時間の後
の部分に対応し、以下同様であり、そしてパターンｐｌ
ｒ　ｐ２１・・・は、一連のターゲットパターンに対す
る統計的基準電すなわちワードテンプレートを形成し、
到来音声データはこれに整合されるのである。ターゲッ
トパターンはｔｌ　”　１　ｒ・−・は、各々、ｐ　（
ｉ）　ｊが独立ラプラシアン変数より成ることを仮定す
ることにより対応するｐ　（ｉ）　ｊから発生される統
計データよりなる。この仮定は、以下で説明される到来
データとターゲットパターン間に尤度統計データが生成
されることを可能にする。かくして、ターゲットパター
ンは、工／トリとして、集められた対応するデザインセ
ットパターン配列エントリに対する平均値、標偏差およ
びエリヤ標準化率を含む配列より成る。より精確な尤度
統計データについては後で説明する。In an exemplary embodiment of the invention, the design set patterns yp(i),l can be thought of as 96-element arrays, each comprising three selected frames arranged in series. The frames forming the pattern are at least 30 to avoid unnecessary correlations due to smoothing with respect to time FC.
Should be milliseconds apart. In the embodiment of the present invention, other sampling methods can be implemented to select frames; however, a preferred method is to select frames at fixed sentence durations, preferably 30 milliseconds apart, and The method is to space non-overlapping design set patterns during time intervals that define keywords. That is, the first design set pattern p, corresponds to the part near the opening point of the keyword, the second pattern p, corresponds to the part after time, and so on, and the pattern pl
r p21... forms a statistical reference voltage or word template for the series of target patterns;
Incoming audio data is matched to this. The target pattern is tl ” 1 r・−・ is p (
i) Consists of statistical data generated from the corresponding p (i) j by assuming that j consists of independent Laplacian variables. This assumption allows likelihood statistics to be generated between the incoming data and the target pattern, which will be described below. Thus, the target pattern consists of an array containing the mean, standard deviation, and area standardization ratio for the corresponding design set pattern array entries collected as a function. More accurate likelihood statistical data will be explained later.

技術に精通したものには明らかなように、はとんどすべ
てのワードは、２以上の文脈上および／または地域的な
発音を有し、したがってデザインセットパターンの２以
上の「スペリング」を有してている。かくして、上述の
パターン化スペリングｐ１ｒ　ｐｚ・・・を有する語業
ワードは、実際上、一般にｐ（’）１　＋　ｐ（ｉ）２
　ｒ・・・、ｉ＝１．２．・・・７Ｍとして表言でる。As is clear to those skilled in the art, almost every word has more than one contextual and/or regional pronunciation, and therefore more than one "spelling" of the Design Set pattern. I'm doing it. Thus, a word word with the patterned spelling p1r pz... described above is in practice generally p(')1 + p(i)2
r..., i=1.2. ...It is expressed as 7M.

ここにｐ（ｉ）ｊの各々は、第３番目のクラスのデザイ
ンセットパターンについての可能な代替的記述方法であ
り、各ワードに対して全部でＭの異なるスペリングがあ
る。Here each p(i)j is a possible alternative description for the third class of design set patterns, and there are a total of M different spellings for each word.

それゆえ、ターゲットパターンｔｌ　ｒ　ｔ２＋・・・
、　ｔｌは、もつとも一般的意味において、各々、第１
番目のグループまたはクラスのデザインセットパターン
に対する複数の代替的統計的スペリングを表わす。この
ように、例示の具体例において、「ターゲットパターン
」なる用語は、もつとも一般的意味において使用されて
おり、したがって、各ターゲットパターンは、２以上の
許容し得る代替的「統計的スペリング」を有し得る。Therefore, the target pattern tl r t2+...
, tl are, in the most general sense, each
represents multiple alternative statistical spellings for the second group or class of design set patterns. Thus, in the illustrated embodiment, the term "target pattern" is used in a very general sense, such that each target pattern has two or more permissible alternative "statistical spellings." It is possible.

到来する未知の音声信号および基準ノくターンを形成す
る音声信号の予備処理は、これで完了する０次に、記憶
されたスペクトルの処理について説明する。Preliminary processing of the incoming unknown audio signal and the audio signal forming the reference nodules is now complete.The processing of the stored spectra will now be described.

米国特許第４，２４１，３２９号、第４，２２７，１７
６号および第４，２２７，１７７号に記載される、音声
パターンを検出ワードに結び付けるキーワード認識法の
より深い研究によれば、それがより一般的でおそらく浸
れた方法の特別の場合であることが分った。U.S. Patent Nos. 4,241,329 and 4,227,17
A deeper study of keyword recognition methods that link phonetic patterns to detected words, described in No. 6 and No. 4,227,177, shows that it is a special case of more general and perhaps submerged methods. I understand.

第４図を参照して説明すると、ワード認識の探索は、抽
象的な状態空間中に適当な路を見つける問題として表わ
すことができる。この図において、各日は、ドウエル（
引延し）時間位置またはレジスタとも指称される可能な
状態を表わし、決定プロセスはこれを通ることができる
。垂直鎖線１２０．１２２間の空間は、パターンが現在
の音素に整合するかしないかを決定する際に決定プロセ
スが通過し得る仮定の状態を各々表わす。この空間は、
必須のドウエル時間部分１２４と、任意のドエル時間部
分１２６に分けられる。必要なドウエル時間部分は、「
現在の」音素またはパターンの最小継続時間を表わす。Referring to FIG. 4, the search for word recognition can be expressed as a problem of finding a suitable path in an abstract state space. In this figure, each day has a dwell (
(deferred) represents possible states, also referred to as time positions or registers, through which the decision process can pass. The spaces between the vertical dashed lines 120, 122 each represent hypothetical states that the decision process may pass through in determining whether a pattern matches or does not match the current phoneme. This space is
It is divided into a mandatory dwell time portion 124 and an optional dwell time portion 126. The required dwell time portion is
Represents the minimum duration of the current phoneme or pattern.

任意または必須ドウエル時間部分内の各日は、形成され
るフレームの連続体のフレーム時間を表わし、フレーム
からフレームまでの０．０１秒の間隔に対応する。かく
して、各日は、１つのワードスペリングにおける仮定の
現在音素位置を表わし、そして（０，０１秒の）フレー
ムの数は、現在の音素が始まってから経過した時間を仮
定し、その音声またはターゲットパターン忙おけるそれ
より前の円の数に対応しており、パターンの現在の継続
を表わす。１つのパターン（音素）が始まり、最小のド
ウエル時間間隔が経過した後、次のターゲットパターン
（音素）の第１の節点すなわち位置（円）１２８に進む
には数本の可能な路がある。これは、スペリングの次の
パターン（音素）へ移動することの決定がいつなされる
かに依存する。これらの決定の可能性は、この図におい
ては、円１２８に向う数本の矢により表わされる。Each day within the optional or required dwell time portion represents a frame time of the series of frames formed and corresponds to an interval of 0.01 seconds from frame to frame. Thus, each day represents the hypothetical current phoneme position in one word spelling, and the number of frames (in 0,01 seconds) assumes the time elapsed since the onset of the current phoneme and its speech or target It corresponds to the number of previous circles in the pattern and represents the current continuation of the pattern. After one pattern (phoneme) begins and the minimum dwell time interval has elapsed, there are several possible paths to proceed to the first node or location (circle) 128 of the next target pattern (phoneme). This depends on when the decision is made to move on to the next pattern (phoneme) of spelling. These decision possibilities are represented in this figure by several arrows pointing toward circle 128.

次のパターン（音素）の始点は円１２８により表わされ
ているが、次のパターンへのこの変換は、現在のパター
ン（音素）の任意のドウエル時間中の任意の節点すなわ
ち位置から、または必須ドウエル時間間隔の最後の節点
からなされよう。Although the starting point of the next pattern (phoneme) is represented by circle 128, this transformation to the next pattern can occur from any node or position during any dwell time of the current pattern (phoneme), or from any required will be done from the last node of the dwell time interval.

米国特許第４，２４１，３２９号、第４，２２７，１７
６号および第４，２２７，１７７号に記載のキーワード
認識方法は、次のパターン（音素）に関する確度スコア
が現在のパターン（音素）ＩＣ関する確度スコアより良
好であるような第１の節点で変換を行なう。すなわち、
フレームが、現在の音素またはパターンより次の音素ま
たはパターンとよく整合する点でなされる。他方、全ワ
ードスコアは、フレーム当りの（すなわぢ路に含まれる
節点当りの）平均パターン（音素）スコアである。現在
の節点までのワードスコアに適用される「全スコア」の
定給と同じ定義が、変換をいつなすべきかを決定するの
に使用できる。すなわち、次のパターンへの変換を、例
えば変換指示＋ｌ１１３０に対応する最初の機会でなす
べきか、あるいは例えば変換指示線１３２に対応するも
つと後の時点になすべきかの決定に使用できる。最適に
は、節点当りの平均スコアが最良であるような路を次の
パターン（音素）中に選ぶことになる。米国特許ｊＩ４
，２４１，３２９号、第４，２２７．１７６号および第
４，２２７，１７７号に記載される標準的キーワード法
は、次のパターン（音素）に移動すべきことの決定をな
した後潜在的な路について試験をしないから、平均スコ
ア／節により測定され右ところにしたがってほぼ最適の
決定をなすことになろう。U.S. Patent Nos. 4,241,329 and 4,227,17
The keyword recognition method described in No. 6 and No. 4,227,177 transforms at the first node such that the accuracy score for the next pattern (phoneme) is better than the accuracy score for the current pattern (phoneme) IC. Do the following. That is,
A frame is made at a point that matches the next phoneme or pattern better than the current phoneme or pattern. The total word score, on the other hand, is the average pattern (phoneme) score per frame (ie, per node included in the path). The same fixed definition of "total score" applied to word scores up to the current node can be used to determine when a transformation should be made. That is, it can be used to determine whether the conversion to the next pattern should occur at the first opportunity, eg, corresponding to conversion instruction +l 1130, or at a later point in time, eg, corresponding to conversion instruction line 132. Optimally, one would choose the path in the next pattern (phoneme) that has the best average score per node. US patent jI4
, No. 241,329, No. 4,227.176, and No. 4,227,177. Since we don't test on the right path, we will make a near-optimal decision as measured by the average score/section.

したがうて本発明は、キーワード認識に平均スコア／節
法を採用する。問題は、追って詳細に説明されるワード
ストリング認識と関連して起こり、含まれる節点の数に
よりすべての部分的ワードスコアを標準化するか（これ
は計算上不効率的である）、あるいは累積値をバイアス
して明白な標準化を不必要としなければならない。クロ
ーズトポキャブラリ−タスクにおいて使用すべき自然バ
イアス値は、現在の分析時間で終了する最良のワードに
対する不標準化スコアである。したが２て、全節点にお
ける累積スコアは、つねに、同じ数の基本的パターンス
コアの総和となろう。さらに、スコアは、このバイアス
値により現在の分析節点で終わる最良のワードストリン
グのスコア□・１に変換される。Therefore, the present invention adopts the average score/clause method for keyword recognition. The problem arises in connection with word string recognition, which will be discussed in more detail later, and involves either normalizing all partial word scores by the number of nodes involved (which is computationally inefficient) or Bias should make explicit standardization unnecessary. The natural bias value to be used in the closed-topocylibrary task is the unstandardized score for the best word finishing at the current analysis time. Therefore, the cumulative score at all nodes will always be the sum of the same number of basic pattern scores. Furthermore, the score is converted by this bias value to the score □·1 of the best word string ending at the current analysis node.

平均スコア／節点による決定法は、米国特許第４．２２
８，４９８号に記載されるベクトルプルセッサで動的プ
寵グラミング技術を使用することにより効率的に実施で
きる。この態様でプログラム設定されるとき、処理速度
は、より多くの仮定試験が必要とされるとしても、米国
特許第４，２４１，３２９号、第４，２２７，１７６号
および＄４，２２７，１７７号に記載される標準的キー
ワード認識法よりもずっと速い。The average score/node determination method is described in U.S. Patent No. 4.22.
This can be efficiently implemented using dynamic programming techniques in the vector processor described in US Pat. No. 8,498. When programmed in this manner, the processing speed is reduced even though more hypothesis testing is required. Much faster than the standard keyword recognition methods described in the issue.

一般的にいって、ワードストリングを認識するためには
、プログラムに、各分析節点で終わる仮定するのに最良
の語案ワードの名前を記憶させる。Generally speaking, to recognize a word string, the program stores the name of the hypothetically best possible word that ends at each analysis node.

また、この最良のワードが始まった節点（時間）も記憶
する。ついで、発声の終りからパックトレーシングし、
記憶されたワードの名前に留意し１現在のワードの指示
された開始点に次の前述のワードを見つけることにより
、最良のワードストリングが発見される。It also remembers the node (time) at which this best word began. Next, pack tracing from the end of the utterance,
The best word string is found by noting the name of the stored word and finding the next aforementioned word at the indicated starting point of the current word.

飴粟ワードとしてサイレントを含ませると、ワードスト
リングに含まれるワードの数を特定することは不必要と
なる０ストリングを見つけるためのバックトラッキング
の動作は、サイレントワードが最良のスコアを有すると
きに実行され、そして先のサイレントが次に検出される
ときに終了する。かくして、話者が息を止める度にスト
リングが見出される。Including silent as a candy word makes it unnecessary to determine the number of words contained in the word string.The backtracking operation to find the 0 string is performed when the silent word has the best score. and terminates the next time the previous silent is detected. Thus, a string is found every time the speaker pauses for breath.

ここに記述されるワードストリング織別法は、個々のキ
ーワードの検出よりも抽出しベルが高い方法である。ワ
ードストリングスコアにより、発声中のすべての音声を
あるワードストリングに強性的に含ませるから、単純な
ワードスポツティング法よりも有利である。後者の方法
は、長いワード中に誤植のワードを検出することが多い
。The word string classification method described herein is a more efficient method of extraction than individual keyword detection. Word string scores are advantageous over simple word spotting methods because they force all sounds in an utterance to be included in a word string. The latter method often detects misspelled words in long words.

有利なことは、ワードストリングケースにタイミングパ
ターンが必要でないことである。これは、ワード連結器
が各ワード終了の仮定ごとにワード開始時間を出力する
からである。もつとも簡単なストリング連結器は、・こ
れらのワード開始時間が正しいことを仮定する。サイレ
ントの検出で、ワードストリングがいま終ったこと、お
よび（、最後のワードの開始点が先のワードの終了点で
（これもサイレントの場合もある）あることを仮定する
０通常、ストリングの各ワード対間には文脈に依存する
変換はないから、装置で、先行のワードの最良の終了点
を見つけるように各ワードの開始点の近傍を探索するこ
とができるようＫするのが好ましかろう。An advantage is that no timing pattern is required for the word string case. This is because the word concatenator outputs the word start time for each word end hypothesis. The simplest string concatenator is: Assume that these word start times are correct. Silent detection assumes that the word string has just ended and that the start of the last word (which may also be silent) is the end of the previous word (which may also be silent). Since there are no context-dependent transformations between word pairs, it is preferable to allow the device to search the neighborhood of the start of each word to find the best end of the preceding word. Dew.

次に、ハードフェアおよびソフトウェアの具体例を含む
方法および装置についてＷＰａＦＣ説明する。Next, WPaFC methods and apparatus, including hardware and software implementations, are described.

第３．図を参照して説明すると、まず、到来連続音声デ
ータを表わす９５で記憶されたスペクトルまたはフレー
ムは、下記の方法にしたがって語索のキーワードを表わ
す記憶されたターゲットパターンテンプレート（９６）
と比較される。Third. To explain with reference to the figure, first, the spectrum or frame stored at 95 representing the incoming continuous audio data is stored in the stored target pattern template (96) representing the keyword of the word search according to the following method.
compared to

各１０ミリ秒のフレームに対して、記憶された基準パタ
ーンと比較のためのバター／は、現在のスヘクトルベク
トルｓ（ｊ、、　ｔ）、３フレーム前のスペクトルｓ　
（Ｊ、　ｔ、−ｏ、ｏ３）　、および６フレーム前のス
ペクトルｓＱ、　ｔ−０，０６）を防接させて下記の９
６要素パターンを形成することにより９７で形成される
〇上述のように、記憶された基準パターンは、認識される
べき種々の音声パターンクラスに属する先に集められた
９６要素パターンの平均値、標準偏差およびエリヤ標準
化ファクタより成る。比較は、入力音声が特定のり２ス
に属することを予測するｆＤｉｘ（ｊ、ｔ）の確率モデ
ルにより遂行される。For each 10 ms frame, the stored reference pattern and the butter/ for comparison are the current spectrum vector s(j,, t), the spectrum s from three frames ago
(J, t, -o, o3), and the spectrum sQ, t-0,06) from 6 frames ago are shielded to form the following 9
97 by forming a 6-element pattern. As mentioned above, the stored reference pattern is the average value of the previously collected 96-element patterns belonging to the various speech pattern classes to be recognized, the standard Consists of deviation and area standardization factor. The comparison is performed by a probabilistic model of fDix(j,t) that predicts that the input speech belongs to a particular class.

確率モデルについてはガウスの分布を利用できるが（例
えば上述の米国特許第４，２４１，３２９号、第４．２
２７，１７６号および第４，２２７，１７７号参照）、
ラプラス分布、すなわち１）（Ｘ）＝　（１／ｖ’２　ｓ’）　ｅｘｐ　−（Ｊ
２１ｘ−ｍ）／ｓ’　）（ここにｍは統計平均、Ｓは変
数ｘｆ）標準偏差である）は、計算が少なくてすみ、−
例えば米国特許第４，０３８，５０３号に記載される話
者に不依存性の隔絶ワード認識法におけるガウスの分布
とほとんど同様に機能することが分った。未知の入カッ
（ターンＩと第に番目の記憶基準パターン間の類似の程
度Ｌ　（ｘｌｋ）は、確率の対数に比例し、次の式で１
００で算出される。For probabilistic models, Gaussian distributions can be used (e.g., U.S. Pat. No. 4,241,329, cited above, 4.2).
27,176 and 4,227,177),
Laplace distribution, i.e. 1) (X) = (1/v'2 s') exp - (J
21x-m)/s') (where m is the statistical mean and S is the standard deviation of the variable xf) requires less calculation and -
It has been found that it performs much like the Gaussian distribution in the speaker-independent isolated word recognition method described, for example, in US Pat. No. 4,038,503. The degree of similarity L (xlk) between the unknown input pattern (turn I and the th memorized reference pattern) is proportional to the logarithm of the probability, and is expressed as 1
Calculated as 00.

ココテ、Ａｋ＝工’Ｚ　　Ｉｎ　８’１ｋ２１＝１一連のパターンの尤度スコアＬを結合して話されたワー
ドまたはフレーズの尤度スコアを形成するため、各フレ
ームに対するスコアＬ（ＸＩＸ）は−そのフレームに対
する全基準パターンの最良の（最小の）スコアを減する
ことにより調節される。すなわち、　Ｌ’（ｘｌｋ）　
＝　Ｌ　（ｘｌｋ）　−ｍｉｎ　Ｌ　（ｘｉ　ｉ　）　
　　（ＩＩしたがって、各フレームに対する最良の適合
パターンは、０のスコアを有するであろう。仮定された
一連のパターンに対する調節されたスコアは、フレーム
ごとに累積され、指示された一連のシーケンスを支持す
る決定が正しい決定となるような、確率に直接）ζ関係
づけられたシーケンススコアを得ることができる。Kokote, Ak=工'Z In 8'1k21=1 Since the likelihood scores L of a series of patterns are combined to form the likelihood score of a spoken word or phrase, the score L(XIX) for each frame is − Adjusted by subtracting the best (minimum) score of all reference patterns for that frame. That is, L'(xlk)
= L (xlk) −min L (xi i)
(II Therefore, the best-fitting pattern for each frame will have a score of 0. The adjusted scores for the hypothesized sequence of patterns are accumulated for each frame and support the indicated sequence of sequences. A sequence score can be obtained that is directly related to the probability that a decision is the correct one.

記憶された既知のパターンに対する未知の入カスベクト
ルパターンの比較は、ｋ番目のパターン対する下記の関
数を計算することにより遂行される。すなわち、６− ｑ二、Σ　　ｓｉｋ　ｌＸｉ　　　　ｕｉｋ　　Ｉ＋　
　ｃｋ　　　　　　　　　　　　　　　　ａ嘩１＝１ここに、Ｓｉｋは１　／　ｓ’１ｋＫ等しい。Comparison of the unknown input vector pattern to the stored known pattern is accomplished by calculating the following function for the kth pattern: That is, 6− q2, Σ sik lXi uik I+
ck a fight 1=1 Here, Sik is equal to 1/s'1kK.

通常のソフトウェアで実施される計算においては、代数
関数５ｌｘ−ｕｌ（式１９）を計算するために下記の命
令が実行されよう。In a calculation performed in normal software, the following instructions would be executed to calculate the algebraic function 5lx-ul (Equation 19).

１、　ズーＵを計算せよＺ　スーＵの符号を試験せよＡｘ−ｕが負ならば、絶対値を形成するように否定せよ４、　　ｓと乗算せよ＆　結果をアキュムレータに加えよ２０−ワード語集を有する代表的音声認識−システム忙
おいては１約２２２の異なる基準ノ（ターンが設けられ
よう０これを求めるに必要とさくれるスーーパテップの数は、間接動作を含まないと、５Ｘ９６Ｘ２２
２＝１０５６０ステツプであり、これが、実時間スペク
トルフレーム速度に遅れないようにするため、１０ミリ
秒以内で実行されなければならない。それゆえ、プロセ
ッサは、尤度関数をＪ度求めるためＫは、はぼ１１００
万／秒の命令を実行できなければならない。必須の速度
を考慮に入れて、米国特許第４，２２８，４９８号に開
示されるベクトルプロセッサシステムと適合する専用の
尤度Ｒ数ハードウエアモジヱール２００（第４図）が採
用される。1. Calculate Zu U. Z Test the sign of Zu U. If Ax-u is negative, negate it to form an absolute value. 4. Multiply by s & add the result to the accumulator. 20-Word Glossary In a typical speech recognition system with a busy system, there would be approximately 222 different criteria (turns).
2=10560 steps, which must be performed within 10 ms to keep up with the real-time spectral frame rate. Therefore, the processor calculates the likelihood function J times, so K is approximately 1100
Must be able to execute instructions at 10,000 per second. Taking into account the requisite speed, a dedicated likelihood R-number hardware module 200 (FIG. 4) is employed that is compatible with the vector processor system disclosed in U.S. Pat. No. 4,228,498. .

この専泪ハードクエアにおいては、上述の５つのステッ
プが、２組の変数５ｓＸｓｕととも同時に遂行されるか
ら、実際には、１つの命令を実行するのに要する時間で
１０の命令が遂行される。In this specialized hard square, the above-mentioned five steps are executed simultaneously with two sets of variables 5sXsu, so ten instructions are actually executed in the time required to execute one instruction.

基本的ベクトルプロセッサは８００万（命令）／秒の速
度で動作するから、尤度関数に対する有効計算速度は、
専用ハードウェア２００が採用されると約８０００万（
命令）７秒となる。Since the basic vector processor operates at a speed of 8 million (instructions)/second, the effective calculation speed for the likelihood function is
If 200 dedicated hardware are adopted, the cost will be approximately 80 million (
command) will be 7 seconds.

第５図を参照すると、ノ１−ドウエアモジュール２００
は、１０のステップの同時の実行を可能にするため、ハ
ードウェアによるパイプ処理および並列処理の組合せを
採用している。２つの同一の部分２０２，２０４は、各
々、独立の入力データ変数について５つの算術演算ステ
ップを遂行しており、結果はその出力に接続された加算
器２０６により結合される。加算器２０６からの加算値
の累積は、式（１９）の１〜９６の加算であり、そして
この値は、米国特許第４，２８８，４９８号に記載され
る標準的ベクトルプロセッサの演算ユニットで処理され
る。Referring to FIG. 5, a node 1-ware module 200
employs a combination of hardware piping and parallel processing to enable simultaneous execution of ten steps. Two identical parts 202, 204 each perform five arithmetic steps on independent input data variables, and the results are combined by an adder 206 connected to its output. The accumulation of the summed value from adder 206 is the summation of 1 to 96 in equation (19), and this value is calculated by the arithmetic unit of the standard vector processor described in U.S. Pat. No. 4,288,498. It is processed.

動作において、パイプフィン結合レジスタは、以下の処
理段階における中間データを保持する。In operation, the pipe fin binding register holds intermediate data for the following processing stages.

１、　入力変数（クロック作動レジスタ２０８．２１０
．２１２．２１４．２１６．２１８）Ｌｘ−ｕの絶対値
（クロック作動レジスタ２２０．２２２　）五　乗算器の出力（クロック作動レジスタ２２４．２２
６）入力データがクロック作動レジスタ２０８〜２１８に保
持されると、Ｘ−ｕの大きさが、減算・絶対値回路によ
り決定される。第６図を参照すると、減算・絶対値回路
２２ｇ、２３０は、各々第１および第２の減算器（一方
はｘ−ｕを算出、他方はｕ−Ｘを算出）および正の結果
を選択するためのマルチプレクサ２３６を備えている。1. Input variables (clock-operated registers 208, 210
．． 212.214.216.218) Absolute value of Lx-u (clock operated register 220.222) 5 Multiplier output (clock operated register 224.22)
6) Once the input data is held in the clocked registers 208-218, the magnitude of X-u is determined by the subtraction and absolute value circuit. Referring to FIG. 6, the subtraction/absolute value circuits 22g, 230 select the first and second subtracters (one calculates x-u, the other calculates u-X) and a positive result, respectively. A multiplexer 236 is provided for this purpose.

レジスタ２０８．２１０から出る１１２３Ｂ、２４０上
の入力変数ＸおよびＵは、それぞれ−１２８〜＋１２７
の８ビツト数である。８ビツト減算器の差出力は９ビツ
トにオーバーフローすることがあるから（例えば１２７
−（−１２８）＝２５５）、オーバーフロー状態を取り
扱うため余分の回路が採用される。状態はオーバーフロ
ー検出器２３５により決定される。しかして、その入力
は、「Ｉ」の符号（線２３５ａ上）、「ｕ」の符号（Ｍ
２３５ｂ上）およびｒｘ−ｕＪの符号（１ｓ２３５　ｃ
上）である。Input variables X and U on 1123B, 240 coming out of registers 208.210 are -128 to +127, respectively.
is an 8-bit number. Since the difference output of an 8-bit subtracter may overflow to 9 bits (for example, 127
-(-128)=255), extra circuitry is employed to handle overflow conditions. The condition is determined by overflow detector 235. Thus, the inputs are the sign of "I" (on line 235a), the sign of "u" (on line 235a), and the sign of "u" (on line 235a).
235b) and the sign of rx-uJ (1s235c
above).

次に第７図を参照すると１オーバーフロー検出器は、こ
の例示の具体例においては、３人力譚ωゲート２６８．
２７０およびＯＲゲート２７２を有する組合せ回路であ
る。第８図の真髄表は、オーバーフロー条件を入力の関
数として表わしている。Referring now to FIG. 7, one overflow detector, in this illustrative embodiment, consists of three ω gates 268.
270 and an OR gate 272. The quintessence table of FIG. 8 represents the overflow condition as a function of input.

オーバー７日−条件は、マルチプレックサ２３６、（こ
れは正の減算器出力を選択する回路である）で４つの選
択を行なうことにより処理される。選択は、＋１ｌ１２
４２および２４４上の２進レベルで定められる。＄２４
２上のレベルは、ｘ−ｕの符号を表わす。２４４上の符
号は、１ならばオーバーフローを表わす。The over 7 days-condition is handled by making four selections in multiplexer 236, which is the circuit that selects the positive subtractor output. The selection is +1l12
42 and 244 on a binary level. $24
The level above 2 represents the sign of x-u. If the code on H.244 is 1, it indicates an overflow.

かくして、選択は次のごとくなる。Thus, the choices are as follows.

線２２４線２２４０　　０　　　減算器２３２の出力を選択１　　　０　
　　減算器２３４の出力を選択マルチプレックサはこの
ように制御されて、８極４位置スイッチのように作用す
る。シフト動作は、組合セにより減算出力を適当なマル
チプレクサに接続することＫより遂行される。シフトは
、算術的に２で分割する効果をもつ。Line 224 Line 224 0 0 Select the output of subtracter 232 1 0
The output selection multiplexer of subtractor 234 is controlled in this manner to act like an 8-pole 4-position switch. The shifting operation is performed by connecting the subtracted output by the combiner to the appropriate multiplexer. The shift has the effect of arithmetically dividing by two.

減算中にオーバー７０−が起こると、マルチプレクサの
出力は、減算器の出力を４で分割した出力となる。それ
ゆえ、最終結果を２で乗算して正しいスケールファクタ
を取り戻すことができるように＼計算の後段でこの条件
を思い出させることが必要である。この復旧は、最後の
パイプ処理レジスタの後のマルチプレックサで行なわれ
る。それゆえ、パイプツイン処理レジスタ２２０，２２
２％２２４．２２６には余分のビットが設けられており
、第２のマルチプレクサ２４８．２５０を制御する。If over 70- occurs during subtraction, the output of the multiplexer will be the output of the subtracter divided by four. It is therefore necessary to recall this condition later in the calculation so that the final result can be multiplied by 2 to recover the correct scale factor. This restoration is done in the multiplexer after the last pipe processing register. Therefore, pipe twin processing registers 220, 22
An extra bit is provided at 2% 224.226 to control the second multiplexer 248.250.

後者のマルチプレクサは、オーバーフロービット（１に
等しい）の場合、それぞれ８×８ビツトの乗算器２５２
．２５４の乗算積を１ビツトだけシフトアップし、２を
乗算する。乗算演算は８ビツト数を受は入れその積を出
力するＴＲＹ　ＭＰＹ−８−ＨＪのごとき標準的集積回
路装置で実施できる。The latter multiplexers each have an 8x8 bit multiplier 252 in case of an overflow bit (equal to 1).
．． The multiplication product of 254 is shifted up by 1 bit and multiplied by 2. Multiplication operations can be performed on standard integrated circuit devices, such as the TRY MPY-8-HJ, which accepts 8-bit numbers and outputs the product.

かくして、乗算器２５２．２５４は、各クロックパルス
で百および１ｘ−ｕｌの積を生ずる（百の値は余分のデ
ータレジスタ２５６．２５８により正しく調時される）
。乗算器２５２．２５４の出力は、レジスタ２２４．２
２６にノくツファ記憶され、線２６０．２６２を介し、
加算器２０６を経て残りの回路に出力される。Thus, the multiplier 252.254 produces a product of 100 and 1x-ul on each clock pulse (the 100 value is correctly timed by the extra data register 256.258).
. The output of multiplier 252.254 is output to register 224.2.
26, via line 260.262,
It is output to the remaining circuits via adder 206.

同じ専用ハードウエアモジヱールは、マトリックス乗算
において必要とされるような２ベクトルの内部積を計算
するのにも採用できる。これは、減算・絶対値回路２２
８．２３０において側路を可能とするゲート回路２６４
．２６６で遂行される。この動作モードにおいては、デ
ータデおよび百入カバスは、乗算器人力として、ノ（イ
ブライン処理レジスタ２２０．２２２に直接加えられる
０次に、７−）’レベルパターン整列について説明する
。The same dedicated hardware module can also be employed to compute two-vector inner products such as those required in matrix multiplication. This is the subtraction/absolute value circuit 22
8. Gate circuit 264 allowing bypass at 230
．． 266. In this mode of operation, the data inputs and inputs are applied directly to the multiplier processing registers 220, 222, 0th order, 7-)' level pattern alignment.

未知の入力音声と各語禽ワードテンプレート間の対応を
最適化するためＫは、動的なプログラミング（１０１）
が採用されるのが好ましい。各ワードテンプレートは、
上述の一連の基準パターン統計データだけでなく、各基
準パターンと関漣する最小および最大のドウエル時間を
含むのがよい。K performs dynamic programming (101) to optimize the correspondence between unknown input speech and each word template.
is preferably adopted. Each word template is
In addition to the set of reference pattern statistical data described above, it may also include the minimum and maximum dwell times associated with each reference pattern.

動的プログラミング法にしたがえば、各語粱ワードに対
して１つの記憶レジスタが提供される。レジスタの数は
、そのワードを構成する基準パターンの最大のドウエル
時間の和に等しい。すなわち、もつとも長い許容ワード
継続時間に比例する。これらのレジスタは１第４図の円
に対応し、各日に対して１つのレジスタがある。入力音
声の各フレームに対して、全レジスタが読み取られ１書
き込まれる０各レジスタは１追って詳述されるように、
指示された語粟ワードが話されつつあるということ、お
よびそのワードにおける現在位置が、そのレジスタの特
定の基準パターンおよびドウエル時間に対応するという
仮定に対応する累積された尤度スコアを含む。全レジス
タは、低い尤度スコアを含むようにイニシャライズされ
、上記の仮定が、最初いずれも容認できるほどに起こり
そうでないことを指示する。According to the dynamic programming method, one storage register is provided for each word. The number of registers is equal to the sum of the maximum dwell times of the reference patterns that make up the word. That is, it is proportional to the longest allowed word duration. These registers correspond to the circles in Figure 4, with one register for each day. For each frame of input audio, all registers are read from and written to 0, each register is written to 1, as detailed below.
It includes an accumulated likelihood score corresponding to the assumption that the indicated word word is being spoken and that the current position in that word corresponds to a particular reference pattern and dwell time for that register. All registers are initialized to contain low likelihood scores, indicating that none of the above assumptions are acceptably likely to occur initially.

レジスタ更新の規則は下記のごとくである０各’７−ド
テンプレートの最初のレジスタ（すなわち＼そのワード
がいま発声され始めたという仮定に対応するレジスタ）
は、（ａ）そのワードの第１の基準パターンに関する現
在のフレームの尤度スコアと、（ｂ）全語業ワードの全
レジスタの最良のスコア（すなわち、あるワードが先行
のフレーム上で完了されたという仮定に対する累積尤度
スコア）の和を含む。The rules for updating the registers are as follows: 0 The first register of each '7-word template (i.e. the register corresponding to the assumption that the word has just begun to be uttered)
is (a) the likelihood score of the current frame with respect to the first reference pattern for that word, and (b) the best score of all registers of all word work words (i.e., if a word was completed on the previous frame). (cumulative likelihood score) for the hypothesis that

ワードテンプレートの第２のレジスタは、（ａ）そのワ
ードの第１の基準パターンに関する現在のフレームの尤
度スコアと、（ｂ）先行のフレームからの第１のレジス
タの内容を含む。かくして、第２のレジスタは、指示さ
れたワードが発声されつつあり、それが先行のフレーム
で始まったという仮定のスコアを含む。The second register of the word template contains (a) the likelihood score of the current frame for the first reference pattern of the word, and (b) the contents of the first register from the previous frame. Thus, the second register contains the score of the hypothesis that the indicated word is being uttered and that it began in the previous frame.

最小および最大の継続時間の間のドウエル時間（任意ド
ウエル期間）に対応するこれらレジスタの更新処理中、
各逐次の「現在フレーム」に対する任意的ドウエル期間
に対応するレジスタに、最良の累積された尤度スコア（
レジスタの内容）を記憶するため、別個のメモリレジス
タが採用される。先行のフレーム時間に見出されたこの
最良のスコアは、そのワードに対する次のターゲットパ
ターンまたはテンプレートの必須ドウエル時間に対応す
る第１のレジスタの次の内容を計算するのに使用される
。このように１次の基準パターンの最初のレジスタの現
在の内容は１その最良のスコア（先行するターゲットパ
ターンの）を、前記の次の基準またはターゲットパター
ンに関する現在の入力フレームの尤度スコアに加えるこ
とにより発生される。During the process of updating these registers corresponding to the dwell time (arbitrary dwell period) between the minimum and maximum durations,
The best accumulated likelihood score (
A separate memory register is employed to store the contents of the register. This best score found in the previous frame time is used to calculate the next contents of the first register corresponding to the required dwell time of the next target pattern or template for that word. Thus the current content of the first register of the first-order reference pattern is 1 and adds its best score (of the preceding target pattern) to the likelihood score of the current input frame with respect to said next reference or target pattern. It is caused by

第４図において、基準パターンの必須ドウエル間隔の第
１のレジスタ１２８に至る多重の矢印は、任意ドウエル
時間レジスタまたは状態から必須ドウエル時間レジスタ
または状態への変換が、任意ドウエル時間間隔中の任意
の時点に、または必須ドウエル時間間隔の最後のレジス
タから生ずることを指示することを意味している。かく
して、現在の情報に基づくと、ワードテンプレートと入
力パターン間の最良の適応は、次のパターンが丁度始ま
りつつあるとき、先行のパターンが、先行の任意ドウエ
ル期間の最良のスコアを含むレジスタ十先行の必須時間
間隔の最後のレジスタ（例示の具体例においてレジスタ
３００）に対応する継続時間をもったということを仮定
するものである。In FIG. 4, the multiple arrows leading to the first register 128 of the required dwell interval of the reference pattern indicate that the conversion from an optional dwell time register or state to a required dwell time register or state occurs at any time during the optional dwell time interval. It is meant to indicate the point in time or originating from the last register of the required dwell time interval. Thus, based on the current information, the best adaptation between the word template and the input pattern is that when the next pattern is just starting, the previous pattern contains the best score of any previous dwell period. has a duration corresponding to the last register (register 300 in the illustrated embodiment) of the required time interval.

動的プログラミングの理論によれば、全部の可能なドウ
エル時間に対応する先に累積されたスコアを保存してお
く必要はない。それは鴬この理論によると、低スコアを
生じたドウエル時間変換点は、将来の全処理段階におい
て低スコアを発生し続けるからである。According to the theory of dynamic programming, there is no need to store previously accumulated scores corresponding to all possible dwell times. This is because, according to this theory, dwell time transformation points that produce low scores will continue to produce low scores in all future processing stages.

分析は、全ワードテンプレートの全基準パターンの全レ
ジスタを使って上述の態様で進行する。The analysis proceeds in the manner described above using all registers of all reference patterns of all word templates.

各ワードテンプレートの最後のパターンの最後のレジス
タ（単数または複数）は、ワードがいま丁度終了したと
いう仮定のスコアを含む。The last register(s) of the last pattern of each word template contains the score of the hypothesis that the word has just ended.

尤度スコアの累積中、一連の継続時間計数値は、各フレ
ーム時間で終了する最良のワードの継続時間を決定する
ため維持される。計数は、ワードの第１テンプレートパ
ターンの第ルジスタで１１」で開始される０テンプレー
トパターンの各第２および後続のレジスタに対して、種
々のレジスタと関連される計数値は「１」だけインクリ
メントされる。しかしながら、基準パターン（１つのワ
ードの第１基準パターン以外の）の開始点に対応する各
レジスタ、すなわち例えば必須ドウエル時間間隔の第ル
ジスタ１２８については、先行のフレーム時間において
最良の尤度スコアを有する先行の基準パターンの任意ド
ウエル時間レジスタ（または最後の必須ドウエル時間レ
ジスタ）の計数値が、レジスタに対する継続時間計数値
を形成するようにインクリメントされる。During likelihood score accumulation, a series of duration counts are maintained to determine the duration of the best word ending in each frame time. The counting starts at "11" in the first register of the first template pattern of the word.For each second and subsequent register of the template pattern, the count values associated with the various registers are incremented by "1". Ru. However, for each register corresponding to the start of a reference pattern (other than the first reference pattern of one word), i.e., the first register 128 of the required dwell time interval, has the best likelihood score in the previous frame time. The count of the optional dwell time register (or the last required dwell time register) of the preceding reference pattern is incremented to form the duration count for the register.

追つ詳細に記載されるバックトラッキング、機構を提供
するため、各フレーム時間ごとに、その時間で終わる最
良スコアのワードおよびその継続時間についての情報は
、循環バッファメモリに転送される。一連のワードが終
了すると、記憶されたワード継続時間は、最後の「最良
」ワードの終端から、その継続時間を逆上って、「最後
のワードｊ直前で終了する最良の先行ワードに至るまで
など、ワードストリングの全ワードが識別されてしまう
までバックトレーシングすることを可能にする。To provide a backtracking,mechanism that will be described in more detail later,,for each frame time, information about the best-scoring word,ending at that time and its duration is transferred to a circular,buffer memory. When a series of words ends, the stored word durations start from the end of the last "best" word and work backwards in duration until "the best preceding word ends just before the last word j. etc., allows backtracing until all words in a word string have been identified.

連続的に発声される語粟ワードのストリングは、サイレ
ントにより境界を定められる。それゆえ、「サイレント
」は、シフテムが応答・認識するｌ索ワードＪの範囲の
限界を定める制御ワードとして働く。前述のように、装
置がある期間の間の最小振幅信号を検出し、「サイレン
ト」として示すことは珍しくなくない。A string of successively uttered words is bounded by silences. ``Silent'' therefore acts as a control word that delimits the range of search words J that the shift system responds to and recognizes. As mentioned above, it is not uncommon for a device to detect the lowest amplitude signal during a period of time and indicate it as "silent."

しかしながら、本発明によると、ワードテンプレートの
１つが、サイレントまたはバックグラウンドノイズに対
応している。サイレントワードが最良の尤度スコアを有
すれば、一連のワードが丁度終了しそして新しい一連の
ワードが始まることが推定される。認識のプロセスの最
後のイニシャライズ以後サイレント以外のワードが最良
のスコアを有したか否かを知るたゆ、フラグレジスタが
試験される。［サイレンＨ以への少なくともｌワードが
最良のスコアを有すれば（１０３）循環バッファ内のワ
ードストリングがバックトレースされ（１０５）、生じ
た認識されたメッセニジが、表示装置または他の制御装
置に伝達される。次いで＼循環バッファはクリヤされて
メツセージの反復伝達を阻止し、フラグレジスタはクリ
ヤされる。However, according to the invention, one of the word templates corresponds to silent or background noise. If the silent word has the best likelihood score, it is presumed that the series of words has just ended and a new series of words has begun. A flag register is tested to see if a non-silent word has had the best score since the last initialization of the recognition process. [If at least l words from siren H onwards have the best score (103), the word string in the circular buffer is backtraced (105) and the resulting recognized message is sent to the display or other control device. communicated. The circular buffer is then cleared to prevent repeated transmission of the message, and the flag register is cleared.

このようＫして、装置は次のワードストリングなｗｔｙ
ｌＩ４するようにイニシャライズされる（１０７）。Thus, the device returns the next word string wty
It is initialized to lI4 (107).

有利なことには、本発明の好ましい具体例においては、
他の「キーワード」スペリングと同じように、１以上の
「サイレント」スペリングを採用できる。すなわち、装
置は、単に、波線的な１組の規準に一致するときにすな
わち演譚的ターゲットパターンに一致するときにサイレ
ントを検出することに限定されるだけでなく、動的に変
化するターゲットパターンまたはテンプレートを採用し
て、装置の「サイレント」検出能力をさらに改善できる
。このようにして、上述のように、音声の先行の１また
は２秒の部分を周期的に試験し、例えば最後の数秒中の
最小振幅を有する代表的パターンを選択することＫよっ
て動的に変化する「すイレント」モデルを決定し、先行
の動的サイレントモデルを更新し、あるいは後述のトレ
ーニング法にしたがって新しい「動的」なサイレントモ
デルを形成できる。このようＫして、「サイレント」は
、ターケラトパターンの２以上の「スペリン勿により限
定することができ、サイレントの正確な検出を改善する
可能性は向上される。Advantageously, in a preferred embodiment of the invention:
One or more "silent" spellings can be used, just like other "keyword" spellings. That is, the device is not only limited to detecting silences when matching a squiggly set of criteria, i.e., when matching a discursive target pattern, but also detecting a dynamically changing target pattern. Alternatively, templates can be employed to further improve the device's "silent" detection capabilities. In this way, dynamic changes can be made, as described above, by periodically testing the preceding 1 or 2 seconds of audio and selecting, for example, the representative pattern with the lowest amplitude during the last few seconds. A "dynamic" silent model can be determined, a previous dynamic silent model can be updated, or a new "dynamic" silent model can be formed according to the training method described below. In this way, a "silent" can be defined by two or more "spellings" of the Tarkerat pattern, and the possibility of improving the accurate detection of the silent is improved.

次に１基準ハターンのトレーニングについて説明する。Next, I will explain the training for the 1 standard hatan.

基準パターンの構成のためサンプル平均ｕおよびパリア
ンスＳ′を得るためには、各語業ワードの多数の発声が
音声識別システムに装入され、対応する予処理されたス
ペクトルフレームの全統計データが求められる。装置の
重要で好結果をもたらす動作（ｉ、どの人カスベクトル
フレームがどのターゲットまたは基準パターンに対応す
べきかの選択である。In order to obtain the sample mean u and parity S' for the construction of the reference pattern, a large number of utterances of each word word are fed into the speech recognition system, and all statistical data of the corresponding preprocessed spectral frame is determined. It will be done. An important and consequential action of the device (i) is the selection of which human scum vector frame should correspond to which target or reference pattern.

人力ワードに対して人間により選ばれた重要な音響的音
素のような十分な情報が不存在の場合、話されたワード
の始点と終点間の時間間隔は、多数の一様に離間された
サブインターバルに分割さ汗る。これらのサブインター
バルの各々は１−唯一の基準パターンと対応せしめられ
る。各間隔において始まる１または複数の３７レームノ
くターンが形成され、その間隔と関連する基準パターン
にしたがって分類される。同じ語禽ワードの後続の例は
、同様に１同数の一様に離間された間隔に分割される。In the absence of sufficient information, such as the important acoustic phonemes chosen by a human for a word, the time interval between the start and end of a spoken word can be divided into a large number of uniformly spaced Sweat divided into intervals. Each of these sub-intervals is associated with a 1-only reference pattern. One or more 37-lem turns are formed starting at each interval and sorted according to the reference pattern associated with that interval. Subsequent instances of the same word word are similarly divided into an equal number of uniformly spaced intervals.

対応する順番の間隔から抽出された３フレームハターン
の要素の平均値およびパリアンスは１語禽ワードの利用
可能な全列について累積され１そのワードに対する１組
の基準パターンを形成する。間隔の数（基準パターンの
数）は、語業ワードに含まれる単位の言語学的音素当り
約２または３とすべきである。The average values and variances of the elements of the three frame patterns extracted from the corresponding ordinal intervals are accumulated over all available columns of a single word to form a set of reference patterns for that word. The number of intervals (number of reference patterns) should be approximately 2 or 3 per linguistic phoneme unit included in the word word.

最良の結果を得るためには、記録された音声波形および
スペクトルフレームの人間による試験を含む手続きによ
り、各語★ワードの始点と終点′がマークされる。この
手続を自動的に実施するためＫは、装置がワードの境界
を正確に見つけるように１ワードを１時に１つずつ話し
、サイレン）Ｋより境界を定めることが必要である。基
準ノ（ターンは、隔絶して話された各ワードの１つのこ
のようなす／プルからイニシャライズされよう。しかし
て、全パリアンスは、基準７くターンにおし１て都合の
よい定数に設定される。その後、トレーニング資料は、
認識されるべき発声を表わしぶつ認識プロセスにより見
出されるようなワードおよび分節境界をもつ発声を含む
ことができる。For best results, the start and end points of each word are marked by a procedure that includes human examination of recorded audio waveforms and spectral frames. To carry out this procedure automatically, K needs to speak one word at a time, one word at a time, so that the device finds the word boundaries accurately, and the boundaries are defined by K (the siren). The reference turns may be initialized from one such pull/pull of each word spoken in isolation.The total parity may then be set to a convenient constant at the reference seven turns. The training materials are then
It can include utterances with word and segment boundaries as found by the recognition process that represent the utterances to be recognized.

適当数のトレーニング発声を含む統計的データが累積し
た後、そのようにして見出された基準）くターンが、初
基準パターンの代わり−に利用される０次いで、トレー
ニング資料による２回目のノくスが行なわれる。このと
き、ワードは、第３図におけるように認識プロセッサに
よりなさまた決定に基づいてインターバルに分割される
。各３フレーム入カバターン（または、各基準）（ター
ンに対する１つの代表釣人カバターン）が、前述のパタ
ーン整合法によりある基準パターンと関連づけられる。After the statistical data containing a suitable number of training utterances have been accumulated, the reference turns so found are used in place of the initial reference pattern and then the second turn according to the training material. The process will be carried out. The words are then divided into intervals based on decisions made by the recognition processor as in FIG. Each three-frame cover turn (or each reference) (one representative angler cover turn for a turn) is associated with a reference pattern by the pattern matching method described above.

平均値およびパリアンスは、それらが認識装置により使
用される方法と完全に適合した態様で誘導される最終の
１組の基準パターンを形成するように１秒間累積される
。The average values and parances are accumulated for one second so that they form a final set of reference patterns that are derived in a manner fully compatible with the method used by the recognizer.

各トレーニングバス中、認識プロセッサにより正しく認
識されないトレーニングクレーズを無視するのが好まし
い。これは、誤認識された発声は、インターバル境界を
不完全に設定したかも知れないからである。そのトレー
ニングパスの完了の際１先に誤認識されたフレーズは、
新しい基準パターンで再度試みることができ、そのとき
認識が成功すれば、基準パターンはさらに更、新できる
。Preferably, during each training bus, training crazes that are not correctly recognized by the recognition processor are ignored. This is because erroneously recognized utterances may have set the interval boundaries incompletely. The first phrase that was misrecognized upon completion of that training pass was
A new reference pattern can be tried again, and if recognition is successful then, the reference pattern can be further updated.

誤認識されたフレーズを無視することに対する代わりの
方法は、各トレーニング発声に対してマルチプルフード
テンプレートを形成することである。このテンプレート
は、発声中の各ワードに対するテンプレートを正しい順
番で結び付けたものである。話者は、指示されたワード
列を話すことを台本により促進され、認識プロセッサは
、マルチプルテンプレートおよびサイレントテンプレー
トのみを参照する。そのとき、ワード境界および基準パ
ターンの分類は、所与の台本および利用可能な基準パタ
ーンに対して最適となろう。この手続の不利な点は、ト
レーニング台本による多数回の試験が噌要とされること
があることである。An alternative to ignoring misrecognized phrases is to create multiple food templates for each training utterance. This template is a combination of templates for each word being uttered in the correct order. The speaker is prompted by the script to speak the instructed word sequence, and the recognition processor only refers to the multiple templates and silent templates. The classification of word boundaries and reference patterns will then be optimal for the given script and available reference patterns. A disadvantage of this procedure is that it may require multiple tests with training scripts.

最高に可能なｍ訳精度を得るためには、認識されるべき
語粱に対して先に決定された１組の話者不依存性の基準
パターンでトレーニング手続きを始めるのが好ましい。In order to obtain the highest possible m-translation accuracy, it is preferable to start the training procedure with a set of speaker-independent reference patterns previously determined for the words to be recognized.

話者不依存性のパターンは、少なくとも数人の異なる話
者により話される認識されるべきフレーズを表わすフレ
ーズから得られる。ワードの境界は、記録された音声波
形の人間による試験により決定されよう。ついで、成上
の２段階手続きが、話者不依存性パターンを発生するた
めに採用される。すなわち１＠目のパスにおいては、各
ワード内にサブインターバルが均一に離間される。２回
目のパスにおいては、第１バスによる基準パターンを使
って認識プロセスによりサブインターバルが決定される
。全話者についての全体的統計が各パスにおいて誘導さ
れる。Speaker-independent patterns are obtained from phrases representing the phrase to be recognized spoken by at least several different speakers. Word boundaries may be determined by human examination of recorded speech waveforms. Naruko's two-step procedure is then employed to generate speaker-independent patterns. That is, in the first pass, subintervals are uniformly spaced within each word. In the second pass, sub-intervals are determined by a recognition process using the reference pattern from the first bus. Global statistics for all speakers are derived on each pass.

本システムは、有利なことには、先に発生された話者不
依存性のパターンを使って特定の話者に対してトレニン
グされ、サイレントテンプレートとの組合せで話者依存
性の音声入力の境界を一決定できることである。好末し
くは、話者依存性の音声人力は、隔絶形態でなく連続ワ
ードス）　ＩＪングで提供されるのがよい。トレーニン
グプロセスにおいて連続音声を使用することＫより、よ
り正確な結果を得ることができる。このようにして、装
置に利用可能な話者不依存性基準パターンを使って、話
者依存性音声の境界が決定され、そして装置をトレーニ
ングするための上述のマルチ試験プロセスが使用され、
すなわち１，１回目のパス中に各ワード中に一様に離間
されたサブインターバルが設定され、２回目のパスにお
いて、第１のパスにより発生されたパターンを使って認
識プロセスによりサブインターバルが決定される。The system is advantageously trained for a particular speaker using the previously generated speaker-independent patterns and, in combination with silent templates, delimits the speaker-dependent speech input. It is possible to make a decision. Preferably, the speaker-dependent speech input is provided in continuous rather than isolated form. More accurate results can be obtained by using continuous audio in the training process. In this way, the boundaries of speaker-dependent speech are determined using the speaker-independent reference patterns available to the device, and the multi-testing process described above for training the device is used;
That is, during the first pass uniformly spaced sub-intervals are established in each word, and in the second pass the recognition process determines the sub-intervals using the pattern generated by the first pass. be done.

驚くべきことＫは、都合のよいことに、予め未知の語り
ワードに対して類似の方法を採用できる。Surprisingly, K can conveniently adopt a similar method for previously unknown spoken words.

すなわち、未知の語禽ワードの境界は、（１）未知のキ
ーワードを認識するための他の語業ワードに対する話者
不依存性のパターンおよび−（２）ワードの始点および
終点におけるサイレントの発生がワードの限界を定める
という波線的知識を使って決定される。そのとき、境界
は、話者不依存性基準パターンを「サイレント」に整合
させるのでなく未知語業ワードに整合させるために形成
された比較的良好なス、コアにより決定される。この結
果を使用すると、未知語案ワードの境界が設定でき、そ
の後上述の２段階法が採用できる。すなわち、１＠目の
パス中にワードを均一にサブインターバルに分割して全
体的統計データを得、ついで２回目のパス中、標準の認
識プロセスおよび第１のパス中発生された基準パターン
を使用するのである。この自動機械法は、未知のワード
を例えば人間により設定するのに比べ都合よく作用する
。That is, the boundaries of unknown word words are defined by (1) patterns of speaker independence relative to other word words for recognizing unknown keywords and - (2) occurrences of silence at the beginning and end of the word. Determined using squiggle knowledge to define the limits of the word. The boundaries are then determined by a relatively good score formed to match the speaker-independent reference pattern to the unknown word word rather than "silently" matching it. This result can be used to set boundaries for unknown candidate words, and then the two-step method described above can be employed. That is, during the first pass, the word is evenly divided into subintervals to obtain global statistics, and then during the second pass, the standard recognition process and the reference pattern generated during the first pass are used. That's what I do. This automatic machine method works better than setting unknown words, for example, by a human.

明らかにしたいことは、少なくとも２つのサイレントス
ペリングを使用する「サイレント」認識法−その１つは
好ましくは動的に決定されるーは、装置を新しい話者に
対してトレーニングすることと関連して著しい利点をも
たらすことである。また、これに関連して、サイレント
「ワード」は、装置からレスポンスをトリガするための
制御ワードとして作用することも指摘したい。他の靴御
ワードも、その認識が十分ＩＣｍ実であれば採用できよ
うし・また・ある状況においては、複数の制御ワードを
１認識プロセス中「道標」どして働かせるように使用で
きよう。しかしながら、好ましい具体例においては、サ
イレント「語禽ワード」が使用される唯一の制御ワード
である。What we wish to make clear is that a "silent" recognition method that uses at least two silent spellings, one of which is preferably dynamically determined, has significant implications in the context of training the device on new speakers. It is about bringing benefits. In this connection, we would also like to point out that the silent "word" acts as a control word for triggering a response from the device. Other control words could be employed if their recognition is sufficiently ICm-realistic; and in some situations, multiple control words could be used to act as "signposts" during a recognition process. However, in the preferred embodiment, silent "words" are the only control words used.

最小（必須）および最大（必須十任意）ドウエル時間は
、好ましくはトレーニングプロセス中に決定されるのが
よい。本発明の好ましい具体例においては、装置は、上
述のように数人の話者を使ってトレーニングされる。さ
らに１上述のように、本認識法では、トレーニング手続
き中、上述の方法にしたがってパターンの境界が自動的
に決定される。このようにして境界が記録され、装置に
よりａ別された各キーワードに対してドウエル時間が記
憶される。The minimum (required) and maximum (required and optional) dwell times are preferably determined during the training process. In a preferred embodiment of the invention, the device is trained using several speakers as described above. Furthermore, as mentioned above, in the present recognition method, during the training procedure, the boundaries of the pattern are automatically determined according to the method described above. In this way, the boundaries are recorded and the dwell time is stored for each keyword classified by the device.

トレーニング工程の終了時に、各パターンに対するドウ
エル時間が試験され、パターンに対する最小および最大
のドウエル時間が選ばれる０本発明の好ましい具体例に
おいては、ドウエル時間のヒストグラムが形成され、最
小および最大ドウエル時間は、第２５および第７５．１
００分位数に設定される。これは、低誤報率を維持しな
がら高認識精度を与える。代わりに１最小および最大ド
ウエル時間の他の選択も可能であるが、認識精度と誤報
率との間には交換条件がある。すなわち、もしも最小ド
ウエル時間および最大ドウエル時間が選択されると、一
般に、高誤報率の犠牲でより高い認識精度が得られる。At the end of the training process, the dwell times for each pattern are tested and the minimum and maximum dwell times for the pattern are selected. In a preferred embodiment of the invention, a histogram of dwell times is formed, and the minimum and maximum dwell times are , No. 25 and No. 75.1
Set to 00 quantile. This gives high recognition accuracy while maintaining a low false alarm rate. Other choices of minimum and maximum dwell times are possible instead, but there is a trade-off between recognition accuracy and false alarm rate. That is, if a minimum dwell time and a maximum dwell time are selected, higher recognition accuracy is generally obtained at the expense of a higher false alarm rate.

次にシンタックスプロセッサについて説明する。Next, the syntax processor will be explained.

２または３の特定のワードテンプレートの結合は、決定
プロセスにおけるシンタックス制御の平凡な例である。The combination of two or three specific word templates is a common example of syntactic control in the decision process.

第９図を参照すると、奇数（１゜３．５，７．・・・）
のワードを含むワード列を検出するためのシンタックス
回路３０８は、各語粟ワードに対して維持される独立の
２組のパターン整列レジスタ３１０．３１２を有してい
る。第１テンプレートに入るスコアは、サイレントに対
するスコアまたは１組の第２テンプレートの最良スコア
のいずれか良い方のものである。第２のテンプレートに
入るスコアは、第１組ρテンプレートの最良のスコアで
ある。このスコアはまた、ノード３１３Ｖｃある第２の
サイレント検出テンプレートに送らｈる。ノード３１３
にある検、戸テンプレートにより測定されて発声の終端
のサイレントが検出されると、発声されたワードのラベ
ルおよび継続時間が、第１および第２組のテンプレート
のトレースハックハッ７アカラ′交互にトレースバック
され得る０重要なことは、サイレント検出テンプレート
の位置で、奇数のワードを有するワード列の後のサイレ
ントのみが検出され得ることが保証されることである。Referring to Figure 9, odd numbers (1°3.5, 7...)
The syntax circuit 308 for detecting word sequences containing words has two independent sets of pattern alignment registers 310, 312 maintained for each word. The score that falls into the first template is the score for silent or the best score of the set of second templates, whichever is better. The score that falls into the second template is the best score of the first set of ρ templates. This score is also sent to a second silent detection template at node 313Vc. node 313
When a terminal silence of an utterance is detected, as measured by the door template, the label and duration of the uttered word are alternately traced in the first and second set of templates. What is important is that the position of the silent detection template ensures that only silents after a word string with an odd number of words can be detected.

若干複雑なシンタックス網は、４＄９図のノード３１３
ａのような各シンタックスノードと、容認できるワード
ストリング長のリストを関連づけることＫより実施でき
る。例えば、奇数のワードを含む任意のストリングを容
認する第９Ｅのシンタックス網において、ストリング長
は、第２のサイレントレジスタ３１３ａの人力における
ストリング長を試験することＫより、特定の奇数、例え
ば５に固定できる。その点におけるストリングの長さが
５でなければ、レジスタは不活性となり（その分析イン
ターバルに対して）、そのレジスタからストリングスコ
アは報告されないが、ストリング長が５であると、スト
リングの検出が報告され得る。同様に、第１語１１ｇレ
ジスタ３１０は、到来ストリング長が０．２または４の
とき可能化され、第２レジスタは、到来ストリング長が
１または３のと、きのみ可能化され得る。５ワードスト
リングに対する最適の結果を得るためには、全部で５組
の動的プログラミングアキュムレータを必要としようが
、本方法によれば、これよりも少ない数のアキュムレー
タに１普通の認識精度に若干の低減をもたらすだけで多
重の役割を遂行させることができる。A slightly more complex syntax network is node 313 in the 4$9 diagram.
This can be done by associating each syntax node such as a with a list of acceptable word string lengths. For example, in the 9E syntax network that accepts any string containing an odd number of words, the string length can be set to a particular odd number, e.g., 5, by manually testing the string length in the second silent register 313a. Can be fixed. If the string length at that point is not 5, the register will be inactive (for that analysis interval) and no string score will be reported from that register, but if the string length is 5 then the detection of the string will be reported. can be done. Similarly, the first word 11g register 310 may be enabled when the incoming string length is 0.2 or 4, and the second register may only be enabled when the incoming string length is 1 or 3. To obtain optimal results for a 5-word string, a total of 5 sets of dynamic programming accumulators would be required, but our method allows for a smaller number of accumulators with a slight improvement in recognition accuracy. Multiple roles can be performed simply by providing a reduction.

本ＦｆＡａｌｔＫＮ示される特定の具体例においては、
５数字ストリングまたは数字でない既知の語粟ワードの
いずれかを認識するように設計される。この文章的シン
タックスは、第９Ａ図に図示されている。９９１図にお
いて、各ノード（節点）３１４ａ　ｓ　３　’１４　ｂ
％　・−・３１４　ｈは１詔識プロセスにおける段階を
表わしている。ノード３１４ａおよび３１４ｇはサイレ
ントの認識を表わし、ノード３１４　ｂ　ｚ　３１４　
ｃ　ｓ　３１４　ｄ　％　３．１１４　ｅおよび３１４
ｆは数字の認識を表わし、３１４ｈは、サイレントでな
い非数字語業ワードの認識を表わしている。かくして、
装置のシンタックス制御にしたがえば、ノード３１４ａ
に対応するサイレントがまず認識されねばならない。こ
の点では、数字′のＮ１１ｌｋより制御は／−）’３１
４　ｂＫ移行り、　非数字の認識により制御はノード３
１４ｈＫ移行する（ここで「移行」とは、文法シンタッ
クス中の容認し得る、すなわち「適法な」進行をいう）
。In the particular embodiment presented, this FfAaltKN includes:
It is designed to recognize either 5-digit strings or known word words that are not digits. This textual syntax is illustrated in Figure 9A. 991, each node (node) 314a s 3 '14 b
% -- 314 h represents a stage in the 1 edict process. Nodes 314a and 314g represent silent recognition, and nodes 314 b z 314
c s 314 d % 3.114 e and 314
f represents recognition of numbers, and 314h represents recognition of non-silent non-number word words. Thus,
According to the syntax control of the device, node 314a
The corresponding silent must first be recognized. At this point, the control is /-)'31 from the number 'N11lk.
4 Moves to bK, control is controlled by node 3 due to non-numeric recognition
14hK Transition (where "transition" refers to an acceptable or "legal" progression in grammatical syntax)
.

ノード３１４ｂでは、このノードから遠ざかる唯一の容
認できる進行は、数字ノードであるノード３１４ｃへの
進行である。他方、３１４ｈでは、このノードから遠ざ
かる唯一の容認し得る唯一の進行は、サイレントである
ノード３１４ｇへの進行である。これらは、第１０図と
関連して説明される制御シンタックスプロセッサ３０８
により許される唯一の容認し得る、すなわち適法な進行
である。重要なことは、第９Ａ図のシンタックスプロセ
ッサは、第９−におけると同様に、箱構造体を折り返え
しく折りたたみ）、「オニグメント（添加部）」を使用
して、「折り返えされた」または「折りたたまれた」シ
ンタックス節構造体を介して進行を制御することにより
相当簡単化できることである。かくしてＭｅＡ図は、接
Ｍ線部分に沿う１つのノードから他のノードへの移行に
ある限定が設定されることを条件として、第９図のよう
に再構成できる。At node 314b, the only acceptable progression away from this node is to node 314c, which is a numeric node. On the other hand, at 314h, the only acceptable progression away from this node is to node 314g, which is silent. These include the control syntax processor 308 described in connection with FIG.
is the only acceptable or lawful proceeding allowed by. Importantly, the syntax processor in FIG. 9A, as in FIG. A considerable simplification can be achieved by controlling the progression through a "folded" or "folded" syntax clause structure. The MeA diagram can thus be reconstructed as shown in FIG. 9, provided that certain limitations are placed on the transition from one node to another along the tangent M line section.

第９Ｂ図には、折りたたまれたシンタックス節構造体が
略示されている。この図においては、ノード３１４χは
唯一のサイレントノードとなり１ノード３１４ｕ、３１
４ｖおよび３１４Ｗは１新しい数字ノードであり（旧ノ
ード３１４１）、３１４ｃｓ　３１４ｄｓ　３１４ｅ、
および３１４　ｆＫ対応）、そしてノード３１４ｈは、
非数字ノードであり、サイレントノードでない。サイレ
ントノートはここで「二重の役割」を果す。すなわち、
サイ−とントノード３１４ｘは、ワードストリング認識
のＵ６始時におけるサイレントまたは、ワードストリン
グ認識の終了時のサイレントのいずれかを表わす。A collapsed syntax clause structure is schematically illustrated in FIG. 9B. In this figure, the node 314χ is the only silent node, and the nodes 314u, 31
4v and 314W are 1 new number nodes (old node 3141), 314cs 314ds 314e,
and 314 fK compatible), and node 314h is
A non-numeric node and not a silent node. Silent Note plays a "double role" here. That is,
The silent node 314x represents either silence at the beginning of word string recognition U6 or silence at the end of word string recognition.

同様に、ノード３１４ｕおよび３１．１４　’／も二重
の役割を果し１ノード３１４ｕは、ワードストリングの
第１または第４数字のいずれかを表わし、ノード３１４
ｖは、第２または第３数字を表わす。Similarly, nodes 314u and 31.14'/ also play dual roles, with one node 314u representing either the first or fourth digit of the word string, and node 314u representing either the first or fourth digit of the word string.
v represents a second or third digit.

動作において、各ノードに対する入力は、ディジットワ
ード計数値にしたがって受は入れられる。In operation, input to each node is accepted according to the digitword count.

第９Ｂ図のノードは、交互の仮定に対して並列に進行す
る計算を表わしている。弧線は、交互の仮定の相互の依
存性を表わしている。第９Ｂ図においては、第９ＡｇＪ
において仮定される５つの数字に代わって仮定ｉｈる３
つのみの数字が活動状態に維持される。動作において、
仮定される活動数字の減少は、入力弧線がデータと関連
して適正なワード計数値を有する場合のみ、すなわちそ
の弧線に対する１組の択一的ワード計数値から容認しう
るワード計数値の１を有するときのみ人力弧線データを
受は入れることにより達成される。かくし１１ノード３
１４ｕは、データと関連するワード計数値が００ときの
みノード３１４ｘから入力弧線データを受は入れるが、
サイレントノードから出る全弧線上のデータは０にセッ
トされたワード計数値を有するから、これはつねにそう
なるであろう。ノード３１４ｕはまた、データと関連す
るワード計数値が３であるときノード３１４Ｗから入力
弧線データを受は入れるＯノードは、すべての容認し得
る入力から最良のスコアデータな受は入れる。かくして
、ノード３１４ｕは、ノード３１４ｘからデータが選択
されたかノード３１４Ｗからのデータが選択されたかの
みに依存して、数字が発声中の第１の数字と整合しつつ
あるという仮定か、数字が発声中の第４の数字と一致し
ているという仮定のいずれかを表わす。同様に、サイレ
ントノードは、ノード３１４ｖが関連するワード計数値
５を有するとき、ノード３１４ｖから弧線データを受は
入れる。またＸサイレントノードは、ノード３１４ｈか
ら、およびそれ自体すなわちノード３１４工から入力を
受容す取るＯそのとき、サイレントノードは、これらの
容認し得る入力から最良のスコアデータを選ぶＯ「折返し」シンタックス構造を提供する効果Ｇ１１装置
に対するメモリの必要および計算負荷を減することであ
る。他方、ある種のデータを捨て、決ダ定を強制することにより、悪い情報が捨てられ、正しく
ない決定がなされる危険がある。しかしながら、以下に
記載される装置のように認識の精度が高い場合、「良好
コなデータを捨てる可能性！ま非常に低い。例えば、ノ
ード３１４ｕ力く、ノート。The nodes in FIG. 9B represent calculations proceeding in parallel for alternating assumptions. The arc lines represent the mutual dependence of alternating assumptions. In Figure 9B, the 9th AgJ
Assume ih3 instead of the five numbers assumed in
Only one number remains active. In operation,
The assumed activity digit reduction is only if the input arch has a valid word count in relation to the data, i.e. an acceptable word count of 1 from a set of alternative word counts for that arc. This is achieved by accepting human arc data only when the data is available. Hidden 11 node 3
14u accepts input arc data from node 314x only when the word count value associated with the data is 00;
This will always be the case since the data on all arcs exiting the silent node will have the word count set to zero. Node 314u also accepts input arc data from node 314W when the word count value associated with the data is 3.O node accepts the best scoring data from all acceptable inputs. Thus, node 314u can either assume that the digit is becoming consistent with the first digit being uttered or that the digit is being uttered, depending solely on whether data is selected from node 314x or from node 314W. It represents one of the assumptions that it matches the fourth number in the table. Similarly, a silent node accepts arc data from node 314v when node 314v has an associated word count of five. The silent node also accepts input from node 314h and from itself, i.e., node 314.The silent node then chooses the best score data from these acceptable inputs. An advantage of providing structure is to reduce memory requirements and computational load on the G11 device. On the other hand, by discarding certain data and forcing decisions, there is a risk that bad information will be discarded and incorrect decisions will be made. However, if the recognition accuracy is high as in the device described below, the possibility of discarding good data is very low.

３１４Ｗからの入力に有利に働いてノード３１４Ｘから
の入力を捨てると、サイレントノート°からの確率の一
非常に低いデータは捨てられることＫなる。装置は、い
つの時点においても、ストリングがいま始まりつつある
か、すでに３ワードを話し終えたかということだけを判
断すればよし）から、これは好ましい動作方法である。If we discard the input from node 314X in favor of the input from 314W, then the very low probability data from the silent note K will be discarded. This is the preferred method of operation, since at any given time the device only needs to determine whether the string is just beginning or whether it has already spoken three words.

この判断において誤りをなす確率は極めて低い。折返し
または折畳みシンタックス系は、認識されたワード数の
計＠値を維持するためにノードごとに１つの追加のレジ
スタを必要とする。（もう少し一般的な場合、計数値は
、文法的シンタックスストリングにおいて＆［されるワ
ード数とし得よう０）Ｌかしながら、折返、しシンタッ
クス系の利点、すなわちメモリおよび計算の低減の利益
は、上述の欠点にまさるものである。キーワード認識に
おいてシンタックスを利用することの他の利点は、サイ
レントが起こるにせよ起こらないにせよ、決定が波線的
知識（文法的シンタックス）を使ってなされることであ
る。このシンタックスにより、装置は、「サイレント」
をより確実に検出でき、連続するワードストリングと「
サイレント」間の境界を正確に定めることができるので
ある。本発明の方法の重要な要素は、ワードストリング
との組合せにおけるサイレントの検出である。すなわち
、サイレント「スペリング」Ｋ対応スコアが、文法的シ
ンタックスの必要条件に一致するワードストリングの認
識に対応するとき、さきに受信された音声信号の「良好
な尤度スコア」を含むから、ワードストリングの終端に
てサイレントが確実に検出さ−れる。The probability of making an error in this judgment is extremely low. The wrapping or folding syntax system requires one additional register per node to maintain the total number of words recognized. (In a slightly more general case, the count value could be the number of words that are &[ in the grammatical syntax string 0)L However, the benefits of the &[ syntactic system, i.e. the benefits of reduced memory and computation. outweighs the drawbacks mentioned above. Another advantage of using syntax in keyword recognition is that the decision whether silence occurs or not is made using squiggle knowledge (grammatical syntax). This syntax allows the device to be ``silent.''
can be detected more reliably, and continuous word strings and “
The boundaries between "silent" can be precisely defined. An important element of the method of the invention is the detection of silences in combination with word strings. That is, when the silent "spelling" K correspondence score corresponds to the recognition of a word string that matches the requirements of grammatical syntax, it contains the "good likelihood score" of the previously received speech signal, so the word Silence is reliably detected at the end of the string.

より正確で確実な認識がなされることを可能にするのは
、そのシンタックスによるサイレントの決定である。こ
れは、例えば音声シンタックスに拘りなくサイレントを
振幅最小として認識する方法に比して明らかに有利であ
る。It is the silent determination of its syntax that allows more accurate and reliable recognition to be made. This is a clear advantage over, for example, methods that recognize silence as an amplitude minimum regardless of audio syntax.

ゾ次に本音声認識方法を使用して実施された装置について
説明する。Next, a device implemented using this speech recognition method will be described.

本発明の好ましい具体例においては、第２図のブリプロ
セッサにより遂行された信号およびデータ操作以上の操
作が、ディジタルデータ・エクィツブメント・コーポレ
ーションＦＤＰ−１１５コンピュータと米国特許第４，
２２８，４９８号に記載されるごとき専用ベクトルコン
ピュータプロセッサとの組合せで実施される。In a preferred embodiment of the invention, signal and data manipulations beyond those performed by the preprocessor of FIG.
228,498 in combination with a special purpose vector computer processor such as that described in US Pat. No. 228,498.

本発明の方法は、コンピュータのプログラミングの利用
に加えて、ハードウェアを利用して実施できる。The method of the present invention can be implemented using hardware in addition to using computer programming.

動作について説明すると、本発明の装置１０は、動的プ
ログラミング技術にしたがって動作する。In operation, the apparatus 10 of the present invention operates according to dynamic programming techniques.

各折しい尤度スコア列、すなわち既知の予定された順の
各基準パターン列に関する尤度スコア列は、コンピュー
タから線３２０を経て、メモリ３２２および３２４の１
つの既存のスコアに供給される。The likelihood score sequence for each random likelihood score sequence, ie, the likelihood score sequence for each reference pattern sequence in a known predetermined order, is transmitted from the computer via line 320 to one of memories 322 and 324.
one existing score.

メモリは、（ａ）各可能なワードの終了に対応するスコ
アを受信するシンタックスプロセッサ３０８、（ｂ）メ
モリ選択および次の音素信号に依存してメモリ３２２お
よび３２４の出力に取って代わる最小スコアレジスタ３
２６、および（Ｃ）他の制御およびクロック信号の制御
下で、以下のように交互に機能する。The memory includes (a) a syntax processor 308 that receives a score corresponding to the end of each possible word; (b) a minimum score that supersedes the output of memories 322 and 324 depending on the memory selection and the next phoneme signal. register 3
26, and (C) function alternately as follows under the control of other controls and clock signals.

動作において、回路は、第４図の各日に対応するレジス
タを更新し、各休止ないしサイレントの認識で最良の整
合を達成し得る決定機構を提供するための規則にしたが
って動作する。In operation, the circuit operates according to rules for updating the registers corresponding to each day in FIG. 4 and providing a decision mechanism that can achieve the best match at each pause or silent recognition.

メモリ３２２および３２４は、同じ形態を有しており、
１０ミリ秒ごとに、すなわち新しいフレレームが分析さ
れるごとに交換される。メモリは各々複数の３２ビツト
ワードを有しており１そして３２ビツトワードの数は一
機械語業のワードと関連されるレジスタ（すなわち第４
図の円）−１に対応している。最初、一方のメモリ、例
えば３２２が、「悪い」尤度スコア、すなわち本例にお
いては大きい値を有するスコアを記憶している。その後
、メモリ３２２はｉｌ［３２０を介してベクトルベプロセッサから供給される新しい尤度スコアノ順序に対
応する予定された順序で逐次読み出され曳　　　□そし
てスコアは以下に記載されるように更新され、他方のメ
モリ３２４１ｆＣ再書込みされる。次の１０ミリ秒フレ
ームにおいては、メモリ３２４から、いまは古くなった
スコアが読み出され、他のメモリ３２２に書き込まれる
。この交番機能は、シンタックスプロセッサ、最小スコ
アレジスタ３２６および他の制御およびクロック信号の
制御下で続く。前述のよう忙、メモリ３２２および３２
４の各ワードは、３２ビツト数である。下位１６ビツト
、すなわちビット０〜１５は、累積尤度スコアを記憶す
るのに採用される。また、ビット１６〜２３は、音素継
続時間を記録するの忙採用され、ビット２４〜３１は１
そのレジスタにおけるワード継続時間を記憶するのに採
用される。Memories 322 and 324 have the same form;
It is exchanged every 10 milliseconds, ie each time a new frame is analyzed. The memories each have a plurality of 32-bit words, one and the number of 32-bit words being stored in the register associated with one machine word (i.e. the fourth
The circle in the figure corresponds to -1. Initially, one memory, e.g. 322, stores a "bad" likelihood score, ie a score with a large value in this example. Thereafter, the memory 322 is sequentially read out in a predetermined order corresponding to the new likelihood score ordering provided by the vector vector processor via il [320] and the scores are updated as described below. The other memory 3241fC is rewritten. In the next 10 millisecond frame, the now stale score is read from memory 324 and written to another memory 322. This alternating function continues under the control of the syntax processor, minimum score register 326 and other control and clock signals. As mentioned above, memory 322 and 32
Each word of 4 is a 32 bit number. The lower 16 bits, bits 0-15, are employed to store the cumulative likelihood score. Also, bits 16-23 are used to record the phoneme duration, and bits 24-31 are used to record the phoneme duration.
It is employed to store the word duration in that register.

コンピュータから到来する尤度スコアは、ノくターンス
コアメモリ３２８に各フレーム時間ごとに記憶される。The likelihood score coming from the computer is stored in turn score memory 328 for each frame time.

この情報は、非常に高速のデータ転送速度テ、コンピュ
ータからバーストとシ”’Ｃ供にされ、そして第１０図
の回路において採用される低速度でパターンスコアメモ
リから読み出される０シンタツクスプロセツサまたは最
小スコアレジスタからの中介制御がない場合、選択され
たメモリ３２２または３２４の出力は、対応する選択さ
れたゲート３３０または３３２を介して線３３４に供給
される。ａ３３４は、尤度スコア、音素またはターゲッ
トパターン継続時間ｉｔ数値、およびワード継続時間計
数値をそれぞれ更新する加算器３３６．３３８．３４０
に接続されている。かくして、メモリ３２２．３２４の
一方から来る先行フレームのスコアに対応する尤度スコ
アは、線３４２を介してパターンスコアメモリから出力
され、古い尤度スコアに加えら′れ、そして書込みに使
用されていないメモリに記憶される。メモリの選択機能
は、線３４４上の信号レベルにより一提供される。同時
に、ワードおよび音素継続時間計数値は１だけインクリ
メントされる。This information is provided in bursts and sequences from the computer at very high data transfer rates, and then read from the pattern score memory at the low speed employed in the circuit of FIG. In the absence of intervening control from the minimum score register, the output of the selected memory 322 or 324 is fed through the corresponding selected gate 330 or 332 to line 334. a334 is the likelihood score, phoneme or Adder 336.338.340 for updating target pattern duration it value and word duration count value respectively
It is connected to the. Thus, the likelihood score corresponding to the previous frame's score coming from one of the memories 322, 324 is output from the pattern score memory via line 342, added to the old likelihood score, and used for writing. Not stored in memory. The memory selection function is provided by the signal level on line 344. At the same time, the word and phoneme duration counts are incremented by one.

同様に、ワード継続時間カウンタ、音素継続時間計数値
および尤度スコアが′通常更哲される。Similarly, word duration counters, phoneme duration counts, and likelihood scores are typically updated.

上述の普通の更新ルールに対する２つの例外は、新しい
音素の開始および新しいワードの開始に対応して起こる
。新しい音声の開始時に（これは新しいワードの開始時
ではない）音素の第ルジスタは、普通の規則にしたがっ
て更新されないが１代って、＠３４２上の尤度スコアが
、先行基準フレームまたは音素任意ドウエル時間のレジ
スタまたは先行音素必須ドウエル時間の最終レジスタか
らの最小スコアに加えられる。これは、最小スコアレジ
スタ３２６を採用することにより実施される。最小スコ
ナレジスタの出力は、前の音素に対する先行のフレーム
時ｒＩＩＩＶＣおける最小スコアを表わす。このスコア
は、新しいスコアが提供されるとき最小スコアレジスタ
の内容を連続的に更新することにより得られる。新しｌ
／１最小スコア＆才、減算演算要素３４６の符号ビット
出力を採用することにより最小スコアレジスタ中に負荷
される。要素３４６は、現在の最小スコアをｌ、ｚま更
新されたレジスタからの新しい最小スコアと比較する０
最小スコアレジスタは、さらに、”最小スコアを有する
レジスタに対応するワード継続時間計数値およべて、新
しい音素の開始時に線３３４に出力される。この出力プ
ロセスは、新音素の開始時に可能化されるゲートと、新
しい音素の開始中ゲート３３２および３３０を万能化す
る制御信号の組合せを使って制御される。Two exceptions to the normal update rule described above occur in response to the start of a new phoneme and the start of a new word. At the start of a new speech (which is not at the start of a new word) the first register of a phoneme is not updated according to the usual rules, but instead the likelihood score on @342 is The dwell time register or the preceding phoneme is added to the minimum score from the last register of required dwell times. This is implemented by employing a minimum score register 326. The output of the minimum scorer register represents the minimum score in the previous frame time rIIIVC for the previous phoneme. This score is obtained by continuously updating the contents of the minimum score register as new scores are provided. New l
/1 min score & is loaded into the min score register by taking the sign bit output of the subtraction operation element 346. Element 346 compares the current minimum score with the new minimum score from the updated register.
The minimum score register is further output on line 334 at the beginning of a new phoneme along with the word duration count value corresponding to the register with the minimum score. This output process is enabled at the beginning of a new phoneme. The gates are controlled using a combination of control signals that universalize gates 332 and 330 during the initiation of a new phoneme.

シンタックスプロセッサ３０８（第９Ｂ図に対応する）
は、新しいワードに対する第１音素の第ルジスタを、先
行のフレームで終わるワードのシンタックスを考慮に入
れたワードの最良のスコアで更新するのに採用される。Syntax processor 308 (corresponding to Figure 9B)
is employed to update the first register of the first phoneme for the new word with the best score of the word taking into account the syntax of the word ending in the previous frame.

かくして、新しし）ワードの第１・音素の第ルジスタに
対応するレジスタのスコアが、到来尤度スコアにより更
新されるとき、採用されるのはメモリ３２２．３２４の
一方の出力でないＯ代わって先行のフレー、ムで終わる
ワードの、好ましくはシンタックスを考慮に入れた最良
の尤度スコアが利用される。この機能は、ゲート３３０
および３３２を不能化し、同時にゲート３５０を可能化
して、レジスタ３５２に記憶された最良の利用可能なス
コアを線３３４上に供給し、線３４２上の到来ノくター
ン尤度スコアと加えることにより可能となる。Thus, when the score of the register corresponding to the first register of a word (new) is updated by the incoming likelihood score, it is not the output of one of the memories 322, 324 that is adopted instead. The best likelihood score of the word ending in the previous frame, preferably taking into account the syntax, is utilized. This function is performed by gate 330
and 332 and simultaneously enable gate 350 to provide the best available score stored in register 352 on line 334 and add it with the incoming turn likelihood score on line 342. becomes.

このようにして、基準フレームのドウエル時間に対応す
る各レジスタは、このハードウェアの具体例において連
続的に更新されるのである。尤度スコアでサイレントワ
ードを表わすと、シンタックスプロセッサは、ノ）−ド
ウエアまたはコンピュータ装置が認識されたワードを決
定するためにノくツクトレースを行なうことを可能にす
るに必要な制御システムを提供するように設計される。In this manner, each register corresponding to the dwell time of the reference frame is continuously updated in this hardware implementation. Once a silent word is represented by a likelihood score, the syntax processor provides the necessary control system to enable hardware or computer equipment to perform a silent trace to determine the recognized word. designed to.

以上の説明を考察すれば、本発明の種々の目的が達成さ
れ、利益ある効果が得：られたことが分ろう。Upon consideration of the foregoing description, it will be seen that the various objects of the present invention have been achieved and the beneficial effects have been achieved.

ここに開示されるワードストリング連続音声認識方法お
よび装置は、特定の応用として隔絶された音声の認識を
含むことが理解されよう。技術に精通したものであれば
、ここに開示される具体例の追加、削除１変更が特許請
求の範囲内において明らかであろう。It will be appreciated that the word string continuous speech recognition method and apparatus disclosed herein includes recognition of isolated speech as a particular application. Additions, deletions, and modifications to the embodiments disclosed herein will be apparent to those skilled in the art within the scope of the following claims.

[Brief explanation of the drawing]

第１図は本発明の方法にしたがって遂行される一連の動
作を一般的用語で例示するフローチャート、第１Ａ図は
本発明の好ましい具体例の装置の電気的ブロック図、第
２図は第１図において例示される全プロセスにおける特
定の処理動作を遂行するための電子装置の概略ブロック
図、第３図は第１図のプロセスにおける特定のプロセス
を遂行するディジタルコンピュータプログラムのフロー
チャート、第４図は本発明のパターン整列フ１＝セスの
線図＼第５図は本発明の好ましい具体例の尤度関数プロ
セッサの！気的ブロック図、第６図は本発明の好ましい
具体例の減算・絶対値回路の電気的概略ブロック図、第
７図は本発明の好ましい具体例のオーバーフロー検ｔＢ
論理回路の電気回路・第８図は第７図の回路図の真値≦
、第９図は本発明のプリプロセッサの１つの好ましい具
体例のシンタックスプロセッサの概略流れ線図、第９Ａ
図はサイレントにより境界を定められる５数字ワードス
トリングを認識するシンタックスプロセッサの概略流れ
線図、第９Ｂ図はノードの数を減するため第９人図の流
れ線図を折り返えした概略流れ線図、第１０図は本発明
の好ましい特定の具体例の逐次解読パターン・整列回路
の電気回路図である。１３：ム／Ｄコンバータ４５：制御プロセッサ４６：プリプロセッサ４８ａ：ベクトルプロセッサ４８ｂ：尤度関数プロセッサ４９：逐次解読プロセッサ５１：クロック発振器５２：周波数分割器５３：ラッチ５６：ディジタル乗算器５８：３２ワード循環シフトレジスタ５９：マルチプレックサ６０：Ｂ選択回路６３：３２ワードシフトレジスタメモリ６５　：　３２
ビツト加算器６７：ゲート７１：コンピュータ割込み回路７３：インター７エース代理人の氏名　　倉　内　基　弘同　　　　　　　倉　　橋　　　　　暎図面の浄？）（
内容に変更なしン手続袖正代・（方式）昭和５８年　５月２４日特許庁長官　若　杉　和　夫　殿事件の表示　昭和５８年特　願第　５５２　号発明の名
称　連続音声認識の改良補正をする者１１Ｘ件との関係　　　　　　　　　　特許出願人名称
　　エクソン・コーポレイション代理人〒１０３住　所　　東京都中央区日本橋３丁目１３番１１号油脂
］−業会館補正の対象願書の発廟ト出願人の欄＝１１４嗣“番−綱描味叫Ｈ４翔舗漬隆嗣搏→−委任状
及びその訳文　　　　　　　　　　　　各１通図面　　
　　　　　　　　１通明細書の発明の詳細な説明・図面の簡単な説明の欄補正
の内容　　別紙の通り図面の浄書（内容に変更なし）明細書中発明の詳細な説明および図面の簡単な説明を下
記の通り補正します。１、　明細書第２７頁１０〜１１行および１４〜１５行
において「第１０図」とあるのを「第９図」と訂正しま
す。２、　同第６５頁４行において「第８図」とあるのを「
下記」と訂正します０五　同第６５頁５行において「表わしている。」とある
次に下表を挿入します。「１　　　　１　　　　１　　　　０１　　　　１　　　　０　　　　０１　　０　　１　　１　（オーバーフロー）１　　　　
０　　　　０　　　　００　１　１　０ｏ　　　　１ｏ　　　　１（オーバー７０−）０　　　
　０　　　　１　　　　００　　　　０　　　　０　　　　０」４、　同第８３頁１５行において「第９図」とあるのを
「第８図」と訂正します。５　同第８４頁１５行および１９行において「第９図」
とあるのを「第８図」と訂正します。−６、同第８６頁
１行および２行において「第９Ａ図」とあるのを「第８
Ａ図」と訂正します。Ｚ　同第８７頁１行において「第１０図」とあるのを「
第９図」と訂正します。８、　同第８７頁４行および１０行において「第９Ａ図
」とあるのを「第８Ａ図」と訂正します。９　同第８７頁５行および１２行において１−第９図」
とあるのを「第８図」と訂正します。１０、同第８７頁１４行において「第９Ｂ図」とあるの
を「第８Ｂ図」と訂正します。１１　　同第８８頁１４行において「第９Ｂ図」とある
のを「第８Ｂ図１と訂正します。１２、同第８８頁１５行において「第９Ａ図」とあるの
を「第８　ＡＷＪと訂正します。１４同第９７頁１４行において「第９Ｂ図」とあるのを
「第８Ｂ図」と訂正します。１４、同第１００頁６行において「第７図の・・・第９
図は」とあるのを削除します。１５、同第１００頁５行および９行において「第９Ａ図
」とあるのを「第８Ａ図」と訂正します。１６、同第１００頁８行において「第９Ｂ図」とあるの
を「第８Ｂ図」と訂正します。１Ｚ　同第１００頁１０行において「第１０図」とある
のを「第９図」と訂正します。FIG. 1 is a flowchart illustrating in general terms the sequence of operations performed in accordance with the method of the invention; FIG. 1A is an electrical block diagram of the apparatus of a preferred embodiment of the invention; FIG. FIG. 3 is a flowchart of a digital computer program for carrying out specific processing operations in the process of FIG. 1; FIG. A diagram of the pattern alignment process of the invention ＼FIG. 5 shows the likelihood function processor of the preferred embodiment of the invention! 6 is a schematic electrical block diagram of a subtraction/absolute value circuit according to a preferred embodiment of the present invention, and FIG. 7 is an electrical block diagram of an overflow detection circuit according to a preferred embodiment of the present invention.
Electrical circuit of logic circuit - Figure 8 is the true value of the circuit diagram in Figure 7 ≦
, FIG. 9 is a schematic flow diagram of a syntax processor of one preferred embodiment of the preprocessor of the present invention, FIG. 9A.
Figure 9B is a schematic flow diagram of a syntax processor that recognizes five-digit word strings delimited by silents; Figure 9B is a schematic flow diagram that folds the flow diagram of Figure 9 to reduce the number of nodes. 10 is an electrical circuit diagram of a sequential decoding pattern and alignment circuit of a preferred specific embodiment of the present invention. 13: Mu/D converter 45: Control processor 46: Preprocessor 48a: Vector processor 48b: Likelihood function processor 49: Sequential decoding processor 51: Clock oscillator 52: Frequency divider 53: Latch 56: Digital multiplier 58: 32 Word circulation Shift register 59: Multiplexer 60: B selection circuit 63: 32 Word shift register memory 65: 32
Bit adder 67: Gate 71: Computer interrupt circuit 73: Inter 7 Ace agent's name: Motoi Kurauchi, Hirodo Kurahashi, Ei drawing no puri? )(
No change in content (method) May 24, 1980 Commissioner of the Japan Patent Office Kazuo Wakasugi Display of the case Patent Application No. 552 of 1980 Title of the invention Amendment for improvement of continuous speech recognition Relationship with Patent No. 11 “Ban-Tsuna-Daki Ami-shou H4 Shoho Tsuzuke Ryuji Rin→- Power of attorney and its translation 1 copy each Drawing
1 Contents of amendments to the detailed description of the invention and brief description of the drawings in the specification Reprint of the drawings as attached (no changes to the contents) The detailed description of the invention and the brief description of the drawings in the specification are as follows: I will correct it accordingly. 1. On page 27 of the specification, lines 10-11 and lines 14-15, "Figure 10" is corrected to "Figure 9." 2. On page 65, line 4, “Figure 8” was replaced with “
0.5 On page 65, line 5, insert the table below after the words ``represents.''"1 1 1 0 1 1 0 0 1 0 1 1 (overflow) 1
0 0 0 0 1 1 0 o 1o 1 (over 70-) 0
0 1 0 0 0 0 0” 4. On page 83, line 15, “Figure 9” is corrected to “Figure 8.” 5 "Figure 9" on page 84, lines 15 and 19
The text has been corrected to read "Figure 8." -6, on page 86, lines 1 and 2, “Figure 9A” was replaced with “Figure 8
I corrected it to ``Figure A''. Z On page 87, line 1 of the same page, "Figure 10" was replaced with "
Figure 9” is corrected. 8. On page 87, lines 4 and 10, "Figure 9A" is corrected to "Figure 8A." 9 Figures 1-9 on page 87, lines 5 and 12.
The text has been corrected to read "Figure 8." 10. On page 87, line 14, "Figure 9B" is corrected to "Figure 8B." 11. On page 88, line 14 of the same page, “Figure 9B” has been corrected to “Figure 8B 1.” 12. On page 88, line 15 of the same page, “Figure 9A” has been changed to “8th AWJ.” I am making a correction. 14 On page 97, line 14 of the same, "Figure 9B" will be corrected to "Figure 8B." 14. On page 100, line 6 of the same page, ``In Figure 7...No. 9
Delete the text that says "The figure is." 15. On page 100, lines 5 and 9, "Figure 9A" is corrected to "Figure 8A." 16. On page 100, line 8 of the same, "Figure 9B" is corrected to "Figure 8B." 1Z On page 100, line 10, "Figure 10" is corrected to "Figure 9."

Claims

Claims: (1) A method of recognizing silence in an incoming voice signal in a voice analysis device that recognizes at least one keyword in the voice signal, comprising: generate a second target template to convert the incoming audio signal into the first target template;
and a second target template, generating a numerical value representing the result of the comparison, and determining whether silence has been detected based at least on this numerical value. (2) The method of claim 1, wherein the step of generating a target template includes adding a dynamically changing silent target template to the incoming audio signal with respect to the first and second target templates. Silent recognition methods that involve occurring in response. (3) In a method for recognizing silence in an audio signal in a speech analysis device that recognizes at least one keyword in an audio signal, the possibility (likelihood) that a currently incoming audio signal portion corresponds to a reference pattern representing silence.
, effectively modify this value according to a syntax-dependent measure representing the recognition according to the grammatical syntax of the immediately preceding part of the audio signal, and from the validly modified value, calculate the current signal. A silent recognition method characterized by determining whether a part corresponds to silent. (4) a reference pattern representing at least one keyword in a speech analysis device that recognizes at least one keyword in a sequence of audio signals, each characterized by a template having at least one target pattern and tailored to the speaker; forming a speaker-independent reference pattern representing the keyword, and using this speaker-independent reference pattern to determine the boundaries of the keyword in the audio signal spoken by the speaker. (5) Patent 5. The method of claim 4, wherein: 1 the training step divides an incoming speech signal representing keywords from a speaker into a plurality of sub-intervals using the keyword boundaries; forcing patterns to correspond, repeating the segmentation and matching steps for multiple audio input signals representing the same keyword, generating statistical data describing the reference pattern associated with each sub-interval; A method of silent recognition comprising making a second pass through the audio input signal representing the keyword using data to generate by a device a sub-interval for the keyword. (6) A method of forming a reference pattern representing a keyword unknown in advance in a speech analysis device that recognizes at least one keyword in an audio signal each characterized by a template having at least one target pattern, the method comprising: forming a speaker-independent reference pattern representing a keyword of A method for forming a reference pattern, comprising training a speech analysis device using the determined boundaries to generate statistical data describing the previously unknown keyword. (7) A method as claimed in claim 6, in which a reference pattern is provided for providing in isolated form an audio signal representing the unknown keyword spoken by the speaker. (8) The method according to claim 6, wherein the training step divides the incoming audio signal corresponding to the previously unknown keyword into a plurality of sub-intervals using the boundary, and each sub-interval, - forcing the pulses to correspond to a unique reference pattern, and repeating this step of splitting and forcing correspondence to multiple audio input signals representing the same keyword (repeated at A, the reference associated with each sub-interval) for each turn. The basic method is to generate statistical data to describe the keyword, use the generated statistical data to generate a second node based on an audio input signal representing the previously unknown keyword, and generate a subroutine for the keyword by the device. Pattern formation method. (9) Each keyword has at least one target keyword (
speech that recognizes multiple keywords in an audio signal, characterized by a template with turns, and described by a grammatical syntax in which each series of keywords in the audio signal is characterized by a plurality of connected series of decision nodes; In the speech analysis method in the analysis device, dynamic programming is adopted to generate a series of numerical scores for recognizing keywords in the speech signal, and the grammatical syntax is adopted to determine which score is used in the recognition process. (10) A speech recognition method characterized in that the number of decision nodes is reduced by determining whether an acceptable progression is formed and collapsing the syntax, thereby reducing the computational load on the device. (10) Speech In an apparatus for recognizing silence in an incoming voice signal in a voice analysis apparatus for recognizing at least one keyword in the signal, means for generating at least first and second target templates representing alternating silent instructions in the incoming voice signal. and the incoming audio signal to the first and Jl! 2 - means for generating a numerical value representative of the result of this comparison; and means for determining, based at least on said numerical value, whether a silence is detected and is a daughter. Silent recognition device. (a) The device according to claim 10,
y) R recognition apparatus, wherein said generating means includes means for generating a dynamically changing target template in response to said incoming human voice signal for one of said first and second target templates. (2) At least one key word in a speech signal is detected in a speech analysis device that recognizes silence in a speech signal.In a device that recognizes silence in a speech signal, there is a possibility that the currently incoming speech signal portion corresponds to a reference snoring representing silence. means for generating a numerical value (degree); means for adding to said numerical value a syntax-dependent value representing recognition according to the grammatical syntax of the immediately preceding portion of said audio signal to form a score; and means for determining whether a portion corresponds to silence. a3 In a speech analysis device for recognizing at least one keyword in a speech signal, each characterized by a template having at least one target pattern, in a device for forming a reference pattern representative of said keyword and tailored to the speaker; , means for forming a speaker-independent reference pattern representing the keyword; and using the speaker-independent reference pattern to determine boundaries of the keyword in an audio signal spoken by the speaker. and means for training a speech analysis device on a speaker using boundaries determined by the device for the keywords spoken by the speaker. Pattern forming device. α◆ The apparatus of claim 13, wherein the training means iteratively divides an incoming speech signal representing a keyword from the speaker into a plurality of sub-intervals using the keyword boundaries. means for iteratively forcing each sub-interval to correspond to a unique reference pattern; means for generating statistical data describing the reference pattern associated with each sub-interval; and means for generating statistical data describing the reference pattern associated with each sub-interval; a reference pattern forming apparatus, comprising means for performing a second pass with the audio signal representative of the keyword using the apparatus to generate a subinterval for the keyword; (v) A device for forming a reference pattern representing a keyword unknown in advance in a speech analysis device that recognizes at least one keyword in an audio signal each characterized by an on-plate having at least one target pattern, the device forming a reference pattern representing a keyword unknown in advance to the device; means for forming a speaker-independent reference pattern representing a keyword; means for determining boundaries of said unknown keyword using said speaker-independent reference pattern; and means for training a speech analysis device using boundaries predetermined by said device to generate statistical data describing said unknown keyword. aQ In the device according to claim 15,
A reference patterning device comprising means for providing in a discrete manner an audio signal representative of the unknown keyword spoken by the speaker. αυ The device according to claim 15! In,
The training means includes means for iteratively dividing the incoming audio signal corresponding to the previously unknown keyword into a plurality of sub-intervals using the boundary; and recursively converting each sub-interval into a unique reference pattern. means for generating statistical data that is described based on the reference pattern associated with each sample interval, and using the generated statistical data to generate the previously unknown keyword. a reference pattern forming device comprising means for making a second pass with said audio signal representing said keyword and generating by said device a sub-interval for said keyword; Number Each keyword must be at least 7. - a plurality of keywords in the speech signal characterized by a template with one target pattern, each series of keywords in the speech signal being characterized by a plurality of connected decision nodes; means employing dynamic programming to generate a set of numerical scores for recognizing keywords in said audio signal; and employing said grammatical syntax to determine which scores are acceptable progress in the recognition process. A speech recognition device comprising: means for determining whether to determine K; and means for reducing the number of decision nodes and reducing the computational load on the device. QcJ A plurality of keywords in an audio signal described in a grammatical syntax, where each keyword is characterized by a template with at least one target pattern, and a sequence of keywords in the audio signal is characterized by a plurality of connected decision nodes. employing the grammatical syntax and means for providing a series of numerical scores for recognizing keywords in the audio signal using circular dynamic programming in a recognition device in a speech analysis device that recognizes keywords; means for determining which scores constitute an acceptable progression in the recognition process; and means for suspending acceptable progression using increments, thereby causing normally acceptable progression to be abandoned in accordance with said syntax. A speech recognition device comprising: (Translation) An audio signal in which each keyword is characterized by a template having at least one target pattern, and a sequence of keywords in the audio signal is described by a grammatical syntax characterized by a plurality of connected decision nodes. A speech recognition method in a speech analysis device that recognizes a plurality of keywords employs dynamic programming to generate a series of numerical scores for recognizing keyword representations in the speech signal, and employs the grammatical syntax. to determine which scores form an acceptable progression in the recognition process, and use the increment to suspend acceptable progression and to abandon normally acceptable progression according to the syntax above. Characteristic voice recognition method.