JPH0372995B2

JPH0372995B2 -

Info

Publication number: JPH0372995B2
Application number: JP61032052A
Authority: JP
Inventors: Rai Baaru Raritsuto; Binsento Desooza Piitaa; Reroi Maasaa Robaato; Aran Pichenii Maikeru
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1986-02-18
Filing date: 1986-02-18
Publication date: 1991-11-20
Also published as: JPS62194295A

Description

【発明の詳細な説明】以下の順序で本発明を説明する。[Detailed description of the invention] The present invention will be explained in the following order.

Ａ産業上の利用分野Ｂ開示の概要Ｃ従来の技術Ｄ発明が解決しようとする問題点Ｅ問題点を解決するための手段Ｆ実施例 F1 音声認識システムの環境（第２図〜第４
図） F2 聴覚モデルおよび音声認識システムの音
響プロセツサにおけるその実現（第８図
〜第１４図） F3 精密突合せ（第４図、第１５図、第１６
図） F4 音素木構造（第１７図） F5 言語モデル（第２図） F6 概算による整形（第１８図） F7 音響突合せにより選択されたワードによ
るワード・パスの延長（第５図〜第７
図、第１９図） F8 ワードの複数の発声から構築するマルコ
フ・モデル（第１Ａ図、第１Ｂ図、第１
Ｃ図、第２０図〜第２６図）Ｇ発明の効果Ａ産業上の利用分野本発明は一般に音声認識の分野、詳細に説明す
れば、音声認識システムにおける基本形式の構築
に係わる。A. Industrial field of application B. Overview of the disclosure C. Conventional technology D. Problem to be solved by the invention E. Means for solving the problem F. Example F1. Environment of speech recognition system (Figures 2 to 4)
F2 Auditory model and its realization in the acoustic processor of the speech recognition system (Figures 8 to 14) F3 Precision matching (Figures 4, 15, 16)
Figure) F4 Phoneme tree structure (Figure 17) F5 Language model (Figure 2) F6 Approximate shaping (Figure 18) F7 Extension of word path using words selected by acoustic matching (Figures 5 to 7)
Figure 19) Markov model constructed from multiple utterances of the F8 word (Figures 1A, 1B, 1
(Figure C, Figures 20 to 26) G Effects of the Invention A Field of Industrial Application The present invention relates generally to the field of speech recognition, and more particularly to the construction of basic forms in speech recognition systems.

Ｂ開示の概要本発明は、ワードの発音で発声ごとの変動を考
慮に入れたフエネメの基本形式（基本形式はワー
ド・マルコフ・モデルのことである。フロント・
エンド（Front End）の信号処理プロセツサから
得られるそれぞれ微小な時間間隔に割当て可能な
音響タイプを表す一組のラベルを基本的な音素
（Phoneme）とし、これに対応する部分マルコ
フ・モデルを連結してワード・マルコフ・モデル
を構築するので、これをフエネメ（Feneme）型
の基本形式とよぶ）を構築する問題に取組むもの
である。詳細に述べれば、本発明は、 (a) ワードの複数の発声をそれぞれのフエネメ・
ストリングに変換し、 (b) フエネメのマルコフ・モデル音素マシンのセ
ツトを形成し、 (c) 複数のフエネメ・ストリングを生成するため
最良の１つの音素マシンP₁を確定し、 (d) 複数のフエネメ・ストリングを生成するため
形式P₁P₂またはP₂P₁の最良の２つの音素基本
形式を確定し、 (e) 各々のフエネメ・ストリングに対して最良の
２つの音素基本形式を整列させ、 (f) 各々のフエネメ・ストリングを左の部分と右
の部分に分割し、左の部分は２つの音素基本形
式の第１の音素マシンに対応させ、右の部分は
２つの音素基本形式の第２の音素マシンに対応
させ、 (g) 左の部分の各々を左サブストリング、右の部
分の各々を右サブストリングとして識別し、 (h) 複数の発声に対応するフエネメ・ストリング
のセツトと同じように左サブストリングのセツ
トおよび右サブストリングのセツトを処理し、
更に、その単一音素基本形式が、最良の２音素
基本形式が生成するよりも高い確率でそのサブ
ストリングを生成する場合には、そのサブスト
リングのそれ以上分割を禁止するステツプを含
み (i) 未分割の単一音素を、それらが対応するフエ
ネメ・サブストリングの順序に対応する順序に
連結する。B. Summary of the Disclosure The present invention provides a basic form of Hueneme (the basic form refers to the Ward-Markov model), which takes into account variations in word pronunciation from utterance to utterance.
A set of labels each representing an acoustic type that can be assigned to a minute time interval obtained from the signal processing processor at the front end is taken as a basic phoneme, and the corresponding partial Markov models are concatenated. This approach addresses the problem of constructing a Ward Markov model (called the basic form of the Feneme type). Specifically, the present invention provides a method for (a) uttering multiple utterances of a word to each
(b) form a set of Feneme Markov model phoneme machines, (c) determine the best one phoneme machine P ₁ to generate the Feneme strings, and (d) form a set of Feneme Markov model phoneme machines. (e) determining the best two phoneme basic forms of the form P ₁ P ₂ or P ₂ P ₁ to generate the Hueneme string; (e) aligning the best two phoneme basic forms for each Hueneme string; , (f) Split each Hueneme string into a left part and a right part, with the left part corresponding to the first phoneme machine of the two phoneme basic forms, and the right part corresponding to the first phoneme machine of the two phoneme basic forms. (g) identifying each of the left parts as a left substring and each of the right parts as a right substring; (h) a set of phoneme strings corresponding to the plurality of utterances; Process the left substring set and right substring set in the same way,
Furthermore, if the single phoneme base form produces the substring with a higher probability than the best two phoneme base form does, then the step of prohibiting further splitting of the substring (i) Concatenate unsegmented single phonemes in an order that corresponds to the order of their corresponding Feneme substrings.

ステツプを含むワード・セグメントの語彙でワ
ードのフエネメ基本形式を構築する方法に係わ
る。 It concerns a method of constructing the basic form of a word with a vocabulary of word segments containing steps.

Ｃ従来の技術ある音声認識システムでは、音響プロセツサは
音声を入力として受取り、それにラベルのストリ
ングを生成する。これらのラベルは、音響プロセ
ツサによりアルフアベツト、すなわち入力された
音声の或る特徴に基づいた独特なラベルのセツト
から選択する。C. Prior Art In some speech recognition systems, an acoustic processor receives speech as input and generates a string of labels for it. These labels are selected by the audio processor from an alphabet, ie, a unique set of labels based on certain characteristics of the input audio.

一般に、音響プロセツサは、センチ秒間隔にわ
たつて入力された音声のパワー・スペクトルの特
徴を検査し、（フエネメと呼ばれる）ラベルを各
間隔に割当てる。従つて、入力された音声に従つ
て、音響プロセツサは対応するフエネメのストリ
ングを生成する。 Generally, an audio processor examines the power spectral characteristics of the input audio over centisecond intervals and assigns a label (called a hue) to each interval. Therefore, according to the input speech, the sound processor generates a corresponding string of sounds.

音声を認識する統計的な方法では、限定された
セツトのモデルが決められる。各モデルは、マル
コフ・モデル、すなわち確率的に限定された状態
の音素マシンであつて、フエネメを生成する。こ
の手法は、IEEE会報：パターン分析および計算
機情報（PAMI）第５巻第２号（1983年３月号）
のエル・アール・バール外の論文“連続音声を認
識する最尤法”（L.R.Bahl et al，“Ａ
Maximum Likelihood Approach to
Continuous Speech Recognitiou”，IEEE
Transactiou on Pattern Analysis and
Machine Intelligence，Vol.PAMI−５，No.2.
March 1983）などに記載されている。 In statistical methods of recognizing speech, a limited set of models is determined. Each model is a Markov model, ie, a probabilistically limited state phoneme machine, that generates a feneme. This method was published in IEEE Bulletin: Pattern Analysis and Computer Information (PAMI), Volume 5, No. 2 (March 1983 issue).
LRBahl et al.'s paper “Maximum Likelihood Method for Recognizing Continuous Speech” (LRBahl et al, “A
Maximum Likelihood Approach to
“Continuous Speech Recognition”, IEEE
Transactiou on Pattern Analysis and
Machine Intelligence, Vol.PAMI-5, No.2.
March 1983).

確率的な方法に従つて、各音素マシンは、その
特徴として、 (a) 複数の状態、 (b) 状態間の遷移（各遷移はそれに関連した確率
を有する）、 (c) 少なくともいくつかの遷移の各々の、その特
定のフエネメを生成する確率を表わす複数の出
力確率を有する。音素マシンは、フエネメを生成しない
ナル遷移を含むことがある。通常、非ナル遷移で
は、アルフアベツトの各フエネメに割当てられた
確率がある。 According to the probabilistic method, each phoneme machine has the following characteristics: (a) multiple states, (b) transitions between states (each transition has a probability associated with it), (c) at least some Each of the transitions has a plurality of output probabilities representing the probability of producing that particular Feneme. A phoneme machine may include null transitions that do not produce phonemes. Typically, for non-null transitions, there is a probability assigned to each corner of the alphabet.

入力された音声がフエネメのストリングに変換
された後、音素マシンを検査し、その音素マシン
の尤度を確定して該ストリング中にフエネメのサ
ブストリングを生成することができる。検査は音
素ごとに実行し、サブストリングを生成する各音
素マシンのそれぞれの尤度を確定することができ
る。同様に、音素のシーケンスの検査は、そのシ
ーケンスで音素の尤度を確定するように行い、該
生成されたストリングにフエネメを生成すること
もできる。 After the input speech has been converted into a string of Feneme, the phoneme machine can be examined to determine the likelihood of the phoneme machine to generate substrings of Feneme within the string. The test can be performed phoneme by phone to determine the respective likelihood of each phoneme machine producing the substring. Similarly, a sequence of phonemes may be examined to determine the likelihood of phonemes in the sequence to generate a phoneme in the generated string.

IBM社の研究により、種々の型の音素マシン
が認められている。１つの型は“音標型音素マシ
ン”で、（発声された場合に）フエネメ・ストリ
ングのフエネメを生成する所与の音標要素の尤度
を反映する統計値を記憶する。もう１つの型は
“フエネメ型音素マシン”で、（発声された場合
に）フエネメ・ストリングのフエネメを生成する
所与のフエネメ要素の尤度を反映する統計値を記
憶する。 Research by IBM has identified various types of phoneme machines. One type is a "phonetic phoneme machine," which stores statistical values that reflect the likelihood of a given phonetic element (when uttered) producing a pheneme in a pheneme string. Another type is a "Feneme-type phoneme machine" which stores statistics that reflect the likelihood of a given Feneme element (when uttered) producing a Feneme in a Feneme string.

フエネメの音素マシンは２つの状態S₁およびS₂
を有する。１つの非ナル遷移はS₁からそれ自身に
戻る。もう１つの非ナル遷移およびナル遷移はS₁
とS₂の間で行われる。 Hueneme's phoneme machine has two states S ₁ and S ₂
has. One non-null transition is from S ₁ back to itself. Another non-null and null transition is S ₁
and S ₂ .

語彙中の各ワードは、“ワード基本形式”と呼
ばれる所定の音素シーケンス（すなわち音素マシ
ン）で表現される。フエネメ基本形式は、所与の
ワードを表現するように連結されているフエネメ
音素シーケンスである。音標の基本形式は、所与
のワードを表現するように連結されている音標の
音素シーケンスである。 Each word in the vocabulary is represented by a predetermined sequence of phonemes (or phoneme machine) called a "word base form." A Hueneme basic form is a sequence of Hueneme phonemes that are concatenated to represent a given word. The basic form of a phonetic alphabet is a sequence of phonemes in a phonetic alphabet that are concatenated to represent a given word.

入力された音声にワードが適合する尤度は、そ
のために、ストリング中にフエネメを生成する基
本形式の確率を反映する。すなわち、フエネメ・
ストリングを生成する最高の確率を有する基本形
式は、入力音声に最も密接に適合するワードを表
す。 The likelihood that a word matches the input speech therefore reflects the probability of the basic form producing a hueneme in the string. In other words, Hueneme
The base form with the highest probability of producing a string represents the word that most closely matches the input speech.

基本形式が、それによつて表現されるワードに
いかに良く対応するかは、確率的な方法により得
られる精度に影響する重要な要因である。 How well the basic form corresponds to the word it represents is an important factor influencing the accuracy obtained by probabilistic methods.

語彙中のワードごとに基本形式を決める１つの
手法はシングルトン・フエネメ基本形式の手法と
呼ばれる。この手法では、各ワードは１回だけ発
声される。各々のフエネメに関連して、そのワー
ドの１つの発声で生成されるのは、そのフエネメ
を生成する最高の確率を有する音素マシンであ
る。 One method for determining a basic form for each word in a vocabulary is called the singleton Hueneme basic form method. In this approach, each word is spoken only once. Associated with each cue, the one utterance of that word produces is the phoneme machine that has the highest probability of producing that cue.

シングルトン・フエネメ基本形式の手法では、
各音素マシンは１つのフエネメに関連する。従つ
て、ストリング中の生成されたフエネメごとに、
１つの最も見込みのある対応する音素マシンがあ
る。ワードの発声に対応する音素マシンのシーケ
ンスはそのワードを表す。 In the singleton Hueneme basic form method,
Each phoneme machine is associated with one phoneme. Therefore, for each generated Feneme in the string,
There is one most promising corresponding phoneme machine. The sequence of phoneme machines that corresponds to the utterance of a word represents that word.

Ｄ発明が解決しようとする問題点シングルトン・フエネメ基本形式の手法にはい
くつかの問題を伴う。特定のワードの発声はかな
り変動することがある。基本形式を構築する１つ
の発声が、そのワードの別の時点での発音とかな
り異なる場合、音声認識の質が低下することがあ
る。D. Problems to be Solved by the Invention The singleton-Feneme basic form approach involves several problems. The pronunciation of particular words can vary considerably. If one utterance that makes up the base form differs significantly from the pronunciation of the word at another time, the quality of speech recognition may deteriorate.

しかしながら、各ワードの複数の発声に基づい
た基本形式を構築するのは簡単なことではない。
ちなみに、複数の発声が接合する最大の確率を有
する音素シーケンスすなわち基本形式はＢ＝P₁P₂……P_n（P_i（ｉ＝１，２，……，ｍ）は
音素である）は、_N 〓ⁱ⁼¹ Pr（Ｂ｜f_i1f_i2f_i3…f_i1i）である。ここでf_i1……f_i1iはｉ番目の発声のフエ
ネメ・ストリングである。この式による計算は既
知のあらゆる方法によつても受入れがたい高い費
用がかかる。 However, constructing a basic form based on multiple utterances of each word is not trivial.
By the way, the phoneme sequence that has the maximum probability of joining multiple utterances, that is, the basic form, is B = P ₁ P ₂ ...P _n (P _i (i = 1, 2, ..., m) is a phoneme) , _N 〓 ⁱ⁼¹ Pr(B|f _i1 f _i2 f _i3 ... f _i1i ). Here, f _{i1 .} . . f _i1i is the feneme string of the i-th utterance. Calculations using this formula are unacceptably expensive even with all known methods.

本発明の目的は、ワード・セグメントの発音
で、その１つの発声と別の発声の間に起こりうる
変化を考慮に入れることにより、シングルトン・
フエネメ基本形式の手法を改良することである。
ワード・セグメントは通常のワードまたはその一
部分を用いることができる。 The aim of the invention is to improve singleton pronunciation by taking into account the changes that can occur between one utterance of a word segment and another.
The purpose is to improve the Hueneme basic form method.
A word segment can be a regular word or a portion thereof.

更に本発明の目的は、反復するプロセスにより
基本的な基本形式を改良することである。 A further object of the invention is to improve the basic basic form by an iterative process.

更に本発明の目的は、音標の音素または他の型
の音素から構築された基本形式も同様に使用でき
るようにすることである。 Furthermore, it is an object of the invention to allow basic forms constructed from phonemes of phonetic alphabets or other types of phonemes to be used as well.

Ｅ問題点を解決するための手段本発明により、各基本形式は、時間すなわち計
算上のきびしい要求なしに有効に基本形式を構築
できる、いわゆる個別解決方式で、対応するワー
ド・セグメントの複数の発声に基づいて構築す
る。E. Means for Solving the Problems According to the invention, each basic form can be constructed by combining multiple utterances of the corresponding word segment in a so-called individual solution method, which allows the basic form to be constructed effectively without demanding time or computational requirements. Build on.

詳細に述べれば、本発明の１つの実施例は下記
ステツプを含む。 Specifically, one embodiment of the invention includes the following steps.

(a) ワード・セグメントの複数の発声をそれぞれ
のフエネメ・ストリングに変換する。(a) Convert multiple utterances of a word segment into respective hueneme strings.

(b) フエネメのマルコフ・モデル音素マシンのセ
ツトを形成する。(b) Form a set of Feneme's Markov model phoneme machines.

(c) 複数のフエネメ・ストリングを生成するため
最良の単一音素マシンP₁を確定する。(c) Determine the best single-phoneme machine P ₁ to generate multiple feneme strings.

(d) 複数のフエネメ・ストリングを生成するため
形式P₁P₂またはP₂P₁の最良の２音素基本形式
を確定する。(d) Determine the best two-phoneme basic form of the form P ₁ P ₂ or P ₂ P ₁ to generate multiple Feneme strings.

(e) 各々のフエネメ・ストリングに対して最良の
２音素基本形式を整列させる。(e) Align the best diphoneme basic form for each Hueneme string.

(f) 各々のフエネメ・ストリングを左の部分と右
の部分に分割し、左の部分は２音素基本形式の
第１の音素マシンに対応させ、右の部分は２音
素基本形式の第２の音素マシンに対応させる。(f) Divide each Hueneme string into a left part and a right part, with the left part corresponding to the first phoneme machine in the two-phoneme basic form and the right part corresponding to the second phoneme machine in the two-phoneme basic form. Make it compatible with phoneme machines.

(g) 左の部分の各々を左サブストリング、右の部
分の各々を右サブストリングとして識別する。(g) Identify each of the left parts as a left substring and each of the right parts as a right substring.

(h) 左サブストリングのセツトを、複数の発声に
対応するフエネメ・ストリングのセツトと同じ
ように処理し、更に、その単一音素基本形式
が、最良の２音素基本形式が、生成するよりも
高い確率でそのサブストリングを生成する場合
に、そのサブストリングのそれ以上の分割を禁
止する。(h) Treat the set of left substrings in the same way as the set of phoneme strings corresponding to multiple utterances, and furthermore, treat the set of left substrings in the same way as the set of phoneme strings that correspond to multiple utterances, and furthermore that their single-phoneme base form is better than the best two-phoneme base form would produce. If the substring is generated with a high probability, further division of the substring is prohibited.

(i) 右サブストリングのセツトを、複数の発声に
対応するフエネメ・ストリングのセツトと同じ
ように処理し、更に、その単一音素基本形式
が、最良の２音素基本形式が生成するよりも高
い確率でそのサブストリングを生成する場合
に、そのサブストリングのそれ以上の分割を禁
止する。(i) Treat the set of right substrings in the same way as the set of pheneme strings corresponding to multiple utterances, and furthermore, if their single-phoneme base form is higher than the best diphoneme base form would produce. Prohibits further splitting of a substring when that substring is generated with probability.

(j) 未分割の単一音素を、それらが対応するフエ
ネメ・サブストリングの順序に対応する順序に
連結する。(j) Concatenate unsegmented single phonemes in an order that corresponds to the order of their corresponding Hueneme substrings.

Ｆ実施例 F1 音声認識システムの環境（第２図〜第４図）第２図は本発明の環境を与える音声認識システ
ム１０００の概要ブロツク図を示す。このシステ
ムは、スタツク・デコーダ１００２、およびそれ
に接続された音響プロセツサ（AP）１００４、
高速概算音響突合せを実行するアレイ・プロセツ
サ１００６、精密な音響突合せを実行するアレ
イ・プロセツサ１００８、言語モデル１０１０、
ならびにワークステーシヨン１０１２を含む。F Embodiment F1 Speech Recognition System Environment (FIGS. 2-4) FIG. 2 shows a schematic block diagram of a speech recognition system 1000 that provides an environment for the present invention. This system includes a stack decoder 1002 and an audio processor (AP) 1004 connected to it.
an array processor 1006 for performing fast approximate acoustic matching; an array processor 1008 for performing precise acoustic matching; a language model 1010;
and a workstation 1012.

音響プロセツサ１００４は、音声波形入力をラ
ベル、すなわち、その各々が対応する単音符号を
大体識別するフエネメのストリングに交換するよ
うに設計されている。一般に単音符号は、スペク
トル・エネルギまたは他の特徴に関するガウス分
布もしくは他の分布を反映することができるクラ
スタ化エルゴリズムにより定義される。本システ
ムでは、音響プロセツサ１００４は、人間の聴覚
の独特なモデルに基づくもので、米国特許出願第
06／665401号（1984年10月26日出願）に記載され
ている。 The acoustic processor 1004 is designed to exchange the audio waveform input into a string of labels, ie, strings of phonemes, each of which approximately identifies the corresponding phonetic symbol. Generally, monophonic codes are defined by a clustering algorithm that can reflect a Gaussian or other distribution of spectral energy or other characteristics. In this system, the acoustic processor 1004 is based on a unique model of human hearing and is
No. 06/665401 (filed on October 26, 1984).

音響プロセツサ１００４からのラベル、すなわ
ちフエネメはスタツク・デコーダ１００２に送ら
れる。第３図はスタツク・デコーダ１００２の論
理装置を示す。すなわち、スタツク・デコーダ１
００２は探索装置１０２０、およびそれに接続さ
れたワークステーシヨン１０１２、インタフエー
ス１０２２，１０２４，１０２６ならびに１０２
８を含む。これらのインタフエースの各々は、音
響プロセツサ１００４、アレイ・プロセツサ１０
０６，１００８ならびに言語モデル１０１０にそ
れぞれ接続される。第２図に示すシステムにおい
て、音響プロセツサ１００４からのフエネメは探
索装置１０２０によりアレイ・プロセツサ１００
６（高速突合せ）に送付される。アレイ・プロセ
ツサ１００６は、ワードの語彙でワードを検査
し、所与の到来ラベルのストリングの候補ワード
の数を少なくするように設計されている。高速突
合せは確率的に限定された状態マシン（本明細書
ではマルコフ・モデル音素マシンともいう）で行
う。 The labels or labels from audio processor 1004 are sent to stack decoder 1002. FIG. 3 shows the logic of stack decoder 1002. That is, stack decoder 1
002 is a search device 1020, a workstation 1012 connected thereto, interfaces 1022, 1024, 1026, and 102.
Contains 8. Each of these interfaces includes an audio processor 1004, an array processor 10
06, 1008 and language model 1010, respectively. In the system shown in FIG.
6 (high-speed matching). Array processor 1006 is designed to examine words in a vocabulary of words and reduce the number of candidate words for a given incoming label string. Fast matching is performed using a stochastically limited state machine (also referred to herein as a Markov model phoneme machine).

精密突合せは、これらのワードを、話されたワ
ードとして適度の尤度を有する高速突合せ候補リ
ストから、言語モデル計算により検査することが
望ましい。 Preferably, precision matching tests these words by language model calculations from a list of fast match candidates that have a reasonable likelihood of being spoken words.

代替的に、精密突合せを、語彙中の各ワードに
用いることができる。この場合は、高速突合せは
省略する。精密突合せは、音素が音標型の場合
の、第４図に示すようなマルコフ・モデル音素マ
シンにより実行する。 Alternatively, precision matching can be used for each word in the vocabulary. In this case, high-speed matching is omitted. Precise matching is performed by a Markov model phoneme machine as shown in FIG. 4 when the phoneme is of the phonetic type.

精密突合せの後、言語モデルを再び呼出し、ワ
ードの尤度を決定することが望ましい。 After fine matching, it is desirable to invoke the language model again and determine the likelihood of the word.

高速突合せ、言語モデル、精密突合せ、および
言語モデル手順は、本発明を利用することができ
る１つのシステムとして認識しなければならな
い。（音標、フエネメ、または他の音素の型の）
精密突合せしか含まないシステムも同様に本発明
を利用することができる。 Fast matching, language models, precise matching, and language model procedures must be recognized as one system in which the present invention can be utilized. (of a phonetic alphabet, feneme, or other phoneme type)
Systems that involve only precision matching can utilize the present invention as well.

スタツク・デコーダ１００２の目的は、ラベル
y₁y₂y₃……のストリングに最高の確率を与えるワ
ード・ストリングを決定することである。 The purpose of the stack decoder 1002 is to
The task is to determine the word string that gives the highest probability for the string y ₁ y ₂ y ₃ .

これは数学的には次のように表現する。 This can be expressed mathematically as follows.

Max（Pr（Ｗ｜Ｙ） (1) これは全ワード・ストリングＷにわたつてＹを
与えるＷの最大確率である。周知のように、Pr
（Ｗ｜Ｙ）は次のように書くことができる。 Max(Pr(W|Y) (1) This is the maximum probability of W giving Y over all word strings W. As is well known, Pr
(W|Y) can be written as follows.

Pr（Ｗ｜Ｙ）＝Pr（Ｗ）・Pr（Ｗ｜Ｙ）／Pr（Ｙ）
（２）ただし、Pr（Ｙ）はＷに無関係である。Pr(W|Y)=Pr(W)・Pr(W|Y)/Pr(Y)
(2) However, Pr(Y) is unrelated to W.

連続するワードＷ☆ の最も起こりうるパス（す
なわち列）を決定する１つの方法は、それぞれの
可能なパスを調べ、復号しようとするラベル・ス
トリングを生じるパスの各々の確率を決定するこ
とである。そして、関連する最高の確率を有する
パスを選択する。5000ワードの語彙の場合、この
方法は、特にワードの列が長いとき、扱いにくく
なり、非実際的である。 One way to determine the most likely path (i.e. sequence) of consecutive words W☆ is to examine each possible path and determine the probability of each path resulting in the label string that we are trying to decode. . Then select the path with the highest associated probability. For a vocabulary of 5000 words, this method becomes unwieldy and impractical, especially when the strings of words are long.

最尤ワード列Ｗ☆ を発見する公知の他の２つの
方法は、ビテルビ（Viterbi）復号化およびスタ
ツク復号化である。これらの手法の各々は、パタ
ーン解析およびマシン情報に関するIEEE会報、
PAMI第５巻第２号、1983年３月号記載のエル・
アール・バール外の論文、“連続音声認識の最尤
アプローチ”（ＬＲ Bahl et al，“Ａ
Maximum Likelihood Approach to
Continuous Speech Recognition，”IEEE
Transactions on Pattern Analysis and
Machine Intelligence，Vol.PAMI−５，No.２，
March 1983）の第項および第項にそれぞれ記載されてい
る。 Two other known methods of finding the maximum likelihood word sequence W* are Viterbi decoding and stack decoding. Each of these techniques is described in IEEE Proceedings on Pattern Analysis and Machine Information,
PAMI Vol. 5 No. 2, March 1983 issue
L R Bahl et al, “Maximum Likelihood Approach to Continuous Speech Recognition” (L R Bahl et al, “A
Maximum Likelihood Approach to
Continuous Speech Recognition，”IEEE
Transactions on Pattern Analysis and
Machine Intelligence, Vol. PAMI-5, No. 2,
(March 1983), respectively.

この論文のスタツク復号手法は単一のスタツク
復号化に関連する。すなわち、長さの異なるパス
は尤度により単一スタツクにリストされ、復号は
この単一のスタツクに基づいて行われる。単一ス
タツク復号は、尤度がいくらかパスの長さに左右
され、従つて一般に正規化が行われるという事実
によるものである。しかしながら、正規化は、も
し正規化フアクタが正しく、推定されなければ、
不適切な探索により過度の探索および探索エラー
を生じることがある。 The stack decoding method in this paper is related to a single stack decoding. That is, paths of different lengths are listed in a single stack by likelihood, and decoding is performed based on this single stack. Single stack decoding is due to the fact that the likelihood depends somewhat on the path length and therefore normalization is generally performed. However, normalization requires that if the normalization factor is not estimated correctly,
Improper searching can result in excessive searching and searching errors.

ビテルビ手法は、正規化は必要としないが、一
般に小さなタスクの場合にしか実際的ではない。 The Viterbi method does not require normalization, but is generally only practical for small tasks.

大規模な語彙を使用すると、基本的に時間に同
期するビテルビ・アルゴリズムは、非同期の音響
突合せ成分とインタフエースしなければならない
ことがある。この場合、インタフエースは適切で
はないという結果になる。 With large vocabularies, the essentially time-synchronous Viterbi algorithm may have to interface with asynchronous acoustic matching components. In this case, the result is that the interface is not suitable.

エル・アール・バーン（L.R.Bahl）他の発明
による代替の新規装置および方法（後述）は、最
も起こりうるワード列Ｗ☆ を、他の手法に比し低
い計算要求と高い精度で復号することができる方
法に関係する。特に、多重スタツク復号および独
特の決定方法により所与の時刻にどのワード列を
展開すべきかを決定することを特徴とする手法が
設けられている。この決定方法に従つて、相対的
に長さの短かいパスは、その短かさの故に不利に
はならないが、その代り、その相対的な尤度によ
り判定される。第５図、第６図および第７図に示
す新規の装置および方法について下記に詳細に説
明する。 An alternative novel apparatus and method (discussed below), invented by LRBahl et al., is capable of decoding the most likely word sequence W☆ with lower computational demands and higher accuracy than other techniques. Related to method. In particular, a technique is provided which is characterized by multiple stack decoding and a unique decision method to determine which word sequence to expand at a given time. According to this determination method, paths of relatively short length are not penalized because of their shortness, but are instead judged by their relative likelihood. The novel apparatus and method shown in FIGS. 5, 6 and 7 will be described in detail below.

スタツク・デコーダ１００２は、実際には、他
の要素を制御するように作用するが、実行する計
算の量は多くはない。従つて、スタツク・デコー
ダ１００２は、IBM VM／370オペレーテイン
グ・システム（モデル155，VS2、リリース1.7）
の制御の下にランする４３４１プロセツサを含む
ことが望ましい。相当な量の計算を実行するアレ
イ・プロセツサは、フローテイング・ポイント・
システム（FPS）社製の市販の190Lにより実現
されている。 Stack decoder 1002 actually acts to control other elements, but it does not perform much computation. Therefore, the stack decoder 1002 is an IBM VM/370 operating system (model 155, VS2, release 1.7).
It is desirable to include a 4341 processor running under the control of the 4341 processor. Array processors that perform a significant amount of computation use floating point
This is achieved using a commercially available 190L manufactured by System (FPS).

F2 聴覚モデルおよび音声認識システムの音響
プロセツサにおけるその実現（第８図〜第１４
図）第８図は、前述のような音響プロセツサ１１０
０の特定の実施例を示す。音響波入力（例えば、
自然の音声）が、所定の速度でサンプリングする
Ａ／Ｄ変換器１１０２に入る。代表的なサンプリ
ング速度は毎50マイクロ秒当り１サンプルであ
る。デイジタル信号の端を整形するために、時間
窓発声器１１０４が設けられている。時間窓発声
器１１０４の出力は、時間窓ごとに周波数スペク
トル出力を与えるFFT（高速フーリエ変換）装置
１１０６に入る。F2 Auditory model and its realization in the acoustic processor of the speech recognition system (Figures 8 to 14)
Figure) Figure 8 shows an audio processor 110 as described above.
A specific example of 0 is shown. Acoustic wave input (e.g.
natural speech) enters an A/D converter 1102 that samples at a predetermined rate. A typical sampling rate is one sample every 50 microseconds. A time window generator 1104 is provided to shape the edges of the digital signal. The output of the time window generator 1104 enters an FFT (Fast Fourier Transform) unit 1106 which provides a frequency spectral output for each time window.

そして、FFT装置１１０６の出力は、ラベル
L₁L₂‥‥L_fを生成するように処理される。特徴
選択装置１１０８、クラスタ装置１１１０、原型
装置１１１２および記号化装置１１１４は共同し
てラベルを生成する。ラベルを生成する際、原型
は、選択された特徴に基づき空間に点（またはベ
クトル）として形成される。音響入力は、選択さ
れた同じ特徴により、原型に比較しうる対応する
点（またはベクトル）を空間に供給するように特
徴づけられている。 Then, the output of the FFT device 1106 is labeled
L ₁ L ₂ ‥‥L _f is processed. Feature selector 1108, clusterer 1110, prototyper 1112, and encoder 1114 jointly generate labels. When generating labels, prototypes are formed as points (or vectors) in space based on selected features. The acoustic input is characterized by the same selected features to provide corresponding points (or vectors) in space that can be compared to the prototype.

詳細に言えば、原型を定義する際、クラスタ装
置１１１０により点のセツトを集めてクラスタに
群化する。クラスタを形成する方法は、音声に適
用される（ガウス分布のような）確率分布に基づ
いている。各クラスタの原型は、（クラスタの中
心軌跡または他の特徴に関連して）原型装置１１
１２により生成される。生成された原型および音
響入力（どちらも同じ特徴が選択されている）は
記号化装置１１１４に入る。記号化装置１１１４
は比較手順を実行し、その結果、特定の音響入力
にラベルを割当てる。 Specifically, when defining a prototype, clustering device 1110 collects a set of points and groups them into clusters. The method of forming clusters is based on probability distributions (such as Gaussian distributions) applied to speech. The prototype of each cluster (with respect to the center locus or other characteristics of the cluster) is stored in the prototype device 11
12. The generated prototype and the acoustic input (both with the same features selected) enter the encoder 1114. Symbolization device 1114
performs a comparison procedure and, as a result, assigns a label to a particular acoustic input.

適切な特徴の選択は、音響（音声）波入力を表
すラベルを取出す際の重要な要素である。音響プ
ロセツサは改良された特徴選択装置１１０８に関
係する。音響プロセツサに従つて、独特の聴覚モ
デルが取出され使用される。聴覚モデルを、第９
図により説明する。 Selection of appropriate features is an important factor in deriving labels representing acoustic (speech) wave inputs. The acoustic processor is associated with an improved feature selector 1108. A unique auditory model is derived and used according to the acoustic processor. Auditory model, 9th
This will be explained using figures.

第９図は人間の内耳の部分を示す。詳細に述べ
れば、内毛細胞１２００と、液体を含有する溝１
２０４に広がる末端部１２０２が詳細に示されて
いる。また、内毛細胞１２００から上流には、外
毛細胞１２０６と、溝１２０４に広がる末端部１
２０８が示されている。内毛細胞１２００と外毛
細胞１２０６には、脳に情報を伝達する神経が結
合している。電気化学的変化は、基底膜１２１０
の機械的運動により刺激される。 FIG. 9 shows a portion of the human inner ear. Specifically, the inner hair cells 1200 and the fluid-containing grooves 1
The distal end 1202 extending to 204 is shown in detail. Further, upstream from the inner hair cell 1200, there is an outer hair cell 1206 and an end portion 1 extending into the groove 1204.
208 is shown. Nerves that transmit information to the brain are connected to the inner hair cells 1200 and the outer hair cells 1206. Electrochemical changes occur in the basement membrane 1210
stimulated by mechanical movement.

基底膜１２１０が音響波入力の周波数分析器と
して作用し、基底膜１２１０に沿つた部分がそれ
ぞれの臨界周波数バンドに応答することは従来か
ら知られている。対応する周波数バンドに応答す
る基底膜１２１０のそれぞれの部分は、音響波形
入力を知覚する音量に影響を与える。すなわち、
トーンの音量は、類似のパワーの強度の２つのト
ーンが同じ周波数バンドを占有する場合よりも、
２つのトーンが別個の臨界周波数バンドにある場
合の方が大きく知覚される。基底膜１２１０によ
り規定された22の等級の臨界周波数バンドがある
ことが分つている。 It is known in the art that basilar membrane 1210 acts as a frequency analyzer of acoustic wave input, with sections along basilar membrane 1210 responding to respective critical frequency bands. Each portion of basilar membrane 1210 that responds to a corresponding frequency band influences the perceived loudness of the acoustic waveform input. That is,
The loudness of a tone is greater than if two tones of similar power intensity occupied the same frequency band.
It is perceived as louder when the two tones are in separate critical frequency bands. It has been found that there are 22 orders of critical frequency bands defined by the basilar membrane 1210.

基底膜１２１０の周波数リスポンスに合わせ
て、本発明は良好な形式で、臨界周波数バンドの
一部または全部に入力された音響波形を定め、次
いで、規定された臨界周波数バンドごとに別個に
信号成分を検査する。この機能は、FFT装置１
１０６（第８図）からの信号を適切に濾波し、検
査された臨界周波数バンドごとに特徴選択装置１
１０８に別個の信号を供給することにより行われ
る。 In accordance with the frequency response of the basilar membrane 1210, the present invention advantageously defines an input acoustic waveform in some or all of the critical frequency bands and then separately determines the signal components for each defined critical frequency band. inspect. This function is available in FFT device 1
106 (FIG. 8) and feature selection device 1 for each tested critical frequency band.
This is done by providing a separate signal to 108.

別個の入力も、時間窓発生器１１０４により
（できれば25.6ミリ秒の）時間フレームにブロツ
クされる。それゆえ、特徴選択装置１１０８は22
の信号を含むことが望ましい。これらの信号の
各々は、時間フレームごとに所与の周波数バンド
の音の強さを表す。 The separate inputs are also blocked into time frames (preferably 25.6 milliseconds) by time window generator 1104. Therefore, the feature selector 1108 has 22
It is desirable to include the following signals. Each of these signals represents the sound intensity of a given frequency band for each time frame.

信号は、第１０図の通常の臨界バンド・フイル
タ１３００により濾波することが望ましい。次い
で、信号は個別に、音量の変化を周波数の関数と
して知覚する音量等化変換器１３０２により処理
する。ちなみに、１つの周波数で所与のdBレベ
ルの第１のトーンの知覚された音量は、もう１つ
の周波数で同じdBレベルの第２のトーンの音量
と異なることがある。音量等化変換器１３０２
は、経験的なデータに基づき、それぞれの周波数
バンドの信号を変換して各々が同じ音量尺度で測
定されるようにする。例えば、音量等化変換器１
３０２は、1933年のフレツチヤおよびムンソン
（Fletcher and Munson）の研究に多少変更を加
えることにより、音響エネルギを同等の音量に写
像することができる。第１１図は前記研究に変更
を加えた結果を示す。第１１図により、40dBで
1kHzのトーンは60dBで100Hzのトーンの音量レベ
ルに対応することが分る。 The signal is preferably filtered by a conventional critical band filter 1300 of FIG. The signals are then individually processed by a volume equalization transformer 1302 that perceives changes in volume as a function of frequency. Incidentally, the perceived loudness of a first tone at a given dB level at one frequency may differ from the loudness of a second tone at the same dB level at another frequency. Volume equalization converter 1302
is based on empirical data and transforms the signals in each frequency band so that each is measured on the same loudness scale. For example, volume equalization converter 1
302, with some modifications to the 1933 work of Fletcher and Munson, can map acoustic energy to equivalent loudness. Figure 11 shows the results of a modification to the previous study. According to Figure 11, at 40dB
It can be seen that a 1kHz tone corresponds to the volume level of a 100Hz tone at 60dB.

音量等化変換器１３０２は、第１１図に示す曲
線に従つて音量を調整し、周波数と無関係に同等
の音量を生じさせる。 The volume equalization converter 1302 adjusts the volume according to the curve shown in FIG. 11, producing the same volume regardless of frequency.

周波数への依存性のほか、第１１図で特定の周
波数を調べれば明らかなように、パワーの変化は
音量の変化に対応しない。すなわち、音の強度、
すなわち振幅の変動は、すべての点で、知覚され
た音量の同様の変化に反映されない。例えば、
100Hzの周波数では、110dB付近における10dBの
知覚された音量変化は、20dB付近における10dB
の知覚された音量変化よりもずつと大きい。この
差は、所定の方法で音量を圧縮する音量圧縮装置
１３０４により処理する。音量圧縮装置１３０４
は、ホン単位の音量振幅測定値をソーン単位に置
換えることにより、パワーＰをその立方根P^1/3に
圧縮することができる。 In addition to the dependence on frequency, changes in power do not correspond to changes in volume, as can be seen by examining specific frequencies in FIG. That is, the intensity of the sound,
That is, variations in amplitude are not reflected in similar changes in perceived loudness at all points. for example,
At a frequency of 100Hz, a 10dB perceived volume change near 110dB is equivalent to a 10dB change near 20dB.
is much larger than the perceived volume change. This difference is processed by a volume compression device 1304, which compresses the volume in a predetermined manner. Volume compression device 1304
can compress the power P to its cube root P ^1/3 by replacing the volume amplitude measurement value in units of phons with units of sones.

第１２図は、経験的に決められた既知のホン対
ソーンの関係を示す。ソーン単位の使用により、
本発明のモデルは大きな音声信号振幅でもほぼ正
確な状態を保持する。１ソーンは、1kHzのトー
ンで40dBの音量と規定されている。 FIG. 12 shows the known empirically determined Hong-to-Thorn relationship. With the use of sone units,
Our model remains nearly accurate even with large audio signal amplitudes. One sone is defined as a 1kHz tone with a volume of 40dB.

第１０図には、新規の時変レスポンス装置１３
０６が示されている。この装置は、各臨界周波数
バンドに関連した音量等化および音量圧縮信号に
より動作する。詳細に述べれば、検査された周波
数バンドごとに、神経発火率ｆが各時間フレーム
で決められている。発火率ｆは本発明の音響プロ
セツサに従つて次のように定義される。 FIG. 10 shows a new time-varying response device 13.
06 is shown. The device operates with volume equalization and volume compression signals associated with each critical frequency band. Specifically, for each frequency band examined, the neural firing rate f is determined for each time frame. The firing rate f is defined according to the acoustic processor of the present invention as follows.

ｆ＝（So＋DL）ｎ (1) ただし、ｎは神経伝達物質の量；Soは音響波
形入力と無関係に神経発火にかかわる自発的な発
火定数；Ｌは音量測定値；Ｄは変位定数である。
So・ｎは音量波入力の有無に無関係に起きる自
発的な神経発火率に相当し、DLnは音響波入力に
よる発火率に相当する。 f=(So+DL)n (1) where n is the amount of neurotransmitter; So is the spontaneous firing constant involved in neural firing independent of acoustic waveform input; L is the measured sound volume; D is the displacement constant.
So·n corresponds to the spontaneous neural firing rate that occurs regardless of the presence or absence of sound wave input, and DLn corresponds to the firing rate due to acoustic wave input.

重要な点は、本発明では、ｎの値は次式により
時間とともに変化するという特徴を有することで
ある。 An important point is that the present invention has the characteristic that the value of n changes with time according to the following equation.

dn／dt＝Ao−（So＋Sh＋DL）ｎ (2) ただし、Aoは補充定数；Shは自発的な神経伝
達物質減衰定数である。式(2)に示す新しい関数
は、神経伝達物質が一定の割合Aoで生成されな
がら、(a) 減衰（Sh・ｎ）、(b) 自発的な発火
（So・ｎ）、および(c) 音響波入力による神経発
火（DL・ｎ）により失われることを考慮してい
る。これらのモデル化された現象は第９図に示さ
れた場所で起きるものと仮定する。 dn/dt=Ao−(So+Sh+DL)n (2) where Ao is the recruitment constant; Sh is the spontaneous neurotransmitter decay constant. The new function shown in equation (2) is that while neurotransmitters are generated at a constant rate Ao, (a) decay (Sh・n), (b) spontaneous firing (So・n), and (c) This takes into account the loss due to neural firing (DL・n) caused by acoustic wave input. It is assumed that these modeled phenomena occur at the locations shown in FIG.

式(2)で明らかなように、神経伝達物質の次量お
よび次発火率が少なくとも神経伝達物質の現量の
自乗に比例しており、本発明の音響プロセツサが
非線形であるという事実を示している。すなわ
ち、状態（ｔ＋△ｔ）での神経伝達物質の量は、
状態（ｔ＋dn／dt・△ｔ）での神経伝達物質の
量に等しい。よつて、ｎ（ｔ＋△ｔ）＝ｎ（ｔ）＋（dn／dt）・△ｔ (3) が成立する。 As is clear from equation (2), the next amount of neurotransmitter and the next firing rate are proportional to at least the square of the current amount of neurotransmitter, indicating the fact that the acoustic processor of the present invention is nonlinear. There is. In other words, the amount of neurotransmitter in state (t+△t) is
It is equal to the amount of neurotransmitter in the state (t+dn/dt・Δt). Therefore, n(t+Δt)=n(t)+(dn/dt)·Δt (3) holds true.

式(1)、(2)および(3)は、時変信号分析器の動作を
表す。時変信号分析器は、聴覚器官系が時間に適
応性を有し、聴神経の信号が音響波入力と非直線
的に関連させられるという事実を示している。ち
なみに、本発明の音響プロセツサは、神経系統の
明白な時間的変化によりよく追随するように、音
声認識システムで非線形信号処理を実施する最初
のモデルを提供するものである。 Equations (1), (2) and (3) represent the operation of the time-varying signal analyzer. Time-varying signal analyzers point to the fact that the auditory system is time-adaptive, and the signals of the auditory nerve are non-linearly related to the acoustic wave input. Incidentally, the acoustic processor of the present invention provides the first model for implementing nonlinear signal processing in a speech recognition system to better track the apparent temporal changes in the neural system.

式(1)および(2)において未知の項数を少なくする
ため、本発明では、一定の音量Ｌに適用される次
式を用いる。 In order to reduce the number of unknown terms in equations (1) and (2), the present invention uses the following equation that is applied to a constant volume L.

So＋Sh＋DL＝１／Ｔ (4) ただし、Ｔはオーデイオ波入力が生成された
後、聴覚レスポンスがその最大値の37％に低下す
るまでの時間の測定値である。Ｔは、音量の関数
であり、本発明の音響プロセツサにより、種々の
音量レベルのレスポンスの減衰を表示する既知の
グラフから取出す。すなわち、一定の音量のトー
ンが生成されると、最初、高いレベルのスレポン
スが生じ、その後、レスポンスは時定数Ｔによ
り、安定した状態のレベルに向つて減衰する。音
響波入力がない場合、Ｔ＝T₀である。これは50
ミリ秒程度である。音量がL_naxの場合、Ｔ＝
T_naxである。これは30ミリ秒程度である。Ao＝
１に設定することにより、１／（So＋Sh）は、
Ｌ＝０の場合、５センチ秒と決定される。Ｌが
L_naxで、L_nax＝20ソーンの場合、次式が成立つ。 So+Sh+DL=1/T (4) where T is the measurement of the time after the audio wave input is generated until the auditory response drops to 37% of its maximum value. T is a function of loudness and is taken from a known graph that displays the attenuation of the response for various loudness levels by the sound processor of the present invention. That is, when a tone of constant volume is generated, a high level response initially occurs, after which the response decays with a time constant T toward a steady state level. When there is no acoustic wave input, T=T ₀ . This is 50
It is about milliseconds. If the volume is L _nax , T=
T _nax . This is about 30 milliseconds. Ao=
By setting it to 1, 1/(So+Sh) becomes
If L=0, it is determined to be 5 centiseconds. L is
When L _nax and L _nax = 20 sones, the following equation holds true.

So＋Sh＋Ｄ（20）＝１／30 (5) 前記データおよび式により、SoおよびShは下
記に示す式(6)および(7)により決まる。 So+Sh+D(20)=1/30 (5) Based on the above data and formula, So and Sh are determined by formulas (6) and (7) shown below.

So＝DL_nax／〔Ｒ＋（DL_naxT₀Ｒ）−１〕 (6) Sh＝１／T₀−S_p (7) ただし、Ｒ＝ｆ安定状態｜／ｆ安定状態｜ L_nax／Ｌ＝０ (8) f安定状態は、dn／dtが０の場合、所与の音量
での発火率を表わす。So=DL _nax / [R+(DL _nax T ₀ R)-1] (6) Sh=1/T ₀ -S _p (7) However, R=f stable state |/f stable state | L _nax /L= 0 (8) f steady state represents the firing rate at a given volume when dn/dt is 0.

Ｒは、音響プロセツサに残つている唯一の変数
である。それゆえ、このプロセツサの性能はＲを
変えるだけで変更される。すなわち、Ｒは、性能
を変更するのに調整することができる１つのパラ
メータで、通常は、過渡状態の効果に対し安定状
態の効果を最小限にすることを意味する。類似の
音声入力の場合に出力パターンが一貫性に欠ける
ことは一般に、周波数レスポンスの相違、話者の
差異、背景雑音ならびに、（音声信号の安定状態
部分には影響するが過度部分には影響しない）歪
みにより生ずるから、安定状態の効果を最小限に
することが望ましい。Ｒの値は、完全な音声認識
システムのエラー率を最適化するように設定する
ことが望ましい。このようにして見つかつた最適
値はＲ＝1.5である。その場合、SoおよびShの値
はそれぞれ0.0888および0.11111であり、Ｄの値
は0.00666が得られる。 R is the only variable left in the audio processor. Therefore, the performance of this processor is changed simply by changing R. That is, R is one parameter that can be adjusted to change performance, usually meant to minimize steady-state effects versus transient-state effects. Inconsistencies in output patterns for similar speech inputs are generally caused by differences in frequency response, speaker differences, background noise, and other factors (affecting the steady-state portion of the speech signal but not the transient portion). ) It is desirable to minimize steady-state effects since they are caused by distortion. The value of R is preferably set to optimize the error rate of the complete speech recognition system. The optimal value thus found is R=1.5. In that case, the values of So and Sh are 0.0888 and 0.11111, respectively, and the value of D is 0.00666.

第１３図は本発明による音響プロセツサの動作
の流れ図である。できれば、20kHzでサンプリン
グされた、25.6ミリ秒の時間フレーム中のデイジ
タル化音声は、ハニング窓１３２０を通過し、そ
の出力は10ミリ秒間隔で、DFT１３２２におい
て２重フーリエ変換されることが望ましい。変換
出力はブロツク１３２４で濾波され、少なくとも
１つの周波数バンド（できればすべての臨界周波
数バンドか、または少なくとも20のバンド）の
各々にパワー密度出力を供給する。次いで、パワ
ー密度はブロツク１３２６で、記録された大きさ
から音量レベルに変換される。この動作は、第１
１図のグラフの変更により、または、後に第１４
図に概要を示すプロセスにより取出された限界値
に基づいて実行される。 FIG. 13 is a flow diagram of the operation of the audio processor according to the present invention. Preferably, the digitized audio during a 25.6 millisecond time frame, sampled at 20 kHz, is passed through a Hanning window 1320 and its output is double Fourier transformed in a DFT 1322 at 10 millisecond intervals. The conversion output is filtered in block 1324 to provide a power density output in each of at least one frequency band (preferably all critical frequency bands, or at least 20 bands). The power density is then converted from the recorded loudness to a volume level at block 1326. This operation is the first
By changing the graph in Figure 1 or later
It is carried out on the basis of the limit values derived by the process outlined in the figure.

第１４図において、最初に濾波された周波数バ
ンドｍの各々の間隔限界T_fおよび可聴限界T_hが
それぞれ、120dBおよび0dBになるように設定さ
れる（ブロツク１３４０）。その後、音声カウン
タ、合計フレーム・レジスタおよびヒストグラ
ム・レジスタをリセツトする（ブロツク１３４
２）。 In FIG. 14, the spacing limit T _f and audibility limit T _h of each of the first filtered frequency bands m are set to be 120 dB and 0 dB, respectively (block 1340). The audio counter, total frame register, and histogram register are then reset (block 134).
2).

ヒストグラムの各々はピン（bin）を含み、ブ
ンの各々は、（所与の周波数バンドで）パワーま
たは類似の測定値がそれぞれのレンジ内にある間
のサンプル数すなわちカウントを表す。本発明で
は、スヒトグラムは、（所与の周波数バンドごと
に）音量が複数の音量レンジの各々の中にある期
間のセンチ秒数を表すことが望ましい。例えば、
第３の周波数バンドでは、10dBと20dBのパワー
の間が20センチ秒の場合がある。同様に、第20の
周波数バンドでは、50dBと60dBの間に、合計
1000センチ秒のうちの150センチ秒がある場合が
ある。合計サンプル数（すなわちセンチ秒）およ
びビンに含まれたカウントから百分位数が取出さ
れる。 Each of the histograms includes bins, with each bin representing the number of samples or counts during which the power or similar measurement (for a given frequency band) is within a respective range. In the present invention, the schtogram preferably represents the number of centiseconds during which the volume is within each of a plurality of volume ranges (for each given frequency band). for example,
In the third frequency band, there may be 20 centiseconds between 10 dB and 20 dB power. Similarly, in the 20th frequency band, between 50dB and 60dB, the total
There may be 150 centiseconds out of 1000 centiseconds. Percentiles are taken from the total number of samples (ie, centiseconds) and the binned counts.

ブロツク１３４４で、それぞれの周波数バンド
のフイルタ出力のフレームが検査され、ブロンク
１３４６で、適切なヒストグラム（フイルタ当り
１つ）中のビンが増分される。ブロツク１３４８
で、振幅が55dBを越えるビンの合計数がフイル
タ（すなわち周波数バンド）ごとに集計され、音
声の存在を示すフイルタ数を決定する。ブロツク
１３５０で、音声の存在を示す最小限（例えば20
のうちの６）のフイルタがない場合、ブロツク１
３４４で次のフレームを検査する。音声の存在を
示す十分なフイルタがある場合、ブロツク１３５
２で、音声カウンタを増分する。音声カウンタ
は、ブロツク１３５４で音声が10秒間現われ、ブ
ロツク１３５６で新しいT_fおよびT_hの値がフイ
ルタごとに決定されるまで増分される。 At block 1344, the frame of filter output for each frequency band is examined, and at block 1346, the bins in the appropriate histogram (one per filter) are incremented. block 1348
Then, the total number of bins with amplitudes greater than 55 dB is tallied for each filter (i.e., frequency band) to determine the number of filters that indicate the presence of speech. At block 1350, a minimum number (e.g. 20
If there is no filter in 6), block 1
At 344, the next frame is inspected. If there are enough filters to indicate the presence of audio, block 135
2, increment the voice counter. The audio counter is incremented at block 1354 until audio appears for 10 seconds and at block 1356 new T _f and T _h values are determined for each filter.

所与のフイルタの新しいT_fおよびT_hの値は次
のように決定される。T_fの場合、1000ビンの最
上位から35番目のサンプルの保持するビンのdB
値（すなわち、音量の96.5番目の百分位数）は
BIN_Hと定義され、T_fはT_f＝BIN_H＋40dBに設定
される。T_hの場合、最下位のビンから（0.01）
（ビン総数−音声カウント）番目の値を保持する
ビンのdB値がBIN_Lと定義される。すなわち、
BIN_Lは、ヒストグラム中の、音声として分類さ
れたものを除いたサンプル数の１％のビンであ
る。T_hはT_h＝BIN_L−30dBと定義される。 The new T _f and T _h values for a given filter are determined as follows. For T _f , the dB of the retained bin for the 35th sample from the top of 1000 bins
The value (i.e. the 96.5th percentile of volume) is
BIN _H is defined, and T _f is set to T _f =BIN _H +40dB. For T _h , from the lowest bin (0.01)
The dB value of the bin holding the (total number of bins - voice count)-th value is defined as BIN _L. That is,
BIN _L is a 1% bin of the number of samples in the histogram excluding those classified as audio. T _h is defined as T _h =BIN _L -30dB.

第１３図のブロツク１３３０および１３３２
で、音の振幅は、前述のように、限界値を更新
し、更新された限界値に基づいてソーン単位に変
換され、圧縮される。ソーン単位を導入し圧縮す
る代替方法は、（ビンが増分された後）フイルタ
振幅“ａ”を取出し、次式によりdBに変換する。 Blocks 1330 and 1332 in FIG.
Then, the amplitude of the sound is updated with the limit value as described above, and based on the updated limit value, the amplitude of the sound is converted into units of sones and compressed. An alternative method of introducing and compressing Sohn units is to take the filter amplitude "a" (after the bins have been incremented) and convert it to dB by:

a^dB＝20log₁₀（ａ）−10 (9) 次に、フイルタ振幅の各々は、次式により同等
の音量を与えるように0dBと120dBの間のレンジ
に圧縮される。a ^dB = 20 log ₁₀ (a) - 10 (9) Each of the filter amplitudes is then compressed to a range between 0 dB and 120 dB to give equivalent loudness by:

a^eq1＝120（a^dB−T_h）／（T_f−T_h） (10) 次にa^eq1は次式により、音量レベル（ホン単
位）からソーン単位の音量の近似値に変換
（40dBで1kHzの信号を１に写像）することが望ま
しい。a ^eq1 = 120 (a ^dB − T _h ) / (T _f − T _h ) (10) Next, a ^eq1 is converted from the volume level (in units of phons) to an approximate value of the volume in units of sones (at 40 dB) using the following formula. It is desirable to map a 1kHz signal to 1).

L^dB＝（a^eq1−30）／４（11）次に、ソーン単位の音量の近似値L_sは次式で与
えられる。 L ^dB = (a ^eq1 −30)/4 (11) Next, the approximate value L _s of the sound volume per son is given by the following equation.

L_s＝10（L^dB）／20 （12）ステツプ１３３４で、L_sは式(1)および(2)の入力
として使用され、周波数バンドごとの出力発火率
ｆを決定する。22周波数バンドの場合、22次元の
ベクトルが、連続する時間フレームにわたる音響
波入力を特徴づける。しかしながら、一般に、20
周波数バンドは、メルでスケーリングされた通常
のフイルタ・バンクを用いて検査する。 L _s =10(L ^dB )/20 (12) At step 1334, L _s is used as input to equations (1) and (2) to determine the output firing rate f for each frequency band. For 22 frequency bands, a 22-dimensional vector characterizes the acoustic wave input over consecutive time frames. However, in general, 20
The frequency bands are examined using a conventional filter bank scaled by Mel.

次の時間フレームを処理する前に、ブロツク１
３３７で、ｎの“次状態”を式(3)に従つて決定す
る。 Block 1 before processing the next time frame.
At 337, the "next state" of n is determined according to equation (3).

前述の音響プロセツサは、発火率ｆおよび神経
伝達物質量ｎが大きいDCペデスタルを有する場
合の使用についての改善を必要とする。すなわ
ち、ｆおよびｎの式の項のダイナミツクレンジが
重要な場合、下記の式を導いてペデスタルの高さ
を下げる。 The acoustic processors described above require improvement for use with DC pedestals where the firing rate f and neurotransmitter content n are large. That is, if the dynamic range of terms in the f and n equations is important, the following equation is derived to lower the height of the pedestal.

安定状態で、かつ音響波入力信号が存在しない
（Ｌ＝０）場合、式(2)は次のように安定状態の内
部状態n′について解くことができる。 In a stable state and when there is no acoustic wave input signal (L=0), equation (2) can be solved for the stable internal state n' as follows.

n′＝Ａ／（So＋Sh）（13）神経伝達物質の量ｎ（ｔ）の内部状態は、次の
ように安定状態部分および変動部分として示され
る。 n'=A/(So+Sh) (13) The internal state of the neurotransmitter quantity n(t) is expressed as a stable state part and a fluctuating part as follows.

ｎ（ｔ）＝n′＋n″（ｔ）（14）式（１）および（14）を結合すると、次のよう
に発火率が得られる。 n(t)=n′+n″(t) (14) Combining equations (1) and (14), we obtain the firing rate as follows.

ｆ（ｔ）＝（So＋Ｄ・Ｌ）（n′＋n″（ｔ））（15） So・n′の項は定数であるが、他のすべての項
は、ｎの変動部分か、または（Ｄ・Ｌ）により表
わされた入力信号を含む。爾後の処理は出力ベク
トル間の差は二乗のみに関連するので、定数項は
無視される。式（15）および（13）から次式が得
られる。 f(t) = (So+D・L)(n′+n″(t)) (15) The So・n′ term is a constant, but all other terms are the varying parts of n or (D・Contains an input signal represented by It will be done.

f″(t)＝(So+D・L)・〔{n″（ｔ）＋Ｄ・Ｌ・Ａ}/(So
＋Sh）〕（16）式(3)を考慮すると、“次状態”は次のようにな
る。f″(t)=(So+D・L)・[{n″(t)+D・L・A}/(So
+Sh)] (16) Considering equation (3), the “next state” is as follows.

ｎ(t+△t)＝n′(t+△t)＋n″（ｔ＋△ｔ）（17）ｎ(t+△t)＝n″（ｔ）＋Ａ−（So＋Sh ＋Ｄ・Ｌ）・（n′＋n″（ｔ））（18）ｎ(t+△t)＝n″(t)-(Sh・n″(t) −（So＋Ao・L^A）・n″(t)-(Ao・L^A・Ｄ）／（So＋Sh）＋Ao−（So・Ao）＋（Sh・Ao））／（So＋Sh）（19）式（19）はすべての常数項を無視すれば次のよ
うになる。n(t+△t)=n′(t+△t)+n″(t+△t) (17) n(t+△t)=n″(t)+A−(So+Sh+D・L)・(n′+n″) (t)) (18) n(t+△t)=n″(t)-(Sh・n″(t) −(So+Ao・L ^A )・n″(t)-(Ao・L ^A・D) /(So+Sh)+Ao−(So・Ao) +(Sh・Ao))/(So+Sh) (19) Equation (19) becomes as follows if all constant terms are ignored.

n″（ｔ＋△ｔ）＝n″（ｔ）（１−So・△ｔ）−f″（
ｔ）
（20）式（15）および（20）は、それぞれの10ミリ秒
時間フレーム中に各フイルタに適用される出力式
および状態更新式を構成する。これらの式の使用
結果は10ミリ秒ごとの20要素のベクトルであり、
このベクトルの各要素は、メルでスケーリングさ
れたフイルタ・バンクにあけるそれぞれの周波数
バンドの発火率に対応する。n″(t+△t)=n″(t)(1−So・△t)−f″(
t)
(20) Equations (15) and (20) constitute the output and state update equations applied to each filter during each 10 ms time frame. The result of using these formulas is a vector of 20 elements every 10 ms,
Each element of this vector corresponds to the firing rate of a respective frequency band in the mel-scaled filter bank.

前述の実施例に関し、第13図の流れ図は、発火
率ｆおよび“次状態”ｎ（ｔ＋△ｔ）の特別の場
合の式をそれぞれ定義する式（11）および（16）
により、ｆ、dn／dtおよびｎ（ｔ＋△ｔ）の式を
置換える以外は当てはまる。 Regarding the above embodiment, the flowchart of FIG. 13 uses equations (11) and (16) that define the special case equations for the firing rate f and the "next state" n(t+Δt), respectively.
, except that the expressions f, dn/dt and n(t+Δt) are replaced.

それぞれの式の項に特有の値（すなわち、t₀
（5csec、t_L＝3csec、Ao＝１、Ｒ＝1.5およびL_nax
＝20）は他の値に設定することができ、So，Sh
およびＤの項は、他の項が異なつた値に設定され
ると、それぞれの望ましい値0.0888、0.11111、
および0.00666とは異なる値になる。 A unique value for each equation term (i.e., t ₀
(5csec, t _L = 3csec, Ao = 1, R = 1.5 and L _nax
= 20) can be set to other values, So, Sh
and D terms have their respective desired values of 0.0888, 0.11111, and
and will be a different value from 0.00666.

本発明は種々のソフトウエアまたはハードウエ
アにより実施することができる。 The invention can be implemented with a variety of software or hardware.

F3 精密突合せ（第４図、第１５図、第１６図）第４図は一例として音標型の音素マシン２００
０を示す。音標型突合せの各マシンは、確率的に
限定された状態マシンであり、 (a) 複数の状態Si； (b) 複数の遷移tr（Sj→Si）：ある遷移は異なつた
状態間で、ある遷移は同じ状態間で遷移し、各
遷移は対応する確率を有する。； (c) 特定の遷移で生成しうるラベルごとに対応す
る実際のラベル確率を有することを特徴とする。F3 Precision matching (Fig. 4, Fig. 15, Fig. 16) Fig. 4 shows the phoneme machine 200 of the phonetic symbol type as an example.
Indicates 0. Each phonetic type matching machine is a stochastically limited state machine; (a) multiple states Si; (b) multiple transitions tr (Sj→Si): a transition is between different states, Transitions transition between the same states, and each transition has a corresponding probability. (c) It is characterized by having an actual label probability corresponding to each label that can be generated at a specific transition.

第４図では、７つの状態S₁〜S₇ならびに13の遷
移tr1〜tr13が精密突合せ音素マシン２０００に
設けられ、その中の３つの遷移tr11，tr12および
tr13のパスは破線で示されている。これらの３つ
の遷移の各々で、音素はラベルを生成せずに１つ
の状態から別の状態に変わることがある。従つ
て、このような遷移はナル遷移と呼ばれる。遷移
tr1〜tr10に沿つて、ラベルを生成することがで
きる。詳細に述べれば、遷移tr1〜tr10の各々に
沿つて少なくとも１つのラベルは、そこに生成さ
れる独特の確率を有することがある。遷移ごと
に、システムで生成することができる各ラベルに
関連した確率がある。すなわち、もし選択的に音
響チヤンネルより生成することができるラベルが
200あれば、（ナルではない）各遷移はそれに関連
した“実際のラベル確率”を200有し、その各々
は、対応するラベルが特定の遷移で音素により生
成される確率に対応する。遷移tr1の実際のラベ
ル確率は、図示のように、記号Ｐと、それに続く
ブラケツトに囲まれた１〜200の列で表わされる。
これらの数字の各々は所与のラベルを表す。ラベ
ル１の場合は、精密突合せ音素マシン２０００が
遷移tr1でラベル１を生成する確率Ｐ〔１〕があ
る。種々の実際のラベル確率は、ラベルおよび対
応する遷移に関連して記憶されている。 In FIG. 4, seven states S ₁ to _{S 7} and 13 transitions tr1 to tr13 are provided in the precision matching phoneme machine 2000, and three transitions tr11, tr12 and
The path of tr13 is shown with a dashed line. In each of these three transitions, a phoneme may change from one state to another without generating a label. Therefore, such a transition is called a null transition. transition
Labels can be generated along tr1 to tr10. In particular, at least one label along each of transitions tr1-tr10 may have a unique probability of being generated there. For each transition, there is a probability associated with each label that can be generated by the system. That is, if the labels that can be selectively generated from the acoustic channels are
200, each transition (not null) has an associated "actual label probability" of 200, each of which corresponds to the probability that the corresponding label is produced by a phoneme at a particular transition. The actual label probability of transition tr1 is represented by the symbol P followed by a column from 1 to 200 in brackets, as shown.
Each of these numbers represents a given label. In the case of label 1, there is a probability P[1] that the precision matching phoneme machine 2000 generates label 1 at transition tr1. Various actual label probabilities are stored in association with labels and corresponding transitions.

ラベルy₁y₂y₃‥‥のストリングが、所与の音素
に対応する精密突合せ音素マシン２０００に提示
されると、突合せ手順が実行される。精密突合せ
音素マシンに関連した手順について第１５図によ
り説明する。 When a string of labels y ₁ y ₂ y ₃ . . . is presented to the precision matching phoneme machine 2000 corresponding to a given phoneme, a matching procedure is performed. The procedure related to the precise matching phoneme machine will be explained with reference to FIG.

第１５図は第４図の音素マシンのトレリス図で
ある。前記音素マシンの場合のように、このトレ
リス図も状態S₁から状態S₇へのナル遷移、状態S₁
から状態S₂への遷移、および状態S₁から状態S₄へ
の遷移を示す。他の状態間の遷移も示されてい
る。また、トレリス図は水平方向に、測定された
時刻を示す。開始時確率q₀，q₁、およびq₂は、音
素がその音素の時刻ｔ＝t₀，ｔ＝t₁またはｔ＝t₂
のそれぞれにおいて開始時刻を有する確率を表
す。各開始時刻におけるそれぞれの遷移も示され
ている。ちなみに、連続する開始（および終了）
時刻の間隔は、ラベルの時間間隔に等しい長さで
あることが望ましい。 FIG. 15 is a trellis diagram of the phoneme machine of FIG. 4. As in the case of the phoneme machine, this trellis diagram also has a null transition from state S ₁ to state S ₇ , state S ₁
shows the transition from to state S ₂ and from state S ₁ to state S ₄ . Transitions between other states are also shown. Additionally, the trellis diagram shows the measured times in the horizontal direction. The starting probabilities q ₀ , q ₁ , and q ₂ are determined by the phoneme at the time t=t ₀ , t=t ₁ or t=t ₂
represents the probability of having a start time in each of the . The respective transitions at each start time are also shown. By the way, consecutive starts (and ends)
It is desirable that the time interval has a length equal to the label time interval.

精密突合せ音素マシン２０００を用いて所与の
音素が到来ストリングのラベルにどれくらいぴつ
たりと突合されるかを決定する際、その音素の終
了時刻分布を探索して、その音素の突合せ値を決
めるのに使用する。精密な突合せを実行するため
終了時刻分布を生成する際、精密突合せ音素マシ
ン２０００は、正確で複雑な計算を必要とする。 When using the precision matching phoneme machine 2000 to determine how closely a given phoneme matches the label of an incoming string, the end time distribution of that phoneme is searched to determine the match value for that phoneme. used for. In generating end time distributions to perform precise matching, precision matching phoneme machine 2000 requires accurate and complex calculations.

最初に、第１５図のトレリス図により、時刻ｔ
＝t₀で開始時刻および終了時刻を得るのに必要な
計算について調べる。第４図に示された音素マシ
ン構造の例の場合は、下記の確率式が当てはま
る。 First, according to the trellis diagram of FIG. 15, time t
Examine the calculations required to obtain the start and end times at = t ₀ . In the case of the example phoneme machine structure shown in FIG. 4, the following probability equation applies.

Pr（S₇，ｔ＝t₀）＝q₀・Ｔ（１→７）＋ Pr（S₂，ｔ＝t₀）・Ｔ（２→７）＋ Pr（S₃，ｔ＝t₀）・Ｔ（３→７）（21）ただし、Prは確率を表し、Ｔは括弧内の２つ
の状態の間の遷移確率を表す。この式は、ｔ＝t₀
で終了時刻になることがある３つの状態のそれぞ
れの確率が、この例では、状態S₇における終了時
刻生起に限定されることを示する。 Pr(S ₇ , t=t ₀ )=q ₀・T(1→7)+ Pr(S ₂ ,t=t ₀ )・T(2→7)+ Pr(S ₃ ,t=t ₀ )・T(3→7) (21) where Pr represents the probability, and T represents the transition probability between the two states in parentheses. This formula is t=t ₀
We show that the probabilities of each of the three states that can result in an end time in S 7 are limited to end time occurrences in state S ₇ in this example.

次に、終了時刻ｔ＝t₁を調べると、状態S₁以外
のあらゆる状態に関する計算を行わなければなら
ない。状態S₁は前の音素の終了時刻で開始する。
説明の都合上、状態S₄に関する計算だけを示す。 Next, when we look at the end time t=t ₁ , we have to perform calculations for every state other than state S ₁ . State S ₁ starts at the end time of the previous phoneme.
For convenience of explanation, only the calculations for state S ₄ are shown.

S₄の場合、計算は次のようになる。 For S ₄ , the calculation becomes:

Pr（S₄，ｔ＝t₁）＝Pr（S₁，ｔ＝t₀）・Ｔ（１→４）・Pr（y₁，１→４）＋Pr（S₄，ｔ＝t₀）・T(4→4) ・Pr（y₁，４→４）（22）式（22）は、時刻ｔ＝t₁で音素マシンが状態S₄
である確率は下記の２つの項： (a) 時刻ｔ＝t₀で状態S₁である確率に、状態S₁か
ら状態S₄への遷移確率を乗じ、更に、生成中の
ストリング中の所与のラベルy₁が状態S₁から状
態S₄へ遷移する確率を乗じて得た値と、(b) 時
刻ｔ＝t₀で状態S₄である確率に、状態S₄からそ
れ自身への遷移確率を乗じ、更に、状態S₄から
それ自身に遷移するものとして所与のラベルy₁
を生成する確率を乗じて得た値との和によつて決まることを示す。Pr(S ₄ , t=t ₁ )=Pr(S ₁ ,t=t ₀ )・T(1→4) ・Pr(y ₁ ,1→4)+Pr(S ₄ ,t=t ₀ )・T (4→4) ・Pr(y ₁ , 4→4) (22) Equation (22) shows that at time t=t ₁ , the phoneme machine is in state S ₄
_The _probability _that _{_} The value obtained by multiplying the probability that a given label y ₁ transitions from state S ₁ to state S ₄ and (b) the probability of being in state S ₄ at time t = t ₀ times the transition from state S ₄ to itself. Multiply the transition probability and further take the given label y ₁ as transitioning from state S ₄ to itself
This shows that it is determined by the sum of the value obtained by multiplying the probability of generating .

同様に、（状態S₁を除く）他の状態に関する計
算も実行され、その音素が時刻ｔ＝t₁で特定の状
態である対応する確率を生成する。一般に、所与
の時刻に対象状態である確率を決定する際、精密
な突合せは、 (a) 対象状態に導く遷移を生じる前の各状態およ
び前記前の各状態のそれぞれの確率を認識し、 (b) 前記前の状態ごとに、そのラベル・ストリン
グに適合するように、前記前の各状態と現在の
状態の間の遷移で生成しなければならないラベ
ルの確率を表す値を認識し、 (c) 前の各状態の確率とラベル確率を表すそれぞ
れの値を組合せて、対応する遷移による対象状
態の確率を与える。 Similarly, calculations for other states (except state S ₁ ) are also performed to generate the corresponding probabilities that the phoneme is in the particular state at time t=t ₁ . In general, in determining the probability of being in a target state at a given time, precise matching involves: (a) recognizing the respective probability of each state and each previous state before the transition that leads to the target state; (b) for each said previous state, recognize a value representing the probability of a label that must be generated on a transition between each said previous state and the current state to match that label string; c) Combine the respective values representing the probability of each previous state and the label probability to give the probability of the target state due to the corresponding transition.

対象状態である全体的な確率は、それに導くす
べての遷移による対象状態確率から決定される。
状態S₇に関する計算は、３つのナル遷移に関する
項を含み、その音素が状態S₇で終了する音素によ
り時刻ｔ＝t₁で開始・終了することを可能にす
る。 The overall probability of being a target state is determined from the target state probabilities due to all transitions leading to it.
The calculation for state S ₇ includes terms for three null transitions, allowing the phoneme to start and end at time t=t ₁ with the phoneme ending in state S ₇ .

時刻ｔ＝t₀およびｔ＝t₁に関する確率を決定す
る場合のように、他の終了時刻の組の確率の決定
は、終了時刻分布を形成するように行うことが望
ましい。所与の音素の終了時刻分布の値は、所与
の音素がどれ位良好に到来ラベルに突合されるか
を表示する。 As in the case of determining the probabilities for times t=t ₀ and t=t ₁ , the determination of probabilities for other sets of end times is preferably performed to form an end time distribution. The value of the end time distribution for a given phoneme indicates how well the given phoneme is matched to the incoming label.

ワードがどれ位良好に到来ラベルに突合される
かを決定する際、そのワードを表す音素は順次に
処理される。各音素は確率値の終了時刻分布を生
成する。音素の突合せ値は、終了時刻確率を合計
し、その合計の対数をとることにより得られる。
次の音素の開始時刻分布は終了時刻分布を正規化
することにより引出される。この正規化では、例
えば、それらの値の各々を、それらの合計で割る
ことによりスケーリングし、スケーリングされた
値の合計が１になるようにする。 In determining how well a word matches an incoming label, the phonemes representing the word are processed sequentially. Each phoneme generates an end time distribution of probability values. The phoneme matching value is obtained by summing the end time probabilities and taking the logarithm of the sum.
The start time distribution of the next phoneme is derived by normalizing the end time distribution. This normalization involves, for example, scaling each of the values by dividing them by their sum, such that the scaled values sum to one.

所与のワードまたはワード・ストリングの検査
すべき音素数ｈを決定する方法が少なくとも２つ
ある。深さ優先方法では、計算は基本形式に沿つ
て行う（連続する音素の各々により連続して小計
を計算する）。この小計がそれに沿つた所与の音
素位置の所定の限界値以下であると分つた場合、
計算は終了する。もう一つの方法、幅優先方法で
は、各ワードにおける類似の音素位置の計算を行
う。計算は、各ワードの第１の音素の計算、続い
て各ワードの第２の音素の計算というように、順
次に行う。幅優先方法では、それぞれのワードの
同数の音素に沿つた計算値は、相対的に同じ音素
位置で比較する。いずれの方法でも、突合せ値の
最大の和を有するワードが、求めていた目的ワー
ドである。 There are at least two ways to determine the number h of phonemes to test for a given word or word string. In the depth-first method, calculations follow a basic format (calculating subtotals for each successive phoneme in succession). If this subtotal is found to be less than or equal to a predetermined limit for a given phoneme position along it,
The calculation ends. Another method, the breadth-first method, involves calculating similar phoneme positions in each word. The calculations are performed sequentially, starting with the first phoneme of each word, followed by the second phoneme of each word, and so on. In the breadth-first method, calculations along the same number of phonemes in each word are compared at the same relative phoneme position. In either method, the word with the largest sum of matching values is the desired target word.

精密な突合せはAPAL（アレイ・プロセツサ・
アセンブリ言語）で実現されている。これは、フ
ローテイング・ポイント・システムズ社
（Flooting Point Systems，Inc.）製のアセンブ
ラ190Lである。ちなみに、精密な突合せは、実
際のラベル確率（すなわち、所与の音素が所与の
遷移で所与のラベルとｙを生成する確率）、音素
マシンごとの遷移確率、および所与の音素が所定
の開始時刻後の所与の時刻で所与の状態である確
率の各々を記憶するためにかなりのメモリを必要
とする。前述の190Lは、終了時刻、できれば終
了時刻確率の対数和に基づいた突合せ値、前に生
成された終了時刻確率に基づいた開始時刻、およ
びワード中の順次音素の突合せ値に基づいたワー
ド突合せ得点のそれぞれの計算をするようにセツ
トアツプされる。更に、精密な突合せは、突合せ
手順の末尾確率を計算することが望ましい。末尾
確率はワードとは無関係に連続するラベルの尤度
を測定する。簡単な実施例では、所与の末尾確率
はもう１つのラベルに続くラベルの尤度に対応す
る。この尤度は、例えば、或るサンプル音声によ
り生成されたラベルのストリングから容易に決定
される。 Precise matching is performed using APAL (Array Processor
It is realized in assembly language). This is an assembler 190L manufactured by Floating Point Systems, Inc. By the way, precise matching is based on the actual label probability (i.e. the probability that a given phoneme produces a given label and y with a given transition), the transition probability for each phoneme machine, and the probability that a given phoneme produces a given label and y requires considerable memory to store each of the probabilities of being in a given state at a given time after the start time of . The 190L mentioned above has a match value based on the end time, preferably the log sum of the end time probabilities, a start time based on the previously generated end time probabilities, and a word match score based on the match values of sequential phonemes in the word. is set up to perform each calculation. Furthermore, for precise matching, it is desirable to calculate the tail probabilities of the matching procedure. Tail probability measures the likelihood of consecutive labels independent of words. In a simple example, a given tail probability corresponds to the likelihood of a label following another label. This likelihood is easily determined, for example, from a string of labels generated by some sample speech.

それ故、精密な突合せは基本形式、マルコフ・
モデルの統計値、および末尾確率を含むのに十分
な記憶装置を備える。各ワードが約10の音素を含
む5000ワードの語彙の場合、基本形式は5000×10
の記憶量を必要とする。（音素ごとにマルコフ・
モデルを有する）70の別個の音素、200の別個の
ラベル、および任意のラベルが生成される確率を
有する10の遷移がある場合、統計値は70×10×
200の記憶ロケーシヨンを必要とすることになる。
しかしながら、音素マシンは３つの部分（開始部
分、中間部分および終了部分）に分割され、統計
表はそれに対応することが望ましい（３つの自己
ループの１つが各部分に含まれることが望まし
い）。従つて、記憶要求は60×２×200に減少す
る。末尾確率に関しては、200×200の記憶ロケー
シヨンが必要である。この配列では、50Kの整数
および82Kの浮動小数点の記憶装置であれば満足
に動作する。 Therefore, precise matching is of the basic form, Markov
Provide sufficient storage to contain model statistics and tail probabilities. For a vocabulary of 5000 words, each word containing about 10 phonemes, the basic format is 5000×10
requires an amount of memory. (Markov for each phoneme)
If there are 70 distinct phonemes (with a model), 200 distinct labels, and 10 transitions with a probability of generating any label, then the statistic is 70 × 10 ×
This would require 200 storage locations.
However, it is preferable that the phoneme machine is divided into three parts (beginning part, middle part and end part) and the statistical table corresponds thereto (preferably one of the three self-loops is included in each part). Therefore, the storage requirement is reduced to 60x2x200. For tail probabilities, 200x200 storage locations are required. 50K integer and 82K floating point storage will work satisfactorily with this array.

以上の説明は、第４図に示すような音標型音素
マシンのシーケンスを含む音標基本形式に関する
ものである。 The above description relates to a basic phonetic alphabet format including a sequence of phonetic phoneme machines as shown in FIG.

しかしながら、更に、前記概設した精密突合せ
と類似の精密突合せでフエネメ基本形式を使用す
ることがある。第１６図は、フエネメ音素マシン
（その例は第１９図に示す）に基づいた格子を示
す。この図は、任意の所与の時刻に、３つの遷移
の中のどれかが生じうることを示す。（破線表示
の）ナル遷移は、ラベルを生成せずに、ある状態
から別の状態に移る。２番目の遷移は、ある状態
からそれ自身への自己ループ中にラベルの生成を
可能にする。３番目の遷移は、ある状態から別の
状態への遷移中にラベルの生成を可能にする。 However, the Feneme basic form may also be used in a precision match similar to the precision match outlined above. FIG. 16 shows a grid based on the Hueneme phoneme machine (an example of which is shown in FIG. 19). This figure shows that at any given time, any of three transitions can occur. A null transition (represented by a dashed line) goes from one state to another without generating a label. The second transition allows the generation of labels during a self-loop from a state to itself. The third transition allows for the generation of labels during the transition from one state to another.

前に示唆したように、高速突合せは、（第２図
に示されてはいるが、）任意選択である。下記の
説明は、精密な突合せで検査するワード数を少な
くする高速突合せを含む環境に関連する。しかし
ながら、希望すれば、高速突合せを省略すること
ができ、その場合は、各ワードは精密な突合せに
より処理される。 As previously suggested, fast matching is optional (although shown in Figure 2). The following discussion relates to an environment involving fast matching, which reduces the number of words examined with precise matching. However, if desired, the fast match can be omitted, in which case each word is processed by a precise match.

F4 音素木構造（第１７図）音素突合せ値は、いつたん確定されると、第１
７図に示すように、木構造４１００の分枝に沿つ
て比較し、音素のどのパスが最も起こりうるかを
判定する。第１７図において、（点４１０２から
分枝４１０４に出る）話されたワード“the”の
音素DHおよびUH1の音素突合せ値の和は、音素
MXから分岐する音素のそれぞれのシーケンスの
場合よりもずつと高い値でなければならない。ち
なみに、最初の音素MXの音素突合せ値は１回だ
け計算され、それから広がる各基本形式に使用さ
れる。（分枝４１０４および４１０６を参照され
たい。）更に、分枝の最初のシーケンスに沿つて
計算された合計得点が、限界値よりもずつと低い
か、または分岐の他のシーケンスの合計得点より
もずつと低いことが分ると、最初のシーケンスか
ら広がるすべての基本形式は同時に候補ワードか
ら削除されることがある。例えば、分枝４１０８
〜４１１８に関連した基本形式は、MXが起こり
そうなパスでないことが確定されると、同時に捨
てられる。高速突合せ実施例および木構造によ
り、順次づけられた候補ワードのリストが生成さ
れ、それに伴なう計算は大幅に節約される。F4 Phoneme tree structure (Figure 17) Once the phoneme matching value is determined, the first
As shown in FIG. 7, comparisons are made along the branches of the tree structure 4100 to determine which path of phonemes is most likely to occur. In FIG. 17, the sum of the phoneme match values for phonemes DH and UH1 of the spoken word "the" (from point 4102 to branch 4104) is
It must be a significantly higher value than for each sequence of phonemes branching from MX. Incidentally, the phoneme match value for the first phoneme MX is calculated only once and then used for each base form that extends. (See branches 4104 and 4106.) Additionally, the total score computed along the first sequence of branches is less than the critical value or less than the total score of other sequences of branches. If it is found to be lower and lower, all base forms extending from the initial sequence may be removed from the candidate words at the same time. For example, branch 4108
The basic form associated with ~4118 is discarded as soon as it is determined that MX is not a likely path. The fast matching implementation and tree structure generates an ordered list of candidate words, with significant savings in associated computation.

記憶要求については、音素の木構造、音素の統
計値、および末尾確率が記憶されることになつて
いる。木構造については、25000の弧と各弧を特
徴づける４つのデータワードがある。第１のデー
タワードは後続の弧すなわち音素の指標を表す。
第２のデータワードは分枝に沿つた後続の音素の
数を表わす。第３のデータワードは木構造のどの
ノードに弧が置かれているかを表す。第４図のデ
ータワードは現在の音素を表す。従つて、この木
構造の場合、25000×４の記憶空間が必要である。
高速突合せでは、100の異なつた音素と200の異な
つたフエネメがある。フエネメは音素中のどこか
で生成される１つの確率を有するから、100×200
の統計的確率の記憶空間が必要である。末尾構造
については、200×200の記憶空間が必要である。
高速突合せの場合、100Kの整数と60Kの浮動小
数点の記憶空間があれば十分である。 For storage requests, the phoneme tree structure, phoneme statistics, and tail probabilities are to be stored. For the tree structure, there are 25,000 arcs and four data words characterizing each arc. The first data word represents an index of a subsequent arc or phoneme.
The second data word represents the number of subsequent phonemes along the branch. The third data word represents at which node of the tree structure the arc is located. The data word in Figure 4 represents the current phoneme. Therefore, this tree structure requires a storage space of 25000×4.
In fast matching, there are 100 different phonemes and 200 different phonemes. Since Hueneme has a probability of 1 to be generated somewhere in the phoneme, it is 100×200
A storage space for statistical probabilities is required. For the tail structure, 200x200 storage space is required.
For fast matching, 100K integer and 60K floating point storage space is sufficient.

F5 言語モデル（第２図）前述のように、文脈中のワードに関する（三重
字のような）情報を記憶する言語モデルを包含す
ることにより、正しくワードを選択する確率を高
めることができる。言語モデルは前記論文に記載
されている。F5 Language Model (Figure 2) As mentioned above, the probability of correctly selecting a word can be increased by including a language model that remembers information about the word in context (such as triple letters). The language model is described in the above paper.

言語モデル１０１０は独特の文字を有すること
が望ましい。詳細に言えば、修正三重字法が使用
される。本発明に従つて、サンプル・テキストが
検査され、語彙中の、順序づけられた三重ワード
およびワード対ならびに単一ワードの各々の尤度
を確定する。そして、最も起こりうる三重ワード
およびワード対のリストが形成される。更に、三
重ワードのリスト中にない三重ワードおよびワー
ド対のリスト中にないワード対の尤度がそれぞれ
決定される。 Preferably, language model 1010 has unique characters. Specifically, a modified trigraph method is used. In accordance with the present invention, sample text is examined to determine the likelihood of each ordered triple word and word pair and single word in the vocabulary. A list of most likely triple words and word pairs is then formed. Additionally, the likelihoods of triple words not in the list of triple words and word pairs not in the list of word pairs are determined, respectively.

言語モデルに従つて、対象ワードが２ワードに
続く場合、この対象ワードと先行する２ワードが
三重ワードのリストにあるかどうかについて判定
する。三重ワードのリストにある場合、その三重
ワードに割当てられた、記憶されている確率が指
定される。対象ワードと先行２ワードが三重ワー
ドのリストにない場合は、その対象ワードとそれ
に隣接する先行ワードがワード対のリストにある
かどうかについて判定する。ワード対のリストに
ある場合、そのワード対の確率と、前述の三重ワ
ードのリストに三重ワードがない確率を掛け、そ
の積を対象ワードに割当てる。対象ワードを含む
前記三重ワードおよびワード対がそれぞれ三重ワ
ードのリストおよびワード対のリストにない場合
には、対象ワードだけの確率に、前述の三重ワー
ドが三重ワードのリストにない確率、ならびにワ
ード対がワード対のリストにない確率を掛け、そ
の積を対象ワードに割当てる。 According to the language model, if the target word follows two words, a determination is made as to whether the target word and the two preceding words are in the list of triple words. If in a list of triple words, the stored probability assigned to that triple word is specified. If the target word and two preceding words are not in the list of triple words, a determination is made as to whether the target word and its adjacent preceding words are in the list of word pairs. If it is in a list of word pairs, multiply the probability of that word pair by the probability that there is no triple word in the list of triple words, and assign the product to the target word. If said triple word and word pair containing the target word are not in the list of triple words and the list of word pairs, respectively, then the probability of the target word alone is added to the probability that said triple word is not in the list of triple words, and the word pair. Multiply by the probability that is not in the list of word pairs and assign the product to the target word.

F6 概算による整形（第１８図）第１８図の流れ図は音響突合せで使用する音素
マシンの整形を示す。ブロツク５００２で、ワー
ド語彙（一般に5000ワードのオーダ）が定義され
る。ブロツク５００４で、各ワードを音素マシン
のシーケンスにより表示する。音素マシンは、例
えば、音標型音素マシンとして表示されている
が、代替的に、フエネメ音素のシーケンスを含む
こともある。音標型音素マシンのシーケンスまた
はフエネメ型音素マシンのシーケンスによるワー
ドの表示については下記に説明する。ワードの音
素マシン・シーケンスはワード基本形式と呼ぶ。F6 Shaping by Approximation (Figure 18) The flowchart in Figure 18 shows the shaping of the phoneme machine used in acoustic matching. At block 5002, a word vocabulary (typically on the order of 5000 words) is defined. At block 5004, each word is displayed as a sequence of phoneme machines. The phoneme machine is shown, for example, as a phonetic phoneme machine, but may alternatively include a sequence of phoneme phonemes. The display of words by the sequence of the phonetic type phoneme machine or the sequence of the Feneme type phoneme machine will be explained below. The phoneme machine sequence of a word is called the word basic form.

ブロツク５００６で、ワード基本形式を前述の
木構造に配列する。各ワードの基本形式での音素
マシンごとの統計は、IEEE会報第64巻（1976年）
532〜556頁記載のエフ・ジエリネクの論文“統計
的方法による連続音声認識”（F.Jelinek，
“Continuous Speech Recognition by
Statistical Methods“Proceedings of the
IEEE，Vol.64，1976，pp532−556）に示された
周知のフオワード・バツクワード・アルゴリズム
による整形により決定される。（ブロツク５００
８）ブロツク５００９で、精密な突合せに用いる値
を記憶する。ブロツク５０１０で、高速突合せ手
順に対応する概算をそれぞれのモデルに使用す
る。概算は、実際の統計値と概算統計値との取替
え、および（または）突合せで検査するラベル数
の限定に関係することがある。 Block 5006 arranges the word elementary forms into the tree structure described above. Statistics for each phoneme machine in the basic form of each word can be found in IEEE Bulletin Volume 64 (1976)
F. Jelinek's paper “Continuous speech recognition using statistical methods” (pages 532-556)
“Continuous Speech Recognition by
Statistical Methods “Proceedings of the
IEEE, Vol. 64, 1976, pp. 532-556). (Block 500
8) Block 5009 stores the values used for precise matching. At block 5010, approximations corresponding to the fast matching procedure are used for each model. Approximation may involve replacing actual statistics with estimated statistics and/or limiting the number of labels tested in a match.

高速突合せで使用する概算パラメータ値はブロ
ツク５０１２で設定する。この時点で、それぞれ
のワード基本形式の各音素マシンは所望の概算に
よつて整形されている。更に、精密突合せ音素マ
シンも形成される。精密な突合せだけで、または
高速突合せと共に音響突合せを実行することがで
きる。それぞれのワード基本形式の音素は、木構
造のパスに沿つて検査される。 Approximate parameter values used in high-speed matching are set in block 5012. At this point, each phoneme machine of each word base form has been shaped by the desired approximation. Additionally, a precision matching phoneme machine is also formed. Acoustic matching can be performed with precision matching alone or in conjunction with fast matching. The phonemes of each word base form are examined along the paths of the tree structure.

F7 音響突合せにより選択されたワードによる
ワード・パスの延長（第５図〜第７図、第１９
図）次に、第２図の音声認識で使用する良好なスタ
ツク復号方法について説明する。F7 Extension of word path by words selected by acoustic matching (Figures 5-7, 19)
(Fig.) Next, a good stack decoding method used in the speech recognition shown in Fig. 2 will be explained.

第５図および第６図において、連続する“ラベ
ル間隔”すなわち“ラベル位置”で生成された複
数の連続ラベルy₁‥‥が示されている。 5 and 6, a plurality of consecutive labels y _{1 .} . . generated at consecutive “label intervals” or “label positions” are shown.

また、第６図には、生成された複数のワード・
パス、すなわちパスＡ、パスＢおよびパスＣが示
されている。第５図の文脈で、パスＡはエントリ
“to be or”に、パスＢはエントリ“two ｂ”
に、パスＣはエントリ“too”に対応するであろ
う。対象ワード・パスの場合、終了している最高
の確率を対象ワード・パスが有するラベル（すな
わち等価的にラベル間隔）がある。このようなラ
ベルを“境界ラベル”という。 In addition, FIG. 6 shows a plurality of generated words.
The paths are shown: path A, path B and path C. In the context of Figure 5, path A leads to entry "to be or" and path B leads to entry "two b".
, path C would correspond to the entry "too". For a target word path, there is a label (or equivalently, a label interval) that the target word path has the highest probability of being completed. Such a label is called a "boundary label."

ワードのシーケンスを表わすワード・パスＷの
場合、最も起こりうる終了時刻（２ワード間の
“境界ラベル”としてラベル・ストリングに表示
されている）は、IBM技術開示会報、第23巻第
４号、1980年９月号、エル・アール・バール外の
論文“音速音響突合せ計算”（L.R.Bahlet al，
“Faster Acoustic Match Computation”，IBM
Technical Disclosure Bulletin，Vol.23，No.４，
September 1980）に記載されているような既知
の方法により発見することができる。簡単に言え
ば、この論文は、下記の２つの重要な事項： (a) どれだけ多くのラベル・ストリングＹがワー
ド（またはワード・シーケンス）によるもので
あるか、 (b) どのラベル間隔で、（ラベル・ストリングの
部分に対応する）部分的な文が終了するかに取組む方法について説明している。 For a word path W representing a sequence of words, the most likely ending time (shown in the label string as a "boundary label" between two words) is given by IBM Technical Disclosure Bulletin, Vol. 23, No. 4. September 1980 issue, LRBahlet al. paper “Sonic Acoustic Matching Calculation” (LRBahlet al,
“Faster Acoustic Match Computation”, IBM
Technical Disclosure Bulletin, Vol.23, No.4,
(September 1980). Briefly, this paper focuses on two important questions: (a) how many label strings Y are words (or word sequences); and (b) at what label spacing. It describes how to deal with the termination of partial sentences (corresponding to parts of label strings).

任意の所与のワード・パスの場合、ラベル・ス
トリングの最初のラベル〜境界ワードを含む各々
のラベルすなわちラベル間隔に関連した“尤度
値”がある。所与のラベル・パスの尤度値の全部
は一括して、所与のワード・パスの“尤度ベクト
ル”を表わす。従つて、ワード・パスごとに、対
応する尤度ベクトルがある。尤度値L_tは第６図に
示されている。 For any given word path, there is a "likelihood value" associated with each label or label interval from the first label of the label string to the boundary word. All of the likelihood values for a given label path collectively represent the "likelihood vector" for a given word path. Therefore, for each word pass there is a corresponding likelihood vector. The likelihood value L _t is shown in FIG.

ワード・パスW¹，W²，‥‥，W^sの集まりの
ラベル間隔ｔでの“尤度包絡線”Λ_tは数学的に
次のように定義される。 The "likelihood envelope" Λ _t of a collection of word paths W ¹ , W ² , . . . , W ^s at label interval t is mathematically defined as follows.

Λ_t＝max（L_t（W¹），‥‥，L_t（W^s））すなわち、ラベル間隔ごとに、尤度包絡線は前
記集りの中の任意のワード・パスに関連した最高
の尤度値を含む。第６図に尤度包絡線８０４０が
示されている。 Λ _t = max(L _t (W ¹ ), ..., L _t (W ^s )) That is, for each label interval, the likelihood envelope is the highest likelihood associated with any word path in the collection. Contains degree values. A likelihood envelope 8040 is shown in FIG.

ワード・パスは、完全な文に対応する場合には
“完全”とみなされる。完全なパスは、入力して
いる話者が、文の終了に達したとき、例えばボタ
ンを押すことにより識別されることが望ましい。
入力は、文終了をマークするラベル間隔と同期さ
れる。完全なワード・パスは、それにワードを付
加して延長することはできない。部分的なワー
ド・パスは不完全な文に対応し、延長することが
できる。 A word pass is considered "complete" if it corresponds to a complete sentence. Preferably, the complete path is identified when the typing speaker reaches the end of the sentence, for example by pressing a button.
The input is synchronized with a label interval that marks the end of a sentence. A complete word pass cannot be extended by appending words to it. Partial word paths accommodate incomplete sentences and can be extended.

部分的なパスは“生きている”または“死んで
いる”パスに分類される。ワード・パスは、それ
が既に延長されているときは、“死んでいる”が、
まだ延長されていないときは“生きている”。こ
の分類により、既に延長されて少なくとも１つ
の、より長く延長されたワード・パスを形成して
いるパスは、次の時刻で延長が再び考慮されるこ
とはない。 Partial paths are classified as "alive" or "dead" paths. A Word Pass is “dead” when it has already been extended, but
It is “alive” when it has not yet been extended. By this classification, a path that has already been extended to form at least one longer extended word path will not be considered for extension again at the next time.

各々のワード・パスは、尤度包絡線に対して
“良い”、または“悪い”ものとして特徴づけるこ
とが可能である。ワード・パスは、その境界ラベ
ルに対応するラベルで、そのワード・パスが、最
大尤度包絡線内にある尤度値を有する場合は良い
ワード・パスである。その他の場合は、ワード・
パスは悪いワード・パスである。最大尤度包絡線
の各値を一定の値だけ減少して良い（悪い）限界
レベルとして作用させることは、望ましいことで
はあるが、必ずしも必要ではない。 Each word path can be characterized as "good" or "bad" with respect to the likelihood envelope. A word path is a good word path if it has a likelihood value that is within the maximum likelihood envelope with the label corresponding to its boundary label. Otherwise, word
Pass is a bad word pass. Although it is desirable, it is not necessary to reduce each value of the maximum likelihood envelope by a fixed value to act as a good (bad) limit level.

ラベル間隔の各々についてスタツク要素があ
る。生きているワード・パスの各々は、このよう
な生きているパスの境界ラベルに対応するラベル
間隔に対応するスタツク要素に割当てられる。ス
タツク要素は、（尤度値の順序にリスト化されて
いる）0.1またはより多くのワード・パス・エン
トリを有することがある。 There is a stack element for each label interval. Each live word path is assigned to a stack element corresponding to the label interval that corresponds to the boundary label of such live path. A stack element may have 0.1 or more word path entries (listed in order of likelihood value).

次に、第２図のスタツク・デコーダ１００２に
より実行されるステツプについて説明する。 Next, the steps performed by stack decoder 1002 of FIG. 2 will be described.

尤度包絡線を形成し、どのワード・パスが良い
かを決定することは、第７図のスタツク復号手法
の流れ図に示すように相互に関係する。 Forming the likelihood envelope and determining which word passes are good are interrelated as shown in the stack decoding technique flow diagram of FIG.

第７図の流れ図において、ブロツク８０５０
で、最初に、ナル・パルスが第１のスタツク
（０）に入る。ブロツク８０５２で、前に確定さ
れている完全なパスを含む（完全な）スタツク要
素が、もしあれば、供給される。（完全な）スタ
ツク要素中の完全なパスの各々は、それに関連す
る尤度ベクトルを有する。その境界ラベルに最高
の尤度を有する完全なパスの尤度ベクトルは、最
初に最尤包絡線を決める。もし（完全な）スタツ
ク要素に完全なパスがなければ、最尤包絡線は各
ラベル間隔で−∞に初期設定される。更に完全な
パスが指定されていない場合にも、最尤包絡線が
− に初期設定されることがある。包絡線の式設
定はブロツク８０５４および８０５６で行われ
る。 In the flowchart of FIG. 7, block 8050
First, the null pulse enters the first stack (0). At block 8052, the (complete) stack element containing the previously determined complete path, if any, is provided. Each complete path in a (complete) stack element has a likelihood vector associated with it. The likelihood vector of the complete path with the highest likelihood at its boundary label first determines the maximum likelihood envelope. If there is no complete path to a (complete) stack element, the maximum likelihood envelope is initialized to -∞ at each label interval. Furthermore, the maximum likelihood envelope may be initialized to - even if a complete path is not specified. Envelope equation setting occurs in blocks 8054 and 8056.

最尤包絡線は、初期設定された後、所定の量だ
け減少され、減少された尤度の情報に△−良い領
域を形成し、減少された尤度の下方に△−悪い領
域を形成する。△が大きければ大きいほど、延長
が可能とみなされるワード・パス数が大きくな
る。L_tを確定するのにlog₁₀を用いる場合、△の
値が２であれば満足すべき結果が得られる。△の
値がラベル間隔の長さに沿つて均一であること
は、望ましいけれども、必ずしも必要ではない。 After the maximum likelihood envelope is initialized, it is reduced by a predetermined amount to form a △-good region in the information of the reduced likelihood, and a △-bad region below the reduced likelihood. . The larger Δ, the larger the number of word passes that are considered to be extendable. When using log ₁₀ to determine L _t , a value of 2 gives satisfactory results. It is desirable, but not necessary, that the value of Δ be uniform along the length of the label interval.

ワード・パスが、△−良い領域内にある境界ラ
ベルに尤度を有する場合、そのワード・パスは
“良い”とマークされる。その他の場合には、ワ
ード・パスは“悪い”とマークされる。 A word path is marked "good" if it has a likelihood of a boundary label falling within the Δ-good region. Otherwise, the word pass is marked as "bad".

第７図に示すように、尤度包絡線を更新し、ワ
ード・パスを“良い”（延長が可能な）パス、ま
たは“悪い”パスとしてマークするループは、マ
ークされていない最長ワード・パスを探すブロツ
ク８０５８で始まる。２以上のマークされていな
いワード・パスが最長のワード・パス長に対応す
るスタツクにある場合、その境界ラベルに最高の
尤度を有するワード・パスが選択される。ワー
ド・パスが発見された場合、ブロツク８０６０
で、その境界ラベルでの尤度が△−良い領域内に
あるかどうかを調べる。もし良い領域内になけれ
ば、ブロツク８０６２で、△−悪い領域内のパス
とマークし、ブロツク８０５８で、次のマークさ
れていない生きているパスを探す。もし良い領域
内にあれば、ブロツク８０６４で、△−良い領域
内のパスとマークし、ブロツク８０６６で、尤度
包絡線を更新して、“良い”とマークされたパス
の尤度値を包含する。すなわち、ラベル間隔ごと
に、更新された尤度値は、 (a) その尤度包線内の現在の尤度値と、 (b) “良い”とマークされたワード・パスに関連
した尤度値の間のより大きい尤度値として確定される。この
動作はブロツク８０６４および８０６６で行われ
る。包絡線が更新された後、ブロツク８０５８に
戻り、マークされていない最長、最良の生きてい
るワード・パスを再び探す。 As shown in Figure 7, the loop that updates the likelihood envelope and marks a word path as a "good" (extendable) or "bad" path is the longest unmarked word path. The search begins at block 8058. If more than one unmarked word path is in the stack corresponding to the longest word path length, then the word path with the highest likelihood for its boundary label is selected. If the password is found, block 8060
Then, check whether the likelihood at that boundary label is within the △-good region. If it is not in the good region, block 8062 marks the path as .DELTA.--in the bad region, and block 8058 searches for the next unmarked live path. If it is in the good region, block 8064 marks the path as △- in the good region, and block 8066 updates the likelihood envelope to include the likelihood value of the path marked "good". do. That is, for each label interval, the updated likelihood value is (a) the current likelihood value within its likelihood envelope, and (b) the likelihood associated with the word path marked “good”. determined as the larger likelihood value between the values. This operation occurs in blocks 8064 and 8066. After the envelope has been updated, we return to block 8058 and again look for the longest, best unmarked living word path.

このループは、マークされていないワード・パ
スがなくなるまで反復される。マークされていな
いワード・パスがなくなると、ブロツク８０７０
で、最短の“良い”とマークされたワード・パス
が選択される。もし、最短の長さを有する２以上
の“良い”ワード・パスがあれば、ブロツク８０
７２で、その境界ラベルに最高の尤度を有するワ
ード・パスが選択され、選択された最短のパスは
延長される。すなわち、少なくとも１つの、見込
みのある後続ワードが、前述のように、高速突合
せ、言語モデル、精密突合せ、および言語モデル
手順を良好に実行することにより確定される。見
込みのある後続ワードごとに、延長されたワー
ド・パスが形成される。詳細に述べれば、延長さ
れたワード・パスは、選択された最短ワード・パ
スの終りに、見込みのある後続ワードを付加する
ことにより形成される。 This loop is repeated until there are no more unmarked word paths. When there are no more unmarked word passes, block 8070
, the shortest word path marked "good" is selected. If there are two or more "good" word paths with the shortest length, block 80
At 72, the word path with the highest likelihood for its boundary label is selected and the selected shortest path is extended. That is, at least one likely successor word is determined by successfully performing a fast match, language model, fine match, and language model procedure, as described above. For each potential successor word, an extended word path is formed. Specifically, an extended word path is formed by appending a likely successor word to the end of the selected shortest word path.

選択された最短ワード・パスが、延長されたワ
ード・パスを形成した後、該選択されたワード・
パスは、それがエントリであつたスタツクから除
去され、その代わりに、各々の延長されたワー
ド・パスは適切なスタツクに挿入される。特に、
延長されたワード・パスはその境界ラベルに対応
するスタツクへのエントリになる（ブロツク８０
７２）。 After the selected shortest word path forms an extended word path, the selected word path
The path is removed from the stack in which it was an entry, and each extended word path is inserted into the appropriate stack in its place. especially,
The extended word path becomes an entry on the stack corresponding to its boundary label (block 80).
72).

ブロツク８０７２における選択されたパルスを
延長する動作を第１９図の流れ図に関連して説明
する。 The operation of extending selected pulses in block 8072 will be described in conjunction with the flowchart of FIG.

第１９図のブロツク６０００で、（第２図の）
音響プロセツサ１００４はラベルのストリングを
生成する。ラベルのストリングはブロツク６００
２に入力として供給され、ブロツク６００２で、
基本の、または改良された概算突合せ手順の１つ
が実行され、順序づけられた候補ワードのリスト
を得る。その後、ブロツク６００４で、前記言語
モデルを前述のように使用する。言語モデルを使
用した後、ブロツク６００６で、残つている対象
ワードは、生成されたラベルと一緒に精密突合せ
プロセツサに送られる。ブロツク６００８で、精
密な突合せは、残つている候補ワードのリストを
生じ、言語モデルに良好に提示される。（概算突
合せ、精密突合せおよび言語モデルにより確定さ
れた）見込みのあるワードは、第７図のブロツク
８０７０で発見されたパスの延長に用いる。ブロ
ツク６００８（第１９図）に確定された、見込み
のあるワードの各々は、発見されたワード・パス
に別個に付加され、複数の延長されたワード・パ
スを形成することができる（ブロツク６０１０）。 At block 6000 of FIG. 19, (of FIG. 2)
Audio processor 1004 generates a string of labels. Label string is block 600
2, and at block 6002,
One of the basic or improved approximate matching procedures is performed to obtain an ordered list of candidate words. Thereafter, block 6004 uses the language model as described above. After using the language model, the remaining target words along with the generated labels are sent to the fine match processor at block 6006. At block 6008, the fine match yields a list of remaining candidate words that are better presented to the language model. Probable words (as determined by rough matches, fine matches, and the language model) are used to extend the path found in block 8070 of FIG. Each of the potential words determined in block 6008 (Figure 19) can be appended separately to the found word path to form multiple extended word paths (block 6010). .

第７図で、延長パスが形成され、スタツクが再
形成された後、ブロツク８０５２に戻つてプロセ
スを反復する。 In FIG. 7, after the extension path is formed and the stack is re-formed, the process returns to block 8052 to repeat the process.

従つて、反復ごとに、最短、最良の“良い”ワ
ード・パスが選択され、延長される。ある反復で
“悪い”パスとマークされたワード・パスは後の
反復で“良い”パスになることがある。よつて、
生きているワード・パスが“良い”パスか、“悪
い”パスかという特徴は、各々の反復で独立して
付与される。実際には、尤度包絡線は１つの反復
と次の反復とで大幅に変化しないので、ワード・
パスが良いか悪いか決定する計算は効率的に行わ
れる。更に、正規化も不要になる。 Therefore, at each iteration, the shortest, best "good" word path is selected and extended. A word path marked as a "bad" path in one iteration may become a "good" path in a later iteration. Then,
The characteristics of whether a living word path is a "good" or "bad" path are assigned independently at each iteration. In practice, the likelihood envelope does not change significantly from one iteration to the next, so the word
The calculations that determine whether a pass is good or bad are done efficiently. Furthermore, normalization becomes unnecessary.

完全な文を識別する場合、ブロツク８０７４を
包含することが望ましい。すなわち、生きている
ワード・パスでマークされずに残つているものは
なく、延長すべき“良い”ワード・パスがない場
合、復号は終了する。その境界ラベルのそれぞれ
に最高の尤度を有する完全なワード・パスが、入
力ラベル・ストリングの最も見込みのあるワー
ド・シーケンスとして識別される。 When identifying complete sentences, it is desirable to include block 8074. That is, if there are no living word paths left unmarked and no "good" word paths to extend, decoding terminates. The complete word path with the highest likelihood for each of its boundary labels is identified as the most likely word sequence of the input label string.

文終了が識別されない連続音声の場合、パス延
長は、継続して行われるか、またはそのシステム
のユーザが希望する所定のワード数まで行われ
る。 In the case of continuous speech where sentence ends are not identified, path extension is performed continuously or up to a predetermined number of words as desired by the user of the system.

F8 ワードの複数の発声から構築するマルコ
フ・モデル（第１Ａ図、第１Ｂ図、第１Ｃ図、
第２０図〜第２６図）第１Ａ図、第１Ｂ図からなる流れ図（第１Ａ
図、第１Ｂ図の配置関係は第１Ｃ図に示す）は、
基本的な基本形式を構築するステツプの概略を示
す。“基本形式”は、音声認識システムの語彙で
見つかつたワード・セグメント（ワードであるこ
とが望ましい）を表す音素マシンのシーケンスで
ある。ワード・セグメントは、辞書ワードである
ことが望ましいが、辞書ワードのシラブルのよう
な、辞書ワードの所定の部分を示すこともある。Markov model constructed from multiple utterances of F8 words (Figures 1A, 1B, 1C,
Figures 20 to 26) A flowchart consisting of Figures 1A and 1B (Figures 1A and 26)
The arrangement relationship in Figure 1B is shown in Figure 1C).
Outlines the steps to construct a basic basic form. A "base form" is a sequence of phoneme machines representing word segments (preferably words) found in the speech recognition system's vocabulary. A word segment is preferably a dictionary word, but may also represent a predetermined portion of a dictionary word, such as a syllable of the dictionary word.

第１Ａ図、第１Ｂ図の本発明の実施例の最初の
ステツプ（ブロツク９０００）で、ワード・セグ
メントの発声をフエネメ（すなわちラベル）のス
トリングに変換する。前述のように、一般に音響
プロセツサは、ワード・セグメントの発声に応答
してフエネメのストリングを生成する。発声ごと
に、それに対応するフエネメ・ストリングがあ
る。 The first step (block 9000) in the embodiment of the invention of FIGS. 1A and 1B is to convert the utterance of a word segment into a string of phrases (or labels). As previously mentioned, an audio processor typically generates a string of words in response to the utterance of a word segment. For each utterance there is a corresponding Hueneme string.

第２０図はＮのフエネメ・ストリングFS₁〜
FS_Nを示す。これらのフエネメ・ストリングはそ
れぞれ、対応する所与のワード・セグメントの発
声に応答して生成される。各ブロツクはストリン
グ中のフエネメを表す。これらのフエネメは、各
ストリングで、フエネメ１〜１ｉとして識別され
る。 Figure 20 shows the Feneme string FS ₁ of N.
Indicates FS _N. Each of these feneme strings is generated in response to the utterance of a corresponding given word segment. Each block represents a flame in the string. These coins are identified in each string as coins 1-1i.

本発明に従つて、１組の音素マシン（すなわち
マルコフ・モデル）が形成される。各音素マシン
は、少なくとも２つの状態；それぞれが或る状態
から或る状態に移る遷移；各遷移に関連した確
率；および、少なくとも幾つかの遷移について複
数の出力確率（各出力確率は、所与のフエネメを
特定の遷移において生成する尤度に対応する）を
有することを特徴とする。第２１図は簡単なサン
プルのフエネメ音素マシン９００２を示す。 In accordance with the present invention, a set of phoneme machines (ie, Markov models) is created. Each phoneme machine has at least two states; each transition going from one state to another; a probability associated with each transition; and multiple output probabilities for at least some transitions (each output probability is a given corresponding to the likelihood of generating a Feneme at a particular transition). FIG. 21 shows a simple sample Feneme phoneme machine 9002.

音素マシン９００２は状態S₁およびS₂を有す
る。１つの遷移t₁は状態S₁から出てそれ自身に戻
り、確率P_t1（S₁｜S₁）を有する。遷移t₁の場合、
フエネメf₁〜f_nの各々を遷移t₁で生成するのに関
連するそれぞれの確率がある。同様に、状態S₁お
よびS₂の間の遷移t₂は、 (a) それに関連した確率P_t2（S₂｜S₁）、 (b) フエネメf₁〜f_nの各々を生成するそれの確率を有する。ナル遷移t₃は、出力すなわちフエネメ
を生成しない遷移を表わし、それに関連するP_t3
（S₂｜S₁）を有する。音素マシン９００２は、そ
れにより、（遷移t₁が反復する場合のように）任
意数のフエネメの生成を可能にするが、遷移t₃が
続く場合は、フエネメは生成されない。 Phoneme machine 9002 has states S ₁ and S ₂ . One transition t ₁ leaves state S ₁ and returns to itself, with probability P _t1 (S ₁ |S ₁ ). For transition t ₁ ,
There is a respective probability associated with producing each of the fenemes f ₁ to f _n at transition t ₁ . Similarly, a transition t ₂ between states S ₁ and S ₂ is defined by (a) its associated probability P _t2 (S ₂ | S ₁ ), (b) its value of producing each of the fenemes f ₁ to f _n Has probability. The null transition t ₃ represents a transition that does not produce an output, i.e. a fueneme, and its associated P _t3
(S ₂ | S ₁ ). The phoneme machine 9002 thereby allows the generation of any number of sound effects (such as when transition t ₁ repeats), but no sound sound is generated when transition t ₃ continues.

各音素マシンはそれに関連する異なつた確率す
なわち統計値を有する。同じ組の音素マシンが同
じ構成を有し統計値だけが異なることは、望まし
いことではあるが、必ずしも必要ではない。統計
値は一般に、整形セツシヨン中に決められる。 Each phoneme machine has different probabilities or statistics associated with it. Although it is desirable, it is not necessary for the same set of phoneme machines to have the same configuration and differ only in statistics. Statistics are generally determined during the shaping session.

音素マシンのセツトが形成され、所与のワー
ド・セグメントの発声により生成されたフエネ
メ・ストリングのすべてに適用された場合、どの
音素マシンが長さ１の最良の基本形式を与えるか
について決定がなされる（ブロツク９００４）。
音素長１の最良の基本形式（P₁）は、セツト内
の各音素マシンの検査、ならびに、音素ごとの、
フエネメ・ストリングFS₁〜FS_Nの各々を生成す
る確率の決定により見つかる。特定の音素マシン
ごとに取出されたＮ個の確率は、それらの積をと
ることにより、その特定の音素マシンに割当てら
れる同時確率を生じる。最高の同時確率を生じる
音素マシンは長さ１の最良の基本形式P₁として
選択される。 When a set of phoneme machines is formed and applied to all of the phoneme strings produced by the utterance of a given word segment, a decision is made as to which phoneme machine gives the best base form of length 1. (block 9004).
The best basic form of phoneme length 1 (P ₁ ) can be determined by examining each phoneme machine in the set and by
It is found by determining the probability of generating each of the Feneme strings FS ₁ to FS _N. The N probabilities extracted for each particular phoneme machine yield a joint probability assigned to that particular phoneme machine by taking their product. The phoneme machine that yields the highest joint probability is selected as the best basic form P ₁ of length 1.

音素P₁を保持しつつ、P₁P₂またはP₂P₁の形式
を有する長さ２の最良の基本形式を探す。すなわ
ち、各音素をP₁の後縁に付加してそれぞれの順
序づけられた音素対を形成し、かつ各音素をP₁
の前縁に付加してそれぞれの順序づけられた音素
対を形成する。そして、各々の順序づけられた音
素対の同時確率が得られる。フエネメ・ストリン
グを生成する最高の同時確率を生じる順序づけら
れた音素対が、長さ２の最良の基本形式とみなさ
れる（ブロツク９００６）。 Find the best basic form of length 2 having the form P ₁ P ₂ or P ₂ P ₁ while preserving the phoneme P ₁ . That is, append each phoneme to the trailing edge of P ₁ to form a respective ordered phoneme pair, and append each phoneme to the trailing edge of P ₁
to the leading edge of each to form each ordered phoneme pair. The joint probability of each ordered phoneme pair is then obtained. The ordered phoneme pair that yields the highest joint probability of producing a phoneme string is considered the best base form of length 2 (block 9006).

次に、長さ２の最良の基本形式、すなわち、最
高の同時確率の順序づけられた対を、周知のビテ
ルビ整列のように整列させる（ブロツク９００
８）。簡単に言えば、整列は、各ストリング中の
どのフエネメが、順序づけられた音素対のそれぞ
れの音素に対応するかを表す。（この時点で、音
素は音素マシンにより表示されている。それゆ
え、音素および音素マシンの項は対応する存在で
ある。）整列に続き、フエネメ・ストリングFS₁〜FS_N
の各々で一致点をみつける。フエネメ・ストリン
グFS₁〜FS_Nの各々について、一致点は、（長さ２
の最良の基本形式の）音素P₁およびP₂が接する
見込みが最大の点として定義される。別の見方と
して、一致点は、フエネメ・ストリングFS₁〜
FS_Nの各々を左部分と右部分に分ける点とみなす
こともできる。この場合、すべてのフエネメ・ス
トリングの左部分は共通する単音セツトを表わ
し、すべてのフエネメ・ストリングの右部分も同
様に共通する単音セツトを表す（ブロツク９０１
０）。左部分の各々は左サブストリング、右部分
の各々は右サブストリングとみなされる（ブロツ
ク９０１２）。 The best elementary forms of length 2, i.e., the ordered pairs with highest joint probabilities, are then aligned as in the well-known Viterbi alignment (block 900
8). Briefly, the alignment represents which phoneme in each string corresponds to each phoneme of the ordered phoneme pair. (At this point, the phoneme has been represented by the phoneme machine. Therefore, the phoneme and phoneme machine terms are corresponding entities.) Following alignment, the feneme string FS ₁ ~ _{FS N}
Find a match in each. For each of the Hueneme strings FS ₁ to FS _N , the matching points are (length 2
is defined as the point where the phonemes P ₁ and P ₂ (of the best basic form of ) have the greatest probability of touching. Another way to look at it is that the coincidence point is the Hueneme string FS ₁ ~
Each FS _N can also be considered as a point that divides into a left part and a right part. In this case, the left part of all the Feneme strings represents a common set of tones, and the right part of all the Feneme strings likewise represents a common set of tones (block 901).
0). Each left portion is considered a left substring and each right portion is considered a right substring (block 9012).

その後、左サブストリングおよび右サブストリ
ングは、別個処理方式により、類似してはいる
が、別個に扱われる。 The left substring and right substring are then treated similarly, but separately, by a separate processing scheme.

左サブストリングの場合、その代わりに最高の
同時確率を有する最良の単一音素基本形式P_Lを
見つける。（ブロンク９０１４）。音素P_Lを保持
しながら、セツト中の各音素をその前に付加した
順序で対を形成し、かつセツト中の各音素をその
後に付加した順序で対を形成する。次いで、左サ
ブストリングでフエネメを生成する最高の同時確
率を有する一定順序の対P_LP_AまたはP_AP_Lを見つ
ける（ブロツク９０１６）。前述のように、これ
は左サブストリングの長さ２の最良の基本形式と
みなされる。 For the left substring, find instead the best single phoneme base form P _L with the highest joint probability. (Bronck 9014). While preserving the phoneme P _L , pairs are formed in the order in which each phoneme in the set is added before it, and pairs are formed in the order in which each phoneme in the set is added after it. Then, find the fixed-order pair P _L P _A or P _A P _L that has the highest joint probability of producing a wager in the left substring (block 9016). As mentioned above, this is considered the best basic form of length 2 for the left substring.

左サブストリングの長さ２の最良の同時確率を
同時確率P_Lだけと比較する（第１Ｂ図のブロツ
ク９０１８）。同時確率P_Lの方が大きい場合、連
結された基本形式に音素P_Lを配置する（ブロツ
ク９０２０）。同時確率P_Lの方が小さい場合、P_L
P_AまたはP_AP_Lを左サブストリングに対し整列さ
れる。（ブロツク９０２２）。左サブストリングの
各々で一致点が見つかり、各左サブストリングは
その時点で、（新しい）左部分と（新しい）右部
分に分割される（ブロツク９０２４）。 The best joint probability of length 2 for the left substring is compared to the joint probability P _L alone (block 9018 in Figure 1B). If the joint probability P _L is larger, the phoneme P _L is placed in the connected basic form (block 9020). If the joint probability P _L is smaller, then P _L
P _A or P _A P _L is aligned against the left substring. (Block 9022). A match is found in each of the left substrings, and each left substring is then split into a (new) left part and a (new) right part (block 9024).

同じ手順が、最初に分割されたフエネメ・スト
リングFS₁〜FS_Nの各右サブストリングにも適用
される。（ブロツク９０３０からの）最良の１つ
の基本形式P_Rと、ブロツク９０３４で見つかつ
た長さ２の最良の基本形式P_RP_BまたはP_BP_Rとが、
ブロツク９０３２で比較される。P_Rの同時確率
の方が大きい場合、連結された基本形式に音素
P_Rを配置する（ブロツク９０２０）。P_Rの方が小
さい場合には、２つの音素の基本形式の整列を行
い、各右サブストリングをその一致点で左部分と
右部分に分割する（ブロツク９０３６）。 The same procedure is applied to each right substring of the initially split Feneme strings FS ₁ to FS _N. The best single basic form P _R (from block 9030) and the best basic form P _R P _B or P _B P _R of length 2 found in block 9034 are
A comparison is made in block 9032. If the joint probability of P _R is larger, then the phoneme is added to the connected basic form.
Place _PR (block 9020). If P _R is smaller, the basic forms of the two phonemes are aligned and each right substring is divided into a left part and a right part at the matching point (block 9036).

分割サイクルは、長さ２の最良の基本形式が最
良の１つの音素の基本形式よりも高い同時確率を
有するサブストリングごとに反復する。すなわ
ち、サブストリングを２つの部分に分割し、その
一方または双方を、（整列後）新しいサブストリ
ングに分割する動作を、１つの音素基本形式しか
残らなくなるまで、次々に実行することができ
る。 The splitting cycle repeats for each substring for which the length-2 best base form has a higher joint probability than the best one-phoneme base form. That is, the operation of splitting a substring into two parts and splitting one or both of them (after alignment) into a new substring can be performed one after the other until only one basic phoneme form remains.

１つの音素の基本形式は、その基本形式が表す
サブストリングと同じ順序で連結される。連結さ
れた基本形式は、フエネメ・ストリングFS₁〜
FS_Nの連続するサブストリングに対応する、連続
する１つの音素を表す。後に説明するように、サ
ブストリングは０，１、または２以上のフエネメ
を含み、それが発声ごとに発音が変化する原因と
なる。 The basic form of one phoneme is concatenated in the same order as the substrings it represents. The concatenated basic form is the Hueneme string FS ₁ ~
FS represents one consecutive phoneme corresponding to consecutive substrings of _N. As will be explained later, a substring may contain zero, one, or more than one sound, which causes the pronunciation to change from utterance to utterance.

前述の連結された基本形式はワード・セグメン
ト、例えば語彙ワードの基本的な基本形式を表
す。連結された基本形式の改良は第２２図の流れ
図に組込まれている。第２２図は、第１Ｂ図で、
１つの音素（P_LおよびP_R）を配置し、連結され
た基本形式を形成するブロツク９０２０から続
く。この改良により、連結された基本形式をフエ
ネメ・ストリングに対して整列させる（ブロツク
９０５０）。フエネメ・ストリングFS₁〜FS_Nの
各々に対し、この整列は、そのストリング中のフ
エネメが（もしあれば）それぞれの音素マシンに
対応することを表わし、音素対応に基づいたスト
リングの分割に役立つ（ブロツク９０５２）。 The concatenated base forms described above represent the basic base forms of word segments, eg lexical words. A modification of the connected basic format is incorporated into the flowchart of FIG. Figure 22 is Figure 1B,
Continuing from block 9020, one phoneme (P _L and P _R ) is placed to form a concatenated base form. This refinement aligns the concatenated primitives to the eneme string (block 9050). For each Feneme string FS ₁ to FS _N , this alignment represents that the Feneme in that string corresponds to the respective phoneme machine (if any), and is useful for splitting the string based on phoneme correspondence ( Block 9052).

分割されるセクションごとに分析を行い、その
分割に最良の１つの音素を確定する（ブロツク９
０５４）。整列により、分割されたセクション内
のフエネメの最良の単一音素は、前に整列され連
結された基本形式中の単一音素と異なることがあ
る。 Analyze each section to be divided and determine the best single phoneme for that division (block 9)
054). Due to the alignment, the best single phoneme of the feneme in the divided section may be different from the single phoneme in the previously aligned and concatenated base form.

両者が異なる場合（ブロツク９０５８）、各最
良の単一音素を、前に整列され連結された基本形
式中の対応する単一音素と置換え、新たに連結さ
れた基本形式を生成する（ブロツク９０５６）。
次いで、新しい基本形式は、必要なら、整列（ブ
ロツク９０５０）、分割（９０５２）、新しい最良
の音素の探索（ブロツク９０５４）、ならびに、
連結れれた基本形式での音素の置換えを適切に行
う。第２２図の流れ図に示すように、このサイク
ルは、連続的に処理される基本形式を得るように
反復することができる。 If they are different (block 9058), replace each best single phoneme with the corresponding single phoneme in the previously aligned and concatenated base form to generate a new concatenated base form (block 9056). .
The new base form is then aligned (block 9050), segmented (9052), search for new best phonemes (block 9054), and
Properly replace phonemes in connected basic forms. As shown in the flowchart of FIG. 22, this cycle can be repeated to obtain a sequentially processed base format.

連結された基本形式中の古い最良の音素が所与
の分割の新しい最良の音素と同じ場合（ブロツク
９０５８）、この音素は連結された基本形式中の
所定位置に固定される（ブロツク９０６０）。す
べての音素がそれぞれの順序の位置に固定される
と、改良された基本形式が生じる（ブロツク９０
６２）。 If the old best phoneme in the concatenated base form is the same as the new best phoneme of the given division (block 9058), then this phoneme is fixed in place in the concatenated base form (block 9060). Once all phonemes are fixed in their ordinal positions, an improved basic form results (block 90
62).

第２３図〜第２６図によりフエネメ基本形式に
ついて説明する。P₁はフエネメ・ストリングFS₁
〜FS_Nの長さ１の最良の基本形式であることが最
初に分つている。 The basic format of Feneme will be explained with reference to FIGS. 23 to 26. P ₁ is Hueneme String FS ₁
~FS is initially found to be the best basic form of length 1 of _N.

P₁を１つの音素として用い、フエネメ・スト
リングFS₁〜FS_Nの最良の順序の音素対を形成す
るように第２の音素を決定する。これは第２３図
に示されている。第２４図では、フエネメ・スト
リングFS₁〜FS_Nの各々は、音素P₁が音素P₂に接
する見込みが最大の点で分割される。第２５図で
は、左部分と右部分が決められる。これらの部分
は、その後、第２３図の複数のフエネメ・ストリ
ングのように別個に検査される。個別処理によ
り、各ストリング中のフエネメは連続的により多
くの音素により表示される。所与の音素の確率
が、取出された２つの音素の確率よりも大きい場
合、分割は停止され、このような分割されない音
素のシーケンスに沿つたそれぞれの位置に、所与
の音素を配置する。 Using P ₁ as one phoneme, the second phoneme is determined to form the best ordered phoneme pair of the feneme strings FS ₁ to FS _N . This is shown in FIG. In FIG. 24, each of the feneme strings FS ₁ to _{FS N} is divided at the point where the probability that the phoneme P ₁ touches the phoneme P ₂ is greatest. In FIG. 25, the left and right portions are determined. These sections are then inspected separately, such as the multiple strings of fines in FIG. With individual processing, the phonemes in each string are represented by successively more phonemes. If the probability of a given phoneme is greater than the probability of the two retrieved phonemes, segmentation is stopped and the given phoneme is placed at its respective position along the sequence of such unsplit phonemes.

第２６図は配置された１つの音素P₁のサンプ
ルで、それに対応するフエネメ・ストリングFS₁
〜FS_Nのサブストリングを表す。FS₁では、音素
P₁は１つのフエネメに関連し、FS₂では、P₁はナ
ルに関連し、FS₃では、P₁は２つのフエネメの生
成に関連する。以下同様である。 Figure 26 is a sample of one phoneme P ₁ arranged, and the corresponding Feneme string FS ₁
~FS represents a substring of _N. In FS ₁ , the phoneme
P ₁ is associated with one Feneme, in FS ₂ P ₁ is associated with a null, and in FS ₃ P ₁ is associated with the generation of two Fenemes. The same applies below.

基本形式を改良するには、連結された基本形式
に対する各々の発声のフエネメ・ストリングのビ
テルビ整列を実行する。連結された基本形式では
音素ごとに順次、それに対して整列されたフエネ
メが決定される。音素に対して整列されるフエネ
メがない場合、その音素は削除される。その他の
場合には、それにより整列された（すなわち、そ
のために分割されたセクシヨン中の）フエネメを
生成する確率を最大化する音素を見つけ、前から
ある音素を、前からあるその音素が最大確率の音
素である見込みがより小さい場合、取替える。希
望により、このステツプは反復されることもあ
り、反復されないこともある。反復される場合、
その反復は音素が取替えられると終了する。 To refine the base form, we perform a Viterbi alignment of the Hueneme string of each utterance to the concatenated base form. In the connected basic format, for each phoneme, a phoneme aligned with it is determined in sequence. If no phoneme is aligned to a phoneme, that phoneme is deleted. In other cases, find the phoneme that maximizes the probability of producing a Feneme aligned with it (i.e. in the sections that have been divided for it), and replace the pre-existing phoneme with the maximum probability If the probability that it is a phoneme is smaller, replace it. This step may or may not be repeated, as desired. If repeated,
The repetition ends when the phoneme is replaced.

本発明は、IBMシステム３０８４のMVSにお
いてPL／Ｉで実現されているが、種々の計算シ
ステムの中の任意のシステムにおいて種々の言語
の中に任意の言語で実現することもできる。 Although the present invention is implemented in PL/I on the IBM System 3084 MVS, it can also be implemented in any of a variety of languages on any of a variety of computing systems.

最良の基本形式は、前述の実施例により、同時
確率が各フエネメ・ストリングに関連した確率の
積になつている場合、最高の同時確率を有する基
本形式として特徴づけられている。最良の基本形
式および最高の同時確率は本発明に従つて別な方
法で決められることがある。ちなみに、最高平均
確率、または所定のある分布は、最高の同時確率
を決定する際に使用することができる。 The best basic form is characterized by the previous example as the basic form with the highest joint probability, if the joint probability is the product of the probabilities associated with each Hueneme string. The best basic form and highest joint probability may be determined differently according to the invention. Incidentally, the highest average probability, or some predetermined distribution, can be used in determining the highest joint probability.

更に、本発明は、同時に３以上の部分に分割す
ることにより実施されることがある。例えば、発
声ごとのフエネメ・ストリングを、最初に３つの
部分（左、中央および右のセクシヨン）に分割す
ることがある。次に、分割された各々のセクシヨ
ンは、個別処理方式で別個に検査される。しかし
ながら、３以上に分割するよりも、２つに分割さ
れたセクシヨンの方が望ましい。 Furthermore, the invention may be practiced by dividing into three or more parts at the same time. For example, the payneme string for each utterance may be initially divided into three parts (left, middle and right sections). Each divided section is then examined separately in an individual processing manner. However, a section divided into two is preferable to a section divided into three or more.

また、分割および整列の順序に一定の制限はな
い。１つの実施例では、分割および整列は、分割
が停止するまで実行され、連続的に小さくなる左
の部分を決める。連結された基本形式における最
も左の音素は、それにより最初に決められる。そ
の後、連結された基本形式で左から２番目の音素
が決められる。代替方法として、本発明は分割お
よび整列する他のルーチンにより、連結された基
本形式で所望の音素を選択することも企図してい
る。 Furthermore, there are no fixed restrictions on the order of division and arrangement. In one embodiment, the splitting and alignment is performed until the splitting stops, determining successively smaller left portions. The leftmost phoneme in the concatenated base form is thereby determined first. Thereafter, the second phoneme from the left is determined using the concatenated basic form. As an alternative, the present invention also contemplates other segmentation and alignment routines to select the desired phonemes in concatenated elementary form.

Ｇ発明の効果本発明により音声認識システムで構築される基
本形式を改良することができる。G. Effects of the Invention According to the present invention, the basic format constructed in a speech recognition system can be improved.

[Brief explanation of drawings]

第１図は第１Ａ図と第１Ｂ図の配置関係を示す
図、第１Ａ図および第１Ｂ図は本発明により複数
の発声に基づいたワード・セグメントの基本的な
基本形式を構築する方法を示す流れ図、第２図は
本発明を実施しうるシステム環境の概要ブロツク
図、第３図は第２図のシステム環境の中のスタツ
ク・デコーダを詳細に示したブロツク図、第４図
は整形セツシヨン中に得られた統計値により記憶
装置で識別され、表示される音標型音素マシンを
示す図、第５図は連続するスタツク復号のステツ
プを示す図、第６図はスタツク復号手法を示す
図、第７図はスタツク復号手法の流れ図、第８図
は音響プロセツサの要素を示す図、第９図は音響
モデルの構成要素を形成する場所を表わす代表的
な人間の耳の部分を示す図、第１０図は音響プロ
セツサの部分を示すブロツク図、第１１図は音響
プロセツサの設計に用いる、音の強度と周波数の
関係を示す図、第１２図はソーンとホンの関係を
示す図、第１３図は第８図の音響プロセツサによ
り音響の特徴をどのように示すかを表す流れ図、
第１４図は第１３図で限界値をどのように更新す
るかを示す流れ図、第１５図は精密突合せ手順の
トレリスすなわち格子を示す図、第１６図は突合
せを実行するのに用いる音素マシンを示す図、第
１７図は同時に複数のワードの処理を可能にする
音素の木構造を示す図、第１８図はマルコフ・モ
デル音素マシンの整形を示す流れ図、第１９図は
ワード・パスの延長を示す流れ図、第２０図は１
つのワード・セグメントのＮの発声から得たスエ
ネメ・シーケンスを示す図、第２１図はサンプル
のフエネメ型音素マシンを示す図、第２２図はワ
ード・セグメントの基本形式を向上させるため第
１Ａ図および第１Ｂ図の流れ図に付加する流れ
図、第２３図は複数の発声の１つに応答して生成
される各フエネメ・ストリングに使用する音素の
長さ２の最良の基本形式を示す図、第２４図は音
素P₁が音素P₂に一貫して接する点に決められた
点で分割された各フエネメ・ストリングを示す
図、第２５図は左の部分と右の部分として識別さ
れる分割された部分を示す図、第２６図はフエネ
メ・ストリングFS₁〜FS_Nの各々の音素および対
応する部分を示す図である。１０００……音声認識システム、１００２……
スタツク・デコーダ、１００４……音響プロセツ
サ、１００６，１００８……アレイ・プロセツ
サ、１０１０……言語モデル、１０１２……ワー
クステーシヨン、１０２０……探索装置，１０２
２，１０２４，１０２６，１０２８……インタフ
エース。 FIG. 1 is a diagram showing the arrangement relationship between FIGS. 1A and 1B, and FIGS. 1A and 1B show a method for constructing a basic form of a word segment based on multiple utterances according to the present invention. 2 is a schematic block diagram of a system environment in which the present invention may be implemented; FIG. 3 is a block diagram detailing the stack decoder within the system environment of FIG. 2; and FIG. FIG. 5 is a diagram showing the steps of continuous stack decoding. FIG. 6 is a diagram showing the stack decoding method. Figure 7 is a flowchart of the stack decoding method, Figure 8 is a diagram showing the elements of the acoustic processor, Figure 9 is a diagram showing the parts of a typical human ear representing the locations forming the components of the acoustic model, and Figure 10. The figure is a block diagram showing the part of the sound processor, Figure 11 is a diagram showing the relationship between sound intensity and frequency used in the design of the sound processor, Figure 12 is a diagram showing the relationship between the horn and the horn, and Figure 13 is a diagram showing the relationship between sound intensity and frequency. a flowchart showing how acoustic features are represented by the acoustic processor of FIG. 8;
Figure 14 is a flowchart showing how the limits are updated in Figure 13, Figure 15 is a diagram showing the trellis or lattice of the precision matching procedure, and Figure 16 shows the phoneme machine used to perform the matching. Figure 17 is a diagram showing the phoneme tree structure that allows processing of multiple words at the same time, Figure 18 is a flowchart showing the shaping of the Markov model phoneme machine, and Figure 19 is a diagram showing the extension of the word path. The flowchart shown in Figure 20 is 1
Figure 21 is a diagram showing a sample Hueneme phoneme machine, and Figure 22 is a diagram showing a Sueneme sequence obtained from N's utterances of two word segments. Flowchart appended to the flowchart of FIG. 1B; FIG. 23 is a diagram illustrating the best basic form of phoneme length 2 to use for each feneme string produced in response to one of a plurality of utterances; FIG. Figure 25 shows each hueneme string split at points determined where phoneme P ₁ consistently touches phoneme P ₂ ; Figure 25 shows the splits identified as left and right parts. FIG. 26 is a diagram showing each phoneme and corresponding portion of the Feneme strings FS ₁ to FS _N. 1000...Voice recognition system, 1002...
Stack decoder, 1004...Acoustic processor, 1006, 1008...Array processor, 1010...Language model, 1012...Workstation, 1020...Search device, 102
2,1024,1026,1028...interface.

Claims

[Claims] 1. A word Markov model formed by concatenating word partial Markov models corresponding to a set of labels each representing an acoustic type that can be assigned to a minute time interval, and a label of unknown input speech. The method for generating the word Markov model used in speech recognition to find the likelihood that the unknown input speech is generated in relation to the word Markov model by comparing Characterized Ward Markov model generation method. (a) Generating multiple label strings from multiple utterances of one word to be generated. (b) dividing the plurality of label strings into repeating label string parts; This step (b) consists of: (i) determining a pair of word partial Markov models that generate the plurality of label strings or label string parts in the target range to be divided with maximum probability; a substep of dividing the target range into left and right target ranges based on matching with the models; Substep (i) above until generating the plurality of label string portions in the previous target range.
It has a substep of repeating. (c) After the final division in step (b), the word part Markov model that generates the label included in the divided part with the highest probability is assigned to each divided part of the plurality of label strings. Generate a chain of Markov models,
Step of making this the word Markov model of the word to be generated.