JPH0431116B2

JPH0431116B2 -

Info

Publication number: JPH0431116B2
Application number: JP59003590A
Authority: JP
Priority date: 1984-01-13
Filing date: 1984-01-13
Publication date: 1992-05-25
Also published as: JPS60149096A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞
書を照合して単語を認識する単語音声認識方法に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本
発明の単語音声認識方法の実施例等を実行するた
めの装置の機能ブロツク図である。従来例を第１
図〜第３図とともに説明する。第１図において、
１は入力音声からパラメータの時系列を作成する
パラメータ抽出部、２は音素標準パタンを照合し
て、音素の確率密度を算出する確率密度計算部、
３は音素毎のセグメンテーシヨン、尤度計算、単
語類似度計算等を行なう単語認識部である。ま
た、４は予め予備実験等により作成された、各音
素毎の各種パラメータにおける分布を各音素毎の
平均値（〓_i）、及び各種パラメータ間の共分散行
列（〓_i）の形で表わした音素標準パタンを記憶
する音素標準パタン部、５は認識すべき全単語を
音素単位の記号列で表記した単語辞書が記憶され
ている単語辞書部である。その単語辞書は、例え
ば単語「サツポロ」、「アサヒカワ」、「アキタ」、
「シマ」、「シサ」等は、それぞれ「SAQPORO」、
「ASAHIKAWA」、「AKITA」、「SIMA」、
「SISA」等と表記されている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an apparatus for executing an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention. Conventional example first
This will be explained with reference to FIGS. In Figure 1,
1 is a parameter extraction unit that creates a time series of parameters from input speech; 2 is a probability density calculation unit that calculates the probability density of a phoneme by comparing a phoneme standard pattern;
3 is a word recognition unit that performs segmentation for each phoneme, likelihood calculation, word similarity calculation, etc. In addition, 4 represents the distribution of various parameters for each phoneme created in advance through preliminary experiments etc. in the form of the average value for each phoneme (〓 _i ) and the covariance matrix among the various parameters (〓 _i ). A phoneme standard pattern section 5 stores phoneme standard patterns, and a word dictionary section 5 stores a word dictionary in which all words to be recognized are expressed in symbol strings in phoneme units. The word dictionary includes, for example, the words "Satsuporo", "Asahikawa", "Akita",
"Sima", "Sisa", etc. are respectively "SAQPORO",
"ASAHIKAWA", "AKITA", "SIMA",
It is written as "SISA" etc.

次に上記従来例の動作について説明する。入力
音素をパラメータ抽出部１により10msのフレー
ム毎に分析しパラメータを抽出して、パラメータ
時系列を作成する。確率密度計算部２はフレーム
毎に得られたパラメータと音素標準パタンを照合
し、そのパラメータの値から生成される音素の確
率密度を算出する。次に単語認識部３において、
上記のパラメータと得られた確率密度値を用いて
各辞書項目毎に、その辞書項目を構成する辞書音
素系列に従つて１音素毎に音素のセグメンテーシ
ヨンを行ない、下記式に従いその音素の種類
と、その音素に対応してセグメンテーシヨンされ
た区間の尤度ｌを計算し、その辞書項目におけ
る、各音素の尤度の平均として類似度を求める。
ここで、その音素をＸとし、Ｘに対応してセグメ
ンテーシヨンされた区間の始端と終端にフレーム
番号をN_s，N_eとし、第ｎフレームにおける各パ
ラメータの値をC_oとすると、音素Ｘの尤度l_xは下
式で定義される。 Next, the operation of the above conventional example will be explained. The input phoneme is analyzed by the parameter extraction unit 1 every 10 ms frame, parameters are extracted, and a parameter time series is created. The probability density calculation unit 2 compares the parameters obtained for each frame with the phoneme standard pattern, and calculates the probability density of the phoneme generated from the parameter values. Next, in the word recognition unit 3,
Using the above parameters and the obtained probability density value, for each dictionary item, segmentation is performed for each phoneme according to the dictionary phoneme sequence that makes up the dictionary item, and the type of phoneme is determined according to the following formula. Then, the likelihood l of the segmented interval corresponding to that phoneme is calculated, and the degree of similarity is determined as the average of the likelihoods of each phoneme in the dictionary entry.
Here, if the phoneme is X, the frame numbers are N _s and _Ne at the start and end of the segmented section corresponding to X, and the value of each parameter in the nth frame is _Co , then The likelihood l _x of X is defined by the following formula.

φ_i（C_oはある音素ｉの確率密度を表わし、式
のように定義される。 φ _i (C _o represents the probability density of a certain phoneme i and is defined as in the equation.

φ_i（C_o）＝１／（2π）^1/2｜〓_i｜^1/2exp 〔−１／２（C_o−〓_i）^T _-1 〓 _i（C_o−〓_i）〕 …… Ｃ：１つのフレームにおけるｊ個のパラメータ
（ベクトル）〓_i：ある音素ｉのパラメータの平均値（ベク
トル）〓_i：共分散行列式において、確率密度の割り算における分母の
サメンシヨンｉの範囲は、音素Ｘが何であるかに
よつて異なり、例えばＸが音素Ａ(ア)の時はｉの範
囲は５母音、Ａ，Ｅ，Ｉ，Ｏ，Ｕとしている。 φ _i (C _o )=1/(2π) ^1/2 ｜〓 _i ｜ ^1/2 exp [−1/2(C _o −〓 _i ) ^T _-1 〓 _i (C _o −〓 _i )] …… C: j parameters (vector) in one frame 〓 _i : Average value of parameters of a certain phoneme i (vector) 〓 _i : In the covariance determinant, the range of submension i of the denominator in dividing the probability density is the phoneme It depends on what X is. For example, when X is the phoneme A, the range of i is five vowels, A, E, I, O, and U.

以上により得られる単語類似度L_Mを式に従
つて各辞書項目毎に求め、L_Mが最大となる辞書
項目をもつて、認識単語としていた。 The word similarity L _M obtained above was determined for each dictionary item according to the formula, and the dictionary item with the maximum L _M was selected as a recognized word.

L_M＝_NP 〓ⁱ⁼¹ l_i／NP …… L_M：辞書中のＭ番目の単語の類似度 l_i：辞書音素系列中の音素ｉの尤度 NP：辞書音素数上記従来例においては、音素の確率密度の値を
用いて辞書項目中の１音素毎についてセグメンテ
ーシヨン及び尤度計算を行なつている。第２図
は／SiMA／（島）と発声した時の各音素の確率
密度の時時変化を示している。この場合のセグメ
ンテーシヨン及び尤度計算は、各音素／Ｓ／，／
ｉ／，／Ｍ／，／Ａ／の確率密度の値φ_s，φ_i，
φ_M，φ_Aの時間変化に従つて行ない、語頭の／
Ｓ／のセグメンテーシヨンはφ_sが低くなり、φ_iが
高くなるフレーム、ａを／Ｓ／の後端とし、セグ
メンテーシヨンされた区間（SF−ａ）に対して
φ_sを用いて尤度計算を行なう。語頭の／Ｓ／に後
続する第２番目の音素／ｉ／についても同様にφ_i
が低くなりφ_Mが高くなるフレームｂを／Ｍ／の
後端とし、セグメンテーシヨンされた区間（ａ〜
ｂ）に対してφ_iを用いて尤度計算を行なつてい
た。 L _M = _NP 〓 ⁱ⁼¹ l _i /NP …… L _M : Similarity of the Mth word in the dictionary l _i : Likelihood of phoneme i in the dictionary phoneme sequence NP: Number of dictionary phonemes In the above conventional example, , segmentation and likelihood calculation are performed for each phoneme in a dictionary entry using the probability density value of the phoneme. Figure 2 shows the temporal changes in the probability density of each phoneme when uttering /SiMA/ (island). In this case, segmentation and likelihood calculation are performed for each phoneme /S/, /
The probability density values of i/, /M/, /A/ φ _s , φ _i ,
This is done according to the time changes of φ _M and φ _A , and /
Segmentation of S/ is performed using a frame in which φ _s is low and φ _i is high, a is the rear end of /S/, and φ _s is used for the segmented section (SF-a). Perform degree calculations. Similarly, for the second phoneme /i/ following the initial /S/, φ _i
Frame b, where φ _M is low and φ M is high, is taken as the rear end of /M/, and the segmented section (a ~
For b), likelihood calculation was performed using φ _i .

第３図は／SiSA／（示唆）と発声した時の各
音素の確率密度の時間変化を示している。セグメ
ンテーシヨン及び尤度計算は、各音素／Ｓ／，／
ｉ／，／Ｓ／，／Ａ／の確率密度の値、φ_s，φ_i，
φ_s，φ_Aの時間変化に従つて行なうが、語頭の／
Ｓ／のセグメンテーシヨンをする場合、後続す
る／ｉ／が無声化しているためφ_iが非常に小さく
なり、またφ_sが語頭の／Ｓ／の本来の区間である
（FS−ｃ）を越え、さらに語頭の／Ｓ／に後続す
る／ｉ／の本来の区間（ｃ〜ｄ）も越えているた
め、／ｉ／に後続する／Ｓ／の後端ｅを語頭の／
Ｓ／の後端として出力しセグメンテーシヨン誤り
を起こしていた。 Figure 3 shows the temporal change in the probability density of each phoneme when /SiSA/ (suggestion) is uttered. Segmentation and likelihood calculation are performed for each phoneme /S/, /
The probability density values of i/, /S/, /A/, φ _s , φ _i ,
This is done according to the time changes of φ _s and φ _A , but the initial /
When segmenting S/, φ _i becomes very small because the following /i/ is devoiced, and φ _s is the original section of the word-initial /S/ (FS-c). It also exceeds the original interval (c to d) of /i/ that follows the /S/ at the beginning of the word, so the trailing end e of /S/ that follows /i/ is changed to /S/ at the beginning of the word.
It was output as the rear end of S/, causing a segmentation error.

このため、語頭の／Ｓ／に続く音素／ｉ／，／
Ｓ／，／Ａ／の音素についてのセグメンテーシヨ
ンも誤り、尤度が低くなる結果、無声化母音を含
む単語は誤認識し易い欠点があつた。 For this reason, the phonemes /i/, / following the /S/ at the beginning of the word
Segmentation for the S/ and /A/ phonemes is also incorrect and the likelihood becomes low, resulting in the disadvantage that words containing devoiced vowels are easily misrecognized.

（発明の目的）本発明は上記従来例の欠点を除去するものであ
り、セグメンテーシヨン及び尤度計算の精度を向
上させ、それにより単語認識率を向上させること
を目的とする。(Object of the Invention) The present invention is intended to eliminate the drawbacks of the above-mentioned conventional examples, and aims to improve the accuracy of segmentation and likelihood calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、認識すべき単語を音素単位の記号列
で表記した単語辞書と、各音素の音響パラメータ
の分布形で表わされた各音素の標準パタンを具備
し、入力音声の単語を認識する際、入力音声を単
語辞書の各辞書項目と照合し、各辞書項目を構成
する辞書音素系列に従い各音素毎にその音素標準
パタンを用いて、その音素から生成される確率密
度を計算し入力音声をセグメンテーシヨンし、そ
のセグメンテーシヨンされた音声の区間に対し
て、上記の確率密度の値を用いて各辞書項目と入
力音声の類似度を求めて単語を認識する単語音声
認識方法において、無声子音に挾まれた無声化母
音のセグメンテーシヨン及び尤度計算を行なう
際、各音素の確率密度の値を用いて無声化母音を
含む、無声子音、無声化母音、無声子音の連続３
音素をまとめてセグメンテーシヨンし尤度計算を
行なうことを特徴とするものであり、これにより
セグメンテーシヨン及び尤度計算の精度を向上さ
せる効果を持つものである。(Structure of the Invention) The present invention includes a word dictionary in which words to be recognized are expressed as symbol strings for each phoneme, and a standard pattern of each phoneme expressed as a distribution of acoustic parameters of each phoneme. When recognizing a word, the input speech is checked against each dictionary item in the word dictionary, and the probability density generated from that phoneme is calculated using the phoneme standard pattern for each phoneme according to the dictionary phoneme sequence that makes up each dictionary item. is calculated, the input speech is segmented, and words are recognized by calculating the similarity between each dictionary item and the input speech using the above probability density value for the segmented speech interval. In the speech recognition method, when segmenting and calculating the likelihood of a devoiced vowel sandwiched between unvoiced consonants, the probability density value of each phoneme is used to segment the unvoiced consonant, unvoiced vowel, and unvoiced vowel, including the unvoiced vowel. consonant sequence 3
This method is characterized by segmenting phonemes and performing likelihood calculations, which has the effect of improving the accuracy of segmentation and likelihood calculations.

（実施例の説明）以下に本発明の一実施例について第１図ととも
に説明する。同図においてパラメータ抽出部１、
確率密度計算部２および音素標準パタン部４は前
述の従来例と同様であり、従来例と異なるのは、
主として単語辞書部５の内容及び単語認識部３の
セグメンテーシヨンおよび尤度計算の一部であ
る。その単語辞書部５に格納されている単語辞書
は、認識すべき単語を音素の記号列で表記してあ
るが、従来例と異なるのは、無声化し易い母音、
例えば、「ASAHKAWA」、「AKTA」、「Ｓ
MA」、「ＳSA」等の〇印をつけたＩ、に対
して予めそれを示す符号をつけてあることであ
る。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIG. 1. In the figure, a parameter extraction unit 1,
The probability density calculation section 2 and the phoneme standard pattern section 4 are the same as those in the conventional example described above, and the differences from the conventional example are as follows.
It mainly includes the contents of the word dictionary section 5 and part of the segmentation and likelihood calculation of the word recognition section 3. The word dictionary stored in the word dictionary section 5 represents the words to be recognized as phoneme symbol strings, but the difference from the conventional example is that vowels that are easily devoiced,
For example, "ASAHKAWA", "AKTA", "S
The I's marked with a circle, such as "MA" and "SSA," are pre-assigned with a code indicating that.

本実施例の方法は、先ず入力音声からパラメー
タ抽出部１によりフレーム毎のパラメータを得、
さらに確率密度計算部２において、そのパラメー
タの値を使つて、各音素標準パタンから得られる
確率密度を計算する。ここまでは、前記従来例と
同様である。次に単語認識部３で、単語辞書部５
の各辞書項目毎にその辞書項目を構成する辞書音
素系列に従つて音素Ｘのセグメンテーシヨンを行
ないその音素Ｘとその音素Ｘに対応してセグメン
テーシヨンされた区間の尤度l_xを計算する。辞書
音素系列中に無声子音C₁，C₂に挾まれた無声化
母音Ｖがある場合声化母音の確率密度の値は母音
の性質を示せず、無声子音の性質を示す。従つて
上記セグメンテーシヨンにおいて、無声子音、無
声化母音、無声子音（C₁VC₂）の並びにおける各
音素の種類及びその音素並びに対応して、各各の
音素確率密度の値を利用して３音素まとめてセグ
メンテーシヨンを行ない、そのセグメンテーシヨ
ンされた区間に対して尤度l_c1vc2を計算する。 The method of this embodiment first obtains parameters for each frame from the input audio using the parameter extraction unit 1,
Furthermore, the probability density calculation unit 2 uses the values of the parameters to calculate the probability density obtained from each phoneme standard pattern. The process up to this point is the same as the conventional example. Next, in the word recognition section 3, the word dictionary section 5
For each dictionary entry, perform segmentation of phoneme X according to the dictionary phoneme sequence that constitutes that dictionary entry, and calculate the likelihood l _x of the segmented interval corresponding to that phoneme X and that phoneme X. do. When there is a voiceless vowel V sandwiched between voiceless consonants C ₁ and C ₂ in the dictionary phoneme sequence, the probability density value of the voiced vowel does not indicate the nature of the vowel, but rather the nature of the voiceless consonant. Therefore, in the above segmentation, by using the type of each phoneme in the sequence of voiceless consonants, voiceless vowels, and voiceless consonants (C ₁ VC ₂ ), the phoneme, and the corresponding value of each phoneme probability density, The three phonemes are segmented together, and the likelihood l _c1vc2 is calculated for the segmented interval.

第３図において／SiS／の間の／ｉ／の確率密
度のφ_iはほとんどなく、代わりに語頭の／Ｓ／の
確率密度の値φ_sが語頭から第３番目のＳの終り、
ｅまで優勢である。 In Figure 3, the probability density φ _i of /i/ between /SiS/ is almost nonexistent, and instead, the probability density value φ _s of /S/ at the beginning of the word is at the end of the third S from the beginning of the word.
It is dominant up to e.

従つて、無声化母音を含む連続３音素の第３番
目の音素／Ｓ／とそれに後続する母音／Ａ／の確
率密度φ_S，φ_Aを用いてセグメンテーシヨンを行
ない、そのセグメンテーシヨンされた区間に対し
てφ_sを用いて尤度を計算する。このようにするこ
とにより、無声子音、無音化母音、無声子音の連
続３音素、／SiS／は区間（FS〜ｅ）に対応し良
好なセグメンテーシヨンができるため尤度計算の
精度も向上する。 Therefore, segmentation is performed using the probability densities φ _S and φ _A of the third phoneme /S/ of the three consecutive phonemes including the devoiced vowel and the vowel /A/ that follows it, and the segmentation result is The likelihood is calculated using φ _s for the interval. By doing this, the successive three phonemes of a voiceless consonant, a voiceless vowel, and a voiceless consonant, /SiS/, correspond to the interval (FS~e) and good segmentation can be achieved, which improves the accuracy of the likelihood calculation. .

本実施例においては無声化母音を１つの音素と
して扱わず無声化母音を含む、無声子音、無声化
母音、無声子音の音素並びをまとめて、セグメン
テーシヨン尤度計算を行なうため、無声化母音を
含む単語の認識率が向上する利点がある。 In this embodiment, the segmentation likelihood calculation is performed by grouping together the phoneme sequences of voiceless consonants, voiceless vowels, and voiceless consonants, including the voiceless vowel, without treating the voiceless vowel as a single phoneme. This has the advantage of improving the recognition rate for words containing words.

（発明の効果）本発明は、無声子音に挾まれた無声化母音のセ
グメンテーシヨン及び尤度計算を行なう際、各音
素の確率密度の値を使つて無声化母音を含む、無
声子音、無声化母音、無声子音の連続３音素をま
とめてセグメンテーシヨンし尤度計算を行なうの
で、従来法に比べ高い精度でセグメンテーシヨン
及び尤度計算を行なう利点を有する。(Effects of the Invention) The present invention uses the probability density value of each phoneme to segment and calculate the likelihood of a devoiced vowel sandwiched between unvoiced consonants. Since the three continuous phonemes of a voiced vowel and a voiceless consonant are segmented and the likelihood calculation is performed, this method has the advantage that the segmentation and likelihood calculation can be performed with higher accuracy than the conventional method.

[Brief explanation of drawings]

第１図は従来の単語音声認識方法の一例及び本
発明の単語音声認識方法の実施例等を実行するた
めの装置の機能ブロツク図、第２図は／
SiMA／、（島）と発声した場合の各音素の確率
密度の時間変化を表わす図、第３図は／SiSA／
（示唆）と発声した場合の各音素の確率密度の変
化を表わす図である。１……パラメータ抽出部、２……確率密度計算
部、３……単語認識部、４……音素標準パタン
部、５……単語辞書部。 FIG. 1 is a functional block diagram of an apparatus for carrying out an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention, and FIG.
Figure 3 shows the time change in the probability density of each phoneme when uttering SiMA/, (island), /SiSA/
FIG. 4 is a diagram showing changes in the probability density of each phoneme when uttering "(suggestion)". 1... Parameter extraction section, 2... Probability density calculation section, 3... Word recognition section, 4... Phoneme standard pattern section, 5... Word dictionary section.

Claims

[Claims]

1. When recognizing words in input speech, the input speech is compared with each dictionary item in a word dictionary in which the word to be recognized is expressed as a symbol string for each phoneme, and the word is expressed in the distribution form of the acoustic parameters of each phoneme. Using the standard pattern of each phoneme, the input speech is segmented by calculating the probability density generated from that phoneme for each phoneme according to the dictionary phoneme series that constitutes each dictionary item, and the segmented speech is When recognizing words by calculating the similarity between each dictionary item and the input speech using the above probability density value for the interval of A word speech recognition method characterized by segmenting consecutive consonants together and performing likelihood calculations using probability density values of phoneme sequences of unvoiced consonants, unvoiced vowels, and unvoiced consonants, including vowels. .