JPH0715638B2

JPH0715638B2 - Syllable pattern cutting device

Info

Publication number: JPH0715638B2
Application number: JP63240248A
Authority: JP
Inventors: 伸神谷; 徹上田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1988-09-26
Filing date: 1988-09-26
Publication date: 1995-02-22
Anticipated expiration: 2010-02-22
Also published as: JPH0289098A

Description

【発明の詳細な説明】＜産業上の利用分野＞この発明は、音声入力装置における音節パターンの切り
出し装置の改良に関する。The present invention relates to an improvement of a syllable pattern cutout device in a voice input device.

＜従来の技術＞従来、音声ワードプロセッサのように音声を文字列に変
換する装置において、予め入力音声パターンから音節パ
ターンを切り出して音節標準パターンとして登録し、こ
の音節標準パターンと入力音声から切り出された音節の
テストパターンとの距離に基づいて音節を認識する方法
がある。その際に、音節パターンの切り出し（セグメン
テーション）を行う場合には、入力音声のパワーやスペ
クトル変化等の音響特徴パラメータを用いて有音区間中
から音節区間を切り出すのである。しかし、上記特徴パ
ラメータのみを用いて音節区間を切り出すと、雑音や調
音結合の影響によって音節パターン切り出しが不正確に
行なわれる場合がある。その場合には、誤った音節標準
パターンが作成されることになり、音節の認識性能に大
きな影響を及ぼすことになる。<Prior Art> Conventionally, in a device that converts voice into a character string, such as a voice word processor, a syllable pattern is cut out from an input voice pattern in advance and registered as a syllable standard pattern, and the syllable standard pattern and the input voice are cut out. There is a method of recognizing a syllable based on the distance between the syllable and the test pattern. At that time, when the syllable pattern is cut out (segmentation), the syllable section is cut out from the voiced section by using the acoustic feature parameters such as the power of the input voice and the spectrum change. However, if the syllable section is cut out using only the above-mentioned characteristic parameters, the syllable pattern cutout may be inaccurate due to the influence of noise and articulation. In that case, an incorrect syllable standard pattern is created, which greatly affects the syllable recognition performance.

そこで、切り出される音節区間に対応する音声波形をエ
コーバックして音声出力し、オペレータに確認させるこ
とによって誤った音節標準パターンの作成を防ぐ音声ワ
ードプロセッサが提案されている。Therefore, there has been proposed a voice word processor which prevents an erroneous syllable standard pattern from being created by echoing back a voice waveform corresponding to a syllabic segment to be cut out, outputting the voice, and having the operator confirm the voice waveform.

＜発明が解決しようとする課題＞このように、従来の音節パターン切り出し方式において
は、音節パターン切り出し区間に対応する音声波形をエ
コーバックして、切り出す音節区間（すなわち、音節境
界位置）の確認をオペレータが行うことによって、間違
った音節標準パターンが作成されることを防止してい
る。<Problems to be Solved by the Invention> As described above, in the conventional syllable pattern cutout method, the voice waveform corresponding to the syllable pattern cutout section is echoed back to confirm the syllable section (that is, the syllable boundary position) to be cut out. This prevents the operator from making an incorrect syllable standard pattern.

しかしながら、上記音節パターン切り出し方式において
は、オペレータが注意を怠ると音節パターン切り出し区
間に対応するエコーバックを聞き落とす場合があるとい
う問題がある。また、毎回発声後にエコーバックさせる
ための時間が必要であり、登録作業に時間がかかるとい
う問題点がある。However, in the above syllable pattern cutout method, there is a problem that if the operator is not careful, the echo back corresponding to the syllable pattern cutout section may be missed. Further, there is a problem that a time is required for echoing back after each utterance, and a registration work takes time.

そこで、この発明の目的は、エコーバック音声の出力を
必要とせず、簡単にしかも正確に音節境界位置を決定す
ることができる「音節パターン切り出し装置」を提供す
ることにある。Therefore, an object of the present invention is to provide a "syllable pattern cutout device" which can easily and accurately determine a syllable boundary position without requiring output of echo back voice.

＜課題を解決するための手段＞上記目的を達成するため、本発明の音節パターン切り出
し装置は、入力音声の特徴パラメータを抽出すると共に
音素の区間長を検出する音声分析部と、上記特徴パラメ
ータに基づいて入力音声の音節境界位置候補を検出する
音節境界位置検出部と、発声内容が既知の単語の音声区
間長と該単語を構成する音節数に基づいて推定平均音節
長を求める平均音節長推定部と、上記既知の単語を構成
する母音子音母音列を生成する母音子音母音列生成部
と、上記音節境界位置候補に対し上記推定平均音節長に
基づいて決められた所定範囲内にある上記母音子音母音
列に相当する音素の区間長の存在割合を評価する音節境
界位置候補評価部と、上記音節境界位置候補と上記母音
子音母音列の対毎に、上記存在割合の評価結果をDPマッ
チングの格子点値とし、かつ上記音節境界位置候補間の
長さと上記推定平均音節長との差を上記格子点間を結ぶ
パスの重みとするDPマッチングを行い、そのDPマッチン
グ結果から上記音節境界位置候補より音節境界を決定す
る音節境界決定部を備えたことを特徴とする。<Means for Solving the Problem> In order to achieve the above object, the syllable pattern cutout device of the present invention includes a speech analysis unit that extracts a characteristic parameter of an input speech and detects a section length of a phoneme. A syllable boundary position detection unit for detecting a syllable boundary position candidate of the input speech based on the above, and an average syllable length estimation for obtaining an estimated average syllable length based on the speech section length of a word whose utterance content is known and the number of syllables forming the word. Part, a vowel consonant vowel sequence generating part that generates a vowel consonant vowel sequence forming the known word, and the vowel within a predetermined range determined based on the estimated average syllable length for the syllable boundary position candidate. A syllable boundary position candidate evaluation unit that evaluates the existence ratio of the phoneme section length corresponding to a consonant vowel string, and an evaluation result of the existence ratio for each pair of the syllable boundary position candidate and the vowel consonant vowel string. The result is used as the DP matching grid point value, and the DP matching is performed using the difference between the length between the syllable boundary position candidates and the estimated average syllable length as the weight of the path connecting the grid points. It is characterized by further comprising a syllable boundary determining unit that determines a syllable boundary from the syllable boundary position candidate.

＜作用＞音声分析部で入力音声の特徴パラメータを抽出すると共
に音素の区間長を検出し、この特徴パラメータに基づい
て入力音声の音節境界位置候補が音節境界位置検出部に
よって検出される。一方、発声内容が既知の単語の音声
区間長と該単語を構成する音節数に基づいて推定平均音
節長が平均音節長推定部によって求められる。既知の単
語を構成する母音子音母音列を母音子音母音列生成部に
よって発生させ、入力音声に対し音素標準パターンとの
マツチングにより抽出された音素列を参照して、上記音
節境界位置候補に対し上記推定平均音節長に基づいて決
められた所定範囲内にある上記母音子音母音列に相当す
る音素の区間長の存在割合を音節境界位置候補評価部に
よって評価してその結果を信頼度とする。上記音節境界
位置候補と上記母音子音母音列との対応付けのDPマッチ
ング時に、上記存在割合の評価結果をDPマッチングの格
子点値とし、上記音節境界位置候補間の長さと上記推定
平均音節長との差を上記格子点間を結ぶパスの重みと
し、上記DPマッチング結果から上記音節境界位置候補よ
り音節境界を決定する。このように決定された音節境界
の区間を音節パターンとして切り出す。<Operation> The speech analysis unit extracts the characteristic parameter of the input speech and detects the section length of the phoneme, and the syllable boundary position detection unit detects the syllable boundary position candidate of the input speech based on this characteristic parameter. On the other hand, the estimated average syllable length is obtained by the average syllable length estimation unit on the basis of the voice section length of a word whose utterance content is known and the number of syllables forming the word. A vowel consonant vowel sequence forming a known word is generated by a vowel consonant vowel sequence generator, and a phoneme sequence extracted by matching with a phoneme standard pattern for an input voice is referred to, and the syllable boundary position candidate is described above. The existence ratio of the phoneme section length corresponding to the above vowel consonant vowel sequence within the predetermined range determined based on the estimated average syllable length is evaluated by the syllable boundary position candidate evaluation unit, and the result is taken as the reliability. At the time of DP matching of correspondence between the syllable boundary position candidate and the vowel consonant vowel sequence, the evaluation result of the existence ratio is set as a grid point value of DP matching, and the length between the syllable boundary position candidates and the estimated average syllable length. Is used as the weight of the path connecting the grid points, and the syllable boundary is determined from the syllable boundary position candidate from the DP matching result. The syllable boundary section thus determined is cut out as a syllable pattern.

すなわち、オペレータは切り出された音節区間をエコー
バック等によって確認する必要がなく、自動的に正しく
音節境界を決定することができる。That is, the operator does not need to confirm the cut out syllable section by echo back or the like, and can automatically and correctly determine the syllable boundary.

＜実施例＞以下、この発明を図示の実施例に従って詳細に説明す
る。<Example> Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments.

第１図はこの発明に係る音節切り出し装置のブロック図
を示す。この音節切り出し装置は、音声とこの音声の単
語（文節）のローマ字表記とを入力して、入力音声から
音節パターン（ケプストラム係数の時系列）を出力する
ものであり、次の様な手順によって行う。FIG. 1 shows a block diagram of a syllable clipping device according to the present invention. This syllable extraction device inputs a voice and the Roman alphabet of a word (syllable) of this voice, and outputs a syllable pattern (time series of cepstrum coefficients) from the input voice, which is performed by the following procedure. .

すなわち、上記音節切り出し装置によって音節標準パタ
ーンの登録を行う際には、ローマ字表記入力によって予
め発声内容が既知であるために、通常の音声認識におい
ては用いることのできないような情報をトップダウン的
に用いることができる。まず、ローマ字表記入力から予
め入力音声に含まれる音節数が既知であるので、音声区
間長を音節数で割ることにより推定平均音節長を求める
ことができる。次に、スペクトル変化およびパワー変化
等により検出される音節境界位置候補の中から上記推定
平均音節長に基づいて音節境界を決定する。その場合
に、決定の際に用いる信頼度として、音節境界位置前後
の一定範囲（推定平均音節長×定数）内にある先行音節
の母音，後続音節の子音および後続音節の母音に相当す
る音素種の数と、音節境界位置候補間の長さ（音節長）
と平均音節長との差とに基づく値を用いる。That is, when the syllable standard pattern is registered by the syllable segmentation device, information that cannot be used in normal speech recognition is top-downed because the utterance content is known in advance by the Roman alphabet input. Can be used. First, since the number of syllables included in the input voice is known in advance from the Roman alphabet input, the estimated average syllable length can be obtained by dividing the voice section length by the number of syllables. Next, a syllable boundary is determined based on the estimated average syllable length from among syllable boundary position candidates detected by spectrum change, power change, and the like. In that case, as the reliability used in the determination, the vowel of the preceding syllable, the consonant of the succeeding syllable, and the phoneme type corresponding to the vowel of the succeeding syllable within a certain range (estimated average syllable length × constant) before and after the syllable boundary position are used. And the length between syllable boundary position candidates (syllable length)
And a value based on the difference between the average syllable length and the average syllable length.

次に、第１図に従って上記音節切り出し装置の概略につ
いて説明する。Next, an outline of the syllable cutting device will be described with reference to FIG.

入力音声から音声分析部１によってフレーム（周期8m
s）毎にパワーズおよびケプストラム係数等の特徴パラ
メータと、後に詳述する音韻分類記号および音素記号等
とが求められる。そして、この特徴パラメータ，音韻記
号および音素記号等から、音節境界位置の候補が音節境
界検出部２によって求められる。一方、入力された単語
のローマ字表記から、その単語を構成するVCVの列（母
音−子音−母音列）がVCV生成部３によって求められ
る。そして、VCVスポッター４で上記音節境界位置候補
とVCV列との各対毎に信頼度を計算し、DP（ダイナミッ
クプログラミング）部５で上記音節境界位置候補と上記
VCV列との対のうち最も信頼度の高い対を探し、音節境
界決定部６で上記対応する区間を音節パターンとして切
り出すのである。A frame (cycle of 8 m
For each s), characteristic parameters such as powers and cepstrum coefficients, and phoneme classification symbols and phoneme symbols which will be described in detail later are obtained. Then, the syllable boundary detection unit 2 obtains candidates for syllable boundary positions from the characteristic parameters, phoneme symbols, phoneme symbols, and the like. On the other hand, a VCV sequence (vowel-consonant-vowel sequence) that constitutes the word is obtained by the VCV generation unit 3 from the input Romanization of the word. Then, the VCV spotter 4 calculates the reliability for each pair of the syllable boundary position candidate and the VCV sequence, and the DP (dynamic programming) unit 5 calculates the syllable boundary position candidate and the above-mentioned reliability.
The pair having the highest reliability among the pairs with the VCV string is searched for, and the syllable boundary determining unit 6 cuts out the corresponding section as a syllable pattern.

以下、上記音節切り出し装置の各部について詳細に説明
する。Hereinafter, each part of the syllable cutting device will be described in detail.

（１）音声分析部１マイクより入力された音声から第１表に示す条件で線形
予測（LPC）ケプストラム，パワーおよび差分パワー等
の特徴パラメータを求める。(1) Speech analysis unit 1 Obtains characteristic parameters such as linear prediction (LPC) cepstrum, power, and differential power from the speech input from the microphone under the conditions shown in Table 1.

この特徴パラメータを用いて各フレームの大略的音韻特
徴を、第２表に示すような６種類の記号（以下、音韻分
類記号という）に記号化して出力する。さらに、孤立単
音節より自動的に切り出した音素（５母音と/n/,/s/）
標準パターンとのフレーム毎にマッチングにより、第４
図に示すような音素記号列を出力する。また、上記特徴
パラメータより次の２種類のセグメンテーション用パラ
メータを計算する。 Using this feature parameter, the rough phonological features of each frame are symbolized into six types of symbols (hereinafter referred to as phonological classification symbols) as shown in Table 2 and output. Furthermore, phonemes automatically extracted from isolated single syllables (5 vowels and / n /, / s /)
By matching each frame with the standard pattern, the fourth
The phoneme symbol string as shown in the figure is output. Further, the following two types of segmentation parameters are calculated from the above characteristic parameters.

・パワーディップ…パワーの一次微係数。・ Power dip ... Primary differential coefficient of power.

・スペクトル変化…８フレーム（または４フレーム）離
れたフレーム間のケプストラム係数。Spectral change: Cepstrum coefficient between frames separated by 8 frames (or 4 frames).

（２）音節境界検出部２上記音声分析部１で求められた音韻分類記号，パワーデ
ィップおよびスペクトル変化を用いて音節境界位置の候
補を求める（第３表参照）。(2) Syllable boundary detecting unit 2 A syllable boundary position candidate is obtained using the phonological classification symbol, power dip and spectrum change obtained by the speech analyzing unit 1 (see Table 3).

境界記号“（”と“）”とは音韻分類記号列の記号の無
音→有音，有音→無音，無声音→有声音および有声音→
無声音の各変化点に相当するフレームに付けられる。ま
た、境界記号“P"はパワーディップが大となるフレーム
に付けられ、境界記号“s"はスペクトル変化が大となる
フレームに付けられる。 Boundary symbols “(” and “)” mean silence → voiced, voiced → voiceless, voiceless → voiced and voiced →
It is attached to the frame corresponding to each change point of unvoiced sound. Further, the boundary symbol "P" is attached to a frame where the power dip is large, and the boundary symbol "s" is attached to a frame where the spectrum change is large.

（３）平均モーラ長推定部７ローマ字表記入力から求められるモーラ数Ｍと音節境界
検出部２によって検出される境界記号“（”およ
び“）”間の音声区間長LTとから次のようにして推定平
均モーラ長LMを求める。(3) Average Mora Length Estimator 7 From the number of moras M obtained from the Roman alphabet input and the speech section length LT between the boundary symbols “(” and “)” detected by the syllable boundary detector 2, Obtain the estimated average mora length LM.

推定平均モーラ長LMの初期値は21フレーム（発声速度６
モーラ／秒に相当）であり、各単語発声毎に更新されて
いく。今Ｍモーラからなる単語を発声した際の音声区間
長がLTフレームならば、平均モーラ長をＬフレームとす
るとＬ＝LT/Mと表せる。そして、この平均モーラ長Ｌの
範囲が16≦Ｌ＜31ならばLM＝Ｌとする。また、雑音や発
声不良により音声区間長LTが誤って検出された場合を考
慮して、Ｌ＜16または31≦ＬならばLM＝（21＋Ｌ）/2と
する。The initial value of the estimated average mora length LM is 21 frames (speech rate 6
(Corresponding to mora / second), and is updated for each word utterance. If the speech section length at the time of uttering a word consisting of M-mora is LT frame, it can be expressed as L = LT / M when the average mora length is L frame. If the range of the average mora length L is 16 ≦ L <31, LM = L. Further, in consideration of the case where the voice section length LT is erroneously detected due to noise or bad utterance, if L <16 or 31 ≦ L, LM = (21 + L) / 2.

（４）VCV生成部３上記音声入力された単語のローマ字表記が入力される
と、その単語を構成するVCVの列が求められる。上記VCV
は音節境界毎に先行音節の母音V2,後続音節の子音C1お
よび後続音節の母音V1の３組からなり、ｎ音節からなる
単語ならば無音区間との境界も含めてｎ＋１個のVCV列
を生成する（第２図参照）。(4) VCV generation unit 3 When the romanized notation of the word input by voice is input, the VCV sequence forming the word is obtained. VCV above
Is composed of 3 pairs of vowel V2 of the preceding syllable, consonant C1 of the succeeding syllable and vowel V1 of the succeeding syllable for each syllable boundary. For a word consisting of n syllables, n + 1 VCV strings including the boundary with the silent interval are generated. (See FIG. 2).

ただし、母音のみからなる音節のC1に相当する部分はV1
と同じ記号で表わし、子音＋拗音＋母音からなる音節の
場合の拗音部の記号は省略する。However, the part corresponding to C1 of the syllable consisting only of vowels is V1.
It is represented by the same symbol as in, but the symbol for the syllable part in the case of a syllable consisting of consonant + syllable + vowel is omitted.

（５）VCVスポッター４上記音節境界検出部２で検出された音節境界位置候補と
上記VCV生成部３で生成されたV2C1V1との各対毎に、音
節境界としての信頼度Ｄを次のようにして求める。ただ
し、ここで述べる信頼度Ｄは値が小さいほど音節境界位
置としての可能性が高いことを意味する。(5) VCV spotter 4 For each pair of the syllable boundary position candidate detected by the syllable boundary detecting unit 2 and the V2C1V1 generated by the VCV generating unit 3, the reliability D as a syllable boundary is as follows. And ask. However, the smaller the value of the reliability D described here, the higher the possibility that it is a syllable boundary position.

まず、第４表に従ってV2C1V1を構成するアルファベット
の小文字表記に基づいてサーチすべき音声分析部１から
の上記音素記号列（音素１および音素2:9種類の大文字
で表記）を求める。ここで、音素１および音素２はサー
チすべき音素が二つあることを意味する。First, according to Table 4, the above-mentioned phoneme symbol string (phoneme 1 and phoneme 2: written in 9 uppercase letters) from the speech analysis unit 1 to be searched for is calculated based on the lowercase letters of the alphabet constituting V2C1V1. Here, phoneme 1 and phoneme 2 mean that there are two phonemes to be searched.

音節境界位置候補の位置をｔフレーム、区間（ｔ−lM/
2）からｔまでにあるV2に相当する音素のフレーム数を
ｃ（V2）、区間ｔから（ｔ＋lM/2）までにあるC1に相当
する音素のフレーム数をｃ（C1）、区間（ｔ＋LM/2）か
ら（ｔ＋LM）までにあるV1に相当する音素のフレーム数
をｃ（V1）とする。そうすると、上述のように第４表か
ら求めた音素１および音素２を参照して次のように信頼
度Ｄを求める。 The position of the syllable boundary position candidate is t frames, and the interval (t-1M /
2) to t, the number of phoneme frames corresponding to V2 is c (V2), and the number of phoneme frames corresponding to C1 from section t to (t + lM / 2) is c (C1), section (t + LM / Let c (V1) be the number of phoneme frames corresponding to V1 from 2) to (t + LM). Then, the reliability D is calculated as follows with reference to the phonemes 1 and 2 calculated from Table 4 as described above.

C1が“B"出ない時はＤ←LM−ｃ（V2）−ｃ（C1） C1が“B"である時はＤ←LM−ｃ（V2）−ｃ（V1） C1が“B"または“S"の時はＤ←LM−ｃ（V2）− MAX（ｃ（C1＝“S"）,c（V1））とする。ただし、上記MAX（，）は（，）内の
値のうち大きいほうを選択する。When C1 does not output "B" D ← LM-c (V2) -c (C1) When C1 is "B" D ← LM-c (V2) -c (V1) C1 is "B" or When "S", D ← LM-c (V2) -MAX (c (C1 = "S"), c (V1)). However, for MAX (,), select the larger value among the values in (,).

さらに、音節境界位置候補の種類が、 “（”ならばＤ←Ｄ−LM/2 “（”ならばＤ←Ｄ−LM×2/3 “p"ならばＤ←Ｄ−LM/3 とする。Further, if the kind of syllable boundary position candidate is “(”, D ← D-LM / 2 “(”, D ← D-LM × 2/3 If “p”, D ← D-LM / 3 .

（６）DP部５上記VCVスポッター４によって求められた信頼度Ｄ（i,
j）を用いて、ｉ番目の音節境界位置候補とｊ番目のV2C
1V1との対毎にDPマッチングを行なって、最も累積距離
Ｇ（i,j）の小さい経路をとる音節境界位置を求める
（第３図参照）。ただし、語頭および語尾に存在しうる
息等の雑音を除くために両端点はフリーとする。ここ
で、（i,j）点における累積距離Ｇ（i,j）は（１）式で
表される。(6) DP unit 5 The reliability D (i,
j) using the i-th syllable boundary position candidate and the j-th V2C
DP matching is performed for each pair with 1V1 to find the syllable boundary position taking the path with the smallest cumulative distance G (i, j) (see FIG. 3). However, both end points are free to remove noise such as breath that may be present at the beginning and end of the word. Here, the cumulative distance G (i, j) at the point (i, j) is expressed by the equation (1).

ここで、Ｓ（i,k）は音節境界位置候補間のフレーム長
（音節長に相当）の上記推定平均モーラ長LMからのずれ
を表し、ｉ番目の音節境界位置候補をｔ（ｉ）フレーム
とすると、Ｓ（i,k）は（２）式のように表される。 Here, S (i, k) represents the deviation of the frame length (corresponding to the syllable length) between the syllable boundary position candidates from the estimated average mora length LM, and the i-th syllable boundary position candidate is t (i) frames. Then, S (i, k) is expressed as in equation (2).

Ｓ（i,k）＝||t（ｉ）−ｔ（ｋ）｜−LM| …（２）すなわち、単語（文節）音声内では各音節長があまり変
動しないという仮定に基づいて、音節長と推定平均モー
ラ長との差をDPマッチングの際の重みとして用いるので
ある。S (i, k) = || t (i) -t (k) | -LM | (2) That is, based on the assumption that each syllable length does not vary much within a word (syllable) speech, the syllable length is The difference between and the estimated average mora length is used as a weight in DP matching.

（７）音節境界決定部６上記DP部５で求められた音節境界位置候補間の区間を音
節パターンとして切り出す（第４図参照）。この際に、
以下の条件を満たす場合にはその単語音声からの音節切
り出しがリジェクトされる。したがって、音節標準パタ
ーンの登録の場合であれば、上記リジェクトされた単語
は再発声して音声入力する必要がある。(7) Syllable boundary determination unit 6 The section between the syllable boundary position candidates obtained by the DP unit 5 is cut out as a syllable pattern (see FIG. 4). At this time,
When the following conditions are satisfied, the syllable cutout from the word voice is rejected. Therefore, in the case of registration of the syllable standard pattern, it is necessary to re-voice the rejected word and input it by voice.

・音節長がLM/4以下である短い音節を含む。-Includes short syllables whose syllable length is LM / 4 or less.

・音節長が2LM以上である長い音節を含む。-Includes long syllables whose syllable length is 2LM or more.

・音節境界の信頼度ＤがLM/2以上である不確実な音節境
界を含む。-Includes an uncertain syllable boundary whose reliability D is LM / 2 or more.

ここで、第４図中の音素第１候補および音素第２候補
は、音声分析部１において行われる孤立単音節に基づく
標準パターンとのマッチングにおいて、マッチング距離
の小さい順に選出された候補である。Here, the phoneme first candidate and the phoneme second candidate in FIG. 4 are candidates selected in ascending order of matching distance in the matching with the standard pattern based on the isolated monosyllable performed in the speech analysis unit 1.

以上が音節切り出し装置の構成と基本アルゴリズムであ
る。しかし、音形規則および発声のゆらぎ等に対処可能
なように性能を向上させるために、次のようないくつか
の例外的なルールを追加している。The above is the configuration and basic algorithm of the syllable extraction device. However, in order to improve performance so that it can deal with phonetic rules and vocalization fluctuations, some exceptional rules are added as follows.

無声化上記V2C1V1において、C1が無声子音であり、かつV1が
“I"および“U"ならば、入力音声の上記V2C1V1の対とな
る音節境界位置のV1に相当する音素が無音“."であって
もよいことにする。Unvoiced In the above V2C1V1, if C1 is an unvoiced consonant and V1 is “I” and “U”, the phoneme corresponding to V1 at the syllable boundary position of the V2C1V1 of the input speech is a silent “.”. I'll decide to go.

母音連鎖上記V2C1V1において、V2が“A"であり、かつC1が“I"な
らば、入力音声の上記V2C1V1の対となる音節境界位置の
C1に相当する音素が“E"であってもよいことにする。Vowel chain In V2C1V1 above, if V2 is "A" and C1 is "I", the syllable boundary position of the V2C1V1 pair of the input voice
The phoneme corresponding to C1 may be "E".

語頭音節/i/,/hi/,/hu/および/tsu/が語頭にある場合は、音
節境界位置候補間のフレーム長が短くなる傾向がある。
したがって、上記DP部５において累積距離Ｇ（i,k）を
算出する際に用いられる音節境界位置候補区間のフレー
ム長の平均モーラ長LMからのずれＳ（i,k）を次式のよ
うに変更する。When the initial syllables / i /, / hi /, / hu / and / tsu / are at the beginning of a word, the frame length between candidate syllable boundary positions tends to be shorter.
Therefore, the deviation S (i, k) of the frame length of the syllable boundary position candidate section used when the cumulative distance G (i, k) is calculated in the DP unit 5 from the average mora length LM is expressed by the following equation. change.

Ｓ（i,k）＝||t（ｉ）−ｔ（ｋ）｜−LM/2| 語尾音節/i/および/N/が語尾にある場合は、音節境界位置間
のフレーム長が短くなる傾向がある。したがって、上記
ずれＳ（i,k）を次式のように変更する。S (i, k) = || t (i) -t (k) | -LM / 2 | Suffix When the syllables / i / and / N / are at the end, the frame length between syllable boundary positions becomes shorter. Tend. Therefore, the shift S (i, k) is changed as in the following equation.

Ｓ（i,k）＝||t（ｉ）−ｔ（ｋ）｜−LM/2| このように、この発明では入力音声より特徴パラメータ
を求め、この特徴パラメータに基づいて入力音声の音節
境界位置候補を求める一方、入力音声から推定平均音節
長を求め、この推定平均音節長とV2,C1およびV1のフレ
ーム数に基づいて、上記音節境界位置候補の音節境界と
しての信頼度を求め、この信頼度に基づいて音節境界位
置を決定して音節パターンを切り出すようにしている。
したがって、オペレータの特性に起因した入力音声の発
声速度の相異にかかわらず、正確に音節パターン切り出
しを行うことができる。S (i, k) = || t (i) -t (k) | -LM / 2 | Thus, according to the present invention, the characteristic parameter is obtained from the input speech, and the syllable boundary of the input speech is based on this characteristic parameter. While obtaining the position candidate, obtain the estimated average syllable length from the input speech, and based on this estimated average syllable length and the number of frames of V2, C1 and V1, obtain the reliability of the above syllable boundary position candidate as a syllable boundary. The syllable boundary position is determined based on the reliability and the syllable pattern is cut out.
Therefore, the syllable pattern can be accurately extracted regardless of the difference in the vocalization speed of the input voice due to the characteristics of the operator.

また、この発明を用いれば、音節切り出し区間に対応す
る音声波形をエコーバックしてオペレータに確認させる
必要がないので、オペレータの負担と登録作業に要する
時間を軽減できる。したがって、簡単にしかも正確に音
節境界位置を決定することができる。Further, according to the present invention, since it is not necessary to echo back the voice waveform corresponding to the syllable cut-out section and have the operator confirm it, the burden on the operator and the time required for the registration work can be reduced. Therefore, the syllable boundary position can be determined easily and accurately.

上記実施例においては、音声入力された単語の発声内容
を入力する際にローマ字表記を入力しているが、この発
明はこれに限定されるものではなく、仮名を入力しても
よい。In the above embodiment, the romanization is input when inputting the utterance content of the voice input word, but the present invention is not limited to this, and a kana may be input.

上記実施例においては、この発明を標準パターンの登録
の際に使用している。しかしながら、この発明はこれに
限定されるものではなく、例えば入力音声の認識の際に
も使用することができる。その際には、上記実施例の場
合とは逆に、音声入力された単語の発声内容入力とし
て、得られた複数個の認識候補の音素列をVCV生成部３
に入力して、入力音声の音節境界と整合する認識候補を
認識結果として出力すればよい。In the above embodiment, the present invention is used when the standard pattern is registered. However, the present invention is not limited to this, and can be used, for example, when recognizing an input voice. At that time, contrary to the case of the above-described embodiment, the VCV generation unit 3 uses the phoneme strings of the plurality of obtained recognition candidates as the utterance content input of the word that is voice input.
, And a recognition candidate matching the syllable boundary of the input voice may be output as a recognition result.

＜発明の効果＞以上より明らかなように、本発明の音節パターン切り出
し装置は、入力音声の特徴パラメータを抽出すると共に
音素の区間長を検出する音声分析部と、上記特徴パラメ
ータに基づいて入力音声の音節境界位置候補を検出する
音節境界位置検出部と、発声内容が既知の単語の音声区
間長と該単語を構成する音節数に基づいて推定平均音節
長を求める平均音節長推定部と、上記既知の単語を構成
する母音子音母音列を生成する母音子音母音列生成部
と、上記音節境界位置候補に対し上記推定平均音節長に
基づいて決められた所定範囲内にある上記母音子音母音
列に相当する音素の区間長の存在割合を評価する音節境
界位置候補評価部と、上記音節境界位置候補と上記母音
子音母音列の対毎に、上記存在割合の評価結果をDPマッ
チングの格子点値とし、かつ上記音節境界位置候補間の
長さと上記推定平均音節長との差を上記格子点間を結ぶ
パスの重みとするDPマッチングを行い、そのDPマッチン
グ結果から上記音節境界位置候補より音節境界を決定す
る音節境界決定部を備えたので、上記検出された音節境
界位置候補の中で一番音節境界としての信頼度の高い音
節境界位置候補を音節境界として自動的に決定すること
ができる。したがって、この発明によれば、エコーバッ
ク音声の出力を必要とせず、簡単にしかも正確に音節境
界を決定することができる。<Effects of the Invention> As is clear from the above, the syllable pattern cutout device of the present invention extracts a characteristic parameter of an input speech and detects a phoneme section length, and an input speech based on the characteristic parameter. A syllable boundary position detecting unit for detecting a candidate syllable boundary position, an average syllable length estimating unit for obtaining an estimated average syllable length based on a voice section length of a word whose utterance content is known and the number of syllables forming the word, A vowel consonant vowel sequence generating unit that generates a vowel consonant vowel sequence that constitutes a known word, and a vowel consonant vowel sequence within a predetermined range determined based on the estimated average syllable length for the syllable boundary position candidate. For each pair of the syllable boundary position candidate evaluation unit that evaluates the existence ratio of the corresponding phoneme section length, and the evaluation result of the existence ratio for each pair of the syllable boundary position candidate and the vowel consonant vowel sequence, a DP match And the difference between the length between the syllable boundary position candidates and the estimated average syllable length is used as the weight of the path connecting the grid points, and DP matching is performed. Since the syllable boundary determining unit for determining the syllable boundary from the candidate is provided, the syllable boundary position candidate having the highest reliability as the syllable boundary position candidate among the detected syllable boundary position candidates is automatically determined as the syllable boundary. be able to. Therefore, according to the present invention, it is possible to easily and accurately determine the syllable boundary without outputting the echo back voice.

[Brief description of drawings]

第１図はこの発明に係る音節切り出し装置のブロック
図、第２図はVCV列の生成例を示す図、第３図は音節境
界決定におけるDPマッチング経路の一例を示す図、第４
図は音節切り出しの一例を示す図である。１……音声分析部、２……音節境界検出部、３……VCV
生成部、４……VCVスポッター、５……DP部、６……音
節境界決定部、７……平均モーラ長推定部。FIG. 1 is a block diagram of a syllable segmentation device according to the present invention, FIG. 2 is a diagram showing an example of generating a VCV sequence, FIG. 3 is a diagram showing an example of a DP matching path in syllable boundary determination, and FIG.
The figure is a diagram showing an example of syllable segmentation. 1 ... Voice analysis unit, 2 ... syllable boundary detection unit, 3 ... VCV
Generation unit, 4 ... VCV spotter, 5 ... DP unit, 6 ... syllable boundary determination unit, 7 ... average mora length estimation unit.

Claims

[Claims]

1. A speech analysis unit for extracting a characteristic parameter of an input speech and detecting a section length of a phoneme; a syllable boundary position detection unit for detecting a syllable boundary position candidate of the input speech based on the characteristic parameter; An average syllable length estimation unit that obtains an estimated average syllable length based on the speech interval length of a word whose content is known and the number of syllables that make up the word, and a vowel consonant vowel that generates a vowel consonant vowel sequence that makes up the known word. A string generation unit and a syllable boundary position candidate that evaluates the existence ratio of the section length of the phoneme corresponding to the vowel consonant vowel string within a predetermined range determined based on the estimated average syllable length candidate for the syllable boundary position candidate. An evaluation unit, for each pair of the syllable boundary position candidate and the vowel consonant vowel sequence,
DP in which the evaluation result of the existence ratio is a DP matching grid point value, and the difference between the length between the syllable boundary position candidates and the estimated average syllable length is the weight of the path connecting the grid points.
A syllable pattern cutout device comprising a syllable boundary determining unit that performs matching and determines a syllable boundary from the syllable boundary position candidate based on the DP matching result.