JP3110025B2

JP3110025B2 - Utterance deformation detection device

Info

Publication number: JP3110025B2
Application number: JP01195154A
Authority: JP
Inventors: 真二古賀
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1989-07-27
Filing date: 1989-07-27
Publication date: 2000-11-20
Anticipated expiration: 2015-11-20
Also published as: JPH0358099A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は発声された音声内に生じている発声変形を高
性能で自動的に検出する発声変形検出装置に関するもの
である。Description: TECHNICAL FIELD The present invention relates to an utterance deformation detection device that automatically detects utterance deformation occurring in uttered speech with high performance.

（従来の技術）従来、未知音声を認識する方法では、あらかじめ発声
した音声データから作成した複数の標準モデル、即ち、
標準パターンと未知音声から求めた音声パターンとの類
似度を求め、最大の類似度を与える標準モデルのカテゴ
リを認識結果とする方法が一般的である。したがって、
標準パターンを作成するための音声データは、当然その
発声内容が既知でなければならず、このことは、認識単
位として音素など単語より小さい単位を用いたとき、よ
り厳密なものとなる。(Prior Art) Conventionally, in a method of recognizing an unknown voice, a plurality of standard models created from voice data uttered in advance, that is,
In general, a similarity between a standard pattern and a voice pattern obtained from an unknown voice is determined, and a category of a standard model that gives the maximum similarity is used as a recognition result. Therefore,
Of course, the voice data for creating the standard pattern must have known utterance contents, and this becomes more strict when a unit smaller than a word such as a phoneme is used as a recognition unit.

一方、同じ単語を発声した場合でも、単語の種類によ
っては、無声化や長母音化等の発声変形が生じる場合が
ある。例えば、「拍手」/hakusyu/の２つの単音素/u/は
無声化して発声されることがあり、「映画」/eiga/の/e
i/は/ee/と長母音化して発声されることがある。その結
果、音素単位で比べた場合、単語名からは同じ音素であ
るが、音声パターンが異なるというものがでてくる。な
お、以下、「音素」とは、音韻論的な意味での音声の最
小基本単位という意味だけではなく、音節や複数の音素
の連結をも含む、もっと広い範囲の音声の単位を意味す
る。発声変形の検出方法として、例えば、武田、勾坂、
片桐らの、日本音響学会昭和62年度春季研究発表会講演
論文集Ｉのページ69−70に掲載の論文「音声データベー
ス構築のための音韻ラベリング」（以下、文献１と称
す）で述べられているような視察による方法が挙げられ
る。ここでは、音声データのスペクトログラム、波形等
をもとに、その音声の音韻ラベルづけを行っており、そ
の際に発声変形の検出を行っている。On the other hand, even when the same word is uttered, depending on the type of the word, utterance deformation such as silence or long vowels may occur. For example, the two monophonemes / u / of "applause" / hakusyu / may be vocalized and voiced, and "/ eiga / of movie" / eiga /
i / may be uttered as a long vowel as / ee /. As a result, when compared on a phoneme basis, the word names indicate that the phonemes are the same, but have different voice patterns. Hereinafter, the “phoneme” means not only the minimum basic unit of speech in the phonological sense, but also a wider range of speech units including syllables and concatenation of a plurality of phonemes. As a method for detecting utterance deformation, for example, Takeda,
Kagiri et al., In a paper entitled "Phonological Labeling for Speech Database Construction" (hereinafter referred to as Reference 1) published on pages 69-70 of the Transactions of the Acoustical Society of Japan Spring Meeting 1987, pp. 69-70. Such an inspection method may be used. Here, phonetic labeling of the voice is performed based on a spectrogram, a waveform, and the like of the voice data, and at that time, a vocal deformation is detected.

（発明が解決しようとする課題）上述の従来技術として説明したような発声変形の検出
を文献１で述べられているような視察で行う場合、音声
データの数が膨大になると、大変な作業となってしま
う。また、検出結果が検出作業を行った者により異なっ
てしまう可能性もあるという問題があった。(Problems to be Solved by the Invention) When the detection of vocal deformation as described in the above-mentioned conventional technique is performed by inspection as described in Document 1, if the number of voice data becomes enormous, a great deal of work is required. turn into. There is also a problem that the detection result may be different depending on the person who performed the detection work.

本発明の目的は、以上のような欠点を除き、発声され
た音声内に生じている発声変形を高性能で自動的に検出
する装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide an apparatus for automatically detecting, with high performance, an utterance deformation occurring in an uttered sound, excluding the above-described drawbacks.

（課題を解決するための手段）前述の課題を解決するために本発明による発声変形検
出装置は、音声信号を分析して特徴ベクトル時系列を出
力する特徴分析部と、前記音声信号の発声変形する可能
性がある音素の音素名と予め記憶された発声変形パター
ンの中から想定される発声変形パターンを複数抽出して
発声変形情報として出力する発声変形情報検出部と、音
素を単位とした標準モデルをあらかじめ蓄えておく標準
モデル記憶部と、前記特徴ベクトル時系列と前記発声変
形情報と前記標準モデル記憶部に蓄えられた標準モデル
に基づいて前記発声変形する可能性がある音素の音素名
と発声変形する可能性がある音素の位置情報を抽出し変
形音素情報として出力する変形音素抽出部と、前記変形
音素情報を蓄えておく変形音素情報記憶部と、前記特徴
ベクトル時系列と前記変形音素情報記憶部に蓄えられた
変形音素情報と前記標準モデル記憶部に蓄えられた標準
モデルに基づいて、前記発声変形情報検出部において抽
出された複数の発声変形パターンのうちいずれの発声変
形が発生したかを判定する発声変形検出部とを有する。(Means for Solving the Problems) In order to solve the above-mentioned problems, an utterance deformation detection device according to the present invention analyzes a voice signal and outputs a feature vector time series, and outputs a utterance deformation of the voice signal. A vocal transformation information detecting unit that extracts a plurality of phonation transformation patterns assumed from among phoneme names of phonemes that may possibly occur and utterance transformation patterns stored in advance and outputs them as utterance transformation information, A standard model storage unit that stores a model in advance, and a phoneme name of a phoneme that may be uttered and deformed based on the feature vector time series, the utterance deformation information, and the standard model stored in the standard model storage unit. A deformed phoneme extraction unit that extracts position information of a phoneme that may be uttered and outputs it as deformed phoneme information, and a deformed phoneme information storage unit that stores the deformed phoneme information Based on the feature vector time series, the deformed phoneme information stored in the deformed phoneme information storage unit, and the standard model stored in the standard model storage unit, a plurality of utterance deformations extracted in the utterance deformation information detection unit. An utterance deformation detecting unit that determines which utterance deformation occurs in the pattern.

（作用）以下、本発明による発声変形検出装置の作用について
説明する。(Operation) Hereinafter, the operation of the utterance deformation detection device according to the present invention will be described.

本発明は、発声された入力音声に対して、発声変形す
る可能性がある音素（以下、変形可能音素と呼ぶ）に対
する音声区間を切り出し、その音素に対する標準モデル
とその区間の音声パターンから入力音声の発声の発声変
形の有無を自動的に検出するものである。The present invention cuts out a speech section for a phoneme that may be uttered and deformed (hereinafter, referred to as a deformable phoneme) from a uttered input speech, and inputs a speech model from a standard model for the phoneme and a speech pattern of the section. The presence or absence of the utterance deformation of the utterance is automatically detected.

入力音声の発声変形を検出するには、まず、その発声
内容に対して発生する可能性がある発生変形パターンを
求めなければならない。多くの発声変形、特に異音によ
る発声変形は、前後の音素のコンテキストにより変形の
生じ易さをルール化することができる。「無声子音、語
尾に挟まれた母音/i/,/u/は無声化し易い」「二重母音/
ei/,/ou/は、それぞれ/ee/,/oo/に長母音化し易い」等
がその例として挙げられる。そして、これらのルールに
より作成されたパターンやそれ以外の経験的に発声変形
することがわかっているパターンを、発声変形パターン
とする。In order to detect the utterance deformation of the input voice, first, a generated deformation pattern that may occur with respect to the utterance content must be obtained. Many utterance deformations, particularly utterance deformations due to allophones, can be ruled for the ease of occurrence of the deformation depending on contexts of the phonemes before and after. "Voiceless consonants, vowels / i /, / u / sandwiched between endings are easily voiced."
ei /, / ou / are each easily converted into long vowels into / ee /, / oo /, respectively. Then, a pattern created by these rules or another pattern that is empirically known to cause utterance deformation is defined as an utterance deformation pattern.

変形可能要素の入力音声中での位置を求めるには、例
えば、入力音声の発声内容に対応した複数個の発声変形
パターンをもとに音素を単位とした標準モデル（以下、
音素モデルと呼ぶ）を連結させて、それぞれのパターン
に対するモデル（以下、変形モデルと呼ぶ）を作成する
（例えば、「映画」という発声内容に対応する変形モデ
ルは、/eiga/,/eega/の２つである）。音素モデルとし
て、例えば、S.E.Levinson、L.R.Rabiner、およびM.M.S
ondhiらの、The Bell System Technical Journal、Vol.
62、No.4、1983年４月のページ1035−1074に掲載の論文
“An Introduction to the Application of the Theory
of Probabilistic Functions of Markov Process to A
utomatic Speech Recognition"（以下、文献２と称す）
に述べられているような隠れマルコフモデル（以下、HM
Mと呼ぶ）を用いることができる。HMMは、状態遷移ネッ
トワークの一種で、各状態には状態遷移確率とベクトル
出現確率とが定義されている。そして、HMMのパラメー
タは、学習用音声を用いて、文献２に述べられているよ
うなフォワード・バックワード（forward−backward）
アルゴリズムによって推定される。In order to determine the position of the deformable element in the input voice, for example, a standard model (hereinafter, referred to as a phoneme unit) based on a plurality of voice deformation patterns corresponding to the voice content of the input voice is used.
A model for each pattern (hereinafter referred to as a deformation model) is created by linking the phoneme models (called a "transformation model"). Two). As phoneme models, for example, SElevinson, LRRabiner, and MMS
ondhi et al., The Bell System Technical Journal, Vol.
62, No. 4, April 1983, pp. 1035-1074, “An Introduction to the Application of the Theory”
of Probabilistic Functions of Markov Process to A
utomatic Speech Recognition "(hereinafter referred to as Reference 2)
Hidden Markov model (hereinafter HM)
M). The HMM is a type of state transition network, and a state transition probability and a vector appearance probability are defined for each state. Then, the parameters of the HMM are obtained by using a learning voice, and a forward-backward as described in Reference 2.
Estimated by algorithm.

変形モデルを作成した後、入力音声から求めた特徴ベ
クトル時系列を用いて、変形可能音素の位置を各モデル
毎に求める（「映画」の場合、/ei/と/ee/の位置を求め
ることになる）。ここで、特徴ベクトル時系列の求め方
として、例えば、古井著、1985年、東海大学出版会発行
の「デジタル音声処理」（以下、文献３と称す）のペー
ジ154−160に述べられているメルケプストラムによる方
法やLPC分析法などを用いることができる。After creating the deformed model, the position of the deformable phoneme is calculated for each model using the feature vector time series obtained from the input voice. (For "movie", the positions of / ei / and / ee / become). Here, as a method of obtaining the feature vector time series, for example, the method described in pages 154 to 160 of “Digital Speech Processing” (hereinafter referred to as Reference 3) by Furui, published in 1985 by Tokai University Press. A cepstrum method, an LPC analysis method, or the like can be used.

また、ある変形モデルに対する変形可能音素の位置
は、例えば、文献２で述べられているビタービ（Viterb
i）アルゴリズムを用いて、モデル内での最適な状態遷
移パスを求め、そのパス上での変形可能音素に対する音
素モデル（以下、変形可能音素モデルと呼ぶ）のパスに
対応する入力音声中の区間として求められる。In addition, the position of a deformable phoneme with respect to a certain deformation model can be determined, for example, by using Viterb (Viterb
i) An optimal state transition path in the model is determined using an algorithm, and a section in the input speech corresponding to a path of a phoneme model (hereinafter referred to as a deformable phoneme model) for a deformable phoneme on the path. Is required.

発声変形の有無は、例えば、各変形可能音素モデル
（「映画」の場合、/ei/および/ee/に対する音素モデ
ル）に対してフォワード・バックワードアルゴリズムま
たはビタービアルゴリズムにより、それぞれのモデルに
対して先に求められた音声区間の音声パターンの出現確
率を求め、確率が最も高いモデルをその区間の音素とし
て判定することができる。For example, the presence or absence of vocal deformation is determined by a forward-backward algorithm or a Viterbi algorithm for each deformable phoneme model (in the case of "movie", a phoneme model for / ei / and / ee /). First, the appearance probability of the voice pattern of the voice section obtained earlier is obtained, and the model having the highest probability can be determined as the phoneme of the section.

（実施例）次に本発明による発声変形検出装置の実施例について
図面を参照して説明する。(Embodiment) Next, an embodiment of an utterance deformation detection device according to the present invention will be described with reference to the drawings.

第１図は本発明の一実施例を示す構成図である。 FIG. 1 is a block diagram showing one embodiment of the present invention.

標準モデル記憶部３の中には、文献２で述べられてい
るようなHMMを用いた音素モデルＭが保持されている。
これらは、文献２で述べられているフォワード・バック
ワードアルゴリズムにより、多量の音声データから作成
できる。A phoneme model M using an HMM as described in Literature 2 is stored in the standard model storage unit 3.
These can be created from a large amount of audio data by the forward / backward algorithm described in Reference 2.

入力された音声信号Ｓは、特徴分析部１および発声変
形情報検出部２へ入力される。The input audio signal S is input to the feature analysis unit 1 and the utterance deformation information detection unit 2.

特徴分析部１では、文献３で述べられているようなメ
ルケプストラムによる方法を用いて、音声信号Ｓが特徴
ベクトル時系列Ｖに変換される。The feature analysis unit 1 converts the audio signal S into a feature vector time series V by using a mel-cepstral method as described in Document 3.

発声変形情報検出部２では、音声信号Ｓの発声内容に
対して発生する可能性がある発声変形パターがルールに
従って求められ、発声内容中での変形可能音素名ととも
に発声変形情報Ｐとして出力される。The utterance deformation information detection unit 2 obtains an utterance deformation pattern that may occur with respect to the utterance content of the voice signal S according to the rule, and outputs the utterance deformation pattern along with the deformable phoneme name in the utterance content as utterance deformation information P. .

発声変形パターンは、この方法以外に、入力されるす
べての音声の発声変形パターンをすべて網羅したメモリ
をあらかじめ用意しておき、そのメモリから必要なパタ
ーンを抽出することによっても求められる。In addition to this method, the utterance deformation pattern can also be obtained by preparing in advance a memory covering all the utterance deformation patterns of all the inputted voices and extracting a necessary pattern from the memory.

変形音素抽出部４では、特徴ベクトル時系列Ｖ、発声
変形情報Ｐおよび標準モデル記憶部３に保持されている
音素モデルＭを受け、発声変形パターン毎に、音素モデ
ルＭが連結されて変形モデルが作成され、文献２で述べ
られているビタービアルゴリズムを用いて特徴ベクトル
時系列Ｖに対する最適な状態遷移パスが求められ、その
パス上で変形可能音素モデルが占有するパスに対応する
特徴ベクトル時系列Ｖ中の区間の始端と終端が変形可能
音素の位置情報として求められ、変形可能音素名ととも
に、変形音素情報Ｉとして出力される。The deformed phoneme extraction unit 4 receives the feature vector time series V, the utterance deformation information P, and the phoneme model M stored in the standard model storage unit 3, and connects the phoneme model M for each utterance deformation pattern to generate a deformation model. The optimal state transition path for the feature vector time series V is determined using the Viterbi algorithm described in Document 2 and the feature vector time series corresponding to the path occupied by the deformable phoneme model on that path. The start and end of the section in V are obtained as the position information of the deformable phoneme, and output as the deformed phoneme information I together with the deformable phoneme name.

この変形音素情報Ｉは、変形音素情報記憶部５に蓄え
られる。The deformed phoneme information I is stored in the deformed phoneme information storage unit 5.

発声変形検出部６では、変形音素情報記憶部５に蓄え
られた変形音素情報Ｉ′、入力音声信号の特徴ベクトル
時系列Ｖ、音素モデルＭを受け、変形音素情報Ｉ′内の
変形可能音素名に対応する音素モデル毎に、それぞれの
モデルに対する変形音素情報Ｉ′内の変形可能音素の位
置情報をもとに切り出された特徴ベクトル時系列Ｖの部
分系列の出現確率が、フォワード・バックワードアルゴ
リズムにより求められ、確率が最も高いモデルがその区
間の音素と判定され、音素名Ｒが検出結果として出力さ
れる。The utterance deformation detecting unit 6 receives the deformed phoneme information I ′, the feature vector time series V of the input speech signal, and the phoneme model M stored in the deformed phoneme information storage unit 5, and receives the deformable phoneme name in the deformed phoneme information I ′. The appearance probability of the subsequence of the feature vector time series V extracted based on the position information of the deformable phoneme in the deformed phoneme information I ′ for each model for each phoneme model corresponding to the forward and backward algorithm , The model with the highest probability is determined to be a phoneme in that section, and the phoneme name R is output as a detection result.

（発明の効果）以上説明したように、本発明は入力音声に対して発声
変形する可能性がある音素に対する音声区間を切り出
し、その音素に対する標準モデルとその区間の音声パタ
ーンから入力音声の発声変形の有無を自動的に検出する
ので、検出者の作業を軽減した高性能な発声変形検出装
置を実現することができる。(Effects of the Invention) As described above, the present invention cuts out a speech section for a phoneme that may be uttered and deformed with respect to an input speech, and transforms the utterance of the input speech from a standard model for the phoneme and a speech pattern of the section. Since the presence / absence of the utterance is automatically detected, it is possible to realize a high-performance utterance deformation detection device which reduces the work of the detector.

[Brief description of the drawings]

第１図は本発明による一実施例を示す構成図である。１……特徴分析部、２……発声変形情報検出部、３……標準モデル記憶部、４……変形音素抽出部、５……変形音素情報記憶部、６……発声変形検出部。 FIG. 1 is a block diagram showing one embodiment according to the present invention. 1 ... Analysis unit, 2 ... Transformation information detection unit, 3 ... Standard model storage unit, 4 ... Transformation phoneme extraction unit, 5 ... Transformation phoneme information storage unit, 6 ... Transformation detection unit.

フロントページの続き (56)参考文献特開昭63−5395（ＪＰ，Ａ) 特開昭63−205699（ＪＰ，Ａ) 特開平１−126694（ＪＰ，Ａ) ＴｈｅＢｅｌｌＳｙｓｔｅｍＴｅｃｈｉｃａｌＪｏｕｒｎａｌＶｏｌ．62，Ｎｏ．４，Ａｐｒｉｌ 1983, Ｐ．1035−1074Continuation of front page (56) References JP-A-63-5395 (JP, A) JP-A-63-205699 (JP, A) JP-A-1-126694 (JP, A) The Bell System Technical Journal Vol. . 62, No. 4, April 1983, p. 1035-1074

Claims

(57) [Claims]

1. A feature analysis unit for analyzing a speech signal and outputting a feature vector time series, a phoneme name of a phoneme which may possibly cause speech transformation of the speech signal, and a phoneme name presumed from a speech transformation pattern stored in advance. An utterance deformation information detecting unit that extracts a plurality of utterance deformation patterns to be output and outputs the utterance deformation pattern as utterance deformation information
A standard model storage unit that stores a standard model in units of phonemes in advance; and the possibility that the utterance deformation is performed based on the feature vector time series, the utterance deformation information, and the standard model stored in the standard model storage unit. A deformed phoneme extracting unit that extracts a phoneme name of a certain phoneme and positional information of a phoneme that may be uttered and outputs as deformed phoneme information, a deformed phoneme information storage unit that stores the deformed phoneme information, and the feature vector Based on the time series, the deformed phoneme information stored in the deformed phoneme information storage unit, and the standard model stored in the standard model storage unit, any one of the plurality of utterance deformation patterns extracted by the utterance deformation information detection unit And an utterance deformation detecting unit for determining whether or not the utterance deformation has occurred.