JP2816163B2

JP2816163B2 - Speaker verification method

Info

Publication number: JP2816163B2
Application number: JP63282031A
Authority: JP
Inventors: 博喜内山; 博雄北川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-01-20
Filing date: 1988-11-08
Publication date: 1998-10-27
Anticipated expiration: 2013-10-27
Also published as: JPH02236599A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は話者認識分野に係り、詳しくは音声により話
者の同定を行う話者照合方式に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of speaker recognition, and more particularly to a speaker verification method for identifying a speaker by voice.

[Conventional technology]

一般に話者認識においては、入力音声波と登録されて
いる音声波を直接比較するのは能率的ではないので、周
波数スペクトル、線形予測係数等の音響パラメータに変
換してから比較を行う。なお、音響パラメータとして
は、他に基本周波数（ピッチ周波数）、音声エネルギ
ー、ホルマント周波数、パーコール係数、対数断面積
比、零交差数などがある。In general, in speaker recognition, it is not efficient to directly compare an input speech wave with a registered speech wave, so that the comparison is performed after conversion into acoustic parameters such as a frequency spectrum and a linear prediction coefficient. The acoustic parameters include a fundamental frequency (pitch frequency), voice energy, formant frequency, Percoll coefficient, log cross-sectional area ratio, number of zero crossings, and the like.

これらの音響パラメータは、音韻性の情報を第一義的
に個人性の情報を第二義的に含むものであるため、話者
照合を行う際には、さらに音響パラメータから話者固有
の新たな特徴量を作成することが行われる。この特徴量
を作成する場合、従来は入力音声を数msecごとのフレー
ムに分割し、そのフレーム毎に求まる各種音響パラメー
タ（スペクトラム、ケプストラム）を全有声音区間にわ
たって時間軸方向に加算平均したり、得られた音響パラ
メータ全体を統計分析したりして、話者固有の特徴量に
変換していた。These acoustic parameters primarily include phonological information and secondarily personality information. Therefore, when performing speaker verification, new speaker-specific features are additionally obtained from the acoustic parameters. Creating a quantity is done. Conventionally, when creating this feature amount, the input voice is divided into frames every several msec, and various acoustic parameters (spectrum, cepstrum) obtained for each frame are averaged in the time axis direction over the entire voiced sound section, Statistical analysis was performed on the entire obtained acoustic parameters to convert them into speaker-specific features.

なお、話者照合に関連する公知文献としては、例えば
特開昭61−278896号公報が挙げられる。As a known document related to speaker verification, for example, JP-A-61-278896 is cited.

[Problems to be solved by the invention]

パラメータを時間軸方向に加算平均する方式は、処理
が比較的簡単かつ特徴量の大きさ（辞書容量）は少なく
てすむという長所がある。しかし、安定な特徴量を作成
するためには、比較的長時間の音声サンプルが必要であ
り、時系列音響パラメータの時間軸方向の加算による特
徴量の生成は、時間軸方向の情報を圧縮してしまうため
に、テキスト依存型の話者照合には不向きである。ま
た、この手法は、音韻性情報の平均化にともない、音韻
に重畳する個人性情報をも平均化してしまうため特徴量
は十分に個人性情報を反映するものではない。The method of averaging the parameters in the time axis direction has advantages that the processing is relatively simple and the size of the feature amount (dictionary capacity) is small. However, in order to create a stable feature, a relatively long time voice sample is required, and the generation of the feature by adding the time-series acoustic parameters in the time axis direction compresses the information in the time axis direction. Therefore, it is not suitable for text-dependent speaker verification. Further, in this method, the personality information to be superimposed on the phoneme is also averaged along with the averaging of the phonological information, so that the feature amount does not sufficiently reflect the personality information.

本発明の目的は、音声信号に含まれる個人性の情報を
忠実に抽出することにより話者の同定を行う話者照合方
法を提供することにある。An object of the present invention is to provide a speaker verification method for identifying a speaker by faithfully extracting personality information included in a voice signal.

[Means for solving the problem]

本発明は、入力音声信号を所定時間単位毎のフレーム
に分割して、そのフレーム毎に音響パラメータを計算す
ると共に、入力音声信号から音声区間を検出して、この
音声区間を時間軸上で音響パラメータの値により複数の
ブロックに分割し、各ブロック毎に音響パラメータを時
間軸方向に加算平均するなどして特徴量を生成し、この
特徴量を用いて照合判定を行うことを特徴とする。音声
区間の分割は、入力音声のパワー、スペクトルの変化
率、ピッチ等を用いて有無、無声、無音区間を判別して
有声区間を抽出し、これを１つのブロックとして分割す
る。The present invention divides an input audio signal into frames for each predetermined time unit, calculates an acoustic parameter for each frame, detects an audio section from the input audio signal, and converts the audio section on the time axis into an audio section. The method is characterized in that a block is divided into a plurality of blocks according to the value of the parameter, and a characteristic amount is generated by, for example, adding and averaging acoustic parameters in the time axis direction for each block, and collation determination is performed using the characteristic amount. The voice section is divided into one block by extracting the voiced section by discriminating the presence / absence, unvoiced and silent sections using the power of the input voice, the rate of change of the spectrum, the pitch, and the like, and extracting the voiced section.

また、本発明は登録時に、発声された音声をあらかじ
めそのパワー、パワーディップ、スペクトル変化率（動
的尺度）等によりブロックに分割する前処理を有し、こ
の前処理により求まる基準ブロックに基づいて、入力音
声信号をブロックに分割することを特徴とする。Further, the present invention has a pre-processing for dividing a uttered voice into blocks in advance at the time of registration based on its power, power dip, spectrum change rate (dynamic scale), and the like, and based on a reference block obtained by this pre-processing. , The input audio signal is divided into blocks.

また、本発明は音響パラメータとしてスペクトルを用
い、登録時のブロック分割に際しては、発声された所定
回数の音声をあらかじめそのスペクトルの一次モーメン
トの値によって複数のブロックにおのおの分割し、その
ブロック分割が最適な音声のブロックを基準ブロックと
して採用し、登録時の入力音声全てをその基準ブロック
に基づいて再度ブロック分割するとゝもに、照合時に
は、先の登録時に作成した基準ブロックに基づいて未知
音声をブロック分割することを特徴とする。In addition, the present invention uses a spectrum as an acoustic parameter, and divides a predetermined number of uttered voices into a plurality of blocks according to the value of the first moment of the spectrum in advance when dividing a block at the time of registration. Block is used as a reference block, and all input voices at the time of registration are divided into blocks again based on the reference block. At the time of matching, unknown voices are blocked based on the reference block created at the time of the previous registration. It is characterized in that it is divided.

さらに、本発明は特徴量にピッチ周期を付加し、スペ
クトルとピッチの総合的な距離判定によって話者を同定
することを特徴とする。Further, the present invention is characterized in that a pitch period is added to the feature amount, and a speaker is identified by comprehensive distance determination between the spectrum and the pitch.

また、このピッチ周期の抽出に際しては、入力音声信
号を低域通過フィルタに通して高周波成分を除去した
後、複数の閾値によってそれぞれ２値化し、各閾値によ
る２値化信号の特定方向の遷移点の時間間隔をカウンタ
で計測し、該計測された各時間間隔よりピッチ周期もし
くは周波数を推定することを特徴とする。In extracting the pitch period, the input audio signal is passed through a low-pass filter to remove high-frequency components, then binarized by a plurality of thresholds, and a transition point in a specific direction of the binarized signal by each threshold. Is measured by a counter, and the pitch period or frequency is estimated from each of the measured time intervals.

[Action]

本発明では、入力音声の音声区間を時間軸上でいくつ
かのブロックに分割し、ブロック毎に特徴量を計算する
ことで、時間変化の情報を特徴量に付加し、個人性の情
報をも強調する。In the present invention, the speech section of the input speech is divided into several blocks on the time axis, and the feature amount is calculated for each block, so that information on the time change is added to the feature amount, and the personality information is also obtained. Emphasize.

しかし、これのみでは、ブロック分割を音声のパワ
ー、パワーディップ、スペクトル変化率等で行うため
に、発声ごとの入力音声のパワー変動等の影響によりブ
ロック分割の抽出位置が不安定であり、発声毎のブロッ
クの位置が異なる。そこで、ブロック分割の前処理とし
て登録時に、発声された音声をあらかじめそのパワー等
によりブロックに分割し、これにより求まる基準ブロッ
クに基づいて、入力音声信号をブロックに分割すること
で、ブロック分割の安定化を図る。However, in this case alone, since the block division is performed by the power of the voice, the power dip, the spectrum change rate, and the like, the extraction position of the block division is unstable due to the influence of the power fluctuation of the input voice for each utterance, and the Block positions are different. Therefore, at the time of registration as preprocessing of block division, the uttered voice is divided into blocks in advance by its power and the like, and the input audio signal is divided into blocks based on the reference block obtained thereby, thereby stabilizing the block division. Plan.

特に、音声スペクトルの一次モーメントの状態を用い
た継続時間長制御型状態遷移モデルをブロック分割に適
用することで、大まかな各ブロックの対応づけができ、
これにより安定なブロック分割率が行える。また、演算
量も、DPマッチング等によるブロック分割に比べてはる
かに少なくて済む。In particular, by applying the duration control type transition model using the state of the first moment of the speech spectrum to block division, it is possible to roughly associate each block,
As a result, a stable block division ratio can be achieved. Also, the amount of calculation is much smaller than block division by DP matching or the like.

次に、本発明では、特徴量に音声ピッチ周期を付加す
ることで照合率の向上を図っている。個人の特徴量を記
述するためには、音韻性のパラメータであるスペクトル
の他にピッチも含めることが重要であり、これによって
照合率が向上すると考えられる。しかし、ピッチのよう
にその値とその時間変化パタンに個人性のある音響パラ
メータに、平均値を使うことは、ブロック内での変動を
うまく表現することができず、有効な特徴量変換とはい
えない。そこで、本発明ではピッチ周期の特徴として、
各ブロック毎のピッチ周期のヒストグラムを用いる。こ
れにより、ピッチ周期の値及びその時間変化の様子を特
徴量自体に反映することができ、話者照合の有効な特徴
量に変換できることから照合率の向上が図れる。Next, in the present invention, the matching rate is improved by adding a voice pitch period to the feature amount. In order to describe an individual feature, it is important to include pitch in addition to the spectrum which is a phonological parameter, and it is considered that this improves the matching rate. However, using an average value as an acoustic parameter that has individuality in its value and its time change pattern, such as pitch, cannot express variations within a block well. I can't say. Therefore, in the present invention, as a feature of the pitch period,
A histogram of the pitch period for each block is used. As a result, the value of the pitch period and the manner of its time change can be reflected in the feature value itself, and can be converted into a feature value effective for speaker verification, thereby improving the verification rate.

さらに、一般にピッチを正確に求めるためには複雑な
処理が必要であるが、このピッチ周期の抽出に際して、
低域通過フィルタとカウンタの組合せによる簡易ピッチ
抽出法を適用することにより、話者照合全体のシステム
規模の小形化が図れる。Further, in general, complicated processing is required to accurately determine the pitch.
By applying a simple pitch extraction method using a combination of a low-pass filter and a counter, the system size of the entire speaker verification can be reduced.

なお、こゝでは、ブロック分割の特徴量としてスペク
トルの一次モーメント、個人の特徴量として平均スペク
トルとピッチヒストスグラムを用いたが、スペクトルの
代りにLPCケプストラムを用いてもよい。In this case, the first moment of the spectrum is used as the feature amount of the block division, and the average spectrum and the pitch histogram are used as the individual feature amounts. However, the LPC cepstrum may be used instead of the spectrum.

〔Example〕

以下、本発明の各実施例について図面により説明す
る。Hereinafter, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の基本原理を説明するための第１実施
例のブロック図を示す。本実施例ではマイクロホン１、
音響パラメータ変換部11、音声区間検出部12、有音・無
音判別部13、有音区間抽出部14、特徴量生成部15、標準
パタン蓄積部16、距離計算部17、判断部18、及び制御対
象２より構成される。FIG. 1 is a block diagram of a first embodiment for explaining the basic principle of the present invention. In this embodiment, the microphone 1,
Acoustic parameter conversion section 11, voice section detection section 12, voiced / silence discrimination section 13, voiced section extraction section 14, feature quantity generation section 15, standard pattern storage section 16, distance calculation section 17, determination section 18, and control It is composed of object 2.

マイクロホン１から入力された音声信号は、音響パラ
メータ変換部11により１フレームごとに音響パラメータ
の時系列に変換される。この音響パラメータ変換部11と
しては、例えば入力音声をローパスフィルター（LPF）
によってサンプリング周波数の1/2以上の成分をカット
した後、アナログ・ディジタル変換器によって離散的な
信号列に量子化し、さらにこれを短時間の波形毎に切出
してハミングウィンドウ等を剩じ窓掛けを行い、各種の
特徴量（スペクトル、ケプストラム、PARCOR、ピッチ
等）に変換しても良いし、またはバンドパスフィルタ群
を用いてスペクトルの時系列情報として音響パラメータ
を得ても良い。こゝでは中心周波数250〜6300Hzで1/6オ
クターブごとに配置された29チャンネルのバンドパスフ
ィルター群を用いた例により説明する。The audio signal input from the microphone 1 is converted by the audio parameter conversion unit 11 into a time series of audio parameters for each frame. As the acoustic parameter conversion unit 11, for example, an input voice is
After cutting the component of more than 1/2 of the sampling frequency, the analog-to-digital converter quantizes it into a discrete signal sequence, cuts it out for each short-time waveform, and adds a hamming window, etc. Then, it may be converted into various feature amounts (spectrum, cepstrum, PARCOR, pitch, etc.), or an acoustic parameter may be obtained as time-series information of the spectrum using a band-pass filter group. Here, an example using a band-pass filter group of 29 channels arranged at intervals of 1/6 octave at a center frequency of 250 to 6300 Hz will be described.

音響パラメータ変換部11は、入力音声信号を10msecご
とのフレームに分け、順次スペクトルの時系列パターン
ｆ（｛f₁，f₂，…，f₂₉｝）に変換する。いま、時刻ｉ
におけるフレームのスペクトルf_ijは、 f_ij＝（f_i1，f_i2，…f_i29）（ｊ＝1,2,…,29）と表わされる。The acoustic parameter conversion unit 11 divides the input audio signal into frames every 10 msec, and sequentially converts them into a time-series pattern f ({f ₁ , f ₂ ,..., F ₂₉ }) of the spectrum. Now, time i
Spectrum f _ij frame in _{_{the, f ij = (f i1,}} f i2, ... f i29) (j = 1,2, ..., 29) is expressed as.

音声区間検出部12では、音響パラメータ変換部11で得
られた時系列パターンを用いて入力音声信号から音声区
間を検出する。The voice section detection unit 12 detects a voice section from the input voice signal using the time-series pattern obtained by the acoustic parameter conversion unit 11.

有音・無音判定部13及び有音区間抽出部14は、上記時
系列ベクトルf_ijを用いて音声区間をいくつかのブロッ
ク（有音区間を１つのブロックとする）に分割する。第
２図に入力音声信号のパワーにより分割を行う手順を示
す。即ち、１フレームを入力し（ステップS1）、フレー
ムごとに求っている29個のスペクトルを加算して１フレ
ームのパワーとし（ステップS2）、このパワーとあらか
じめ設定した閾値とを比較することで音声区間を有音、
無音のテブロックに分割し（ステップS3,S4）、連続す
る有音区間を１ブロックとして分割する（ステップS
5）。こゝで、ステップS1〜S4の処理は有音・無音判別
部13が受け持ち、ステップS5の処理は有音区間抽出部14
が受け持つ。The sound / non-speech determining unit 13 and the sound section extracting unit 14 divide the sound section into several blocks (the sound section is regarded as one block) using the time series vector f _ij . FIG. 2 shows a procedure for performing division by the power of the input audio signal. That is, one frame is input (step S1), the 29 spectra obtained for each frame are added to make the power of one frame (step S2), and this power is compared with a preset threshold. Voiced voice section,
It is divided into silent teblocks (steps S3 and S4), and a continuous voiced section is divided into one block (step S3).
Five). Here, the processing of steps S1 to S4 is handled by the voiced / silence discriminating unit 13, and the processing of step S5 is performed by the voiced segment extraction unit 14
Is responsible for.

第３図はブロック分割の具体例を示したものである。
（ａ）図はある音声の時間スペクトルパタン（TSP）を
示しており、（ｂ）は（ａ）図のパワーの時間変化と閾
値の関係を示している。FIG. 3 shows a specific example of the block division.
(A) shows a time spectrum pattern (TSP) of a certain voice, and (b) shows a relationship between a time change of power and a threshold in (a).

（ｂ）図の閾値により音声区間はI,II,IIIの３つの有
音区間のブロックに分割される。（ｂ）図により示され
る区間の位置を時間スペクトルパタン上で示すと（ｃ）
図となる。(B) The voice section is divided into blocks of three sound sections I, II, and III according to the threshold values in the figure. (B) When the position of the section shown by the figure is shown on the time spectrum pattern, (c)
It becomes a figure.

特徴量生成部15では、上記有音区間抽出部14で抽出さ
れたブロック毎に特徴量を求める。こゝでは、特徴量と
してブロック内スペクトル時系列の時間方向の加算平均
を用いるとする。例えば、ｋ番目のブロックに入るスペ
クトル列を fkij＝（fki1,fki2,…,fki29） k:k番目のブロック i:k番目のブロック内における時刻 j:バンドパスフィルターのチャンネル番号とする時、ｋブロックの特徴量Xkavrは、となる。こゝで、ｍはブロック内のフレーム数を示して
おり、▲▼（ｊ＝1,2,…,29）は、各チャンネル
ごとのフペクトル平均を示している。また、こゝでの特
徴量としては、Xkavrを１つのスペクトルとみて、この
最小二乗直線を計算し、この傾き等を特徴ベクトルの一
要素として加えても良い。The feature value generation unit 15 obtains a feature value for each block extracted by the sound section extraction unit 14. Here, it is assumed that the averaging in the time direction of the intra-block spectrum time series is used as the feature amount. For example, when the spectrum sequence included in the k-th block is fkij = (fki1, fki2,..., Fki29), k: the k-th block, i: the time in the k-th block, j: the channel number of the band-pass filter, k The block feature Xkavr is Becomes Here, m indicates the number of frames in the block, and ▼ (j = 1, 2,..., 29) indicates the average of the spectrum for each channel. In addition, as the feature quantity, Xkavr may be regarded as one spectrum, this least square line may be calculated, and the slope or the like may be added as one element of the feature vector.

特徴量生成部15で得られた特徴量を、登録時には、標
準パタン蓄積部16に話者のファイルネームを付加して標
準特徴量（辞書特徴量）として格納する。一方、照合時
には、距離計算部17において、特徴量生成部15で得られ
た未知話者の特徴量と標準パタン蓄積部16に蓄積されて
いる登録話者の特徴量の距離計算を、対応するブロック
毎に計算し、これを全ブロックに渡って荷重平均などし
て求める。こゝで距離は、ユークリッド距離を用いて行
っても良いし、マハラノビス距離等を用いても良い。ま
た、あらかじめＦ比（話者間／話者内分散比）等を用い
て特徴量の次元数を減じる処理を行っても良い。At the time of registration, the feature amount obtained by the feature amount generation unit 15 is added to the standard pattern storage unit 16 with the file name of the speaker and stored as a standard feature amount (dictionary feature amount). On the other hand, at the time of matching, the distance calculation unit 17 calculates the distance between the feature amount of the unknown speaker obtained by the feature amount generation unit 15 and the feature amount of the registered speaker stored in the standard pattern storage unit 16. The calculation is performed for each block, and the calculated value is obtained by averaging the load over all blocks. Here, the distance may be determined using a Euclidean distance, a Mahalanobis distance, or the like. Further, a process of reducing the number of dimensions of the feature amount using an F ratio (inter-speaker / in-speaker variance ratio) or the like may be performed in advance.

判断部18では、予め各話者毎に設定されている閾値と
距離計算部17によって得られた距離とを比較することで
話者の判定を行う。この判定結果により制御対象２を制
御する。制御対象２としては、バンキングサービス・シ
ステム、音声キーによる入室管理システム、その他玩具
等の応答装置などが考えられる。The determination unit 18 determines a speaker by comparing a threshold value preset for each speaker with the distance obtained by the distance calculation unit 17. The control target 2 is controlled based on this determination result. Examples of the control target 2 include a banking service system, an entry management system using voice keys, and a response device such as a toy.

第４図は本発明の第２実施例のブロック図を示す。本
実施例は、音声区間検出部22で検出された入力音声信号
の音声区間より、有声・無声・無音判別部23と有声区間
抽出部24を用いて有声音区間のみを抽出し、この区間を
１つのブロックとするもので、それ以外は第１図と同様
である。なお、有声、無声、無音区間判別には、パワ
ー、スペクトルの変化率（動的尺度）、ピッチ等を用い
る。FIG. 4 shows a block diagram of a second embodiment of the present invention. In the present embodiment, only a voiced sound section is extracted from a voice section of an input voice signal detected by the voice section detection section 22 by using a voiced / unvoiced / silent discrimination section 23 and a voiced section extraction section 24. This is one block, and the rest is the same as FIG. Note that power, spectrum change rate (dynamic scale), pitch, and the like are used for voiced, unvoiced, and silent section discrimination.

第５図に有声、無声、無音に分割された音声のパタン
と抽出されたブロックの関係を示す。第５図において、
Ｖは有声音区間、Vlは無声音区間、Ｓは無音区間であ
り、Ｖの区間をブロックとするのである。FIG. 5 shows the relationship between voice patterns divided into voiced, unvoiced, and silent, and extracted blocks. In FIG.
V is a voiced sound section, Vl is an unvoiced sound section, S is a silent section, and the section of V is a block.

第６図は本発明の第３実施例のブロック図を示す。第
１図及び第４図では、音声区間の有音あるいは有声音区
間のみを抽出し、その区間のみを考慮したのに対し、本
実施例は入力音声信号のパワーディップを検出し、この
パワーディップとあらかじめ設定した閾値と比較するこ
とによって、音声区間全体をいくつかのブロックに分割
するものである。即ち、パワーディップ検出部33でパワ
ーディップを検出し、ブロック検出部34において、ま
ず、閾値設定部39であらかじめ設定しておいた閾値と比
較する。次に、パワーディップが閾値よりも低い位置で
音声区間を区切り、この区切られた区間をブロックとす
る。それ以外は第１図と同様である。FIG. 6 shows a block diagram of a third embodiment of the present invention. In FIGS. 1 and 4, only the voiced or voiced sound section of the voice section is extracted and only that section is taken into account. In the present embodiment, however, the power dip of the input voice signal is detected and this power dip is detected. Is compared with a preset threshold value to divide the entire voice section into several blocks. That is, the power dip is detected by the power dip detector 33, and the block detector 34 first compares the power dip with a threshold value set in advance by the threshold value setting unit 39. Next, the audio section is divided at a position where the power dip is lower than the threshold, and the divided section is defined as a block. Otherwise, it is the same as FIG.

第７図に具体例を示す。第７図（ａ）では、パワーデ
ィップは、からの４つ検出されているが、閾値によ
りの３つが選ばれる。従って、この３つのディッ
プの位置によって音声区間を分割する。第７図（ｂ）は
分割されたブロックを示している。FIG. 7 shows a specific example. In FIG. 7A, four power dips are detected, but three power dips are selected according to the threshold value. Therefore, the voice section is divided according to the positions of these three dips. FIG. 7 (b) shows the divided blocks.

第８図は本発明の第４実施例のブロック図を示す。第
６図では、入力音声のパワーディップを検出し、パワー
ディップとあらかじめ設定した閾値と比較することによ
って、音声区間全体をブロック分割したのに対し、本実
施例は、パワーディップの代わりにスペクトル変化率
（動的尺度）を用い、このローカルピークを検出するこ
とで音声区間全体のブロック分割を行うものである。即
ち、まずスペクトル変化率計算部43で入力音声信号のス
ペクトルの変化率を算出し、そのローカルピークをロー
カルピーク検出部44で検出する。ブロック分割部45で
は、検出されたローカルピークと閾値設定部50による所
定の閾値とを比較し、ローカルピークが閾値よりも高い
位置を基準点として音声区間を区切り、この区切られた
区間をブロックとして設定する。これ以外の特徴量に関
する処理は第１図と同様である。FIG. 8 shows a block diagram of a fourth embodiment of the present invention. In FIG. 6, the entire voice section is divided into blocks by detecting the power dip of the input voice and comparing the power dip with a preset threshold. On the other hand, in this embodiment, the spectrum change is performed instead of the power dip. By detecting the local peak using a rate (dynamic scale), block division of the entire voice section is performed. That is, first, the spectrum change rate calculator 43 calculates the change rate of the spectrum of the input audio signal, and the local peak is detected by the local peak detector 44. In the block dividing unit 45, the detected local peak is compared with a predetermined threshold value by the threshold value setting unit 50, and a voice section is divided using a position where the local peak is higher than the threshold value as a reference point, and the divided section is defined as a block. Set. The other processes related to the feature amount are the same as those in FIG.

第９図に具体例を示す。こゝで、ローカルピークは無
数に検出されているが、閾値により６つが選ばれる。従
って、この６つのピークの位置を基準点として音声区間
を７つに分割する。第９図において、Ｉ〜VIIは分割さ
れたブロックを示している。FIG. 9 shows a specific example. Here, countless local peaks are detected, but six are selected according to the threshold value. Therefore, the voice section is divided into seven sections using the positions of these six peaks as reference points. In FIG. 9, I to VII indicate divided blocks.

第10図は本発明の第５実施例のブロック図を示す。第
６図や第８図の実施例は、入力音声のパワーディップ及
び、スペクトル変化率のローカルピークを検出し、これ
らと所定の閾値と比較することで音声区間に区切りをい
れ、この区切り位置によって音声区間をブロックに分割
するものであった。これに対し、本実施例では、閾値調
節部60において、予めブロック数を設定しておき、その
ブロック数となるように閾値を可変にするものである。
これ以外の処理は第１図と同様である。FIG. 10 shows a block diagram of a fifth embodiment of the present invention. The embodiment of FIGS. 6 and 8 detects the power dip of the input voice and the local peak of the rate of change of the spectrum, compares them with a predetermined threshold value, and puts a break into the voice section. The voice section was divided into blocks. On the other hand, in the present embodiment, in the threshold adjusting unit 60, the number of blocks is set in advance, and the threshold is made variable so as to become the number of blocks.
Other processes are the same as those in FIG.

第11図に閾値調節部60の処理手順を示す。閾値調節部
60には、所定のブロック分割数として例えば７（anb＝
７）が設定されているとする。ブロック分割部55では、
検出されたスペクトル変化率のローカルピークと所定の
閾値（Th）を比較することによりブロック分割を行う。
閾値調節部60は、この分割されるブロック数（nb）が７
より多いときには、閾値を大きくし、反対にブロックが
７より少ないときには、閾値を小さくする。この操作に
より閾値を変更し、ブロック数が７になるまで繰返す。FIG. 11 shows a processing procedure of the threshold adjustment unit 60. Threshold adjuster
For example, in the case of 60, 7 (anb =
It is assumed that 7) is set. In the block division unit 55,
Block division is performed by comparing the detected local peak of the spectrum change rate with a predetermined threshold (Th).
The threshold adjusting unit 60 determines that the number of divided blocks (nb) is 7
When the number is larger than 7, the threshold is increased, and when the number of blocks is smaller than 7, the threshold is decreased. The threshold value is changed by this operation, and the process is repeated until the number of blocks becomes seven.

第12図にこのブロック分割の具体例を示す。即ち、
（ａ）図では、始めの閾値によりブロック数は12個とな
るが、（ｂ）図では、閾値を図のように変更することに
よってブロック数は７と求まる。FIG. 12 shows a specific example of this block division. That is,
12A, the number of blocks becomes 12 by the initial threshold value, but in FIG. 10B, the number of blocks is obtained as 7 by changing the threshold value as shown in the figure.

なお、スペクトルの変化率の代わりに、パワーディッ
プを用いても良い。この場合閾値の変化は、分割される
ブロック数の所定のブロック数より多いときには、閾値
を小さくし、反対にブロックが所定のブロック数より少
ないときには閾値を大きくする。Note that a power dip may be used instead of the spectrum change rate. In this case, when the number of divided blocks is larger than the predetermined number of blocks, the threshold value is decreased, and when the number of blocks is smaller than the predetermined number of blocks, the threshold value is increased.

第13図に本発明の第６実施例のブロック図を示す。こ
れまでの実施例では、ブロック分割を音声のパワー、パ
ワーディップ、スペクトル変化率等で行うために、発声
ごとの入力音声のパワー変動等の影響によりブロック分
割の抽出位置が不安定であり、発声毎のブロックの位置
が異なる。そこで本実施例では、前処理として登録時
に、発声された音声をあらかじめそのパワー等によりブ
ロックに分割し、これにより求まる基準ブロックに基づ
いて、入力音声信号をブロックに分割することで、ブロ
ック分割の安定化を図るものである。FIG. 13 shows a block diagram of a sixth embodiment of the present invention. In the embodiments described above, since the block division is performed based on the power of the voice, the power dip, the spectrum change rate, and the like, the extraction position of the block division is unstable due to the influence of the power fluctuation of the input voice for each voice, and the voice The position of each block is different. Therefore, in the present embodiment, at the time of registration as preprocessing, the uttered voice is divided into blocks in advance based on its power and the like, and the input audio signal is divided into blocks based on the reference block obtained thereby, thereby achieving block division. It is intended to stabilize.

マイクロホン１から入力された音声信号を、音響パラ
メータ変換部61により１フレームごとに音響パラメータ
の時系列に変換し、音声区間検出部62は、これらの特徴
量を用いて入力音声信号から音声区間を検出する。こゝ
までは、これまでの実施例と同様である。The audio signal input from the microphone 1 is converted into a time series of audio parameters for each frame by an audio parameter conversion unit 61, and the audio section detection unit 62 uses these feature amounts to convert the audio section from the input audio signal. To detect. Up to this point, the operation is the same as in the previous embodiments.

ブロック分割部64は音声区間を時間軸上でいくつかの
ブロックに分割する。ブロック分割は、例えば上記スペ
クトルの時系列ベクトルf_ijを用いて、フレームごとに
求まっている29個のスペクトルを加算して、１フレーム
のパワーとし、このパワーとあらかじめ設定した基準ブ
ロックとのマッチングにより音声区間を分割する。一
方、ブロック分割前処理部63では、前処理として登録時
に、同様にして入力音声信号をあらかじめそのパワー等
によりブロック分割し、これを基準ブロックとして、ブ
ロック分割部64のその後のブロック分割に反映させる。The block dividing unit 64 divides the voice section into several blocks on the time axis. The block division is performed, for example, by adding the 29 spectra obtained for each frame using the time series vector f _ij of the above spectrum to obtain a power of one frame, and matching this power with a preset reference block. Divide the voice section. On the other hand, in the block division preprocessing unit 63, at the time of registration as preprocessing, similarly, the input audio signal is divided into blocks in advance by its power and the like, and this is reflected as a reference block in the subsequent block division of the block division unit 64. .

特徴量生成部65では、ブロック分割部64で抽出された
ブロック毎に特徴量を求める。こゝでは、特徴量として
ブロック内スペクトル時系列の時間方向の加算平均を用
いるとする。例えば、ｋ番目のブロックに入るスペクト
ル列を fkij＝（fki1,fki2,…,fki29） k:k番目のブロック i:k番目のブロック内における時刻 j:バンドパスフィルターのチャンネル番号とする時、ｋブロックの特徴量Xkavrは、となる。こゝで、ｍはブロック内のフレーム数を示して
おり、▲▼（ｊ＝1,2,…,29）は、各チャンネル
ごとのスペクトル平均を示している。また、こゝでの特
徴量としては、Vkavrを１つのスペクトルとみて、この
最小二乗直線を計算し、この傾き等を特徴ベクトルの一
要素として加えても良い。The feature amount generation unit 65 obtains a feature amount for each block extracted by the block division unit 64. Here, it is assumed that the averaging in the time direction of the intra-block spectrum time series is used as the feature amount. For example, when the spectrum sequence included in the k-th block is fkij = (fki1, fki2,..., Fki29), k: the k-th block, i: the time in the k-th block, j: the channel number of the band-pass filter, k The block feature Xkavr is Becomes Here, m indicates the number of frames in the block, and ▼ (j = 1, 2,..., 29) indicates the spectrum average for each channel. In addition, as the feature quantity, it is also possible to regard Vkavr as one spectrum, calculate this least-square straight line, and add the slope or the like as an element of the feature vector.

特徴量生成部65で得られた特徴量を、登録時には、標
準パタン蓄積部66に話者のファイルネームを付加して標
準特徴量（辞書）として格納する。一方、照合時には、
距離計算部67において、特徴量生成部65で得られた未知
話者の特徴量と標準パタン蓄積部66に蓄積されている登
録話者の特徴量の距離計算を、対応するブロック毎に計
算し、これを全ブロックに渡って荷重平均などして求め
る。こゝで距離は、ユークリッド距離を用いて行っても
良いし、マハラノビス距離等を用いても良い。また、あ
らかじめＦ比（話者間／話者内分散比）等を用いて特徴
量の次元数を減じる処理を行っても良い。At the time of registration, the feature amount obtained by the feature amount generation unit 65 is added to the standard pattern storage unit 66 with the file name of the speaker and stored as a standard feature amount (dictionary). On the other hand,
The distance calculator 67 calculates the distance between the feature amount of the unknown speaker obtained by the feature amount generation unit 65 and the feature amount of the registered speaker stored in the standard pattern storage unit 66 for each corresponding block. This is obtained by averaging the load over all blocks. Here, the distance may be determined using a Euclidean distance, a Mahalanobis distance, or the like. Further, a process of reducing the number of dimensions of the feature amount using an F ratio (inter-speaker / in-speaker variance ratio) or the like may be performed in advance.

判断部68では、予め各話者毎に設定されている閾値と
距離計算部67によって得られた距離とを比較することで
話者の判定を行う。この判定結果により制御対象２を制
御する。The determination unit 68 determines a speaker by comparing a threshold value preset for each speaker with the distance obtained by the distance calculation unit 67. The control target 2 is controlled based on this determination result.

以下、ブロック分割前処理部63及びブロック分割部64
での処理を詳述する。Hereinafter, the block division pre-processing unit 63 and the block division unit 64
Will be described in detail.

第14図にブロック分割前処理及びブロック分割の一実
施例の手順を示す。まず、ブロック分割前処理部63にお
いて、ブロック分割の前処理として、第１番目の発声を
その音声のパワー、パワーディップ、スペクトル変化率
等を用いてブロックに分割し、ブロック分割点を求めて
おく。ブロック分割部64は２番目以降の入力音声につい
て、１番目の音声を辞書パタンとしてDPマッチングを行
いブロック分割点に対応するフレーム位置を求め、これ
によってブロック分割を行う。FIG. 14 shows the procedure of one embodiment of the pre-block division processing and the block division. First, the pre-block division processing unit 63 divides the first utterance into blocks using the power, power dip, spectrum change rate, and the like of the voice as pre-processing for block division, and obtains a block division point. . The block division unit 64 performs DP matching on the second and subsequent input voices using the first voice as a dictionary pattern to obtain a frame position corresponding to a block division point, thereby performing block division.

第15図にブロック分割に対する別の実施例を示す。ま
ず、ブロック分割前処理部63において、ブロック分割の
前処理として、第１番目の発声をその音声のパワー、パ
ワーディップ、スペクトル変化率等を用いてブロックに
分割し、そのブロック内の音響パラメータを時間軸方向
に加算平均などしてブロック毎に特徴量を求める。ブロ
ック分割部64は、２番目以降の入力音声について、１番
目の音声ブロック毎の特徴量を辞書パタンとして継続時
間長制御型状態遷移モデル等を用いて基準ブロックに対
応するフレーム位置を求め、これによってブロック分割
を行う。第16図に各基準ブロックの特徴量と２番目以降
の音声フレームとの対応づけを示す。こゝで、〜は
標準パタンのブロックにおける特徴量、イ〜チは入力音
声のブロックを示している。FIG. 15 shows another embodiment for block division. First, the pre-block division processing unit 63 divides the first utterance into blocks using the power, power dip, spectrum change rate, and the like of the speech as pre-processing for block division, and sets the acoustic parameters in the block. A feature amount is obtained for each block by performing averaging in the time axis direction. The block dividing unit 64 obtains a frame position corresponding to the reference block using the feature amount of each first audio block as a dictionary pattern and a duration control type state transition model or the like for the second and subsequent input voices. Block division. FIG. 16 shows the correspondence between the feature amount of each reference block and the second and subsequent audio frames. Here, represents a feature amount in the block of the standard pattern, and I to I represent blocks of the input voice.

第17図にブロック分割に対する更に別の実施例を示
す。まず、ブロック分割前処理部63において、第15図の
場合と同じく、ブロック分割の前処理として、第１番目
の発声をその音声のパワー、パワーディップ、スペクト
ル変化率等を用いてブロックに分割し、そのブロック内
の音響パラメータを時間軸方向に加算平均などしてブロ
ック毎に特徴量を求める。ブロック分割部64では、２番
目以降の入力音声信号のブロック分割の際、まず、その
音声のパワー、パワーデイップ、スペクトル変化率等を
用いて１番目の音声をブロックに分割したときよりも細
かくセミブロックに分割する。次に、このセミブロック
内の音響パラメータを時間軸方向に加算平均などして特
徴量を求め、１番目の音声ブロック毎の特徴量を辞書パ
タンとして、セミブロック毎の特徴量とのDPマッチング
等を行うことで基準ブロックに対応するセミブロックの
位置を求め、これによってブロック分割を行う、第18図
に各基準ブロックの特徴量と２番目以降のセミブロック
との対応づけを示す。こゝで、〜は標準パタンのブ
ロックにおける特徴量、′〜′は入力音声のセミブ
ロック内の特徴量を示している。FIG. 17 shows still another embodiment for block division. First, the block division preprocessing unit 63 divides the first utterance into blocks using the power, power dip, spectrum change rate, and the like of the voice as preprocessing for block division, as in the case of FIG. Then, the acoustic parameters in the block are averaged in the time axis direction or the like, and the feature amount is obtained for each block. In the block division unit 64, when dividing a second or subsequent input audio signal into blocks, first, the first audio is divided into blocks using the power, power dip, spectrum change rate, and the like of the input audio signal. Divide into blocks. Next, a feature amount is obtained by averaging the acoustic parameters in the semi-block in the time axis direction or the like, and the feature amount of each first audio block is used as a dictionary pattern, and DP matching with the feature amount of each semi-block is performed. Is performed to obtain the position of the semi-block corresponding to the reference block, and the block is divided based on the position. FIG. 18 shows the correspondence between the feature amount of each reference block and the second and subsequent semi-blocks. Here, 〜 indicates a feature amount in a block of a standard pattern, and 〜 to を indicate a feature amount in a semi-block of an input voice.

第19図にブロック分割に対する更に別の実施例を示
す。ブロック分割前処理部63でのブロック分割の前処理
としては、まず、登録時に発声された全ての音声をその
パワー、パワーディップ、スペクトル変化率（動的尺
度）等を用いておのおのブロックに分割し、各発声毎の
ブロック数を加算平均などして、その代表的なブロック
数を求める。ブロック分割部64では、この代表的なブロ
ック数を示す音声パタンのブロックを基準として、登録
時及び照合時の入力音声のブロック分割を行う。FIG. 19 shows another embodiment for block division. As preprocessing of block division in the block division preprocessing unit 63, first, all voices uttered at the time of registration are divided into blocks using their power, power dip, spectrum change rate (dynamic scale), and the like. Then, the representative number of blocks is obtained by averaging the number of blocks for each utterance. The block division unit 64 performs block division of the input voice at the time of registration and at the time of collation based on the block of the voice pattern indicating the representative number of blocks.

この場合、ブロックの位置合わせは、基準ブロックの
音声とのDPマッチングによって行ってもよいし、基準ブ
ロックの音響パラメータを時間軸方向に加算平均などし
てブロック毎に特徴量を求めておき、この特徴量を辞書
パタンとして継続時間長制御型状態モデル等を用い、基
準ブロックに対応するフレーム位置を求め、これによっ
てブロック分割を行ってもよい。また、あらかじめ代表
的なブロック数にならない音声をセミブロックに分割し
ておき、このセミブロック内の音響パラメータを時間軸
方向に加算平均などして求めた特徴量と、基準ブロック
内の特徴量とのDPマッチング等によって行うことで、基
準ブロックに対応するセミブロックの位置を求めて、こ
れによってブロック分割を行ってもよい。In this case, the positioning of the blocks may be performed by DP matching with the sound of the reference block, or the acoustic amount of the reference block may be averaged in the time axis direction to obtain a feature amount for each block. A frame position corresponding to the reference block may be obtained by using a state model or the like with a duration control type using the feature amount as a dictionary pattern, and block division may be performed using this. In addition, voices that do not have a representative number of blocks are divided into semi-blocks in advance, and feature amounts obtained by averaging acoustic parameters in the semi-blocks in the time axis direction, and feature amounts in the reference block. By performing the DP matching or the like, the position of the semi-block corresponding to the reference block may be obtained, and the block division may be performed using this.

第20図は本発明の第７実施例のブロック図を示す。 FIG. 20 is a block diagram showing a seventh embodiment of the present invention.

一般に、ブロック分割を音声のパワー、パワーディッ
プ、スペクトル変化率等々の音響パラメータを用いて行
った場合、処理は複雑であり、発声ごとに変化する音声
の特徴量を常に確実に得ることは難しい。また同時に登
録時、照合時の特徴量作成にあたり、発声された全ての
音声を同一位置にてブロック分割することは難しく、こ
のため線形伸縮やDPマッチング等によるブロック位置の
対応づけが考えられる。ところが、これらの方法は、あ
くまでもフレーム単位での対応付けを図るものであり、
一連のブロック領域の区間位置を求めるような対応付け
には不向きであり、計算量も多くなる。Generally, when block division is performed using audio parameters such as audio power, power dip, and spectrum change rate, the processing is complicated, and it is difficult to always reliably obtain a feature amount of audio that changes for each utterance. At the same time, it is difficult to divide all uttered voices into blocks at the same position when creating a feature at the time of registration and collation. Therefore, it is conceivable to associate block positions by linear expansion / contraction or DP matching. However, these methods aim at the correspondence in the frame unit to the last,
It is unsuitable for association such as finding a section position of a series of block areas, and the amount of calculation increases.

そこで、本実施例では、所定時間毎に分割された音声
フレームのスペクトルの一次モーメントを用いて、音声
区間を有声、無声、無声に分けるのではなく、一次モー
メントを判断基準として、一次モーメントが安定な一連
の区間を一つの同一の区間（ブロック）として分割し、
話者及び発声された音声に固有な分割点を設定すること
で行う。さらに同一音声のブロック毎の対応づけは、予
め設定した基準ブロック内一次モーメントの加算平均値
をそのブロックの状態として設定し、継続時間長制御型
状態遷移モデルを適用することで、入力音声の音声区間
の一次モーメントを複数のブロックに分割することで行
うものである。本手法によれば、ブロックの対応づけ
は、状態とフレームの対応づけであり、演算回数は、そ
の格子点の数で決まるため、非常に少ない演算回数で処
理が済む。また、格子点での演算は、一次モーメントの
一回の減算で済み、通常ベクトル演算が格子点の演算と
して行われるDPマッチング等に比べて、その演算量は非
常に少ない。Therefore, in the present embodiment, the first moment is not determined by using the first moment as a criterion, instead of dividing the voice section into voiced, unvoiced, or unvoiced by using the first moment of the spectrum of the voice frame divided every predetermined time. Is divided into one identical section (block),
This is performed by setting a division point unique to the speaker and the uttered voice. Further, the same voice is associated with each block by setting the average of the first moments in the reference block set in advance as the state of the block, and applying the duration control type transition model to obtain the voice of the input voice. This is performed by dividing the first moment of the section into a plurality of blocks. According to this method, the association between blocks is the association between a state and a frame, and the number of operations is determined by the number of grid points. Therefore, processing can be performed with a very small number of operations. In addition, the calculation at the grid point requires only one subtraction of the first moment, and the calculation amount is very small as compared with the DP matching or the like in which the normal vector calculation is performed as the calculation of the grid point.

以下、第20図の実施例について説明する。こゝでも、
音響パラメータ変換部71では、マイクロホン１からの入
力音声信号を10msecごとのフレームに分け、順次スペク
トルの時系列パターンｆ（｛f₁，f₂，…，f₂₉｝）に変
換するものとする。この場合、時刻ｉにおけるフレーム
のスペクトルf_ijは、 f_ij＝（f_i1，f_i2，…f_i29）（ｊ＝1,2,…,29）と表わされることは既に説明した。Hereinafter, the embodiment of FIG. 20 will be described. Even here
The acoustic parameter conversion unit 71 divides the input audio signal from the microphone 1 into frames every 10 msec and sequentially converts them into a time series pattern f ({f ₁ , f ₂ ,..., F ₂₉ }) of the spectrum. In this case, the spectrum f _ij of the frame at the time i is expressed as f _ij = (f _i1 , f _i2 ,... F _i29 ) (j = 1, 2,..., 29).

音声区間検出部72では、音響パラメータ変換部71の時
系列パターンを用いて入力音声信号から音声区間を検出
する。こゝで、入力信号からの音声区間の検出は、各フ
レーム毎のパワースペクトルPwiを計算し、この値と所
定の閾値Thpwを比較し、パワーが閾値Thpwを越える区間
を音声区間として検出することで行うとする。こゝで、
時刻ｉの音声パワーPwiは、と表される。次に、一次モーメント計算部73にて各フレ
ームのスペクトルの一次モーメントを計算する。こゝ
で、時刻ｉのスペクトルの一次モーメントは、にて計算される。このmiは、各フレームのスペクトルの
重心の位置を示しており、１から29までの間の値を取
る。The voice section detection unit 72 detects a voice section from the input voice signal using the time-series pattern of the acoustic parameter conversion unit 71. Here, detection of a voice section from the input signal includes calculating a power spectrum Pwi for each frame, comparing this value with a predetermined threshold Thpw, and detecting a section in which the power exceeds the threshold Thpw as a voice section. Let's do it. Here
The audio power Pwi at time i is It is expressed as Next, the first moment calculator 73 calculates the first moment of the spectrum of each frame. Here, the first moment of the spectrum at time i is Is calculated by This mi indicates the position of the center of gravity of the spectrum of each frame, and takes a value between 1 and 29.

ところで、一般に音声の有声音は、低域にスペクトル
が集中し、無音声は特に高域にのみ集中するため、一次
モーメントの値は、有声音区間では小さな値を、無音声
区間で大きな値を取ることになる。また、無音区間で
は、中央付近（値で言うと15の近辺）の値を取る。しか
し、スペクトルのパワーがチャンネルの中央部に集中す
るような音韻と無音区間とは同様な一次モーメントの値
をとり、明確に区別できない。さらに、無音区間での雑
音は、上述の式による一次モーメントでは大きな値の変
化となり、一次モーメントは安定性に欠ける。そこで、
予め所定の周波数帯域に重みづけを行い、一次モーメン
トの安定化を図る。この処置としては、29チャネル中の
所定のチャンネルに重みを加えて一次モーメントを計算
してもよいし、または低域チャンネルの下、高域チャン
ネルの上にもう一つ仮の周波数帯域を設定して、これに
所定の荷重を加えて一次モーメントを計算しも良い。こ
ゝでは、高域チャンネルの上に仮のチャンネルを設定
し、そのチャンネルに所定の荷重Ｗを加算して一次モー
メントを求める。この時の時刻ｉにおけるフレームのス
ペクトルは、 f_ij＝（f_i1，f_i2，…，f_i29,W）（ｊ＝1,2,…,29）となり、したがって一次モーメントは、となる。第21図に［アクジュンカン］と発声したときの
スペクトルと一次モーメントの時間変化の様子を示す。By the way, in general, the spectrum of voiced voices is concentrated in the low frequency band, and unvoiced voices are concentrated only in the high frequency band. Therefore, the value of the first moment is small in voiced sound periods and large in non-voiced sound periods. Will take. In a silent section, a value near the center (in the vicinity of 15 in terms of value) is taken. However, a phoneme and a silent section in which spectral power is concentrated in the center of the channel have similar first moment values and cannot be clearly distinguished. Further, the noise in the silent section has a large change in the first moment according to the above equation, and the first moment lacks stability. Therefore,
A predetermined frequency band is weighted in advance to stabilize the first moment. As a measure for this, a first moment may be calculated by adding a weight to a predetermined channel out of the 29 channels, or another temporary frequency band may be set above the high band and below the low band. Then, a predetermined load may be applied thereto to calculate the first moment. Here, a temporary channel is set on the high frequency channel, and a predetermined load W is added to the channel to obtain a first moment. At this time, the spectrum of the frame at time i is f _ij = (f _i1 , f _i2 ,..., F _i29 , W) (j = 1, 2,..., 29). Becomes FIG. 21 shows the spectrum and first-moment moment change over time when [Akjunkan] is uttered.

登録時、ブロック分割前処理部74では、基準ブロック
の作成を行う。まず、登録時に発声された全ての音声を
そのスペクトルの一次モーメントと所定の閾値に比較に
より複数のブロックに分割する。個人や発声された単語
によっても異なるが、一次モーメントの変化は、比較的
安定である。第21図の例では、有声音区間とみられる区
間の一次モーメントの値が小さくなり、無声、無音区間
では大きくなっている。したがって、所定の閾値Thを設
定することで、この一次モーメントが安定した複数のブ
ロックに分割することができる。At the time of registration, the pre-block division processing unit 74 creates a reference block. First, all voices uttered at the time of registration are divided into a plurality of blocks by comparing the first moment of the spectrum with a predetermined threshold. The change in the first moment is relatively stable, depending on the individual and the spoken word. In the example of FIG. 21, the value of the first moment of the section considered to be a voiced sound section is small, and is large in the unvoiced and silent sections. Therefore, by setting the predetermined threshold value Th, the first moment can be divided into a plurality of blocks in which the first moment is stable.

第22図に分割点の設定方法の一例を説明するためのフ
ローチャートを示す。初期設定ｉ＝０、ｎ＝０を行った
後（ステップS1）、ｉ＝１＋１として（ステップS2）、
まず、一次モーメントmiが、閾値Thより小さくなる位置
を検出する（ステップS3）。miがThより小さくなると、
この位置をｎ番目のブロック区間の始点位置としてメモ
リのsp［ｎ］に格納する（ステップS4）。次に、miがTh
より小さい内はｉの値を増加させ（ステップS5）、Thよ
り大きくなった時点で、そのときのｉの値をｎ番目のブ
ロック区間の終端位置としてメモリのep［ｎ］に格納す
る（ステップS6,S7）。ｎの値を増加させて、同様の処
理を音声区間が終了するまで繰返し、ブロック分割を全
ての登録音声について行う（ステップS8,S9）。FIG. 22 shows a flowchart for explaining an example of the setting method of the division point. After performing initial settings i = 0 and n = 0 (step S1), set i = 1 + 1 (step S2),
First, a position where the first moment mi becomes smaller than the threshold Th is detected (step S3). When mi becomes smaller than Th,
This position is stored in sp [n] of the memory as the start point position of the n-th block section (step S4). Next, mi is Th
If it is smaller, the value of i is increased (step S5), and when it becomes larger than Th, the value of i at that time is stored in ep [n] of the memory as the end position of the n-th block section (step S5). S6, S7). The value of n is increased, and the same processing is repeated until the end of the voice section, and block division is performed for all registered voices (steps S8 and S9).

このようにして、全ての登録音声についてブロック分
割前処理が終了した後、基準ブロック設定部75にて基準
ブロックの選定を行う。この選定基準としては、ブロッ
ク数の最大値、最小値、中央値（メジアン）等々を使っ
てもよいし、あるいは全音声間で対応するブロック毎に
比較を行い、例えばブロック長の平均値を求めて、最も
平均値に近いブロック長を示す音声サンプルのブロック
を選ぶようにしても良い。こゝでは、ブロック分割数が
最小のものを基準ブロックとする。After the pre-block division processing is completed for all registered voices in this way, the reference block setting unit 75 selects a reference block. As the selection criterion, a maximum value, a minimum value, a median value (median), or the like of the number of blocks may be used, or a comparison is made for every corresponding block among all voices, and for example, an average value of block length is obtained Then, a block of a sound sample having a block length closest to the average value may be selected. Here, the block having the smallest number of block divisions is set as the reference block.

再分割処理部76では、基準ブロック内の平均一次モー
メントを基に、基準ブロックを作成した音声をも含めて
入力音声全てを再度ブロックに分割する。この時の方式
としては、例えば継続時間長制御型状態遷移モデルを使
えばよい。第23図はそれを説明する図であり、縦軸は基
準ブロックの状態（添え字ｊ）を、横軸は入力音声の音
声区間の一次モーメントの時間変化（添え字ｉ）を示し
ており、この縦軸と横軸の対応づけを継続時間長制御型
状態遷移モデルにて行う。第21図の例について登録時の
全音声について再分割した結果を第24図に示す。In the re-division processing unit 76, based on the average first moment in the reference block, all the input speech including the speech for which the reference block was created is again divided into blocks. As a method at this time, for example, a state length control type state transition model may be used. FIG. 23 is a diagram for explaining this, in which the vertical axis represents the state of the reference block (subscript j), and the horizontal axis represents the temporal change of the first moment of the voice section of the input voice (subscript i). The vertical axis and the horizontal axis are associated with each other using a duration control state transition model. FIG. 24 shows the result of subdividing all the sounds at the time of registration for the example of FIG.

特徴量作成部78では、各ブロック毎の話者の特徴量を
算出する。こゝでの特徴量の算出は、第13図の特徴量生
成部65と同様であり、例えば各ブロック毎に平均スペク
トルを計算する。登録時、特徴量作成部78は、算出した
特徴量を話者のファイルネーム等を付加して標準パタン
蓄積部81に格納する。即ち、登録時の個人辞書パタンと
しては、登録時の全音声について計算されるブロック内
平均スペクトルを対応するブロック毎にまとめて、例え
ばそれらの平均値と分散などを計算したものが用いられ
る。The feature quantity creation unit 78 calculates the feature quantity of the speaker for each block. The calculation of the feature amount here is the same as that of the feature amount generation unit 65 in FIG. 13, and for example, an average spectrum is calculated for each block. At the time of registration, the characteristic amount creation unit 78 stores the calculated characteristic amount in the standard pattern storage unit 81 with the file name of the speaker added. That is, as the personal dictionary pattern at the time of registration, an average spectrum within a block calculated for all voices at the time of registration is collected for each corresponding block, and for example, an average value and a variance thereof are calculated.

照合時には、ブロック分割部77は、一次モーメント計
算部73で計算した未知音声の音声区間の一次モーメント
と標準パタン蓄積部81に予め登録された登録話者の基準
ブロックの状態とを継続時間長制御型状態遷移モデルに
よりパタンマッチングし、分割点を求める。特徴量生成
部78は、求まったブロック毎に平均スペクトルを計算し
て特徴量を求める。そして、距離計算部79において未知
話者の特徴量と登録話者の特徴量との距離計算を行う。At the time of matching, the block division unit 77 controls the duration of the first moment of the speech section of the unknown speech calculated by the first moment calculation unit 73 and the state of the reference block of the registered speaker registered in the standard pattern storage unit 81 in advance. Pattern matching is performed using the type state transition model to determine a division point. The feature amount generation unit 78 calculates an average spectrum for each of the obtained blocks to obtain a feature amount. Then, the distance calculator 79 calculates the distance between the feature amount of the unknown speaker and the feature amount of the registered speaker.

判断部80では、以上のようにして求めた距離と予め設
定した所定の閾値との比較を行い、本人であるか詐称者
であるかを判定する。The determination unit 80 compares the distance obtained as described above with a predetermined threshold value set in advance, and determines whether the user is the person or the impostor.

第25図は本発明の第８実施例のブロック図を示す。音
響パラメータ変換部81、音声区間検出部82、ブロック分
割部83の動作はこれまでの実施例と同様である。FIG. 25 shows a block diagram of the eighth embodiment of the present invention. The operations of the acoustic parameter conversion unit 81, the voice section detection unit 82, and the block division unit 83 are the same as in the previous embodiments.

特徴量生成部84では、ブロック分割部83で得られたブ
ロックごとの特徴量を計算するが、該特徴量生成部84は
加算平均スペクトル計算部84−１と最小自乗直線計算部
84−２よりなる。The feature value generation unit 84 calculates the feature value of each block obtained by the block division unit 83. The feature value generation unit 84 includes an averaging spectrum calculation unit 84-1 and a least square line calculation unit.
84-2.

こゝでも、ｋ番目のブロックに入るフーレムスペクト
ル列を fkij＝（fki1,fki2,…,fki29） k:k番目のブロック i:k番目のブロック内における時刻 j:バンドパスフィルターのチャンネル番号とする。Again, the Fourem spectrum sequence that enters the k-th block is represented by fkij = (fki1, fki2, ..., fki29) k: k-th block i: time in k-th block j: channel number of bandpass filter and I do.

特徴量としてブロック内スペクトル時系列の時間方向
の加算平均を用い、この特徴量をXavrとすると、となることは、例えば第13図で説明した通りである。Using the averaging in the time direction of the intra-block spectrum time series as the feature amount and letting this feature amount be Xavr, Is as described with reference to FIG. 13, for example.

上記特徴量Xavrを加算平均スペクトル計算部84で計算
する。更に、この特徴量Xavrを１つのスペクトルとみ
て、その最小二乗直線を最小自乗直線計算部84−２で計
算し、この傾きXaも特徴量とする。The feature value Xavr is calculated by the averaging spectrum calculation unit 84. Further, the feature Xavr is regarded as one spectrum, the least square straight line is calculated by the least square straight line calculator 84-2, and the slope Xa is also used as the feature.

特徴量統合部85では、上記特徴量Xavr,Xaを統合す
る。即ち、ｋブロックの全体の特徴量をXkとすると、 Xk＝（Xavr,Xa）となる。The feature amount integration unit 85 integrates the feature amounts Xavr and Xa. That is, if the entire feature amount of the k blocks is Xk, then Xk = (Xavr, Xa).

特徴量統合部85で統合された特徴量（Xa）を、登録時
には、標準パタン蓄積部86に話者のファイルネーム等を
付加して標準特徴量（辞書パタン）として格納する。一
方、照合時には、距離計算部87において、特徴量統合部
85を介して得られた未知話者の特徴量と標準パタン蓄積
部86に蓄積されている登録話者の特徴量（辞書パタン）
の距離計算を、対応するブロック毎に計算する。At the time of registration, the feature amount (Xa) integrated by the feature amount integration unit 85 is stored as a standard feature amount (dictionary pattern) with the file name of the speaker added to the standard pattern storage unit 86. On the other hand, at the time of collation, the distance calculation
Features of unknown speakers obtained through 85 and features of registered speakers stored in the standard pattern storage unit 86 (dictionary patterns)
Is calculated for each corresponding block.

いま、辞書パタンをＸ、未知音声のパタンをＺとする
と、距離ｄ（X,Z）はで求める。こゝでdis（Xk,Zk）は、対応するブロックの
対応する特徴量間の距離の和であり、 dis（Xk,Zk）＝w1×dist（Xavr,Zavr）＋w2×dist（Xa,
Za）と表される。対応する特徴量間の距離計算dist（a,b）
は、ユークリッド距離を用いても良いし、マハラノビス
距離等を用いても良い。こゝで、w1〜w2は各距離に対す
る重みづけを示しており、あらかじめＦ比（話者間／話
者内分離比）等を用いて設定しておく。Now, assuming that the dictionary pattern is X and the unknown voice pattern is Z, the distance d (X, Z) is Ask for. Here, dis (Xk, Zk) is the sum of the distances between the corresponding features of the corresponding block, and dis (Xk, Zk) = w1 × dist (Xavr, Zavr) + w2 × dist (Xa,
Za). Calculation of distance between corresponding features dist (a, b)
May use the Euclidean distance, the Mahalanobis distance, or the like. Here, w1 to w2 indicate weights for the respective distances, and are set in advance using an F ratio (inter-speaker / in-speaker separation ratio) or the like.

判断部88では、予め各話者毎に設定されている閾値と
距離計算部87によって得られた距離とを比較することで
話者の判定を行う。この判定結果により制御対象２を制
御する。The determination unit 88 determines a speaker by comparing a threshold value preset for each speaker with the distance obtained by the distance calculation unit 87. The control target 2 is controlled based on this determination result.

第26図に本発明の第９実施例のブロック図を示す。こ
れは、特徴量生成部94を加算平均スペクトル計算部94−
１、最小自乗直線計算部94−２及び差分計算部94−３で
構成し、ブロック内の特徴量として、ブロック毎のブロ
ック内平均スペクトルよりその最小自乗直線を計算し、
その平均スペクトルより最小自乗直線を引いた残りのス
ペクトル概形を用いるもので、それ以外は第25図の実施
例と同じである。FIG. 26 is a block diagram showing a ninth embodiment of the present invention. This is because the feature amount generation unit 94 is added to the averaged spectrum calculation unit 94-
1. It is composed of a least squares straight line calculation unit 94-2 and a difference calculation unit 94-3, and calculates a least squares straight line from the average spectrum in a block for each block as a feature amount in the block.
The remaining spectrum outline obtained by drawing a least-squares straight line from the average spectrum is used, and other than that is the same as the embodiment of FIG.

第27図に本発明の第10実施例のブロック図を示す。こ
れは、特徴量生成部104を加算平均スペクトル計算部104
−１、最小自乗直線計算部104−２及び差分スペクトル
計算部104−３で構成し、ブロック内各フレーム毎にそ
のスペクトル概形の最小自乗直線を計算し、スペクトル
概形より最小自乗直線を減じたスペクトルをそのフレー
ムにおける個人性情報として、これを時間軸方向に加算
平均し、ブロック内の特徴量としたものである。FIG. 27 is a block diagram of a tenth embodiment of the present invention. This means that the feature amount generation unit 104
-1, a least-squares straight line calculation unit 104-2 and a difference spectrum calculation unit 104-3, which calculates a least square line of the spectrum outline for each frame in the block, and subtracts the least square line from the spectrum outline. The obtained spectrum is used as personality information in the frame, and is added and averaged in the time axis direction to obtain a feature amount in the block.

第28図は本発明の第11実施例のブロック図を示したも
のである。これは、フレームごとのスペクトルの微細構
造を特徴量に加えるために、これを効率的に記述する方
法としてスペクトルのピーク（山）とボトム（谷）の位
置情報を導入したものである。特徴量生成部114は、山
・谷パタン変換部114−１、スペクトル概形の山・谷位
置抽出部114−２及び加算平均部114−３よりなる。FIG. 28 is a block diagram showing an eleventh embodiment of the present invention. In order to add a fine structure of a spectrum for each frame to a feature amount, position information of peaks (peaks) and bottoms (valleys) of the spectrum is introduced as a method for efficiently describing the feature. The feature amount generation unit 114 includes a peak-to-valley pattern conversion unit 114-1, a peak-to-valley position extraction unit 114-2 of a spectral shape, and an averaging unit 114-3.

まず、ｋ番目のブロック内番目のフレームのスペクト
ルについてどのチャンネルにピークが有るかを調べ、ｊ
番目のチャンネルがピークならば“1"、そうでないなら
ば“0"として、以下のように各フレームのスペクトルの
山の位置を表現する。First, it is determined which channel has a peak in the spectrum of the k-th block in the k-th block.
If the first channel is a peak, “1” is used, otherwise “0” is used to express the position of the peak of the spectrum of each frame as follows.

TOPkij＝（TOPki1,TOPki2,…,TOPki29） TOPkil 1:j番目のチャンネルがピークの時 0:j番目のチャンネルがピーク以外の時 k:k番目のブロック i:k番目のブロックにおける時刻 j:周波数のチャンネル番号ブロック内の特徴量としては、フレーム毎に求まるTO
Pkijの時間方向の加算平均を用いる。これをTOPavrとす
ると、となる。こゝで、ｍはブロック内のフレーム数、▲
▼はチャンネルにスペクトルの山のでるブロック
内の平均値である。TOPkij = (TOPki1, TOPki2, ..., TOPki29) TOPkil 1: When the j-th channel is at a peak 0: When the j-th channel is at a non-peak k: The k-th block i: The time at the k-th block j: Frequency Of the channel number block of the TO
The averaging in the time direction of Pkij is used. If this is TOPavr, Becomes Here, m is the number of frames in the block, ▲
▼ is the average value in the block where the peak of the spectrum appears on the channel.

スペクトルのボトム（谷）についても同様に取扱う。
即ち、ｋ番目のブロック内ｉ番目のスペクトルについ
て、どのチャンネルに谷があるかを調べ、 BTMkij＝（BTMki1,BTMki2,…,BTMki29） k:k番目のブロック i:k番目のブロックにおける時刻 j:周波数のチャンネル番号とする。The same applies to the bottom (valley) of the spectrum.
That is, for the i-th spectrum in the k-th block, it is checked which channel has a valley. BTMkij = (BTMki1, BTMki2,..., BTMki29) This is the frequency channel number.

こゝで、BTMkijにはｊ番目のチャンネルが谷ならば
“1"、そうでないなら“0"が入っている。Here, BTMkij contains “1” if the j-th channel is a valley, and “0” otherwise.

ブロック内の特徴量としては、フレーム毎に求まるBT
Mkijの時間方向の加算平均を用いる。これをBTMavrとす
ると、となる。The feature value in the block is BT obtained for each frame.
Mkij averaging in the time direction is used. If this is BTMavr, Becomes

以上の特徴量をまとめてｋブロックの全体の特徴量Xk
とすると、 Xk＝（XTOPavr,XBTMavr）となる。Summarizing the above features, the overall feature Xk of k blocks
Then, Xk = (XTOPavr, XBTMavr).

距離計算を行う際には、以上の特徴量を用いて辞書パ
タンと未知パタンとの間で対応するブロック同志の距離
計算を行う。こゝで、辞書パタンをＸ、未知音声のパタ
ンをＺとすると、距離ｄ（X,Z）は、で求める。こゝで、dis（Xk,Zk）は、対応するブロック
間の対応する特徴量間の距離の和であり、 dis（Xk,Zk）＝w3×dist（XTOPavr,ZTOPavr）＋w4×dis
t（XBTMavr,ZBTMavr）と表される。対応する特徴量間の距離計算dist（a,b）
は、ユークリッド距離を用いても良いし、マハラノビス
距離等を用いても良い。こゝでw3〜w4は各距離に対する
重みづけを示しておりあらかじめＦ比等を用いて設定し
ておく。When performing the distance calculation, the distance between the corresponding blocks between the dictionary pattern and the unknown pattern is calculated using the above feature amounts. Here, if the dictionary pattern is X and the unknown voice pattern is Z, the distance d (X, Z) is Ask for. Here, dis (Xk, Zk) is the sum of the distances between corresponding feature values between corresponding blocks, and dis (Xk, Zk) = w3 × dist (XTOPavr, ZTOPavr) + w4 × dis
Expressed as t (XBTMavr, ZBTMavr). Calculation of distance between corresponding features dist (a, b)
May use the Euclidean distance, the Mahalanobis distance, or the like. Here, w3 to w4 indicate weighting for each distance, and are set in advance using the F ratio or the like.

こゝでは、各ブロックの特徴ベクトルとしてスペクト
ルの山と谷の位置情報のみを用いたが、先の実施例で述
べたブロック内平均スペクトルや、その傾き等も加え
て、 Xk＝（Xavr,Xa,XTOPavr,XBTMavr）としても良い。In this case, only the position information of the peaks and valleys of the spectrum is used as the feature vector of each block, but the average spectrum in the block described in the previous embodiment, the slope thereof, and the like are added, and Xk = (Xavr, Xa , XTOPavr, XBTMavr).

この場合、対応するブロック間での距離の和dis（Xk,
Zk）は、 dis（Xk,Zk）＝wl×dist（Xavr,Zavr）＋w2×dist（Xa,
Za）＋w3×dist（XTOPavr,ZTOPavr）＋w4×dist（XBTMa
vr,ZBTMavr）で計算する。対応する特徴量間の距離計算dist（a,b）
は、ユークリッド距離を用いても良いし、マハラノビス
距離等を用いても良い。こゝでw1〜w4は各距離に対する
重みづけを示しており、あらかじめＦ比等を用いて設定
しておく。In this case, the sum of the distances between corresponding blocks, dis (Xk,
Zk) is obtained by: dis (Xk, Zk) = wl × dist (Xavr, Zavr) + w2 × dist (Xa,
Za) + w3 × dist (XTOPavr, ZTOPavr) + w4 × dist (XBTMa
vr, ZBTMavr). Calculation of distance between corresponding features dist (a, b)
May use the Euclidean distance, the Mahalanobis distance, or the like. Here, w1 to w4 indicate weighting for each distance, and are set in advance using the F ratio or the like.

第29図は本発明の第12実施例のブロック図を示したも
のである。FIG. 29 is a block diagram showing a twelfth embodiment of the present invention.

これまで述べてきた実施例は、入力音声を数msec毎の
フレームに分割し、そのフレーム毎に求まる音響パラメ
ータを用いて音声区間を複数のブロックに分割し、各ブ
ロック内の音響パラメータの加算平均値と分散を個人の
特徴量として用いることが、基本となっている。ところ
で、個人の特徴量を記述するためには、音韻性のパラメ
ータであるスペクトルの他にピッチを積極的に利用する
ことが重要であり、これによって照合率も向上すると考
えられる。しかし、ピッチのようにその値とその時間変
化パタンに個人性のある音響パラメータに、平均値を使
うことは、ブロック内での変動をうまく表現することが
できず、有効な特徴量変換とはいえない。第29図の実施
例は、ピッチ周期の特徴量として、各ブロック毎にピッ
チ周期のヒストグラムを用い、ピッチ周期の値及びその
時間変化の様子を特徴量自体に反映することができるよ
うにして、照合率の向上を図ることを目的としたもので
ある。In the embodiments described above, the input sound is divided into frames every several msec, the sound section is divided into a plurality of blocks using the sound parameters obtained for each frame, and the averaging of the sound parameters in each block is performed. Basically, values and variances are used as individual feature values. By the way, in order to describe an individual feature, it is important to actively use the pitch in addition to the spectrum, which is a phonological parameter, and it is thought that the collation rate is improved. However, using an average value as an acoustic parameter that has individuality in its value and its time change pattern, such as pitch, cannot express variations within a block well. I can't say. The embodiment of FIG. 29 uses a pitch cycle histogram for each block as the pitch cycle feature quantity, so that the value of the pitch cycle and the state of its time change can be reflected in the feature quantity itself. The purpose is to improve the matching rate.

以下に第29図の実施例にて説明する。マイクロホン１
から入力された音声信号は、数msecごとにフレームに分
割され、スペクトル抽出部121−１及びピッチ抽出部121
−２にてスペクトルとピッチ周期の時系列に変換され
る。スペクトル抽出部121−１でのスペクトル抽出とし
ては、例えば入力音声をローパスフィルタ（LPF）によ
ってサンプリング周波数の1/2以上の成分をカットした
後、アナログ・ディジタル変換器によって離散的な信号
列に量子化し、さらにこれを短時間の波形毎に切出して
ハミングウィンドウ等を剩じ窓掛けを行い、スペクトル
に変換しても良いし、またはバンドパスフィルタ群を用
いてスペクトルに変換し時系列情報を得ても良い。こゝ
では、これまでの実施例と同様に、中心周波数250〜630
0Hzで1/6オクターブごとに配置された29チャンネルのバ
ンドパスフィルタを用いた例にて説明する。また、ピッ
チ抽出部121−２でのピッチ周期の検出には、自己相関
法等の数値算を用いても良いし、ローパスフィルタとカ
ウンタの組合せ等による簡単な抽出法を用いた専用のハ
ードボードによって検出しても良い。なお、このピッチ
抽出の具体的構成については後述する。This will be described below with reference to the embodiment shown in FIG. Microphone 1
Is divided into frames every several milliseconds, and the spectrum extraction unit 121-1 and the pitch extraction unit 121-1
At -2, it is converted into a time series of the spectrum and the pitch period. As the spectrum extraction in the spectrum extraction unit 121-1, for example, after the input sound is cut by a low-pass filter (LPF) to cut off a component equal to or more than サンプリング of the sampling frequency, the analog-to-digital converter converts the input sound into a discrete signal sequence. Then, this is cut out for each short-time waveform, and a Hamming window or the like is multiply windowed, and converted into a spectrum, or converted into a spectrum using a band-pass filter group to obtain time-series information. May be. Here, as in the previous embodiments, the center frequency is 250 to 630.
An example will be described in which bandpass filters of 29 channels arranged at intervals of 1/6 octave at 0 Hz are used. Further, the pitch extraction in the pitch extraction unit 121-2 may be performed by a numerical calculation such as an autocorrelation method, or by a dedicated hard board using a simple extraction method such as a combination of a low-pass filter and a counter. May be detected. The specific configuration of the pitch extraction will be described later.

いま、時刻ｉにおけるフレームのスペクトルfijは fij＝（fi1,fi2,…,fi29）（ｊ＝1,2,…,29）また、ピッチ周期はpiで表わされるものとする。 Now, the spectrum fij of the frame at the time i is fij = (fi1, fi2,..., Fi29) (j = 1, 2,..., 29), and the pitch period is represented by pi.

これらの特徴量を用いて音声区間検出部122では、入
力音声から音声区間を検出する。入力信号から音声区間
を検出する方法としては、例えば各フレーム毎のパワー
スペクトルを計算し、この値と所定の閾値を比較し、パ
ワーが閾値を越える区間を音声区間として検出すればよ
い。The voice section detection unit 122 detects a voice section from the input voice using these feature amounts. As a method of detecting a voice section from an input signal, for example, a power spectrum for each frame is calculated, this value is compared with a predetermined threshold, and a section in which power exceeds the threshold may be detected as a voice section.

ブロック分割部123では、各フレームのスペクトルの
一次モーメント等を用いて複数のブロックに分割する。
また、ブロックの対応付けも行う。The block dividing unit 123 divides each frame into a plurality of blocks using the first moment or the like of the spectrum.
In addition, block association is also performed.

次に、特徴量作成部124にて各ブロック毎の話者の特
徴量を算出する。こゝでの特徴量としては、スペクトル
に関しては、各ブロック毎に計算される平均スペクトル
を用いる。既に述べた如く、ｋ番目のブロックのスペク
トル列を、 fkij＝（fki1,fki2,…,fki29） k:k番目のブロック i:k番目のブロックにおける時刻 j:バンドパスフィルタのチャンネル番号とする時、ｋ番目のブロックのスペクトルに関する特徴
量Xkavrは、と表せる。こゝで、ｍはブロック内のフレーム数を示し
ており、▲▼（ｊ＝1,2,…,29）は、各チャンネ
ルごとの平均スペクトルを示している。Next, the feature quantity creation unit 124 calculates the feature quantity of the speaker for each block. As the feature amount, an average spectrum calculated for each block is used for the spectrum. As described above, when the spectrum of the k-th block is fkij = (fki1, fki2,..., Fki29) k: k-th block i: time in k-th block j: channel number of bandpass filter , The feature quantity Xkavr for the spectrum of the kth block is Can be expressed as Here, m indicates the number of frames in the block, and ▼ (j = 1, 2,..., 29) indicates the average spectrum for each channel.

さらにピッチに関しては、各ブロック内のピッチのヒ
ストグラムを用いる。これは、ｋ番目のブロックの時刻
ｉにおけるピッチ周期をPkiとする時、このブロックに
おけるピッチのヒストグラムは、ｋブロック内で同じピ
ッチ周期の個数を数えて、これを特徴点PKLとする。こ
ゝで、Ｌはピッチ周期に対応する変数である。PKLはｋ
番目のブロックでピッチ周期がＬのものゝ個数を示して
いる。As for the pitch, a histogram of the pitch in each block is used. When the pitch cycle of the k-th block at time i is Pki, the pitch histogram of this block counts the number of the same pitch cycle in the k-th block and sets this as a feature point PKL. Here, L is a variable corresponding to the pitch period. PKL is k
In the second block, the pitch period is L and the number is shown.

登録時には、全音声について計算されるブロック内平
均スペクトルを対応するブロック毎にまとめて、それら
の平均値と分散などを計算し、スペクトルの特徴量とす
る。また、ピッチのヒストグラムに関しても、PKLを対
応するブロック毎に加算平均して特徴量として用いる。
生成した特徴量は話者のファイルネームを付加して標準
特徴量（辞書特徴量）として標準パタン蓄積部125に蓄
積する。At the time of registration, the in-block average spectrum calculated for all voices is grouped for each corresponding block, and the average value and variance thereof are calculated to obtain a spectrum feature amount. As for the pitch histogram, PKL is added and averaged for each corresponding block and used as a feature amount.
The generated feature amount is added to the speaker's file name and stored in the standard pattern storage unit 125 as a standard feature amount (dictionary feature amount).

照合時には、未知音声の音声区間の一次モーメントと
予め登録された登録話者の基準ブロックの状態とを継続
時間長制御型状態遷移モデルによりパタンマッチングし
て分割点を求め、さらに各ブロック毎に平均スペクトル
とピッチ周期のヒストグラムを計算して特徴量を求め
る。At the time of matching, the first moment of the speech section of the unknown speech and the state of the reference block of the registered speaker registered in advance are pattern-matched by a state length control type state transition model to obtain a division point, and the average is obtained for each block. A feature amount is obtained by calculating a histogram of the spectrum and the pitch period.

距離計算部126では、特徴量生成部124で生成された未
知話者の特徴量と標準パタン蓄積部125に蓄積されてい
る登録話者の辞書特徴量との距離計算が行われる。こゝ
での計算は、ユークリッド距離等を用いる。The distance calculation unit 126 calculates the distance between the feature amount of the unknown speaker generated by the feature amount generation unit 124 and the dictionary feature amount of the registered speaker stored in the standard pattern storage unit 125. The calculation here uses the Euclidean distance and the like.

判断部127では、以上のようにして求めた距離と予め
設定した所定の閾値との比較を行い、本人であるか詐称
者であるかを判定する。この判定結果により制御対象２
を制御する。The determination unit 127 compares the distance obtained as described above with a predetermined threshold value set in advance, and determines whether the user is the person or the impostor. Based on this determination result, the control target 2
Control.

第30図は、特徴量生成部124においてピッチのヒスト
グラムを特徴量とする場合の他の実施例の処理フローを
示したものである。先ず抽出されたピッチを入力し（ス
テップ131）、各ブロック毎にヒストグラムを作成する
（ステップ132）。その後、登録時には、発声された全
音声サンプルの対応するブロック内のピッチのヒストグ
ラムの総和を計算し（ステップ133）、これをピッチ周
期方向に平滑化し（ステップ134）、さらに全体のパワ
ーで正規化し（ステップ135）、こうして得られたもの
をピッチに関する辞書特徴量として用いる。照合時に
は、ブロック内のピッチのヒストグラムをピッチ周期方
向に平滑化し（ステップ136）、これを全体の頻度数で
正規化（ステップ137）したものを用いる。即ち、本実
施例はピッチのヒストグラムを辞書特徴量とする場合
に、辞書としての安定性をもたせるためにヒストグラム
の平滑化と正規化を導入したものである。FIG. 30 shows a processing flow of another embodiment when the pitch histogram is used as the feature in the feature generator 124. First, the extracted pitch is input (step 131), and a histogram is created for each block (step 132). Thereafter, at the time of registration, the sum of the histograms of the pitches in the corresponding blocks of all the uttered voice samples is calculated (step 133), smoothed in the pitch cycle direction (step 134), and further normalized by the total power. (Step 135) The obtained result is used as a dictionary feature amount relating to the pitch. At the time of matching, the histogram of the pitch in the block is smoothed in the pitch cycle direction (step 136), and the result is normalized by the entire frequency (step 137). That is, in the present embodiment, when the pitch histogram is used as the dictionary feature, smoothing and normalization of the histogram are introduced in order to provide stability as a dictionary.

第31図に具体的な処理の様子を示す。これは（１）〜
（３）の３回の発声で辞書を作った例である。図中、Ｉ
がスペクトルの変化、IIがピッチの変動を示している。
各発声における〜のグラフが各ブロックに対応した
ピッチのヒストグラムである。例えばのブロックに注
目して、そのピッチ周期の特徴量の作成過程を説明する
と、まず各発声ののブロックのみを取出し、これを対
応するピッチ周期毎に加算する。この結果をで示す。
次に、これをピッチ周期方向（横軸方向）に平滑化処理
する。この平滑化の処理としては、例えば３点〜５点毎
に加算平均してやればよい。さらに正規化処理する。こ
れは例えば全ピッチ周期の頻度を加算して総ピッチ数を
求め、この値によって正規化処理を行えば良い。この結
果がであり、これをピッチの辞書特徴量とする。FIG. 31 shows a specific process. This is (1) ~
This is an example in which a dictionary is created by three utterances of (3). In the figure, I
Indicates a change in spectrum, and II indicates a change in pitch.
The graph of in each utterance is a histogram of the pitch corresponding to each block. For example, focusing on a block, a process of creating a feature amount of the pitch cycle will be described. First, only a block of each utterance is extracted and added for each corresponding pitch cycle. This result is indicated by.
Next, this is smoothed in the pitch period direction (horizontal axis direction). As the smoothing process, for example, the averaging may be performed every three to five points. Further, normalization processing is performed. For example, the total pitch number may be obtained by adding the frequencies of all pitch periods, and the value may be normalized. The result is as follows, and this is used as the dictionary feature amount of the pitch.

第32図は本発明の第13実施例のブロック図を示したも
のである。FIG. 32 is a block diagram showing a thirteenth embodiment of the present invention.

未知話者の特徴量と登録話者の特徴量との距離と所定
閾値との比較を全ブロック総合（ブロック毎の距離を全
ブロックに渡って加算する等）で行った場合、個々のブ
ロックでの誤差が反映されない。さらに、スペクトルと
ピッチ周期という２つの違った次元のものを使って距離
計算する場合には、各パラメータをその分散によって正
規化し、次元を無くしてから、これらの距離を加算して
距離計算を行うと、スペクトルもピッチも同等の重みで
用いられるため、スペクトル距離が大きく離れていて
も、ピッチはさほど離れていなかったり、その逆にスペ
クトル距離は小さくて、ピッチ距離が大きい場合という
ように、これらを区別できない。そこで、第32図の実施
例は、未知話者と登録話者の距離計算において、スペク
トルとピッチの各々に関して別々に所定の閾値と比較
し、その結果を総合してスコア（得点）という形で与
え、このスコアを全ブロックに渡って加算した結果と所
定の閾値の比較によって総合的に照合の成否を判定する
方式を提供するものである。When the distance between the feature amount of the unknown speaker and the feature amount of the registered speaker is compared with a predetermined threshold value for all blocks (such as adding the distance of each block over all blocks), Is not reflected. Further, when calculating a distance using two different dimensions, that is, a spectrum and a pitch period, each parameter is normalized by its variance, the dimension is eliminated, and then these distances are added to calculate the distance. Since both the spectrum and the pitch are used with the same weight, even if the spectral distance is large, the pitch is not so large, or conversely, the spectral distance is small and the pitch distance is large. Can not be distinguished. Therefore, in the embodiment of FIG. 32, in the calculation of the distance between the unknown speaker and the registered speaker, the spectrum and the pitch are separately compared with predetermined threshold values, and the results are integrated into a score (score). The present invention provides a method of comprehensively determining the success or failure of collation by comparing a result obtained by adding the scores over all blocks with a predetermined threshold.

第32図において、スペクトル／ピッチ抽出部141−1,1
41−２から特徴量生成部144までの処理は第29図の場合
と同様であり、距離計算部146以降の処理が異ってい
る。照合時、距離計算部146にて、スペクトルとピッチ
の距離を計算し、各ブロック毎に予め設定したスペクト
ルとピッチの所定の閾値と比較する。そして、スペクト
ルとピッチが各々閾値を越えるか否かによって、例えば
第33図の様なスコアを与える。第33図で、○印は閾値以
内（受理）、×印は閾値以上（拒否）を示している。ス
コアとしては、◎，○，△，×を与える。具体的には、
例えば、◎印は２点、○は１点、△は０点、×は−10点
などとする。例えば、あるブロックのスペクトル距離が
閾値以内で、ピッチ距離が閾値以上の時、スコアは○と
なる。スコア計算部158にて各ブロック毎のスコアを計
算する。総合判定部159では、各ブロック毎のスコタを
もとに判定を行う。この時の処理としては、全スコアを
加算してそれが所定の値よりも大きければ受理、そうで
なければ拒否としてもよいし、１点以上のブロックが全
ブロック数の９割を越えた時受理、そうでないとき拒否
などとしてもよい。In FIG. 32, a spectrum / pitch extraction section 141-1,1
The processing from 41-2 to the feature amount generation unit 144 is the same as that in FIG. 29, and the processing after the distance calculation unit 146 is different. At the time of comparison, the distance between the spectrum and the pitch is calculated by the distance calculation unit 146, and the spectrum and the pitch preset for each block are compared with a predetermined threshold of the pitch. Then, a score as shown in FIG. 33 is given depending on whether or not the spectrum and the pitch each exceed the threshold value. In FIG. 33, a mark “以内” indicates a value within the threshold value (accepted), and a mark “X” indicates a value greater than the threshold value (rejection). ◎, △, Δ, and × are given as scores. In particular,
For example, ◎ indicates 2 points, ○ indicates 1 point, Δ indicates 0 points, × indicates -10 points, and the like. For example, when the spectral distance of a certain block is within the threshold value and the pitch distance is equal to or greater than the threshold value, the score is ○. The score calculation unit 158 calculates a score for each block. The overall determination unit 159 makes a determination based on the scota for each block. At this time, all the scores are added, and if it is larger than a predetermined value, it is accepted, otherwise, it may be rejected. If one or more blocks exceed 90% of the total number of blocks, Accepted, otherwise rejected.

第34図は本発明の第14実施例のブロック図である。本
実施例は第32図と基本的に同じであるが、各ブロック毎
にスペクトルとピッチの距離に対して複数の閾値を設定
し、閾値に幅を持たせることで、閾値付近での誤りを減
らすようにしたものである。即ち、各ブロック毎に単一
の閾値を設定した場合、その閾値の値によって、ブロッ
ク毎のスコアが決まるため、閾値前後で変動するパタン
を誤認識してしまう。そこで、本実施例は複数の閾値を
設定することで、距離のスコアにある程度の自由度を許
すことで誤認識を減らすことを目的とするものである。FIG. 34 is a block diagram of a fourteenth embodiment of the present invention. The present embodiment is basically the same as FIG. 32, except that a plurality of thresholds are set for the distance between the spectrum and the pitch for each block, and the thresholds are given a range, so that errors near the thresholds can be reduced. It is intended to be reduced. That is, when a single threshold is set for each block, the score of each block is determined by the value of the threshold, so that a pattern that fluctuates before and after the threshold is erroneously recognized. Therefore, the present embodiment aims to reduce erroneous recognition by setting a plurality of thresholds to allow a certain degree of freedom for the distance score.

こゝでは、閾値を２つにした場合のスコアについて説
明する。いま、閾値として激しい閾値と緩やかな閾値の
２つをスペクトルとピッチともに各々設定する。例え
ば、スペクトル、ピッチの距離が激しい閾値以内である
時は◎、緩やかな閾値以下であると○、それ以外の時は
×と判定する。各ブロックに於けるスペクトルとピッチ
の取得る判定の組合せは第35図に示した通りである。そ
の時のスコアも◎（２点）、○（１点）、△（０点）、
×（−10点）の４段階あるものとする。いま、距離計算
部156において、スペクトルとピッチの距離を計算し、
各ブロック毎のスペクトルとピッチの閾値と比較した結
果が、スペクトルは○、ピッチに関しては◎の時、スコ
アは、第35図より◎となる。総合判定部159での全体の
ブロックに渡っての総合判定は、先の第32図の場合と同
様に行う。Here, the score when the number of thresholds is two will be described. Now, two thresholds, a strong threshold and a gentle threshold, are set for both the spectrum and the pitch. For example, when the distance between the spectrum and the pitch is within the intense threshold, it is determined as ◎. Combinations of determination for obtaining a spectrum and a pitch in each block are as shown in FIG. The score at that time was also ◎ (2 points), ○ (1 point), △ (0 point),
It is assumed that there are four stages of × (−10 points). Now, the distance calculator 156 calculates the distance between the spectrum and the pitch,
As a result of comparing the spectrum for each block with the pitch threshold, when the spectrum is ○ and the pitch is ◎, the score becomes より from FIG. 35. The comprehensive determination over the entire block by the comprehensive determination unit 159 is performed in the same manner as in the case of FIG.

以上、複数ブロック化したものに関し説明してきた
が、単一ブロックにしか分割できなかった場合にも適応
できることは当然である。また、ブロック分割の特徴量
としてスペクトルの一次モーメント、個人の特徴量とし
て平均スペクトルとピッチヒストグラムを用いたが、ス
ペクトルの代りに、LPCケプストラムを用いてもよい。In the above, the description has been given of the case where a plurality of blocks are formed. Although the first moment of the spectrum is used as the feature amount of the block division, and the average spectrum and the pitch histogram are used as the individual feature amounts, an LPC cepstrum may be used instead of the spectrum.

次に、第29図や第32図や第34図におけるピッチ抽出部
121−2,141−2,151−２の具体的構成例について説明す
る。Next, the pitch extraction unit in FIG. 29, FIG. 32, and FIG.
A specific configuration example of 121-2, 141-2, 151-2 will be described.

前に述べた様に、個人の特徴量としてスペクトル、LP
Cケプストラム等の他にピッチ周期を併用すれば照合精
度が向上する。しかし、一般にピッチを正確に抽出する
ためには、音声信号の自己相関係数を計算するなど、複
雑な処理が必要であり、これまでに示したセグメンテー
ション等をいくら簡単にしても、ピッチ抽出のために複
雑な処理を必要としたのでは、話者照合全体の計算に占
めるピッチ抽出の割合が大きくなり、話者照合全体のシ
ステム規模も大きくなる。そこで、こゝでは簡単なハー
ド構成で、高精度のピッチ抽出を可能とする簡易ピッチ
抽出機構について説明する。該簡易ピッチ抽出機構を使
用することにより、話者照合全体のシステム規模を比較
的小さくできる。As mentioned earlier, the spectrum, LP
If a pitch cycle is used in addition to the C cepstrum or the like, the matching accuracy is improved. However, in general, accurate extraction of pitch requires complicated processing such as calculation of the autocorrelation coefficient of the audio signal. Therefore, if complicated processing is required, the ratio of pitch extraction in the calculation of the entire speaker verification increases, and the system scale of the entire speaker verification also increases. Therefore, a simple pitch extraction mechanism that enables highly accurate pitch extraction with a simple hardware configuration will be described. By using the simple pitch extraction mechanism, the system scale of the entire speaker verification can be made relatively small.

第36図に簡易ピッチ抽出機構の一実施例を示す。これ
は、低域通過フィルタにより高周波成分を除いた音声波
形を予め設定した振幅閾値により２値化する複数のコン
パレータ回路と所定周波数のクロックを発生する回路、
前記クロック回路出力を計数し２値化信号が０から１へ
立ち上がる区間のクロック数を出力する計数回路とを有
し、計測される複数の計数値を用いて所定区間内のピッ
チ周波数を推定するものである。FIG. 36 shows an embodiment of the simple pitch extracting mechanism. This includes a plurality of comparator circuits for binarizing an audio waveform from which a high-frequency component has been removed by a low-pass filter using a preset amplitude threshold, and a circuit for generating a clock of a predetermined frequency;
A counting circuit that counts the clock circuit output and outputs the number of clocks in a section where the binarized signal rises from 0 to 1, and estimates a pitch frequency in a predetermined section using a plurality of measured count values. Things.

以下、第36図の動作を説明する。マイクロホンから入
力した音声信号は、遮断周波数が300Hz程度の低域通過
フィルタ161により高周波成分を除去されたのち、複数
のコンパレータ162₁，162₂，…，162_nに入力し、それぞ
れ異なった閾値Th₁，Th₂，…，Th_nによって２値化され
る。これらのコンパレータ162₁〜162_nの半数のもの（図
の上半分のもの）は信号波形の振幅の正側で働き、残り
の半数のもの（図の下半分のもの）は振幅の負側で働く
ものである。正側のコンパレータによる２値化信号は、
振幅が閾値を正方向へ越えたときに“1"となる。負側の
コンパレータによる２値化信号は、振幅が閾値を負方向
へ越えたときに“1"となる。Hereinafter, the operation of FIG. 36 will be described. Audio signal input from the microphone, after the cut-off frequency is removing high frequency components by the low pass filter 161 of about 300 Hz, a plurality of comparators 162 _1, 162 _2, ... are input to 162 _n, the threshold Th respectively different _1, Th _2, ..., it is binarized by Th _n. Half of these comparators 162 _{1 to} 162 _n (the upper half in the figure) work on the positive side of the amplitude of the signal waveform, and the other half (lower half in the figure) work on the negative side of the amplitude. It works. The binarized signal from the positive comparator is
It becomes "1" when the amplitude exceeds the threshold value in the positive direction. The binary signal from the negative comparator becomes "1" when the amplitude exceeds the threshold value in the negative direction.

各コンパレータ162₁〜162_nの出力信号、すなわち各閾
値による２値化信号は、それぞれ同期回路163₁，163₂，
…，163_nに入力し、クロック生成回路164で生成された
基準クロックと同期化され、この同期化後の２値化信号
は計数回路165₁，165₂，…，165_nに入力する。各計数回
路165₁〜165_nは、それぞれ各同期化回路163₁〜163_nから
入力する２値化信号の“0"から“1"への遷移点でゼロク
リア後に基準クロックの計数を開始し、その２値化信号
が“1"から“0"へ遷移すると計数を停止する。すなわ
ち、各計数回路165₁〜165_nは、同期化後の各２値化信号
の“1"から“0"への遷移点の時間間隔を計測している。The output signals of the comparators 162 _{1 to} 162 _n , that is, the binarized signals based on the respective thresholds are respectively supplied to the synchronization circuits 163 ₁ , 163 ₂ ,
... are input to 163 _n, the reference clock synchronized generated by the clock generation circuit 164, the binary signal after synchronization counting circuit 165 _1, 165 _2, ..., and inputs to 165 _n. Each of the counting circuits 165 _{1 to} 165 _n starts counting the reference clock after clearing zero at a transition point from “0” to “1” of the binary signal input from each of the synchronization circuits 163 _{1 to} 163 _n , When the binarized signal transitions from "1" to "0", counting stops. That is, each of the counting circuits 165 _{1 to} 165 _n measures a time interval of a transition point from “1” to “0” of each binarized signal after synchronization.

各計数回路165₁〜165_nの計数値は、判断回路166₁，16
6₂，…，166_nを介して推定回路167へ入力する。各判断
回路166₁〜166_nでは、対応する計数回路165₁〜165_nの計
数値が妥当な範囲内であるか否かを判断し、判断結果を
推定回路167へ通知する。例えば、人の音声のピッチ周
期は大体100Hzから250Hzであり、基準クロックの周波数
を16KHzとすると、妥当な計数値は64から160の間である
ので、その範囲内の計数値を妥当と判断することができ
る。あるいは、計数値間の差を求め、それらの差が所定
の閾値以下の計数値を妥当と判断することができる。判
断の基準はこれに限らないが、いずれにしても計数値の
妥当な範囲は前もって予想可能であるので、面倒な演算
を行わず容易に妥当性の判断が可能である。The count values of the respective counting circuits 165 _{1 to} 165 _n are determined by the judging circuits 166 ₁ , 16
6 _2, ..., and inputs to the estimation circuit 167 via a 166 _n. Each of the determination circuits 166 _{1 to} 166 _n determines whether or not the count value of the corresponding counting circuit 165 _{1 to} 165 _n is within an appropriate range, and notifies the estimation result to the estimation circuit 167. For example, if the pitch period of a human voice is approximately 100 Hz to 250 Hz, and the frequency of the reference clock is 16 KHz, a valid count value is between 64 and 160, so a count value within the range is determined to be valid. be able to. Alternatively, the differences between the count values can be determined, and the count values whose differences are equal to or less than a predetermined threshold value can be determined to be appropriate. The criterion for the determination is not limited to this, but in any case, the valid range of the count value can be predicted in advance, so that the validity can be easily determined without performing complicated calculations.

推定回路167は、各判断回路166₁〜166_nによって妥当
と判断された計数値を用い、入力音声信号のピッチ周期
もしくは周波数を推定する回路である。この推定のアル
ゴリズムは様々のものが可能であるが、例えば妥当な計
数値の集合の平均値あるいは最小値をピッチ周期とす
る。The estimating circuit 167 is a circuit for estimating the pitch period or frequency of the input audio signal using the count value determined as appropriate by each of the determining circuits 166 _{1 to} 166 _n . Various algorithms can be used for this estimation. For example, an average value or a minimum value of a set of valid count values is used as the pitch period.

第37図に本実施例の信号波形図を示す。（ａ）は基準
クロックの波形、（ｂ）は低域フィルタ161を通過後の
音声信号の波形である。（ｃ）は正側の一つのコンパレ
ータ162_iによる２値化信号の波形、、（ｄ）はこの２値
化信号の同期化後の波形である。（ｅ）は負側のコンパ
レータ162_jによる２値化信号の波形、（ｆ）はその同期
化後の波形である。計数回路165_i，165_jは、（ｄ）の波
形の（ア）から（イ）の区間、あるいは（ｆ）の波形の
（ウ）から（エ）の区間の時間を計測することになる。FIG. 37 shows a signal waveform diagram of the present embodiment. (A) shows the waveform of the reference clock, and (b) shows the waveform of the audio signal after passing through the low-pass filter 161. (C) the waveform of the binary signal by a comparator 162 _i of positive ,, (d) is a waveform after synchronization of the binary signal. (E) the waveform of the binary signal by the negative side of the comparator 162 _j, a (f) is the waveform after the synchronization. The counting circuits 165 _i and 165 _j measure the time in the section (a) to (a) of the waveform (d) or the section (c) to (d) in the waveform (f).

なお、第36図の構成は次のように変更することも可能
である。即ち、低域通過フィルタ161からの出力を半波
整流回路によって、正、負２つに分離し、正側・負側で
各々平滑化して、その実効値（電圧）をもとめ、これを
所定閾値倍したものを正側、負側の複数のコンパレータ
162₁〜162_nの各々の振幅閾値とし、振幅閾値を入力信号
の大きさによって可変とするのである。これにより、入
力音声の振幅の変動に対するピッチ抽出の安定性を高め
ることができる。さらに、半波整流回路によって求めら
れた正側および負側の実行値が妥当であるか判断するた
めの実行値検定回路を設け、正側または負側の実行値が
所定の閾値以下となった時に推定回路167に対し該当区
間のピッチ抽出の禁止を指示するようにしてもよい。The configuration shown in FIG. 36 can be changed as follows. That is, the output from the low-pass filter 161 is separated into two, positive and negative, by a half-wave rectifier circuit, smoothed on each of the positive and negative sides, and its effective value (voltage) is obtained. Multiple positive and negative comparators
162 _{1 to} 162 _n are set as the amplitude thresholds, and the amplitude threshold is made variable depending on the magnitude of the input signal. As a result, the stability of pitch extraction with respect to fluctuations in the amplitude of the input voice can be improved. Further, an execution value test circuit is provided for determining whether the execution values on the positive side and the negative side obtained by the half-wave rectification circuit are appropriate, and the execution value on the positive side or the negative side becomes equal to or less than a predetermined threshold. At this time, it may be instructed to the estimation circuit 167 to prohibit the pitch extraction of the corresponding section.

第38図にピッチ抽出の他の実施例を示す。入力音声信
号を低域通過フィルタに通してピッチ周波数以外の高周
波成分を除く際、該低域通過フィルタの遮断周波数は、
入力音声のピッチ周波数の付近に定めることが適当であ
る。しかし、音声のピッチ周波数は時々変化し、さらに
男声、女声のピッチ周波数は大きく異なるなどの現象か
ら、低域通過フィルタの遮断周波数を一意に定めること
は難しい。特に、零交差数等の低域通過フィルタの遮断
周波数の影響を大きく受ける方式では、倍ピッチ、半ピ
ッチなどの抽出誤りが多くなる。第38図の実施例は、低
域通過フィルタの遮断周波数を時間軸上で動的に変化さ
せることにより、倍ピッチ、半ピッチなどの抽出誤りを
削減し、高精度なピッチ抽出を可能にしたものである。FIG. 38 shows another embodiment of pitch extraction. When the input audio signal is passed through a low-pass filter to remove high-frequency components other than the pitch frequency, the cut-off frequency of the low-pass filter is
It is appropriate to set it near the pitch frequency of the input voice. However, it is difficult to uniquely determine the cut-off frequency of the low-pass filter due to the phenomenon that the pitch frequency of voice changes from time to time and the pitch frequencies of male and female voices are greatly different. In particular, in a system that is greatly affected by the cutoff frequency of the low-pass filter such as the number of zero crossings, extraction errors such as double pitch and half pitch increase. In the embodiment of FIG. 38, the cutoff frequency of the low-pass filter is dynamically changed on the time axis, thereby reducing extraction errors such as double pitch and half pitch, and enabling highly accurate pitch extraction. Things.

以下、第38図の動作を説明する。入力音声信号を一方
は低域通過フィルタ171に入力して、高周波成分を除去
する。もう一方はパワー検出回路（Ｉ）172に入力し、
短時間平均パワーを求める。こゝで、低例通過フィルタ
171は、コントロール信号Ｅの値によって遮断周波数が
変化するもので、電圧制御によるヴォルテージコントロ
ールフィルタや、クロック可変のスイッチトキャパシタ
フィルタなどで実現できる。Hereinafter, the operation of FIG. 38 will be described. One of the input audio signals is input to a low-pass filter 171 to remove high-frequency components. The other is input to the power detection circuit (I) 172,
Find the short-term average power. Here, low-pass filter
Reference numeral 171 denotes a component whose cutoff frequency changes depending on the value of the control signal E, and can be realized by a voltage control filter by voltage control, a clocked variable switched capacitor filter, or the like.

低域通過フィルタ171で高周波成分が除かれた音声信
号は、ピッチ抽出回路175に入力され、ピッチ周波数を
求められる。このピッチ抽出回路175は、第36図におい
て低例通過フィルタ161を除いた部分に対応する。低域
通過フィルタ171により高周波成分を除かれた音声信号
のもう一方はパワー検出回路（II）173に入力し、短時
間平均パワーを求める。制御回路174では、パワー検出
回路（Ｉ）172の出力をＡ、パワー検出回路（II）173の
出力をＢとしたとき、B/Aが予め定められたB/Aの目標値
Ｃより小さくなったとき低域通過フィルタ171の遮断周
波数を高く、また、B/AがＣより大きくなったとき低域
通過フィルタ171の遮断周波数を低くするようにコント
ロール信号Ｅを出力する。The audio signal from which the high-frequency component has been removed by the low-pass filter 171 is input to the pitch extraction circuit 175, where the pitch frequency is obtained. This pitch extraction circuit 175 corresponds to the part except for the low-pass filter 161 in FIG. The other of the audio signals from which the high-frequency components have been removed by the low-pass filter 171 is input to the power detection circuit (II) 173, and the short-time average power is obtained. In the control circuit 174, when the output of the power detection circuit (I) 172 is A and the output of the power detection circuit (II) 173 is B, B / A becomes smaller than a predetermined B / A target value C. The control signal E is output so that the cut-off frequency of the low-pass filter 171 is increased when B / A becomes greater than C when B / A becomes greater than C.

制御回路174としては、低域通過フィルタ171のコント
ロール信号Ｅを増やしたとき遮断周波数も高くなるよう
な低域通過フィルタの場合、第39図に示すように演算回
路181にてＣ−B/Aの演算を行い、その出力信号Ｄを積分
回路182に通したものを低域通過フィルタ171のコントロ
ール信号Ｅとすればよい。As the control circuit 174, in the case of a low-pass filter in which the cut-off frequency increases when the control signal E of the low-pass filter 171 is increased, as shown in FIG. And the result of passing the output signal D through the integration circuit 182 may be used as the control signal E of the low-pass filter 171.

〔The invention's effect〕

本発明の話者照合方式によれば、以下のような効果が
期待できる。According to the speaker verification method of the present invention, the following effects can be expected.

（１）音声区間をブロックに分割して、各種音響パラメ
ータを用いて特徴量を作成し、話者照合を行うことによ
り、これまで以上に個人性の情報量を特徴量に取込むこ
とが出来、照合率が向上する。(1) By dividing a voice section into blocks, creating feature amounts using various acoustic parameters, and performing speaker verification, it is possible to incorporate information amounts of personality into feature amounts more than ever. , The matching rate is improved.

（２）ブロック分割の前処理として、登録時に、発声さ
れた音声をあらかじめそのパワー等によりブロックに分
割し、この前処理により求まる基準ブロックに基づい
て、入力音声信号をブロックに分割することにより、安
定なブロック化と、より以上の照合率を期待できる。(2) As preprocessing of block division, at the time of registration, an uttered voice is divided into blocks in advance by its power and the like, and an input audio signal is divided into blocks based on a reference block obtained by this preprocessing. We can expect stable blocking and a higher matching rate.

（３）音声フレームのスペクトルの一次モーメントの状
態を用いた継続時間長制御型状態遷移モデルをブロック
分割に適用することで、大まかな各ブロックの対応づけ
ができ、これにより安定なブロック分割化が行える。ま
た、演算量もDPマッチング等によるブロック分割に比べ
てはるかに少なくて済む。(3) By applying the duration control type state transition model using the state of the first moment of the spectrum of the speech frame to block division, it is possible to roughly associate each block, and thereby to achieve stable block division. I can do it. Also, the amount of calculation is much smaller than block division by DP matching or the like.

（４）特徴量としてスペクトル等の他にピッチ周期を併
用し、このピッチ周期の特徴量としてそのヒストグラム
を用いるために、ピッチ周期の値及びその時間変化の様
子を特徴量自体に反映することができ、話者照合の有効
な特徴量に変換できることから照合率の向上が図れる。
さらに、ピッチ周期の辞書特徴量としてそのヒストグラ
ムを平滑化、正規化して用いるためにピッチ周期の特徴
量に安定性を持たせることができ、やはり照合率の向上
が図れる。(4) Since a pitch period is used in addition to a spectrum or the like as a feature amount, and the histogram is used as the feature amount of the pitch period, it is necessary to reflect the pitch period value and the state of its time change in the feature amount itself. It can be converted into an effective feature amount for speaker verification, so that the verification rate can be improved.
Further, since the histogram is smoothed and normalized as the pitch period dictionary feature, the pitch period feature can be stabilized, and the matching rate can be improved.

（５）未知話者の特徴量と登録された辞書特徴量とを比
較し、その類似度が所定の閾値を越える時に未知話者を
登録話者と同一であると判断する際に、各ブロック毎に
スペクトルとピッチの距離各々に関して所定の閾値を設
定し、各ブロック毎のスペクトルとピッチが各々閾値を
越えるか否かにより場合分けし、その各々にスコア（得
点）を与え、このスコアを全ブロックに渡って考慮する
ことで、各ブロックのスペクトル、ピッチの距離を総合
的な判定に反映することができる。さらに、各ブロック
毎にスペクトルとピッチ各々について所定の閾値を複数
設定し、スペクトルとピッチが各々閾値のどの範囲を取
るかで場合分けし、スコア（得点）を与えていること
で、閾値附近での誤認識を減らすことができる。(5) When comparing the feature amount of the unknown speaker with the registered dictionary feature amount and determining that the unknown speaker is the same as the registered speaker when the similarity exceeds a predetermined threshold, each block is used. A predetermined threshold value is set for each distance between the spectrum and the pitch for each block, and the case is divided depending on whether the spectrum and the pitch for each block exceed the threshold value, and a score (score) is given to each of them. By taking the blocks into consideration, the spectrum and pitch distance of each block can be reflected in the overall judgment. Furthermore, a plurality of predetermined thresholds are set for each spectrum and pitch for each block, and the spectrum and the pitch are divided into cases according to the ranges of the thresholds, and a score (score) is given. Misrecognition can be reduced.

（６）ピッチ抽出に低域抽出フィルタとカウンタ等の構
成による簡易ピッチ抽出法を用いることにより、システ
ム規模を大きくすることなく照合率の向上が図れる。ま
た、その場合、低域通過フィルタの遮断周波数を時間軸
上で動的に変化させることにより、倍ピッチ、半ピッチ
などの抽出誤りを削減し、高精度なピッチ抽出が可能で
あるため、結果的に、さらに照合率の向上が図れる。(6) By using a simple pitch extraction method using a low-frequency extraction filter and a counter for the pitch extraction, the matching rate can be improved without increasing the system scale. Also, in this case, by dynamically changing the cutoff frequency of the low-pass filter on the time axis, extraction errors such as double pitch and half pitch can be reduced, and highly accurate pitch extraction is possible. Therefore, the collation rate can be further improved.

[Brief description of the drawings]

第１図は本発明の第１実施例のブロック図、第２図は第
１図におけるブロック分割に関係する部分の処理フロー
チャート、第３図は第１図によるブロック分割の具体例
を示す図、第４図は本発明の第２実施例のブロック図、
第５図は第４図によるブロック分割の具体例を示す図、
第６図は本発明の第３実施例のブロック図、第７図は第
６図によるブロック分割の具体例を示す図、第８図は本
発明の第４実施例のブロック図、第９図は第８図による
ブロック分割の具体例を示す図、第10図は本発明の第５
実施例のブロック図、第11図は第10図における閾値調節
部の処理フローチャート、第12図は第10図によるブロッ
ク分割の具体例を示す図、第13図は本発明の第６実施例
のブロック図、第14図は第13図におけるブロック分割前
処理とブロック分割の一実施例の処理手順を示す図、第
15図は同じく別の実施例の処理手順を示す図、第16図は
第15図に対応する具体例を示す図、第17図は同じく更に
別の実施例をの処理手順を示す図、第18図は第17図に対
応する具体例を示す図、第19図は同じく更に別の実施例
の処理手順を示す図、第20図は第７実施例のブロック
図、第21図は第20図によるブロック分割の具体例を示す
図、第22図は第20図における分割点の設定手順例を示す
フローチャート、第23図は入力音声のスペクトルの一次
モーメントを用いた継続時間長制御型状態遷移モデルを
説明する図、第24図は第21図に対する再分割結果を示す
図、第25図は本発明の第８実施例のブロック図、第26図
は本発明の第９実施例のブロック図、第27図は本発明の
第10実施例のブロック図、第28図は本発明の第11実施例
のブロック図、第29図は本発明の第12実施例のブロック
図、第30図は第29図における特徴量生成部の別の実施例
を説明するための処理フロー図、第31図は第30図の処理
の具体例を示す図、第32図は本発明の第13実施例のブロ
ック図、第33図は第32図におけるスコアの具体例を示す
図、第34図は本発明の第14の実施例のブロック図、第35
図は第34図におけるスコアの具体例を示す図、第36図は
ピッチ抽出の一実施例を示す図、第37図は第36図の各部
の信号波形図、第38図はピッチ抽出の他の実施例を示す
図、第39図は第38図における制御回路の具体例を示す図
である。 10…マイクロホン、２…制御対象、11…音響パラメータ
変換部、12…音声区間検出部、13…有音・無音判別部、
14…有音区間抽出部（ブロック分割部）、15…特徴量生
成部、16…標準パタン蓄積部、17…距離計算部、18…判
断部、63,74…ブロック分割前処理部、75…基準ブロッ
ク設定部、76…再分割処理部、121−1,141−1,151−１
…スペクトル抽出部、121−2,141−2,151−２…ピッチ
抽出部、161,171…低域通過フィルタ。FIG. 1 is a block diagram of a first embodiment of the present invention, FIG. 2 is a processing flowchart of a portion related to block division in FIG. 1, FIG. 3 is a diagram showing a specific example of block division according to FIG. FIG. 4 is a block diagram of a second embodiment of the present invention,
FIG. 5 is a diagram showing a specific example of block division according to FIG. 4,
6 is a block diagram of a third embodiment of the present invention, FIG. 7 is a diagram showing a specific example of block division according to FIG. 6, FIG. 8 is a block diagram of a fourth embodiment of the present invention, FIG. FIG. 8 is a diagram showing a specific example of block division according to FIG. 8, and FIG.
FIG. 11 is a block diagram of an embodiment, FIG. 11 is a processing flowchart of a threshold adjusting unit in FIG. 10, FIG. 12 is a diagram showing a specific example of block division according to FIG. 10, and FIG. 13 is a diagram of a sixth embodiment of the present invention. FIG. 14 is a block diagram showing a processing procedure of an embodiment of the block division pre-processing and block division in FIG. 13;
15 is a diagram showing a processing procedure of another embodiment, FIG. 16 is a diagram showing a specific example corresponding to FIG. 15, FIG. 17 is a diagram showing a processing procedure of still another embodiment, FIG. 18 is a diagram showing a specific example corresponding to FIG. 17, FIG. 19 is a diagram showing a processing procedure of still another embodiment, FIG. 20 is a block diagram of the seventh embodiment, and FIG. FIG. 22 is a diagram showing a specific example of block division according to the drawing, FIG. 22 is a flowchart showing an example of a setting procedure of a division point in FIG. 20, and FIG. 23 is a duration length control type state transition using a first moment of a spectrum of an input voice. FIG. 24 is a diagram illustrating a model, FIG. 24 is a diagram showing a result of subdivision with respect to FIG. 21, FIG. 25 is a block diagram of an eighth embodiment of the present invention, and FIG. 26 is a block diagram of a ninth embodiment of the present invention. FIG. 27 is a block diagram of a tenth embodiment of the present invention, FIG. 28 is a block diagram of an eleventh embodiment of the present invention, and FIG. FIG. 30 is a block diagram of the embodiment, FIG. 30 is a processing flow chart for explaining another embodiment of the feature amount generation unit in FIG. 29, FIG. 31 is a diagram showing a specific example of the processing in FIG. FIG. 33 is a block diagram of a thirteenth embodiment of the present invention, FIG. 33 is a diagram showing a specific example of scores in FIG. 32, FIG. 34 is a block diagram of a fourteenth embodiment of the present invention, and FIG.
FIG. 34 is a diagram showing a specific example of the score in FIG. 34, FIG. 36 is a diagram showing an embodiment of the pitch extraction, FIG. 37 is a signal waveform diagram of each part in FIG. 36, and FIG. FIG. 39 is a diagram showing a specific example of the control circuit in FIG. 38. 10 microphone, 2 control object, 11 acoustic parameter conversion unit, 12 voice section detection unit, 13 voiced / silent discrimination unit,
14: voiced section extraction unit (block division unit), 15: feature amount generation unit, 16: standard pattern storage unit, 17: distance calculation unit, 18: determination unit, 63, 74: pre-block division processing unit, 75 ... Reference block setting unit, 76 ... re-division processing unit, 121-1, 141-1, 151-1
... Spectrum extractor, 121-2, 141-2, 151-2. Pitch extractor, 161,171... Low-pass filter.

───────────────────────────────────────────────────── フロントページの続き (31)優先権主張番号特願昭63−217585 (32)優先日昭63(1988)８月31日 (33)優先権主張国日本（ＪＰ） (31)優先権主張番号特願昭63−265054 (32)優先日昭63(1988)10月20日 (33)優先権主張国日本（ＪＰ） (56)参考文献特開昭60−48098（ＪＰ，Ａ) 特開昭56−138798（ＪＰ，Ａ) 特開昭61−138299（ＪＰ，Ａ) 特開昭56−158386（ＪＰ，Ａ) 特開昭60−238897（ＪＰ，Ａ) 特開昭61−21499（ＪＰ，Ａ) 特開昭56−119198（ＪＰ，Ａ) 新美康永著「情報科学講座Ｅ．19. ３音声認識」共立出版株式会社（昭和 54年）Ｐ．211，ｌ．９ (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 531 G10L 3/00 513 G10L 3/00 515 G10L 9/00 301 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on front page (31) Priority claim number Japanese Patent Application No. 63-217585 (32) Priority date August 31, 1988 (33) Priority claim country Japan (JP) (31) Priority Claim Number Japanese Patent Application No. 63-265054 (32) Priority Date October 20, 1988 (33) Priority Country Japan (JP) (56) References JP-A-60-48098 (JP, A) JP-A-56-138798 (JP, A) JP-A-61-138299 (JP, A) JP-A-56-158386 (JP, A) JP-A-60-238897 (JP, A) JP-A-61-21499 (JP) JP, A) JP-A-56-119198 (JP, A) Yasunaga Niimi, "Information Science Course E.19.3 Speech Recognition," Kyoritsu Shuppan Co., Ltd. (Showa 54) 211, l. 9 (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00 531 G10L 3/00 513 G10L 3/00 515 G10L 9/00 301 JICST file (JOIS)

Claims

(57) [Claims]

1. A feature amount registered by a registered speaker is compared with a feature amount input by an unknown speaker, and when the similarity exceeds a certain threshold, the unknown speaker is identified as the registered speaker. In the speaker verification method, the input voice signal is divided into frames every predetermined time unit,
A means for converting to a sound parameter for each frame; a means for detecting a voice section from an input voice signal and dividing the voice section into a plurality of blocks on the time axis according to the value of the voice parameter; Means for generating a speaker-specific feature amount, registering the feature amount for each speaker at the time of registration, and comparing the registered feature amount with the feature amount of an unknown speaker at the time of collation; Is divided into voiced, unvoiced, and unvoiced sections by using the power, the rate of change of the spectrum, and the pitch rate of the input voice to extract a voiced section, and divides this into one block. Matching method.

Means for dividing an input audio signal into frames for each predetermined time unit and converting the input audio signal into audio parameters for each frame; detecting an audio section from the input audio signal; Means for dividing into a plurality of blocks, and a speaker-specific feature amount is generated for each block, and the feature amount is registered for each speaker at the time of registration, and the feature amount of an unknown speaker and the registered feature are registered at the time of verification. Means for comparing the feature amount registered by the registered speaker with the feature amount input by the unknown speaker when the similarity exceeds a certain threshold. In the speaker verification method which determines that the voice is the same as the speaker, at the time of registration, there is provided a pre-processing means for dividing the uttered voice into blocks in advance according to its power, power dip, spectrum change rate, etc. A speaker verification method wherein input speech is divided into blocks based on a reference block.

3. The method according to claim 1, wherein only the first voice input at the time of registration is divided into blocks using power, power dip, spectrum change rate, and the like of the voice.
3. The speaker verification method according to claim 2, wherein the voice segment of the second and subsequent input voices is divided into blocks.

4. Using the first voice at the time of registration as a reference pattern, perform DP matching with the second and subsequent voices, and perform voice sections of the second and subsequent voices corresponding to the first voice block. 4. The speaker verification method according to claim 3, wherein is divided as a block.

5. A feature amount for each block is generated by, for example, adding and averaging acoustic parameters in each block of a first speech at the time of registration in a time axis direction, and using this as a reference pattern,
Using a duration control type transition model with the second and subsequent voices and the like, dividing the voice section of the second and subsequent voices corresponding to the block of the first voice into blocks. The speaker verification method described in (3).

6. The second and subsequent voices are divided in advance into a plurality of semi-blocks according to the power, power dip, spectrum change rate, etc. of the voice, and the feature amount in the semi-block is calculated. 4. The speaker verification method according to claim 3, wherein the block division is performed by DP matching with the in-block feature amount of the first speech.

7. All the voices uttered at the time of registration are divided into respective blocks by using their power, power dip, spectrum change rate, etc., the representative number of blocks is obtained, and registration is performed in accordance with the number of blocks. 3. The speaker verification method according to claim 2, wherein the number and the input voice at the time of verification are divided into blocks.

8. A method for performing DP matching with a voice having a representative number of blocks as a reference pattern and a voice having a number of other blocks to divide a voice section corresponding to a representative voice block into blocks. The speaker verification method according to claim 7, characterized in that:

9. A feature amount for each block is generated by, for example, adding and averaging acoustic parameters in each block of a voice indicating a representative number of blocks in a time direction, and using these as a reference pattern, a duration time between the block and another voice. The speaker verification method according to claim 7, wherein block division corresponding to a representative block is performed using a length control type state transition model or the like.

10. An input speech is divided in advance into a plurality of semi-blocks according to speech power, power dip, spectrum change rate, etc., and a feature amount in the semi-block is calculated. The speaker verification method according to claim 7, wherein the block division is performed by DP matching with a feature amount for each block of the indicated voice.

11. A means for dividing an input audio signal into frames for each predetermined time unit, converting the input audio signal into acoustic parameters for each frame, detecting an audio section from the input audio signal,
A means for dividing this voice section into a plurality of blocks on the time axis, a speaker-specific feature quantity is generated for each block, the feature quantity is registered for each speaker at the time of registration, and an unknown speaker at the time of verification. And a means for comparing the registered feature quantity with the feature quantity registered by the registered speaker, and when the similarity between the feature quantity input by the unknown speaker exceeds a certain threshold. In a speaker verification method in which an unknown speaker is determined to be the same as the above registered speaker, a spectrum is used as an acoustic parameter. , Each block is divided into a plurality of blocks, and the block of the optimum voice is used as a reference block, and all the input voices at the time of registration are re-based based on the reference block. A speaker verification method characterized by dividing an unknown voice into blocks based on a reference block created at the time of previous registration, at the time of verification, in addition to block division.

12. As a reference block at the time of registration, all voices uttered at the time of registration are divided into a plurality of blocks by comparing the first moment of the spectrum with a predetermined threshold,
The speaker verification method according to claim 11, wherein the block of the voice having the smallest number of divisions is used as a reference block.

13. An average value of the first moment of each of the reference blocks adopted at the time of registration is obtained, and the average of the first moment is used as the state of each block. The speaker verification system according to claim 11, wherein the speaker is divided into blocks according to a model.

14. An input audio signal is divided into frames for each predetermined time unit, and means for converting the input audio signal into acoustic parameters for each frame;
A means for dividing a voice section into a plurality of blocks on a time axis, a feature amount unique to a speaker is generated for each block, the feature amount is registered for each speaker at the time of registration, and an unknown speaker is And a means for comparing the registered characteristic amount with the registered characteristic amount for each corresponding block. The similarity between the characteristic amount registered by the registered speaker and the characteristic amount input by the unknown speaker is determined by a threshold value. In the speaker verification method in which it is judged that the unknown speaker is the same as the unknown speaker when the number exceeds, an average value in the time axis direction of the spectrum for each frame in the block (average spectrum in the block) and a least square of the average spectrum A speaker verification method characterized in that the inclination of a straight line is used as a feature value of a block.

15. A means for dividing an input audio signal into frames for each predetermined time unit and converting the input audio signal into acoustic parameters for each frame, detecting an audio section from the input audio signal,
A means for dividing a voice section into a plurality of blocks on a time axis, a feature amount having a speaker tag for each block is generated, the feature amount is registered for each speaker at the time of registration, and an unknown Means for comparing the feature amount of the speaker and the registered feature amount for each corresponding block, and the similarity between the feature amount registered by the registered speaker and the feature amount input by the unknown speaker, In the speaker verification method of determining that the unknown speaker is the same as the unknown speaker when the threshold value is exceeded, the positions of the peaks and valleys of the spectral outline of each frame in the block are extracted, and the number of appearances of the peaks and valleys in the block is determined. A speaker verification method characterized in that the accumulated number is obtained for each frequency and the accumulated number is used as a feature amount in a block.

16. An input audio signal is divided into frames for each predetermined time unit and converted into acoustic parameters (spectrum, pitch period, etc.) for each frame, and an audio section is detected from the input audio signal. Means for dividing the voice section into a plurality of blocks on the time axis, and generating a speaker-specific feature amount using the acoustic parameters for each block;
At the time of registration, the feature amount is registered as a dictionary for each speaker, and at the time of matching, there is means for comparing the feature amount of the unknown speaker with the registered dictionary feature amount, and the feature amount input by the unknown speaker is provided. When the similarity between the registered speaker and the dictionary feature registered by the registered speaker exceeds a predetermined threshold, the pitch period is used in a speaker verification method in which the unknown speaker is determined to be the same as the registered speaker. A speaker verification method using a pitch period histogram for each block as a speaker-specific feature amount.

17. When creating a feature quantity relating to a pitch cycle for each block, at the time of registration, a sum of histograms of pitches in the corresponding blocks of all uttered speech samples is calculated, and this is smoothed in the pitch cycle direction. 17. The method according to claim 16, wherein a value normalized by the entire power is used as a dictionary feature amount relating to the pitch, and at the time of matching, a histogram of the pitch in the block is smoothed in a pitch period direction. Speaker verification method.

18. An input audio signal is divided into frames for each predetermined time unit, a means for converting the input audio signal into acoustic parameters for each frame, a voice section is detected from the input voice signal, and the voice section is detected on the time axis. A means for dividing into a plurality of blocks and a speaker-specific feature amount are generated using the acoustic parameter for each block, and the feature amount is registered as a dictionary for each speaker at the time of registration, and an unknown speaker is registered at the time of verification. And a means for comparing the registered feature with the dictionary feature registered by the unknown speaker. The similarity between the feature input by the unknown speaker and the dictionary feature registered by the registered speaker has a predetermined threshold value. In the speaker verification method in which the unknown speaker is judged to be the same as the registered speaker when it exceeds, using the spectrum and pitch as speaker-specific features, the features of the unknown speaker and the registered dictionary features Quantity and compare When it is determined that the unknown speaker is the same as the registered speaker when the similarity exceeds a predetermined threshold, a predetermined threshold is set for each distance between the spectrum and the pitch for each block, and the spectrum for each block is set. And the pitches exceed the threshold, respectively, and a score (score) is given to each of them, and this score is individually compared with a predetermined threshold, or a value added over all blocks is added to a predetermined threshold. Or comparing the unknown speaker with the registered speaker to determine whether the unknown speaker is the same as the registered speaker.

19. A plurality of thresholds are set for the distance between the spectrum and the pitch for each block, and a score is given for each block depending on which range of the plurality of thresholds the distance between the spectrum and the pitch is. The speaker according to claim 18, wherein the score is added over all blocks and compared with a predetermined threshold to determine whether the unknown speaker is the same as the registered speaker. Matching method.

20. A method of extracting a pitch period, comprising the steps of: passing an input audio signal through a low-pass filter to remove high-frequency components; binarizing the input audio signal with a plurality of thresholds; The speaker verification method according to any one of claims (16) to (19), wherein the time interval between points is measured, and the pitch period or the frequency is estimated from the measured time interval.

21. When extracting a pitch period, the low-pass filter is cut off so that the ratio between the short-time average power of the input audio waveform and the short-time average power of the audio waveform passed through the low-pass filter becomes constant. The speaker verification method according to claim 20, wherein the frequency is changed.