JPH10228296A

JPH10228296A - Acoustic signal separation method

Info

Publication number: JPH10228296A
Application number: JP9031813A
Authority: JP
Inventors: Kunio Kayano; 邦夫柏野; Hiroshi Murase; 洋村瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-02-17
Filing date: 1997-02-17
Publication date: 1998-08-25
Anticipated expiration: 2017-02-17
Also published as: JP3501199B2

Abstract

(57)【要約】【課題】多くの音源が存在し、かつこれらが多様に変
動する場合でも高い精度でこれら音源信号を分離可能と
する。【解決手段】入力音響信号をパワー変動に着目して音
の立上りを検出し、その立上りで入力音響信号を区分し
（１３）、各区分入力音響信号ｚの基本周波数に対し、
ある範囲内に収まる基本周波数をもつ記憶波形を候補と
して、記憶手段１４から選択し（１５）、この各候補に
対してＦＩＲフィルタ処理をした結果の和と、対応ｚと
の平均二乗誤差が最少になるＦＩＲフィルタのインパル
ス応答ｈ_nを求め、このｈ_nを用いて対応する候補に対
し、ＦＩＲフィルタ処理を行い、その結果のパワーが大
きい場合、その候補の音源がｚに含まれているとする。 (57) [Summary] [PROBLEMS] To enable separation of these sound source signals with high accuracy even when there are many sound sources and these fluctuate variously. SOLUTION: A rising edge of a sound is detected by paying attention to power fluctuation of an input audio signal, and the input audio signal is classified at the rising edge (13).
A storage waveform having a fundamental frequency falling within a certain range is selected as a candidate from the storage means 14 (15), and the sum of the results of performing FIR filter processing on each candidate and the mean square error of the corresponding z are minimized. The impulse response h _n of the FIR filter is obtained, and the corresponding candidate is subjected to FIR filter processing using this h _n. If the resulting power is large, it is determined that the candidate sound source is included in z. I do.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、複数の音源から
の音が混在している音響信号をもとに、この音響信号に
含まれる個々の音源の音を分離抽出する音響信号の分離
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sound signal separation method for separating and extracting sounds of individual sound sources included in a sound signal based on the sound signal in which sounds from a plurality of sound sources are mixed. .

【０００２】[0002]

【従来の技術】従来、音響信号分離方法に関しては、く
し型フィルタなど特定の周波数帯域のみを通過させるフ
ィルタ装置によって音源の分離を図る方法が知られてい
る。しかし、この方法では、複数の音源がある周波数帯
域を共有した場合には適切な分離処理が行えないため
に、一般に数多くの音源が存在した場合に分離が難しい
という欠点があった。2. Description of the Related Art Conventionally, as for an acoustic signal separation method, there is known a method of separating sound sources by a filter device such as a comb filter that passes only a specific frequency band. However, this method has a drawback that it is generally difficult to separate when a large number of sound sources exist, since a proper separation process cannot be performed when a plurality of sound sources share a certain frequency band.

【０００３】また、入力音響信号に対して周波数解析を
行った後、パワースペクトルの特徴に着目してクラスタ
リングの手法により音響信号を分離する方法が知られて
いる。しかし、この方法はボトムアップに処理が行われ
るため、雑音が混入した場合や数多くの音源が含まれて
いた場合には、適切に処理できないという欠点があっ
た。Further, a method is known in which after performing frequency analysis on an input audio signal, the audio signal is separated by a clustering technique by focusing on the characteristics of the power spectrum. However, since this method performs processing from the bottom up, there is a disadvantage that the processing cannot be properly performed when noise is mixed in or when a large number of sound sources are included.

【０００４】また、音源のモデルをパワースペクトル等
の形で装置内に記憶しておき、入力音響信号に適合する
モデルを選択し照合することによって音響信号の分離を
行う方法が知られている。しかしながら、この方法で
は、モデルが固定的であるために、音源の多様性や変動
に対して対応できないという欠点があった。従って、上
記の各方法は、数多くの音源が存在し、それらの音源が
多様であり変動をもつ場合にあっては、十分な音響信号
分離処理が期待し難い。There is also known a method in which a model of a sound source is stored in a device in the form of a power spectrum or the like, and a model suitable for an input sound signal is selected and collated to separate sound signals. However, this method has a drawback that it is not possible to cope with the variety and fluctuation of sound sources because the model is fixed. Therefore, in each of the above methods, when a large number of sound sources are present and the sound sources are various and have fluctuations, it is difficult to expect sufficient sound signal separation processing.

【０００５】[0005]

【発明が解決しようとする課題】この発明は、数多くの
音源が存在し、それらの音源が多様であり変動をもつ場
合であっても十分に分離することができ、つまり公知の
方法と比較して高い精度で音響信号を分離することがで
きる音響信号分離方法を提供することを目的としてい
る。SUMMARY OF THE INVENTION The present invention can sufficiently separate even a large number of sound sources, even if the sound sources are diverse and have fluctuations, that is, compared with known methods. It is an object of the present invention to provide an audio signal separation method capable of separating an audio signal with high accuracy.

【０００６】[0006]

【課題を解決するための手段】この発明によれば、入力
音響信号を時間的に区分し、その区分入力音響信号に含
まれている可能性のある全ての波形、波形記憶手段に記
憶された記憶波形中から選択して候補波形を得、これら
各候補波形にフィルタ処理を施した結果の和と当該区分
入力音響信号波形との平均自乗誤差を最小にするように
前記フィルタ処理の係数を求め、この求めたフィルタ係
数のフィルタ処理を各候補波形に対して行う。According to the present invention, an input audio signal is temporally divided, and all the waveforms possibly contained in the divided input audio signal are stored in the waveform storage means. Candidate waveforms are selected from among the stored waveforms to obtain candidate waveforms, and the coefficients of the filter processing are determined so as to minimize the mean square error between the sum of the results of performing filter processing on each of these candidate waveforms and the divided input acoustic signal waveform. The filter processing of the obtained filter coefficient is performed on each candidate waveform.

【０００７】[0007]

【発明の実施の形態】次に、この発明の実施形態につい
て図面を用いて説明する。図１は、この発明方法を適用
した音響信号分離装置の機能構成を示す。なお、以下の
説明はこの装置の一応用例として音楽の演奏を楽器ごと
の演奏に分離する場合を例にとって説明する。Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows a functional configuration of an acoustic signal separating apparatus to which the method of the present invention is applied. In the following description, as an application example of this apparatus, a case where music performance is separated into performances for each musical instrument will be described as an example.

【０００８】この音響信号分離装置１０は、入力端子１
１からの混合音の音響信号波形を入力とし出力端子１２
から音源ごとの音響信号波形を出力する。入力音響信号
（波形）は例えば４８ｋＨｚ、９６ｋＨｚなどでサンプ
リングされ、その各サンプルのデジタル値の時系列とし
て入力される。入力端子１１からの音響信号はこれに含
まれる音の立上り成分が波形区分手段１３で検出され、
その音響信号が時間的に区分される。この区分は一定時
間ごとの区分としてもよい。The acoustic signal separating apparatus 10 has an input terminal 1
The audio signal waveform of the mixed sound from 1 is input and output terminal 12
Output an acoustic signal waveform for each sound source. The input acoustic signal (waveform) is sampled at, for example, 48 kHz or 96 kHz, and is input as a time series of digital values of each sample. The rising component of the sound contained in the sound signal from the input terminal 11 is detected by the waveform classifying means 13,
The sound signal is temporally divided. This division may be made at regular intervals.

【０００９】波形記憶手段１４に、この装置１０が対象
とする音源波形のテンプレートをあらかじめ記憶してあ
る。候補波形選択手段１５で各区分ごとに、入力音響信
号波形に対し、基本周波数、パワー包絡など基本的な音
の特徴量を分析し、その結果を参照して、波形記憶手段
１４に蓄えられている波形の中から、その入力音響信号
波形に含まれている可能性のある波形を選択する。[0009] A waveform storage means 14 stores in advance a template of a sound source waveform targeted by the apparatus 10. The candidate waveform selection means 15 analyzes the basic sound features such as the fundamental frequency and the power envelope for the input sound signal waveform for each section, and refers to the result, and stores it in the waveform storage means 14. From the existing waveforms, a waveform that may be included in the input acoustic signal waveform is selected.

【００１０】これら選択された波形のそれぞれに対して
フィルタ演算を適用した各波形の和と、入力音響信号波
形との自乗平均誤差が最小となるようなフィルタ演算の
係数が係数決定手段１６で決定される。この決定された
フィルタ演算の係数をフィルタ演算手段１７に設定し
て、候補波形選択手段１５で選択された波形のそれぞれ
に対してフィルタ演算を行う。その各フィルタ演算の結
果を、各音源ごとに分離された出力として出力端子１２
に出力される。The coefficient determining means 16 determines the sum of the respective waveforms obtained by applying the filter operation to each of the selected waveforms and the coefficient of the filter operation that minimizes the root mean square error with the input acoustic signal waveform. Is done. The determined coefficients of the filter operation are set in the filter operation means 17, and the filter operation is performed on each of the waveforms selected by the candidate waveform selection means 15. The result of each filter operation is output to an output terminal 12 as an output separated for each sound source.
Is output to

【００１１】次に、上述した手段１３，１５，１６，１
７における各処理を以下に具体的に説明する。波形区分
手段１３では、図２に示すようにまず入力音響信号を読
み込み（ステップ１０１）、その入力音響信号のパワー
変動等に着目して、その入力音響信号に含まれる音の立
上りを検出する（ステップ１０２）。次に、前回の検出
立上りから今回検出された立上り時刻までを区分音響信
号として入力音響信号を出力する（ステップ１０３）。
続いて入力音響信号が引続き入力されているかどうかを
調べ（ステップ１０４）、引続き入力されていればステ
ップ１０２以降の処理を繰り返し、入力が終了していれ
ば処理を終わる。Next, the above means 13, 15, 16, 1
Each process in 7 will be specifically described below. As shown in FIG. 2, the waveform classifying means 13 first reads an input audio signal (step 101), and detects a rising edge of a sound included in the input audio signal by paying attention to a power fluctuation or the like of the input audio signal (step 101). Step 102). Next, an input audio signal is output as a segmented audio signal from the last detected rise to the currently detected rise time (step 103).
Subsequently, it is checked whether or not the input audio signal is continuously input (step 104). If the input audio signal is continuously input, the processing after step 102 is repeated, and if the input is completed, the processing is ended.

【００１２】候補波形選択手段１５では、図３に示すよ
うにまず波形区分手段１３で区分された区分入力音響信
号を読み込む（ステップ２０１）。次に、その各区分入
力音響信号に対して周波数成分を抽出し（ステップ２０
２）、基本周波数およびパワー包絡等の音の特徴量を抽
出する（ステップ２０３）。この特徴量は、その区分入
力音響信号に含まれている可能性のある音の記憶波形を
選択するために用いられる。音の記憶波形は、波形記憶
手段１４にあらかじめ蓄積されているので、これを順に
検査する（ステップ２０４〜２０８）。まず、未検査の
記憶波形があるかどうかを調べ（ステップ２０４）、も
しあれば未検査の記憶波形を一つ選択する（ステップ２
０５）。次に、その記憶波形の基本周波数と、ステップ
２０２で抽出された周波数成分の周波数とを比較し、あ
る範囲内に収まっているかどうかを調べる（ステップ２
０６）。もしある範囲に収まっていなければ、その記憶
波形は当該区分入力音響信号に含まれている可能性は低
いので、ステップ２０４に戻る。前記ある範囲は例えば
次のようにして決める。即ち記憶波形の基本周波数をそ
の大きさ順に並べた場合、ある基本周波数についてみる
と、そのすぐ下の基本周波数との間の半分だけ低い周波
数から、すぐ上の基本周波数との間の半分だけ高い周波
数までの範囲に入るものを候補とする。例えば半音ごと
の記憶波形を設ける場合は、半音は約６％ずつ周波数が
高くなっているから、基本周波数±３％の範囲にあるも
のを候補とする。ステップ２０６でもしある範囲に収ま
っていれば、さらに特徴量に矛盾（例えば発音不可能な
音域であるなど）があるかどうかを調べる（ステップ２
０７）。もし矛盾があれば、その記憶波形は当該区分入
力音響信号に含まれている可能性は低いので、ステップ
２０４に戻る。もし矛盾がなければ、その記憶波形は当
該区分入力音響信号に含まれている可能性が高いので、
候補波形に追加して（ステップ２０８）ステップ２０４
に戻る。ステップ２０４において、未検査の記憶波形が
なければ、その時点までに見出された候補波形を出力し
て（ステップ２０９）終了する。As shown in FIG. 3, the candidate waveform selecting means 15 first reads the segmented input sound signals divided by the waveform dividing means 13 (step 201). Next, a frequency component is extracted from each of the divided input audio signals (step 20).
2) Extract sound features such as fundamental frequency and power envelope (step 203). This feature amount is used to select a stored waveform of a sound that may be included in the divided input sound signal. Since the stored waveform of the sound is stored in the waveform storage means 14 in advance, it is inspected sequentially (steps 204 to 208). First, it is checked whether there is an untested stored waveform (step 204), and if so, one untested stored waveform is selected (step 2).
05). Next, the fundamental frequency of the stored waveform is compared with the frequency of the frequency component extracted in step 202 to check whether the frequency falls within a certain range (step 2).
06). If it does not fall within a certain range, it is unlikely that the stored waveform is included in the divided input audio signal, and the process returns to step 204. The certain range is determined, for example, as follows. That is, if the fundamental frequencies of the stored waveforms are arranged in the order of their magnitudes, looking at a certain fundamental frequency, the frequency is lower by a half between the fundamental frequencies immediately below it, and is higher by a half between the fundamental frequency immediately above it. Those that fall within the frequency range are considered as candidates. For example, when a stored waveform is provided for each semitone, since the frequency of the semitone is increased by about 6%, those having a fundamental frequency within ± 3% are selected as candidates. If it falls within a certain range in step 206, it is further checked whether or not there is a contradiction in the characteristic amount (for example, a sound range that cannot be pronounced) (step 2).
07). If there is a contradiction, it is unlikely that the stored waveform is included in the segmented input sound signal, and the process returns to step 204. If there is no inconsistency, the stored waveform is likely to be included in the segmented input sound signal,
Add to the candidate waveform (step 208) and step 204
Return to In step 204, if there is no untested stored waveform, the candidate waveform found up to that point is output (step 209), and the process ends.

【００１３】係数決定手段１６では、まず波形区分手段
１３で区分された区分入力音響信号を読み込む（ステッ
プ３０１）。次に、候補波形選択手段１５で選択された
候補波形を読み込む（ステップ３０２）。続いて、各候
補波形にそれぞれフィルタ演算を適用した結果の各波形
を足し合わせた波形と、当該区分入力音響信号との平均
自乗誤差が最小となるようなフィルタ係数を求めるため
に、連立方程式を作成する（ステップ３０３）。フィル
タとしてＦＩＲ型を用いることにすれば、候補波形にフ
ィルタ演算を適用した結果の波形はｙ_n（ｋ）＝Σ_m=0 ^M-1ｈ_n（ｍ）ｒ_n（ｋ−ｍ）（１）と書ける。ここで、ｋは標本化された時刻、ｎは候補波
形を数える添字、ｙ_n（ｋ）はフィルタ演算を適用した
結果の時刻ｋの値、ｈはＦＩＲフィルタのインパルス応
答、ｒは候補波形、Ｍはフィルタの次数である。各候補
波形にフィルタ演算を適用した結果の各波形を足し合わ
せた波形と、当該区分入力音響信号との平均自乗誤差はＪ＝Ｅ〔｛ｚ（ｋ）−Σ_n=0 ^N-1ｙ_n（ｋ）｝²〕（２）と書ける。ここでｚ（ｋ）は区分入力音響信号波形の時
刻ｋの値、Ｎは候補波形の数、Ｅは時間平均を表す。こ
れを最小化するための必要条件は、全てのｎとｍにに関
して、偏微分∂Ｊ／∂ｈ_n（ｍ）が０となることであ
る。この条件を用いると、Ｎ×Ｍ個の連立一次方程式 Σ_n=0 ^N-1Σ_m=0 ^M-1Ｅ〔ｒ_i（ｋ−ｌ）ｒ_j（ｋ−ｍ）〕ｈ_n（ｍ）＝Ｅ〔ｒ_i（ｋ−ｍ）ｚ（ｋ）〕（３）を導くことができる。方程式（３）をステップ３０３に
おいて作成する。続いて、方程式（３）を解く（ステッ
プ３０４）。方程式（３）は、未知数の個数と方程式の
個数が等しいので、係数行列の逆行列を求めることによ
って解くことができる。求められた係数をステップ３０
５において出力する。The coefficient determining means 16 first reads the segmented input sound signals divided by the waveform dividing means 13 (step 301). Next, the candidate waveform selected by the candidate waveform selecting means 15 is read (step 302). Then, in order to obtain a filter coefficient that minimizes the mean square error between the waveform obtained by applying the filter operation to each candidate waveform and the divided input sound signal, a simultaneous equation is formed. It is created (step 303). If you decide to use a FIR type as a filter, the waveform of the result of applying a filter operation to the candidate waveform _{_{y n (k) = Σ m}} = 0 M-1 h n (m) r n (k-m) (1 ). Here, k is a sampled time, n is a suffix for counting candidate waveforms, y _n (k) is a value of a time k resulting from applying a filter operation, h is an impulse response of an FIR filter, r is a candidate waveform, M is the order of the filter. A waveform obtained by adding the waveform of the result of applying a filter operation to each candidate waveform, the mean square error between the divided input sound signal J = E _{[{z (k) -Σ n =} 0 N-1 y n (K)｝ ² ] (2) Here, z (k) is the value of the time k of the segmented input sound signal waveform, N is the number of candidate waveforms, and E is the time average. A necessary condition for minimizing this is that the partial differential ∂J / ∂h _n (m) becomes zero for all n and m. With this condition, N × M pieces of simultaneous linear equations _{^{Σ n = 0 N-1 Σ}} m = 0 M-1 E _{[r i (k-l) r} j (k-m) ] h _n (m) = E _{[r i (k-m) z} (k) ] (3) can be derived. Equation (3) is created in step 303. Subsequently, equation (3) is solved (step 304). Since the number of unknowns is equal to the number of equations, equation (3) can be solved by finding the inverse of the coefficient matrix. The obtained coefficient is used in step 30.
Output at 5.

【００１４】フィルタ演算手段１７では、図５に示すよ
うにまず係数決定手段１０で求められたフィルタ係数を
読み込み（ステップ４０１）、次に候補波形選択手段１
５で選択された候補波形を読み込む（ステップ４０
２）。続いて式（１）のフィルタ演算を行い（ＦＩＲ型
フィルタの場合）（ステップ４０３）、演算結果の波形
を出力する（ステップ４０４）。この波形が、音源ごと
に分離された信号波形である。The filter calculating means 17 first reads the filter coefficients obtained by the coefficient determining means 10 (step 401) as shown in FIG.
The candidate waveform selected in step 5 is read (step 40).
2). Subsequently, the filter operation of Expression (1) is performed (in the case of an FIR type filter) (step 403), and the waveform of the operation result is output (step 404). This waveform is a signal waveform separated for each sound source.

【００１５】以上のように、この発明では記憶波形から
候補を選択し、これら各候補記憶波形をフィルタ処理し
たものの和と区分入力音響信号との二乗誤差が最小にな
るようにフィルタ係数を決定しているため、つまり区分
入力音響信号の特性に近いフィルタ係数が決定され、選
択された候補記憶波形中の区分入力音響信号中に含まれ
ないものは、そのフィルタを通しても通過しないような
フィルタ特性となる。また前記のようなフィルタ係数の
決定は、ある候補記憶波形が区分入力音響信号中に存在
する音源波形と近い場合は、この候補記憶波形と音源波
形との波形の変形に応じたフィルタ係数が決定され、そ
の候補記憶波形をフィルタ処理した場合に区分入力音響
信号中のその対応音源波形に対する波形変形が吸収さ
れ、大きな出力が得られる。As described above, in the present invention, candidates are selected from the stored waveforms, and the filter coefficients are determined so that the square error between the sum of the results obtained by filtering the candidate stored waveforms and the divided input sound signal is minimized. That is, a filter coefficient close to the characteristic of the segmented input sound signal is determined, and a filter characteristic that is not included in the segmented input sound signal in the selected candidate storage waveform does not pass even through the filter. Become. Further, when the candidate storage waveform is close to the sound source waveform present in the segmented input sound signal, the filter coefficient according to the deformation of the waveform between the candidate storage waveform and the sound source waveform is determined. Then, when the candidate storage waveform is filtered, the waveform deformation of the corresponding input sound waveform in the divided input sound signal is absorbed, and a large output is obtained.

【００１６】例えばＡ社製ピアノとＢ社製ピアノで高さ
Ｆ４をほぼ同じ強さで弾いた場合の同じ時間部分（立上
がりから１００ｍｓ〜１３０ｍｓ）の波形は図６Ａ，Ｂ
に示すように、全体としては同様の波形であるが互いに
異なっている。図６Ｂの波形を４０次のＦＩＲフィルタ
で処理することにより、図６Ｃに示すように図６Ａの波
形に可成り近づいたものとすることができる。For example, waveforms of the same time portion (100 ms to 130 ms from the rise) when the height F4 is played with almost the same strength on the piano manufactured by Company A and the piano manufactured by Company B are shown in FIGS. 6A and 6B.
As shown in the figure, the waveforms are similar as a whole, but different from each other. By processing the waveform of FIG. 6B with a 40th-order FIR filter, the waveform of FIG. 6C can be made to be considerably close to the waveform of FIG. 6A.

【００１７】従って、各候補記憶波形についてそれぞれ
フィルタ演算手段１７でフィルタ処理をすると、区分入
力音響信号中に含まれる音源波形と同一のものの出力平
均パワーが大となり、その平均パワーはその音源波形の
混合割合に応じた値となり、区分入力音響信号中に含ま
れていない候補記憶波形の出力はゼロとなり、かつ、候
補記憶波形が区分入力音響信号中の対応する音源波形に
対して波形が多少異なっていても、これが適応的に修正
され、フィルタ処理出力は大きなものとなる。Therefore, when each candidate storage waveform is subjected to the filtering process by the filter operation means 17, the output average power of the same sound source waveform included in the divided input sound signal becomes large, and the average power becomes the average power of the sound source waveform. A value corresponding to the mixing ratio, the output of the candidate storage waveform not included in the divided input audio signal is zero, and the candidate storage waveform is slightly different from the corresponding sound source waveform in the divided input audio signal. However, this is adaptively corrected, and the filtered output is large.

【００１８】係数決定手段１６における処理が有効にな
るためには、候補記憶波形ｒの基本周波数および位相
が、区分入力音響信号ｚに含まれている音源の基本周波
数および位相と一致していることが望ましい。これは係
数決定手段１６でのフィルタでは信号の周波数を変える
ことができないからである。このため候補記憶波形ｒの
位相を、区分入力音響信号ｚ中の対応する音源成分の位
相に時々刻々合わせ込む波形同期処理を行うとよい。こ
の波形同期処理は例えば次のように行う。In order for the processing by the coefficient determining means 16 to be effective, the fundamental frequency and phase of the candidate storage waveform r must match the fundamental frequency and phase of the sound source included in the segmented input sound signal z. Is desirable. This is because the filter of the coefficient determination means 16 cannot change the frequency of the signal. For this reason, it is preferable to perform a waveform synchronization process that momentarily matches the phase of the candidate storage waveform r with the phase of the corresponding sound source component in the divided input audio signal z. This waveform synchronization processing is performed, for example, as follows.

【００１９】図７に示すように区分入力音響波形、すな
わち基準波形ｚを読み込んで（ステップ３０１）、帯域
フィルタバンクなどの方法で周波数解析を行う（ステッ
プ３０２）。次に、その周波数解析によって時間周波数
平面上でのパワー表現が得られるので、周波数方向でパ
ワーの極大点（ローカルピーク）を見出す（ステップ３
０３）。続いて、時間的に連続するローカルピークを接
続して、一続きのローカルピークとする（ステップ３０
４）。ローカルピークを時間的に接続したものは、周波
数成分と呼ばれ、もとの波形に存在する色々な周期性を
表現したものである。この周波数成分を周期性情報とし
て出力する（ステップ３０５）。候補記憶波形ｒについ
ても同様に処理して周期性情報を取得する。As shown in FIG. 7, a divided input acoustic waveform, that is, a reference waveform z is read (step 301), and frequency analysis is performed by a method such as a band filter bank (step 302). Next, since the power analysis on the time-frequency plane is obtained by the frequency analysis, a maximum point (local peak) of the power is found in the frequency direction (step 3).
03). Subsequently, the temporally continuous local peaks are connected to form a continuous local peak (step 30).
4). The connection of the local peaks in time is called a frequency component, and expresses various periodicities existing in the original waveform. This frequency component is output as periodicity information (step 305). The same processing is performed on the candidate storage waveform r to obtain the periodicity information.

【００２０】次に図８に示すように区分入力音響信号波
形ｚと候補記憶波形ｒに存在する周期性の中で、ほぼ同
一の基本周波数を選択し、その周波数にバンドパスフィ
ルタの中心周波数を設定する。区分入力音響波形ｚにこ
のバンドパスフィルタを適用して出力波形を得て（ステ
ップ４０２）、この出力波形の位相の時系列を記憶する
（ステップ４０３）。つまりバンドパスフィルタの出力
は正弦波に近いので、その正弦波時系列の符号反点の前
後の時刻ｋ、ｋ＋１のサンプル値からゼロクロス時刻を
求め、更に正弦波の周期を求め、各正弦波時系列の各時
刻での位相値（位相角）を求める。続いて候補記憶波形
にもステップ４０２で用いたものと同じバンドパスフィ
ルタを適用して出力波形を得て（ステップ４０４）、こ
の出力波形の位相の時系列を記憶する（ステップ４０
５）。次に、ステップ４０３とステップ４０５とで記憶
した両位相の時系列の差を求めて、位相差の時系列を得
て出力する（ステップ４０６）。Next, as shown in FIG. 8, substantially the same fundamental frequency is selected from among the periodicities existing in the divided input acoustic signal waveform z and the candidate storage waveform r, and the center frequency of the band-pass filter is set to that frequency. Set. An output waveform is obtained by applying the band-pass filter to the sectioned input acoustic waveform z (step 402), and a time series of the phase of the output waveform is stored (step 403). That is, since the output of the bandpass filter is close to a sine wave, the zero-crossing time is obtained from the sample values at times k and k + 1 before and after the sign inversion of the sine wave time series, and the cycle of the sine wave is obtained. A phase value (phase angle) at each time of the sequence is obtained. Subsequently, the same band-pass filter as used in step 402 is applied to the candidate storage waveform to obtain an output waveform (step 404), and the time series of the phase of the output waveform is stored (step 40).
5). Next, the time series difference between the two phases stored in step 403 and step 405 is obtained, and the time series of the phase difference is obtained and output (step 406).

【００２１】この位相差の時系列を、時間差の時系列に
換算する。この換算は、式（４）によって行う。 δｔ（ｋ）＝（１／（２πｆ））δｐ（ｋ）（４）ただし、ｋは時刻、δｔ（ｋ）は時間差時刻ｋの時間
差、ｆはバンドパスフィルタの中心周波数、δｐ（ｋ）
は位相差時系列の時刻ｋの位相差である。The time series of the phase difference is converted into the time series of the time difference. This conversion is performed by equation (4). δt (k) = (1 / (2πf)) δp (k) (4) where k is the time, δt (k) is the time difference of the time difference k, f is the center frequency of the bandpass filter, δp (k)
Is the phase difference at time k in the phase difference time series.

【００２２】時間差時系列の各時間差に応じて候補記憶
波形時系列ｒの対応サンプル値を遅らせ又は進める。こ
の結果、区分入力音響信号ｚの基本周波数と瞬時位相同
期した候補記憶波形が得られる。The corresponding sample value of the candidate stored waveform time series r is delayed or advanced according to each time difference of the time difference time series. As a result, a candidate storage waveform that is instantaneously synchronized in phase with the fundamental frequency of the divided input sound signal z is obtained.

【００２３】[0023]

【発明の効果】次にこの発明を適用した認識精度を評価
する実験について述べる。図９に示すように３つの単音
が同時に鳴るパターンをテストパターンとし、パターン
はクラス２、つまり同時に発音する単音の少なくとも一
組が１．５の整数倍の関係にある基本周波数を持つよう
な単音パターンとした。パターンの作成においては、あ
らかじめフルート、ピアノ、およびバイオリンの自然楽
器の単音を半音ごとにスタジオで収録し（１６ｂｉｔ、
４８ｋＨｚ）、この波形を計算機上に蓄積しておき、こ
れをクラス２およびＭＩＤＩノート番号６０〜７４とい
う制約の中でランダムに選択して加算することによって
パターンを作成した。Next, an experiment for evaluating recognition accuracy to which the present invention is applied will be described. As shown in FIG. 9, a test pattern is a pattern in which three single sounds sound simultaneously, and the pattern is class 2, that is, a single sound in which at least one set of simultaneously sounding single sounds has a fundamental frequency that is an integer multiple of 1.5. Pattern. In creating a pattern, a single note of a flute, piano, and violin natural musical instrument is recorded in a studio for each semitone in advance (16 bits,
48 kHz), the waveform was stored on a computer, and the waveform was randomly selected and added under the restrictions of class 2 and MIDI note numbers 60 to 74 to create a pattern.

【００２４】認識率Ｒの定義は次式（５）によった。Ｒ＝１００・｛((right-wrong)/total) ・(1/2) ＋1/2 ｝（５）ｒｉｇｈｔは出力に含まれて音符のうち音高と音色の両
方が正しく認識された音符の数、ｗｒｏｎｇは出力に含
まれる音符のうち、音高と音色のどちらかまたは両方が
正しくない音符の数、ｔｏｔａｌは入力（正解）に含ま
れる総音符数である。予備実験の結果からテンプレート
フィルタリングＯＮの条件においては、ＦＩＲフィルタ
の次数を４０とした。なおテンプレートフィルタリング
は式（１）のフィルタリングのことであり、テンプレー
トフィルタリングＯＦＦとはＦＩＲフィルタの次数を１
としたという意味である。The definition of the recognition rate R is based on the following equation (5). R = 100 · ｛((right-wrong) / total) · (1/2) +1/2｝ (5) The right is included in the output and the pitch of the note in which both the pitch and the tone are correctly recognized. The number, wrong, is the number of notes whose pitch and / or timbre are not correct among the notes included in the output, and total is the total number of notes included in the input (correct answer). From the results of the preliminary experiment, the order of the FIR filter was set to 40 under the condition of template filtering ON. Note that template filtering refers to filtering of equation (1), and template filtering OFF means that the order of the FIR filter is 1
It means that it was done.

【００２５】この実験では、原テンプレートとしてテス
トパターンの生成に利用するのと同一の波形を用いた
り、同一個体の楽器を用いたりすると、波形の一致度が
高いために評価実験としては適切でない。そこで、テン
プレートの波形とテストパターンの波形は、互いに異な
る個体から収録したものを用いた。これを図１０に示
す。In this experiment, using the same waveform as that used for generating the test pattern as the original template or using the same musical instrument as the original template is not appropriate as an evaluation experiment because of the high degree of matching of the waveforms. Therefore, the waveform of the template and the waveform of the test pattern used were recorded from different individuals. This is shown in FIG.

【００２６】実験結果を図１１に示す。この表では、右
下の欄の条件（テンプレートフィルタリングＯｆｆ、位
相同期Ｏｆｆ）が、単純なマッチドフィルタによる音源
同定に相当している。したがって、マッチドフィルタに
比較して、この発明の適応型テンプレートを用いる処理
の有効性が明確に示されていると見ることができる。特
に位相同期を行うと一層認識率が高くなっている。FIG. 11 shows the experimental results. In this table, the conditions (template filtering Off, phase synchronization Off) in the lower right column correspond to sound source identification using a simple matched filter. Therefore, it can be seen that the effectiveness of the processing using the adaptive template of the present invention is clearly shown in comparison with the matched filter. In particular, when phase synchronization is performed, the recognition rate is further increased.

【００２７】ベンチマークテストに加え、音楽の生演奏
を対象とした音楽認識テストを行った。ここでは、図１
０とはまた別の楽器個体のバイオリン、フルート、ピア
ノを用いて演奏したテスト曲「蛍の光」を対象として、
音源同定処理についての認識率Ｒを調べた。図１２にそ
の結果を示す。図中の値は音源同定処理だけに関する認
識率である。結果の定性的傾向はベンチマークテストと
同様であり、この発明の方法の有効性が示されている。In addition to the benchmark test, a music recognition test was performed on live music. Here, FIG.
For the test song "Firefly Light", which was performed using a violin, flute, and piano of another musical instrument,
The recognition rate R for the sound source identification processing was examined. FIG. 12 shows the result. The values in the figure are the recognition rates for only the sound source identification processing. The qualitative trend of the results is similar to the benchmark test, indicating the effectiveness of the method of the present invention.

【００２８】以上、説明したように、この発明によれ
ば、数多くの音源が存在し、それらの音源が多様であり
変動をもつ場合であっても、公知の方法に比較して高い
精度で音響信号分離処理を行うことができるという利点
がある。As described above, according to the present invention, even if there are a large number of sound sources, and these sound sources are diverse and have fluctuations, the sound can be obtained with higher accuracy than a known method. There is an advantage that signal separation processing can be performed.

[Brief description of the drawings]

【図１】この発明方法を適用した音響信号分離装置の機
能構成例を示すブロック図。FIG. 1 is a block diagram showing a functional configuration example of an acoustic signal separation device to which the method of the present invention is applied.

【図２】波形区分手段１３の処理手順を示す流れ図。FIG. 2 is a flowchart showing a processing procedure of a waveform classification unit 13;

【図３】候補波形選択手段１５の処理手順を示す流れ
図。FIG. 3 is a flowchart showing a processing procedure of a candidate waveform selection unit 15;

【図４】係数決定手段１６の処理手順を示す流れ図。FIG. 4 is a flowchart showing a processing procedure of a coefficient determining unit 16;

【図５】フィルタ演算手段１７の処理手順を示す流れ
図。FIG. 5 is a flowchart showing a processing procedure of a filter operation means 17;

【図６】候補波形と区分入力音響信号波形と、適応化フ
ィルタ処理後の区分入力音響信号波形との例を示す図。FIG. 6 is a diagram showing an example of a candidate waveform, a segmented input sound signal waveform, and a segmented input sound signal waveform after adaptive filter processing.

【図７】波形の同期性情報を取得する手順を示す流れ
図。FIG. 7 is a flowchart showing a procedure for acquiring waveform synchronization information.

【図８】位相差時系列を取得する手順を示す流れ図。FIG. 8 is a flowchart showing a procedure for acquiring a phase difference time series.

【図９】ベンチマークテストに用いた単音パターンの例
を示す図。FIG. 9 is a diagram showing an example of a single sound pattern used in a benchmark test.

【図１０】実験に用いた楽器を示す図。FIG. 10 is a diagram showing a musical instrument used in the experiment.

【図１１】ベンチマークテストの結果を示す図。FIG. 11 is a view showing a result of a benchmark test.

【図１２】音響認識テストの結果を示す図。FIG. 12 is a diagram showing a result of a sound recognition test.

Claims

[Claims]

1. A process of temporally dividing an input audio signal, and all waveforms possibly contained in each of the divided input audio signals are stored in a waveform storage means. Obtaining candidate waveforms by selecting from the stored waveforms, and obtaining filter coefficients of the filter operations so that an error between the sum of the results of the filter operations performed on the candidate waveforms and the divided input sound signal is minimized. And performing a filter operation with the obtained filter coefficient on each of the candidate waveforms.

2. The step of obtaining the candidate waveform includes the step of extracting a fundamental frequency of the divided input audio signal, and the step of storing a stored waveform having a fundamental frequency falling within a predetermined range with respect to the extracted fundamental frequency. 2. The method according to claim 1, further comprising the step of selecting from the means.

3. The sound according to claim 1, wherein the step of dividing is a step of detecting a rising edge of a sound included in the input audio signal and dividing the interval between adjacent detected rising edges. Signal separation method.

4. The method according to claim 1, wherein a fundamental frequency component of each of the candidate waveforms is phase-synchronized with a fundamental frequency component of the divided input audio signal, and the phase-synchronized candidate waveform is used in a process of obtaining the filter coefficient. Claims 1 to 3
The acoustic signal separation method according to any one of the above.

5. The phase synchronization acquires a time series of a phase difference between a fundamental frequency component of each candidate waveform and a fundamental frequency component of the divided input audio signal, and includes a polarity of each phase difference of the phase difference time series. The time series of the time movement amount converted to the time difference is obtained, and according to each time movement amount of this time movement time series,
The method according to claim 4, wherein the method is performed by moving a sample at a corresponding time of a corresponding candidate waveform.