JP4625933B2

JP4625933B2 - Sound analyzer and program

Info

Publication number: JP4625933B2
Application number: JP2006237269A
Authority: JP
Inventors: 真孝後藤; 琢哉藤島; 慶太有元
Original assignee: Yamaha Corp; National Institute of Advanced Industrial Science and Technology AIST
Current assignee: Yamaha Corp; National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2006-09-01
Filing date: 2006-09-01
Publication date: 2011-02-02
Anticipated expiration: 2026-09-01
Also published as: JP2008058753A

Description

この発明は、市販のＣＤ（ｃｏｍｐａｃｔｄｉｓｃ）などに収録されている、歌声や複数種類の楽器音を同時に含む音楽音響信号を対象に、メロディ音やベース音の音高（本明細書では基本周波数の意味で用いる）を推定する音分析装置およびプログラムに関する。 The present invention is directed to a musical sound signal including a singing voice and a plurality of types of instrument sounds recorded on a commercially available CD (compact disc) or the like, and a pitch of a melody sound or a bass sound (in this specification, a fundamental frequency). The present invention relates to a sound analysis apparatus and a program for estimating the

多数の音源の音が混ざり合ったモノラルの音響信号中から、ある特定の音源の音高を推定することは、非常に困難である。混合音に対して音高推定することが難しい本質的な理由の１つに、時間周波数領域において、ある音の周波数成分が同時に鳴っている他の音の周波数成分と重複することが挙げられる。例えば、歌声、鍵盤楽器（ピアノ等）、ギター、ベースギター、ドラムス等で演奏される典型的なポピュラー音楽では、メロディを担う歌声の高調波構造の一部（特に基本周波数成分）は、鍵盤楽器、ギターの高調波成分やベースギターの高次の高調波成分、スネアドラム等の音に含まれるノイズ成分などと頻繁に重複する。そのため、各周波数成分を局所的に追跡するような手法は、複雑な混合音に対しては安定して機能しない。基本周波数成分が存在することを前提に高調波構造を推定する手法もあるが、そのような手法は、ミッシングファンダメンタル（ｍｉｓｓｉｎｇｆｕｎｄａｍｅｎｔａｌ）現象を扱えないという大きな欠点を持つ。さらに、同時に鳴っている他の音の周波数成分が基本周波数成分と重複すると、有効に機能しない。 It is very difficult to estimate the pitch of a specific sound source from a monaural sound signal in which the sounds of many sound sources are mixed. One of the essential reasons why it is difficult to estimate the pitch of a mixed sound is that, in the time-frequency domain, the frequency component of one sound overlaps with the frequency component of another sound that is playing simultaneously. For example, in typical popular music played on singing voices, keyboard instruments (piano, etc.), guitars, bass guitars, drums, etc., part of the harmonic structure of the singing voice that plays the melody (especially the fundamental frequency component) It frequently overlaps with the harmonic component of the guitar, the higher harmonic component of the bass guitar, the noise component included in the sound of the snare drum, and the like. For this reason, a method of locally tracking each frequency component does not function stably for complex mixed sounds. There is a technique for estimating a harmonic structure on the assumption that a fundamental frequency component exists, but such a technique has a major drawback that it cannot handle a missing fundamental phenomenon. Furthermore, if the frequency components of other sounds that are playing at the same time overlap with the fundamental frequency components, they will not function effectively.

以上のような理由により、従来、単一音のみか、非周期的な雑音を伴った単一音を収録した音響信号を対象とした音高の推定技術はあったが、市販のＣＤに記録された音響信号のように複数の音が混ざり合ったものについて音高を推定する技術はなかった。 For the above reasons, there has been a technique for estimating the pitch of a single sound or an acoustic signal that contains a single sound with aperiodic noise, but it is recorded on a commercially available CD. There was no technique for estimating the pitch of a mixed sound signal such as an acoustic signal.

しかしながら、近年、統計的手法を利用することにより、混合音に含まれる各音の音高を適切に推定する技術が提案されるに至った。特許文献１の技術である。 However, in recent years, a technique for appropriately estimating the pitch of each sound included in a mixed sound has been proposed by using a statistical method. This is the technique of Patent Document 1.

この特許文献１の技術では、メロディ音のものと考えられる帯域に属する周波数成分と、ベース音のものと考えられる帯域に属する周波数成分とを入力音響信号からＢＰＦにより別々に取り出し、それらの各帯域の周波数成分に基づき、メロディ音およびベース音の各々の基本周波数の推定を行う。 In the technique of this Patent Document 1, a frequency component belonging to a band considered to be a melody sound and a frequency component belonging to a band considered to be a bass sound are separately extracted from an input acoustic signal by a BPF, and each of those bands is extracted. Based on the frequency components, the fundamental frequencies of the melody sound and the bass sound are estimated.

さらに詳述すると、特許文献１の技術では、音の高調波構造に対応した確率分布を持った音モデルを用意し、メロディ音の帯域の各周波数成分、ベース音の帯域の各周波数成分が、様々な基本周波数に対応した各音モデルを重み付け加算した混合分布であると考える。そして、各音モデルの重みの値をＥＭ（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いて推定する。 More specifically, in the technique of Patent Document 1, a sound model having a probability distribution corresponding to the harmonic structure of a sound is prepared, and each frequency component of the band of the melody sound and each frequency component of the band of the base sound are It is considered to be a mixed distribution obtained by weighting and adding each sound model corresponding to various fundamental frequencies. Then, the weight value of each sound model is estimated using an EM (Expectation-Maximization) algorithm.

このＥＭアルゴリズムは、隠れ変数を含む確率モデルに対して最尤推定を行うための反復アルゴリズムであり、局所最適解を求めることができる。ここで、最も大きな重みの値を持つ確率分布は、その時点で最も優勢な高調波構造であるとみなすことができるため、あとはその優勢な高調波構造における基本周波数を音高として求めればよい。この手法は基本周波数成分の存在に依存しないため、ミッシングファンダメンタル現象も適切に扱うことができ、基本周波数成分の存在に依存せずに、最も優勢な高調波構造を求めることができる。 This EM algorithm is an iterative algorithm for performing maximum likelihood estimation on a probability model including hidden variables, and a local optimum solution can be obtained. Here, since the probability distribution having the largest weight value can be regarded as the most dominant harmonic structure at that time, the fundamental frequency in the dominant harmonic structure can be obtained as the pitch. . Since this method does not depend on the presence of the fundamental frequency component, the missing fundamental phenomenon can be appropriately handled, and the most dominant harmonic structure can be obtained without depending on the presence of the fundamental frequency component.

非特許文献１は、特許文献１の技術に対して次の拡張を行った技術を開示している。 Non-Patent Document 1 discloses a technique in which the following extension is made to the technique of Patent Document 1.

＜拡張１：音モデルの多重化＞
特許文献１の技術では、同一基本周波数には１つの音モデルしか用意されていなかったが、実際には、ある基本周波数に、異なる高調波構造を持つ音が入れ替わり立ち替わり現れることがある。そこで、同一基本周波数に対して複数の音モデルを用意し、入力音響信号をそれらの混合分布としてモデル化した。 <Extension 1: Sound model multiplexing>
In the technique of Patent Document 1, only one sound model is prepared for the same fundamental frequency, but in reality, a sound having a different harmonic structure may be switched and appear at a certain fundamental frequency. Therefore, a plurality of sound models are prepared for the same fundamental frequency, and the input acoustic signal is modeled as a mixture distribution thereof.

＜拡張２：音モデルのパラメータの推定＞
特許文献１の技術では、音モデルにおいて各高調波成分の大きさの比を固定していた（ある理想的な音モデルを仮定していた）。これは実世界の混合音中の高調波構造とは必ずしも一致しておらず、精度向上のためには洗練される余地が残されていた。そこで、音モデルの高調波成分の比率もモデルパラメータに加え、各時刻においてＥＭアルゴリズムにより推定するようにした。 <Extension 2: Estimation of sound model parameters>
In the technique of Patent Document 1, the ratio of the magnitude of each harmonic component is fixed in the sound model (assuming an ideal sound model). This does not necessarily match the harmonic structure in the real world mixed sound, leaving room for refinement to improve accuracy. Therefore, the ratio of harmonic components of the sound model is also estimated by the EM algorithm at each time in addition to the model parameters.

＜拡張３：モデルパラメータに関する事前分布の導入＞
特許文献１の技術では、音モデルの重み（基本周波数の確率密度関数）に関する事前知識は仮定していなかった。しかし、この基本周波数の推定技術の用途によっては、たとえ事前に基本周波数がどの周波数の近傍にあるかを与えてでも、より誤検出の少ない基本周波数を求めたいというような要求も発生し得る。例えば、演奏分析やビブラート分析等の目的では、楽曲をヘッドホン聴取しながらの歌唱や楽器演奏によって、各時刻におけるおおよその基本周波数を事前知識として用意しておき、実際の楽曲中のより正確な基本周波数を得ることが求められている。そこで、特許文献１におけるモデルパラメータ（音モデルに対する重み値）の最尤推定の枠組みを拡張し、モデルパラメータに関する事前分布に基づいて最大事後確率推定（ＭＡＰ推定；Maximum A Posteriori Probability Estimation）を行うようにした。その際、＜拡張２＞においてモデルパラメータに加えた音モデルの高調波成分の大きさの比率に関する事前分布も導入した。
特許第３４１３６３４号後藤真孝:"リアルタイム音楽情景記述システム: 全体構想と音高推定手法の拡張", 情報処理学会音楽情報科学研究会研究報告 2000-MUS-37-2,Vol.2000, No.94, pp.9-16, ２０００年１０月１６日 <Extension 3: Introduction of prior distribution for model parameters>
In the technique of Patent Document 1, prior knowledge about the weight of the sound model (probability density function of the fundamental frequency) has not been assumed. However, depending on the application of the fundamental frequency estimation technique, there may be a demand for obtaining a fundamental frequency with fewer false detections even if the fundamental frequency is in the vicinity. For example, for the purpose of performance analysis and vibrato analysis, the approximate basic frequency at each time is prepared as prior knowledge by singing and playing musical instruments while listening to the headphones, and more accurate basics in actual music There is a need to obtain a frequency. Therefore, the framework of the maximum likelihood estimation of the model parameter (weight value for the sound model) in Patent Document 1 is expanded, and maximum posterior probability estimation (MAP estimation; Maximum A Posteriori Probability Estimation) is performed based on the prior distribution regarding the model parameter. I made it. At that time, a prior distribution regarding the ratio of the magnitudes of the harmonic components of the sound model added to the model parameters in <Extension 2> was also introduced.
Japanese Patent No. 3413634 Masataka Goto: "Real-time music scene description system: Overall concept and extension of pitch estimation method", IPSJ SIG 2000-MUS-37-2, Vol.2000, No.94, pp.9 -16, October 16, 2000

非特許文献１に開示された技術によれば、上記「拡張１」を導入したことにより、例えば音源が高調波構造の異なった複数の音を発生し得るような場合に、それらの各高調波構造に対応した複数の音モデルを用意しておくことで、各音の基本周波数の推定精度が向上することが期待される。しかしながら、基本周波数の推定精度を高めるために、多数の音モデルを用いるとなると、そのような多数の音モデルを作成するのに多大な工数を要し、また、多数の音モデルを記憶させるための記憶容量を音分析装置内に確保しなければならないという問題がある。 According to the technique disclosed in Non-Patent Document 1, the introduction of the “extension 1” allows each harmonic to be generated when the sound source can generate a plurality of sounds having different harmonic structures, for example. By preparing a plurality of sound models corresponding to the structure, it is expected that the estimation accuracy of the fundamental frequency of each sound is improved. However, if a large number of sound models are used in order to improve the estimation accuracy of the fundamental frequency, it takes a lot of man-hours to create such a large number of sound models, and a large number of sound models are stored. There is a problem that the storage capacity must be secured in the sound analyzer.

この発明は、以上説明した事情に鑑みてなされたものであり、記憶する音モデルの個数が比較的少なくて済み、かつ、高い推定精度で基本周波数の推定を行うことができる音分析装置およびプログラムを提供することを目的としている。 The present invention has been made in view of the circumstances described above, and a sound analysis apparatus and program that can store a relatively small number of sound models and can estimate a fundamental frequency with high estimation accuracy. The purpose is to provide.

この発明は、楽器から発音される複数種類の音の高調波構造を各々定義した複数種類の音モデルを記憶する記憶手段と、前記記憶手段に記憶された複数種類の音モデルを各々の基本周波数に従って序列化し、序列化された複数種類の音モデルに対して、基本周波数に基づく補間処理を施し、序列化された各音モデルの中間の基本周波数に対応した複数種類の音モデルを生成する補間手段と、前記記憶手段に記憶された複数種類の音モデルおよび前記補間手段により生成された複数種類の音モデルを用いて、各種の高調波構造および基本周波数を有する複数の音モデルを重み付け加算した混合分布を構成し、この混合分布が入力音響信号の周波数成分の分布となるように、各音モデルに対する重み値を最適化し、最適化された各音モデルの重み値を前記入力音響信号の元である音源の音の基本周波数の確率密度関数として推定する確率密度関数推定手段と、前記基本周波数の確率密度関数に基づいて前記入力音響信号における１または複数の音源の音の基本周波数を推定して出力する基本周波数推定手段とを具備することを特徴とする音分析装置並びにコンピュータを該音分析装置として機能させるコンピュータプログラムを提供する。 The present invention provides a storage means for storing a plurality of types of sound models each defining a harmonic structure of a plurality of types of sounds generated from a musical instrument, and a plurality of types of sound models stored in the storage means for each fundamental frequency. Interpolation is performed according to the above, and interpolation processing based on the fundamental frequency is applied to the multiple types of sound models that are ordered, and multiple types of sound models corresponding to intermediate frequencies of each of the ordered sound models are generated. And a plurality of types of sound models stored in the storage unit and a plurality of types of sound models generated by the interpolation unit, and a plurality of sound models having various harmonic structures and fundamental frequencies are weighted and added. The weight distribution for each sound model is optimized so that the mixture distribution is composed, and this mixture distribution is the distribution of the frequency components of the input acoustic signal. A probability density function estimating means for estimating the fundamental frequency of the sound of the sound source that is the source of the input acoustic signal, and one or more sound sources in the input acoustic signal based on the probability density function of the fundamental frequency Provided is a sound analysis device comprising a fundamental frequency estimation means for estimating and outputting a fundamental frequency of sound, and a computer program for causing a computer to function as the sound analysis device.

かかる発明によれば、基本周波数に基づく音モデルの補間が行われ、音モデルが補充された状態で、音モデルを用いた基本周波数の推定が行われるので、記憶手段に記憶させる音モデルの個数が比較的少ない場合であっても高い精度で基本周波数の推定を行うことができる。 According to this invention, since the sound model is interpolated based on the fundamental frequency and the fundamental frequency is estimated using the sound model with the sound model supplemented, the number of sound models to be stored in the storage means Even when there is a relatively small amount, the fundamental frequency can be estimated with high accuracy.

以下、図面を参照し、この発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

＜Ａ．第１実施形態＞
＜全体構成＞
図１は、この発明の第１実施形態による音分析プログラムの処理内容を示す図である。この音分析プログラムは、自然界から音響信号を取得する収音機能、ＣＤ等の記録媒体から音楽の音響信号を再生する再生機能またはネットワークを介して音楽の音響信号を取得する通信機能等の音響信号取得機能を備えたパーソナルコンピュータ等のコンピュータにインストールされて実行される。本実施形態による音分析プログラムを実行するコンピュータは、本実施形態による音分析装置として機能する。 <A. First Embodiment>
<Overall configuration>
FIG. 1 is a diagram showing the processing contents of a sound analysis program according to the first embodiment of the present invention. This sound analysis program includes an acoustic signal such as a sound collection function for acquiring an acoustic signal from the natural world, a playback function for reproducing an acoustic signal of music from a recording medium such as a CD, or a communication function for acquiring an acoustic signal of music via a network. The program is installed and executed on a computer such as a personal computer having an acquisition function. The computer that executes the sound analysis program according to the present embodiment functions as the sound analysis device according to the present embodiment.

本実施形態による音分析プログラムは、音響信号取得機能を介して取得されたモノラルの音楽音響信号に対し、その中のある音源の音高を推定する。その最も重要な例として、ここではメロディラインとベースラインを推定する。メロディは他よりも際立って聞こえる単音の系列、ベースはアンサンブル中で最も低い単音の系列であり、その時間的な変化の軌跡をそれぞれメロディラインＤｍ（ｔ）、ベースラインＤｂ（ｔ）と呼ぶ。時刻tにおける基本周波数Ｆ０をＦｉ（ｔ）（ｉ＝ｍ，ｂ）、振幅をＡｉ（ｔ）とすると、これらは以下のように表される。

The sound analysis program according to the present embodiment estimates the pitch of a certain sound source in a monaural music sound signal acquired through the sound signal acquisition function. As the most important example, the melody line and the bass line are estimated here. The melody is a sequence of single notes that can be heard more prominently than the others, and the bass is the sequence of the lowest single note in the ensemble. The temporal changes are called the melody line Dm (t) and the base line Db (t), respectively. Assuming that the fundamental frequency F0 at time t is Fi (t) (i = m, b) and the amplitude is Ai (t), these are expressed as follows.

このメロディラインＤｍ（ｔ）およびベースラインＤｂ（ｔ）を入力音響信号から得るための手段として、音分析プログラムは、瞬時周波数の算出１、周波数成分の候補の抽出２、周波数帯域の制限３、メロディラインの推定４ａおよびベースラインの推定４ｂ並びに音モデル補間処理５の各処理を含む。また、メロディラインの推定４ａおよびベースラインの推定４ｂの各処理は、基本周波数の確率密度関数の推定４１およびマルチエージェントモデルによる基本周波数の継時的な追跡４２の各処理を各々含む。本実施形態において、瞬時周波数の算出１、周波数成分の候補の抽出２、周波数帯域の制限３、メロディラインの推定４ａおよびベースラインの推定４ｂの処理内容は、前掲特許文献１および非特許文献１に開示されたものと基本的に同様である。本実施形態の特徴は、音モデル補間処理５を追加した点にある。以下、本実施形態による音分析プログラムを構成する各処理の内容を説明する。 As a means for obtaining the melody line Dm (t) and the base line Db (t) from the input sound signal, the sound analysis program includes an instantaneous frequency calculation 1, frequency component candidate extraction 2, frequency band restriction 3, Each of the melody line estimation 4a, the baseline estimation 4b, and the sound model interpolation process 5 is included. Each process of the melody line estimation 4a and the baseline estimation 4b includes a fundamental frequency probability density function estimation 41 and a fundamental frequency sequential tracking 42 using a multi-agent model. In this embodiment, the processing contents of instantaneous frequency calculation 1, frequency component candidate extraction 2, frequency band restriction 3, melody line estimation 4a, and baseline estimation 4b are described in Patent Document 1 and Non-Patent Document 1 described above. This is basically the same as that disclosed in. The feature of this embodiment is that a sound model interpolation process 5 is added. Hereinafter, the content of each process which comprises the sound analysis program by this embodiment is demonstrated.

＜瞬時周波数の算出１＞
この処理では、入力音響信号を複数のＢＰＦからなるフィルタバンクに与え、フィルタバンクの各ＢＰＦの出力信号について、位相の時間微分である瞬時周波数（Flanagan, J.L. and Golden, R.M.: Phase Vocoder, The BellSystem
Technical J., Vol.45, pp.1493-1509 (1966)参照）を計算する。ここでは、上記Flanaganの手法を用い、短時間フーリエ変換(STFT)の出力をフィルタバンク出力と解釈して、効率良く瞬時周波数を計算する。入力音響信号ｘ(ｔ)に対する窓関数ｈ(ｔ)を用いたＳＴＦＴが式（３）および（４）により与えられるとき、瞬時周波数λ（ω，ｔ）は式（５）により求めることができる。 <Instantaneous frequency calculation 1>
In this process, the input acoustic signal is applied to a filter bank consisting of a plurality of BPFs, and the instantaneous frequency (Flanagan, JL and Golden, RM: Phase Vocoder, The BellSystem)
Technical J., Vol. 45, pp.1493-1509 (1966)). Here, the above-described Flanagan method is used, the short-time Fourier transform (STFT) output is interpreted as the filter bank output, and the instantaneous frequency is efficiently calculated. When the STFT using the window function h (t) for the input acoustic signal x (t) is given by the equations (3) and (4), the instantaneous frequency λ (ω, t) can be obtained by the equation (5). .

ここで、ｈ(ｔ)は時間周波数の局所化を与える窓関数である（例えば、最適な時間周波数の局所化を与えるガウス関数に２階のカーディナルＢ−スプライン関数を畳み込んで作成した時間窓など)。 Here, h (t) is a window function that gives the localization of the time frequency (for example, a time window created by convolving a second-order cardinal B-spline function with a Gaussian function that gives the optimum localization of the time frequency. Such).

この瞬時周波数を計算するのに、ウェーブレット変換を用いても良い。ここでは、計算量を減らすためにＳＴＦＴを用いるが、単一のＳＴＦＴのみを用いたのでは、ある周波数帯域における時間分解能や周波数分解能が悪くなってしまう。そこで、マルチレートフィルタバンク（Vetterli, M.: A Theory of Multirate Filter Banks, IEEE Trans. on
ASSP, Vol.ASSP-35, No.3, pp. 356-372 (1987)、参照）を構成し、リアルタイムに実行可能という制約のもとで、ある程度妥当な時間周波数分解能を得る。 A wavelet transform may be used to calculate this instantaneous frequency. Here, the STFT is used to reduce the amount of calculation. However, if only a single STFT is used, the time resolution and frequency resolution in a certain frequency band are deteriorated. Therefore, multi-rate filter banks (Vetterli, M .: A Theory of Multirate Filter Banks, IEEE Trans. On
ASSP, Vol. ASSP-35, No. 3, pp. 356-372 (1987)), and obtain a reasonable time-frequency resolution under the restriction that it can be executed in real time.

＜周波数成分の候補の抽出２＞
この処理では、フィルタの中心周波数からその瞬時周波数への写像に基づいて、周波数成分の候補を抽出する（Charpentier, F.J.: Pitch detection using the short-termphase spectrum,
Proc. of ICASSP 86, pp.113-116 (1986)参照）。あるＳＴＦＴフィルタの中心周波数ωからその出力の瞬時周波数λ（ω，ｔ）への写像を考える。すると、もし周波数ψの周波数成分があるときには、ψがこの写像の不動点に位置し、その周辺の瞬時周波数の値はほぼ一定となる。つまり、全周波数成分の瞬時周波数Ψ_f ^(t)は、次式によって抽出することができる。 <Frequency component candidate extraction 2>
In this process, candidate frequency components are extracted based on the mapping from the center frequency of the filter to its instantaneous frequency (Charpentier, FJ: Pitch detection using the short-term phase spectrum,
Proc. Of ICASSP 86, pp. 113-116 (1986)). Consider a mapping from the center frequency ω of an STFT filter to the instantaneous frequency λ (ω, t) of its output. Then, if there is a frequency component of frequency ψ, ψ is located at the fixed point of this mapping, and the value of the instantaneous frequency around it is almost constant. That is, the instantaneous frequency Ψ _f ^(t) of all frequency components can be extracted by the following equation.

これらの周波数成分のパワーは、Ψ_f ^(t)の各周波数におけるＳＴＦＴパワースペクトルの値として得られるため、周波数成分のパワー分布関数Ψ_p ^(t)(ω)を次のように定義することができる。

Since the power of these frequency components is obtained as the value of the STFT power spectrum at each frequency of ψ _f ^(t) , the power distribution function ψ _p ^(t) (ω) of the frequency component can be defined as follows. it can.

＜周波数帯域の制限３＞
この処理では、抽出した周波数成分に重み付けすることで、周波数帯域を制限する。ここでは、メロディラインとベースライン用に、２種類のＢＰＦを用意する。メロディライン用ＢＰＦは、典型的なメロディラインの主要な基本波成分および高調波成分の多くを通過させることができ、かつ、基本周波数付近の重複が頻繁に起きる周波数帯域をある程度遮断する。一方、ベースライン用ＢＰＦは、典型的なベースラインの主要な基本周波数成分および高調波成分の多くを通過させることができ、かつ、他の演奏パートがベースラインよりも優勢になるような周波数帯域をある程度遮断する。 <Frequency band restriction 3>
In this process, the frequency band is limited by weighting the extracted frequency components. Here, two types of BPF are prepared for the melody line and the base line. The melody line BPF can pass most of the main fundamental wave components and harmonic components of a typical melody line, and cuts off a frequency band in which duplication near the fundamental frequency frequently occurs to some extent. On the other hand, the BPF for a bass line can pass many of the main fundamental frequency components and harmonic components of a typical bass line, and the frequency band in which the other performance parts are dominant over the bass line. To some extent.

本実施形態では、以下、対数スケールの周波数をcentの単位(本来は音高差(音程)を表す尺度)で表し、Ｈｚで表された周波数ｆＨｚを、次のようにcentで表された周波数ｆｃｅｎｔに変換する。

平均律の半音は１００ｃｅｎｔに、１オクターブは１２００ｃｅｎｔに相当する。 In the present embodiment, the logarithmic scale frequency is expressed in units of cents (originally a scale representing pitch difference (pitch)), and the frequency fHz expressed in Hz is expressed as cents as follows: Convert to fcent.

A semitone of equal temperament corresponds to 100 cent, and one octave corresponds to 1200 cent.

周波数ｘｃｅｎｔでのＢＰＦの周波数応答をＢＰＦｉ（ｘ）（ｉ＝ｍ，ｂ）とし、周波数成分のパワー分布関数をΨ’_ｐ ^（ｔ）（ｘ）とすると、ＢＰＦを通過した周波数成分はＢＰＦｉ（ｘ）Ψ’_ｐ ^（ｔ）（ｘ）と表すことができる。ただし、Ψ’_ｐ ^（ｔ）（ｘ）は、周波数軸がｃｅｎｔで表されていることを除けばΨ_ｐ ^（ｔ）（ω）と同じ関数である。ここで、次の段階の準備として、ＢＰＦを通過した周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を定義する。

When the frequency response of the BPF at the frequency x cent is BPFi (x) (i = m, b) and the power distribution function of the frequency component is ψ ′ _p ^(t) (x), the frequency component that has passed through the BPF is BPFi. (X) ψ ′ _p ^(t) (x). However, Ψ ′ _p ^(t) (x) is the same function as Ψ _p ^(t) (ω) except that the frequency axis is represented by cent. Here, as a preparation for the next stage, a probability density function p _Ψ ^(t) (x) of a frequency component that has passed through the BPF is defined.

ここで、Ｐｏｗ^（ｔ）は次式に示すようにＢＰＦを通過した周波数成分のパワーの合計である。

Here, Pow ^(t) is the total power of the frequency components that have passed through the BPF as shown in the following equation.

＜基本周波数の確率密度関数の推定４１＞
この基本周波数の確率密度関数の推定４１では、ＢＰＦを通過した周波数成分の候補に対し、各高調波構造が相対的にどれくらい優勢かを表す基本周波数の確率密度関数を求める。この基本周波数の確率密度関数の推定４１の処理内容は、非特許文献１に開示された内容となっている。 <Estimation 41 of probability density function of fundamental frequency>
In the fundamental frequency probability density function estimation 41, a fundamental frequency probability density function representing how relatively each harmonic structure prevails with respect to a frequency component candidate that has passed through the BPF is obtained. The processing content of the probability frequency function estimation 41 of the fundamental frequency is the content disclosed in Non-Patent Document 1.

基本周波数の確率密度関数の推定４１では、上述した「拡張１」と「拡張２」を実現するために、同一基本周波数に対してＭｉ種類の音モデルがあるものとし（ｉはメロディ用（ｉ＝ｍ）かベース用（ｉ＝ｂ）かを示す）、基本周波数がＦであり、音モデルの種類がｍ番目の種類であり、モデルパラメータμ^（ｔ）（Ｆ，ｍ）を持った音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））を次のように定義する。

In the estimation 41 of the probability density function of the fundamental frequency, it is assumed that there are Mi types of sound models for the same fundamental frequency (i is for melody (i = M) or bass (i = b)), the fundamental frequency is F, the sound model type is the mth type, and the sound has model parameters μ ^(t) (F, m) The model p (x | F, m, μ ^(t) (F, m)) is defined as follows.

この音モデルは、基本周波数がＦのときに、その高調波成分がどの周波数にどれくらい現れるかをモデル化したものである。Ｈｉは基本周波数成分も含めた高調波成分の数、Ｗ_ｉ ^２はガウス分布Ｇ（ｘ；ｘ０，σ）の分散を表す。ｃ^（ｔ）（ｈ｜Ｆ，ｍ）は、基本周波数がＦであるｍ番目の音モデルの第ｈ次調波成分の大きさを表し、次式を満たす。

This sound model is obtained by modeling how many harmonic components appear at which frequency when the fundamental frequency is F. Hi represents the number of harmonic components including the fundamental frequency component, and W _i ² represents the variance of the Gaussian distribution G (x; x0, σ). c ^(t) (h | F, m) represents the magnitude of the h-order harmonic component of the m-th sound model whose fundamental frequency is F, and satisfies the following expression.

ｍ番目の音モデルがある基本周波数Ｆの音モデルとして使用され、基本周波数の確率密度関数の推定に用いられる場合、その基本周波数Ｆの音モデルにおける重みｃ^（ｔ）（ｈ｜Ｆ，ｍ）として、上記式（１６）に示すように、総和が１となるように予め定義された重みｃ^（ｔ）（ｈ｜Ｆ，ｍ）が用いられる。 When the m-th sound model is used as a sound model of a fundamental frequency F and is used to estimate the probability density function of the fundamental frequency, the weight c ^(t) (h | F, m) in the sound model of the fundamental frequency F As shown in the above equation (16), a weight c ^(t) (h | F, m) that is predefined so that the sum is 1 is used.

基本周波数の確率密度関数の推定４１では、以上のような音モデルを使用し、周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）が、次式で定義されるようなｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））の混合分布モデルｐ（ｘ｜θ^（ｔ））から生成されたと考える。

ここで、ＦｈｉとＦｌｉは、許容される基本周波数の上限と下限であり、ｗ^（ｔ）（Ｆ，ｍ）は次式を満たすような音モデルの重みである。

In the estimation 41 of the probability density function of the fundamental frequency, the sound model as described above is used, and the probability density function p _Ψ ^(t) (x) of the frequency component is defined as p (x | F , M, μ ^(t) (F, m)) is considered to have been generated from the mixed distribution model p (x | θ ^(t) ).

Here, Fhi and Fli are the upper and lower limits of the allowable fundamental frequency, and w ^(t) (F, m) is the weight of the sound model that satisfies the following equation.

実世界の混合音に対して事前に音源数を仮定することは不可能なため、式（１７）のように、あらゆる基本周波数の可能性を同時に考慮してモデル化することが重要となる。最終的に、モデルｐ（ｘ｜θ^（ｔ））から、観測した確率密度関数ｐ_Ψ ^（ｔ）（ｘ）が生成されたかのようにモデルパラメータθ^（ｔ）を推定できれば、その重みｗ^（ｔ）（Ｆ，ｍ）は各高調波構造が相対的にどれくらい優勢かを表すため、次式のように基本周波数の確率密度関数ｐ_Ｆ０ ^（ｔ）（Ｆ）と解釈することができる。

Since it is impossible to assume the number of sound sources in advance for a mixed sound in the real world, it is important to perform modeling in consideration of the possibility of all fundamental frequencies as shown in Equation (17). Finally, if the model parameter θ ^(t) can be estimated from the model p (x | θ ^(t) ) as if the observed probability density function p _Ψ ^(t) (x) was generated, its weight w ^{(t ) Since} (F, m) represents how relatively each harmonic structure is dominant, it can be interpreted as a probability density function p _F0 ^(t) (F) of the fundamental frequency as in the following equation.

次に、上述した「拡張３」を実現するために、θ^（ｔ）の事前分布ｐ_０ｉ（θ^（ｔ））を、式（２３）のように式（２４）と式（２５）の積で与える。

Then, the product in order to achieve "expansion 3" described above, theta ^(t) prior distribution _{p 0i} of (theta ^(t)) of the formula (25) and (24) as in equation (23) Give in.

ここで、ｐ_０ｉ（ｗ^（ｔ））とｐ_０ｉ（μ^（ｔ））は、最も起こりやすいパラメータをｗ_０ｉ ^（ｔ）（Ｆ，ｍ）とμ_０ｉ ^（ｔ）（Ｆ，ｍ）としたときに、そこで最大値を取るような単峰性の事前分布である。ただし、Ｚ_ｗ、Ｚ_μは正規化係数、β_ｗｉ ^（ｔ）、β_μｉ ^（ｔ）（Ｆ，ｍ）は、最大値をどれくらい重視した事前分布とするかを決めるパラメータで、０のときに無情報事前分布(一様分布）となる。また、Ｄ_ｗ（ｗ_０ｉ ^（ｔ）；ｗ^（ｔ））、Ｄ_μ（μ_０ｉ ^（ｔ）（Ｆ，ｍ）；μ^（ｔ）（Ｆ，ｍ））は、次のようなＫ−Ｌ情報量（Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ’ｓｉｎｆｏｒｍａｔｉｏｎ）である。

Here, for p _0i (w ^(t) ) and p _0i (μ ^(t) ), the most likely parameters are w _0i ^(t) (F, m) and μ _0i ^(t) (F, m). Sometimes it is a unimodal prior distribution that takes its maximum value. However, Z _w and Z _μ are normalization coefficients, and β _wi ^(t) and β _μi ^(t) (F, m) are parameters that determine how much prior distribution is emphasized, and when 0, No information prior distribution (uniform distribution). _{_{^{^{Further, D w (w 0i (t}}}} ); w (t)), D μ (μ 0i (t) (F, m); μ (t) (F, m)) , such as: K-L This is the amount of information (Kullback-Leibler's information).

以上から、確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を観測したときに、そのモデルｐ（ｘ｜θ^（ｔ））のパラメータθ^（ｔ）を、事前分布ｐ_０ｉ（θ^（ｔ））に基づいて推定する問題を解けばよいことがわかる。この事前分布に基づくθ^（ｔ）の最大事後確率推定量（ＭＡＰ推定量）は、次式を最大化することで得られる。

From the above, when the probability density function p _Ψ ^(t) (x) is observed, the parameter θ ^(t) of the model p (x | θ ^(t) ) is changed to the prior distribution p _0i (θ ^(t) ). It can be seen that the problem to be estimated based on the problem should be solved. The maximum posterior probability estimator (MAP estimator ⁾ of θ ^(t) based on this prior distribution can be obtained by maximizing the following equation.

この最大化問題は解析的に解くことが困難なため、前述のＥＭ（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いてθ^（ｔ）を推定する。ＥＭアルゴリズムは、Ｅステップ（ｅｘｐｅｃｔａｔｉｏｎｓｔｅｐ）とＭステップ（ｍａｘｉｍｉｚａｔｉｏｎｓｔｅｐ）を交互に繰返し適用することで、不完全な観測データ（この場合、ｐ_Ψ ^（ｔ）（ｘ））から最尤推定をおこなうための反復アルゴリズムである。本実施形態では、ＥＭアルゴリズムを繰り返すことにより、ＢＰＦを通過した周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を、各種の基本周波数Ｆに対応した複数の音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））を重み付け加算した混合分布と考える場合において、最も尤もらしい重みのパラメータθ^（ｔ）（＝｛ｗ^（ｔ）（Ｆ，ｍ），μ^（ｔ）（Ｆ，ｍ）｝）を求める。ここで、ＥＭアルゴリズムの各繰り返しでは、パラメータθ^（ｔ）（＝｛ｗ^（ｔ）（Ｆ，ｍ），μ^（ｔ）（Ｆ，ｍ））に関して、古いパラメータ推定値θ_old ^（ｔ）（＝｛ｗ_ｏｌｄ ^（ｔ）（Ｆ，ｍ），μ_ｏｌｄ ^（ｔ）（Ｆ，ｍ）｝）を更新して新しい（より尤もらしい）パラメータ推定値θ_ｎｅｗ ^（ｔ）（＝｛ｗ_ｎｅｗ ^（ｔ）（Ｆ，ｍ），μ_ｎｅｗ ^（ｔ）（Ｆ，ｍ）｝）を求めていく。θ_ｏｌｄ ^（ｔ）の初期値には、１つ前の時刻ｔ−１における最終的な推定値を用いる。この古いパラメータ推定値θ_old ^（ｔ）から新しいパラメータ推定値θ_ｎｅｗ ^（ｔ）を求める漸化式は、次のようになる。なお、この漸化式の導出過程は非特許文献１に詳細に説明されているので、そちらを参照されたい。

Since this maximization problem is difficult to solve analytically, θ ^(t) is estimated using the aforementioned EM (Expectation-Maximization) algorithm. The EM algorithm performs maximum likelihood estimation from incomplete observation data (in this case, p _Ψ ^(t) (x)) by repeatedly applying an E step (expectation step) and an M step (maximization step) alternately. Iterative algorithm for In the present embodiment, by repeating the EM algorithm, the probability density function p _Ψ ^(t) (x) of the frequency component that has passed through the BPF is converted into a plurality of sound models p (x | F, In the case of a mixed distribution obtained by weighting and adding m, μ ^(t) (F, m)), the most likely weight parameter θ ^(t) (= {w ^(t) (F, m), μ ^(t) (F, m)}). Here, in each iteration of the EM algorithm, with respect to the parameter θ ^(t) (= {w ^(t) (F, m), μ ^(t) (F, m)), the old parameter estimation value θ _old ^(t) ( = {W _old ^(t) (F, m), μ _old ^(t) (F, m)}) and update the new (more likely) parameter estimate θ _new ^(t) (= {w _new ^{(t )} (F, m), μ _new ^(t) (F, m)}). As the initial value of θ _old ^(t) , the final estimated value at the previous time t−1 is used. A recurrence formula for obtaining a new parameter estimated value θ _new ^(t) from the old parameter estimated value θ _old ^(t) is as follows. Note that the process of deriving the recurrence formula is described in detail in Non-Patent Document 1, so please refer to that.

上記式（２９）および（３０）におけるｗ_ＭＬ ^（ｔ）（Ｆ，ｍ）とｃ_ＭＬ ^（ｔ）（ｈ｜Ｆ，ｍ）は、β_ｗｉ ^（ｔ）＝０、β_μｉ ^（ｔ）（Ｆ，ｍ）＝０の無情報事前分布のとき、つまり、最尤推定の場合の推定値であり、次式により与えられる。

_{^{W ML (t) (F,}} m) in the formula (29) and (30) and _{^{c ML (t) (h |}} F, m) _{^{is, β wi (t) = 0}} , β μi (t) (F , M) = 0, which is an estimated value in the case of no information prior distribution, that is, maximum likelihood estimation, and is given by the following equation.

これらの反復計算により、事前分布を考慮した基本周波数の確率密度関数ｐ_Ｆ０ ^（ｔ）（Ｆ）が、式（２２）によってｗ^（ｔ）（Ｆ，ｍ）から求まる。さらに、すべての音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））の各高調波成分の大きさの比率ｃ^（ｔ）（ｈ｜Ｆ，ｍ）も求まり、「拡張１」〜「拡張３」が実現される。 Through these iterative calculations, the probability density function p _F0 ^(t) (F) of the fundamental frequency in consideration of the prior distribution is obtained from w ^(t) (F, m) by the equation (22). Furthermore, the ratio c ^(t) (h | F, m) of the magnitude of each harmonic component of all sound models p (x | F, m, μ ^(t) (F, m)) is also obtained. 1 ”to“ Extended 3 ”are realized.

最も優勢な基本周波数Ｆｉ（ｔ）を決定するには、次式に示すように、基本周波数の確率密度関数ｐ_Ｆ０ ^（ｔ）（Ｆ）（式（２２）より、式（２９）〜（３２）を反復計算した最終的な推定値として得られる）を最大にする周波数を求めればよい。

こうして得られた周波数を音高とする。これが、本実施形態において、基本周波数の確率密度関数に基づいて入力音響信号における１または複数の音源の音の基本周波数を推定して出力する基本周波数推定手段としての処理である。 In order to determine the most dominant fundamental frequency Fi (t), as shown in the following equation, the probability density function p _F0 ^(t) (F) of the fundamental frequency (from the equation (22), the equations (29) to (32) What is necessary is just to obtain | require the frequency which maximizes) obtained as the final estimated value which repeated calculation).

Let the frequency obtained in this way be the pitch. This is processing as basic frequency estimation means for estimating and outputting the fundamental frequency of the sound of one or more sound sources in the input acoustic signal based on the probability density function of the fundamental frequency in the present embodiment.

＜マルチエージェントモデルによる基本周波数の継時的な追跡４２＞
基本周波数の確率密度関数において、同時に鳴っている音の基本周波数に対応する複数のピークが拮抗すると、それらのピークが確率密度関数の最大値として次々に選ばれてしまうことがあるため、このように単純に求めた結果は安定しないことがある。そこで、本実施形態における基本周波数推定手段としての処理では、大局的な観点から基本周波数を推定するために、基本周波数の確率密度関数の時間変化において複数のピークの軌跡を継時的に追跡し、その中で最も優勢で安定した基本周波数の軌跡を選択する。このような追跡処理を動的で柔軟に制御するために、マルチエージェントモデルを導入する。 <Frequency tracking 42 of basic frequency by multi-agent model>
In the probability density function of the fundamental frequency, if multiple peaks corresponding to the fundamental frequency of the sound that is playing at the same time are antagonized, these peaks may be selected one after another as the maximum value of the probability density function. The result obtained simply may not be stable. Therefore, in the processing as the fundamental frequency estimation means in the present embodiment, in order to estimate the fundamental frequency from a global viewpoint, the trajectories of a plurality of peaks are tracked continuously in the time change of the probability density function of the fundamental frequency. Select the most dominant and stable fundamental frequency trajectory among them. In order to control such tracking process dynamically and flexibly, a multi-agent model is introduced.

マルチエージェントモデルは、１つの特徴検出器と複数のエージェントにより構成される（図２参照）。特徴検出器は、基本周波数の確率密度関数の中で目立つピークを拾い上げる。エージェントは基本的に、それらのピークに駆動されて軌跡を追跡していく。つまり、マルチエージェントモデルは、入力中で目立つ特徴を時間的に追跡する汎用の枠組みである。具体的には、各時刻において以下の処理がおこなわれる。 The multi-agent model is composed of one feature detector and a plurality of agents (see FIG. 2). The feature detector picks up the prominent peaks in the probability density function of the fundamental frequency. The agent basically follows the trajectory driven by those peaks. In other words, the multi-agent model is a general-purpose framework that temporally tracks features that stand out in the input. Specifically, the following processing is performed at each time.

（１）基本周波数の確率密度関数が求まった後、特徴検出器は目立つピーク（最大ピークに応じて動的に変化する閾値を越えたピーク）を複数検出する。そして、目立つピークのそれぞれについて、周波数成分のパワーの合計Ｐｏｗ^（ｔ）も考慮しながら、どれくらい将来有望なピークかを評価する。これは、現在時刻を数フレーム先の時刻とみなして、ピークの軌跡をその時刻まで先読みして追跡することで実現する。 (1) After the probability density function of the fundamental frequency is obtained, the feature detector detects a plurality of conspicuous peaks (peaks exceeding a threshold that dynamically changes according to the maximum peak). Then, for each conspicuous peak, the promising peak is evaluated in consideration of the total power Pow ^(t) of frequency components. This is realized by regarding the current time as a time several frames ahead and prefetching and tracking the peak trajectory up to that time.

（２）既に生成されたエージェントがあるときは、それらが相互作用しながら、目立つピークをそれに近い軌跡を持つエージェントへと排他的に割り当てる。複数のエージェントが割り当て候補に上がる場合には、最も信頼度の高いエージェントへと割り当てる。 (2) When there is an agent already generated, the prominent peak is exclusively assigned to an agent having a locus close to it while interacting with each other. If multiple agents are candidates for assignment, assign them to the agent with the highest reliability.

（３）最も有望で目立つピークがまだ割り当てられていないときは、そのピークを追跡する新たなエージェントを生成する。 (3) If the most promising and conspicuous peak has not yet been assigned, a new agent that tracks that peak is generated.

（４）各エージェントは累積ペナルティを持っており、それが一定の閾値を越えると消滅する。 (4) Each agent has a cumulative penalty and disappears when it exceeds a certain threshold.

（５）目立つピークが割り当てられなかったエージェントは、一定のペナルティを受け、基本周波数の確率密度関数の中から自分の追跡する次のピークを直接見つけようとする。もしそのピークも見つからないときは、さらにペナルティを受ける。さもなければ、ペナルティはリセットされる。 (5) An agent that has not been assigned a conspicuous peak receives a certain penalty, and tries to find the next peak to be tracked directly from the probability density function of the fundamental frequency. If the peak is not found, a penalty is applied. Otherwise, the penalty is reset.

（６）各エージェントは、今割り当てられたピークがどれくらい有望で目立つかを表す度合いと、１つ前の時刻の信頼度との重み付き和によって、信頼度を自己評価する。 (6) Each agent self-evaluates the reliability based on the weighted sum of the degree of how promising and conspicuous the peak assigned at present is and the reliability at the previous time.

（７）時刻ｔにおける基本周波数Ｆｉ（ｔ）は、信頼度が高く、追跡しているピークの軌跡に沿ったパワーの合計が大きいエージェントに基づいて決定する。振幅Ａｉ（ｔ）は、基本周波数Ｆｉ（ｔ）の高調波成分等をΨ_ｐ ^（ｔ）（ω）から抽出して決定する。 (7) The fundamental frequency Fi (t) at time t is determined based on an agent having high reliability and a large total power along the track of the peak being tracked. The amplitude Ai (t) is determined by extracting a harmonic component or the like of the fundamental frequency Fi (t) from Ψ _p ^(t) (ω).

＜本実施形態の改良点（音モデル補間処理５）＞
一般に楽器から発音される音のスペクトル形状は音高（基本周波数）に依存して変化する。従って、基本周波数の推定精度を高めるためには、様々な基本周波数を持った音を楽器から収音し、これらの各音から作成した多くの音モデルを用いて、基本周波数の確率密度関数の推定４１を実行した方が好ましい。しかし、そのような多数の音モデルを基本周波数の確率密度関数の推定４１に用いるとなると、それらの多数の音モデルを作成するのには多大な工数を要し、また、多数の音モデルを記憶させるための記憶容量を音分析装置内に確保しなければならないという問題がある。そこで、本実施形態では、次のような改良が行われている。すなわち、音分析装置の記憶装置には、様々な基本周波数に対応した比較的少数の代表的な音モデルのみを各々の基本周波数と対応付けて記憶させ、音分析プログラムの実行時に、この記憶装置に記憶された比較的少数の代表的な音モデルから多数の音モデルを生成し、基本周波数の確率密度関数の推定４１に引き渡すのである。 <Improvement of this embodiment (sound model interpolation processing 5)>
In general, the spectrum shape of sound produced by an instrument varies depending on the pitch (fundamental frequency). Therefore, in order to improve the estimation accuracy of the fundamental frequency, sounds with various fundamental frequencies are collected from the instrument, and the probability density function of the fundamental frequency is calculated using many sound models created from these sounds. It is preferable to perform estimation 41. However, when such a large number of sound models are used for the estimation 41 of the probability density function of the fundamental frequency, it takes a great amount of man-hours to create the large number of sound models. There is a problem that a storage capacity for storage must be secured in the sound analyzer. Therefore, in the present embodiment, the following improvements are made. That is, the storage device of the sound analyzer stores only a relatively small number of representative sound models corresponding to various fundamental frequencies in association with the respective fundamental frequencies, and this storage device is used when the sound analysis program is executed. A large number of sound models are generated from a relatively small number of representative sound models stored in, and passed to the probability frequency function estimation 41 of the fundamental frequency.

そして、本実施形態では、記憶装置に記憶された比較的少数の代表的な音モデルから多数の音モデルを生成し、基本周波数の確率密度関数の推定４１に引き渡すための手段として、図１に示す音モデル補間処理５が音分析プログラムに追加されている。この音モデル補間処理は、記憶装置に記憶された複数種類の音モデルを各々の基本周波数に従って序列化し、序列化された複数種類の音モデルに対して、基本周波数に基づく補間処理を施し、序列化された各音モデルの中間の基本周波数に対応した複数種類の音モデルを生成する処理である。本実施形態による音分析プログラムは、その実行開始の初期に、この音モデル補間処理５を実行し、記憶装置に記憶された代表的な音モデルとこの音モデル補間処理５により得られた音モデルとを基本周波数の確率密度関数の推定４１に引き渡すように構成されている。 In the present embodiment, as means for generating a large number of sound models from a relatively small number of representative sound models stored in the storage device and transferring them to the estimation 41 of the probability density function of the fundamental frequency, FIG. The sound model interpolation process 5 shown is added to the sound analysis program. In this sound model interpolation process, a plurality of types of sound models stored in the storage device are ordered according to each fundamental frequency, and the plurality of types of ordered sound models are subjected to interpolation processing based on the fundamental frequency, This is a process of generating a plurality of types of sound models corresponding to the intermediate fundamental frequency of each sound model. The sound analysis program according to the present embodiment executes the sound model interpolation process 5 at the beginning of the execution thereof, a representative sound model stored in the storage device, and the sound model obtained by the sound model interpolation process 5 Are delivered to the probability frequency function estimation 41 of the fundamental frequency.

図３は代表的な音モデルの選出と音モデル補間処理５の具体例を示すものである。この例では、ギターの全フレットにおいて５フレット毎に代表フレットを選び、それらの各代表フレットを指で押さえたときのギター音の音モデルを作成し、代表的な音モデルとして音分析装置の記憶装置に記憶させる。そして、各代表フレットに挟まれた中間フレットに対応したギター音の音モデルは、音モデル補間処理５に生成させる。音モデル補間処理５では、中間フレットに対応した音モデルのｈ次倍音成分（ｈ＝１〜Ｈｉ）を、その中間フレットの低音側の代表フレットに対応した音モデルのｈ次倍音成分（ｈ＝１〜Ｈｉ）とその中間フレットの高音側の代表フレットに対応した音モデルのｈ次倍音成分（ｈ＝１〜Ｈｉ）とから生成する。この音モデル補間処理５に関しては各種の態様が考えられる。ある好ましい態様では、低音側の代表フレットに対応した音モデルの基本周波数をＦａ、ｈ次倍音成分をｃ（ｈ｜Ｆａ、ｍａ）、高音側の代表フレットに対応した音モデルの基本周波数をＦｂ、ｈ次倍音成分をｃ（ｈ｜Ｆｂ、ｍｂ）、中間フレットに対応した音モデルの基本周波数をＦｃ、ｈ次倍音成分をｃ（ｈ｜Ｆｃ、ｍｃ）とした場合に、次式に示す１次補間により中間フレットに対応した音モデルのｈ次倍音成分ｃ（ｈ｜Ｆｃ、ｍｃ）を求める。

FIG. 3 shows a specific example of representative sound model selection and sound model interpolation processing 5. In this example, a representative fret is selected for every five frets of all guitar frets, a sound model of the guitar sound when each representative fret is pressed with a finger is created, and the sound analyzer stores the representative sound model. Store in the device. Then, the sound model interpolation process 5 generates a sound model of the guitar sound corresponding to the intermediate frets sandwiched between the representative frets. In the sound model interpolation process 5, the h-order overtone component (h = 1 to Hi) of the sound model corresponding to the intermediate fret is converted into the h-order overtone component (h = H =) of the sound model corresponding to the low-frequency side representative fret of the intermediate fret. 1 to Hi) and the h-order harmonic component (h = 1 to Hi) of the sound model corresponding to the high-side representative fret of the intermediate fret. Various types of sound model interpolation processing 5 can be considered. In a preferred embodiment, the fundamental frequency of the sound model corresponding to the low-frequency side representative fret is Fa, the h-order overtone component is c (h | Fa, ma), and the basic frequency of the sound model corresponding to the high-frequency side representative fret is Fb. , H (th) overtone component is c (h | Fb, mb), the fundamental frequency of the sound model corresponding to the intermediate frets is Fc, and the h th overtone component is c (h | Fc, mc), The h-order overtone component c (h | Fc, mc) of the sound model corresponding to the intermediate fret is obtained by primary interpolation.

基本周波数の確率密度関数の推定４１では、このようにして得られる中間フレットに対応した音モデルと元々記憶装置に記憶されていた代表フレットに対応した代表的な音モデルの両方が用いられる。 In the estimation 41 of the probability density function of the fundamental frequency, both the sound model corresponding to the intermediate fret obtained in this way and the representative sound model corresponding to the representative fret originally stored in the storage device are used.

代表的な音モデルを作成する基本周波数は、楽器の構造により定まる倍音特性の特徴に着目して選出するのが好ましい。具体的には、倍音構造が急激に変化する基本周波数領域においては、密に代表的な音モデルを作成して記憶装置に記憶させるのが効果的である。
例えばギターでは、ある弦の最高フレットまでは倍音構造が連続的に変化するが、それより半音高い音を出すには、より高い音を発音しうる別の弦を奏さねばならず、ここで倍音構造が不連続となる。さらにいえば、弦の開放弦および低いフレット付近では倍音構造の変化は緩やかであり、高いフレット付近では１フレットの違いでも大きく倍音構造が変化するから、これを反映し、低域では粗い間隔で、高域になるほど密に、代表的な音モデルを作成する基本周波数を選ぶと有効である。また、ピアノでは最低音域は１本の弦、低音域は２本の弦、高音域は３本の弦、最高音域は、３本の弦であるが他と異なりミュート機構がない、といったように周波数帯域により弦構造が異なり、これに呼応して音色すなわち倍音構造も特定の周波数で不連続に変化する。そのような不連続点では代表的な音モデルを密に配置すれば、少ない数の音モデルでも、基本周波数の推定精度を高めることが可能となる。 It is preferable to select a fundamental frequency for creating a representative sound model by paying attention to the characteristics of the overtone characteristics determined by the structure of the musical instrument. Specifically, in the fundamental frequency region where the overtone structure changes rapidly, it is effective to create a representative sound model closely and store it in the storage device.
For example, in a guitar, the harmonic structure changes continuously up to the highest fret of a certain string, but to produce a sound that is higher by a semitone than that, you must play another string that can produce a higher tone, The structure becomes discontinuous. Furthermore, the harmonic structure changes gradually near the open strings and low frets of the strings, and the harmonic structure changes greatly even with a difference of one fret near the high frets. It is effective to select a fundamental frequency that creates a representative sound model more densely as the frequency becomes higher. On the piano, the lowest range is one string, the lower range is two strings, the higher range is three strings, the highest range is three strings, but unlike the others, there is no mute mechanism, etc. The string structure differs depending on the frequency band, and in response to this, the tone color, that is, the overtone structure, also changes discontinuously at a specific frequency. If representative sound models are densely arranged at such discontinuous points, it is possible to improve the estimation accuracy of the fundamental frequency even with a small number of sound models.

以上説明した本実施形態によれば、記憶する音モデルのデータ量を削減しつつ、音域ごとに異なる音源特性をより詳細に音モデルとして表現し、さらには少数のパラメータ調整により実際の入力音に最適に音モデルの形状を調整することが可能となる。 According to the present embodiment described above, while reducing the amount of data of the sound model to be stored, the sound source characteristics that differ for each sound range are expressed in more detail as the sound model, and further, the actual input sound is obtained by adjusting a small number of parameters. The shape of the sound model can be adjusted optimally.

＜Ｂ．第２実施形態＞
上記第１実施形態では、基本周波数によって音のスペクトル形状が異なることを考慮し、比較的少ない代表的な音モデルからより多くの種類の基本周波数に対応した音モデルを補間処理により生成した。本実施形態では、基本周波数の確率密度関数の推定４１において、各音モデル（代表的な音モデルおよび音モデル補間処理５により得られた音モデル）の種類毎に、当該音モデルが本来有していた基本周波数に合わせて、基本周波数の範囲を設定し、この設定した基本周波数の範囲外の周波数における当該音モデルに対する重み値を制限して、各音モデルに対する重み値の最適化を行う。さらに詳述すると、次の通りである。 <B. Second Embodiment>
In the first embodiment, considering that the spectrum shape of the sound differs depending on the fundamental frequency, a sound model corresponding to more types of fundamental frequencies is generated by interpolation processing from relatively few representative sound models. In the present embodiment, in the estimation 41 of the probability density function of the fundamental frequency, the sound model originally has for each type of each sound model (representative sound model and sound model obtained by the sound model interpolation process 5). The range of the fundamental frequency is set according to the fundamental frequency that has been set, the weight value for the sound model at a frequency outside the set range of the fundamental frequency is limited, and the weight value for each sound model is optimized. Further details are as follows.

まず、ＥＭアルゴリズムによる基本周波数の確率密度関数の推定に関して、音モデルの種類毎に基本周波数の適用範囲を定める。基本周波数の適用範囲の下限Ｆｌｍおよび上限Ｆｈｍの定め方には各種の方法が考えられる。例えばｍ番目の種類の音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））があるフレット位置におけるギター音の音モデルである場合に、そのフレット位置での基本周波数と隣接する低音側のフレット位置での基本周波数との中間の周波数をＦｌｍとし、そのフレット位置での基本周波数と隣接する高音側のフレット位置での基本周波数との中間の周波数をＦｈｍとしてもよい。あるいはＦｌｍとＦｈｍとの間をもっと広くとり、基本周波数の隣接した各音モデル間で基本周波数の適用範囲をオーバラップさせてもよい。 First, regarding the estimation of the probability density function of the fundamental frequency by the EM algorithm, the application range of the fundamental frequency is determined for each type of sound model. Various methods are conceivable for determining the lower limit Flm and the upper limit Fhm of the application range of the fundamental frequency. For example, if the m-th type sound model p (x | F, m, μ ^(t) (F, m)) is a sound model of a guitar sound at a fret position, it is adjacent to the fundamental frequency at that fret position. The intermediate frequency between the fundamental frequency at the lower fret position and the fundamental frequency at the adjacent fret position may be Fhm. Alternatively, a wider range may be set between Flm and Fhm, and the applicable range of the fundamental frequency may be overlapped between adjacent sound models of the fundamental frequency.

そして、本実施形態では、ＥＭアルゴリズムの過程において、ｍ番目の種類の音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））は、その適用範囲から外れるような基本周波数（すなわち、Ｆ＜ＦｌｍまたはＦ＞Ｆｈｍであるような基本周波数）の確率密度の推定に使われないようにする。具体的には、ｍ番目の種類の音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））は、何ら策を講じないとすると、ＥＭアルゴリズムにおいて多くの種類の基本周波数Ｆに対応した音モデルとして使用され得るが、式（２９）〜（３２）の漸化式を繰り返す際、各基本周波数Ｆに対応した各音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））のうちＦ＜ＦｌｍまたはＦ＞Ｆｈｍであるような基本周波数Ｆに対応した音モデルについては、それらに対する重み値ｗ（Ｆ，ｍ）の初期値を０にするのである。 In this embodiment, in the course of the EM algorithm, the m-th type sound model p (x | F, m, μ ^(t) (F, m)) That is, it is not used for estimating the probability density of the fundamental frequency such that F <Flm or F> Fhm. Specifically, the m-th type of sound model p (x | F, m, μ ^(t) (F, m)) has many types of fundamental frequencies F in the EM algorithm if no measures are taken. However, when repeating the recurrence formulas (29) to (32), the sound models p (x | F, m, μ ^(t) ( For the sound model corresponding to the fundamental frequency F such that F <Flm or F> Fhm among F, m)), the initial value of the weight value w (F, m) for them is set to zero.

このようにすることで、ｍ番目の種類の音モデルｐ（ｘ｜Ｆ，ｍ，μ^（ｔ）（Ｆ，ｍ））は、Ｆ＜ＦｌｍまたはＦ＞Ｆｈｍであるような基本周波数Ｆの確率密度の推定には一切使われないようになる。このような処理を全ての種類の音モデルについて行う。 In this way, the m-th type sound model p (x | F, m, μ ^(t) (F, m)) has a probability of the fundamental frequency F such that F <Flm or F> Fhm. It will no longer be used for density estimation. Such processing is performed for all types of sound models.

この態様によれば、音源から発生し得る各音の基本周波数の範囲に合わせて、その音に対応した音モデルの適用可能な基本周波数の範囲（Ｆｌｍ，Ｆｈｍ）を個別的に定義しておくことで、各音の基本周波数の範囲を考慮した適切な基本周波数の推定を行うことができる。 According to this aspect, in accordance with the range of the fundamental frequency of each sound that can be generated from the sound source, the applicable fundamental frequency range (Flm, Fhm) of the sound model corresponding to the sound is defined individually. Thus, an appropriate fundamental frequency can be estimated in consideration of the fundamental frequency range of each sound.

＜他の実施形態＞
以上、この発明の一実施形態について説明したが、この発明には他にも実施形態があり得る。例えば次の通りである。 <Other embodiments>
Although one embodiment of the present invention has been described above, the present invention may have other embodiments. For example:

（１）音モデル補間処理５には、１次補間に限らず、０次補間、スプライン補間など一般に知られる補間法を広く用いることができる。さらに、代表的な音モデルにも誤差が含まれ得るので、自己回帰を用いることにより、代表的な音モデルの各間の音モデルを求めてもよい。 (1) The sound model interpolation processing 5 is not limited to linear interpolation, and widely known interpolation methods such as zero-order interpolation and spline interpolation can be widely used. Furthermore, since a typical sound model may include an error, a sound model between the representative sound models may be obtained by using autoregression.

（２）形状の異なった多くの種類の音モデルを用いることは基本周波数の推定精度の向上に寄与するが、同じような形状の音モデルを多数用いたとしても基本周波数の推定精度の向上は期待できない。そこで、音モデル補間処理５では、代表的な音モデルの各間の全てではなく、音モデルの形状がある程度以上変化する一部区間のみについて補間を行うようにしてもよい。 (2) The use of many types of sound models with different shapes contributes to the improvement of the estimation accuracy of the fundamental frequency, but the improvement of the estimation accuracy of the fundamental frequency is not improved even if many sound models having the same shape are used. I can't expect it. Therefore, in the sound model interpolation process 5, interpolation may be performed not only for all of the representative sound models but only for a partial section in which the shape of the sound model changes to some extent.

（３）音モデル補間処理５において、周波数領域により異なる補間演算方法で音モデルの補間を行ってもよい。例えば基本周波数の変化に対する音モデルの形状の変化が緩やかな周波数領域では１次補間により音モデルを求め、基本周波数の変化に対する音モデルの形状の変化が比較的急激な周波数領域ではより高次の補間により音モデルを求める、といった態様が考えられる。あるいは補間演算方法自体を変える代わりに、補間演算用のパラメータを周波数領域に応じて変えてもよい。 (3) In the sound model interpolation processing 5, the sound model may be interpolated by an interpolation calculation method that differs depending on the frequency domain. For example, a sound model is obtained by primary interpolation in a frequency region where the change in the shape of the sound model with respect to the change in the fundamental frequency is obtained. A mode of obtaining a sound model by interpolation is conceivable. Alternatively, instead of changing the interpolation calculation method itself, parameters for interpolation calculation may be changed according to the frequency domain.

（４）上記各実施形態では、基本周波数の確率密度関数の推定４１により得られる基本周波数をマルチエージェントに追跡させることにより、最終的な基本周波数を決定したが、基本周波数の確率密度関数の推定４１において誤推定の確率が低く、信頼性の高い推定結果が得られる場合には、マルチエージェントによる追跡を省略してもよい。 (4) In each of the above embodiments, the final fundamental frequency is determined by causing the multi-agent to track the fundamental frequency obtained by the fundamental frequency probability density function estimation 41. However, the fundamental frequency probability density function is estimated. If the probability of erroneous estimation is low at 41 and a highly reliable estimation result is obtained, tracking by a multi-agent may be omitted.

（５）上記各実施形態では、音分析装置に「拡張１」（音モデルの多重化）に加えて、「拡張２」（音モデルのパラメータの推定)を導入したが、「拡張２」は導入せず、基本周波数の確率密度関数の推定では、例えば漸化式（２９）および（３０）のうち漸化式（２９）のみを逐次演算し、音モデルに対する重みｗ（Ｆ，ｍ）の更新のみを行うようにしてもよい。 (5) In each of the above embodiments, in addition to “extension 1” (sound model multiplexing), “extension 2” (estimation of sound model parameters) is introduced into the sound analyzer. In the estimation of the probability density function of the fundamental frequency without introducing, for example, only the recurrence formula (29) out of the recurrence formulas (29) and (30) is sequentially calculated, and the weight w (F, m) of the sound model is calculated. Only updating may be performed.

（６）上記各実施形態では、音分析装置に「拡張３」（モデルパラメータに関する事前分布の導入）を導入したが、音分析装置はこれを導入しない構成としてもよい。 (6) In each of the above embodiments, “Extended 3” (introduction of a prior distribution relating to model parameters) is introduced into the sound analyzer, but the sound analyzer may be configured not to introduce this.

この発明の第１実施形態である音分析プログラムの処理内容を示す図である。It is a figure which shows the processing content of the sound analysis program which is 1st Embodiment of this invention. １つの特徴検出器と複数のエージェントにより構成されるマルチエージェントモデルによる基本周波数の経時的な追跡を示す図である。It is a figure which shows time-dependent tracking of the fundamental frequency by the multi agent model comprised by one feature detector and a some agent. 同実施形態における代表的な音モデルの選出例と音モデル補間処理の内容を示す図である。It is a figure which shows the selection example of the representative sound model in the same embodiment, and the content of the sound model interpolation process.

Explanation of symbols

１……瞬時周波数の算出、２……周波数成分の候補の算出、３……周波数帯域の制限、４ａ……メロディラインの推定、４ｂ……ベースラインの推定、４１……基本周波数の確率密度関数の推定、４２……マルチエージェントモデルによる基本周波数の継時的な追跡、５……音モデル補間処理。 1 ...... Calculation of instantaneous frequency, 2 ... Calculation of frequency component candidates, 3 ... Limitation of frequency band, 4a ... Estimation of melody line, 4b ... Estimation of baseline, 41 ... Probability density of fundamental frequency Function estimation, 42 …… Tracking of fundamental frequency by multi-agent model, 5 …… Sound model interpolation processing.

Claims

A storage means for storing a plurality of types of sound models each defining a harmonic structure of a plurality of types of sounds generated from a musical instrument , each storing a plurality of types of sound models corresponding to a fundamental frequency unique to each sound ;
A plurality of types of sound models stored in the storage means are ordered according to each fundamental frequency, and the plurality of types of ordered sound models are subjected to interpolation processing based on the fundamental frequency, and each of the ordered sound models Interpolation means for generating a plurality of types of sound models corresponding to intermediate frequencies of
Using a plurality of types of sound models stored in the storage unit and a plurality of types of sound models generated by the interpolation unit, a mixed distribution obtained by weight-adding a plurality of sound models having various harmonic structures and fundamental frequencies is obtained. And a weight value for each sound model is optimized so that the mixed distribution is a distribution of frequency components of the input acoustic signal, and the optimized weight value of each sound model is a sound source that is the source of the input acoustic signal. A probability density function estimating means for estimating the fundamental frequency of the sound as a probability density function;
And a fundamental frequency estimating means for estimating and outputting a fundamental frequency of sound of one or a plurality of sound sources in the input acoustic signal based on a probability density function of the fundamental frequency.

The probability density estimation unit for each type of tone models, in accordance with the fundamental frequency of the tone models, set the range of the fundamental frequency, weight values for the sound model in the range of frequencies of the fundamental frequency to the set The sound analysis apparatus according to claim 1, wherein the weight value for each of the sound models is optimized by limiting the sound value.

2. The sound according to claim 1, wherein the fundamental frequency estimation means detects a plurality of peaks in the probability density function, and outputs a fundamental frequency having high reliability and high power based on reliability of each peak. Analysis equipment.

The sound analysis apparatus according to any one of claims 1 to 3, wherein the interpolation means executes the interpolation processing by an interpolation calculation method that differs depending on a frequency domain.

Computer
A storage means for storing a plurality of types of sound models each defining a harmonic structure of a plurality of types of sounds generated from a musical instrument, each storing a plurality of types of sound models corresponding to a fundamental frequency unique to each sound;
A plurality of types of sound models stored in the storage means are ordered according to each fundamental frequency, and the plurality of types of ordered sound models are subjected to interpolation processing based on the fundamental frequency, and each of the ordered sound models Interpolation means for generating a plurality of types of sound models corresponding to intermediate frequencies of
Using a plurality of types of sound models stored in the storage unit and a plurality of types of sound models generated by the interpolation unit, a mixed distribution obtained by weight-adding a plurality of sound models having various harmonic structures and fundamental frequencies is obtained. And a weight value for each sound model is optimized so that the mixed distribution is a distribution of frequency components of the input acoustic signal, and the optimized weight value of each sound model is a sound source that is the source of the input acoustic signal. A probability density function estimating means for estimating the fundamental frequency of the sound as a probability density function;
Fundamental frequency estimation means for estimating and outputting a fundamental frequency of sound of one or more sound sources in the input acoustic signal based on a probability density function of the fundamental frequency;
A computer program characterized by functioning as a computer program.