JP5696828B2

JP5696828B2 - Signal processing device

Info

Publication number: JP5696828B2
Application number: JP2010003792A
Authority: JP
Inventors: 広臣四童子
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2010-01-12
Filing date: 2010-01-12
Publication date: 2015-04-08
Anticipated expiration: 2030-01-12
Also published as: JP2011145326A

Description

この発明は、入力されたオーディオ信号の内容を判別する信号処理装置に関する。 The present invention relates to a signal processing apparatus that determines the contents of an input audio signal.

近年、マルチチャンネルオーディオ装置が普及している。マルチチャンネルオーディオ装置は、５．１チャンネルなど、ステレオ２チャンネルよりも多いチャンネル（マルチチャンネル）のオーディオ信号を再生し、これらの信号を部屋の各所に設置された複数のスピーカから出力することによって、立体的な広がりのあるオーディオを再生する装置である（特許文献１）。 In recent years, multi-channel audio devices have become widespread. A multi-channel audio device reproduces audio signals of channels (multi-channel), such as 5.1 channels, which are more than two stereo channels, and outputs these signals from a plurality of speakers installed in various places in the room. This is an apparatus for reproducing audio having a three-dimensional spread (Patent Document 1).

従来のマルチチャンネルオーディオ信号では、各チャンネルにどのような内容のオーディオ信号を割り振るか（チャンネル割当）は、ほぼ統一されていた。すなわち、センタチャンネルにセリフ等の話声、フロント左右チャンネルにＢＧＭ等の楽音、サラウンド左右チャンネルに環境音や効果音などのその他音が割り当てられていた。 In the conventional multi-channel audio signal, the content of the audio signal assigned to each channel (channel assignment) is almost unified. That is, speech such as speech is assigned to the center channel, musical sounds such as BGM are assigned to the front left and right channels, and other sounds such as environmental sounds and sound effects are assigned to the surround left and right channels.

マルチチャンネルオーディオ装置は、再生したオーディオ信号に反射音や残響音を付加することにより、ホールなどの仮想的な空間の響きを作り出す音場制御を行う機能を有している。ただし、セリフ等の話声に反射音や残響音等の効果を強く付加すると、明瞭度が低下して出演者が何を話しているか聴き取りくくなってしまうため、話声が再生されるチャンネルの音場制御量を他のチャンネルよりも小さくするように設定されている。 The multi-channel audio device has a function of performing sound field control that creates a reverberation of a virtual space such as a hall by adding a reflected sound or a reverberation sound to a reproduced audio signal. However, if the effect of reflected sound or reverberation sound is strongly added to speech such as speech, the clarity will be reduced and it will be difficult to hear what the performer is talking about, so the channel where the speech is played back Is set to be smaller than the other channels.

上記従来のコンテンツの場合、センタチャンネルにセリフ等の話声が割り振られるのが一般であるため、従来のマルチチャンネルオーディオ装置では、センタチャンネルの音場制御量を小とし、他のチャンネルの音場制御量を大または中とするよう予め設定されていた。 In the case of the conventional content described above, speech such as speech is generally allocated to the center channel. Therefore, in the conventional multi-channel audio apparatus, the sound field control amount of the center channel is reduced and the sound field of other channels is reduced. The control amount was previously set to be large or medium.

しかし、地上波デジタル放送の開始等により、家庭で再生可能なマルチチャンネルオーディオのコンテンツも多様化しており、従来の映画のようなチャンネル割当でないものが増えている。すなわち、センタチャンネルでないフロントチャンネルやサラウンドチャンネルに話声が割り当てられたコンテンツも増えている。 However, with the start of terrestrial digital broadcasting and the like, multi-channel audio content that can be played back at home is diversifying, and the number of channels that are not assigned to channels as in conventional movies is increasing. In other words, content in which speech is assigned to front channels and surround channels that are not center channels is also increasing.

このようなマルチチャンネルオーディオコンテンツを従来の音場制御量の設定で再生すると、セリフ等の話声に強い反射音や残響効果が掛かってしまい、明瞭度が低下してしまう。また、センタチャンネルでＢＧＭ等の楽音が再生されている場合には、ＢＧＭに音場効果が掛からず雰囲気を盛り上げることができないなどの問題が生じる。 When such multi-channel audio content is reproduced with the conventional setting of the sound field control amount, a strong reflected sound or reverberation effect is applied to speech such as speech, and the clarity is lowered. Further, when a musical sound such as BGM is played on the center channel, there is a problem that the sound field effect is not applied to the BGM and the atmosphere cannot be raised.

そこで、どのチャンネルでどのような内容の音響が再生されているかを検出して、各チャンネルの音場制御量を調整することが考えられる。特に、どのチャンネルで話声が再生されているかを検出し、話声が再生されているチャンネルの音場制御量を小さくすることが考えられる。オーディオ信号から話声を検出する手法としては、従来より特許文献２、３のような方法が提案されていた。 Therefore, it is conceivable to detect what kind of sound is reproduced in which channel and adjust the sound field control amount of each channel. In particular, it is conceivable to detect in which channel the voice is being reproduced and to reduce the control amount of the sound field of the channel in which the voice is being reproduced. As methods for detecting speech from an audio signal, methods such as Patent Documents 2 and 3 have been proposed.

特許文献２には、有声音の時間波形の自己相関関数を用いて話声を検出することが記載されている。また、特許文献３には、音響信号における調波構造成分の占める占有度を瞬時周波数分析を用いて求め、これに基づいて音声区間を検出することが記載されている。 Japanese Patent Application Laid-Open No. H10-228667 describes detecting speech using an autocorrelation function of a time waveform of voiced sound. Japanese Patent Application Laid-Open No. H10-228867 describes that an occupancy degree of a harmonic structure component in an acoustic signal is obtained by using an instantaneous frequency analysis, and a voice section is detected based on this.

特開平８−２７５３００号公報JP-A-8-275300

特公平４−５５３２０号公報Japanese Examined Patent Publication No. 4-55320

特許第３８９２３７９号公報Japanese Patent No. 3892379

しかし、特許文献２の方式では、時間波形の自己相関関数に基づき話声を検出するのみなので、正弦波などの周期性はあるが調波構造を持たない信号を話声と誤検出してしまう問題点があった。また、特許文献３の方式では、毎フレーム、全周波数帯域について瞬時周波数分析を行う必要があり演算量が膨大である。
また、上記いずれの方式でも、話声、単音の楽音、合奏の楽音、その他音のそれぞれを判別することができなかった。 However, in the method of Patent Document 2, since the speech is only detected based on the autocorrelation function of the time waveform, a signal having a periodicity such as a sine wave but not having a harmonic structure is erroneously detected as the speech. There was a problem. Moreover, in the method of Patent Document 3, it is necessary to perform instantaneous frequency analysis for every frame and all frequency bands, and the amount of calculation is enormous.
In any of the above systems, it is impossible to discriminate between spoken voice, single tone, ensemble tone and other sounds.

この発明は、できるだけ簡略な処理でオーディオ信号の内容を正確に検出することを可能にした信号処理装置を提供することを目的とする。 An object of the present invention is to provide a signal processing apparatus that can accurately detect the contents of an audio signal by a process as simple as possible.

この発明は、オーディオ信号の音階周波数成分のエネルギーと全帯域成分のエネルギーとを比較することにより、前記オーディオ信号が楽音か否かを判定する楽音判定部と、楽音判定部により前記オーディオ信号が楽音と判定されなかったとき、前記オーディオ信号の調波性の有無を判定することにより、前記オーディオ信号が調波音であるかその他音であるかを判定する調波性判定部と、前記調波性判定部により、前記オーディオ信号が調波音であると判定されたとき、このオーディオ信号のピッチ周波数が音階周波数に一致しているか否か、または、前記ピッチ周波数の揺らぎの有無に基づいて、前記オーディオ信号が話声であるか楽音であるかを判定する話声／楽音判定部と、を備えたことを特徴とする。
上記発明において、前記調波性判定部が、短時間フーリエ変換による周波数スペクトルの自己相関関数に基づき調波性の有無を判定し、前記話声／楽音判定部は、前記自己相関関数に基づいて求められた概算のピッチ周波数付近のみで瞬時周波数分析を行うことにより、正確なピッチ周波数を求める手段を含むものであってもよい。 The present invention relates to a musical tone determination unit that determines whether or not the audio signal is a musical tone by comparing the energy of the scale frequency component of the audio signal with the energy of the entire band component, and the musical tone determination unit determines whether the audio signal is a musical tone when it is not determined that the by determining the presence or absence of harmonic of the audio signal, said audio signal determines harmonic determination section whether the other sound or a tone sound of waves, the harmonicity When the determination unit determines that the audio signal is a harmonic sound, the audio signal is based on whether the pitch frequency of the audio signal matches the scale frequency or the presence or absence of fluctuation of the pitch frequency. A speech / musical sound determination unit for determining whether the signal is a voice or a music is provided.
In the above invention, the harmonicity determination unit determines the presence or absence of harmonics based on an autocorrelation function of a frequency spectrum by short-time Fourier transform, and the speech / musical sound determination unit is based on the autocorrelation function. Means for obtaining an accurate pitch frequency by performing an instantaneous frequency analysis only in the vicinity of the obtained approximate pitch frequency may be included.

この発明によれば、比較的簡略な処理でオーディオ信号の内容（話声／楽音等）を判別することが可能になる。 According to the present invention, it is possible to discriminate the content (speech / musical tone, etc.) of an audio signal with a relatively simple process.

この発明の実施形態である信号処理部を含むオーディオ装置のブロック図Block diagram of an audio apparatus including a signal processing unit according to an embodiment of the present invention マルチチャンネルオーディオ信号のチャンネル割当の例を示す図The figure which shows the example of the channel allocation of a multichannel audio signal 同信号処理部のブロック図Block diagram of the signal processor 同信号処理部の内容判別部の処理を示すフローチャートThe flowchart which shows the process of the content discrimination | determination part of the signal processing part 内容判別部の楽音判定処理を説明する図The figure explaining the musical tone determination process of a content determination part 内容判別部の調波性判定処理を示すフローチャートFlowchart showing harmonic determination processing of content determination unit 各種オーディオ信号の周波数スペクトルおよび自己相関関数を示す図Diagram showing frequency spectrum and autocorrelation function of various audio signals 内容判別部の話声／楽音判定処理を示すフローチャートFlow chart showing speech / musical sound determination processing of content determination unit ＳＴＦＴの周波数ｂｉｎと瞬時周波数との相関を説明する図The figure explaining the correlation between the frequency bin of STFT and the instantaneous frequency

《オーディオ装置の構成》
図１はこの発明の実施形態である信号処理部を含むオーディオ装置のブロック図である。オーディオ装置は、コンテンツ再生装置２、オーディオアンプ１、複数のスピーカ３を有してている。オーディオアンプ１は、信号処理部４、増幅回路５を有している。 <Configuration of audio device>
FIG. 1 is a block diagram of an audio apparatus including a signal processing unit according to an embodiment of the present invention. The audio device has a content reproduction device 2, an audio amplifier 1, and a plurality of speakers 3. The audio amplifier 1 has a signal processing unit 4 and an amplifier circuit 5.

コンテンツ再生装置２は、たとえば映画等のＤＶＤを再生するＤＶＤプレイヤ、衛星、地上波のテレビ放送を受信するテレビ放送チューナ等で構成される。コンテンツ再生装置２は、マルチチャンネル（たとえば５．１チャンネル）のオーディオ信号をオーディオアンプ１に入力する。オーディオアンプ１の信号処理部４は、コンテンツ再生装置２から入力されたマルチチャンネルのオーディオ信号に対してイコライジング、音場制御等の処理を行ったのち、増幅回路５に入力する。増幅回路５は入力されたマルチチャンネルのオーディオ信号をそれぞれ個別に増幅して各チャンネルに対応するスピーカ３に出力する。 The content playback apparatus 2 is composed of, for example, a DVD player that plays back a DVD such as a movie, a satellite, and a TV broadcast tuner that receives a terrestrial TV broadcast. The content reproduction device 2 inputs a multi-channel (for example, 5.1 channel) audio signal to the audio amplifier 1. The signal processing unit 4 of the audio amplifier 1 performs processing such as equalizing and sound field control on the multi-channel audio signal input from the content reproduction device 2 and then inputs the processed signal to the amplifier circuit 5. The amplifying circuit 5 individually amplifies the input multi-channel audio signals and outputs them to the speakers 3 corresponding to the respective channels.

複数のスピーカ３はリスニングルームの各所に設置されており、各チャンネルの音響が各スピーカ３から放音されることにより、リスニングルームに広がりのある音場が形成される。 The plurality of speakers 3 are installed at various locations in the listening room, and sound of each channel is emitted from each speaker 3, thereby forming a sound field that spreads in the listening room.

《コンテンツのチャンネル割当例》
ここで、図２を参照して、コンテンツ再生装置２からオーディオアンプ１に入力されるマルチチャンネルオーディオ信号のチャンネル割当について説明する。《Example of content channel assignment》
Here, with reference to FIG. 2, channel assignment of a multi-channel audio signal input from the content reproduction apparatus 2 to the audio amplifier 1 will be described.

図２（Ａ）は、一般的な映画コンテンツのマルチチャンネルオーディオ信号のチャンネル割当を示す図である。この実施形態では５．１チャンネルのオーディオ信号を例にあげて説明する。５．１チャンネルのオーディオ信号は、センタチャンネルＣ、フロント左チャンネルＦＬ、フロント右チャンネルＦＲ、サラウンド（リア）左チャンネルＳＬ、サラウンド（リア）右チャンネルＳＲ、および、サブウーファチャンネルＳＷからなっている。このうち、サブウーファチャンネルＳＷは他のチャンネルの重低音信号を集めて構成されるため、コンテンツ再生装置２から入力されるチャンネル数は５チャンネルである。したがって以下では、センタチャンネルＣ、フロント左チャンネルＦＬ、フロント右チャンネルＦＲ、サラウンド左チャンネルＳＬおよびサラウンド右チャンネルＳＲの５チャンネルのチャンネル割当について説明する。 FIG. 2A is a diagram showing channel assignment of multi-channel audio signals of general movie content. In this embodiment, a 5.1 channel audio signal will be described as an example. The 5.1-channel audio signal includes a center channel C, a front left channel FL, a front right channel FR, a surround (rear) left channel SL, a surround (rear) right channel SR, and a subwoofer channel SW. Among these, since the subwoofer channel SW is configured by collecting the deep bass signals of other channels, the number of channels input from the content reproduction apparatus 2 is five. Therefore, in the following, channel assignment of five channels, center channel C, front left channel FL, front right channel FR, surround left channel SL, and surround right channel SR will be described.

一般的なコンテンツの場合、センタチャンネルＣにセリフ等の話声、フロント左右チャンネルＦＬ，ＦＲにＢＧＭ等の楽音、サラウンド左右チャンネルＳＬ，ＳＲにその他音（効果音や環境音など）が割り当てられる。 In the case of general content, speech such as speech is assigned to the center channel C, musical sounds such as BGM are assigned to the front left and right channels FL and FR, and other sounds (such as sound effects and environmental sounds) are assigned to the surround left and right channels SL and SR.

一般的に、話声に対しては喋っている内容が不明瞭になるのを防止するため、音場効果を付与する量（音場制御量）を小さくする。また、ＢＧＭ等の楽音に対しては、響きが豊かになるように音場制御量を大きくする。また、環境音や効果音等のその他音に対しては音場制御量を中くらいに設定する。したがってセンタチャンネルＣの音場制御量は「小」、フロント左右チャンネルＦＬ，ＦＲの音場制御量は「大」、サラウンド左右チャンネルＳＬ，ＳＲの音場制御量は「中」に設定される。 In general, the amount of sound field effect (sound field control amount) is reduced in order to prevent the content spoken from being obscured from being spoken. For musical sounds such as BGM, the sound field control amount is increased so that the sound is rich. For other sounds such as environmental sounds and sound effects, the sound field control amount is set to a medium level. Accordingly, the sound field control amount of the center channel C is set to “small”, the sound field control amounts of the front left and right channels FL and FR are set to “large”, and the sound field control amounts of the surround left and right channels SL and SR are set to “medium”.

一方、同図（Ｂ）は、一般的な映画コンテンツ以外のコンテンツ、たとえば、デジタルテレビ放送のマルチチャンネルオーディオ信号のチャンネル割当の例を示す図である。この例では、センタチャンネルＣは無音であり、フロント左チャンネルＦＬにセリフ等の話声とＢＧＭ、フロント右チャンネルＦＲにＢＧＭ等の楽音、サラウンド左右チャンネルＳＬ，ＳＲにその他音が割り当てられている。 On the other hand, FIG. 5B is a diagram showing an example of channel assignment of content other than general movie content, for example, a multi-channel audio signal for digital television broadcasting. In this example, the center channel C is silent, and speech and BGM such as speech are assigned to the front left channel FL, musical sounds such as BGM are assigned to the front right channel FR, and other sounds are assigned to the surround left and right channels SL and SR.

このような場合、センタチャンネルＣの音場制御量は任意（入力信号がないため音場効果は実質０になる）、フロント左右チャンネルＦＬ，ＦＲの音場制御量は「小」、サラウンド左右チャンネルＳＬ，ＳＲの音場制御量は「中」に設定される。 In such a case, the sound field control amount of the center channel C is arbitrary (the sound field effect is substantially zero because there is no input signal), the sound field control amounts of the front left and right channels FL and FR are “small”, and the surround left and right channels. The sound field control amount of SL and SR is set to “medium”.

すなわち、フロント左チャンネルＦＬには、話声と楽音が合成して出力されているが、この場合には話声が優先し、音場制御量は「小」に設定される。また、フロント右チャンネルＦＲは楽音のみであるが、左右チャンネルの音場制御のバランスが崩れるとリスナーに不安定な印象を与える可能性があるため、フロント左チャンネルＦＬと同様に音場制御量を「小」にしている。なお、この場合フロント右チャンネルＦＲの音場制御量を楽音に合わせて「大」に設定してもよく、それらの中間をとって「中」に設定してもよい。 That is, the voice and musical sound are synthesized and output in the front left channel FL. In this case, the voice is given priority, and the sound field control amount is set to “small”. In addition, the front right channel FR is only a musical sound, but if the balance of the sound field control of the left and right channels is lost, it may give the listener an unstable impression. “Small”. In this case, the sound field control amount of the front right channel FR may be set to “large” in accordance with the musical sound, or may be set to “medium” in the middle of them.

《信号処理部の構成》
図３は、上記信号処理部４の構成例を示す図である。信号処理部４は、イコライジング、音場効果付与等種々の処理を行う機能部であるが、図３ではそのうち音場効果を付与する構成部のみを示している。入力部１０は、センタチャンネル用入力部１０Ｃ，フロント左チャンネル用入力部、フロント右チャンネル用入力部、サラウンド左チャンネル用入力部、サラウンド右チャンネル用入力部の５つの入力部からなっており、それぞれ各チャンネル（Ｃ、ＦＬ、ＦＲ、ＳＬ、ＳＲ）のオーディオ信号が入力される。
以下、上記入力部１０と同じように、５チャンネル分並列に設けられている構成部については、個別チャンネル毎の説明は省略する。 <Configuration of signal processing unit>
FIG. 3 is a diagram illustrating a configuration example of the signal processing unit 4. The signal processing unit 4 is a functional unit that performs various processes such as equalizing and applying a sound field effect. FIG. 3 shows only a component that provides the sound field effect. The input unit 10 includes five input units, a center channel input unit 10C, a front left channel input unit, a front right channel input unit, a surround left channel input unit, and a surround right channel input unit. Audio signals of each channel (C, FL, FR, SL, SR) are input.
Hereinafter, as with the input unit 10 described above, the description of each individual channel is omitted for the components provided in parallel for five channels.

入力部１０から入力されたオーディオ信号は、内容判別部１４および遅延部１１に入力される。内容判別部１４は、５チャンネル分並列に設けられており、各チャンネルのオーディオ信号の内容を判別する。内容とは、オーディオ信号が、話声／楽音／その他音のいずれであるかを示す情報である。 The audio signal input from the input unit 10 is input to the content determination unit 14 and the delay unit 11. The content determination unit 14 is provided in parallel for five channels, and determines the content of the audio signal of each channel. The content is information indicating whether the audio signal is a voice / musical sound / other sound.

内容判別部１４は、調波構造の有無や、変調スペクトル、倍音構造、周波数変化率などを測定することで、話声／音楽／その他音を判別する。内容判別部１４の判別処理の詳細は後述する。 The content discriminating unit 14 discriminates speech / music / other sounds by measuring the presence / absence of a harmonic structure, modulation spectrum, harmonic structure, frequency change rate, and the like. Details of the determination processing of the content determination unit 14 will be described later.

遅延部１１は、内容判別部１４がオーディオ信号の内容を判別するために必要な時間分、オーディオ信号を遅延させる。これにより、内容判別部１４の判別結果に基づく音場制御の制御遅れを解消している。 The delay unit 11 delays the audio signal by a time necessary for the content determination unit 14 to determine the content of the audio signal. Thereby, the control delay of the sound field control based on the determination result of the content determination unit 14 is eliminated.

内容判別部１４の判別結果は、係数制御部１５に入力される。係数制御部１５は、各チャンネルのオーディオ信号の内容に応じて各チャンネルのオーディオ信号に対する音場制御量を決定する。音場制御量は図２に示したようなルールで決定される。内容判別部１４は、各チャンネルのオーディオ信号に対する音場制御量を決定し、その音場制御量に対応する入力レベルにオーディオ信号を制御する係数を出力する。係数は係数乗算部１６に入力される。 The determination result of the content determination unit 14 is input to the coefficient control unit 15. The coefficient control unit 15 determines a sound field control amount for the audio signal of each channel according to the contents of the audio signal of each channel. The sound field control amount is determined by the rules as shown in FIG. The content determination unit 14 determines a sound field control amount for the audio signal of each channel, and outputs a coefficient for controlling the audio signal to an input level corresponding to the sound field control amount. The coefficient is input to the coefficient multiplier 16.

係数乗算部１６は、遅延部１１で遅延されたオーディオ信号に係数制御部１５から入力された係数を乗算して加算部１７に入力する。係数乗算部１６は５チャンネル分並列に設けられている。加算部１７は、それぞれ係数が乗算された５チャンネルのオーディオ信号を加算合成する。加算合成されたオーディオ信号は、レベル制御部１８でレベルが制御されたのち、音場効果生成部１９により、初期反射音、残響音を含む音場効果が付与される。 The coefficient multiplier 16 multiplies the audio signal delayed by the delay unit 11 by the coefficient input from the coefficient controller 15 and inputs the result to the adder 17. The coefficient multiplication unit 16 is provided in parallel for five channels. The adder 17 adds and synthesizes 5-channel audio signals each multiplied by a coefficient. The level of the added and synthesized audio signal is controlled by the level control unit 18, and then a sound field effect including an initial reflection sound and a reverberation sound is applied by the sound field effect generation unit 19.

音場効果生成部１９に入力されるオーディオ信号のレベルが大きいほど、音場効果生成部１９によって生成される音場効果音（反射音、残響音）は大きくなる。したがって、係数制御部１５が生成する係数により、各チャンネルのオーディオ信号に付与される音場効果の程度が制御される。 As the level of the audio signal input to the sound field effect generation unit 19 increases, the sound field effect sound (reflected sound, reverberation sound) generated by the sound field effect generation unit 19 increases. Therefore, the degree of the sound field effect given to the audio signal of each channel is controlled by the coefficient generated by the coefficient control unit 15.

音場効果生成部１９は、音場データ２０に基づき、ホールや室内などにおける音の響きを再現する。すなわち、ホールや室内で生じる初期反射音や残響音を生成する。この処理は、空間伝搬や反射に伴う周波数特性の変化を模擬するためのフィルタ処理や遅延と係数乗算による初期反射音の生成処理および後部残響音の生成処理などを含んでいる。 Based on the sound field data 20, the sound field effect generator 19 reproduces the sound of the sound in a hall or a room. That is, the initial reflection sound and reverberation sound generated in the hall and the room are generated. This processing includes filter processing for simulating changes in frequency characteristics due to spatial propagation and reflection, initial reflected sound generation processing by delay and coefficient multiplication, rear reverberation sound generation processing, and the like.

音場効果生成部１９で生成された音場効果音は、係数乗算部２１および加算部１２を介してドライのオーディオ信号に加算される。係数乗算部２１、加算部１２も５チャンネル分並列に設けられている。一般的にセリフ等の話声が出力されるチャンネルには音場効果音を加算しないほうが話声の明瞭度が高くなるため、係数乗算部２１により、話声のチャンネルへの音場効果音の加算ゲインを０にする。 The sound field effect sound generated by the sound field effect generation unit 19 is added to the dry audio signal via the coefficient multiplication unit 21 and the addition unit 12. A coefficient multiplier 21 and an adder 12 are also provided in parallel for five channels. In general, since the clarity of speech is higher when a sound field effect sound is not added to a channel such as a speech output channel, the coefficient multiplier 21 causes the sound field effect sound to be transmitted to the speech channel. Set the addition gain to 0.

係数乗算部２１に入力される係数も係数制御部１５が設定すればよい。話声が出力されるチャンネルの係数を“０”とし、他のチャンネルの係数を“１”とすればよいが、各チャンネルごとに係数の値を“０”と“１”の中間値に変化させてもよい。 The coefficient input to the coefficient multiplier 21 may be set by the coefficient controller 15. The coefficient of the channel where the voice is output can be set to “0” and the coefficient of the other channels can be set to “1”, but the coefficient value is changed to an intermediate value between “0” and “1” for each channel. You may let them.

このような制御により、各チャンネルにおいて、セリフ以外を再生している期間は広く豊かな音場効果を付与しつつ、セリフが再生された場合にはセリフに対する音場効果の量を抑えることで響きすぎを抑え、豊かな音場効果と明瞭なセリフを両立することができる。 With such control, each channel plays a wide and rich sound field effect during the period other than the line is played, and when the line is played, it reduces the amount of sound field effect on the line. It is possible to suppress excessive noise and achieve both a rich sound field effect and clear lines.

《内容判別部１４の処理の説明》
図４〜図９を参照して内容判別部１４の内容判別処理について説明する。この処理は、１フレーム（４０ｍｓ）毎に実行される。調波性判定処理（Ｓ４）では自フレームのほか、前後３フレームのデータを併せて用いるため、判別処理は３フレーム分遅延する。遅延部１１がこの判別処理の遅れ時間だけオーディオ信号を遅延させる。 << Description of Processing of Content Determination Unit 14 >>
The content determination process of the content determination unit 14 will be described with reference to FIGS. This process is executed every frame (40 ms). In the harmonic determination process (S4), in addition to the self frame, the data of the three frames before and after are used together, so the determination process is delayed by three frames. The delay unit 11 delays the audio signal by the delay time of this discrimination process.

図４は、内容判別処理の全体処理を示すフローチャートである。まず、楽音判定処理を行う（Ｓ１）。楽音判定処理とは、オーディオ信号の周波数成分のうち、音階周波数の成分が占める比率を測定する処理である。この楽音判定の詳細は図５を参照して後述する。楽音判定処理により楽音であると判定された場合（Ｓ２でＹＥＳ）には、内容判別結果として「楽音」を出力して（Ｓ３）、処理を終える。 FIG. 4 is a flowchart showing the entire content determination process. First, a musical tone determination process is performed (S1). The musical tone determination process is a process of measuring a ratio occupied by a scale frequency component among the frequency components of the audio signal. Details of the tone determination will be described later with reference to FIG. If it is determined by the tone determination process that the tone is a tone (YES in S2), “musical tone” is output as the content determination result (S3), and the process ends.

楽音判定処理により楽音と判定されなかった場合（Ｓ２でＮＯ）には、調波性判定処理を行う（Ｓ４）。調波性判定処理とは、オーディオ信号が調波性を有するか、すなわち、基音およびその整数倍の倍音成分からなるスペクトル構造を有しているかを判定する処理である。調波性判定処理の詳細は図６を参照して後述する。調波性判定処理により調波性なしと判定された場合（Ｓ５でＮＯ）には、内容判別結果として「その他音」を出力する（Ｓ６）。一方、調波性判定処理により調波性ありと判定された場合（Ｓ５でＹＥＳ）、そのオーディオ信号は話声または楽音であると考えられるため、話声／楽音判定処理（Ｓ７）を行う。
すなわち、話声や楽音は、調波性を有するが、環境音や効果音などの音響は調波性を持たないためである。 If the musical sound is not determined by the musical sound determination process (NO in S2), the harmonic determination process is performed (S4). The harmonic determination process is a process for determining whether the audio signal has harmonic characteristics, that is, whether it has a spectral structure including a fundamental tone and an overtone component that is an integral multiple of the fundamental tone. Details of the harmonic determination processing will be described later with reference to FIG. If it is determined by the harmonic determination process that there is no harmonic (NO in S5), “other sound” is output as the content determination result (S6). On the other hand, that there is by Ri harmonicity the harmonic determination process when it is determined (YES in S5), since the audio signal is considered to be speech or tone, speech / tone determination processing (S7) Do.
That is, voices and musical sounds have harmonics, but sounds such as environmental sounds and sound effects do not have harmonics.

話声／楽音判定処理では、正確な基音周波数（ピッチ）を算出し、このピッチが音階周波数に一致しているか、または、ピッチに大きな揺らぎがないかに基づき、このオーディオ信号が楽音であるか話声であるかを判定する。この話声／楽音判定処理の詳細は図７を参照して後述する。判定結果が話声であった場合には、内容判別結果として「話声」を出力する（Ｓ９）。判定結果が楽音であった場合には、内容判別結果として「楽音」を出力する（Ｓ１０）。 In the speech / musical sound determination process, an accurate fundamental frequency (pitch) is calculated, and whether the audio signal is a musical sound based on whether the pitch matches the scale frequency or there is no large fluctuation in the pitch. Determine if it is a voice. Details of the voice / musical tone determination processing will be described later with reference to FIG. If the determination result is speech, “speech” is output as the content determination result (S9). If the determination result is a musical sound, “musical sound” is output as the content determination result (S10).

図５（Ａ）は、楽音判定処理を示すフローチャートである。この処理では、オーディオ信号の全周波数帯域のエネルギーに占める音階周波数成分のエネルギーを測定することにより、このオーディオ信号が楽音（特に合奏の楽音）であるか否かを判定する。 FIG. 5A is a flowchart showing a musical tone determination process. In this process, it is determined whether or not the audio signal is a musical tone (particularly a ensemble musical tone) by measuring the energy of the scale frequency component in the energy of the entire frequency band of the audio signal.

まず、オーディオ信号の中の音階周波数成分のエネルギーおよび全周波数帯域のエネルギーを測定する（Ｓ２０）。オーディオ信号のエネルギーを測定する機能部のブロック図を図５（Ｂ）に示す。音階周波数成分のエネルギー測定は、特定オクターブの１２音のエネルギーを加算したものである。特定オクターブとしては、メロディが演奏されるオクターブ、たとえばＣ３〜Ｂ３のオクターブを用いればよい。このため、Ｃ〜Ｂの１２半音階のＢＰＦフィルタを設ける。各フィルタを通過した周波数成分をそれぞれ積分して、各周波数成分のエネルギーを求め、これらを加算する。この加算されたものが音階周波数成分のエネルギーである。一方、オーディオ信号を直接積分して全周波数帯域のエネルギーを求める。 First, the energy of the scale frequency component in the audio signal and the energy of the entire frequency band are measured (S20). FIG. 5B shows a block diagram of a functional unit that measures the energy of the audio signal. The energy measurement of the scale frequency component is obtained by adding the energy of 12 sounds of a specific octave. As the specific octave, an octave in which a melody is played, for example, a C3 to B3 octave may be used. For this reason, a BPF filter having 12 to 12 scales of C to B is provided. The frequency components that have passed through each filter are integrated to obtain the energy of each frequency component, and these are added. This sum is the energy of the scale frequency component. On the other hand, the audio signal is directly integrated to obtain energy in the entire frequency band.

Ｓ２０で求められた音階周波数成分のエネルギーと全周波数帯域成分のエネルギーとを比較し（Ｓ２１）、その比率が所定の比率以上であった場合、すなわち、音階周波数成分のエネルギーの占める比率が所定値以上であった場合には（Ｓ２２でＹＥＳ）、判定結果として「楽音」を出力する（Ｓ２３）。一方、音階周波数成分のエネルギーの占める比率が所定値に満たなかった場合には（Ｓ２２でＮＯ）、判定結果を出力しないで終了する。 The energy of the scale frequency component obtained in S20 and the energy of all frequency band components are compared (S21). If the ratio is equal to or greater than a predetermined ratio, that is, the ratio of the scale frequency component energy is a predetermined value. If so ("YES" in S22), "musical sound" is output as the determination result (S23). On the other hand, when the ratio of the energy of the scale frequency component does not reach the predetermined value (NO in S22), the process ends without outputting the determination result.

このように、複数のＢＰＦフィルタ処理およひ積分処理のみでオーディオ信号が楽音であるか否かが判定可能であるため、この処理でオーディオ信号が楽音と判定されれば、図４のＳ４以下の処理を省略することができ、処理負荷を大幅に軽減することができる。また、この楽音判定処理では、明確な調波性が現れない複数楽器による合奏の楽音であっても、音階周波数に多くの成分が現れるため、容易に検出可能である。 Thus, since it is possible to determine whether or not the audio signal is a musical tone only by a plurality of BPF filter processing and integration processing, if the audio signal is determined to be a musical tone by this processing, S4 and subsequent steps in FIG. This processing can be omitted, and the processing load can be greatly reduced. Further, in this musical tone determination process, even a musical tone of an ensemble composed of a plurality of musical instruments that does not exhibit a clear harmonic characteristic can be easily detected because many components appear in the scale frequency.

図６は、調波判定処理を示すフローチャートである。この処理では、オーディオ信号を短時間フーリエ変換（ＳＴＦＴ）し、その周波数スペクトルの自己相関を求めることによって、調波性の有無およびピーク周波数（概略のピッチ周波数）を求める。
FIG. 6 is a flowchart showing harmonic determination processing. In this process, the audio signal is subjected to short-time Fourier transform (STFT), and the autocorrelation of its frequency spectrum is obtained to obtain the presence / absence of harmonics and the peak frequency (approximate pitch frequency).

ここで、ＳＴＦＴは、現フレームのデータとその前後２フレームを併せた５フレーム分のデータを用いて行う。また、現フレームにおけるＳＴＦＴ結果に前フレームのＳＴＦＴ結果、次フレームのＳＴＦＴ結果を加えた平均値を現フレームの周波数スペクトルＰ（Ｔ）として用いる。したがって、現フレームの周波数スペクトルが求められるのは、現フレームから３フレーム後である。 Here, the STFT is performed using data for five frames including the data of the current frame and the two frames before and after that. Further, an average value obtained by adding the STFT result of the previous frame and the STFT result of the next frame to the STFT result of the current frame is used as the frequency spectrum P (T) of the current frame. Therefore, the frequency spectrum of the current frame is obtained three frames after the current frame.

このように複数フレームの周波数スペクトルを平均することにより、継続的に存在する周波数成分が強調される。すなわち、背景音等のノイズ成分は継続的に存在しないため、スペクトル上で強調されないが、話声や楽音等の継続的に存在する成分は、スペクトル上でその調波成分が強調される。これにより、オーディオ信号中に背景音に埋もれたレベルの小さい話声や楽音が存在しても、これを検出してピーク周波数の測定が可能になる。 By averaging the frequency spectra of a plurality of frames in this way, the frequency components that are continuously present are emphasized. That is, since noise components such as background sounds do not exist continuously, they are not emphasized on the spectrum. However, components that exist continuously such as speech and musical sounds have their harmonic components emphasized on the spectrum. As a result, even if a voice or musical sound with a low level buried in the background sound is present in the audio signal, it is possible to detect this and measure the peak frequency.

図６において、まず、上述の手法で短時間フーリエ変換を行い、現フレーム（時刻：Ｔ）の周波数スペクトルＰ（Ｔ）を求める（Ｓ３１）。図７（Ａ）にＦＦＴ結果の例を示す。この例は話声のみの信号スペクトルである。 In FIG. 6, first, short-time Fourier transform is performed by the above-described method to obtain a frequency spectrum P (T) of the current frame (time: T) (S31). FIG. 7A shows an example of the FFT result. This example is a signal spectrum of speech only.

次にこの周波数スペクトルの自己相関を検出する（Ｓ３２）。図７（Ｂ）、（Ｃ）、（Ｄ）に自己相関関数の例を示す。図７（Ｂ）は、図７（Ａ）に示した話声のみの周波数スペクトルの自己相関関数であり、自己相関が明確に現れている。図７（Ｃ）は話声および話声以外の成分をふくむオーディオ信号の周波数スペクトルの自己相関関数の例を示す図である。話声が占める周波数帯域は狭いため、周波数差が小さい範囲では自己相関が現れているが、周波数差が大きい範囲では自己相関が乱れている。図７（Ｄ）はその他音の周波数スペクトルの自己相関関数の例を示す図である。このように、その他音は調波性がないため、周波数スペクトルの自己相関が全くない。 Next, the autocorrelation of this frequency spectrum is detected (S32). 7B, 7C and 7D show examples of autocorrelation functions. FIG. 7B shows the autocorrelation function of the frequency spectrum of only the speech shown in FIG. 7A, and the autocorrelation clearly appears. FIG. 7C is a diagram illustrating an example of an autocorrelation function of the frequency spectrum of an audio signal including speech and components other than speech. Since the frequency band occupied by the voice is narrow, autocorrelation appears in the range where the frequency difference is small, but the autocorrelation is disturbed in the range where the frequency difference is large. FIG. 7D is a diagram showing an example of the autocorrelation function of the frequency spectrum of other sounds. As described above, since the other sounds do not have harmonics, there is no autocorrelation of the frequency spectrum.

自己相関関数の最初のピークを検出し、そのピークの周波数差をピーク周波数Ｆａとする（Ｓ３３）。図７（Ｄ）に例示したようにピーク周波数Ｆａが検出できなかった場合は（Ｓ３４でＮＯ）、「調波性なし」の判定結果を出力して（Ｓ３９）、この処理を終了する。 The first peak of the autocorrelation function is detected, and the frequency difference between the peaks is set as the peak frequency Fa (S33). As illustrated in FIG. 7D, when the peak frequency Fa cannot be detected (NO in S34), a determination result of “no harmonics” is output (S39), and this process ends.

ピーク周波数Ｆａが検出された場合には、このピーク周波数Ｆａと直前のフレーム（Ｔ−３）のピッチ周波数Ｆ（Ｔ−３）とを比較する（Ｓ３５）。その差が所定値以下（ほぼ一致）であった場合には（Ｓ３６でＹＥＳ）、ピーク周波数Ｆａをオーディオ信号の今回のフレーム（Ｆ−２）におけるピッチ周波数Ｆ（Ｔ−２）とする（Ｓ３７）。そして、判定結果「調波性あり」、ピッチ周波数「Ｆ（Ｔ−２）」を出力して（Ｓ３８）、調波性判定処理を終了する。 When the peak frequency Fa is detected, the peak frequency Fa is compared with the pitch frequency F (T-3) of the immediately preceding frame (T-3) (S35). If the difference is less than or equal to the predetermined value (substantially coincides) (YES in S36), the peak frequency Fa is set to the pitch frequency F (T-2) in the current frame (F-2) of the audio signal (S37). ). Then, the determination result “having harmonics” and the pitch frequency “F (T-2)” are output (S38), and the harmonics determination process is terminated.

一方、ＦａとＦ（Ｔ−３）との差が所定値より大きかった場合には（Ｓ３６でＮＯ）、「調波性なし」の判定結果を出力して（Ｓ３９）、処理動作を終了する。 On the other hand, if the difference between Fa and F (T-3) is greater than the predetermined value (NO in S36), the determination result of “no harmonics” is output (S39), and the processing operation is terminated. .

話声、楽音の調波性は瞬間的に現れて消滅するものではなく複数フレーム継続するものであるため、今回のピーク周波数を前フレームのピッチ周波数と比較してほぼ一致したときのみ調波性ありと判定して誤検出を防止している。 The harmonic nature of speech and musical sound does not appear and disappear instantaneously, but continues for multiple frames, so the harmonic nature is only when the peak frequency of this time is almost the same as the pitch frequency of the previous frame. It is judged that there is, and false detection is prevented.

図８は話声／楽音判定処理を示すフローチャートである。この処理では、今回の判別対象フレームの正確なピッチ周波数を算出し、このピッチ周波数が音階周波数に一致しているか、または、ピッチ周波数に大きな揺らぎがないかに基づき、このオーディオ信号が楽音であるか話声であるかを判定している。 FIG. 8 is a flowchart showing the voice / musical tone determination process. In this process, the exact pitch frequency of the current discrimination target frame is calculated, and whether this audio signal is a musical sound based on whether this pitch frequency matches the scale frequency or if there is no large fluctuation in the pitch frequency. It is determined whether the voice is spoken.

まず、図６の処理で得たＳＴＦＴ周波数スペクトルおよびその周波数分解能で得られたＳＴＦＴ分解能の精度のピッチ周波数Ｆ（Ｔ−２）を用いて、瞬時周波数を分析し、これに基づいて得られた正確なピッチ周波数をＦｅ（Ｔ−２）とする（Ｓ５０）。すなわち、全周波数帯域について瞬時周波数分析をするのではなく、ＳＴＦＴで得られた概算のピッチ周波数Ｆ（Ｔ−２）付近のみで瞬時周波数分析を行う。これにより、瞬時周波数分析における処理量を大幅に少なくすることができる。 First, the instantaneous frequency was analyzed using the STFT frequency spectrum obtained by the processing of FIG. 6 and the pitch frequency F (T-2) of the accuracy of the STFT resolution obtained by the frequency resolution, and obtained based on this. An accurate pitch frequency is set to Fe (T-2) (S50). That is, the instantaneous frequency analysis is not performed for the entire frequency band, but the instantaneous frequency analysis is performed only in the vicinity of the approximate pitch frequency F (T-2) obtained by the STFT. Thereby, the processing amount in the instantaneous frequency analysis can be greatly reduced.

瞬時周波数は、ＳＴＦＴの各周波数ｂｉｎの信号成分波形の位相φの時間微分φ′として求められる。通常、瞬時周波数φ′は、各周波数ｂｉｎの周波数とほぼ一致し、図９（Ａ）のような一次関数的な相関を示すが、ＳＴＦＴを行ったフレームの信号波形に強いパワーを持った周波数成分Ｆｅがあると、そのＳＴＦＴにおけるその周波数成分Ｆｅ近傍の周波数ｂｉｎの瞬時周波数φ′がほぼ一定値になることが知られている。そして、この場合、ＳＴＦＴで求めた概算のピッチ周波数Ｆと、上述の相関曲線のほぼ水平になっている部分との交点の縦軸値が正確なピッチ周波数Ｆｅであると推定することができる。このようにして、０．２Ｈｚ精度の正確なピッチ周波数Ｆｅ（Ｔ−２）を求めることが可能になる。 The instantaneous frequency is obtained as a time derivative φ ′ of the phase φ of the signal component waveform of each frequency bin of the STFT. Usually, the instantaneous frequency φ ′ substantially coincides with the frequency of each frequency bin and shows a linear function correlation as shown in FIG. 9A. It is known that when there is a component Fe, the instantaneous frequency φ ′ of the frequency bin in the vicinity of the frequency component Fe in the STFT becomes a substantially constant value. In this case, it can be estimated that the vertical value of the intersection of the approximate pitch frequency F obtained by the STFT and the substantially horizontal portion of the correlation curve is the accurate pitch frequency Fe. In this way, it is possible to obtain an accurate pitch frequency Fe (T-2) with an accuracy of 0.2 Hz.

この正確なピッチ周波数Ｆｅ（Ｔ−２）と音階周波数とを比較する（Ｓ５１）。この処理では、楽音の存在しえるオクターブ範囲の全１２半音階の周波数をＦｅ（Ｔ−２）と比較する。これらがほぼ一致した場合には（Ｓ５２でＹＥＳ）、オーディオ信号は楽音であるとして、楽音の判定結果を出力し（Ｓ５６）、処理を終了する。一方、Ｆｅ（Ｔ−２）と音階周波数とが一致しない場合には、前フレームで求めた正確なピッチ周波数Ｆｅ（Ｔ−３）と今回の正確なピッチ周波数Ｆｅ（Ｔ−２）とを比較する（Ｓ５３）。前回の正確なピッチ周波数Ｆｅ（Ｔ−３）と今回の正確なピッチ周波数Ｆｅ（Ｔ−２）とがほぼ一致する場合には（Ｓ５４でＹＥＳ）、ピッチ周波数の揺らぎが殆どないため楽音の判定結果を出力する（Ｓ５６）。一方、前回の正確なピッチ周波数Ｆｅ（Ｔ−３）と今回の正確なピッチ周波数Ｆｅ（Ｔ−２）とが一致しない場合には（Ｓ５４でＮＯ）、ピッチに揺らぎがあるため話声の判定結果を出力する（Ｓ５５）。 The accurate pitch frequency Fe (T-2) is compared with the scale frequency (S51). In this process, the frequencies of all 12 semitones in the octave range where a musical sound can exist are compared with Fe (T-2). If they substantially match (YES in S52), it is determined that the audio signal is a musical tone, and a musical tone determination result is output (S56), and the process is terminated. On the other hand, when Fe (T-2) does not match the scale frequency, the accurate pitch frequency Fe (T-3) obtained in the previous frame is compared with the current accurate pitch frequency Fe (T-2). (S53). If the previous accurate pitch frequency Fe (T-3) and the current accurate pitch frequency Fe (T-2) are substantially the same (YES in S54), there is almost no fluctuation in the pitch frequency, so that the determination of the musical tone is made. The result is output (S56). On the other hand, if the previous accurate pitch frequency Fe (T-3) and the current accurate pitch frequency Fe (T-2) do not match (NO in S54), the voice is judged because the pitch fluctuates. The result is output (S55).

すなわち、楽音は、安定した周波数の音響であるが、話声は周波数の抑揚があり、比較的大きいピッチ変動（揺らぎ）があるからである。なお、前フレームの正確なピッチＦｅがない（前フレームでこの処理が行われなかった）場合には、Ｓ５４では不一致と判定される。なお、人声に限らず、動物の鳴き声であってもこの話声／楽音判定処理で話声と判定することが可能である。 That is, the musical sound is a sound having a stable frequency, but the voice has a frequency inflection, and there is a relatively large pitch fluctuation (fluctuation). If there is no accurate pitch Fe of the previous frame (this process was not performed in the previous frame), it is determined in S54 that they do not match. Note that it is possible to determine not only a human voice but also an animal cry as a voice by this voice / musical sound determination process.

また、楽音判定処理（図４のＳ１または図５の処理）で楽音と判定されず、この話声／楽音判定処理で楽音と判定されるオーディオ信号とは、たとえばフルート１本など単音の演奏であるため音階音の占めるエネルギーが小さい楽音や、民族楽器等の西洋１２音音階に一致しないピッチの楽器等である。 The audio signal that is not determined as a musical sound by the musical sound determination process (S1 of FIG. 4 or the process of FIG. 5) and is determined as a musical sound by the speech / musical sound determination process is, for example, a single tone performance such as one flute. Therefore, it is a musical sound with a small energy occupied by a musical scale, or a musical instrument with a pitch that does not match the western 12-tone musical scale such as a folk instrument.

なお、この実施形態では、全てのチャンネルに内容判別部１４を設け、全てのチャンネルの内容を判別しているが、必ずしも全てのチャンネルの内容を判別する必要はなく、一部のチャンネル（たとえばセンタチャンネル）のみ内容を判別してもよい。また、話声／楽音／その他音の全ての内容を判別する必要はなく、一部の内容（たとえば話声）のみを判別してもよい。 In this embodiment, the contents discriminating unit 14 is provided for all the channels and the contents of all the channels are discriminated. However, it is not always necessary to discriminate the contents of all the channels. Only the channel) may be discriminated. Further, it is not necessary to discriminate all the contents of the voice / musical sound / other sounds, and only a part of the contents (for example, the voice) may be discriminated.

《尚書き》
上記実施形態では、オーディオ信号に初期反射音や残響音を付加する音場効果について説明したが、本発明における信号処理は音場効果に限定されない。《Still Write》
In the above embodiment, the sound field effect of adding the initial reflected sound or the reverberation sound to the audio signal has been described, but the signal processing in the present invention is not limited to the sound field effect.

また、上記実施形態では、５．１チャンネルのマルチオーディオ信号を例に挙げて説明したが、マルチチャンネルオーディオ信号のチャンネル数は５．１チャンネルに限定されない。 In the above embodiment, a 5.1 channel multi-audio signal has been described as an example. However, the number of channels of the multi-channel audio signal is not limited to 5.1 channels.

１オーディオアンプ
４信号処理部
１４内容判別部
１５係数制御部
１６係数乗算部
１９音場効果生成部 DESCRIPTION OF SYMBOLS 1 Audio amplifier 4 Signal processing part 14 Content determination part 15 Coefficient control part 16 Coefficient multiplication part 19 Sound field effect production | generation part

Claims

A musical sound determination unit that determines whether the audio signal is a musical sound by comparing the energy of the scale frequency component of the audio signal with the energy of the entire band component;
A harmonic that determines whether the audio signal is harmonic or other sound by determining whether or not the audio signal is harmonic when the audio signal is not determined by the musical sound determination unit. A sex determination unit;
When the audio signal is determined to be a harmonic sound by the harmonic determination unit, whether or not the pitch frequency of the audio signal matches the scale frequency, or based on the presence or absence of fluctuation of the pitch frequency A voice / musical sound determination unit for determining whether the audio signal is a voice or a music;
A signal processing apparatus comprising:

The harmonic determination unit determines the presence or absence of harmonics based on the autocorrelation function of the frequency spectrum by short-time Fourier transform,
The signal according to claim 1, wherein the speech / musical sound determination unit includes means for obtaining an accurate pitch frequency by performing an instantaneous frequency analysis only in the vicinity of an approximate pitch frequency obtained based on the autocorrelation function. Processing equipment.