JP2021135361A

JP2021135361A - Sound processing device, sound processing program and sound processing method

Info

Publication number: JP2021135361A
Application number: JP2020030596A
Authority: JP
Inventors: 尚也川畑; Naoya Kawabata
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2021-09-13

Abstract

To provide a sound processing device, a sound processing program and a sound processing method that reduce discomfort of a listener having listened to a masker signal generated with a voice signal.SOLUTION: A sound processing device includes: means for dividing a microphone input signal supplied from a microphone for collecting the sound of a voice spoken by an object speaker, into prescribed lengths; means for creating long-time frames of prescribed lengths together with the frame-divided microphone input signal; means for accumulating the generated long-time frame signals; means for carrying out frame signal selection processing for selecting a signal to be used in generating a masker signal from the past frame-divided microphone input signals that have been accumulated; means for restricting frames to be selected, when carrying out the frame signal selection processing; and means for generating and outputting a masker signal for making listening difficult of the voice spoken by the object speaker, by using a signal to be used in generating the masker signal.SELECTED DRAWING: Figure 1

Description

本発明は、音響処理装置、音響処理プログラム及び音響処理方法に関し、例えば、発話している話者の周囲の第三者に対して、会話の内容が漏れることを防ぐ手法として用いられるサウンドマスキング処理に適用し得る。 The present invention relates to an acoustic processing device, an acoustic processing program, and an acoustic processing method. For example, a sound masking process used as a method for preventing the contents of a conversation from being leaked to a third party around the speaker who is speaking. Can be applied to.

近年、不特定多数の人が存在する施設（例えば、病院、薬局、銀行等）の受付カウンター、窓口、打合せスペース等で話者が会話の相手と会話を行うと、会話の内容が周囲の第三者に漏洩することが問題になっている。 In recent years, when a speaker talks with a conversation partner at a reception counter, a window, a meeting space, etc. of a facility (for example, a hospital, a pharmacy, a bank, etc.) where an unspecified number of people exist, the content of the conversation becomes the surrounding number. Leakage to three parties has become a problem.

第三者に会話内容の漏洩を防ぐことをスピーチプライバシーと言い、スピーチプライバシーを実現するために、音のマスキング効果が利用されている。 Preventing the leakage of conversation content to a third party is called speech privacy, and the sound masking effect is used to realize speech privacy.

音のマスキング効果とは、ある音（以下、「対象音」とも呼ぶ）が聞こえている状態で、対象音に近い音響特性（例えば、周波数特性、ピッチ、フォルマント等）を持つ別の音が存在した場合、対象音が聞き取りにくくなる（マスクされる）現象である。一般的にマスクする音は「マスカー」と呼ばれ、マスクされる音（対象音）は「マスキー」とも呼ばれる。 The sound masking effect is a state in which a certain sound (hereinafter, also referred to as "target sound") is heard, and another sound having acoustic characteristics (for example, frequency characteristics, pitch, formant, etc.) close to the target sound exists. This is a phenomenon in which the target sound becomes difficult to hear (masked). Generally, the masked sound is called "masker", and the masked sound (target sound) is also called "muskellunge".

この音のマスキング効果を利用した、第三者に会話内容の漏洩を防止（スピーチプライバシーを保護）するサウンドマスキング装置が特許文献１によって提案されている。 Patent Document 1 proposes a sound masking device that uses this sound masking effect to prevent leakage of conversation content to a third party (protect speech privacy).

特開２００６−２４３１７８号公報Japanese Unexamined Patent Publication No. 2006-2431178

しかしながら、特許文献１の音声処理方法では、マイクロフォンの入力音声信号のスペクトル包絡を抽出し、スペクトル包絡を変形させて変形スペクトル包絡を生成し、スペクトル微細構造と合成してマスカー信号生成に使用する信号として使用している。このため、特許文献１の記載技術では、話者の音声信号を変形して生成されたマスカー信号は人工的な音になってしまい、マスカー信号が不快な音になる可能性がある。 However, in the voice processing method of Patent Document 1, the spectral envelope of the input voice signal of the microphone is extracted, the spectral envelope is deformed to generate a deformed spectral envelope, and the signal is combined with the spectral microstructure to be used for masker signal generation. It is used as. Therefore, in the technique described in Patent Document 1, the masker signal generated by transforming the speaker's voice signal becomes an artificial sound, and the masker signal may become an unpleasant sound.

さらに、特許文献１に記載の音声処理方法では、マイクロフォンの入力音声信号を変形させてマスカー信号を生成しているので、マイクロフォンの入力音声信号の言葉とマスカー信号の言葉が似た内容になり、音声信号とマスカー信号を聞く人にとって、エコーのような不快な音が聞こえるになる。 Further, in the voice processing method described in Patent Document 1, since the masker signal is generated by transforming the input voice signal of the microphone, the words of the input voice signal of the microphone and the words of the masker signal have similar contents. For those who hear the audio and masker signals, you will hear an unpleasant sound like an echo.

以上のような問題に鑑みて、生成したマスカー信号を聞く聴者の不快感を軽減する音響処理装置、音響処理プログラム及び音響処理方法が望まれている。 In view of the above problems, an acoustic processing device, an acoustic processing program, and an acoustic processing method that reduce the discomfort of the listener listening to the generated masker signal are desired.

第１の本発明の音響処理装置は、（１)対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割するフレーム分割手段と、（２）前記フレーム分割手段でフレーム分割されたマイク入力信号を合わせて所定の長さの長時間フレームとして作成する長時間フレーム信号作成手段と、（３）前記長時間フレーム信号作成手段で生成した長時間フレーム信号を蓄積する入力信号蓄積手段と、（４）前記入力信号蓄積手段に蓄積されている過去のフレーム分割されたマイク入力信号からマスカー信号を生成するために使用する信号を選択するフレーム信号選択処理を行うフレーム信号選択手段と、（５）前記フレーム信号選択手段が、前記フレーム信号選択処理を行う際に、選択するフレームを制限するフレーム選択制限手段と、（６）前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を、聞き取りにくくさせる前記マスカー信号を生成して出力するマスカー信号生成手段とを有することを特徴とする。 The first sound processing apparatus of the present invention includes (1) a frame dividing means for dividing a microphone input signal supplied from a microphone that picks up the sound spoken by the target speaker into a predetermined length, and (2) the above. A long-time frame signal creating means that combines the microphone input signals frame-divided by the frame-dividing means to create a long-time frame of a predetermined length, and (3) a long-time frame signal generated by the long-time frame signal creating means. And (4) a frame signal selection process for selecting a signal to be used for generating a masker signal from a past frame-divided microphone input signal stored in the input signal storage means. Used for frame signal selection means to be performed, (5) frame selection limiting means for limiting the frames to be selected when the frame signal selection means performs the frame signal selection process, and (6) generation of the masker signal. It is characterized by having a masker signal generation means for generating and outputting the masker signal that makes it difficult to hear the voice spoken by the target speaker using the signal.

第２の本発明の音響処理プログラムは、コンピュータを、（１）対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割するフレーム分割手段と、（２）前記フレーム分割手段でフレーム分割されたマイク入力信号を所定の長さの時間フレームにする長時間フレーム信号作成手段と、（３）前記長時間フレーム信号作成手段で生成した長時間フレーム信号を蓄積する入力信号蓄積手段と、（４）前記入力信号蓄積手段に蓄積されている過去のフレーム分割されたマイク入力信号からマスカー信号を生成するために使用する信号を選択するフレーム信号選択処理を行うフレーム信号選択手段と、（５）前記フレーム信号選択手段が、前記フレーム信号選択処理を行う際に、選択するフレームを制限するフレーム選択制限手段と、（６）前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を、聞き取りにくくさせる前記マスカー信号を生成して出力するマスカー信号生成手段とを有することを特徴として機能させることを特徴とする。 The second sound processing program of the present invention comprises (1) a frame dividing means for dividing a microphone input signal supplied from a microphone that picks up the sound spoken by the target speaker into a predetermined length, and (1) 2) A long-time frame signal creating means for converting the microphone input signal frame-divided by the frame-dividing means into a time frame of a predetermined length, and (3) a long-time frame signal generated by the long-time frame signal creating means. Performs a frame signal selection process for selecting the input signal storage means to be stored and (4) a signal to be used for generating a masker signal from the past frame-divided microphone input signals stored in the input signal storage means. The frame signal selection means, (5) the frame selection limiting means for limiting the frame to be selected when the frame signal selection means performs the frame signal selection process, and (6) the signal used to generate the masker signal. It is characterized in that it has a masker signal generation means for generating and outputting the masker signal that makes it difficult to hear the voice spoken by the target speaker.

第３の本発明は、音響処理装置が行う音響処理方法において、（１）前記音響処理装置は、フレーム分割手段、長時間フレーム信号作成手段、入力信号蓄積手段、フレーム選択制限手段、フレーム信号選択手段、及び、マスカー信号生成手段を有し、（２）前記フレーム分割手段は、対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割し、（３）前記長時間フレーム信号作成手段は、前記フレーム分割手段でフレーム分割されたマイク入力信号を合わせて所定の長さの長時間フレームとして作成し、（４）前記入力信号蓄積手段は、前記長時間フレーム信号作成手段で生成した長時間フレーム信号を蓄積し、（５）前記フレーム信号選択手段は、前記入力信号蓄積手段に蓄積されている過去のフレーム分割されたマイク入力信号からマスカー信号を生成するために使用する信号を選択するフレーム信号選択処理を行い、（６）前記フレーム選択制限手段は、前記フレーム信号選択手段が、前記フレーム信号選択処理を行う際に、選択するフレームを制限し、（７）前記マスカー信号生成手段は、前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を、聞き取りにくくさせる前記マスカー信号を生成して出力することを特徴とする。 A third aspect of the present invention is the acoustic processing method performed by the acoustic processing apparatus. (1) The acoustic processing apparatus includes a frame dividing means, a long-time frame signal creating means, an input signal accumulating means, a frame selection limiting means, and a frame signal selection. It has means and a masker signal generating means, and (2) the frame dividing means divides a microphone input signal supplied from a microphone that picks up the sound spoken by the target speaker into a predetermined length, and (2) 3) The long-time frame signal creating means creates a long-time frame having a predetermined length by combining the microphone input signals frame-divided by the frame-dividing means, and (4) the input signal accumulating means has the length. The long-time frame signal generated by the time frame signal creating means is accumulated, and (5) the frame signal selecting means generates a masker signal from the past frame-divided microphone input signal stored in the input signal storing means. A frame signal selection process for selecting a signal to be used is performed, and (6) the frame selection limiting means limits the frames to be selected when the frame signal selection means performs the frame signal selection process. (7) The masker signal generation means uses the signal used to generate the masker signal to generate and output the masker signal that makes it difficult to hear the voice spoken by the target speaker. ..

本発明によれば、音声信号と生成したマスカー信号を聞く聴者の不快感を軽減する音響処理装置、音響処理プログラム及び音響処理方法を提供することができる。 According to the present invention, it is possible to provide an acoustic processing device, an acoustic processing program, and an acoustic processing method that reduce discomfort to a listener listening to an audio signal and a generated masker signal.

第１の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound masking apparatus which concerns on 1st Embodiment. 第１の実施形態に係るサウンドマスキング装置のハードウェア構成の例について示したブロック図である。It is a block diagram which showed the example of the hardware composition of the sound masking apparatus which concerns on 1st Embodiment. 第１の実施形態に係るサウンドマスキング装置で生成したマスカー信号を出力するイメージ図である。It is an image diagram which outputs the masker signal generated by the sound masking apparatus which concerns on 1st Embodiment. 第２の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound masking apparatus which concerns on 2nd Embodiment. 第３の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound masking apparatus which concerns on 3rd Embodiment. 第４の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound masking apparatus which concerns on 4th Embodiment.

（Ａ）第１の実施形態
以下、本発明の音響処理装置、音響処理、及び音響処理方法の第１の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (A) First Embodiment Hereinafter, the first embodiment of the acoustic processing apparatus, acoustic processing, and acoustic processing method of the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to the sound masking device will be described.

（Ａ−１）第１の実施形態の構成
図１は、第１の実施形態に係るサウンドマスキング装置１００の機能的構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing a functional configuration of the sound masking device 100 according to the first embodiment.

サウンドマスキング装置１００は、マイク１０１、マイクアンプ１０２、ＡＤ変換器１０３、ＤＡ変換器１０４、スピーカアンプ１０５、スピーカ１０６、及びサウンドマスキング処理部２００を有している。 The sound masking device 100 includes a microphone 101, a microphone amplifier 102, an AD converter 103, a DA converter 104, a speaker amplifier 105, a speaker 106, and a sound masking processing unit 200.

マイク１０１は、人の音声や音等の空気振動を電気信号に変換するものである。 The microphone 101 converts air vibrations such as human voice and sound into electric signals.

マイクアンプ１０２は、マイク１０１により受音（収音）された電気信号を増幅するものである。 The microphone amplifier 102 amplifies the electric signal received (sound picked up) by the microphone 101.

ＡＤ変換器１０３は、マイクアンプ１０２により増幅された電気信号（アナログ信号）をデジタル信号に変換するものである。以下、ＡＤ変換器１０３から出力されるデジタル信号を「マイク入力信号」と呼ぶものとする。 The AD converter 103 converts an electric signal (analog signal) amplified by the microphone amplifier 102 into a digital signal. Hereinafter, the digital signal output from the AD converter 103 will be referred to as a “microphone input signal”.

サウンドマスキング処理部２００は、入力されたマイク入力信号からマスカー信号を生成し、出力する。 The sound masking processing unit 200 generates a masker signal from the input microphone input signal and outputs it.

ＤＡ変換器１０４は、サウンドマスキング処理部２００から出力された出力信号（デジタル信号）を電気信号（アナログ信号）に変換するものである。 The DA converter 104 converts an output signal (digital signal) output from the sound masking processing unit 200 into an electric signal (analog signal).

スピーカアンプ１０５は、ＤＡ変換器１０４から出力される電気信号を増幅するものである。 The speaker amplifier 105 amplifies the electric signal output from the DA converter 104.

スピーカ１０６は、電気信号を空気の振動に変換して音として出力するものである。 The speaker 106 converts an electric signal into vibration of air and outputs it as sound.

次に、サウンドマスキング処理部２００の詳細な構成を説明する。 Next, the detailed configuration of the sound masking processing unit 200 will be described.

サウンドマスキング処理部２００は、フレーム分割部２０１、長時間フレーム信号作成部２０２、ＤＢ（データベース）書込み部２０３、入力信号ＤＢ２０４、フレーム信号ＤＢ２０５、フレーム選択制限部２０６、フレーム信号選択部２０７、マスカー信号生成部２０８、音入力端子ＩＮ、及び音出力端子ＯＵＴを有している。 The sound masking processing unit 200 includes a frame dividing unit 201, a long-time frame signal creating unit 202, a DB (database) writing unit 203, an input signal DB 204, a frame signal DB 205, a frame selection limiting unit 206, a frame signal selection unit 207, and a masker signal. It has a generation unit 208, a sound input terminal IN, and a sound output terminal OUT.

音入力端子ＩＮは、マイク入力信号をサウンドマスキング処理部２００に入力するインタフェース（オーディオインタフェース）のである。 The sound input terminal IN is an interface (audio interface) for inputting a microphone input signal to the sound masking processing unit 200.

フレーム分割部２０１は、サウンドマスキング処理部２００に入力されたマイク入力信号を所定の長さ（以下、「フレーム長Ｌ１」と表す）のフレーム（以下、「分割フレーム」と呼ぶ）に分割して出力する。フレーム長Ｌ１は、一般的に音声を解析するのに適した長さを適用することが望ましい。例えば、フレーム分割部２０１において、フレーム長Ｌ１は、１００〜２００ｍｓｅｃとしても良い。そして、フレーム分割部２０１は、分割したフレーム信号（以下、「分割フレーム信号」と呼ぶ）を出力する。 The frame dividing unit 201 divides the microphone input signal input to the sound masking processing unit 200 into frames having a predetermined length (hereinafter referred to as “frame length L1”) (hereinafter referred to as “divided frames”). Output. It is generally desirable to apply a frame length L1 suitable for analyzing voice. For example, in the frame dividing unit 201, the frame length L1 may be 100 to 200 msec. Then, the frame dividing unit 201 outputs the divided frame signal (hereinafter, referred to as “divided frame signal”).

長時間フレーム信号作成部２０２は、分割フレーム信号を所定の長さ（以下、「フレーム長Ｌ２」と表す）のフレーム（以下、「長時間フレーム」と呼ぶ）に結合して出力する。フレーム長Ｌ２（分割フレームを結合する長さ；分割フレームを結合する数）は、音声信号の単語、もしくは文章として認識できる程度の長さ（人間の耳で聞いたときに音声信号の単語、もしくは文章と判定できる長さ）を適用することが望ましい。例えば、長時間フレーム信号作成部２０２において、フレーム長Ｌ２は分割フレーム信号を３フレームから５フレーム結合した長さ（例えば、Ｌ２＝Ｌ２×３からＬ２＝Ｌ２×５）としても良く、音の分節単位（例えば、モーラを単位とする長さで１モーラから２モーラ）の長さになるように結合しても良く、時間単位（例えば、を３００〜１０００ｍｓｅｃの範囲のいずれかの長さ）としても良い。そして、長時間フレーム信号作成部２０２は、結合した長時間のフレーム信号（以下、「長時間フレーム信号」と呼ぶ）を出力する。 The long-time frame signal creation unit 202 combines the divided frame signal with a frame having a predetermined length (hereinafter, referred to as “frame length L2”) (hereinafter, referred to as “long-time frame”) and outputs the signal. The frame length L2 (the length of combining divided frames; the number of combined divided frames) is a length that can be recognized as a word of an audio signal or a sentence (a word of an audio signal when heard by the human ear, or It is desirable to apply the length that can be judged as a sentence). For example, in the long-time frame signal creation unit 202, the frame length L2 may be a length obtained by combining 3 frames to 5 frames of the divided frame signal (for example, L2 = L2 × 3 to L2 = L2 × 5), and the sound segmentation. It may be combined so as to have a length of a unit (for example, a length in a mora unit of 1 to 2 mora), and as a time unit (for example, a length in the range of 300 to 1000 msec). Is also good. Then, the long-time frame signal creation unit 202 outputs the combined long-time frame signal (hereinafter, referred to as “long-time frame signal”).

ＤＢ書込み部２０３は、長時間フレーム信号を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に書込む。 The DB writing unit 203 writes the long-time frame signal into the frame signal DB 205 of the input signal DB 204.

入力信号ＤＢ２０４は、過去の各長時間フレーム信号を長時間フレーム毎に蓄積（保持）する記憶手段である。入力信号ＤＢ２０４内のデータ形式については限定されないものであるが、ここでは、入力信号ＤＢ２０４は、過去の長時間フレーム信号を蓄積するフレーム信号ＤＢ２０５で構成されているものとする。 The input signal DB 204 is a storage means for accumulating (holding) each long-time frame signal in the past for each long-time frame. The data format in the input signal DB 204 is not limited, but here, it is assumed that the input signal DB 204 is composed of a frame signal DB 205 that stores a past long-time frame signal.

フレーム選択制限部２０６は、制限するフレーム数（以下、「制限フレーム数」と呼ぶ）を決定し、制限フレーム数を出力する。 The frame selection limiting unit 206 determines the number of frames to be limited (hereinafter, referred to as "limited number of frames"), and outputs the limited number of frames.

フレーム信号選択部２０７は、入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積されている過去の長時間フレーム信号を、フレーム選択制限部２０６の制限フレーム数より前のフレームからマスカー信号を生成するために使用する信号（以下、「マスカー素片信号」と呼ぶ）として複数フレーム選択し、選択したフレームを出力する。 The frame signal selection unit 207 uses the past long-time frame signal stored in the frame signal DB 205 of the input signal DB 204 to generate a masker signal from frames before the limit number of frames of the frame selection restriction unit 206. A plurality of frames are selected as a signal (hereinafter referred to as "masker element signal"), and the selected frame is output.

マスカー信号生成部２０８は、フレーム信号選択部２０７の選択結果を基に、フレーム信号選択部２０７で選択されたマスカー素片信号を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５から複数フレーム読み出し、読み出された複数フレームのマスカー素片信号を使用してマスカー信号を生成し、出力する。 Based on the selection result of the frame signal selection unit 207, the masker signal generation unit 208 reads out a plurality of frames of the masker element signal selected by the frame signal selection unit 207 from the frame signal DB 205 of the input signal DB 204, and reads out a plurality of frames. A masker signal is generated and output using the masker fragment signal of the frame.

音出力端子ＯＵＴは、マスカー信号生成部２０８で生成したマスカー信号をＤＡ変換器１０４に出力するインタフェース（オーディオインターフェース）である。 The sound output terminal OUT is an interface (audio interface) that outputs the masker signal generated by the masker signal generation unit 208 to the DA converter 104.

サウンドマスキング処理部２００は、全てをハードウェア的に構成（例えば、専用ボードやＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）を用いて構築）するようにしても良いし、ソフトウェア的にコンピュータを用いて構成するようにしても良い。また、サウンドマスキング処理部２００は、例えば、メモリ及びプロセッサを有するコンピュータにプログラム（実施形態に係る音響処理プログラムを含む）をインストールして構成するようにしても良い。 The sound masking processing unit 200 may be configured entirely in terms of hardware (for example, constructed using a dedicated board or DSP (Digital Signal Processor)), or configured in terms of software using a computer. You may. Further, the sound masking processing unit 200 may be configured by installing a program (including the sound processing program according to the embodiment) on a computer having a memory and a processor, for example.

なお、この実施形態では、ＡＤ変換器１０３及びＤＡ変換器１０４を、サウンドマスキング処理部２００の外に配置しているが、サウンドマスキング処理部２００にＡＤ変換器１０３及びＤＡ変換器１０４を搭載した構成としても良い。 In this embodiment, the AD converter 103 and the DA converter 104 are arranged outside the sound masking processing unit 200, but the sound masking processing unit 200 is equipped with the AD converter 103 and the DA converter 104. It may be configured.

図２では、サウンドマスキング処理部２００をソフトウェア（コンピュータ）的に実現する際の構成について示している。 FIG. 2 shows a configuration when the sound masking processing unit 200 is realized as software (computer).

図２に示すサウンドマスキング処理部２００は、コンピュータ３００を用いてソフトウェア的に構成されている。コンピュータ３００には、プログラム（実施形態の音響処理プログラムを含むプログラム）がインストールされている。なお、コンピュータ３００は、音響処理プログラム専用のコンピュータとしても良いし、他の機能のプログラムと共用される構成としても良い。 The sound masking processing unit 200 shown in FIG. 2 is configured by software using a computer 300. A program (a program including the sound processing program of the embodiment) is installed in the computer 300. The computer 300 may be a computer dedicated to an audio processing program, or may be shared with a program having another function.

図２に示すコンピュータ３００は、プロセッサ３０１、一次記憶部３０２、及び二次記憶部３０３、音入力端子ＩＮ、及び音出力端子ＯＵＴを有している。音入力端子ＩＮ、及び音出力端子ＯＵＴは、図１に示した要素と同じである。 The computer 300 shown in FIG. 2 has a processor 301, a primary storage unit 302, a secondary storage unit 303, a sound input terminal IN, and a sound output terminal OUT. The sound input terminal IN and the sound output terminal OUT are the same as the elements shown in FIG.

一次記憶部３０２は、プロセッサ３０１の作業用メモリ（ワークメモリ）として機能する記憶手段であり、例えば、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の高速動作するメモリが適用される。 The primary storage unit 302 is a storage means that functions as a work memory (work memory) of the processor 301, and for example, a memory that operates at high speed such as a DRAM (Dynamic Random Access Memory) is applied.

二次記憶部３０３は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）やプログラムデータ（実施形態に係る音響処理プログラムのデータを含む）等の種々のデータを記録する記憶手段であり、例えば、ＦＬＡＳＨメモリやＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリが適用される。 The secondary storage unit 303 is a storage means for recording various data such as an OS (Operating System) and program data (including data of an acoustic processing program according to an embodiment), and is, for example, a FLASH memory or an HDD (Hard Disk). A non-volatile memory such as Drive) or SSD (Solid State Drive) is applied.

この実施形態のコンピュータ３００では、プロセッサ３０１が起動する際、二次記憶部３０３に記録されたＯＳやプログラム（実施形態に係る音響処理プログラムを含む）を読み込み、一次記憶部３０２上に展開して実行する。なお、コンピュータ３００の具体的な構成は図２の構成に限定されないものであり、種々の構成を適用することができる。例えば、一次記憶部３０２が不揮発性メモリであれば、二次記憶部３０３については除外した構成としても良い。 In the computer 300 of this embodiment, when the processor 301 is started, the OS and programs (including the sound processing program according to the embodiment) recorded in the secondary storage unit 303 are read and expanded on the primary storage unit 302. Run. The specific configuration of the computer 300 is not limited to the configuration shown in FIG. 2, and various configurations can be applied. For example, if the primary storage unit 302 is a non-volatile memory, the secondary storage unit 303 may be excluded.

図３は、マイク１０１と、マイク１０１に向かって発話する話者（以下、「対象話者」と呼ぶ）Ｕ１と、対象話者Ｕ１の後ろ側に立っている対象話者以外の人（対象話者Ｕ１の発話する音声の聴取をマスキングする対象の人：以下、「マスキング対象者」と呼ぶ）Ｕ２と、スピーカ１０６との配置関係（スピーカ１０６の配置構成）の例について示した図である。 FIG. 3 shows a microphone 101, a speaker speaking into the microphone 101 (hereinafter referred to as “target speaker”) U1, and a person other than the target speaker standing behind the target speaker U1 (target). The person to be masked for listening to the voice spoken by the speaker U1: hereinafter referred to as the "masking target person") is a diagram showing an example of the arrangement relationship between the U2 and the speaker 106 (arrangement configuration of the speaker 106). ..

図３では、スピーカ１０６から出力される直接音ＤＳ（ＤｉｒｅｃｔＳｏｕｎｄ）の指向性を点線で図示している。また、図３の（ａ）では、直接音が床ＦＲ（ＦＬＯＯＲ）に反射することにより発生する反射音ＲＳ（ＲｅｆｌｅｃｔｅｄＳｏｕｎｄ）の指向性を一点鎖線で図示している。 In FIG. 3, the directivity of the direct sound DS (Direct Sound) output from the speaker 106 is illustrated by a dotted line. Further, in FIG. 3A, the directivity of the reflected sound RS (Reflected Sound) generated by the direct sound reflected on the floor FR (FLOOR) is illustrated by a dashed line.

図３の（ａ）では、スピーカ１０６は、対象話者Ｕ１の前方で膝程度の高さ、スピーカ１０６の振動面が下向きで、床ＦＲの表面に対して斜め方向に設置されることで、直接音ＤＳが床ＦＲに反射し、反射した反射音ＲＳが対象話者Ｕ１の後方にいるマスキング対象者Ｕ２に伝わるように向けられた状態となっている。そして、スピーカ１０６から放射されたマスカー信号は、床ＦＲの表面に向けて出力され、床ＦＲに到達すると反射する。これにより、床ＦＲで反射したマスカー信号は、対象話者Ｕ１の後方にいるマスキング対象者Ｕ２にマスカー信号が伝わる。このとき、対象話者Ｕ１が発話する音声の直接音もマスキング対象者Ｕ２に伝わるが、マスカー信号によって、マスクされる。 In FIG. 3A, the speaker 106 is installed in front of the target speaker U1 at a height of about knee, the vibration surface of the speaker 106 facing downward, and diagonally to the surface of the floor FR. The direct sound DS is reflected on the floor FR, and the reflected reflected sound RS is directed so as to be transmitted to the masking target person U2 behind the target speaker U1. Then, the masker signal radiated from the speaker 106 is output toward the surface of the floor FR, and is reflected when it reaches the floor FR. As a result, the masker signal reflected by the floor FR is transmitted to the masking target person U2 behind the target speaker U1. At this time, the direct sound of the voice uttered by the target speaker U1 is also transmitted to the masking target U2, but is masked by the masker signal.

以上のように、スピーカ１０６の設置方法は、マスカー信号が対象話者Ｕ１に聞こえないように設置し、且つマスキング対象者Ｕ２にマスカー信号が聞こえるように設置できれば種々の設置方法を広く適用することができる。例えば、図３の（ｂ）に示しているように、対象話者Ｕ１の後ろにスピーカ１０６を設置できるスペースがあれば、対象話者Ｕ１の後ろにスピーカ１０６を設置して、直接スピーカの１０６の振動面をマスキング対象者Ｕ２に向けて出力するようにしても良いし、図３の（ｃ）に示しているように、マスキング対象者Ｕ２の近くの床ＦＲにスピーカ１０６を埋め込むスペースがあれば、床ＦＲにスピーカ１０６を埋め込むようにして直接スピーカの１０６の振動面をマスキング対象者Ｕ２に向けてマスカー信号を出力するようにしても良いし、図２の（ｃ）に示しているように、マスキング対象者Ｕ２の近くの天井ＣＥ（ＣＥＩＬＩＮＧ）にスピーカ１０６を設置できるスペースがあれば、天井ＣＥにスピーカ１０６を設置して、直接スピーカの１０６の振動面をマスキング対象者Ｕ２に向けてマスカー信号を出力するようにでも良い。 As described above, as the installation method of the speaker 106, if the masker signal can be installed so that the target speaker U1 cannot hear the masker signal and the masker signal can be heard by the masking target person U2, various installation methods can be widely applied. Can be done. For example, as shown in FIG. 3B, if there is a space in which the speaker 106 can be installed behind the target speaker U1, the speaker 106 is installed behind the target speaker U1 and the speaker 106 is directly installed. The vibrating surface of the speaker may be output toward the masking target person U2, or as shown in FIG. 3C, there is a space for embedding the speaker 106 in the floor FR near the masking target person U2. For example, the speaker 106 may be embedded in the floor FR so that the vibration surface of the speaker 106 is directly directed to the masking target person U2 to output the masker signal, as shown in FIG. 2 (c). If there is a space in the ceiling CE (CEILING) near the masking target person U2, the speaker 106 can be installed in the ceiling CE, and the vibration surface of the speaker 106 is directly directed toward the masking target person U2. The masker signal may be output.

（Ａ−２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態のサウンドマスキング装置１００の動作（実施形態の音響処理方法）を説明する。 (A-2) Operation of the First Embodiment Next, the operation of the sound masking device 100 of the first embodiment having the above configuration (the sound processing method of the embodiment) will be described.

サウンドマスキング装置１００の動作が開始し、サウンドマスキング装置１００の対象話者Ｕ１がマイク１０１に向かって音声を発話すると、マイク１０１に音声信号が入力される。 When the operation of the sound masking device 100 starts and the target speaker U1 of the sound masking device 100 speaks a voice toward the microphone 101, a voice signal is input to the microphone 101.

マイク１０１に入力されたアナログの音声信号は、電気信号（アナログ信号）に変換され、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、サウンドマスキング処理部２００の音入力端子ＩＮにマイク入力信号ｘ（ｎ）として入力される。なお、マイク入力信号ｘ（ｎ）において、ｎは入力信号の離散的な時系列を示すパラメータである。 The analog audio signal input to the microphone 101 is converted into an electric signal (analog signal), amplified by the microphone amplifier 102, converted from an analog signal to a digital signal by the AD converter 103, and the sound of the sound masking processing unit 200. It is input to the input terminal IN as a microphone input signal x (n). In the microphone input signal x (n), n is a parameter indicating a discrete time series of the input signal.

サウンドマスキング処理部２００の音入力端子ＩＮにマイク入力信号ｘ（ｎ）が入力され始めると、フレーム分割部２０１に入力される。 When the microphone input signal x (n) starts to be input to the sound input terminal IN of the sound masking processing unit 200, it is input to the frame dividing unit 201.

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を分割フレーム信号のフレーム長Ｌ１に分割する。フレーム分割部２０１は、例えば、（１）式に従い、マイク入力信号ｘ（ｎ）を分割フレーム毎に分割する。
ｘ＿ｆｒａｍ（ｌ；ｍ）＝ｘ（ｌ・Ｌ１＋ｍ）…（１） The frame division unit 201 divides the microphone input signal x (n) into the frame length L1 of the division frame signal. For example, the frame dividing unit 201 divides the microphone input signal x (n) into each divided frame according to the equation (1).
x_fram (l; m) = x (l · L1 + m) ... (1)

（１）式で、ｘ＿ｆｒａｍ（ｌ；ｍ）は分割フレーム信号、ｌはフレーム番号、ｍは当該分割フレーム内の時間（ｍ＝０、１、２、…、Ｌ1−１）である。 In the equation (1), x_fram (l; m) is the divided frame signal, l is the frame number, and m is the time in the divided frame (m = 0, 1, 2, ..., L1-1).

フレーム分割部２０１は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を長時間フレーム信号作成部２０２に出力する。 The frame division unit 201 outputs the division frame signal x_fram (l; m) to the long-time frame signal creation unit 202.

長時間フレーム信号作成部２０２は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）をフレーム長Ｌ２に結合する。長時間フレーム信号作成部２０２が長時間フレームを作成する具体的手法については限定されないものであり種々の方式を適用することができる。長時間フレーム信号作成部２０２は、例えば、分割フレーム信号を分割フレームで結合する場合は、（２）式に従い、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を作成するようにしても良い。
ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｉ・Ｌ１＋ｍ）
＝ｘ＿ｆｒａｍ（ｌ−（（Ｉ−１）−ｉ）；ｍ）…（２） The long-time frame signal creation unit 202 combines the divided frame signal x_fram (l; m) with the frame length L2. The specific method by which the long-time frame signal creating unit 202 creates a long-time frame is not limited, and various methods can be applied. For example, when the divided frame signal is combined by the divided frame, the long-time frame signal creating unit 202 may create the long-time frame signal x_fram_long (s) according to the equation (2).
x_fram_long (i ・ L1 + m)
= X_fram (l-((I-1) -i); m) ... (2)

（２）式で、ｉはインデックス（ｉ＝０、１、２、…、Ｉ−１）、Ｉは長時間フレーム信号に用いられる分割フレームの数（以下、「使用フレーム数」と呼ぶ）である（Ｉ＝Ｌ２／Ｌ１）である。 In equation (2), i is the index (i = 0, 1, 2, ..., I-1), and I is the number of divided frames used for the long-time frame signal (hereinafter referred to as "the number of frames used"). There is (I = L2 / L1).

また、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）の作成手法は、例えば、（３）式、（４）式に従い、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を用いるようにしても良い。
ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）＝ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ＋Ｌ１） …（３）
ｘ＿ｆｒａｍ＿ｌｏｎｇ（Ｌ２−Ｌ１＋ｍ）＝ｘ＿ｆｒａｍ（ｌ；ｍ） …（４） Further, as a method for creating the long-time frame signal x_fram_long (s), for example, the long-time frame signal x_fram_long (s) may be used according to the equations (3) and (4).
x_fram_long (s) = x_fram_long (s + L1) ... (3)
x_fram_long (L2-L1 + m) = x_fram (l; m) ... (4)

（３）式で、ｓは長時間フレーム内の時間（ｓ＝０、１、２、…、Ｌ２−Ｌ１−１）である。（３）式は、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を分割フレーム長Ｌ１だけ前にシフトし、（４）式は、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）の後ろに分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を格納するという式である。なお、長時間フレーム信号作成部２０２は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を時間単位で結合しても良い。 In equation (3), s is the time in the frame for a long time (s = 0, 1, 2, ..., L2-L1-1). In the equation (3), the long-time frame signal x_fram_long (s) is shifted forward by the split frame length L1, and in the equation (4), the long-time frame signal x_fram_long (s) is followed by the split frame signal x_fram (l; m; m). ) Is stored. The long-time frame signal creation unit 202 may combine the divided frame signals x_fram (l; m) in time units.

そして、長時間フレーム信号作成部２０２は、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）をＤＢ書込み部２０３に出力する。 Then, the long-time frame signal creation unit 202 outputs the long-time frame signal x_fram_long (s) to the DB writing unit 203.

ＤＢ書込み部２０３は、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に書込む。ＤＢ書込み部２０３は、例えば、（５）と（６）式に従い、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を、それぞれ入力信号ＤＢ２０４のフレーム信号ＤＢ２０５ＤＢ＿ｓｉｎｇａｌ（ｊ；ｔ）に書込む。

The DB writing unit 203 writes the long-time frame signal x_fram_long (s) into the frame signal DB 205 of the input signal DB 204. For example, the DB writing unit 203 writes the long-time frame signal x_fram_long (s) into the frame signal DB205DB_singal (j; t) of the input signal DB204, respectively, according to the equations (5) and (6).

（５）式で、ｔは長時間フレーム内の時間（ｔ＝０、１、２、…、Ｌ２−１）、ｊは入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に長時間フレーム信号が書込まれるとインクリメン卜されるインデックス（ｊ＝０、１、２、…、ＤＢ＿ＬＥＮ−１；主キー；長時間フレームの識別子）、ＤＢ＿ＬＥＮはデータベース長である。（５）式と（６）式に示すように、フレーム信号ＤＢ２０５は、ＤＢ＿ｓｉｎｇａｌ（ｊ；ｔ）に、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を書き込む。 In equation (5), t is the time in the frame for a long time (t = 0, 1, 2, ..., L2-1), and j is an index when the frame signal DB205 of the input signal DB204 is written for a long time. The index to be processed (j = 0, 1, 2, ..., DB_LEN-1; primary key; long-time frame identifier), DB_LEN is the database length. As shown in the equations (5) and (6), the frame signal DB 205 writes the long-time frame signal x_fram_long (s) in the DB_singal (j; t).

入力信号ＤＢ２０４は、過去の各長時間フレーム信号を蓄積（保持）する記憶手段である。 The input signal DB 204 is a storage means for accumulating (holding) each past long-time frame signal.

上述の通り、この実施形態の入力信号ＤＢ２０４には、フレーム信号ＤＢ２０５が含まれている。ここでは、フレーム信号ＤＢ２０５に各長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）が蓄積されることになる。 As described above, the input signal DB 204 of this embodiment includes the frame signal DB 205. Here, each long-time frame signal x_frame_long (s) is accumulated in the frame signal DB 205.

フレーム選択制限部２０６は、制限フレーム数を決定する。フレーム選択制限部２０６が選択するフレーム信号を制限する具体的手法については限定されないものであり種々の方式を適用することができる。フレーム選択制限部２０６は、例えば、（７）式に従い、入力信号ＤＢ２０４に蓄積されたばかりの長時間フレーム信号を後述するフレーム信号選択部２０７で選択しないようするための、制限フレーム数を決定する。
Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭ＝ａ×ＤＢ＿ＬＥＮ …（７） The frame selection limiting unit 206 determines the number of limited frames. The specific method for limiting the frame signal selected by the frame selection limiting unit 206 is not limited, and various methods can be applied. For example, the frame selection limiting unit 206 determines the number of limited frames for preventing the frame signal selection unit 207, which will be described later, from selecting the long-time frame signal just accumulated in the input signal DB 204 according to the equation (7).
Limit_Fream_NUM = a × DB_LEN… (7)

（７）式で、Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭは制限フレーム数（Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭ＜ＤＢ＿ＬＥＮ）、ａはデータベース長ＤＢ＿ＬＥＮに対する割合であり、０．１以上、０．５以下の値となる。（７）式において、制限フレーム数を短くしたい場合、ａは０．１に近い値が望ましく（例えばａ＝０．１等の値）、制限フレーム数を長くしたい場合ａは０．５に近い値が望ましい（例えば、ａ＝０．５等の値）。 In the equation (7), Limit_Fream_NUM is the number of limited frames (Limit_Fream_NUM <DB_LEN), and a is the ratio to the database length DB_LEN, which is 0.1 or more and 0.5 or less. In equation (7), when it is desired to shorten the number of limited frames, a is preferably a value close to 0.1 (for example, a value such as a = 0.1), and when it is desired to increase the number of limited frames, a is close to 0.5. A value is desirable (eg, a value such as a = 0.5).

なお、フレーム選択制限部２０６における制限フレーム数を決定する手法は限定されないものであり種々の手法を適用することができる。フレーム選択制限部２０６では、例えば、蓄積されたばかりのフレーム信号とピッチ情報（直近の所定時間内のフレーム信号とピッチ情報）をフレーム信号選択部２０７で選択しない状態となる長さの固定値Ｆｒｅａｍ＿ＮＵＭ＿ＣＯＮＳＴをＬｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭ（例えば、Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭ=１０）と設定しても良く、時間の固定値ＴＩＭＥ＿ＣＯＮＳＴで設定（例えば、Ｆｒｅａｍ＿ＮＵＭ＿ＣＯＮＳＴ＝（ｆｓ・ＴＩＭＥ＿ＣＯＮＳＴ）／Ｌ２）しても良いし、予め実験等により、マスキング対象者Ｕ２の位置においてマスカー信号に基づくマスカー音によるマスキング効果を維持しつつ、エコーのように聞こえない程度の好適な値（マスカー音が不快な音とならない程度の値）をＬｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭとして設定（例えば、サウンドマスキング装置１００及びスピーカ１０６を実際の環境に設置した後に、好適な値をＬｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭとする）としても良い。 The method for determining the limited number of frames in the frame selection limiting unit 206 is not limited, and various methods can be applied. In the frame selection limiting unit 206, for example, a fixed value Flame_NUM_CONST having a length that prevents the frame signal selection unit 207 from selecting the newly accumulated frame signal and pitch information (frame signal and pitch information within the latest predetermined time) is set. It may be set as Limit_Fream_NUM (for example, Limit_Fream_NUM = 10), it may be set as a fixed time value TIME_CONST (for example, Dream_NUM_CONST = (fs · TIME_CONST) / L2), or it may be masked by an experiment or the like in advance. While maintaining the masking effect of the masker sound based on the masker signal at the position of, set a suitable value (a value that does not make the masker sound unpleasant) that does not sound like an echo as Limit_Fream_NUM (for example, sound masking). After installing the device 100 and the speaker 106 in an actual environment, a suitable value may be set to Limit_Fream_NUM).

そして、フレーム選択制限部２０６は、制限フレーム数Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭをフレーム信号選択部２０７に出力する。 Then, the frame selection limiting unit 206 outputs the limited number of frames Limit_Fream_NUM to the frame signal selection unit 207.

フレーム信号選択部２０７は、入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積されている過去の長時間フレーム信号を、フレーム選択制限部２０６の制限フレームＬｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭより前のフレームからマスカー素片信号として複数フレーム選択する。フレーム信号選択部２０７がマスカー素片信号の選択する具体的手法については限定されないものであり種々の方式を適用することができる。フレーム信号選択部２０７は、例えば、（８）式に従い、フレームを選択する。

The frame signal selection unit 207 selects a plurality of past long-time frame signals stored in the frame signal DB 205 of the input signal DB 204 as masker element signal from frames before the limit frame Limit_Fream_NUM of the frame selection restriction unit 206. .. The specific method for selecting the masker element signal by the frame signal selection unit 207 is not limited, and various methods can be applied. The frame signal selection unit 207 selects a frame according to, for example, Eq. (8).

（８）式で、Ｔ（ｐ）は選択したフレーム、ｐ（ｐ＝０、１…、ＳＥＬ＿ＮＵＭ−１）は選択したフレームＴ（ｐ）のインデックス、ＳＥＬ＿ＮＵＭ（ＳＥＬ＿ＮＵＭ＜＝ＤＢ＿ＬＥＮ−１）はマスカー素片信号の選択数、ｊは（６）式のデータベースのインデックスである。（８）式は、入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に保持されている長時間フレーム信号を、制限フレーム数Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭより前のフレームから、時間的に新しい順番で選択し、選択した長時間フレーム信号が保持されているデータベースのインデックス番号をＴ（ｐ）に代入するという式である。 In equation (8), T (p) is the selected frame, p (p = 0, 1 ..., SEL_NUM-1) is the index of the selected frame T (p), and SEL_NUM (SEL_NUM <= DB_LEN-1) is the masker. The number of selected elementary signals, j, is the index of the database in Eq. (6). In the equation (8), the long-time frame signal held in the frame signal DB 205 of the input signal DB 204 is selected from the frames before the limit number of frames Limit_Fream_NUM in the order of new time, and the selected long-time frame signal is selected. The formula is to assign the index number of the retained database to T (p).

また、マスカー素片信号の選択手法は、例えば、マスカー素片信号を制限フレーム数Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭより前のフレームからランダムに選択しても良い。 Further, as a method for selecting the masker element signal, for example, the masker element signal may be randomly selected from the frames before the limited number of frames Limit_Fream_NUM.

以上のように、フレーム信号選択部２０７は、フレーム信号ＤＢ２０５に保持されている長時間フレーム信号から複数フレーム選択し、選択したフレームＴ（ｐ）をマスカー信号生成部２０８に出力する。 As described above, the frame signal selection unit 207 selects a plurality of frames from the long-time frame signals held in the frame signal DB 205, and outputs the selected frame T (p) to the masker signal generation unit 208.

マスカー信号生成部２０８は、フレーム信号選択部２０７の選択したフレームＴ（ｐ）を基に、マスカー素片信号を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５から複数フレーム読み出し、マスカー信号を生成し出力する。マスカー信号生成部２０８がマスカー信号を生成する具体的手法については限定されないものであり種々の方式を適用することができる。マスカー信号生成部２０８は、例えば、（９）式に従い、マスカー信号ｈ（ｌ；ｔ）を生成する。

The masker signal generation unit 208 reads a plurality of frames from the frame signal DB 205 of the input signal DB 204 based on the frame T (p) selected by the frame signal selection unit 207, generates a masker signal, and outputs the masker signal. The specific method by which the masker signal generation unit 208 generates the masker signal is not limited, and various methods can be applied. The masker signal generation unit 208 generates the masker signal h (l; t) according to the equation (9), for example.

（９）式は、フレーム信号選択部２０７で選択された複数のマスカー素片信号を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５から読み出し、読み出したマスカー素片信号を重畳して、マスカー信号ｈ（ｌ；ｔ）を生成する式である。 In the equation (9), a plurality of masker element signals selected by the frame signal selection unit 207 are read from the frame signal DB 205 of the input signal DB 204, and the read masker element signals are superimposed to superimpose the masker signal h (l; t; t). ) Is an expression to generate.

そして、マスカー信号生成部２０８は、（１０）式に従い、マスカー信号ｈ（ｌ；ｔ）をオーバーラップ加算処理して出力信号ｙ（ｎ）とし、音出力端子ＯＵＴから出力する。

Then, the masker signal generation unit 208 performs overlap addition processing of the masker signal h (l; t) to obtain an output signal y (n) according to the equation (10), and outputs the output signal y (n) from the sound output terminal OUT.

サウンドマスキング処理部２００の音出力端子ＯＵＴから出力されるマスカー信号は、ＤＡ変換器１０４でデジタル信号からアナログ信号に変換され、スピーカアンプ１０５で増幅されてからスピーカ１０６から出力される。 The masker signal output from the sound output terminal OUT of the sound masking processing unit 200 is converted from a digital signal to an analog signal by the DA converter 104, amplified by the speaker amplifier 105, and then output from the speaker 106.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effect of First Embodiment According to the first embodiment, the following effects can be obtained.

第１の実施形態のサウンドマスキング装置１００では、入力信号ＤＢのフレーム信号ＤＢに蓄積されている過去の長時間フレーム信号を、制限フレーム数より前のフレームからマスカー素片信号を複数フレーム選択してマスカー信号（マスキング音）を生成している。これにより、第１の実施形態のサウンドマスキング装置１００では、新しい長時間フレーム信号がマスカー素片信号として選択されなくなることで、マスカー信号がエコーのような音にならないため、生成したマスカー信号の聞き心地が良くなる（マスキング対象者Ｕ２にとっての不快感が軽減される）。 In the sound masking device 100 of the first embodiment, a plurality of masker element signals are selected from the frames before the limited number of frames for the past long-time frame signals stored in the frame signal DB of the input signal DB. A masker signal (masking sound) is generated. As a result, in the sound masking device 100 of the first embodiment, the new long-time frame signal is not selected as the masker element signal, and the masker signal does not sound like an echo. Therefore, the generated masker signal is heard. It becomes comfortable (the discomfort for the masking target U2 is reduced).

（Ｂ）第２の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第２の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (B) Second Embodiment Hereinafter, a second embodiment of the acoustic processing apparatus, the acoustic processing program, and the acoustic processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to the sound masking device will be described.

（Ｂ−１）第２の実施形態の構成
図４は、第２の実施形態に係るサウンドマスキング装置１００Ａの機能的構成について示したブロック図である。図４では、図１と同一部分又は対応部分には、同一符号又は対応符号を付している。 (B-1) Configuration of Second Embodiment FIG. 4 is a block diagram showing a functional configuration of the sound masking device 100A according to the second embodiment. In FIG. 4, the same parts or corresponding parts as those in FIG. 1 are designated by the same reference numerals or corresponding reference numerals.

以下では、第２の実施形態について、第１の実施形態との差異を中心に説明し、第１の実施形態と重複する部分については説明を省略する。 In the following, the second embodiment will be mainly described with respect to the difference from the first embodiment, and the description of the part overlapping with the first embodiment will be omitted.

第２の実施形態のサウンドマスキング装置１００Ａでは、サウンドマスキング処理部２００がサウンドマスキング処理部２００Ａに置き換わっている点で、第１の実施形態と異なっている。 The sound masking device 100A of the second embodiment is different from the first embodiment in that the sound masking processing unit 200 is replaced with the sound masking processing unit 200A.

サウンドマスキング処理部２００Ａでは、音声区間判定部２０９とＤＢ蓄積判定部２１０とマスカー信号生成判定部２１１が追加されており、さらに、ＤＢ書込み部２０３とフレーム信号選択部２０７とマスカー信号生成部２０８が、ＤＢ書込み部２０３Ａとフレーム信号選択部２０７Ａとマスカー信号生成部２０８Ａに置き換わっている点で、第１の実施形態と異なっている。 In the sound masking processing unit 200A, a voice section determination unit 209, a DB accumulation determination unit 210, and a masker signal generation determination unit 211 are added, and further, a DB writing unit 203, a frame signal selection unit 207, and a masker signal generation unit 208 are added. , DB writing unit 203A, frame signal selection unit 207A, and masker signal generation unit 208A are replaced, which is different from the first embodiment.

第２の実施形態のサウンドマスキング装置１００Ａでは、音声区間判定部２０９とＤＢ蓄積判定部２１０が追加され、ＤＢ書込み部２０３Ａに置き換わったことにより入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積される長時間フレーム信号の蓄積方法が異なる点と、マスカー信号生成判定部２１１が追加されたことによりマスカー信号の生成方法が異なる点と、フレーム信号選択部２０７Ａに置き換わったことによりとマスカー素片信号の選択方法が異なる点と、マスカー信号生成部２０８Ａに置き換わったことによりマスカー信号の生成方法が異なる点が第１の実施形態のサウンドマスキング装置１００と異なる。サウンドマスキング処理部２００Ａの詳細な構成を説明する。 In the sound masking device 100A of the second embodiment, the voice section determination unit 209 and the DB accumulation determination unit 210 are added and replaced with the DB writing unit 203A, so that a long-time frame accumulated in the frame signal DB 205 of the input signal DB 204 The signal storage method is different, the masker signal generation method is different due to the addition of the masker signal generation determination unit 211, and the masker piece signal selection method is different because it is replaced by the frame signal selection unit 207A. It differs from the sound masking apparatus 100 of the first embodiment in that it differs from the sound masking apparatus 100 of the first embodiment in that the masker signal generation method is different due to the replacement with the masker signal generation unit 208A. The detailed configuration of the sound masking processing unit 200A will be described.

サウンドマスキング処理部２００Ａは、フレーム分割部２０１、音声区間判定部２０９、ＤＢ蓄積判定部２１０、長時間フレーム信号作成部２０２、ＤＢ書込み部２０３Ａ、入力信号ＤＢ２０４、フレーム信号ＤＢ２０５、フレーム選択制限部２０６、フレーム信号選択部２０７Ａ、マスカー信号生成部２０８Ａ、音入力端子ＩＮ、及び音出力端子ＯＵＴを有する。 The sound masking processing unit 200A includes a frame division unit 201, a voice section determination unit 209, a DB accumulation determination unit 210, a long-time frame signal creation unit 202, a DB writing unit 203A, an input signal DB 204, a frame signal DB 205, and a frame selection restriction unit 206. , A frame signal selection unit 207A, a masker signal generation unit 208A, a sound input terminal IN, and a sound output terminal OUT.

音声区間判定部２０９は、分割フレーム信号が音声区間か非音声区間（音声区間以外の区間）かを判定し、判定結果を出力する。 The voice section determination unit 209 determines whether the divided frame signal is a voice section or a non-voice section (a section other than the voice section), and outputs a determination result.

ＤＢ蓄積判定部２１０は、音声区間判定部２０９の判定結果を基に、長時間フレーム信号をＤＢに蓄積するか否かを判定し、判定結果を出力する。 The DB accumulation determination unit 210 determines whether or not to accumulate the frame signal for a long time in the DB based on the determination result of the voice section determination unit 209, and outputs the determination result.

ＤＢ書込み部２０３Ａは、ＤＢ蓄積判定部２１０の判定結果を基に、長時間フレーム信号を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に書込む。 The DB writing unit 203A writes the long-time frame signal into the frame signal DB 205 of the input signal DB 204 based on the determination result of the DB accumulation determination unit 210.

マスカー信号生成判定部２１１は、音声区間判定部２０９の判定結果を基に、マスカー信号を生成するか否かを判定し、判定結果を出力する。 The masker signal generation determination unit 211 determines whether or not to generate a masker signal based on the determination result of the voice section determination unit 209, and outputs the determination result.

フレーム信号選択部２０７Ａは、マスカー信号生成判定部２１１の判定結果を基に、入力信号ＤＢ２０４に蓄積されている過去の長時間フレーム信号を、フレーム選択制限部２０６の制限フレーム数より前のフレームからマスカー素片信号として複数フレーム選択し、選択したフレームを出力する。 Based on the determination result of the masker signal generation determination unit 211, the frame signal selection unit 207A selects the past long-time frame signal stored in the input signal DB 204 from the frames before the limit number of frames of the frame selection restriction unit 206. Multiple frames are selected as masker element signals, and the selected frames are output.

なお、第２の実施形態において、第２の実施形態と同様にマスカー信号生成判定部２１１を除外した構成としても良い。 In the second embodiment, the masker signal generation determination unit 211 may be excluded as in the second embodiment.

（Ｂ−２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態におけるサウンドマスキング装置１００Ａの動作（実施形態に係る音響処理方法）について詳細に説明する。 (B-2) Operation of the Second Embodiment Next, the operation of the sound masking device 100A (the sound processing method according to the embodiment) in the second embodiment having the above configuration will be described in detail.

第２の実施形態に係るサウンドマスキング装置１００Ａにおけるサウンドマスキング処理の基本的な動作は、第１の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100A according to the second embodiment is the same as the sound masking process described in the first embodiment.

以下では、第１の実施形態と異なる点である音声区間判定部２０９、ＤＢ蓄積判定部２１０、ＤＢ書込み部２０３Ａ、マスカー信号生成判定部２１１、フレーム信号選択部２０７Ａにおける動作を中心に詳細に説明する。 In the following, the operations in the voice section determination unit 209, the DB accumulation determination unit 210, the DB writing unit 203A, the masker signal generation determination unit 211, and the frame signal selection unit 207A, which are different from the first embodiment, will be described in detail. do.

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を処理フレームごとに分割し、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を音声区間判定部２０９、長時間フレーム信号作成部２０２に出力する。 The frame division unit 201 divides the microphone input signal x (n) into processing frames, and outputs the divided frame signal x_fram (l; m) to the voice section determination unit 209 and the long-time frame signal creation unit 202.

音声区間判定部２０９は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を用いて、音声区間か非音声区間かを判定する。音声区間判定部２０９が音声区間か非音声区間かを判定する具体的手法については限定されないものであり種々の方式を適用することができる。音声区間判定部２０９は、例えば、（１１）式と（１２）式に従い、音声区間か非音声区間かを判定するようにしても良い。

The voice section determination unit 209 determines whether it is a voice section or a non-voice section by using the divided frame signal x_fram (l; m). The specific method for determining whether the voice section determination unit 209 is a voice section or a non-voice section is not limited, and various methods can be applied. The voice section determination unit 209 may determine, for example, whether it is a voice section or a non-voice section according to the equations (11) and (12).

（１１）式と（１２）式で、ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）は分割フレーム信号の平均振幅値、ＶＡＤ（ｌ）は音声区間判定結果、ＴＨは音声区間の判定に用いられる閾値である。（１１）式は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）の平均振幅値ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）を求める式である。（１２）式は、（１１）式で求めた分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）の平均振幅値ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）が閾値ＴＨより値が大きければ音声区間と判定し音声区間判定結果ＶＡＤ（ｌ）に１を代入し、閾値ＴＨより値が小さければ非音声区間と判定し音声区間判定結果ＶＡＤ（ｌ）に０を代入するという式である。 In equations (11) and (12), x_fram_amp (l) is the average amplitude value of the divided frame signal, VAD (l) is the voice section determination result, and TH is the threshold value used for the determination of the voice section. Equation (11) is an equation for obtaining the average amplitude value x_fram_amp (l) of the divided frame signal x_fram (l; m). In the equation (12), if the average amplitude value x_fram_amp (l) of the divided frame signal x_fram (l; m) obtained by the equation (11) is larger than the threshold value TH, it is determined as a voice section and the voice section determination result VAD (l). ) Is substituted, and if the value is smaller than the threshold value TH, it is determined to be a non-voice section, and 0 is substituted into the voice section determination result VAD (l).

（１２）式の閾値ＴＨは、音声の有無を判定できれば良く、種々の方法を広く適用することができる。例えば、（１３）式に示すように、サウンドマスキング装置１００Ａが動作し始めてから所定の長さ（以下、「フレーム長Ｌ３」と表す）のフレーム（以下、「初期フレーム」と呼ぶ）を無音区間とし、その初期フレームの平均振幅値を閾値ＴＨとして使用する固定の閾値ＴＨを用いても良い。また、（１４）式に従い、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）の平均振幅値ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）に時定数フィルタを用いて分割フレーム毎に変動する閾値ＴＨ（ｌ）を用いても良い。

As for the threshold value TH of the equation (12), it suffices if the presence or absence of voice can be determined, and various methods can be widely applied. For example, as shown in equation (13), a frame (hereinafter referred to as “initial frame”) having a predetermined length (hereinafter referred to as “frame length L3”) after the sound masking device 100A starts operating is a silent section. A fixed threshold value TH may be used, in which the average amplitude value of the initial frame is used as the threshold value TH. Further, according to the equation (14), a threshold value TH (l) that fluctuates for each divided frame may be used for the average amplitude value x_fram_amp (l) of the divided frame signal x_fram (l; m) by using a time constant filter.

（１４）式で、ｂは時定数フィルタの係数であり、０以上、１以下の値となる。（１４）式において、閾値の更新を遅くしたい場合、ｂは１に近い値が望ましく（例えばｂ＝０．９等の値）、閾値の更新を速くしたい場合、ｂは０に近い値が望ましい（例えばｂ＝０．１等の値）。 In equation (14), b is the coefficient of the time constant filter, which is a value of 0 or more and 1 or less. In equation (14), when it is desired to delay the update of the threshold value, b is preferably a value close to 1 (for example, a value such as b = 0.9), and when it is desired to speed up the update of the threshold value, b is preferably a value close to 0. (For example, a value such as b = 0.1).

なお、音声区間判定部２０９において、音声区間か非音声区間かの判定の手段は、種々の方法を広く適用することができ、例えば、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）の自己相関を求めて音声区間か非音声区間か求める等の方法で判定しても良い。 In the voice section determination unit 209, various methods can be widely applied as a means for determining whether the voice section or the non-voice section is used. For example, the autocorrelation of the divided frame signal x_fram (l; m) is obtained. It may be determined by a method such as determining whether it is a voice section or a non-voice section.

そして、音声区間判定部２０９は、音声区間判定結果ＶＡＤ（ｌ）をＤＢ蓄積判定部２１０とマスカー信号生成判定部２１１に出力する。 Then, the voice section determination unit 209 outputs the voice section determination result VAD (l) to the DB accumulation determination unit 210 and the masker signal generation determination unit 211.

ＤＢ蓄積判定部２１０は、音声区間判定部２０９の音声区間判定結果ＶＡＤ（ｌ）を基に、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積するか否かを判定する。ＤＢ蓄積判定部２１０は、例えば、（１５）式に従い、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積するか否かを判定する。

The DB accumulation determination unit 210 determines whether or not to accumulate the divided frame signal x_fram (l; m) in the frame signal DB 205 of the input signal DB 204 based on the audio section determination result VAD (l) of the audio section determination unit 209. do. The DB accumulation determination unit 210 determines, for example, whether or not to accumulate the divided frame signal x_fram (l; m) in the frame signal DB 205 of the input signal DB 204 according to the equation (15).

（１５）式で、ＤＢ＿ｆｌａｇ（ｌ）は蓄積するか否かの判定結果である。（１５）式は、音声区間判定結果ＶＡＤ（ｌ）が１のとき、ＤＢに蓄積すると判定し、判定結果ＤＢ＿ｆｌａｇ（ｌ）に１を代入し、音声区間判定結果ＶＡＤ（ｌ）が０のとき、ＤＢに蓄積しないと判定し、判定結果ＤＢ＿ｆｌａｇ（ｌ）に０を代入するという式である。 In equation (15), DB_flag (l) is a determination result of whether or not to accumulate. In the equation (15), when the voice section determination result VAD (l) is 1, it is determined that the data is accumulated in the DB, 1 is substituted for the determination result DB_flag (l), and the voice interval determination result VAD (l) is 0. , It is determined that the data is not accumulated in the DB, and 0 is substituted into the determination result DB_flag (l).

そして、ＤＢ蓄積判定部２１０は、ＤＢに蓄積するか否かの判定結果ＤＢ＿ｆｌａｇ（ｌ）をＤＢ書込み部２０３Ａに出力する。 Then, the DB accumulation determination unit 210 outputs the determination result DB_flag (l) as to whether or not to accumulate in the DB to the DB writing unit 203A.

ＤＢ書込み部２０３Ａは、ＤＢ蓄積判定部２１０の判定結果ＤＢ＿ｆｌａｇ（ｌ）が１のときのみ、例えば、（５）式、（６）式に従い、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に書込む。一方、ＤＢ蓄積判定部２１０の判定結果ＤＢ＿ｆｌａｇ（ｌ）が０のとき、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に書込まない。 The DB writing unit 203A inputs a long-time frame signal x_frame_long (s) to the input signal DB 204 only when the determination result DB_flag (l) of the DB accumulation determination unit 210 is 1, for example, according to the equations (5) and (6). Write to the frame signal DB205. On the other hand, when the determination result DB_flag (l) of the DB accumulation determination unit 210 is 0, the long-time frame signal x_fram_long (s) is not written to the frame signal DB 205 of the input signal DB 204.

マスカー信号生成判定部２１１は、音声区間判定部２０９の音声区間判定結果ＶＡＤ（ｌ）を基に、マスカー信号を生成するか否かを判定する。判定手段は、例えば、（１６）式に従い、マスカー信号を生成するか否かを判定する。

The masker signal generation determination unit 211 determines whether or not to generate a masker signal based on the voice section determination result VAD (l) of the voice section determination unit 209. The determination means determines, for example, whether or not to generate a masker signal according to the equation (16).

（１６）式で、ｍａｓｋ＿ｆｌａｇ（ｌ）はマスカー信号を生成するか否かの判定結果である。 In equation (16), mask_flag (l) is a determination result of whether or not to generate a masker signal.

（１６）式は、音声区間判定結果ＶＡＤ（ｌ）が１のとき、マスカー信号を生成すると判定して判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）に１を代入し、音声区間判定結果ＶＡＤ（ｌ）が０のとき、マスカー信号を生成しないと判定して判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）に０を代入する式となっている。 In the equation (16), when the voice interval determination result VAD (l) is 1, it is determined that a masker signal is generated, 1 is substituted for the determination result mask_flag (l), and the voice interval determination result VAD (l) is 0. At this time, it is determined that the masker signal is not generated, and 0 is substituted into the determination result mask_flag (l).

そして、マスカー信号生成判定部２１１は、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）をフレーム信号選択部２０７Ａに出力する。 Then, the masker signal generation determination unit 211 outputs the determination result mask_flag (l) as to whether or not to generate the masker signal to the frame signal selection unit 207A.

フレーム信号選択部２０７Ａは、マスカー信号生成判定部２１１から出力されたマスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のときのみ、例えば、（７）式や（８）式に従い、フレームを選択する。一方、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が０のとき、フレーム信号選択部２０７Ａは、フレームを選択しない。 The frame signal selection unit 207A follows the equations (7) and (8) only when the determination result mask_flag (l) of whether or not to generate the masker signal output from the masker signal generation determination unit 211 is 1. , Select a frame. On the other hand, when the determination result mask_flag (l) of whether or not to generate a masker signal is 0, the frame signal selection unit 207A does not select a frame.

以上のように、フレーム信号選択部２０７Ａは、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のときのみ、フレーム信号ＤＢ２０５に保持されている過去の長時間フレーム信号から複数フレーム選択し、選択したフレームＴ（ｐ）をマスカー信号生成部２０８Ａに出力する。 As described above, the frame signal selection unit 207A has a plurality of frames from the past long-time frame signal held in the frame signal DB 205 only when the determination result mask_flag (l) of whether or not to generate the masker signal is 1. The selected frame T (p) is output to the masker signal generation unit 208A.

マスカー信号生成部２０８Ａは、マスカー信号生成判定部２１１の判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）とフレーム信号選択部２０７Ａの選択したフレームＴ（ｐ）を基に、入力信号ＤＢ２０４のフレーム信号ＤＢ２０５から過去の長時間フレーム信号をマスカー素片信号として複数フレーム読み出し、マスカー信号を生成し、出力する。マスカー信号生成部２０８Ａは、例えば、（１７）式に従い、マスカー信号ｈａ（ｌ；ｔ）を生成する。

The masker signal generation unit 208A has a long time in the past from the frame signal DB 205 of the input signal DB 204 based on the determination result mask_flag (l) of the masker signal generation determination unit 211 and the frame T (p) selected by the frame signal selection unit 207A. Multiple frames are read out from the frame signal as a masker element signal, and a masker signal is generated and output. The masker signal generation unit 208A generates the masker signal ha (l; t) according to the equation (17), for example.

（１７）式は、マスカー信号生成判定部２１１の判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のときのみ、マスカー信号ｈ（ｌ；ｓ）を生成し、ｈａ（ｌ；ｔ）に代入し、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が０のときは、ｈａ（ｌ；ｔ）に０（無音）を代入するという式である。 In the equation (17), the masker signal h (l; s) is generated and substituted into ha (l; t) only when the determination result mask_flag (l) of the masker signal generation determination unit 211 is 1, and the masker signal is used. When the determination result mask_flag (l) of whether or not to generate is 0, 0 (silence) is substituted for ha (l; t).

以上のように、マスカー信号生成部２０８Ａは、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のときのみ、マスカー信号を生成する。そして、マスカー信号生成部２０８Ａは、（１８）式に従い、マスカー信号ｈａ（ｌ；ｔ）をオーバーラップ加算処理して出力信号ｙ（ｎ）として音出力端子ＯＵＴに出力する。

As described above, the masker signal generation unit 208A generates the masker signal only when the determination result mask_flag (l) of whether or not to generate the masker signal is 1. Then, the masker signal generation unit 208A performs overlap addition processing of the masker signal ha (l; t) according to the equation (18) and outputs the output signal y (n) to the sound output terminal OUT.

（Ｂ−３）第２の実施形態の効果
第２の実施形態によれば、第１の実施形態と比較して以下のような効果を奏することができる。 (B-3) Effect of Second Embodiment According to the second embodiment, the following effects can be obtained as compared with the first embodiment.

第２の実施形態のサウンドマスキング装置１００Ａでは、音声区間と判定されたときのみ対象話者Ｕ１の音声を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積することで、音声区間のみ入力信号ＤＢ２０４に蓄積されるので、音声のみでマスカー信号を生成することができ、高いマスキング効果を維持できる。 In the sound masking device 100A of the second embodiment, the voice of the target speaker U1 is stored in the frame signal DB 205 of the input signal DB 204 only when it is determined to be the voice section, so that only the voice section is stored in the input signal DB 204. Therefore, the masker signal can be generated only by voice, and a high masking effect can be maintained.

また、第２の実施形態のサウンドマスキング装置１００Ａでは、音声区間と判定されたときのみマスカー信号を生成するので、対象話者Ｕ１の音声が入力されているときだけマスカー信号を生成して出力するため、音声が入力されたときのみマスカー信号が出力されるように構成することができる。 Further, in the sound masking device 100A of the second embodiment, since the masker signal is generated only when the voice section is determined, the masker signal is generated and output only when the voice of the target speaker U1 is input. Therefore, it can be configured so that the masker signal is output only when the voice is input.

（Ｃ）第３の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第３の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (C) Third Embodiment Hereinafter, a third embodiment of the sound processing apparatus, the sound processing program, and the sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to the sound masking device will be described.

（Ｃ−１）第３の実施形態の構成
図５は、第３の実施形態に係るサウンドマスキング装置１００Ｂの機能的構成について示したブロック図である。図５では、上述の図１、図４と同一部分又は対応部分には、同一符号又は対応符号を付している。 (C-1) Configuration of Third Embodiment FIG. 5 is a block diagram showing a functional configuration of the sound masking device 100B according to the third embodiment. In FIG. 5, the same reference numerals or corresponding reference numerals are given to the same portions or corresponding portions as those in FIGS. 1 and 4 described above.

以下では、第３の実施形態について、第２の実施形態との差異を中心に説明し、第２の実施形態と重複する部分については説明を省略する。 In the following, the third embodiment will be mainly described with respect to the difference from the second embodiment, and the description of the part overlapping with the second embodiment will be omitted.

第３の実施形態のサウンドマスキング装置１００Ｂでは、サウンドマスキング処理部２００Ａがサウンドマスキング処理部２００Ｂに置き換わっている点で、第２の実施形態と異なっている。 The sound masking device 100B of the third embodiment is different from the second embodiment in that the sound masking processing unit 200A is replaced with the sound masking processing unit 200B.

サウンドマスキング処理部２００Ｂでは、ピッチ推定部２１２が追加され、さらに、ＤＢ蓄積判定部２１０とマスカー信号生成判定部２１１がそれぞれＤＢ蓄積判定部２１０Ｂとマスカー信号生成判定部２１１Ｂに置き換わっている点で、第２の実施形態と異なっている。 In the sound masking processing unit 200B, the pitch estimation unit 212 is added, and the DB accumulation determination unit 210 and the masker signal generation determination unit 211 are replaced by the DB accumulation determination unit 210B and the masker signal generation determination unit 211B, respectively. It is different from the second embodiment.

第３の実施形態のサウンドマスキング装置１００Ｂでは、ピッチ推定部２１２が追加されたことにより分割フレーム信号のピッチを推定することが異なる点と、ＤＢ蓄積判定部２１０Ｂに置き換わったことにより入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積される長時間フレーム信号の蓄積方法が異なる点と、マスカー信号生成判定部２１１Ｂに置き換わったことによりマスカー信号の生成方法が異なる点が第２の実施形態のサウンドマスキング装置１００Ａとの差異となる。 In the sound masking device 100B of the third embodiment, the pitch of the divided frame signal is estimated differently due to the addition of the pitch estimation unit 212, and the input signal DB204 is replaced by the DB accumulation determination unit 210B. The sound masking device 100A of the second embodiment is different in the method of accumulating the long-time frame signal stored in the frame signal DB 205 and the method of generating the masker signal by being replaced by the masker signal generation determination unit 211B. Is the difference.

次に、サウンドマスキング処理部２００Ｂの詳細な構成を説明する。 Next, the detailed configuration of the sound masking processing unit 200B will be described.

サウンドマスキング処理部２００Ｂは、フレーム分割部２０１、音声区間判定部２０９、ピッチ推定部２１２、ＤＢ蓄積判定部２１０Ｂ、長時間フレーム信号作成部２０２、ＤＢ書込み部２０３Ａ、入力信号ＤＢ２０４、フレーム信号ＤＢ２０５、マスカー信号生成判定部２１１Ｂ、フレーム選択制限部２０６、フレーム信号選択部２０７Ａ、マスカー信号生成部２０８Ａ、音入力端子ＩＮ、及び音出力端子ＯＵＴを有する。 The sound masking processing unit 200B includes a frame division unit 201, a voice section determination unit 209, a pitch estimation unit 212, a DB accumulation determination unit 210B, a long-time frame signal creation unit 202, a DB writing unit 203A, an input signal DB 204, and a frame signal DB 205. It has a masker signal generation determination unit 211B, a frame selection limiting unit 206, a frame signal selection unit 207A, a masker signal generation unit 208A, a sound input terminal IN, and a sound output terminal OUT.

ピッチ推定部２１２は、音声区間判定部２０９から出力される音声区間判定の結果を基に、音声区間と判定されたときにのみ分割フレーム信号のピッチ（音声の高さ）を推定し、ピッチ推定結果(以下、「ピッチの推定値」と呼ぶ）)を出力する。 The pitch estimation unit 212 estimates the pitch (speech pitch) of the divided frame signal only when it is determined to be a voice section based on the result of the voice section determination output from the voice section determination unit 209, and pitch estimates. The result (hereinafter referred to as "pitch estimate") is output.

ＤＢ蓄積判定部２１０Ｂは、ピッチ推定部２１２のピッチの推定値を基に、分割フレーム信号とピッチの推定値を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積するか否かを判定し、判定結果を出力する。 The DB accumulation determination unit 210B determines whether or not to accumulate the divided frame signal and the pitch estimation value in the frame signal DB 205 of the input signal DB 204 based on the pitch estimation value of the pitch estimation unit 212, and outputs the determination result. do.

マスカー信号生成判定部２１１Ｂは、ピッチ推定部２１２のピッチの推定値を基に、マスカー信号を生成するか否かを判定し、判定結果を出力する。 The masker signal generation determination unit 211B determines whether or not to generate a masker signal based on the estimated value of the pitch of the pitch estimation unit 212, and outputs the determination result.

なお、第３の実施形態において、第１の実施形態と同様にマスカー信号生成判定部２１１Ｂを除外した構成としても良い。 In the third embodiment, the masker signal generation determination unit 211B may be excluded as in the first embodiment.

（Ｃ−２）第３の実施形態の動作
次に、以上のような構成を有する第３の実施形態におけるサウンドマスキング装置１００Ｂの動作（実施形態に係る音響処理方法）について詳細に説明する。 (C-2) Operation of the Third Embodiment Next, the operation of the sound masking device 100B (the sound processing method according to the embodiment) in the third embodiment having the above configuration will be described in detail.

第３の実施形態に係るサウンドマスキング装置１００Ｂにおけるサウンドマスキング処理の基本的な動作は、第１、及び第２の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100B according to the third embodiment is the same as the sound masking process described in the first and second embodiments.

以下では、第２の実施形態と異なる点であるピッチ推定部２１２、ＤＢ蓄積判定部２１０Ｂ、マスカー信号生成判定部２１１Ｂにおける処理動作を中心に詳細に説明する。 Hereinafter, the processing operations in the pitch estimation unit 212, the DB accumulation determination unit 210B, and the masker signal generation determination unit 211B, which are different from the second embodiment, will be described in detail.

音声区間判定部２０９は、例えば、（１３）式から（１５）式に従い、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間か非音声区間かを判定し、音声区間判定結果ＶＡＤ（ｌ）をピッチ推定部２１２に出力する。 For example, the voice section determination unit 209 determines whether the divided frame signal x_fram (l; m) is a voice section or a non-voice section according to the equations (13) to (15), and determines the voice section determination result VAD (l). Output to the pitch estimation unit 212.

ピッチ推定部２１２は、音声区間判定部２０９で音声区間と判定された分割フレーム（ＶＡＤ（ｌ）＝１の分割フレーム）のみ、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）のピッチを推定する。ピッチ推定部２１２がピッチを推定する具体的手法については限定されないものであり種々の方式を適用することができる。ピッチ推定部２１２は、例えば、（１９）式から（２１）式に従い、ピッチを推定するようにしても良い。

The pitch estimation unit 212 estimates the pitch of the divided frame signal x_fram (l; m) only for the divided frame (divided frame with VAD (l) = 1) determined by the voice section determination unit 209 as the voice section. The specific method for estimating the pitch by the pitch estimation unit 212 is not limited, and various methods can be applied. The pitch estimation unit 212 may estimate the pitch according to the equations (19) to (21), for example.

（１９）式で、τ（τ＝０、１…、Ｌ１−１）は自己相関の遅延量、（２０）式で、ｆｓはサンプリング周波数、ｔｍｐ＿ｐｉｔｃｈ（ｌ）は一時的にピッチの推定値を保持する変数、（２１）式でｐｉｔｃｈ（ｌ）はピッチの推定値を示している。（２１）式で、ｆｓはサンプリング周波数である。 In equation (19), τ (τ = 0, 1 ..., L1-1) is the delay amount of autocorrelation, in equation (20), fs is the sampling frequency, and tp_pitch (l) is the estimated value of pitch temporarily. In the variable to be held, equation (21), pitch (l) indicates the estimated value of the pitch. In equation (21), fs is the sampling frequency.

（１９）式では、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）の自己相関関数ｘ＿ｆｒａｍ＿ｃｏｒｒ（ｌ；ｉ）を求めている。 In the equation (19), the autocorrelation function x_fram_corr (l; i) of the divided frame signal x_fram (l; m) is obtained.

（２０）式では、自己相関関数ｘ＿ｆｒａｍ＿ｃｏｒｒ（ｌ；ｉ）が最大になる遅延量τを求めてサンプリング周波数ｆｓで割ることでピッチを推定し、一時的にｔｍｐ＿ｐｉｔｃｈ（ｌ）に代入している。（２０）式は、音声区間判定結果ＶＡＤ（ｌ）が１のとき（音声区間のとき）ピッチの推定値ｐｉｔｃｈ（ｌ）にピッチの推定値を代入し、音声区間判定結果ＶＡＤ（ｌ）が０のとき（非音声区間のとき）ピッチの推定値ｐｉｔｃｈ（ｌ）に０を代入するという式となっている。 In equation (20), the pitch is estimated by finding the delay amount τ that maximizes the autocorrelation function x_fram_corr (l; i) and dividing it by the sampling frequency fs, and temporarily assigning it to tp_pitch (l). In the equation (20), when the voice section determination result VAD (l) is 1 (when the voice section is used), the pitch estimation value is substituted into the pitch estimation value pitch (l), and the voice section determination result VAD (l) is obtained. When it is 0 (in the non-voice section), 0 is substituted for the estimated pitch (l) of the pitch.

なお、ピッチ推定部２１２におけるピッチの推定手法は限定されないものであり種々の手法を適用することができる。ピッチ推定部２１２では、例えば、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）を離散フーリエ変換や高速フーリエ変換を行ってからケプストラム分析することでピッチを算出するようにしても良い。 The pitch estimation method in the pitch estimation unit 212 is not limited, and various methods can be applied. In the pitch estimation unit 212, for example, the pitch may be calculated by performing discrete Fourier transform or fast Fourier transform on the divided frame signal x_fram (l; m) and then performing cepstrum analysis.

そして、ピッチ推定部２１２は、ピッチの推定値ｐｉｔｃｈ（ｌ）をＤＢ蓄積判定部２１０Ｂとマスカー信号生成判定部２１１Ｂに出力する。 Then, the pitch estimation unit 212 outputs the pitch estimation value pitch (l) to the DB accumulation determination unit 210B and the masker signal generation determination unit 211B.

ＤＢ蓄積判定部２１０Ｂは、ピッチ推定部２１２のピッチの推定値ｐｉｔｃｈ（ｌ）を基に、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）を入力信号ＤＢ２０４のフレーム信号ＤＢ２０５に蓄積するか否かを判定する。ＤＢ蓄積判定部２１０Ｂは、例えば、（２２）式に従い、蓄積するか否かを判定するようにしても良い。

The DB accumulation determination unit 210B determines whether or not to accumulate the long-time frame signal x_fram_long (s) in the frame signal DB 205 of the input signal DB 204 based on the pitch estimation value pitch (l) of the pitch estimation unit 212. For example, the DB accumulation determination unit 210B may determine whether or not to accumulate according to the equation (22).

（２２）式で、ＤＢ＿ｆｌａｇ（ｌ）は蓄積するか否かの判定結果、ＴＨ＿ＰＩＴＣＨはＤＢ蓄積するか否かの判定に用いられる閾値である。（２２）式は、ピッチの推定値ｐｉｃｔｈ（ｌ）が閾値ＴＨ＿ＰＩＴＣＨより大きければ、ＤＢに蓄積すると判定して判定結果ＤＢ＿ｆｌａｇ（ｌ）に１を代入し、ピッチの推定値ｐｉｃｔｈ（ｌ）が閾値ＴＨ＿ＰＩＴＣＨより小さい場合、ＤＢに蓄積しないと判定し判定結果ＤＢ＿ｆｌａｇ（ｌ）に０を代入するという式となっている。 In the equation (22), DB_flag (l) is a determination result of whether or not to accumulate, and TH_PITCH is a threshold value used for determining whether or not to accumulate DB. In equation (22), if the pitch estimated value pitch (l) is larger than the threshold value TH_PITCH, it is determined that the pitch is accumulated in the DB, 1 is substituted into the determination result DB_flag (l), and the pitch estimated value pitch (l) is the threshold value. If it is smaller than TH_PITCH, it is determined that it will not be accumulated in the DB, and 0 is substituted into the determination result DB_flag (l).

閾値ＴＨ＿ＰＩＴＣＨは、ＤＢに蓄積するか否かを判定できれば良く、種々の方法を広く適用することができる。例えば、ピッチ推定部２１２では、ｐｉｃｔｈ（ｌ）が０以外のときはＤＢに蓄積するとしてＴＨ＿ＰＩＴＣＨ＝０としも良いし、人の音声の基本周波数の下限値（例えば、１００Ｈｚ）以上としてＴＨ＿ＰＩＴＣＨ＝１００としても良い。 As long as it can be determined whether or not the threshold value TH_PITCH is accumulated in the DB, various methods can be widely applied. For example, in the pitch estimation unit 212, when pitch (l) is other than 0, TH_PITCH = 0 may be set as it is stored in the DB, or TH_PITCH = 100 as the lower limit value (for example, 100 Hz) of the fundamental frequency of human voice. May be.

そして、ＤＢ蓄積判定部２１０Ｂは、ＤＢに蓄積するか否かの判定結果ＤＢ＿ｆｌａｇ（ｌ）をＤＢ書込み部２０３Ａに出力する。 Then, the DB accumulation determination unit 210B outputs the determination result DB_flag (l) as to whether or not to accumulate in the DB to the DB writing unit 203A.

マスカー信号生成判定部２１１Ｂは、ピッチ推定部２１２のピッチの推定値ｐｉｔｃｈ（ｌ）を基に、マスカー信号を生成するか否かを判定する。判定手段は、例えば、（２３）式に従い、マスカー信号を生成するか否かを判定する。

The masker signal generation determination unit 211B determines whether or not to generate a masker signal based on the pitch estimation value pitch (l) of the pitch estimation unit 212. The determination means determines, for example, whether or not to generate a masker signal according to the equation (23).

（２３）式で、ｍａｓｋ＿ｆｌａｇ（ｌ）はマスカー信号を生成するか否かの判定結果、ＴＨ２＿ＰＩＴＣＨは蓄積するか否かの判定に用いられる閾値である。 In equation (23), mask_flag (l) is a determination result of whether or not to generate a masker signal, and TH2_PITCH is a threshold value used for determining whether or not to accumulate.

（２３）式は、ピッチの推定値ｐｉｃｔｈ（ｌ）が閾値ＴＨ２＿ＰＩＴＣＨより大きい場合、ＤＢに蓄積すると判定して判定結果ＤＢ＿ｆｌａｇ（ｌ）に１を代入し、閾値ＴＨ２＿ＰＩＴＣＨより小さい場合、マスカー信号を生成しないと判定して判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）に０を代入するという式である。 In equation (23), when the estimated pitch value pitch (l) is larger than the threshold value TH2_PITCH, it is determined that the pitch is accumulated in the DB, 1 is substituted into the determination result DB_flag (l), and when it is smaller than the threshold value TH2_PITCH, a masker signal is generated. It is an expression that it is determined not to be performed and 0 is substituted into the determination result mask_flag (l).

閾値ＴＨ２＿ＰＩＴＣＨは、ＤＢに蓄積するか否かを判定できれば良く、種々の方法により算出される値を広く適用することができる。例えば、ピッチ推定部２１２でｐｉｃｔｈ（ｌ）が０以外のときはＤＢに蓄積するとしてＴＨ２＿ＰＩＴＣＨ＝０としも良いし、人の音声の基本周波数の下限値（例えば、１００Ｈｚ）以上としてＴＨ２＿ＰＩＴＣＨ＝１００としても良い。また、ＴＨ２＿ＰＩＴＣＨ＝ＴＨ＿ＰＩＴＣＨＤＢとしてＤＢ蓄積判定部２１０Ｂで使用している（２２）式の閾値ＴＨ＿ＰＩＴＣＨと同じとしても良い。 As the threshold value TH2_PITCH, it suffices if it can be determined whether or not it is accumulated in the DB, and values calculated by various methods can be widely applied. For example, when pitch (l) is other than 0 in the pitch estimation unit 212, TH2_PITCH = 0 may be set as it is stored in the DB, or TH2_PITCH = 100 as the lower limit value (for example, 100 Hz) of the fundamental frequency of human voice. Is also good. Further, TH2_PITCH = TH_PITCHDB may be the same as the threshold value TH_PITCH of the formula (22) used in the DB accumulation determination unit 210B.

そして、マスカー信号生成判定部２１１Ｂ、マスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）をフレーム信号選択部２０７Ａに出力する。 Then, the masker signal generation determination unit 211B outputs the determination result mask_flag (l) as to whether or not to generate the masker signal to the frame signal selection unit 207A.

（Ｃ−３）第３の実施形態の効果
第３の実施形態によれば、第１及び第２の実施形態と比較して、以下のような効果を奏することができる。 (C-3) Effect of Third Embodiment According to the third embodiment, the following effects can be obtained as compared with the first and second embodiments.

第３の実施形態のサウンドマスキング装置１００Ｂでは、対象話者Ｕ1の音声のピッチを推定し、ピッチの推定値をＤＢ蓄積判定部２１０Ｂやマスカー信号生成判定部２１１Ｂに使用することで、音声区間で有声音のみ入力信号ＤＢ２０４に蓄積されるので、音声のみでマスカー信号を生成することができ、高いマスキング効果を維持できる。 In the sound masking device 100B of the third embodiment, the pitch of the voice of the target speaker U1 is estimated, and the estimated value of the pitch is used for the DB accumulation determination unit 210B and the masker signal generation determination unit 211B in the voice section. Since only the voiced sound is stored in the input signal DB 204, the masker signal can be generated only by the voice, and a high masking effect can be maintained.

第３の実施形態のサウンドマスキング装置１００Ｂでは、音声区間で有声音と判定されるときのみマスカー信号を生成するので、対象話者Ｕ１の音声が入力されているときだけマスカー信号を生成し、出力している。これにより、第３の実施形態のサウンドマスキング装置１００Ｂでは、音声が入力されたときのみマスカー信号が出力されるように構成することができる。 Since the sound masking device 100B of the third embodiment generates a masker signal only when it is determined to be a voiced sound in the voice section, the masker signal is generated and output only when the voice of the target speaker U1 is input. doing. As a result, the sound masking device 100B of the third embodiment can be configured so that the masker signal is output only when the sound is input.

（Ｄ）第４の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第４の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (D) Fourth Embodiment Hereinafter, a fourth embodiment of the sound processing apparatus, the sound processing program, and the sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to the sound masking device will be described.

（Ｄ−１）第４の実施形態の構成
図６は、第４の実施形態に係るサウンドマスキング装置１００Ｃの機能的構成について示したブロック図である。図６では、上述の図５と同一部分又は対応部分には、同一符号又は対応符号を付している。 (D-1) Configuration of Fourth Embodiment FIG. 6 is a block diagram showing a functional configuration of the sound masking device 100C according to the fourth embodiment. In FIG. 6, the same code or the corresponding code is attached to the same part or the corresponding part as in FIG. 5 described above.

以下では、第４の実施形態について、第３の実施形態との差異を中心に説明し、第３の実施形態と重複する部分については説明を省略する。 In the following, the fourth embodiment will be mainly described with respect to the difference from the third embodiment, and the description of the part overlapping with the third embodiment will be omitted.

第４の実施形態のサウンドマスキング装置１００Ｃでは、サウンドマスキング処理部２００Ｂがサウンドマスキング処理部２００Ｃに置き換わっている点で、第３の実施形態と異なっている。 The sound masking device 100C of the fourth embodiment is different from the third embodiment in that the sound masking processing unit 200B is replaced with the sound masking processing unit 200C.

サウンドマスキング処理部２００Ｃでは、長時間ピッチ推定情報作成部２１３とピッチ推定情報ＤＢが追加されている点と、ＤＢ書込み部２０３Ａと入力信号ＤＢ２０４とフレーム信号選択部２０７Ａが、ＤＢ書込み部２０３Ｃと入力信号ＤＢ２０４Ｃとフレーム信号選択部２０７Ｃに置き換わりっている点で、第３の実施形態と異なっている。 In the sound masking processing unit 200C, the long-time pitch estimation information creation unit 213 and the pitch estimation information DB are added, and the DB writing unit 203A, the input signal DB 204, and the frame signal selection unit 207A input the DB writing unit 203C. It differs from the third embodiment in that it replaces the signal DB 204C and the frame signal selection unit 207C.

第４の実施形態のサウンドマスキング装置１００Ｃでは、長時間ピッチ推定情報作成部２１３とピッチ推定情報ＤＢ２１４が追加され、入力信号ＤＢ２０４Ｃに置き換わったことにより、入力信号ＤＢにピッチの推定値が蓄積されるようになった点と、ＤＢ書込み部２０３Ｃに置き換わったことにより入力信号ＤＢ２０４Ｃの蓄積方法が異なる点と、フレーム信号選択部２０７Ｃに置き換わったことによりとマスカー素片信号の選択方法が異なる点が第３の実施形態のサウンドマスキング装置１００Ｂとの差異となる。 In the sound masking device 100C of the fourth embodiment, the long-time pitch estimation information creation unit 213 and the pitch estimation information DB 214 are added and replaced with the input signal DB 204C, so that the estimated value of the pitch is accumulated in the input signal DB. The first point is that the method of accumulating the input signal DB204C is different due to the replacement with the DB writing unit 203C, and the method of selecting the masker element signal is different due to the replacement with the frame signal selection unit 207C. This is a difference from the sound masking device 100B of the third embodiment.

次に、サウンドマスキング処理部２００Ｃの詳細な構成を説明する。 Next, the detailed configuration of the sound masking processing unit 200C will be described.

サウンドマスキング処理部２００は、フレーム分割部２０１、音声区間判定部２０９、ピッチ推定部２１２、ＤＢ蓄積判定部２１０Ｂ、長時間フレーム信号作成部２０２、長時間ピッチ推定情報作成部２１３、入力信号ＤＢ２０４Ｃ、フレーム信号ＤＢ２０５、ピッチ推定情報ＤＢ２１４、フレーム選択制限部２０６、フレーム信号選択部２０７Ｃ、マスカー信号生成部２０８Ａ、音入力端子ＩＮ、及び音出力端子ＯＵＴを有する。 The sound masking processing unit 200 includes a frame division unit 201, a voice section determination unit 209, a pitch estimation unit 212, a DB accumulation determination unit 210B, a long-time frame signal creation unit 202, a long-time pitch estimation information creation unit 213, and an input signal DB 204C. It has a frame signal DB 205, a pitch estimation information DB 214, a frame selection limiting unit 206, a frame signal selection unit 207C, a masker signal generation unit 208A, a sound input terminal IN, and a sound output terminal OUT.

長時間ピッチ推定情報作成部２１３は、ＤＢ蓄積判定部２１０Ｂの判定結果を基に、ピッチ推定部２１２で推定されたピッチの推定値に基づいて長時間フレームのピッチ推定情報（以下、「長時間ピッチ推定情報」と呼ぶ）を作成し、作成した長時間ピッチ推定情報を出力する。 The long-time pitch estimation information creation unit 213 is based on the determination result of the DB accumulation determination unit 210B, and based on the estimated value of the pitch estimated by the pitch estimation unit 212, the pitch estimation information of the long-time frame (hereinafter, "long-time"). "Pitch estimation information") is created, and the created long-time pitch estimation information is output.

ＤＢ書込み部２０３Ｃは、ＤＢ蓄積判定部２１０ＢのＤＢに蓄積するか否かの判定結果を基に、長時間フレーム信号を入力信号ＤＢ２０４Ｃのフレーム信号ＤＢ２０５に、長時間ピッチ推定情報を入力信号ＤＢ２０４Ｃのピッチ推定情報ＤＢ２１４に書込む。 The DB writing unit 203C inputs the long-time frame signal to the frame signal DB 205 of the input signal DB 204C and the long-time pitch estimation information to the input signal DB 204C based on the determination result of whether or not to store the long-time frame signal in the DB of the DB storage determination unit 210B. It is written in the pitch estimation information DB 214.

入力信号ＤＢ２０４Ｃは、過去の長時間フレーム信号と過去の長時間ピッチ推定情報を長時間フレーム毎に対応づけて蓄積（保持）する記憶手段である。入力信号ＤＢ２０４Ｃ内のデータ形式については限定されないものであるが、ここで、入力信号ＤＢ２０４Ｃは、少なくとも、過去の長時間フレーム信号を蓄積したフレーム信号ＤＢ２０５と、過去の長時間ピッチ推定情報を蓄積したピッチ推定情報ＤＢ２１４とを有しているものとする。 The input signal DB204C is a storage means that stores (holds) the past long-time frame signal and the past long-time pitch estimation information in association with each other for each long-time frame. The data format in the input signal DB204C is not limited, but here, the input signal DB204C has accumulated at least the frame signal DB205 that has accumulated the past long-time frame signal and the past long-time pitch estimation information. It is assumed that the pitch estimation information DB 214 is provided.

フレーム信号選択部２０７Ｃは、マスカー信号生成判定部２１１Ｂの判定結果と、ピッチ推定部２１２のピッチの推定値と入力信号ＤＢ２０４Ｃのピッチ推定情報ＤＢ２１４に蓄積されている過去の長時間ピッチ推定情報との比較結果を基に、入力信号ＤＢ２０４Ｃに蓄積されている過去の長時間フレーム信号をフレーム選択制限部２０６の制限フレーム数より前のフレームからマスカー素片信号として選択し、選択したフレームを出力する。 The frame signal selection unit 207C contains the determination result of the masker signal generation determination unit 211B, the pitch estimation value of the pitch estimation unit 212, and the past long-time pitch estimation information stored in the pitch estimation information DB 214 of the input signal DB 204C. Based on the comparison result, the past long-time frame signal stored in the input signal DB204C is selected as the masker element signal from the frames before the limited number of frames of the frame selection limiting unit 206, and the selected frame is output.

なお、第３の実施形態において、第２の実施形態と同様にマスカー信号生成判定部２１１Ｂを除外した構成としても良い。 In the third embodiment, the masker signal generation determination unit 211B may be excluded as in the second embodiment.

（Ｄ−２）第４の実施形態の動作
次に、以上のような構成を有する第４の実施形態におけるサウンドマスキング装置１００Ｃの動作（実施形態の音響処理方法）について詳細に説明する。 (D-2) Operation of the Fourth Embodiment Next, the operation of the sound masking device 100C (the sound processing method of the embodiment) in the fourth embodiment having the above configuration will be described in detail.

第４の実施形態に係るサウンドマスキング装置１００Ｃにおけるサウンドマスキング処理の基本的な動作は、第３の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100C according to the fourth embodiment is the same as the sound masking process described in the third embodiment.

以下では、第４の実施形態において、第３の実施形態と異なる点である長時間ピッチ推定情報作成部２１３、ＤＢ書込み部２０３Ｃ、入力信号ＤＢ２０４Ｃのピッチ推定情報ＤＢ２１４、及びフレーム信号選択部２０７Ｃにおける処理動作を中心に詳細に説明する。 In the following, in the fourth embodiment, the long-time pitch estimation information creation unit 213, the DB writing unit 203C, the pitch estimation information DB 214 of the input signal DB 204C, and the frame signal selection unit 207C, which are different from the third embodiment. The processing operation will be described in detail.

ピッチ推定部２１２は、分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）のピッチを推定し、ピッチの推定値ｐｉｔｃｈ（ｌ）をＤＢ蓄積判定部２１０Ｂと長時間ピッチ推定情報作成部２１３とマスカー信号生成判定部２１１Ａとフレーム信号選択部２０７Ｃに出力する。 The pitch estimation unit 212 estimates the pitch of the divided frame signal x_fram (l; m), and sets the pitch estimation value pitch (l) to the DB accumulation determination unit 210B, the long-time pitch estimation information creation unit 213, and the masker signal generation determination unit. Output to 211A and frame signal selection unit 207C.

長時間ピッチ推定情報作成部２１３は、ピッチ推定部２１２で推定された分割フレーム信号ｘ＿ｆｒａｍ（ｌ；ｍ）のピッチの推定値ｐｉｔｃｈ（ｌ）を結合して、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）の長時間ピッチ推定情報を作成する。長時間ピッチ推定情報作成部２０４は、例えば、（２４）式に従い、ピッチの推定値を結合して長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）を作成するようにしても良い。
ｐｉｃｔｈ＿ｌｏｎｇ（ｉ）
＝ｐｉｔｃｈ（ｌ−（（Ｉ−１）−ｉ））…（２４） The long-time pitch estimation information creation unit 213 combines the pitch estimation value pitch (l) of the divided frame signal x_fram (l; m) estimated by the pitch estimation unit 212 to obtain the long-time frame signal x_fram_long (s). Create long-time pitch estimation information. For example, the long-time pitch estimation information creating unit 204 may create the long-time pitch estimation information pitch_long (i) by combining the estimated values of the pitches according to the equation (24).
pix_long (i)
= Pitch (l-((I-1) -i)) ... (24)

（２４）式は、長時間フレーム信号を作成するときに使用されていた分割フレーム信号のピッチの推定値ｐｉｔｃｈ（ｌ）を結合して長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）を作成するという式となっている。 Equation (24) is an equation (24) in which the estimated value pitch (l) of the pitch of the divided frame signal used when creating the long-time frame signal is combined to create the long-time pitch estimation information pitch_long (i). It has become.

そして、長時間ピッチ推定情報作成部２１３は、作成した長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）をＤＢ書込み部２０３Ｃに出力する。 Then, the long-time pitch estimation information creation unit 213 outputs the created long-time pitch estimation information pitch_long (i) to the DB writing unit 203C.

ＤＢ書込み部２０３Ｃは、ＤＢ蓄積判定部２１０Ｂから出力された判定結果ＤＢ＿ｆｌａｇ（ｌ）に基づいて、長時間フレームｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）と長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（i）とを入力信号ＤＢ２０４のフレーム信号ＤＢ２０５とピッチ推定情報ＤＢ２１４に対応付けて書き込む。 Based on the determination result DB_flag (l) output from the DB accumulation determination unit 210B, the DB writing unit 203C inputs the long-time frame x_frame_long (s) and the long-time pitch estimation information pitch_long (i) into the frame signal of the input signal DB204. It is written in association with DB 205 and pitch estimation information DB 214.

ＤＢ書込み部２０３Ｃは、ＤＢ蓄積判定部２１０Ｂから出力された判定結果ＤＢ＿ｆｌａｇ（ｌ）が１のときのみ、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）をフレーム信号ＤＢ２０５のＤＢ＿ｓｉｎｇａｌ（ｊ；ｔ）に書込み、同時に長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）をピッチ推定情報ＤＢ２１４ＤＢ＿ｐｉｔｃｈ（ｉ）に書込む。ＤＢ書込み部２０３Ｃは、ＤＢ蓄積判定部２１０Ｂから出力された判定結果ＤＢ＿ｆｌａｇ（ｌ）が１のとき、例えば、（５）式、（２５）式、（６）式に従い、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）をフレーム信号ＤＢ２０５のＤＢ＿ｓｉｎｇａｌ（ｊ；ｔ）に書込み、同時に長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）をピッチ推定情報ＤＢ２１４ＤＢ＿ｐｉｔｃｈ（ｉ）に書込む。一方、ＤＢ蓄積判定部２１０Ｂの判定結果ＤＢ＿ｆｌａｇ（ｌ）が０のとき、ＤＢ書込み部２０３Ｃは、長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）と長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）を入力信号ＤＢ２０４Ｃのフレーム信号ＤＢ２０５とピッチ推定情報ＤＢ２１４に書込まない。
ＤＢ＿ｐｉｔｃｈ（ｊ；ｉ）＝ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）…（２５） The DB writing unit 203C writes the long-time frame signal x_fram_long (s) to the DB_singal (j; t) of the frame signal DB 205 only when the determination result DB_flag (l) output from the DB accumulation determination unit 210B is 1, and at the same time. The long-time pitch estimation information pitch_long (i) is written in the pitch estimation information DB214DB_pitch (i). When the determination result DB_flag (l) output from the DB accumulation determination unit 210B is 1, the DB writing unit 203C follows, for example, equations (5), (25), and (6) for a long time frame signal x_fram_long ( s) is written in the DB_singal (j; t) of the frame signal DB 205, and at the same time, the long-time pitch estimation information pitch_long (i) is written in the pitch estimation information DB214DB_pitch (i). On the other hand, when the determination result DB_flag (l) of the DB accumulation determination unit 210B is 0, the DB writing unit 203C inputs the long-time frame signal x_frame_long (s) and the long-time pitch estimation information pitch_long (i) to the frame signal of the input signal DB204C. Do not write to DB205 and pitch estimation information DB214.
DB_pitch (j; i) = pitch_long (i) ... (25)

入力信号ＤＢ２０４Ｃは、長時間フレームｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）と長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）とを入力信号ＤＢ２０４のフレーム信号ＤＢ２０５とピッチ推定情報ＤＢ２１４に対応付けて蓄積（保持）する。 The input signal DB204C stores (holds) the long-time frame x_fram_long (s) and the long-time pitch estimation information pitch_long (i) in association with the frame signal DB205 of the input signal DB204 and the pitch estimation information DB214.

上述の通り、この実施形態の入力信号ＤＢ２０４Ｃには、フレーム信号ＤＢ２０５とピッチ情報ＤＢ２１４とが含まれている。ここでは、フレーム信号ＤＢ２０５に各長時間フレーム信号ｘ＿ｆｒａｍ＿ｌｏｎｇ（ｓ）が記録され、ピッチ情報ＤＢ２１４には、長時間ピッチ推定情報ｐｉｔｃｈ＿ｌｏｎｇ（ｉ）が記録されることになる。 As described above, the input signal DB 204C of this embodiment includes the frame signal DB 205 and the pitch information DB 214. Here, each long-time frame signal x_frame_long (s) is recorded in the frame signal DB 205, and the long-time pitch estimation information pitch_long (i) is recorded in the pitch information DB 214.

フレーム信号選択部２０７Ｃは、マスカー信号生成判定部２１１から出力されたマスカー信号を生成するか否かの判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）を基に、ピッチ推定部２１２のピッチの推定値（現在の分割フレームに基づくピッチ）と、入力信号ＤＢ２０４Ｃのピッチ情報ＤＢ２１４に蓄積されている過去の長時間ピッチ推定情報を比較し、ピッチ推定部２１２のピッチの推定値と近いピッチ情報（近いピッチの値）を持つ長時間フレーム信号をマスカー素片信号として選択する。フレーム信号選択部２０７Ｃは、マスカー信号生成に使用する素片データを入力信号ＤＢ２０４Ｃとピッチの推定値ｐｉｔｃｈ（ｌ）と制限フレーム数Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭを用いてマスカー素片信号を選択する。 The frame signal selection unit 207C estimates the pitch of the pitch estimation unit 212 (current division frame) based on the determination result mask_flag (l) of whether or not to generate the masker signal output from the masker signal generation determination unit 211. (Pitch based on) is compared with the past long-time pitch estimation information stored in the pitch information DB 214 of the input signal DB 204C, and has pitch information (close pitch value) close to the pitch estimation value of the pitch estimation unit 212. Select the long-time frame signal as the masker element signal. The frame signal selection unit 207C selects the masker fragment signal by using the input signal DB204C, the pitch estimated value pitch (l), and the limited number of frames Limit_Fream_NUM for the element data used for masker signal generation.

フレーム信号選択部２０７Ｃは、マスカー信号生成判定部２１１の判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のときのみ、フレームを選択し、マスカー信号生成判定部２１１の判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が０のとき、フレームを選択しない。マスカー信号生成判定部２１１の判定結果ｍａｓｋ＿ｆｌａｇ（ｌ）が１のとき、フレーム信号選択部２０７Ｃは、例えば、（２８）式や（２９）式に従い、フレームを選択する。

The frame signal selection unit 207C selects a frame only when the determination result mask_flag (l) of the masker signal generation determination unit 211 is 1, and when the determination result mask_flag (l) of the masker signal generation determination unit 211 is 0, the frame is framed. Do not select. When the determination result mask_flag (l) of the masker signal generation determination unit 211 is 1, the frame signal selection unit 207C selects a frame according to, for example, equations (28) and (29).

（２６）式で、Ｔｃ（ｐ）は選択したフレーム、ｔｍｐ＿Ｔ（ｐ）は一時的に選択したフレームを保持する変数である。また、（２６）式で、ＤＢ＿ｐｉｔｃｈ＿ａｖｅ（ｉ）は過去の長時間ピッチ推定情報の平均値である。さらに、（２７）式で、Ｓｕｂ＿ｐｉｔｃｈ（ｊ）は、ピッチの推定値ｐｉｔｃｈ（ｌ）と過去の長時間ピッチ推定情報の平均値とＤＢ＿ｐｉｔｃｈ＿ａｖｅ（ｉ）の差の絶対値である。さらにまた、（２８）式で、Ｔｃ（ｐ）は選択したフレーム番号、ｐ（ｐ＝０、１…、ＳＥＬ＿ＮＵＭ−１）は、選択フレーム数である。また、（２８）式のｓｍａｌｌ(ｘ（ｋ）、ｐ)は、配列ｘ（ｋ）でｐ番目に小さいｘ（ｋ＿ｐ）のインデックスｋ＿ｐを出力する関数である。 In equation (26), Tc (p) is a variable that holds the selected frame, and tp_T (p) is a variable that holds the temporarily selected frame. Further, in the equation (26), DB_pitch_ave (i) is an average value of past long-time pitch estimation information. Further, in the equation (27), Sub_pitch (j) is an absolute value of the difference between the pitch estimated value pitch (l), the average value of the past long-time pitch estimation information, and DB_pitch_ave (i). Furthermore, in equation (28), Tc (p) is the selected frame number, and p (p = 0, 1 ..., SEL_NUM-1) is the number of selected frames. Further, the small (x (k), p) in the equation (28) is a function that outputs the index k_p of the p-th smallest x (k_p) in the array x (k).

（２６）式は、入力信号ＤＢ２０４Ｃのフレーム信号ＤＢ２０５に保持されている長時間フレーム信号を、制限フレーム数Ｌｉｍｉｔ＿Ｆｒｅａｍ＿ＮＵＭより前のフレームから、時間的に新しい順番で選択し、選択した長時間フレーム信号が保持されているデータベースのインデックス番号をｔｍｐ＿Ｔ（ｐ）に代入するという式である。また、（２７）式は、ｔｍｐ＿Ｔ（ｐ）ごとにピッチ推定情報ＤＢ２１４に蓄積されている過去の長時間ピッチ推定情報の平均値ＤＢ＿ｐｉｔｃｈ＿ａｖｅ（ｉ）を算出するという式である。さらに、（２８）式は、ピッチ推定部２１２のピッチの推定値ｐｉｔｃｈ（ｌ）と過去の長時間ピッチ推定情報の平均値ＤＢ＿ｐｉｔｃｈ＿ａｖｅ（ｉ）の差の絶対値Ｓｕｂ＿ｐｉｃｔｈ（ｉ）を計算する処理を示している。さらにまた、（２９）式は、Ｓｕｂ＿ｐｉｃｔｈ（ｉ）が最も小さいインデックスｊ（長時間フレーム）をマスカー素片信号として複数フレーム選択するという式である。 In the equation (26), the long-time frame signal held in the frame signal DB205 of the input signal DB204C is selected from the frames before the limit number of frames Limit_Fream_NUM in the order of new time, and the selected long-time frame signal is selected. The formula is to assign the index number of the retained database to tp_T (p). Further, the formula (27) is a formula for calculating the average value DB_pitch_ave (i) of the past long-time pitch estimation information stored in the pitch estimation information DB 214 for each tp_T (p). Further, the equation (28) is a process of calculating the absolute value Sub_pitth (i) of the difference between the pitch estimation value pitch (l) of the pitch estimation unit 212 and the average value DB_pitch_ave (i) of the past long-time pitch estimation information. Shown. Furthermore, the equation (29) is an equation in which a plurality of frames are selected as the masker element signal of the index j (long-time frame) having the smallest Sub_picth (i).

なお、フレーム信号選択部２０６Ｃは、ピッチ情報ＤＢ２１４に蓄積されている過去の長時間フレームごとに、長時間ピッチ推定情報に基づく当該長時間フレーム全体のピッチを示す値（以下、「長時間フレームピッチ」と呼ぶ）を算出し、ピッチ推定部２１２のピッチの推定値と長時間フレームピッチとの比較結果に基づいてマスカー素片信号として選択する長時間フレームを選択するようにしても良い。例えば、フレーム信号選択部２０７Ｃは、ピッチ推定部２１２のピッチの推定値と近い値の長時間フレームピッチを備える長時間フレームを、マスカー素片信号として選択するようにしても良い。 The frame signal selection unit 206C is a value indicating the pitch of the entire long-time frame based on the long-time pitch estimation information for each past long-time frame stored in the pitch information DB 214 (hereinafter, "long-time frame pitch"). The long-time frame to be selected as the masker element signal may be selected based on the comparison result between the estimated value of the pitch of the pitch estimation unit 212 and the long-time frame pitch. For example, the frame signal selection unit 207C may select a long-time frame having a long-time frame pitch having a value close to the estimated value of the pitch of the pitch estimation unit 212 as a masker element signal.

以上のように、フレーム信号選択部２０７Ｃは、ピッチ推定部２１２のピッチの推定値と近いピッチ情報（近いピッチの値）を持つ長時間フレーム信号（インデックス）をマスカー素片信号として選択し、選択したフレームＴｃ（ｐ）を出力する。 As described above, the frame signal selection unit 207C selects and selects a long-time frame signal (index) having pitch information (near pitch value) close to the pitch estimation value of the pitch estimation unit 212 as the masker element signal. The frame Tc (p) is output.

（Ｄ−３）第４の実施形態の効果
第４の実施形態によれば、第１〜第３の実施形態と比較して、以下のような効果を奏することができる。 (D-3) Effect of Fourth Embodiment According to the fourth embodiment, the following effects can be obtained as compared with the first to third embodiments.

第４の実施形態のサウンドマスキング装置１００Ｃは、ピッチ推定情報ＤＢ２１４に蓄積されているピッチ情報と、ピッチ推定部２１２で推定したピッチ推定値に近い信号を、フレーム信号ＤＢ２０５に蓄積されているフレーム信号から選択している。これにより、第４の実施形態のサウンドマスキング装置１００Ｃでは、マスカー信号の周波数特性が対象話者Ｕ１の音声のピッチに近くなり、より高いマスキング効果を維持することができる。 In the sound masking device 100C of the fourth embodiment, the pitch information stored in the pitch estimation information DB 214 and the signal close to the pitch estimation value estimated by the pitch estimation unit 212 are stored in the frame signal DB 205. You are selecting from. As a result, in the sound masking device 100C of the fourth embodiment, the frequency characteristic of the masker signal becomes close to the pitch of the voice of the target speaker U1, and a higher masking effect can be maintained.

（Ｅ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (E) Other Embodiments The present invention is not limited to each of the above embodiments, and modified embodiments as illustrated below can also be mentioned.

（Ｅ−１）例えば、本発明のサウンドマスキング装置を電話会議で周囲の対象者以外の人に対して、会話の内容が漏れることを防止する装置に搭載されるようにしても良い。この場合、サウンドマスキング装置において、対象話者Ｕ１は電話会議で発話している人となる。 (E-1) For example, the sound masking device of the present invention may be mounted on a device for preventing the content of a conversation from being leaked to a person other than the surrounding subjects in a conference call. In this case, in the sound masking device, the target speaker U1 is a person speaking in a conference call.

（Ｅ−２）上記の各実施形態において、サウンドマスキング装置の、サウンドマスキング部は、ネットワーク上の処理装置（例えば、サーバ等）で処理される構成としても良い。 (E-2) In each of the above embodiments, the sound masking unit of the sound masking device may be configured to be processed by a processing device (for example, a server or the like) on the network.

（Ｅ−３）上記の各実施形態において、サウンドマスキング装置には、オーディオデバイス（マイク、マイクアンプ、ＡＤ変換器、スピーカ、スピーカアンプ、及びＤＡ変換器）が含まれる構成として説明したが、サウンドマスキング装置についてオーディオデバイスを除外した構成として製造し、実際に使用する現場でオーディオデバイスを別途接続するようにしても良い。すなわち、サウンドマスキング装置には、少なくともサウンドマスキング処理部が含まれる構成としても良い。 (E-3) In each of the above embodiments, the sound masking device has been described as a configuration including an audio device (microphone, microphone amplifier, AD converter, speaker, speaker amplifier, and DA converter), but the sound has been described. The masking device may be manufactured as a configuration excluding the audio device, and the audio device may be connected separately at the actual site of use. That is, the sound masking device may be configured to include at least a sound masking processing unit.

１００、１００Ａ、１００Ｂ、１００Ｃ、…サウンドマスキング装置、１０１…マイク、１０２…マイクアンプ、１０３…ＡＤ変換器、１０４…ＤＡ変換器、１０５…スピーカアンプ、１０６…スピーカ、２００、２００Ａ、２００Ｂ、２００Ｃ…サウンドマスキング装置、２０１…フレーム分割部、２０２…長時間フレーム信号作成部、２０３、２０３Ａ、２０３Ｃ…ＤＢ書込み部、２０４、２０４Ｃ…入力信号ＤＢ、２０５…フレーム信号ＤＢ、２０６…フレーム選択制限部、２０７、２０７Ａ、２０７Ｃ…フレーム信号選択部、２０８、２０８Ａ…マスカー信号生成部、２０９…音声区間判定部、２１０、２１０Ｂ…ＤＢ蓄積判定部、２１１、２１１Ｂ…マスカー信号生成判定部、２１２…ピッチ推定、２１３…長時間ピッチ推定情報作成部、２１４…ピッチ推定情報ＤＢ、ＩＮ…音入力端子、ＯＵＴ…音出力端子、３００…コンピュータ、３０１…プロセッサ、３０２…一次記憶部、３０３…二次記憶部。 100, 100A, 100B, 100C, ... Sound masking device, 101 ... Microphone, 102 ... Microphone amplifier, 103 ... AD converter, 104 ... DA converter, 105 ... Speaker amplifier, 106 ... Speaker, 200, 200A, 200B, 200C ... Sound masking device, 201 ... Frame division unit, 202 ... Long-time frame signal creation unit, 203, 203A, 203C ... DB writing unit, 204, 204C ... Input signal DB, 205 ... Frame signal DB, 206 ... Frame selection restriction unit , 207, 207A, 207C ... Frame signal selection unit, 208, 208A ... Masker signal generation unit, 209 ... Sound section determination unit, 210, 210B ... DB accumulation determination unit, 211, 211B ... Masker signal generation determination unit, 212 ... Pitch Estimate 213 ... Long-time pitch estimation information creation unit, 214 ... Pitch estimation information DB, IN ... Sound input terminal, OUT ... Sound output terminal, 300 ... Computer, 301 ... Processor, 302 ... Primary storage unit, 303 ... Secondary storage Department.

Claims

A frame dividing means for dividing the microphone input signal supplied from the microphone that collects the voice spoken by the target speaker into a predetermined length, and
A long-time frame signal creating means for creating a long-time frame having a predetermined length by combining the microphone input signals frame-divided by the frame-dividing means.
An input signal storage means for accumulating a long-time frame signal generated by the long-time frame signal creation means,
A frame signal selection means for performing a frame signal selection process for selecting a signal to be used for generating a masker signal from a past frame-divided microphone input signal stored in the input signal storage means.
When the frame signal selection means performs the frame signal selection process, the frame selection limiting means for limiting the frames to be selected, and the frame selection limiting means.
An audio processing device having a masker signal generation means for generating and outputting the masker signal that makes it difficult to hear the voice uttered by the target speaker by using the signal used for generating the masker signal. ..

The acoustic processing according to claim 1, wherein the frame selection limiting means limits the frame signal selecting means so as not to select a signal of a predetermined time before when performing the frame signal selecting process. Device.

A voice section determining means for determining whether the microphone input signal frame-divided by the frame dividing means is a voice section or a non-voice section,
Based on the result of the voice section determination means, an input signal storage determination means for determining whether or not to accumulate the frame-divided microphone input signal in the input signal storage means, and an input signal storage determination means.
A masker signal generation determining means for determining whether or not to generate the masker signal based on the result of the voice section determining means is further provided.
The long-time frame signal creating means stores the microphone input signal frame-divided by the frame-dividing means in the input signal accumulating means only when it is determined to be the voice section.
The sound processing apparatus according to claim 1 or 2, wherein the masker signal generation means generates the masker signal only when it is determined to be the voice section.

Further, a pitch estimation means for estimating the pitch of the microphone input signal frame-divided by the frame division means is provided.
Based on the result of the pitch estimation means, an input signal storage determination means for determining whether or not to store the frame-divided microphone input signal in the input signal storage means, and an input signal storage determination means.
A masker signal generation determining means for determining whether or not to generate the masker signal based on the result of the pitch estimating means is further provided.
The long-time frame signal creating means stores the microphone input signal frame-divided by the frame dividing means in the input signal accumulating means based on the pitch estimated by the pitch estimating means.
The sound processing apparatus according to claim 1 or 2, wherein the masker signal generation means generates the masker signal based on the pitch estimated by the pitch estimation means.

A pitch accumulating means capable of accumulating the pitch estimated by the pitch estimating means for each frame, and a pitch accumulating means.
The pitch storing means is further provided with a pitch information creating means for accumulating the pitch estimated by the pitch estimating means.
The frame selection restriction is performed by using the past frame-divided microphone input signal stored in the input signal storage means, the past pitch information stored in the pitch storage means, and the pitch estimated by the pitch estimation means. The acoustic processing apparatus according to claim 4, wherein a signal used for generating a masker signal is selected from frames prior to the number of frames limited by means.

Computer,
A frame dividing means for dividing the microphone input signal supplied from the microphone that collects the voice spoken by the target speaker into a predetermined length, and
A long-time frame signal creating means for converting a microphone input signal frame-divided by the frame-dividing means into a time frame of a predetermined length,
An input signal storage means for accumulating a long-time frame signal generated by the long-time frame signal creation means,
A frame signal selection means for performing a frame signal selection process for selecting a signal to be used for generating a masker signal from a past frame-divided microphone input signal stored in the input signal storage means.
When the frame signal selection means performs the frame signal selection process, the frame selection limiting means for limiting the frames to be selected, and the frame selection limiting means.
It is characterized by having a masker signal generation means for generating and outputting the masker signal that makes it difficult to hear the voice spoken by the target speaker by using the signal used for generating the masker signal. Characteristic sound processing program.

In the sound processing method performed by the sound processing device,
The sound processing device includes a frame dividing means, a long-time frame signal creating means, an input signal storing means, a frame selection limiting means, a frame signal selecting means, and a masker signal generating means.
The frame dividing means divides the microphone input signal supplied from the microphone that picks up the voice spoken by the target speaker into a predetermined length.
The long-time frame signal creating means creates a long-time frame having a predetermined length by combining the microphone input signals frame-divided by the frame-dividing means.
The input signal storage means stores the long-time frame signal generated by the long-time frame signal creation means, and the input signal storage means stores the long-time frame signal.
The frame signal selection means performs a frame signal selection process for selecting a signal to be used for generating a masker signal from a past frame-divided microphone input signal stored in the input signal storage means.
The frame selection limiting means limits the frames to be selected when the frame signal selecting means performs the frame signal selection process.
The sound processing method is characterized in that the masker signal generation means uses a signal used to generate the masker signal to generate and output the masker signal that makes it difficult to hear the voice spoken by the target speaker. ..