JP7287182B2

JP7287182B2 - SOUND PROCESSING DEVICE, SOUND PROCESSING PROGRAM AND SOUND PROCESSING METHOD

Info

Publication number: JP7287182B2
Application number: JP2019151513A
Authority: JP
Inventors: 尚也川畑; 祥剛大塩; 敬信西浦; 健太岩居
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2023-06-06
Anticipated expiration: 2039-08-21
Also published as: JP2021032989A

Description

本発明は、音響処理装置、音響処理プログラム及び音響処理方法に関し、例えば、発話している話者の周囲の第三者に対して、会話の内容が漏れることを防ぐ手法として用いられるサウンドマスキング処理に適用し得る。 The present invention relates to a sound processing device, a sound processing program, and a sound processing method, and for example, a sound masking process used as a technique for preventing the contents of a conversation from leaking out to third parties around the speaker who is speaking. can be applied to

近年、不特定多数の人が存在する施設（例えば、病院、薬局、銀行等）の受付カウンター、窓口、打合せスペース等で話者が会話の相手と会話を行うと、会話の内容が周囲の第三者に漏洩することが問題になっている。 In recent years, when a speaker has a conversation with a conversation partner at a reception counter, window, meeting space, etc. in a facility where an unspecified number of people are present (for example, hospitals, pharmacies, banks, etc.), the content of the conversation is Leakage to three parties is a problem.

第三者に会話内容の漏洩を防ぐことをスピーチプライバシーと言い、スピーチプライバシーを実現するために、音のマスキング効果が利用されている。 Preventing the content of a conversation from being leaked to a third party is called speech privacy, and a sound masking effect is used to achieve speech privacy.

音のマスキング効果とは、ある音（以下、対象音）が聞こえている状態で、対象音に近い音響特性（例えば、周波数特性、ピッチ、フォルマント等）を持つ別の音が存在した場合、対象音が聞き取りにくくなる（マスクされる）現象である。一般的にマスクする音をマスカー、マスクされる音をマスキーと呼ぶ。 The sound masking effect is that when a certain sound (hereafter referred to as the target sound) is heard and there is another sound with similar acoustic characteristics (for example, frequency characteristics, pitch, formants, etc.), the target sound is masked. This is a phenomenon in which sounds become difficult to hear (masked). In general, the masked sound is called a masker, and the masked sound is called a maskee.

この音のマスキング効果を利用した、第三者に会話内容の漏洩を防止（スピーチプライバシーを保護）するサウンドマスキング装置が特許文献１と特許文献２によって提案されている。 Patent documents 1 and 2 propose a sound masking device that uses this sound masking effect to prevent the leakage of conversation content to a third party (protect speech privacy).

特許文献１に記載のサウンドマスキング装置は、マスキー信号である話者の音声信号が変化した場合でも、話者の音声信号の音響特徴量の解析を行い、解析結果を基にマスカー信号を生成し、高いマスキング効果が得られるようにしたサウンドマスキング装置である。 The sound masking device described in Patent Document 1 analyzes the acoustic feature quantity of the speaker's speech signal and generates a masker signal based on the analysis result even when the speaker's speech signal, which is a masking signal, changes. , is a sound masking device capable of obtaining a high masking effect.

特許文献２に記載の音声処理方法は、音声信号のスペクトル包絡とスペクトル微細構造を抽出し、抽出したスペクトル包絡を変形して変形スペクトル包絡を生成する。そして、変形スペクトル包絡及び抽出したスペクトル微細構造を合成して変形スペクトルを生成し、変形スペクトルに基づいて生成した信号をマスカー信号として出力することで会話音声の内容が第三者に聞かれないようにする音声処理方法である。 The speech processing method described in Patent Literature 2 extracts the spectral envelope and spectral fine structure of an audio signal, and deforms the extracted spectral envelope to generate a modified spectral envelope. Then, a modified spectrum is generated by synthesizing the modified spectrum envelope and the extracted spectral fine structure, and a signal generated based on the modified spectrum is output as a masker signal so that the contents of the conversational voice cannot be heard by a third party. It is an audio processing method that makes

特開２０１２－８８５７７号公報JP 2012-88577 A 特開２００６－２４３１７８号公報JP 2006-243178 A

特許文献１に記載のサウンドマスキング装置では、不特定の話者に対してもある程度マスク効果が期待できるように、男性および女性を含む複数人の音声信号を汎用マスカー信号としてデータベースに保存している。そして、話者の音声信号の音響特徴量の解析結果を基に、データベースに保存されている汎用マスカー信号の音響特性を変化させる（例えば、汎用マスカー信号のピッチを入力音声信号のピッチに変換、汎用マスカー音のフォルマントを入力音声信号のフォルマントに変換等）ことでマスカー信号を生成している。このため、データベースに保存している汎用マスカー信号を変化させた信号が、人工的な音になりマスカー信号が不快な音になる可能性がある。さらに、音響特徴量の解析結果が間違っていると、話者の音声の音響特徴量とマスカー信号の音響特徴量が異なるので、マスキング効果は低くなり会話の内容をマスクすることができない。 In the sound masking device described in Patent Document 1, voice signals of a plurality of people, including men and women, are stored in a database as general-purpose masker signals so that a certain degree of masking effect can be expected even for unspecified speakers. . Then, based on the analysis results of the acoustic features of the speaker's speech signal, the acoustic characteristics of the general-purpose masker signal stored in the database are changed (for example, the pitch of the general-purpose masker signal is converted to the pitch of the input speech signal, The masker signal is generated by converting the formant of the general-purpose masker sound into the formant of the input audio signal. Therefore, there is a possibility that the signal obtained by changing the general-purpose masker signal stored in the database will sound artificial and the masker signal will sound unpleasant. Furthermore, if the analysis result of the acoustic feature amount is wrong, the masking effect will be low and the content of the conversation cannot be masked because the acoustic feature amount of the speaker's voice and the acoustic feature amount of the masker signal are different.

特許文献２に記載の音声処理方法でも、抽出した音声信号のスペクトル包絡を変形させて変形スペクトル包絡を生成し、変形スペクトル包絡と抽出した音声信号のスペクトル微細構造を合成してマスカー信号生成に使用している。このため、話者の音声信号を変形して生成されたマスカー信号は人工的な音になってしまい、マスカー信号が不快な音になる可能性がある。 Also in the speech processing method described in Patent Document 2, the spectral envelope of the extracted speech signal is deformed to generate a deformed spectral envelope, and the deformed spectral envelope and the spectral fine structure of the extracted speech signal are synthesized and used for masker signal generation. are doing. For this reason, the masker signal generated by transforming the voice signal of the speaker ends up sounding artificial, and there is a possibility that the masker signal will sound unpleasant.

また、特許文献１に記載のサウンドマスキング装置と特許文献２に記載の音声処理方法のいずれも、生成したマスカー信号が話者に聞こえるように出力されると、話者にもマスカー信号が聞こえてしまうので、会話の妨げになってしまい、円滑に会話することができない。 In addition, in both the sound masking device described in Patent Document 1 and the sound processing method described in Patent Document 2, when the generated masker signal is output so that the speaker can hear it, the speaker also hears the masker signal. Because it is closed, it interferes with conversation, and it is not possible to have a smooth conversation.

以上のような問題に鑑みて、音声を発話する話者（以下、「対象話者」と呼ぶ）の音響特徴量の解析を行わない、または、音響特徴量の解析結果が間違っていても、高いマスキング効果を実現できる音響処理装置、音響処理プログラム及び音響処理方法が望まれている。さらに、対象話者の会話を妨害せずに対象話者の発話する音声をマスキングすることができる音響処理装置、音響処理プログラム及び音響処理方法が望まれている。 In view of the above problems, even if the acoustic feature value of the speaker uttering speech (hereinafter referred to as "target speaker") is not analyzed, or the analysis result of the acoustic feature value is incorrect, An acoustic processing device, an acoustic processing program, and an acoustic processing method capable of realizing a high masking effect are desired. Furthermore, an acoustic processing device, an acoustic processing program, and an acoustic processing method capable of masking the voice uttered by the target speaker without interfering with the conversation of the target speaker are desired.

第１の本発明の音響処理装置は、（１）対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割するフレーム分割手段と、（２）前記フレーム分割されたマイク入力信号を蓄積する入力信号蓄積手段と、（３）前記入力信号蓄積手段に蓄積されている過去のフレーム分割したマイク入力信号から、マスカー信号の生成に使用する信号を選択し、選択結果を出力する信号選択手段と、（４）前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を聞き取りにくくさせる前記マスカー信号を生成して出力するマスカー信号生成手段と、（５）マイク入力信号のピッチを推定するピッチ推定手段とを有し、（６）前記入力信号蓄積手段は、マイク入力信号を前記ピッチ推定手段が推定したピッチに応じて複数のクラスのいずれかに振り分けて蓄積し、（７）前記マスカー信号生成手段は、前記入力信号蓄積手段から前記ピッチ推定手段が推定したピッチに応じたクラスのマイク入力信号を用いて、マスカー信号を生成することを特徴とする。 A sound processing apparatus according to a first aspect of the present invention comprises: (1) frame dividing means for dividing a microphone input signal supplied from a microphone for picking up a sound uttered by a target speaker into predetermined lengths; input signal storage means for storing frame-divided microphone input signals; and (3) a signal used for generating a masker signal is selected from past frame-divided microphone input signals stored in the input signal storage means. (4) a masker signal for generating and outputting the masker signal that makes it difficult to hear the voice uttered by the target speaker using the signal used for generating the masker signal; (5) pitch estimating means for estimating the pitch of the microphone input signal; (7) the masker signal generating means generates a masker signal using the microphone input signal of the class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means; characterized by

第２の本発明の音響処理プログラムは、コンピュータを、（１）対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割するフレーム分割手段と、（２）前記フレーム分割されたマイク入力信号を蓄積する入力信号蓄積手段と、（３）前記入力信号蓄積手段に蓄積されている過去のフレーム分割したマイク入力信号から、マスカー信号の生成に使用する信号を選択し、選択結果を出力する信号選択手段と、（４）前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を聞き取りにくくさせる前記マスカー信号を生成して出力するマスカー信号生成手段と、（５）マイク入力信号のピッチを推定するピッチ推定手段として機能させ、（６）前記入力信号蓄積手段は、マイク入力信号を前記ピッチ推定手段が推定したピッチに応じて複数のクラスのいずれかに振り分けて蓄積し、（７）前記マスカー信号生成手段は、前記入力信号蓄積手段から前記ピッチ推定手段が推定したピッチに応じたクラスのマイク入力信号を用いて、マスカー信号を生成することを特徴とする。 The sound processing program of the second aspect of the present invention comprises: (1) a frame dividing means for dividing a microphone input signal supplied from a microphone for picking up a voice uttered by a target speaker into predetermined lengths; 2) input signal accumulation means for accumulating the frame-divided microphone input signal; and (3) a signal used for generating a masker signal from the past frame-divided microphone input signal accumulated in the input signal accumulation means. and (4) using the signal used to generate the masker signal, generates and outputs the masker signal that makes it difficult to hear the speech uttered by the target speaker. (5) functions as pitch estimation means for estimating the pitch of the microphone input signal; and (6) the input signal accumulation means stores the microphone input signal according to the pitch estimated by the pitch estimation means. (7) the masker signal generating means uses the microphone input signal of the class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means to generate a masker signal; is characterized by generating

第３の本発明の音響処理方法は、（１）フレーム分割手段、入力信号蓄積手段、信号選択手段、マスカー信号生成手段及びピッチ推定手段を有し、（２）前記フレーム分割手段は、対象話者が発話した音声を収音するマイクから供給されたマイク入力信号を所定の長さに分割し、（３）前記入力信号蓄積手段は、前記フレーム分割されたマイク入力信号を蓄積し、（４）前記信号選択手段は、前記入力信号蓄積手段に蓄積されている過去のフレーム分割したマイク入力信号から、マスカー信号の生成に使用する信号を選択し、選択結果を出力し、（５）前記マスカー信号生成手段は、前記マスカー信号の生成に使用する信号を用いて、前記対象話者が発話した音声を、聞き取りにくくさせる前記マスカー信号を生成して出力し、（６）前記ピッチ推定手段は、マイク入力信号のピッチを推定し、（７）前記入力信号蓄積手段は、マイク入力信号を前記ピッチ推定手段が推定したピッチに応じて複数のクラスのいずれかに振り分けて蓄積し、（８）前記マスカー信号生成手段は、前記入力信号蓄積手段から前記ピッチ推定手段が推定したピッチに応じたクラスのマイク入力信号を用いて、マスカー信号を生成することを特徴とする。 A sound processing method according to a third aspect of the present invention includes (1) frame division means, input signal accumulation means, signal selection means , masker signal generation means, and pitch estimation means ; (3) the input signal accumulation means accumulates the frame-divided microphone input signal, ( 4) the signal selection means selects a signal to be used for generating a masker signal from past frame-divided microphone input signals accumulated in the input signal accumulation means, and outputs the selection result ; The masker signal generating means uses the signal used to generate the masker signal to generate and output the masker signal that makes it difficult to hear the speech uttered by the target speaker, and (6) the pitch estimating means. (7) the input signal storage means sorts the microphone input signal into one of a plurality of classes according to the pitch estimated by the pitch estimation means and stores the class; (8) The masker signal generating means generates the masker signal using the microphone input signal of the class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means.

本発明によれば、マスカー信号の生成に使用する信号を蓄積された対象話者自身の過去の音声を使用して生成することで、音響特徴量の解析を行わない、または、音響特徴量の解析結果が間違っていても、音響特性を変化していない信号を使用してマスカー信号を生成することで、高いマスキング効果を実現できる。さらに、対象話者の会話を妨害せずに対象話者の発話する音声をマスキングすることができる。 According to the present invention, the signal used to generate the masker signal is generated using the accumulated past speech of the target speaker, so that the acoustic feature quantity is not analyzed, or the acoustic feature quantity is not analyzed. Even if the analysis result is wrong, a high masking effect can be achieved by generating a masker signal using a signal whose acoustic characteristics are not changed. Furthermore, the voice uttered by the target speaker can be masked without disturbing the conversation of the target speaker.

第１の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。1 is a block diagram showing a functional configuration of a sound masking device according to a first embodiment; FIG. 第１の実施形態に係るサウンドマスキング装置のハードウェア構成の例について示したブロック図である。1 is a block diagram showing an example of a hardware configuration of a sound masking device according to a first embodiment; FIG. 第１の実施形態に係るサウンドマスキング装置で生成したマスカー信号を床面に反射させて出力する際のイメージ図である。FIG. 4 is an image diagram when the masker signal generated by the sound masking device according to the first embodiment is reflected on the floor surface and output. 第１の実施形態に係るサウンドマスキング装置で生成したマスカー信号を出力するイメージ図である。FIG. 4 is an image diagram of outputting a masker signal generated by the sound masking device according to the first embodiment; 第２の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。FIG. 11 is a block diagram showing the functional configuration of a sound masking device according to a second embodiment; FIG. 第３の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。FIG. 11 is a block diagram showing the functional configuration of a sound masking device according to a third embodiment; FIG. 第４の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。FIG. 12 is a block diagram showing the functional configuration of a sound masking device according to a fourth embodiment; FIG. 第４の実施形態に係るサウンドマスキング装置の第三者音声信号ＤＢ（データベース）に第三者音声信号を蓄積する際の構成について示したブロック図である。FIG. 11 is a block diagram showing a configuration for accumulating a third party audio signal in a third party audio signal DB (database) of the sound masking device according to the fourth embodiment; 第５の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。FIG. 12 is a block diagram showing the functional configuration of a sound masking device according to a fifth embodiment; FIG. 第６の実施形態に係るサウンドマスキング装置の機能的構成を示すブロック図である。FIG. 12 is a block diagram showing the functional configuration of a sound masking device according to a sixth embodiment; FIG.

（Ａ）第１の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第１の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (A) First Embodiment Hereinafter, a first embodiment of a sound processing device, a sound processing program, and a sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ａ－１）第１の実施形態の構成
図１は、第１の実施形態に係るサウンドマスキング装置１００の機能的構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing the functional configuration of a sound masking device 100 according to the first embodiment.

サウンドマスキング装置１００は、マイク１０１、マイクアンプ１０２、ＡＤ変換器１０３、スピーカ１０４、スピーカアンプ１０５、ＤＡ変換器１０６、及びサウンドマスキング処理部２００を有している。 The sound masking device 100 has a microphone 101 , a microphone amplifier 102 , an AD converter 103 , a speaker 104 , a speaker amplifier 105 , a DA converter 106 and a sound masking processing section 200 .

マイク１０１は、人の音声や音等の空気振動を電気信号に変換するマイクである。 The microphone 101 is a microphone that converts air vibration such as human voice and sound into an electric signal.

マイクアンプ１０２は、マイク１０１により受音（収音）された入力信号を増幅するものである。 The microphone amplifier 102 amplifies an input signal received (collected) by the microphone 101 .

ＡＤ変換器１０３は、マイクアンプ１０２により増幅された入力信号をアナログ信号からデジタル信号に変換するものである。以下、ＡＤ変換器１０３で変換された信号を「マイク入力信号」とする。 The AD converter 103 converts the input signal amplified by the microphone amplifier 102 from an analog signal to a digital signal. Hereinafter, the signal converted by the AD converter 103 will be referred to as "microphone input signal".

サウンドマスキング処理部２００は、入力されたマイク入力信号や過去のマイク入力信号からマスカー信号を生成し、出力するものである。 The sound masking processing unit 200 generates and outputs a masker signal from an input microphone input signal or past microphone input signals.

ＤＡ変換器１０６は、サウンドマスキング処理部２００から出力された音信号をデジタル信号からアナログ信号に変換するものである。 The DA converter 106 converts the sound signal output from the sound masking processing unit 200 from a digital signal to an analog signal.

スピーカアンプ１０５は、アナログ信号を増幅するものである。 A speaker amplifier 105 amplifies an analog signal.

スピーカ１０４は、電気信号を空気の振動に変換して音として出力するスピーカである。 The speaker 104 is a speaker that converts an electrical signal into air vibration and outputs it as sound.

次に、サウンドマスキング処理部２００の詳細な構成を説明する。 Next, a detailed configuration of the sound masking processing section 200 will be described.

サウンドマスキング処理部２００は、フレーム分割部２０１、入力信号ＤＢ（データベース）２０２、信号選択部２０３、マスカー信号生成部２０４、音入力端子ＩＮ、及び音出力端子ＯＵＴを有する。 The sound masking processing unit 200 has a frame division unit 201, an input signal DB (database) 202, a signal selection unit 203, a masker signal generation unit 204, a sound input terminal IN, and a sound output terminal OUT.

音入力端子ＩＮは、マイク入力信号をサウンドマスキング処理部２００に入力するインタフェース（オーディオインタフェース）である。 A sound input terminal IN is an interface (audio interface) for inputting a microphone input signal to the sound masking processing unit 200 .

フレーム分割部２０１は、サウンドマスキング処理部２００に入力されたマイク入力信号を所定の長さ（処理フレーム）に分割して出力する。フレーム分割部２０１は、一般的に音声を解析するのに適した長さに分割すれば良く、例えば、マイク入力信号を１００［ミリ秒］～２００［ミリ秒］単位にフレーム分割する。 The frame dividing unit 201 divides the microphone input signal input to the sound masking processing unit 200 into predetermined lengths (processing frames) and outputs the divided frames. The frame division unit 201 generally divides the signal into lengths suitable for analyzing the voice, and for example, divides the microphone input signal into frames in units of 100 [milliseconds] to 200 [milliseconds].

入力信号ＤＢ２０２は、フレーム分割したマイク入力信号を蓄積する記憶手段である。 The input signal DB 202 is storage means for accumulating frame-divided microphone input signals.

信号選択部２０３は、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号から、マスカー信号の生成に使用する信号（以下、「マスカー素辺信号」と呼ぶ）を選択し、選択結果を出力する。 The signal selection unit 203 selects a signal to be used for generating a masker signal (hereinafter referred to as a “masker bare edge signal”) from past frame-divided microphone input signals stored in the input signal DB 202, and selects a selection result. to output

マスカー信号生成部２０４は、選択されたマスカー素辺信号を入力信号ＤＢ２０２から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号を使用してマスカー信号を生成し出力する。 The masker signal generation unit 204 reads a plurality of frames of the selected masker side signal from the input signal DB 202, and generates and outputs a masker signal using the read masker side signals of the plurality of frames.

音出力端子ＯＵＴは、生成したマスカー信号をＤＡ変換器１０６に出力するインタフェース（オーディオインターフェース）である。 The sound output terminal OUT is an interface (audio interface) that outputs the generated masker signal to the DA converter 106 .

サウンドマスキング処理部２００は、全てをハードウェア的に構成（例えば、専用ボードやＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）を用いて構築）するようにしても良いし、ソフトウェア的にコンピュータを用いて構成するようにしても良い。サウンドマスキング処理部２００は、例えば、メモリ、及びプロセッサを有するコンピュータにプログラム（実施形態に係る音響処理プログラムを含む）をインストールして構成するようにしても良い。なお、この実施形態では、ＡＤ変換器１０３及びＤＡ変換器１０６を、サウンドマスキング処理部２００の外に配置しているが、サウンドマスキング処理部２００にＡＤ変換器１０３、及びＤＡ変換器１０６を搭載した構成としても良い。 The sound masking processing unit 200 may be entirely configured in hardware (for example, constructed using a dedicated board or DSP (Digital Signal Processor)), or may be configured in software using a computer. can be The sound masking processing unit 200 may be configured by installing programs (including the sound processing program according to the embodiment) in a computer having a memory and a processor, for example. In this embodiment, the AD converter 103 and the DA converter 106 are arranged outside the sound masking processing unit 200, but the sound masking processing unit 200 is equipped with the AD converter 103 and the DA converter 106. It is good also as the composition which carried out.

次に、図２では、サウンドマスキング処理部２００をソフトウェア（コンピュータ）的に実現する際の構成について示している。 Next, FIG. 2 shows a configuration when the sound masking processing unit 200 is implemented by software (computer).

図２に示すサウンドマスキング処理部２００は、コンピュータ３００を用いてソフトウェア的に構成されている。コンピュータ３００には、プログラム（実施形態の音響処理プログラムを含むプログラム）がインストールされている。なお、コンピュータ３００は、音響処理プログラム専用のコンピュータとしても良いし、他の機能のプログラムと共用される構成としても良い。 The sound masking processing unit 200 shown in FIG. 2 is configured in software using the computer 300 . A program (a program including the sound processing program of the embodiment) is installed in the computer 300 . The computer 300 may be a computer dedicated to the sound processing program, or may be configured to be shared with programs of other functions.

図２に示すコンピュータ３００は、プロセッサ３０１、一次記憶部３０２、及び二次記憶部３０３、音入力端子ＩＮ、及び音出力端子ＯＵＴを有している。音入力端子ＩＮ、及び音出力端子ＯＵＴは、図１に示した要素と同じである。 A computer 300 shown in FIG. 2 has a processor 301, a primary storage unit 302, a secondary storage unit 303, a sound input terminal IN, and a sound output terminal OUT. A sound input terminal IN and a sound output terminal OUT are the same as the elements shown in FIG.

一次記憶部３０２は、プロセッサ３０１の作業用メモリ（ワークメモリ）として機能する記憶手段であり、例えば、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の高速動作するメモリが適用される。 The primary storage unit 302 is storage means that functions as a working memory (work memory) for the processor 301, and for example, a high-speed memory such as a DRAM (Dynamic Random Access Memory) is applied.

二次記憶部３０３は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）やプログラムデータ（実施形態に係る音響処理プログラムのデータを含む）等の種々のデータを記録する記憶手段であり、例えば、ＦＬＡＳＨメモリやＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリが適用される。 The secondary storage unit 303 is storage means for recording various data such as OS (Operating System) and program data (including data of the sound processing program according to the embodiment). Drive), SSD (Solid State Drive), and other non-volatile memories are applied.

この実施形態のコンピュータ３００では、プロセッサ３０１が起動する際、二次記憶部３０３に記録されたＯＳやプログラム（実施形態に係る音響処理プログラムを含む）を読み込み、一次記憶部３０２上に展開して実行する。なお、コンピュータ３００の具体的な構成は図２の構成に限定されないものであり、種々の構成を適用することができる。例えば、一次記憶部３０２が不揮発メモリ（例えば、ＦＬＡＳＨメモリ等）であれば、二次記憶部３０３については除外した構成としても良い。 In the computer 300 of this embodiment, when the processor 301 is activated, the OS and programs (including the sound processing program according to the embodiment) recorded in the secondary storage unit 303 are read, and expanded on the primary storage unit 302. Execute. Note that the specific configuration of the computer 300 is not limited to the configuration in FIG. 2, and various configurations can be applied. For example, if the primary storage unit 302 is a non-volatile memory (for example, FLASH memory), the secondary storage unit 303 may be excluded.

（Ａ－２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態におけるサウンドマスキング装置１００の動作（実施形態の音響処理方法）について詳細に説明する。 (A-2) Operation of the First Embodiment Next, the operation of the sound masking device 100 (acoustic processing method of the embodiment) having the configuration described above according to the first embodiment will be described in detail.

サウンドマスキング装置１００の動作が開始し、サウンドマスキング装置１００の利用者（図３の対象話者Ｕ１）がマイク１０１に向かつて音声を発話すると、マイク１０１に音声信号が入力される。 When the sound masking device 100 starts operating and the user of the sound masking device 100 (target speaker U1 in FIG. 3) speaks into the microphone 101, a voice signal is input to the microphone 101. FIG.

マイク１０１に入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、サウンドマスキング処理部２００の音入力端子ＩＮにマイク入力信号ｘ（ｎ）として入力される。なお、マイク入力信号ｘ（ｎ）において、ｎは入力信号の離散的な時間を表すパラメータである。 An analog sound signal input to the microphone 101 is amplified by the microphone amplifier 102, converted from the analog signal to a digital signal by the AD converter 103, and supplied to the sound input terminal IN of the sound masking processing unit 200 as the microphone input signal x(n). ). Note that in the microphone input signal x(n), n is a parameter representing discrete times of the input signal.

サウンドマスキング処理部２００の音入力端子ＩＮにマイク入力信号ｘ（ｎ）が入力され始めると、フレーム分割部２０１に入力される。 When the microphone input signal x(n) starts to be input to the sound input terminal IN of the sound masking processing unit 200 , it is input to the frame dividing unit 201 .

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を所定単位に分割する。フレーム分割部２０１は、例えば、以下の（１）式に従い、処理フレームごとに分割する。 Frame dividing section 201 divides microphone input signal x(n) into predetermined units. The frame division unit 201 divides each processing frame according to the following formula (1), for example.

（１）式で、ｘ＿ｆｒａｍ（ｌ；ｍ）はフレーム分割したマイク入力信号、ｌはフレーム番号、ｍはフレーム内の離散的な時間（ｍ＝０、１、２、・・・、Ｍ－１）、Ｍはフレーム長である。フレーム分割部２０１は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を入力信号ＤＢ２０２に出力する。 In equation (1), x_fram (l; m) is a frame-divided microphone input signal, l is a frame number, m is a discrete time within a frame (m=0, 1, 2, . . . , M−1 ), M is the frame length. The frame dividing unit 201 outputs the frame-divided microphone input signal x_fram(l;m) to the input signal DB 202 .

入力信号ＤＢ２０２は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を（２）式と（３）式に従い、フレームごとに入力信号ＤＢ２０２に蓄積する。 The input signal DB 202 accumulates the frame-divided microphone input signal x_fram(l;m) in the input signal DB 202 frame by frame according to formulas (2) and (3).

（２）式で、ＤＢ（ｉ；ｍ）は入力信号ＤＢ、ｉはデータベースのインデックス（ｉ＝０、１、２、・・・、Ｉ－１）、ｍはフレーム内の時間（ｍ＝０、１、２、・・・、Ｍ－１）、Ｍはフレーム長、Ｉはデータベース長である。ｉは（３）式に示すように、入力信号ＤＢにデータが蓄積されるとインクリメン卜する。

In equation (2), DB(i;m) is the input signal DB, i is the index of the database (i=0, 1, 2, . , 1, 2, . . . , M-1), where M is the frame length and I is the database length. As shown in equation (3), i is incremented when data is accumulated in the input signal DB.

信号選択部２０３は、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号から、マスカー素辺信号を選択する。信号選択部２０３は、例えば、（４）式に示すように選択結果Ｔ（ｋ）を算出する。 The signal selection unit 203 selects a masker bare edge signal from past frame-divided microphone input signals accumulated in the input signal DB 202 . The signal selection unit 203, for example, calculates the selection result T(k) as shown in equation (4).

（４）式で、ｋ（ｋ＝１，２，・・・，Ｋ）は変数、Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声信号の加算回数）、ＭＯＤ（ｉ－ｋ，Ｉ）は、ｉ－ｋをＩで割ったときの剰余を返すＭＯＤ関数である。Ｉで割ったときの剰余を返すことで、選択結果Ｔ（ｋ）は０からＩ－１の値になる。例えば、（４）式で、Ｋ＝５のときは、入力信号ＤＢ２０２に蓄積されている５フレーム分のマイク入力信号を選択する。 In equation (4), k (k=1, 2, . , I) is the M O D function that returns the remainder when ik is divided by I. By returning the remainder when divided by I, the selection result T(k) becomes a value from 0 to I−1. For example, in equation (4), when K=5, microphone input signals for five frames accumulated in the input signal DB 202 are selected.

なお、選択結果Ｔ（ｋ）を算出手法は、種々の方法を広く適用することができ、例えば、（５）式に示すように、マスカー素辺信号をランダムに選択しても良い。 Various methods can be widely applied to the method of calculating the selection result T(k). For example, as shown in equation (5), the masker element signal may be randomly selected.

（５）式で、ｒａｎｄ（ｋ）は自然数ｋに対して非負の整数の乱数を生成する関数である。（５）式は、ＭＯＤ関数を使用してｒａｎｄ（ｋ）で生成した乱数をＩで割ったときの剰余を返すことで、選択結果Ｔ（ｋ）は０からＩ－１の値になる。信号選択部２０３は、選択結果Ｔ（ｋ）をマスカー信号生成部２０４に出力する。

In equation (5), rand(k) is a function that generates a non-negative integer random number for a natural number k. Expression (5) returns the remainder when the random number generated by rand(k) using the MOD function is divided by I, and the selection result T(k) is a value from 0 to I−1. Signal selection section 203 outputs selection result T(k) to masker signal generation section 204 .

マスカー信号生成部２０４は、信号選択部２０３の選択結果Ｔ（ｋ）に基づいて、マスカー素辺信号を入力信号ＤＢ２０２からＫフレーム読み出し、読み出されたＫフレームのマスカー素辺信号からマスカー信号を生成し出力する。マスカー信号の生成手法は、例えば、（６）式に示すように、読み出されたＫフレームのマスカー素辺信号を重畳して生成する。 Based on the selection result T(k) of the signal selection unit 203, the masker signal generation unit 204 reads the masker side signal from the input signal DB 202 for K frames, and generates the masker signal from the read masker side signal of the K frames. Generate and output. The method of generating the masker signal is, for example, as shown in equation (6), by superimposing the masker edge signals of the read K frames.

（６）式で、ｋ（ｋ＝１，２，・・・，Ｋ）は変数、Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声信号の加算回数）、ｈ（ｌ；ｍ）はマスカー信号である。例えば、（６）式で、Ｋ＝５のときは、選択結果Ｔ（ｋ）に基づき、マスカー素辺信号として入力信号ＤＢ２０２から過去５フレーム分をマスカー素辺信号として読み出し、読み出したマスカー素辺信号を重畳することでマスカー信号ｈ（ｌ；ｍ）を生成する。 In equation (6), k (k=1, 2, . ) is the masker signal. For example, in equation (6), when K=5, based on the selection result T(k), the past five frames are read from the input signal DB 202 as the masker edge signal, and the read masker edge signal is A masker signal h(l;m) is generated by superimposing the signals.

なお、マスカー信号ｈ（ｌ；ｍ）の生成手法は、種々の方法を広く適用することができ、例えば、（７）式に示すように、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間反転して重畳することでマスカー信号ｈ（ｌ；ｍ）を生成しても良いし、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間遅延して重畳することでマスカー信号ｈ（ｌ；ｍ）を生成しても良い。 Various methods can be widely applied to generate the masker signal h(l;m). The masker signal h(l;m) may be generated by time-reversing and superimposing the microphone input signal as time processing, or the past frame-divided microphone input signal stored in the input signal DB 202 may be time-processed. , the masker signal h(l;m) may be generated by superimposing with a time delay.

そして、マスカー信号生成部２０４は、（８）式に従い、マスカー信号ｈ（ｌ；ｍ）を出力信号ｙ（ｎ）としてサウンドマスキング処理部２００の音出力端子ＯＵＴに出力する。

Then, the masker signal generator 204 outputs the masker signal h(l;m) to the sound output terminal OUT of the sound masking processor 200 as the output signal y(n) according to equation (8).

サウンドマスキング処理部２００の音出力端子ＯＵＴから出力される信号は、ＤＡ変換器１０６でデジタル信号からアナログ信号に変換され、スピーカアンプ１０５で増幅されてからスピーカ１０４から出力される。 A signal output from the sound output terminal OUT of the sound masking processing unit 200 is converted from a digital signal to an analog signal by the DA converter 106 , amplified by the speaker amplifier 105 and then output from the speaker 104 .

図３、図４は、マイク１０１と、マイク１０１に向かって発話する対象話者Ｕ１と、対象話者Ｕ１の後ろ側に立っている対象話者Ｕ1以外の人（対象話者Ｕ１の発話する音声をマスカー信号で聞き取りづらくする対象の人（以下、「マスキング対象者」と呼ぶ）Ｕ２と、スピーカ１０４との配置関係（スピーカ１０４の配置構成）の例について示した図である。図３、図４では、スピーカから出力される直接音ＤＳ（ＤｉｒｅｃｔＳｏｕｎｄ）の指向性を点線で図示している。また、図３では、直接音が床ＦＲに反射することにより発生する反射音ＲＳ（ＲｅｆｌｅｃｔｅｄＳｏｕｎｄ）の指向性を一点鎖線で図示している。 3 and 4 show a microphone 101, a target speaker U1 speaking into the microphone 101, and a person other than the target speaker U1 standing behind the target speaker U1 (a person speaking by the target speaker U1). 3A and 3B are diagrams showing an example of the arrangement relationship (arrangement configuration of the speaker 104) between a person U2 whose voice is to be made difficult to hear with the masker signal (hereinafter referred to as a “masking target person”) and the speaker 104. FIG. In Fig. 4, the directivity of the direct sound DS (Direct Sound) output from the speaker is indicated by a dotted line, and in Fig. 3, the reflected sound RS (Reflected Sound) generated by the reflection of the direct sound on the floor FR. Sound) is illustrated by a dashed line.

図３では、スピーカ１０４は、対象話者Ｕ１の前方で膝程度の高さに配置され、スピーカ１０４の振動面（指向性）が下方向で、床ＦＲの表面に対して斜め方向に設置されている。さらに、対象話者Ｕ１の後方の床ＦＲ部分に指向性が向けられた状態となっている。そして、スピーカ１０４から放射されたマスカー信号は図３に示すように、床ＦＲの表面に向けて出力され、床ＦＲに到達すると反射する。これにより、図３に示すようにマスカー信号が反射し、対象話者Ｕ１の後方にいるマスキング対象者Ｕ２にマスカー信号が伝わる。このとき、対象話者Ｕ１が発話する音声の直接音もマスキング対象者Ｕ２に伝わるが、マスカー信号によって、マスクされる。 In FIG. 3, the speaker 104 is placed in front of the target speaker U1 at a knee height, the vibration plane (directivity) of the speaker 104 is directed downward, and the speaker 104 is installed obliquely to the surface of the floor FR. ing. Furthermore, the directivity is directed toward the floor FR portion behind the target speaker U1. Then, as shown in FIG. 3, the masker signal radiated from the speaker 104 is output toward the surface of the floor FR, and is reflected upon reaching the floor FR. As a result, the masker signal is reflected as shown in FIG. 3 and transmitted to the masking target person U2 behind the target speaker U1. At this time, the direct sound of the voice uttered by the target speaker U1 is also transmitted to the masking target U2, but is masked by the masker signal.

なお、スピーカ１０４の設置方法は、対象話者Ｕ１にマスカー信号が聞こえないように設置し、且つマスキング対象者Ｕ２にマスカー信号が聞こえるように設置できれば種々の設置方法を広く適用することができる。例えば、図４の（ａ）に示しているように、対象話者Ｕ１の後ろに設置できるスペースがあれば、直接スピーカ１０４の振動面をマスキング対象者Ｕ２に直接向けてマスカー信号を出力するようにしても良いし、図４の（ｂ）に示しているように、床ＦＲにスピーカ１０４を埋め込んで直接スピーカ１０４の振動面をマスキング対象者Ｕ２に直接向けてマスカー信号を出力するようにしても良いし、図４の（ｃ）に示しているように、天井ＣＥにスピーカ１０４を設置して直接スピーカ１０４の振動面をマスキング対象者Ｕ２に直接向けてマスカー信号を出力するようにしても良い。 As for the installation method of the speaker 104, various installation methods can be widely applied as long as the speaker 104 is installed so that the target speaker U1 cannot hear the masker signal and the masking target U2 can hear the masker signal. For example, as shown in FIG. 4A, if there is a space that can be installed behind the target speaker U1, the vibration surface of the speaker 104 can be directed directly toward the masking target U2 to output the masker signal. Alternatively, as shown in FIG. 4B, the speaker 104 is embedded in the floor FR and the vibration surface of the speaker 104 is directed directly toward the masking target person U2 to output the masker signal. Alternatively, as shown in (c) of FIG. 4, a speaker 104 may be installed on the ceiling CE and the vibration surface of the speaker 104 may be directed directly toward the masking target person U2 to output the masker signal. good.

（Ａ－３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of First Embodiment According to the first embodiment, the following effects can be obtained.

第１の実施形態のサウンドマスキング装置１００は、対象話者Ｕ1の音声を入力信号ＤＢに蓄積し、入力信号ＤＢに蓄積されている過去のフレーム分割されたマイク入力信号を複数フレーム使用してマスカー信号を生成し、出力する。これにより、第１の実施形態のサウンドマスキング装置１００では、マスカー信号の音響特徴が対象話者Ｕ１の音声の音響特徴により近くなることから、マスキング効果が向上し、会話の内容が漏れることを防ぐことができる。言い換えると、第１の実施形態のサウンドマスキング装置１００では、入力信号ＤＢに蓄積されている対象話者Ｕ1の音声信号を用いてマスカー信号を生成することで、対象話者Ｕ1の音響特性の解析を行わなくても、マスカー信号の音響特徴が対象話者Ｕ1の音声信号の音響特徴に近いので、高いマスキング効果が得られる。 The sound masking apparatus 100 of the first embodiment accumulates the speech of the target speaker U1 in an input signal DB, and uses a plurality of frames of past frame-divided microphone input signals accumulated in the input signal DB for masking. Generate and output a signal. As a result, in the sound masking device 100 of the first embodiment, the acoustic features of the masker signal become closer to the acoustic features of the voice of the target speaker U1, thereby improving the masking effect and preventing the leakage of the content of the conversation. be able to. In other words, the sound masking apparatus 100 of the first embodiment analyzes the acoustic characteristics of the target speaker U1 by generating the masker signal using the voice signal of the target speaker U1 stored in the input signal DB. A high masking effect can be obtained even without the masking signal because the acoustic features of the masker signal are close to the acoustic features of the speech signal of the target speaker U1.

さらに、第１の実施形態のサウンドマスキング装置１００は、マスカー信号を再生するスピーカを、対象話者Ｕ1にマスカー信号が聞こえないように設置し、且つマスキング対象者Ｕ２にマスカー信号が聞こえるように設置することで、対象話者Ｕ1の会話を妨害せずに対象話者Ｕ1の発話する音声をマスキングすることができる。 Furthermore, in the sound masking apparatus 100 of the first embodiment, the speaker for reproducing the masker signal is installed so that the target speaker U1 cannot hear the masker signal and the masker target U2 can hear the masker signal. By doing so, the voice uttered by the target speaker U1 can be masked without interfering with the conversation of the target speaker U1.

（Ｂ）第２の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第２の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (B) Second Embodiment Hereinafter, a second embodiment of the sound processing device, sound processing program, and sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ｂ－１）第２の実施形態の構成
図５は、第２の実施形態に係るサウンドマスキング装置１００Ａの機能的構成について示したブロック図である。図２では、図１と同一部分又は対応部分には、同一符号又は対応符号を付している。 (B-1) Configuration of Second Embodiment FIG. 5 is a block diagram showing the functional configuration of a sound masking device 100A according to the second embodiment. In FIG. 2, the same reference numerals or corresponding reference numerals are assigned to the same or corresponding portions as those in FIG.

以下では、第２の実施形態について、第１の実施形態との差異を中心に説明し、第１の実施形態と重複する部分については説明を省略する。 In the following, the second embodiment will be described with a focus on differences from the first embodiment, and descriptions of portions that overlap with the first embodiment will be omitted.

第２の実施形態のサウンドマスキング装置１００Ａでは、サウンドマスキング処理部２００がサウンドマスキング処理部２００Ａに置き換わっている点で、第１の実施形態と異なっている。サウンドマスキング処理部２００Ａでは、マスカー信号生成部２０４が、マスカー信号生成部２０４Ａに置き換わり、さらに、音声区間判定部２０５とＤＢ蓄積判定部２０６が追加されている点で、第１の実施形態と異なっている。 The sound masking device 100A of the second embodiment differs from the first embodiment in that the sound masking processing section 200 is replaced with a sound masking processing section 200A. The sound masking processing unit 200A differs from the first embodiment in that the masker signal generation unit 204 is replaced with a masker signal generation unit 204A, and a speech section determination unit 205 and a DB accumulation determination unit 206 are added. ing.

第２の実施形態のサウンドマスキング装置１００Ａのサウンドマスキング処理部２００Ａでは、音声区間判定部２０５とＤＢ蓄積判定部２０６が増えたことにより入力信号ＤＢに蓄積されるフレーム分割されたマイク入力信号とマスカー信号の生成方法が異なる点と、マスカー信号生成部２０４Ａになったことによりフレーム分割されたマイク入力信号の蓄積方法やマスカー信号方法が異なる点が第１の実施形態のサウンドマスキング装置１００と異なる。 In the sound masking processing unit 200A of the sound masking apparatus 100A of the second embodiment, the frame-divided microphone input signal and the masker input signal accumulated in the input signal DB due to the addition of the speech period determination unit 205 and the DB accumulation determination unit 206 are added. The sound masking apparatus 100 differs from the sound masking apparatus 100 of the first embodiment in that the method of signal generation is different, and the accumulation method of frame-divided microphone input signals and the masker signal method are different due to the use of the masker signal generation unit 204A.

音声区間判定部２０５は、フレーム分割されたマイク入力信号が音声区間か非音声区間（音声区間以外の区間）かを判定し、判定結果を出力する。 A voice segment determination unit 205 determines whether the frame-divided microphone input signal is a voice segment or a non-voice segment (a segment other than a voice segment), and outputs the determination result.

ＤＢ蓄積判定部２０６は、音声区間判定部２０５の音声区間判定の結果を基に、フレーム分割されたマイク入力信号が音声区間と判定された場合、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２に出力し、非音声区間と判定された場合、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２に出力しない。 If the frame-divided microphone input signal is determined to be in a voice segment based on the voice segment determination result of the voice segment determination unit 205, the DB accumulation determination unit 206 stores the frame-divided microphone input signal in the input signal DB 202. If it is determined as a non-speech section, the frame-divided microphone input signal is not output to the input signal DB 202 .

マスカー信号生成部２０４Ａは、音声区間判定の結果と選択結果を基に、選択されたマスカー素辺信号を入力信号ＤＢ２０２から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成し出力する。 The masker signal generation unit 204A reads a plurality of frames of the selected masker side signal from the input signal DB 202 based on the result of the speech section determination and the selection result, and generates a masker signal from the read masker side signals of the plurality of frames. Generate and output.

（Ｂ－２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態におけるサウンドマスキング装置１００Ａの動作（実施形態に係る音響処理方法）について詳細に説明する。 (B-2) Operation of Second Embodiment Next, the operation of the sound masking device 100A (acoustic processing method according to the embodiment) according to the second embodiment having the configuration described above will be described in detail.

第２の実施形態に係るサウンドマスキング装置１００Ａにおけるサウンドマスキング処理の基本的な動作は、第１の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100A according to the second embodiment is the same as the sound masking process described in the first embodiment.

以下では、第１の実施形態と異なる点である音声区間判定部２０５、ＤＢ蓄積判定部２０６、マスカー信号生成部２０４Ａにおける処理動作を中心に詳細に説明する。 In the following, a detailed description will be given centering on the processing operations in the speech section determination unit 205, the DB accumulation determination unit 206, and the masker signal generation unit 204A, which are different from the first embodiment.

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を処理フレームごとに分割し、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を音声区間判定部２０５とＤＢ蓄積判定部２０６に出力する。 Frame dividing section 201 divides microphone input signal x(n) into processing frames, and outputs frame-divided microphone input signal x_fram(l;m) to voice section determining section 205 and DB accumulation determining section 206 .

音声区間判定部２０５は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を用いて、音声区間か非音声区間かを判定する。音声区間か非音声区間かの判定手段は、例えば、（９）式と（１０）式に従い判定する。 The voice segment determination unit 205 uses the frame-divided microphone input signal x_fram(l;m) to determine whether it is a voice segment or a non-voice segment. The means for judging whether it is a speech segment or a non-speech segment makes a determination according to the equations (9) and (10), for example.

（９）式と（１０）式で、ｘ＿ｆｒａｍ（ｌ；ｍ）はフレーム分割したマイク入力信号、ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）はフレーム分割したマイク入力信号の平均振幅値、ＶＡＤ（ｌ）は音声区間判定結果、ＴＨは音声区間の判定に用いられる閾値である。

In equations (9) and (10), x_fram(l;m) is the frame-divided microphone input signal, x_fram_amp(l) is the average amplitude value of the frame-divided microphone input signal, and VAD(l) is the voice section determination result. , TH are thresholds used to determine the speech segment.

（９）式は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）の平均振幅値ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）を求める式である。（１０）式は、（９）式で求めたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）の平均振幅値ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）が閾値ＴＨより値が大きければ音声区間と判定し音声区間判定結果ＶＡＤ（ｌ）に１を代入し、閾値ＴＨより値が小さければ非音声区間と判定し音声区間判定結果ＶＡＤ（ｌ）に０を代入するという式である。 Formula (9) is a formula for obtaining the average amplitude value x_fram_amp(l) of the frame-divided microphone input signal x_fram(l;m). Equation (10) determines that it is a voice segment if the average amplitude value x_fram_amp(l) of the frame-divided microphone input signal x_fram(l;m) obtained by Equation (9) is greater than the threshold value TH. 1 is substituted for VAD(l), and if the value is smaller than the threshold value TH, it is determined as a non-speech section and 0 is substituted for the speech section determination result VAD(l).

閾値ＴＨは、音声の有無を判定できれば良く、種々の方法を広く適用することができ、例えば、（１１）式に示すように、サウンドマスキング装置１００Ａが動作し始めた最初の数フレームを無音区間とし、その最初の数フレームの平均振幅値を閾値ＴＨとして使用する固定の閾値ＴＨを用いても良いし、（１２）式に示すように、ｘ＿ｆｒａｍ＿ａｍｐ（ｌ）に時定数フィルタを用いてフレーム毎に変動する閾値ＴＨ（ｌ）を用いても良い。

The threshold value TH should be able to determine the presence or absence of voice, and various methods can be widely applied. , and a fixed threshold TH that uses the average amplitude value of the first several frames as the threshold TH may be used, or as shown in equation (12), x_fram_amp(l) is filtered using a time constant filter for each frame A threshold TH(l) that fluctuates to .

（１２）式で、ａは時定数フィルタの係数であり、０以上、１以下の値となる。（１２）式において、閾値の更新を遅くしたい場合ａは１に近い値が望ましく（例えばａ＝０．９等の値）、閾値の更新を速くしたい場合ａは０に近い値が望ましい（例えばａ＝０．１等の値）。 In the expression (12), a is a coefficient of the time constant filter and takes a value of 0 or more and 1 or less. In the equation (12), a value close to 1 is desirable for slow updating of the threshold value (for example, a=0.9), and a value close to 0 is desirable for speeding up updating of the threshold value (for example, value such as a=0.1).

なお、音声区間か非音声区間かの判定の手段は、種々の方法を広く適用することができ、例えば、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）の自己相関を求めて音声区間か非音声区間か求める等の方法で判定しても良い。音声区間判定部２０５は、音声区間か非音声区間かの判定結果をＤＢ蓄積判定部２０６とマスカー信号生成部２０４Ａに出力する。 Various methods can be widely applied as the means for judging whether it is a speech section or a non-speech section. It may be determined by a method such as determining whether it is a speech segment. The voice segment determination unit 205 outputs the determination result as to whether it is a voice segment or a non-voice segment to the DB accumulation determination unit 206 and the masker signal generation unit 204A.

ＤＢ蓄積判定部２０６は、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたとき（ＶＡＤ(ｌ)＝１のとき）のみ、フレーム分割部２０１から出力されたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を、入力信号ＤＢ２０２に出力し、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が非音声区間と判定されたとき（ＶＡＤ(ｌ)＝０のとき）は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を出力しない。 The DB accumulation determination unit 206 extracts the The output frame-divided microphone input signal x_fram(l;m) is output to the input signal DB 202, and the frame-divided microphone input signal x_fram(l;m) is determined to be a non-voice segment by the voice segment determination unit 205. When (VAD(l)=0), the frame-divided microphone input signal x_fram(l;m) is not output.

マスカー信号生成部２０４Ａは、音声区間判定部２０５の音声区間判定結果ＶＡＤ(ｌ)と信号選択部２０３の選択結果Ｔ（ｋ）を基に、選択されたマスカー素辺信号を入力信号ＤＢ２０２から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成し出力する。マスカー信号生成部２０４Ａは、（６）式と（１３）式に従い、マスカー信号を出力する。 The masker signal generation unit 204A extracts a plurality of selected masker edge signals from the input signal DB 202 based on the speech interval determination result VAD(l) of the speech interval determination unit 205 and the selection result T(k) of the signal selection unit 203. A frame is read, and a masker signal is generated and output from the read masker edge signals of a plurality of frames. The masker signal generation unit 204A outputs a masker signal according to formulas (6) and (13).

（１３）式で、ｈａ（ｌ；ｍ）はマスカー信号生成部２０４Ａで生成されるマスカー信号である。（１３）式は、音声区間判定部２０５で、マイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたとき（ＶＡＤ(ｌ)＝１のとき）のみ、信号選択部２０３の選択結果Ｔ（ｋ）を用いてマスカー素辺信号を入力信号ＤＢ２０２から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号を使用してマスカー信号ｈ（ｌ；ｍ）を生成しｈａ（ｌ；ｍ）に代入し、マイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が非音声区間と判定されたとき（ＶＡＤ(ｌ)≠１のとき）は、ｈａ（ｌ；ｍ）に無音を代入する。 In equation (13), ha(l;m) is the masker signal generated by the masker signal generator 204A. Expression (13) expresses the selection result T (k) is used to read a plurality of frames of masker edge signals from the input signal DB 202, and the masker signal h(l;m) is generated using the readout masker edge signals of the plurality of frames, ha(l;m). ), and when the microphone input signal x_fram(l;m) is determined to be a non-speech section (when VAD(l)≠1), silence is substituted for ha(l;m).

マスカー信号生成部２０４は、（１４）式に従い、出力信号ｙ（ｎ）を音出力端子ＯＵＴに出力する。

The masker signal generator 204 outputs the output signal y(n) to the sound output terminal OUT in accordance with equation (14).

（Ｂ－３）第２の実施形態の効果
第２の実施形態によれば、以下のような効果を奏することができる。 (B-3) Effects of Second Embodiment According to the second embodiment, the following effects can be obtained.

第２の実施形態のサウンドマスキング装置１００Ａでは、音声区間と判定されたときのみ対象話者Ｕ１の音声を入力信号ＤＢ２０２に蓄積することで、対象話者Ｕ１の音声とは関係のない雑音が入力信号ＤＢ２０２に蓄積されてマスカー素辺信号として選択されることがなくなるので、対象話者Ｕ１の音声のみでマスカー信号を生成することができ、高いマスキング効果を維持できる。 In the sound masking device 100A of the second embodiment, by accumulating the speech of the target speaker U1 in the input signal DB 202 only when it is determined to be in the speech period, noise unrelated to the speech of the target speaker U1 is input. Since it is no longer stored in the signal DB 202 and selected as a masker side signal, it is possible to generate a masker signal using only the voice of the target speaker U1, thereby maintaining a high masking effect.

また、第２の実施形態のサウンドマスキング装置１００Ａでは、音声区間と判定されたときのみ、入力信号ＤＢに蓄積されている過去のフレーム分割されたマイク入力信号を複数フレーム使用してマスカー信号を生成し、出力している。これにより、音声が入力されたときのみマスカー信号が出力されるように構成することができる。 Further, in the sound masking device 100A of the second embodiment, a masker signal is generated using a plurality of frames of past frame-divided microphone input signals stored in the input signal DB only when it is determined to be in a speech period. and output. Thereby, it is possible to configure so that the masker signal is output only when the voice is input.

（Ｃ）第３の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第３の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (C) Third Embodiment Hereinafter, a third embodiment of the sound processing device, sound processing program and sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ｃ－１）第３の実施形態の構成
図６は、第３の実施形態に係るサウンドマスキング装置１００Ｂの機能的構成について示したブロック図である。図６では、上述の図５と同一部分又は対応部分には、同一符号又は対応符号を付している。 (C-1) Configuration of Third Embodiment FIG. 6 is a block diagram showing the functional configuration of a sound masking device 100B according to the third embodiment. In FIG. 6, the same reference numerals or corresponding reference numerals are given to the same or corresponding portions as those in FIG.

以下では、第３の実施形態について、第１、及び第２の実施形態との差異を中心に説明し、第１と第２の実施形態と重複する部分については説明を省略する。 In the following, the third embodiment will be described with a focus on differences from the first and second embodiments, and descriptions of portions that overlap with the first and second embodiments will be omitted.

第３の実施形態のサウンドマスキング装置１００Ｂでは、サウンドマスキング処理部２００Ａがサウンドマスキング処理部２００Ｂに置き換わっている点で、第２の実施形態と異なっている。 A sound masking device 100B of the third embodiment differs from that of the second embodiment in that the sound masking processing section 200A is replaced with a sound masking processing section 200B.

サウンドマスキング処理部２００Ｂでは、入力信号ＤＢ２０２と信号選択部２０３とマスカー信号生成部２０４Ａが、それぞれ入力信号ＤＢ２０２Ａと信号選択部２０３Ａとマスカー信号生成部２０４Ｂに置き換わり、さらに、ピッチ推定部２０５とクラス判定部２０８が追加されている点で、第２の実施形態と異なっている。 In the sound masking processing section 200B, the input signal DB 202, the signal selection section 203, and the masker signal generation section 204A are replaced with the input signal DB 202A, the signal selection section 203A, and the masker signal generation section 204B, respectively. It differs from the second embodiment in that a section 208 is added.

第３の実施形態のサウンドマスキング装置１００Ｂでは、ピッチ推定部２０５とクラス判定部２０８が増えたことにより、フレーム分割されたマイク入力信号のピッチ推定、フレーム分割されたマイク入力信号の蓄積方法、マスカー信号の生成に使用する信号を選択する方法、マスカー信号の生成方法が異なる点が第２の実施形態と異なる。 In the sound masking device 100B of the third embodiment, since the pitch estimation unit 205 and the class determination unit 208 are added, the pitch estimation of the frame-divided microphone input signal, the accumulation method of the frame-divided microphone input signal, the masker This embodiment differs from the second embodiment in that the method of selecting signals to be used for signal generation and the method of generating masker signals are different.

ピッチ推定部２０７は、フレーム分割されたマイク入力信号と音声区間判定の結果からフレーム分割されたマイク入力信号のピッチ（音声の高さ）を推定し、推定したピッチを出力する。 The pitch estimator 207 estimates the pitch (speech pitch) of the frame-divided microphone input signal from the frame-divided microphone input signal and the voice segment determination result, and outputs the estimated pitch.

クラス判定部２０８は、ピッチ推定部２０７で推定したピッチの結果を基に、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積すると判定された場合にのみ、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａのピッチに応じたクラスに出力し、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積しないと判定された場合、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａのピッチに応じたクラスに出力しない。 Based on the result of the pitch estimated by the pitch estimation unit 207, the class determination unit 208 inputs the frame-divided microphone input signal only when it is determined that the frame-divided microphone input signal is stored in the input signal DB 202A. When it is determined not to output the frame-divided microphone input signal to the input signal DB 202A and store the frame-divided microphone input signal in the class corresponding to the pitch of the input signal DB 202A. No output.

入力信号ＤＢ２０２Ａは、フレーム分割したマイク入力信号をピッチに応じたクラスごとに蓄積する記憶手段である。 The input signal DB 202A is storage means for accumulating frame-divided microphone input signals for each class corresponding to the pitch.

信号選択部２０３Ａは、クラスごとに蓄積されている過去のフレーム分割したマイク入力信号から、マスカー素辺信号を選択し、選択結果を出力する。 The signal selection unit 203A selects a masker bare edge signal from past frame-divided microphone input signals accumulated for each class, and outputs a selection result.

マスカー信号生成部２０４Ｂは、音声区間判定とピッチ推定の結果と選択結果を基に、選択されたマスカー素辺信号を入力信号ＤＢ２０２Ａのピッチに応じたクラスから複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成して出力する。 The masker signal generation unit 204B reads out a plurality of frames of the selected masker edge signal from the class corresponding to the pitch of the input signal DB 202A based on the result of the voice section determination and pitch estimation and the selection result, and converts the read-out plurality of frames. A masker signal is generated from the masker bare edge signal and output.

なお、第３の実施形態において、第１の実施形態と同様に音声区間判定部２０５を除外した構成としても良い。 In addition, in the third embodiment, the configuration may be such that the speech segment determination unit 205 is excluded, as in the first embodiment.

（Ｃ－２）第３の実施形態の動作
次に、以上のような構成を有する第３の実施形態におけるサウンドマスキング装置１００Ｂの動作（実施形態に係る音響処理方法）について詳細に説明する。 (C-2) Operation of Third Embodiment Next, the operation of the sound masking device 100B (acoustic processing method according to the embodiment) of the third embodiment having the configuration described above will be described in detail.

第３の実施形態に係るサウンドマスキング装置１００Ｂにおけるサウンドマスキング処理の基本的な動作は、第１、及び第２の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100B according to the third embodiment is the same as the sound masking process described in the first and second embodiments.

以下では、第２の実施形態と異なる点であるピッチ推定部２０７、クラス判定部２０８、入力信号ＤＢ２０２Ａ、信号選択部２０３Ａ、マスカー信号生成部２０４Ｂにおける処理動作を中心に詳細に説明する。 In the following, the processing operations of the pitch estimation unit 207, class determination unit 208, input signal DB 202A, signal selection unit 203A, and masker signal generation unit 204B, which are different from the second embodiment, will be described in detail.

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を処理フレームごとに分割し、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を音声区間判定部２０５、ＤＢ蓄積判定部２０６、ピッチ推定部２０７に出力する。 The frame dividing unit 201 divides the microphone input signal x(n) into processing frames, and divides the frame-divided microphone input signal x_fram(l;m) into the speech period determining unit 205, the DB accumulation determining unit 206, and the pitch estimating unit 207. output to

音声区間判定部２０５は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を用いて、音声区間か非音声区間かを判定し、音声区間か非音声区間かの判定結果をＤＢ蓄積判定部２０６、ピッチ推定部２０７、マスカー信号生成部２０４Ｂに出力する。 The speech section determination unit 205 uses the frame-divided microphone input signal x_fram(l;m) to determine whether it is a speech section or a non-speech section. , pitch estimation section 207 and masker signal generation section 204B.

ＤＢ蓄積判定部２０６は、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたときのみ、フレーム分割部２０１から出力されたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を、クラス判定部２０８、信号選択部２０３Ａ、マスカー信号生成部２０４Ｂに出力し、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が非音声区間と判定されたときは、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を出力しない。 Only when the voice segment determining unit 205 determines that the frame-divided microphone input signal x_fram(l;m) is a voice segment, the DB accumulation determining unit 206 performs frame-divided microphone input signal x_fram output from the frame dividing unit 201. (l; m) is output to the class determination unit 208, the signal selection unit 203A, and the masker signal generation unit 204B. , the frame-divided microphone input signal x_fram(l;m) is not output.

ピッチ推定部２０７は、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたときのみ、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）のピッチを推定する。ピッチの推定手段は、例えば、（１５）式に従い、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）の自己相関関数ｘ＿ｆｒａｍ＿ｃｏｒｒ（ｌ）を求め、（１６）式に従い自己相関関数ｘ＿ｆｒａｍ＿ｃｏｒｒ（ｌ）を用いて推定するようにしても良い。

The pitch estimation unit 207 estimates the pitch of the frame-divided microphone input signal x_fram(l;m) only when the speech period determination unit 205 determines that the frame-divided microphone input signal x_fram(l;m) is in a voice period. do. For example, the pitch estimating means obtains the autocorrelation function x_fram_corr(l) of the frame-divided microphone input signal x_fram(l;m) according to equation (15), and calculates the autocorrelation function x_frame_corr(l) according to equation (16). You may make it estimate using.

（１６）式で、ｐｉｔｃｈ（ｌ）は推定したピッチ、ｆｓはサンプリング周波数である。ピッチの推定手法は、種々の方法を広く適用することができ、例えば、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を離散フーリエ変換や高速フーリエ変換を行ってからケプストラム分析を行い、ピッチを算出しても良い。ピッチ推定部２０５は、推定したピッチｐｉｔｃｈ（ｌ）をクラス判定部２０８とマスカー信号生成部２０４Ｂに出力する。 In equation (16), pitch(l) is the estimated pitch and fs is the sampling frequency. Various methods can be widely applied as the pitch estimation method. You can calculate. Pitch estimation section 205 outputs the estimated pitch pitch(l) to class determination section 208 and masker signal generation section 204B.

クラス判定部２０８は、ピッチ推定部２０７で推定したピッチを基に、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積するか蓄積しないかを判定する。クラス判定部２０８において、入力信号ＤＢ２０２Ａに蓄積するか蓄積しないかを判定手法については限定されないものである。例えば、ピッチ推定部２０７で推定したピッチｐｉｔｃｈ（ｌ）が、１００ＨＺ以下、１０１Ｈｚ～２００Ｈｚ、２０１Ｈｚ～３００Ｈｚ、３０１Ｈｚ～４００Ｈｚ、４０１Ｈｚ～５００Ｈｚ、５００Ｈｚ以上のように１００Ｈｚの間隔（グリッド）でクラス分けする。そして、１００ＨＺ以下、又は５００Ｈｚ以上の時、入力信号ＤＢ２０２Ａに蓄積しないと判定し、それ以外のときは入力信号ＤＢ２０２Ａに蓄積すると判定するようにしても良い。また、例えば、入力信号ＤＢ２０２Ａでは、周波数があがるほどクラスの周波数間隔（グリッド）を広げるようにしても良い。 Based on the pitch estimated by pitch estimation section 207, class determination section 208 determines whether or not to store the frame-divided microphone input signal in input signal DB 202A. In class determination section 208, the method of determining whether or not to store in input signal DB 202A is not limited. For example, the pitch pitch (l) estimated by the pitch estimation unit 207 is classified into classes at intervals (grids) of 100 Hz such as 100 Hz or less, 101 Hz to 200 Hz, 201 Hz to 300 Hz, 301 Hz to 400 Hz, 401 Hz to 500 Hz, and 500 Hz or more. . Then, when the frequency is 100 Hz or less or 500 Hz or more, it may be determined that the input signal DB 202A is not to be stored, and in other cases, it may be determined that the input signal DB 202A is to be stored. Further, for example, in the input signal DB 202A, the class frequency interval (grid) may be widened as the frequency increases.

クラス判定部２０８は、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積すると判定された場合にのみ、フレーム分割されたマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を入力信号ＤＢ２０２Ａのピッチに応じたクラスに出力し、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積しないと判定された場合、フレーム分割されたマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を入力信号ＤＢ２０２Ａのピッチに応じたクラスに出力しない。 Only when it is determined that the frame-divided microphone input signal is stored in the input signal DB 202A, the class determination unit 208 classifies the frame-divided microphone input signal x_fram(l;m) into a class according to the pitch of the input signal DB 202A. , and if it is determined not to store the frame-divided microphone input signal in the input signal DB 202A, the frame-divided microphone input signal x_fram(l;m) is not output to the class corresponding to the pitch of the input signal DB 202A. .

入力信号ＤＢ２０２Ａは、クラス判定部２０８からマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が出力されたときのみ、出力されたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を（１７）式と（１８）式に従い、ピッチに応じたクラスごとに入力信号ＤＢ２０２Ａに蓄積する。

Only when the microphone input signal x_fram(l;m) is output from the class determination unit 208, the input signal DB 202A converts the output frame-divided microphone input signal x_fram(l;m) into the following equations (17) and (18): Accumulates in the input signal DB 202A for each class corresponding to the pitch according to the formula.

（１７）式で、ＤＢ’（ｐ；ｉ；ｍ）は入力信号ＤＢ、ｍはフレーム内の離散的な時間（ｍ＝０、１、２、・・・、Ｍ－１）、ｉ（ｐｉｔｃｈ（ｌ））はデータベースのクラスごとのインデックス、Ｉはデータベース長である。ｉ（ｐｉｔｃｈ（ｌ））は（１８）式に示すように、クラスにデータが蓄積されるとインクリメントする。 In equation (17), DB' (p; i; m) is the input signal DB, m is the discrete time in the frame (m = 0, 1, 2, ..., M-1), i (pitch (l)) is an index for each class of the database, and I is the length of the database. i(pitch(l)) is incremented when data is accumulated in the class as shown in equation (18).

信号選択部２０３Ａは、入力信号ＤＢ２０２Ａにクラスごとに蓄積されている過去のフレーム分割したマイク入力信号からマスカー素辺信号を選択する。信号選択部２０３Ａは、例えば、（１９）式に示すように選択結果Ｔａ（ｋ）を選択する。 The signal selection unit 203A selects a masker bare edge signal from past frame-divided microphone input signals accumulated for each class in the input signal DB 202A. The signal selection unit 203A selects the selection result Ta(k) as shown in equation (19), for example.

（１９）式で、ｋ（ｋ＝１，２，・・・，Ｋ）は変数、Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声信号の加算回数）、ＭＯＤ（ｉ－ｋ，Ｉ）は、ｉ－ｋをＩで割ったときの剰余を返すＭＯＤ関数である。（１９）式は、Ｉで割ったときの剰余を返すことで、選択結果Ｔａ（ｋ）は０からＩ－１の値になる。 In the equation (19), k (k=1, 2, . , I) is the M O D function that returns the remainder when ik is divided by I. Expression (19) returns the remainder when divided by I, so that the selection result Ta(k) takes values from 0 to I−1.

なお、選択結果Ｔａ（ｋ）を算出手法は、種々の方法を広く適用することができ、例えば、（２０）式に示すように、どのフレームを使用するかランダムに選択しても良い。 Various methods can be widely applied to the method of calculating the selection result Ta(k). For example, as shown in equation (20), which frame to use may be randomly selected.

（２０）式で、ｒａｎｄは自然数ｋに対して乱数を生成する関数である。（２０）式は、ＭＯＤ関数を使用してｒａｎｄ（ｋ）で生成した乱数をＩで割ったときの剰余を返すことで、選択結果Ｔａ（ｋ）は０からＩ－１の値になる。信号選択部２０３Ａは、選択結果Ｔａ（ｋ）をマスカー信号生成部２０４に出力する。

In equation (20), rand is a function that generates random numbers for natural number k. Expression (20) returns the remainder when the random number generated by rand(k) using the MOD function is divided by I, and the selection result Ta(k) becomes a value from 0 to I−1. The signal selection unit 203A outputs the selection result Ta(k) to the masker signal generation unit 204. FIG.

マスカー信号生成部２０４Ｂは、音声区間判定部２０５の音声区間判定結果ＶＡＤ(ｌ)、ピッチ推定部２０７で推定したピッチｐｉｃｔｈ（ｌ）、信号選択部２０３Ａの選択結果Ｔａ（ｋ）を基に、マスカー素辺信号を入力信号ＤＢ２０２Ａのピッチに応じたクラスから複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成し出力する。マスカー信号生成部２０４Ｂは、（２１）式と（２２）式に従い、マスカー信号を出力する。 The masker signal generation unit 204B, based on the voice segment determination result VAD(l) of the voice segment determination unit 205, the pitch picth(l) estimated by the pitch estimation unit 207, and the selection result Ta(k) of the signal selection unit 203A, A plurality of frames of masker side signals are read from a class corresponding to the pitch of the input signal DB 202A, and a masker signal is generated from the read masker side signals of the plurality of frames and output. The masker signal generation unit 204B outputs a masker signal according to formulas (21) and (22).

（２１）式で、ｈｂ（ｌ；ｍ）はマスカー信号を、Ｆ０＿ＭＡＸはピッチの最大値を、（２２）式で、ｈ’（ｌ；ｍ）は入力信号ＤＢから生成されるマスカー信号Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声の加算回数）をである。（２１）式は、音声区間判定部２０５でマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されとき（ＶＡＤ(ｌ)＝１のとき）、かつ、ピッチ推定部２０７の推定したピッチｐｉｔｃｈ（ｌ）が０Ｈｚより大きく、Ｆ０＿ＭＡＸ以下ときのみ、マスカー信号ｈ’（ｌ；ｍ）を生成し、上記以外の時ときは無音を生成し、ｈｂ（ｌ；ｍ）に代入するという式である。（２２）は、入力信号ＤＢ２０２Ａにピッチに応じたクラスごとに蓄積されている過去のフレーム分割したマイク入力信号を重畳して生成する方法である。 In equation (21), hb(l;m) is the masker signal, F0_MAX is the maximum pitch value, and in equation (22), h'(l;m) is the masker signal K generated from the input signal DB. It is the number of selections of the masker bare side signal (the number of additions of the voice when generating the masker signal). Expression (21) expresses the pitch pitch estimated by the pitch estimating unit 207 when the microphone input signal x_fram(l;m) is determined to be in the speech interval by the speech period determining unit 205 (when VAD(l)=1). Only when (l) is greater than 0Hz and F0_MAX or less, the masker signal h'(l;m) is generated, and otherwise silence is generated and substituted into hb(l;m). . (22) is a method of superimposing and generating past frame-divided microphone input signals stored in the input signal DB 202A for each class corresponding to the pitch.

なお、マスカー信号生成部２０４Ｂにおいて、マスカー信号の生成手法は、種々の方法を広く適用することができる。例えば、マスカー信号生成部２０４Ｂでは、入力信号ＤＢ２０２Ａのクラスごとに蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間反転して重畳してからマスカー信号ｈ’（ｌ；ｍ）を生成しても良いし、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間遅延して重畳することでマスカー信号ｈ’（ｌ；ｍ）を生成しても良いし、過去のどのフレームを使用するかランダムに決定してマスカー信号ｈ’（ｌ；ｍ）を生成しても良い。 Various methods can be widely applied to the masker signal generation method in the masker signal generation unit 204B. For example, in the masker signal generation unit 204B, past frame-divided microphone input signals stored for each class in the input signal DB 202A are time-reversed and superimposed as time processing, and then the masker signal h'(l;m) is generated. Alternatively, the masker signal h'(l;m) may be generated by superimposing the past frame-divided microphone input signal accumulated in the input signal DB 202 with a time delay as time processing. However, which past frame to use may be randomly determined to generate the masker signal h'(l;m).

そして、マスカー信号生成部２０４Ｂは、（２３）式に従い、出力信号ｙ（ｎ）を音出力端子ＯＵＴに出力する。

Then, the masker signal generator 204B outputs the output signal y(n) to the sound output terminal OUT according to the expression (23).

（Ｃ－３）第３の実施形態の効果
第３の実施形態によれば、以下のような効果を奏することができる。 (C-3) Effects of Third Embodiment According to the third embodiment, the following effects can be obtained.

第３の実施形態のサウンドマスキング装置１００Ｂでは、対象話者Ｕ1の音声をピッチに応じたクラスごとに入力信号ＤＢ２０２Ａに蓄積し、ピッチに応じたクラスごとに入力信号ＤＢに蓄積されている過去のマイク入力信号を複数フレーム使用してマスカー信号を生成し出力する。これにより、第３の実施形態のサウンドマスキング装置１００Ｂでは、マスカー信号と対象話者Ｕ1の音声との音響特徴にさらに近づくので、よりマスキング効果を高めることができる。 In the sound masking apparatus 100B of the third embodiment, the speech of the target speaker U1 is stored in the input signal DB 202A for each class according to pitch, and the past voices stored in the input signal DB for each class according to pitch are stored in the input signal DB 202A. A masker signal is generated and output using multiple frames of the microphone input signal. As a result, in the sound masking device 100B of the third embodiment, the acoustic features of the masker signal and the voice of the target speaker U1 are brought closer to each other, so that the masking effect can be further enhanced.

（Ｄ）第４の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第４の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (D) Fourth Embodiment Hereinafter, a fourth embodiment of the sound processing device, sound processing program and sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ｄ－１）第４の実施形態の構成
図７は、第４の実施形態に係るサウンドマスキング装置１００Ｃの機能的構成について示したブロック図である。図７では、上述の図６と同一部分又は対応部分には、同一符号又は対応符号を付している。 (D-1) Configuration of the Fourth Embodiment FIG. 7 is a block diagram showing the functional configuration of a sound masking device 100C according to the fourth embodiment. In FIG. 7, the same reference numerals or corresponding reference numerals are given to the same or corresponding portions as those in FIG.

以下では、第４の実施形態について、第１から第３の実施形態との差異を中心に説明し、第１から第３の実施形態と重複する部分については説明を省略する。 In the following, the fourth embodiment will be described with a focus on the differences from the first to third embodiments, and descriptions of portions that overlap with the first to third embodiments will be omitted.

第４の実施形態のサウンドマスキング装置１００Ｃでは、サウンドマスキング処理部２００Ｂがサウンドマスキング処理部２００Ｃに置き換わっている点で、第３の実施形態と異なっている。 A sound masking device 100C of the fourth embodiment differs from that of the third embodiment in that the sound masking processing section 200B is replaced with a sound masking processing section 200C.

サウンドマスキング処理部２００Ｃでは、信号選択部２０３Ａとマスカー信号生成部２０４Ｂが信号選択部２０３Ｂとマスカー信号生成部２０４Ｃに置き換わり、さらに、第三者音声信号ＤＢ２０９と使用ＤＢ判定部２１０が追加されている点で、第１から第３の実施形態と異なっている。 In the sound masking processing unit 200C, the signal selection unit 203A and the masker signal generation unit 204B are replaced with the signal selection unit 203B and the masker signal generation unit 204C, and a third party audio signal DB 209 and a usage DB determination unit 210 are added. This point differs from the first to third embodiments.

第４の実施形態のサウンドマスキング装置１００Ｃでは、第三者音声信号ＤＢ２０９と使用ＤＢ判定部２１０が増えたことにより、第三者音声信号ＤＢ２０９に第三者音声信号を蓄積する方法、サウンドマスキング装置１００Ｃが動作した時に使用するＤＢ、マスカー信号の生成に使用する信号を選択する方法、マスカー生成方法が異なる点が第１から第３の実施形態と異なる。 In the sound masking device 100C of the fourth embodiment, since the number of the third party audio signal DB 209 and the usage DB determination unit 210 is increased, the method of accumulating the third party audio signal in the third party audio signal DB 209, the sound masking device It differs from the first to third embodiments in that the DB used when 100C operates, the method of selecting the signal used to generate the masker signal, and the method of generating the masker are different.

第三者音声信号ＤＢ２０９は、例えば、事前にサンプルとなる音声信号（以下、「第三者音声信号」と呼ぶ）を蓄積しておき、蓄積した第三者の音声信号をフレーム分割し、フレーム分割された第三者音声信号をピッチに応じたクラスに分けて蓄積したデータベースである。 For example, the third-party audio signal DB 209 accumulates sample audio signals (hereinafter referred to as "third-party audio signals") in advance, divides the accumulated third-party audio signals into frames, and divides them into frames. This is a database in which divided third-party speech signals are classified into classes according to pitch and accumulated.

使用ＤＢ判定部２１０は、入力信号ＤＢ２０２Ａの各クラスに、フレーム分割されたマイク入力信号が所定量以上（十分）蓄積されているか否かを判定し、その判定結果を出力する。 The use DB determination unit 210 determines whether or not a predetermined amount or more (sufficient) of frame-divided microphone input signals is accumulated in each class of the input signal DB 202A, and outputs the determination result.

信号選択部２０３Ｂは、入力信号ＤＢ２０２Ａ、又は第三者音声信号ＤＢ２０９にクラスごとに蓄積されている過去のフレーム分割したマイク入力信号から、マスカー素辺信号を選択し、選択結果を出力する。 The signal selection unit 203B selects a masker bare edge signal from past frame-divided microphone input signals stored for each class in the input signal DB 202A or the third-party audio signal DB 209, and outputs the selection result.

マスカー信号生成部２０４Ｃは、音声区間判定とピッチ推定の結果と使用ＤＢ判定結果と選択結果を基に、入力信号ＤＢ２０２Ａに所定量以上蓄積されていると判定されたときは入力信号ＤＢ２０２Ａ、入力信号ＤＢ２０２Ａに所定量以上蓄積されていないと判定されたときは第三者音声信号ＤＢ２０９を選択し、マスカー素辺信号を選択されたデータベース（以下、選択したデータベースを「選択データベース」と呼ぶ）のピッチに応じたクラスから複数フレーム読み出し、読み出された複数フレームからマスカー素辺信号からマスカー信号を生成して出力する。 The masker signal generation unit 204C generates the input signal DB 202A and the input signal DB 202A when it is determined that the input signal DB 202A has accumulated a predetermined amount or more based on the results of the speech section determination and pitch estimation, the use DB determination result, and the selection result. When it is determined that the DB 202A does not store a predetermined amount or more, the third-party speech signal DB 209 is selected, and the pitch of the selected database (hereinafter, the selected database is referred to as the "selected database") of the masker base signal. A plurality of frames are read from a class corresponding to the class, and a masker signal is generated from a masker edge signal from the read plurality of frames and output.

なお、第４の実施形態において、ピッチ推定部２０５を除外し、入力信号ＤＢ２０２Ａ、又は第三者音声信号ＤＢ２０９においてクラス分けせずに蓄積するようにしても良い。また、第４の実施形態において、音声区間判定部２０５を除外するようにしても良い。 In the fourth embodiment, the pitch estimator 205 may be excluded, and the input signal DB 202A or the third-party speech signal DB 209 may be stored without being classified. Also, in the fourth embodiment, the speech segment determination unit 205 may be excluded.

（Ｄ－２）第４の実施形態の動作
次に、以上のような構成を有する第４の実施形態におけるサウンドマスキング装置１００Ｃの動作（実施形態に係る音響処理方法）について詳細に説明する。 (D-2) Operation of Fourth Embodiment Next, the operation of the sound masking device 100C (acoustic processing method according to the embodiment) according to the fourth embodiment having the configuration described above will be described in detail.

第４の実施形態に係るサウンドマスキング装置１００Ｃにおけるサウンドマスキング処理の基本的な動作は、第１から第３の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100C according to the fourth embodiment is the same as the sound masking process described in the first to third embodiments.

以下では、第１から第３の実施形態と異なる点である第三者音声信号ＤＢ２０９、使用ＤＢ判定部２１０、信号選択部２０３Ｂ、マスカー信号生成部２０４Ｃにおける処理動作を中心に詳細に説明する。 In the following, a detailed description will be given centering on the processing operations of the third-party audio signal DB 209, the use DB determination unit 210, the signal selection unit 203B, and the masker signal generation unit 204C, which are different from the first to third embodiments.

サウンドマスキング装置１００Ｃのサウンドマスキング処理部２００Ｃでは、サウンドマスキング処理を行う前に、第三者音声信号ＤＢ２０９へ音声信号の蓄積を行う。 The sound masking processing unit 200C of the sound masking device 100C accumulates the audio signal in the third party audio signal DB 209 before performing the sound masking process.

例えば、図８に示すように、事前に音声信号のサンプルを蓄積したデータベース（例えば、市販されている音声信号のデータベース等）により構成された第三者音声信号サンプルデータＡＳを、サウンドマスキング処理部２００Ｃに入力することで第三者音声信号ＤＢ２０９を構築する。 For example, as shown in FIG. 8, third-party audio signal sample data AS composed of a database in which audio signal samples are accumulated in advance (for example, a commercially available audio signal database) is processed by the sound masking processor. 200C to construct a third party audio signal DB 209. FIG.

図８では、第三者音声信号サンプルデータＡＳに基づく音声信号をサウンドマスキング処理部２００Ｃに入力し、サウンドマスキング装置１００Ｃが動作を開始して、第三者音声信号サンプルデータＡＳに基づく音声信号について、上記の各実施形態と同様にフレーム分割、音声区間判定、ピッチ推定、ＤＢ蓄積判定、クラス判定を行い、第三者音声信号ＤＢ２０９に蓄積する。 In FIG. 8, an audio signal based on the third party audio signal sample data AS is input to the sound masking processing unit 200C, the sound masking device 100C starts operating, and the audio signal based on the third party audio signal sample data AS , frame division, voice section determination, pitch estimation, DB accumulation determination, and class determination are performed in the same manner as in the above embodiments, and stored in the third-party voice signal DB 209 .

なお、上記の各実施形態の入力信号ＤＢ２０２、２０２Ａの蓄積処理と同様の処理により、第三者音声信号ＤＢ２０９を構築するようにしても良い。 Note that the third party audio signal DB 209 may be constructed by the same process as the accumulation process of the input signal DBs 202 and 202A in each of the above embodiments.

また、第三者音声信号サンプルデータＡＳが記録されるデータ記録媒体は限定されないものである。 Also, the data recording medium on which the third party audio signal sample data AS is recorded is not limited.

さらに、第三者音声信号ＤＢ２０９を構築する際のサンプルとしては、予め録音された第三者音声信号サンプルデータＡＳではなく、マイク１０１、マイクアンプ１０２、及びＡＤ変換器１０３を音入力端子ＩＮに接続して、複数の人物に発話して蓄積（マイク１０１を介して第三者音声信号のサンプルを蓄積）するようにしても良いし、別のＰＣ等で処理して作成したデータ（第三者音声信号のサンプルデータ）を使用（例えば、通信やデータ記録媒体によりコピー）するようにしても良い。 Furthermore, as a sample for constructing the third party audio signal DB 209, the microphone 101, the microphone amplifier 102, and the AD converter 103 are connected to the sound input terminal IN instead of the prerecorded third party audio signal sample data AS. It is also possible to connect and store utterances to a plurality of people (samples of third-party voice signals are stored via the microphone 101), or data created by processing on another PC (third-party sample data of a voice signal) may be used (for example, copied by communication or a data recording medium).

そして、第三者音声信号ＤＢ２０９に第三者の音声信号に基づくデータが十分に蓄積（所定以上の量のデータが蓄積）された段階でサウンドマスキング装置１００Ｃは、第三者音声信号ＤＢ２０９の準備処理を終了し、サウンドマスキング処理が開始するまで一時停止する。 Then, when the third party audio signal DB 209 has sufficiently accumulated data based on the third party audio signal (accumulated a predetermined amount or more of data), the sound masking device 100C prepares the third party audio signal DB 209. End the process and pause until the sound masking process begins.

なお、第三者音声信号ＤＢ２０９に第三者の音声信号に基づくデータが十分に蓄積（所定以上の量のデータが蓄積）された段階でサウンドマスキング装置１００Ｃは、第三者音声信号ＤＢ２０９の準備処理を終了し、サウンドマスキング処理を開始するようにしても良い。 It should be noted that when the third party audio signal DB 209 has sufficiently accumulated data based on the third party audio signal (a predetermined amount or more of data has been accumulated), the sound masking device 100C prepares the third party audio signal DB 209. Alternatively, the processing may be terminated and the sound masking processing may be started.

このとき、第三者音声信号ＤＢ２０９に所定以上の量のデータが蓄積されたか否かを判定する方法は限定されないものであるが、使用ＤＢ判定部２１０を用いた判定処理を行うようにしても良い。 At this time, the method of determining whether or not a predetermined amount or more of data has been accumulated in the third party audio signal DB 209 is not limited. good.

サウンドマスキング装置１００Ｃがサウンドマスキング処理を開始し、対象話者Ｕ１がマイク１０１に向かつて音声を発話すると、マイク１０１に入力される。 The sound masking device 100C starts sound masking processing, and when the target speaker U1 speaks toward the microphone 101, the voice is input to the microphone 101. FIG.

マイク１０１に入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、サウンドマスキング処理部２００Ｃの音入力端子ＩＮにマイク入力信号ｘ（ｎ）として入力される。 An analog sound signal input to the microphone 101 is amplified by the microphone amplifier 102, converted from the analog signal to a digital signal by the AD converter 103, and is supplied to the sound input terminal IN of the sound masking processing unit 200C as the microphone input signal x(n). ).

サウンドマスキング処理部２００Ｃの音入力端子ＩＮにマイク入力信号ｘ（ｎ）が入力され始めると、フレーム分割部２０１に入力される。 When the microphone input signal x(n) starts to be input to the sound input terminal IN of the sound masking processing unit 200C, it is input to the frame division unit 201. FIG.

フレーム分割部２０１は、マイク入力信号ｘ（ｎ）を、処理フレームごとに分割し、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を音声区間判定部２０５とＤＢ蓄積判定部２０６とピッチ推定部２０７に出力する。 The frame dividing unit 201 divides the microphone input signal x(n) into processing frames, and divides the frame-divided microphone input signal x_fram(l;m) into the speech period determining unit 205, the DB accumulation determining unit 206, and the pitch estimating unit. 207.

音声区間判定部２０５は、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を用いて、音声区間か非音声区間かを判定し、音声区間か非音声区間かの判定結果をＤＢ蓄積判定部２０６、ピッチ推定部２０７、マスカー信号生成部２０４Ｃに出力する。 The speech section determination unit 205 uses the frame-divided microphone input signal x_fram(l;m) to determine whether it is a speech section or a non-speech section. , the pitch estimation unit 207 and the masker signal generation unit 204C.

ＤＢ蓄積判定部２０６は、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたときのみ、フレーム分割部２０１から出力されたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を、クラス判定部２０８信号選択部２０３Ｂ、マスカー信号生成部２０４Ｃに出力し、音声区間判定部２０５でフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が非音声区間と判定されたときは、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を出力しない。 Only when the voice segment determining unit 205 determines that the frame-divided microphone input signal x_fram(l;m) is a voice segment, the DB accumulation determining unit 206 performs frame-divided microphone input signal x_fram output from the frame dividing unit 201. (l;m) is output to the class determination unit 208 signal selection unit 203B and the masker signal generation unit 204C, and the frame-divided microphone input signal x_fram(l;m) is determined as a non-speech interval by the voice interval determination unit 205. , the frame-divided microphone input signal x_fram(l;m) is not output.

ピッチ推定部２０７は、音声区間判定部２０５でマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されたときのみ、フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）のピッチを推定し、推定したピッチをマスカー信号生成部２０４Ｃとピッチ推定部２０７に出力する。 The pitch estimation unit 207 estimates the pitch of the frame-divided microphone input signal x_fram(l;m) only when the speech period determination unit 205 determines that the microphone input signal x_fram(l;m) is in a voice period. The resulting pitch is output to masker signal generation section 204C and pitch estimation section 207 .

クラス判定部２０８は、ピッチ推定部２０７で推定したピッチを基に、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａに蓄積すると判定された場合にのみ、フレーム分割されたマイク入力信号を入力信号ＤＢ２０２Ａのピッチに応じたクラスに出力して蓄積する。 Based on the pitch estimated by pitch estimating section 207, class determining section 208 classifies the frame-divided microphone input signal into input signal DB 202A only when it is determined that the frame-divided microphone input signal is stored in input signal DB 202A. are output and stored in classes corresponding to the pitches.

入力信号ＤＢ２０２Ａは、クラス判定部２０８からマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が出力されたときのみ、出力されたフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）を（１７）式と（１８）式に従い、ピッチに応じたクラスごとに入力信号ＤＢ２０２Ａに蓄積する。 Only when the microphone input signal x_fram(l;m) is output from the class determination unit 208, the input signal DB 202A converts the output frame-divided microphone input signal x_fram(l;m) into the following equations (17) and (18): Accumulates in the input signal DB 202A for each class corresponding to the pitch according to the formula.

使用ＤＢ判定部２１０は、入力信号ＤＢ２０２Ａの各クラスに過去のフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が所定以上の量のデータが蓄積（十分蓄積）されているか判定し、判定結果を出力する。使用ＤＢ判定部２１０は、例えば、以下の（２４）式に従って、入力信号ＤＢ２０２Ａにフレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が所定以上の量が蓄積されているか否かを判定する。

The use DB determination unit 210 determines whether a predetermined amount or more of data has been accumulated (sufficiently accumulated) in past frame-divided microphone input signals x_fram(l;m) in each class of the input signal DB 202A. Output. The use DB determination unit 210 determines whether or not a predetermined amount or more of the microphone input signal x_fram(l;m) divided into frames is accumulated in the input signal DB 202A according to the following equation (24), for example.

（２４）式で、ｆｌａｇ（ｌ）は、判定結果である。（２４）式は、所定以上の量のデータが蓄積されている場合は、判定結果ｆｌａｇ（ｌ）に１を代入し、所定以上の量のデータが蓄積（十分蓄積）されていない場合は判定結果ｆｌａｇ（ｌ）に０を代入する。 In expression (24), flag(l) is the determination result. Formula (24) assigns 1 to the determination result flag(l) when a predetermined amount or more of data has been accumulated, and determines Assign 0 to the result flag(l).

なお、使用ＤＢ判定部２１０において、入力信号ＤＢ２０２Ａに所定以上の量のデータが蓄積されているか否かの判断手法は、種々の方法を広く適用することができる。例えば、使用ＤＢ判定部２１０フレーム分割したマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が入力信号ＤＢに蓄積される回数をカウントし、カウント数が閾値を超えた場合、所定以上のデータが蓄積されていると判定しても良いし、クラス毎に蓄積される回数をカウントし、全てのクラスについてカウント数が閾値を超えた場合、十分蓄積されていると判定しても良い。 It should be noted that various methods can be widely applied as a method of determining whether or not a predetermined amount of data or more is accumulated in the input signal DB 202A in the use DB determination unit 210. FIG. For example, the use DB determination unit 210 counts the number of times the microphone input signal x_fram(l;m) divided into frames is accumulated in the input signal DB, and when the count exceeds the threshold value, a predetermined amount of data or more is accumulated. Alternatively, the number of times accumulated for each class may be counted, and when the counted number exceeds the threshold value for all classes, it may be judged that sufficient accumulation is achieved.

また、使用ＤＢ判定部２１０において、入力信号ＤＢ２０２Ａに所定以上の量のデータが蓄積されているか否かの判断開始方法は、種々の方法を広く適用することができる。例えば、サウンドマスキング装置１００Ｃの動作が開始してから判定を開始しても良いし、サウンドマスキング装置１００Ｃの動作が開始して所定時間経過した時から判定を開始するようにしても良い。そして、使用ＤＢ判定部２１０は、信号選択部２０３Ｂに判定結果ｆｌａｇ（ｌ）を出力する。 Further, various methods can be widely applied as a method for starting determination in use DB determination section 210 as to whether or not a predetermined amount or more of data has been accumulated in input signal DB 202A. For example, the determination may be started after the operation of the sound masking device 100C is started, or the determination may be started when a predetermined time has passed since the operation of the sound masking device 100C is started. Then, the used DB determination unit 210 outputs the determination result flag(l) to the signal selection unit 203B.

信号選択部２０３Ｂは、使用ＤＢ判定部２１０から出力された判定結果ｆｌａｇ（ｌ）から入力信号ＤＢ２０２Ａ、又は第三者音声信号ＤＢ２０９にクラスごとに蓄積されている過去のフレーム分割したマイク入力信号からマスカー素辺信号を選択する。信号選択部２０３Ａは、例えば、（２５）式に示すように選択結果Ｔｂ（ｋ）を選択する。 The signal selection unit 203B selects from the input signal DB 202A from the determination result flag(l) output from the use DB determination unit 210, or from the past frame-divided microphone input signal accumulated for each class in the third-party audio signal DB 209. Select the masker bare edge signal. The signal selection unit 203A selects the selection result Tb(k) as shown in equation (25), for example.

（２５）式で、ｋ（ｋ＝１，２，・・・，Ｋ）は変数、Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声信号の加算回数）、ＭＯＤ（ｉ－ｋ，Ｉ）は、ｉ－ｋをＩで割ったときの剰余を返すＭＯＤ関数である。Ｉで割ったときの剰余を返すことで、選択結果Ｔｂ（ｋ）は０からＩ－１の値になる。（２５）式は、使用ＤＢ判定部２１０で、入力信号ＤＢ２０２Ａに所定量以上蓄積されていないと判定されたとき（ｆｌａｇ（ｌ）＝０のとき）は、第三者音声信号ＤＢ２０９からマスカー素辺信号を選択し、入力信号ＤＢ２０２Ａに所定量以上蓄積されていると判定されたとき（ｆｌａｇ（ｌ）＝０以外のとき）は、入力信号ＤＢ２０２Ａからマスカー素辺信号を選択するという式である。
(25), k (k=1, 2, . , I) is the M O D function that returns the remainder when ik is divided by I. By returning the remainder when dividing by I, the selection result Tb(k) becomes a value from 0 to I−1. Expression (25) is obtained by the use DB determination unit 210, when it is determined that the input signal DB 202A does not store a predetermined amount or more (when flag(l)=0), the third party audio signal DB 209 outputs masker elements. A masker edge signal is selected from the input signal DB 202A when it is determined that an edge signal is selected and accumulated in the input signal DB 202A by a predetermined amount or more (when flag(l) is other than 0). .

なお、選択結果Ｔｂ（ｋ）を算出手法は、種々の方法を広く適用することができ、例えば、（２６）式に示すように、どのフレームを使用するかランダムに選択しても良い。 Various methods can be widely applied to the method of calculating the selection result Tb(k). For example, as shown in equation (26), which frame to use may be randomly selected.

（２６）式で、ｒａｎｄは自然数ｋに対して乱数を生成する関数である。（２６）式は、ＭＯＤ関数を使用してｒａｎｄ（ｋ）で生成した乱数をＩで割ったときの剰余を返すことで、選択結果Ｔｂ（ｋ）は０からＩ－１の値になる。信号選択部２０３Ｂは、選択結果Ｔｂ（ｋ）をマスカー信号生成部２０４に出力する。

In equation (26), rand is a function that generates random numbers for natural number k. Expression (26) returns the remainder when the random number generated by rand(k) using the MOD function is divided by I, and the selection result Tb(k) takes values from 0 to I−1. Signal selection section 203B outputs selection result Tb(k) to masker signal generation section 204 .

マスカー信号生成部２０４Ｃは、音声区間判定部２０５の音声区間判定結果ＶＡＤ(ｌ)、ピッチ推定部２０７で推定したピッチｐｉｃｔｈ（ｌ）、信号選択部２０３Ｂの選択結果Ｔｂ（ｋ）、使用ＤＢ判定部２１０の判定結果ｆｌａｇ（ｌ）に基に、入力信号ＤＢ２０２Ａに所定量以上蓄積されていると判定されたときは入力信号ＤＢ２０２Ａ、入力信号ＤＢ２０２Ａに所定量以上蓄積されていない判定されたときは第三者音声信号ＤＢ２０９を選択し、マスカー素返信号を選択データベースのピッチに応じたクラスから複数フレーム読み出す。そして、読み出された複数フレームからマスカー信号を生成し出力する。マスカー信号生成部２０４Ｃは、例えば、（２７）式と（２８）式に従い、マスカー信号を出力する。 The masker signal generation unit 204C uses the voice segment determination result VAD(l) of the voice segment determination unit 205, the pitch picth(l) estimated by the pitch estimation unit 207, the selection result Tb(k) of the signal selection unit 203B, and the used DB determination. Based on the determination result flag(l) of the unit 210, when it is determined that the input signal DB 202A has accumulated the predetermined amount or more, the input signal DB 202A is determined not to have accumulated the predetermined amount or more. A third-person voice signal DB 209 is selected, and a plurality of frames of masker return signals are read out from the class corresponding to the pitch of the selected database. Then, it generates and outputs a masker signal from the plurality of read frames. The masker signal generator 204C outputs a masker signal according to, for example, equations (27) and (28).

（２７）式で、ｈｃ（ｌ；ｍ）はマスカー信号を、Ｆ０＿ＭＡＸはピッチの最大値を、（２８）式で、ＤＢ２（ｐ；ｌ；ｍ）は第三者音声信号ＤＢ、ｈ’’（ｌ；ｍ）は第三者音声信号ＤＢと入力信号ＤＢから生成されるマスカー信号、Ｋはマスカー素辺信号の選択数（マスカー信号生成時における音声の加算回数）である。（２７）式は、音声区間判定部２０５でマイク入力信号ｘ＿ｆｒａｍ（ｌ；ｍ）が音声区間と判定されとき（ＶＡＤ(ｌ)＝１のとき）、かつ、ピッチ推定部２０７の推定したピッチｐｉｔｃｈ（ｌ）が０Ｈｚより大きく、Ｆ０＿ＭＡＸ以下ときのみ、マスカー信号ｈ’’（ｌ；ｍ）を生成し、上記以外の時ときは無音を生成しｈｃ（ｌ；ｍ）に代入するという式である。（２８）は、使用ＤＢ判定部２１０で、入力信号ＤＢ２０２Ａに所定量以上蓄積されていないと判定されたとき（ｆｌａｇ（ｌ）＝０のとき）は、マスカー素辺信号を、第三者音声信号ＤＢ２０９から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成し、入力信号ＤＢ２０２Ａに所定量以上蓄積されていると判定されたとき（ｆｌａｇ（ｌ）＝０以外のとき）は、マスカー素辺信号を入力信号ＤＢ２０２Ａから複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成する。 In equation (27), hc(l;m) is the masker signal, F0_MAX is the maximum pitch value, and in equation (28), DB2(p;l;m) is the third party audio signal DB, h'' (l; m) is the masker signal generated from the third party audio signal DB and the input signal DB, and K is the number of selections of masker side signals (the number of audio additions when generating the masker signal). Expression (27) expresses the pitch pitch estimated by the pitch estimator 207 when the microphone input signal x_fram(l;m) is determined to be in the voice segment by the voice segment determination unit 205 (when VAD(l)=1). Only when (l) is greater than 0 Hz and equal to or less than F0_MAX, the masker signal h''(l;m) is generated, and otherwise silence is generated and substituted into hc(l;m). . (28) When the use DB determination unit 210 determines that the input signal DB 202A does not store a predetermined amount or more (flag(l) = 0), the masker bare edge signal is A plurality of frames are read out from the signal DB 209, a masker signal is generated from the read masker edge signals of the plurality of frames, and when it is determined that a predetermined amount or more is accumulated in the input signal DB 202A (flag(l) = other than 0). (time) reads a plurality of frames of masker side signals from the input signal DB 202A, and generates a masker signal from the read masker side signals of the plurality of frames.

なお、マスカー信号生成部２０４Ｃにおいて、マスカー信号の生成手法は、種々の方法を広く適用することができる。例えば、マスカー信号生成部２０４Ｃでは、選択データベースのピッチに応じたクラスごとに蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間反転して重畳してからマスカー信号ｈ’’（ｌ；ｍ）を生成しでも良いし、選択データベースのピッチに応じたクラスごとに蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間遅延して重畳することでマスカー信号ｈ’’（ｌ；ｍ）を生成しても良いし、過去のどのフレームを使用するかランダムに決定してマスカー信号ｈ’’（ｌ；ｍ）を生成しても良い。 In addition, in the masker signal generator 204C, various methods can be widely applied to the method of generating the masker signal. For example, in the masker signal generation unit 204C, past frame-divided microphone input signals accumulated for each class corresponding to the pitch of the selection database are time-reversed and superimposed as time processing, and then the masker signal h''(l ;m), or the masker signal h''( l;m) may be generated, or a past frame to be used may be randomly determined to generate the masker signal h''(l;m).

そして、マスカー信号生成部２０４Ｃは、（２９）式に従い、生成したマスカー信号ｈｃ（ｌ；ｍ）を出力信号ｙ（ｎ）として音出力端子ＯＵＴに出力する。

Then, the masker signal generator 204C outputs the generated masker signal hc(l;m) to the sound output terminal OUT as the output signal y(n) according to the equation (29).

（Ｄ－３）第４の実施形態の効果
第４の実施形態によれば、以下のような効果を奏することができる。 (D-3) Effects of Fourth Embodiment According to the fourth embodiment, the following effects can be obtained.

第４の実施形態のサウンドマスキング装置１００Ｃは、動作開始時には第三者音声信号ＤＢ２０９を使用してマスカー信号を生成して出力し、入力信号ＤＢ２０２Ａに入力信号が十分蓄積されたら、入力信号ＤＢ２０２Ａに蓄積されている過去のマイク入力信号を複数フレーム使用してマスカー信号を生成し出力する。これにより、サウンドマスキング装置１００Ｃでは、動作開始時から音響特徴が対象話者Ｕ1の音声の音響特徴に近いマスカー信号を生成できるので、よりマスキング効果を高めることができる。 The sound masking apparatus 100C of the fourth embodiment uses the third party audio signal DB 209 to generate and output a masker signal at the start of operation, and when the input signal DB 202A has accumulated enough input signals, the input signal DB 202A A masker signal is generated and output using a plurality of frames of accumulated past microphone input signals. As a result, the sound masking apparatus 100C can generate a masker signal whose acoustic features are close to the acoustic features of the voice of the target speaker U1 from the start of operation, so that the masking effect can be further enhanced.

（Ｅ）第５の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第５の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (E) Fifth Embodiment Hereinafter, a fifth embodiment of the sound processing device, sound processing program, and sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ｅ－１）第５の実施形態の構成
図９は、第５の実施形態に係るサウンドマスキング装置１００Ｄの機能的構成について示したブロック図である。図９では、上述の図１と同一部分又は対応部分には、同一符号又は対応符号を付している。 (E-1) Configuration of Fifth Embodiment FIG. 9 is a block diagram showing the functional configuration of a sound masking device 100D according to the fifth embodiment. In FIG. 9, the same reference numerals or corresponding reference numerals are assigned to the same or corresponding portions as those in FIG.

以下では、第５の実施形態について、第１の実施形態との差異を中心に説明し、第１の実施形態と重複する部分については説明を省略する。 In the following, the fifth embodiment will be described with a focus on the differences from the first embodiment, and descriptions of portions that overlap with the first embodiment will be omitted.

第５の実施形態のサウンドマスキング装置１００Ｄでは、サウンドマスキング処理部２００がサウンドマスキング処理部２００Ｄに置き換わっている点で、第１の実施形態と異なっている。サウンドマスキング処理部２００Ｄでは、マスカー信号生成部２０４がマスカー信号生成部２０４Ｄに置き換わっている点で第１の実施形態と異なっている。 The sound masking device 100D of the fifth embodiment differs from the first embodiment in that the sound masking processing section 200 is replaced with a sound masking processing section 200D. The sound masking processor 200D differs from the first embodiment in that the masker signal generator 204 is replaced with a masker signal generator 204D.

第５の実施形態のサウンドマスキング装置１００Ｄは、マスカー信号生成部２０４Ｄのマスカー信号の生成方法が異なる点が第１の実施形態のサウンドマスキング装置１００と異なる。 The sound masking device 100D of the fifth embodiment differs from the sound masking device 100 of the first embodiment in that the method of generating the masker signal by the masker signal generator 204D is different.

マスカー信号生成部２０４Ｄは、選択されたマスカー素辺信号を入力信号ＤＢ２０２から複数フレーム読み出し、読み出された複数フレームのマスカー素辺信号からマスカー信号を生成し出力する。 The masker signal generation unit 204D reads a plurality of frames of the selected masker side signal from the input signal DB 202, generates a masker signal from the read masker side signals of the plurality of frames, and outputs the masker signal.

（Ｅ－２）第５の実施形態の動作
次に、以上のような構成を有する第５の実施形態におけるサウンドマスキング装置１００Ｄの動作（実施形態に係る音響処理方法）について詳細に説明する。 (E-2) Operation of Fifth Embodiment Next, the operation of the sound masking device 100D (acoustic processing method according to the embodiment) having the configuration described above according to the fifth embodiment will be described in detail.

第５の実施形態に係るサウンドマスキング装置１００Ｄにおけるサウンドマスキング処理の基本的な動作は、第１の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100D according to the fifth embodiment is the same as the sound masking process described in the first embodiment.

以下では、第１の実施形態と異なる点であるマスカー信号生成部２０４Ｄにおける処理動作を中心に詳細に説明する。 In the following, a detailed description will be given centering on the processing operation in the masker signal generator 204D, which is different from the first embodiment.

マスカー信号生成部２０４Ｄは、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を使用してマスカー信号を生成する。マスカー信号生成部２０４Ｄが行うマスカー信号の生成手法としては、例えば、入力信号ＤＢ２０２に蓄積されているマイク入力信号に所定量の遅延を与えて重畳することで疑似的にエコー（以下、「疑似エコー」と呼ぶ）を生成し、マスカー信号として使用する手法が挙げられる。 The masker signal generation unit 204D generates a masker signal using the past frame-divided microphone input signal stored in the input signal DB 202 . As a masker signal generation method performed by the masker signal generation unit 204D, for example, a pseudo echo (hereinafter referred to as “pseudo echo ) and use it as a masker signal.

マスカー信号生成部２０４Ｄは、疑似エコーを生成し、生成した疑似エコーをマスカー信号として出力する。疑似エコーは、例えば、（３０）式、（３１）式に従い、疑似エコーを生成する。

The masker signal generator 204D generates a pseudo echo and outputs the generated pseudo echo as a masker signal. A pseudo echo is generated according to, for example, equations (30) and (31).

（３０）式、（３１）式で、ｃ（ｃ＝１、２、・・・、Ｃ）はインデックスを、Ｃは疑似エコー生成時における音声の加算回数、ｐ（１≦ｐ≦（Ｍ－１））は疑似エコーを生成する時の入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号をどれだけ遅延させるかのパラメー夕、αは減表係数（０．０＜α＜１．０）である。（３１）式は、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を複数フレーム読み出しを時間的にずらして減衰係数を乗算してから重畳して生成される信号である。疑似エコーの遅延時間は、例えば、０．１［秒］から１．０［秒］（４８ｋＨｚサンプリングで約４８００［サンプル］から４８０００［サンプル］）程度としても良い。例えば、（３０）式で、Ｃ＝３、ｐ＝５０、α＝０．５のときは、入力信号ＤＢ２０２に蓄積されている過去１フレーム前のマイク入力信号と、入力信号ＤＢ２０２に蓄積されている過去２フレーム前のマイク入力信号を５０サンプル進めて減衰係数α（＝０．５）を乗算した信号と、入力信号ＤＢ２０２に蓄積されている過去３フレーム前のマイク入力信号を１００サンプル進めて、減衰係数α^２（＝０．２５）を乗算した信号を重畳することで疑似エコーｅ（ｌ；ｍ）を生成することを示す。 In equations (30) and (31), c (c=1, 2, . 1)) is a parameter indicating how much the past frame-divided microphone input signal stored in the input signal DB 202 when generating a pseudo echo is to be delayed, and α is a reduction coefficient (0.0<α<1 .0). Equation (31) is a signal generated by multiplying the past frame-divided microphone input signal accumulated in the input signal DB 202 by an attenuation coefficient while shifting the readout of a plurality of frames, and then superimposing the signal. The delay time of the pseudo echo may be, for example, about 0.1 [seconds] to 1.0 [seconds] (approximately 4800 [samples] to 48000 [samples] at 48 kHz sampling). For example, in equation (30), when C=3, p=50, and α=0.5, the microphone input signal of the previous frame accumulated in the input signal DB 202 and the microphone input signal accumulated in the input signal DB 202 are A signal obtained by advancing the microphone input signal of the past two frames before by 50 samples and multiplying it by an attenuation coefficient α (=0.5), and advancing the microphone input signal of the past three frames accumulated in the input signal DB 202 by 100 samples. , a pseudo echo e(l;m) is generated by superimposing a signal multiplied by an attenuation coefficient α ² (=0.25).

なお、マスカー信号生成部２０４Ｄにおける疑似エコーの生成手法は、種々の方法を広く適用することができる。マスカー信号生成部２０４Ｄでは、例えば、（３２）式と（３３）式に示すように、入力信号ＤＢ２０２に蓄積されている過去のフレーム分割したマイク入力信号を時間処理として時間反転した信号を使用して疑似エコーｅ（ｌ；ｍ）を生成しても良いし、過去のどのフレームを使用するかランダムに決定して疑似エコーｅ（ｌ；ｍ）を生成しても良い。

Various methods can be widely applied to the pseudo echo generation method in the masker signal generation unit 204D. In the masker signal generation unit 204D, for example, as shown in equations (32) and (33), the past frame-divided microphone input signal accumulated in the input signal DB 202 is time-reversed as time-processed signals. Alternatively, the pseudo echo e(l;m) may be generated by randomly determining which past frame to use.

そして、マスカー信号生成部２０４Ｄは、（３４）式に従い、生成した疑似エコーｅ（ｌ；ｍ）を出力信号ｙ（ｎ）として音出力端子ＯＵＴに出力する。

Then, the masker signal generator 204D outputs the generated pseudo echo e(l;m) to the sound output terminal OUT as the output signal y(n) according to the equation (34).

（Ｅ－３）第５の実施形態の効果
第５の実施形態によれば、以下のような効果を奏することができる。 (E-3) Effects of Fifth Embodiment According to the fifth embodiment, the following effects can be obtained.

第５の実施形態のサウンドマスキング装置１００Ｄは、対象話者Ｕ１の音声を入力信号ＤＢに蓄積し、入力音声信号ＤＢに蓄積されている過去のフレーム分割されたマイク入力信号を複数フレーム使用して疑似エコーを生成し、疑似エコーをマスカー信号として出力する。これにより、サウンドマスキング装置１００Ｄでは、マスカー信号の音響特徴が対象話者Ｕ1の音声の音響特徴により近くなることから、マスキング効果が向上し、会話の内容が漏れることを防ぐことができる。言い換えると、第５の実施形態のサウンドマスキング装置１００でも、入力信号ＤＢに蓄積されている対象話者Ｕ1の音声信号を用いてマスカー信号を生成することで、対象話者Ｕ1の音響特性の解析を行わなくても、マスカー信号の音響特徴が対象話者Ｕ1の音声信号の音響特徴に近くなるので、高いマスキング効果が得られる。 The sound masking device 100D of the fifth embodiment stores the voice of the target speaker U1 in the input signal DB, and uses the past frame-divided microphone input signal stored in the input voice signal DB for a plurality of frames. A pseudo echo is generated and the pseudo echo is output as a masker signal. As a result, in the sound masking device 100D, the acoustic features of the masker signal are brought closer to the acoustic features of the voice of the target speaker U1, thereby improving the masking effect and preventing the leakage of the content of the conversation. In other words, the sound masking apparatus 100 of the fifth embodiment also analyzes the acoustic characteristics of the target speaker U1 by generating the masker signal using the voice signal of the target speaker U1 stored in the input signal DB. Even if the above is not performed, the acoustic features of the masker signal are close to the acoustic features of the voice signal of the target speaker U1, so that a high masking effect can be obtained.

（Ｆ）第６の実施形態
以下、本発明による音響処理装置、音響処理プログラム及び音響処理方法の第６の実施形態を、図面を参照しながら詳述する。この実施形態では、本発明の音響処理装置、音響処理プログラム及び音響処理方法を、サウンドマスキング装置に適用した例について説明する。 (F) Sixth Embodiment Hereinafter, a sixth embodiment of the sound processing device, sound processing program, and sound processing method according to the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the sound processing device, the sound processing program, and the sound processing method of the present invention are applied to a sound masking device will be described.

（Ｆ－１）第６の実施形態の構成
図１０は、第６の実施形態に係るサウンドマスキング装置１００Ｅの機能的構成について示したブロック図である。図１０では、上述の図９と同一部分又は対応部分には、同一符号又は対応符号を付している。 (F-1) Configuration of Sixth Embodiment FIG. 10 is a block diagram showing the functional configuration of a sound masking device 100E according to the sixth embodiment. In FIG. 10, the same reference numerals or corresponding reference numerals are assigned to the same or corresponding portions as in FIG. 9 described above.

以下では、第５の実施形態について、第５の実施形態との差異を中心に説明し、第５の実施形態と重複する部分については説明を省略する。 In the following, the fifth embodiment will be described with a focus on differences from the fifth embodiment, and descriptions of portions that overlap with the fifth embodiment will be omitted.

第６の実施形態のサウンドマスキング装置１００Ｅでは、サウンドマスキング処理部２００Ｄがサウンドマスキング処理部２００Ｅに置き換わっている点で、第５の実施形態と異なっている。サウンドマスキング処理部２００Ｅは、フレーム分割部２０１、第１の入力信号ＤＢ２１１、第２の入力信号ＤＢ２１２、第１の信号選択部２１３、第２の信号選択部２１４、第１のマスカー生成部２１５、第２のマスカー生成部２１６、及びマスカー信号ミキシング部２１７を有している。 A sound masking device 100E of the sixth embodiment differs from that of the fifth embodiment in that the sound masking processing section 200D is replaced with a sound masking processing section 200E. The sound masking processing unit 200E includes a frame division unit 201, a first input signal DB 211, a second input signal DB 212, a first signal selection unit 213, a second signal selection unit 214, a first masker generation unit 215, It has a second masker generator 216 and a masker signal mixer 217 .

第６の実施形態のサウンドマスキング装置１００Ｅでは、マスカー信号の生成方法が、第１の実施形態、及び第５の実施形態と異なっている。具体的には、サウンドマスキング処理部２００Ｅは、入力されたマイク入力信号から２種類のマスカー信号を生成し、重畳した信号をマスカー信号として出力する。 The sound masking device 100E of the sixth embodiment differs from the first and fifth embodiments in the method of generating the masker signal. Specifically, the sound masking processing unit 200E generates two types of masker signals from the input microphone input signal, and outputs the superimposed signal as the masker signal.

第１の入力信号ＤＢ２１１、第２の入力信号ＤＢ２１２は、第１の実施形態の入力信号ＤＢ２０２と同様のものであるため詳しい説明を省略する。また、第１の信号選択部２１３、第２の信号選択部２１４も、第１の実施形態の信号選択部２０３と名前が異なるだけで同様のものであるため詳しい説明を省略する。 Since the first input signal DB211 and the second input signal DB212 are similar to the input signal DB202 of the first embodiment, detailed description thereof will be omitted. Also, the first signal selection unit 213 and the second signal selection unit 214 are similar to the signal selection unit 203 of the first embodiment except that they have different names, so detailed description thereof will be omitted.

第１のマスカー生成部２１５は、後述する第２のマスカー生成部２１６と異なる方法で、第１の入力信号ＤＢ２１１からマスカー信号を生成し出力する。 The first masker generator 215 generates and outputs a masker signal from the first input signal DB 211 by a method different from that of the second masker generator 216, which will be described later.

第２のマスカー生成部２１６は、第１のマスカー生成部２１５と異なる方法で、第２の入力信号ＤＢ２１２からマスカー信号を生成し出力する。 The second masker generator 216 generates and outputs a masker signal from the second input signal DB 212 by a method different from that of the first masker generator 215 .

マスカー信号ミキシング部２１７は、各マスカー信号生成部から出力されたマスカー信号をミキシングして最終的に出力するマスカー信号を生成する。 The masker signal mixing unit 217 mixes the masker signals output from the respective masker signal generation units to generate a masker signal to be finally output.

第１の入力信号ＤＢ２１１と第２の入力信号ＤＢ２１２には、両法のＤＢに同様のデータ（例えば、第１の入力信号ＤＢ２１１と第２の入力信号ＤＢ２１２に第１の実施形態における入力信号ＤＢ２０２と同様のデータ）を蓄積するようにしても良いし、異なるデータ（例えば、第１の入力信号ＤＢ２１１は、第１の実施形態における入力信号ＤＢ２０２、第２の入力信号ＤＢ２１２は、第３の実施形態における入力信号ＤＢ２０２Ａと同様のデータ）を蓄積するようにしても良い。 For the first input signal DB211 and the second input signal DB212, the same data as the DB of both methods (for example, the first input signal DB211 and the second input signal DB212 have the same data as the input signal DB202 in the first embodiment). , or different data (for example, the first input signal DB211 is the input signal DB202 in the first embodiment, and the second input signal DB212 is the third embodiment). (data similar to the input signal DB 202A in the form) may be accumulated.

（Ｆ－２）第６の実施形態の動作
次に、以上のような構成を有する第６の実施形態におけるサウンドマスキング装置１００Ｅの動作（実施形態に係る音響処理方法）について詳細に説明する。 (F-2) Operation of Sixth Embodiment Next, the operation of the sound masking device 100E (acoustic processing method according to the embodiment) of the sixth embodiment having the configuration described above will be described in detail.

第６の実施形態に係るサウンドマスキング装置１００Ｅにおけるサウンドマスキング処理の基本的な動作は、第５の実施形態で説明したサウンドマスキング処理と同様である。 The basic operation of the sound masking process in the sound masking device 100E according to the sixth embodiment is the same as the sound masking process described in the fifth embodiment.

本発明の第６の実施形態に係るサウンドマスキング装置１００Ｅの動作を詳細に説明する。 The operation of the sound masking device 100E according to the sixth embodiment of the invention will be described in detail.

第１のマスカー生成部２１５は、第１の入力信号ＤＢ２１１に蓄積されている過去のフレーム分割したマイク入力信号を使用して第２のマスカー生成部２１６とは異なる方法でマスカー信号を生成する。 The first masker generator 215 uses the past frame-divided microphone input signal stored in the first input signal DB 211 to generate a masker signal by a method different from that of the second masker generator 216 .

第２のマスカー生成部２１６は、第２の入力信号ＤＢ２１２に蓄積されている過去のフレーム分割したマイク入力信号を使用して第１のマスカー生成部２１５とは異なる方法でマスカー信号を生成する。 The second masker generation unit 216 generates a masker signal by a method different from that of the first masker generation unit 215 using the past frame-divided microphone input signal stored in the second input signal DB 212 .

例えば、第１のマスカー生成部２１５は、（６）式、又は（７）式に示すようにマスカー信号ｈ（ｌ；ｍ）を生成し、第２のマスカー生成部２１６は、（３２）式、又は（３４）式に示すような疑似エコーｅ（ｌ；ｍ）をマスカー信号として生成するようにしても良い。 For example, the first masker generator 215 generates the masker signal h(l;m) as shown in equation (6) or (7), and the second masker generator 216 generates the masker signal h(l;m) as shown in equation (32) , or a pseudo echo e(l;m) as shown in equation (34) may be generated as a masker signal.

マスカー信号ミキシング部２１７は、第１のマスカー生成部２１５、及び第２のマスカー生成部２１６から出力されたマスカー信号をミキシングし、マスカー信号ｍｉｘ（ｌ；ｍ）として出力する。マスカー信号ミキシング部２１７は、例えば、（３５）式に基づいて、第１のマスカー生成部２１５、及び第２のマスカー生成部２１６から出力されたマスカー信号をミキシングするようにしても良い。 The masker signal mixing unit 217 mixes the masker signals output from the first masker generation unit 215 and the second masker generation unit 216, and outputs a masker signal mix(l;m). The masker signal mixing section 217 may mix the masker signals output from the first masker generation section 215 and the second masker generation section 216, for example, based on the equation (35).

（３５）式で、β（０．０≦β≦１．０）はどちらのマスカー信号を多く使用するかのパラメータである。第１のマスカー生成部２１５のマスカー信号を多く使用したい場合、βは１に近い値が望ましく（例えば、β＝０．９等の値）、第２のマスカー生成部２１６のマスカー信号を多く使用したい場合、βは１に近い値が望ましい（例えば、β＝０．１等の値）。 In equation (35), β (0.0≤β≤1.0) is a parameter indicating which masker signal is used more. When the masker signal of the first masker generator 215 is desired to be used more, β is preferably close to 1 (for example, a value such as β=0.9), and the masker signal of the second masker generator 216 is used more often. β should be close to 1 (for example, β=0.1).

マスカー信号ミキシング部２１７は、（３６）式に従い、ミキシングしたマスカー信号ｍｉｘ（ｌ；ｍ）を出力信号ｙ（ｎ）として出力する。

The masker signal mixing unit 217 outputs the mixed masker signal mix(l;m) as the output signal y(n) according to the equation (36).

（Ｆ－３）第６の実施形態の効果
第６の実施形態によれば以下のような効果を奏することができる。 (F-3) Effects of Sixth Embodiment According to the sixth embodiment, the following effects can be obtained.

第６の実施形態のサウンドマスキング装置１００Ｅでは、対象話者Ｕ１の音声を第１の入力信号ＤＢ２１１及び第２の入力信号ＤＢ２１２に蓄積し、各入力信号ＤＢに蓄積されている過去のマイク入力信号を複数フレーム使用し、それぞれ異なる方法でマスカー信号を生成し、ミキシングする量を調節してミキシングし出力する。これにより、第６の実施形態のサウンドマスキング装置１００Ｅでは、対象話者Ｕ１にマスキング効果が高い方式のマスカー音のミキシング量を調節できるので、よりマスキング効果を高めることができる。 In the sound masking device 100E of the sixth embodiment, the voice of the target speaker U1 is accumulated in the first input signal DB 211 and the second input signal DB 212, and the past microphone input signals accumulated in each input signal DB are are used for a plurality of frames, masker signals are generated by different methods, the amount of mixing is adjusted, and the signals are mixed and output. As a result, the sound masking apparatus 100E of the sixth embodiment can adjust the mixing amount of the masker sound of the method having a high masking effect on the target speaker U1, so that the masking effect can be further enhanced.

（Ｇ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (G) Other Embodiments The present invention is not limited to the above-described embodiments, and modified embodiments such as those illustrated below can also be included.

（Ｇ－１）例えば、本発明のサウンドマスキング装置を電話会議で周囲の対象者以外の人に対して、会話の内容が漏れることを防止するする装置に搭載されるようにしても良い。この場合、サウンドマスキング装置において、対象話者Ｕ１は電話会議で発話している人となる。 (G-1) For example, the sound masking device of the present invention may be installed in a device that prevents the contents of a conversation from leaking out to people other than the target audience in a teleconference. In this case, in the sound masking device, the target speaker U1 is the person speaking in the conference call.

（Ｇ－２）上記の各実施形態において、サウンドマスキング装置の、サウンドマスキング部は、ネットワーク上の処理装置（例えば、サーバ等）で処理される構成としても良い。 (G-2) In each of the above embodiments, the sound masking unit of the sound masking device may be configured to be processed by a processing device (for example, a server, etc.) on the network.

（Ｇ－３）上記の各実施形態において、サウンドマスキング装置には、オーディオデバイス（マイク、マイクアンプ、ＡＤ変換器、スピーカ、スピーカアンプ、及びＤＡ変換器）が含まれる構成として説明したが、サウンドマスキング装置についてオーディオデバイスを除外した構成として製造し、実際に使用する現場でオーディオデバイスを別途接続するようにしても良い。すなわち、サウンドマスキング装置には、少なくともサウンドマスキング処理部が含まれる構成としても良い。 (G-3) In each of the above embodiments, the sound masking device includes an audio device (microphone, microphone amplifier, AD converter, speaker, speaker amplifier, and DA converter). The masking device may be manufactured without the audio device, and the audio device may be separately connected at the site of actual use. That is, the sound masking device may include at least a sound masking processing unit.

１００、１００Ａ、１００Ｂ、１００Ｃ、１００Ｄ、１００Ｅ…サウンドマスキング装置、１０１…マイク、１０２…マイクアンプ、１０３…ＡＤ変換器、１０４…スピーカ、１０５…スピーカアンプ、１０６…ＤＡ変換器、１０７…スピーカ、２００、２００Ａ、２００Ｂ、２００Ｃ、２００Ｄ、２００Ｅ…サウンドマスキング処理部、２０１…フレーム分割部、２０２、２０２Ａ…入力信号ＤＢ、２０３、２０３Ａ、２０３Ｂ…信号選択部、２０４、２０４Ａ、２０４Ｂ、２０４Ｃ、２０４Ｄ…マスカー信号生成部、２０５…音声区間判定部、２０６…ＤＢ蓄積判定部、２０７…ピッチ推定部、２０８…クラブ判定部、２０９…第三者音声信号ＤＢ、２１０…使用ＤＢ判定部、２１１…第１の入力信号ＤＢ、２１２…第２の入力信号ＤＢ、２１３…第１の信号選択部、２１６…第２の信号選択部、２１５…第１のマスカー生成部、２１６…第２のマスカー生成部、２１７…マスカー信号ミキシング部、３００…コンピュータ、３０１…プロセッサ、３０２…一次記憶部、３０３…二次記憶部。 100, 100A, 100B, 100C, 100D, 100E... Sound masking device, 101... Microphone, 102... Microphone amplifier, 103... AD converter, 104... Speaker, 105... Speaker amplifier, 106... DA converter, 107... Speaker, 200, 200A, 200B, 200C, 200D, 200E... Sound masking processing unit, 201... Frame division unit, 202, 202A... Input signal DB, 203, 203A, 203B... Signal selection unit, 204, 204A, 204B, 204C, 204D Masker signal generation unit 205 Speech section determination unit 206 DB accumulation determination unit 207 Pitch estimation unit 208 Club determination unit 209 Third party voice signal DB 210 Usage DB determination unit 211 First input signal DB, 212... Second input signal DB, 213... First signal selector, 216... Second signal selector, 215... First masker generator, 216... Second masker generator Part 217... Masker signal mixing part 300... Computer 301... Processor 302... Primary storage part 303... Secondary storage part.

Claims

a frame dividing means for dividing a microphone input signal supplied from a microphone for picking up a voice uttered by a target speaker into predetermined lengths;
input signal accumulation means for accumulating the frame-divided microphone input signal;
signal selection means for selecting a signal to be used for generating a masker signal from past frame-divided microphone input signals accumulated in the input signal accumulation means, and for outputting a selection result;
a masker signal generating means for generating and outputting the masker signal that makes the speech uttered by the target speaker difficult to hear, using the signal used to generate the masker signal ;
pitch estimation means for estimating the pitch of the microphone input signal;
The input signal accumulation means sorts and accumulates the microphone input signal into one of a plurality of classes according to the pitch estimated by the pitch estimation means,
The masker signal generating means generates a masker signal using a microphone input signal of a class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means.
An acoustic processing device characterized by:

2. The sound processing apparatus according to claim 1, further comprising a speaker for emitting said masker signal output by said masker signal generating means toward a person to be masked other than said target speaker.

The masker signal output by the masker signal generating means is reflected by a reflecting surface, and a speaker is arranged so that the reflected sound reflected by the reflecting surface is directed toward a person to be masked other than the target speaker. The sound processing device of claim 1, further comprising:

further comprising a speech section determination unit that determines whether the microphone input signal is a speech section or a non-speech section,
3. The sound processing apparatus according to claim 1, wherein the input signal accumulation means accumulates the microphone input signal only when the speech period is determined.

a third party signal storage means for storing a third party voice signal obtained by picking up a voice uttered by a third party different from the target speaker;
further comprising accumulation determination means for determining whether or not a predetermined amount or more of the microphone input signal is accumulated in the input signal accumulation means,
The masker signal generation means accumulates in the third party signal accumulation means only while the accumulation judgment means judges that the input signal accumulation means does not accumulate microphone input signals of a predetermined amount or more. 5. The sound processing apparatus according to any one of claims 1 to 4 , wherein the masker signal is generated using a third party audio signal that has been recorded.

The input signal accumulation means accumulates microphone input signals divided into a plurality of frames,
The masker signal generating means time-processes a signal obtained by superimposing a plurality of frames of the microphone input signal accumulated in the input signal accumulation means, or time-processes the input signal of a plurality of frames accumulated in the input signal accumulation means. 2. The sound processing device according to claim 1, wherein the signal superimposed on the signal is output as a masker signal.

The masker signal generation means delays the microphone input signal stored in the input signal storage means by a predetermined amount to generate a pseudo echo, and outputs the generated pseudo echo as the masker signal. Item 1. The acoustic processing device according to item 1.

The input signal accumulation means accumulates microphone input signals divided into a plurality of frames,
The masker signal generating means is
A signal obtained by superimposing a plurality of frames of the microphone input signal accumulated in the input signal accumulation means, or a signal obtained by temporally processing and superimposing the plurality of frames of the input signal accumulated in the input signal accumulation means, as a first signal. generated as a masker signal of
delaying the microphone input signal accumulated in the input signal accumulation means by a predetermined amount to generate a pseudo echo, and generating the generated pseudo echo as a second masker signal;
The sound processing apparatus according to claim 1, wherein a signal obtained by superimposing the first masker signal and the second masker signal is generated as a masker signal and output.

the computer,
a frame dividing means for dividing a microphone input signal supplied from a microphone for picking up a voice uttered by a target speaker into predetermined lengths;
input signal accumulation means for accumulating the frame-divided microphone input signal;
signal selection means for selecting a signal to be used for generating a masker signal from past frame-divided microphone input signals accumulated in the input signal accumulation means, and for outputting a selection result;
a masker signal generating means for generating and outputting the masker signal that makes the speech uttered by the target speaker difficult to hear, using the signal used to generate the masker signal ;
functioning as pitch estimation means for estimating the pitch of the microphone input signal,
The input signal accumulation means sorts and accumulates the microphone input signal into one of a plurality of classes according to the pitch estimated by the pitch estimation means,
The masker signal generating means generates a masker signal using a microphone input signal of a class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means.
A sound processing program characterized by:

In the acoustic processing method,
having frame division means, input signal accumulation means, signal selection means , masker signal generation means and pitch estimation means ,
The frame dividing means divides a microphone input signal supplied from a microphone for picking up a voice uttered by a target speaker into predetermined lengths,
The input signal accumulation means accumulates the frame-divided microphone input signal,
The signal selection means selects a signal to be used for generating a masker signal from past frame-divided microphone input signals accumulated in the input signal accumulation means, and outputs a selection result;
The masker signal generating means uses the signal used to generate the masker signal to generate and output the masker signal that makes the speech uttered by the target speaker difficult to hear ,
The pitch estimation means estimates the pitch of the microphone input signal,
The input signal accumulation means sorts and accumulates the microphone input signal into one of a plurality of classes according to the pitch estimated by the pitch estimation means,
The masker signal generating means generates a masker signal using a microphone input signal of a class corresponding to the pitch estimated by the pitch estimating means from the input signal accumulating means.
An acoustic processing method characterized by: