JP7567716B2

JP7567716B2 - Estimation method, estimation program, deep neural network device, and estimation device

Info

Publication number: JP7567716B2
Application number: JP2021136088A
Authority: JP
Inventors: 悠馬小泉; 昌弘安田; 浩平矢田部; 義紀升山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2024-10-16
Anticipated expiration: 2041-08-24
Also published as: JP2023030771A

Description

特許法第３０条第２項適用［公開の事実１］１．発行日：２０２０年８月２６日２．刊行物：一般財団法人日本音響学会２０２０年秋季研究発表会（講演予稿集）ｈｔｔｐｓ：／／ａｃｏｕｓｔｉｃｓ．ｊｐ／ａｎｎｕａｌｍｅｅｔｉｎｇ／ｐａｓｔ－ｍｅｅｔｉｎｇｓ／３．公開者：小泉悠馬、原田登、矢田部浩平、升山義紀、及川靖広［公開の事実２］１．開催日：２０２０年９月９日～２０２０年９月１１日（公知日：２０２０年９月１０日）２．集会名：一般財団法人日本音響学会２０２０年秋季研究発表会（オンライン開催）ｈｔｔｐｓ：／／ａｃｏｕｓｔｉｃｓ．ｊｐ／ａｎｎｕａｌｍｅｅｔｉｎｇ／ｐａｓｔ－ｍｅｅｔｉｎｇｓ／３．公開者：小泉悠馬、原田登、矢田部浩平、升山義紀、及川靖広［公開の事実３］１．ウェブサイト掲載日：２０２０年１０月２８日２．ウェブサイトのアドレスＩＥＥＥｓｉｇｎａｌｐｒｏｃｅｓｓｉｎｇ（ｖｏｌ１５，Ｎｏ１，Ｊａｎ２０２１）ｈｔｔｐｓ：／／ｉｅｅｅｘｐｌｏｒｅ．ｉｅｅｅ．ｏｒｇ／ｄｏｃｕｍｅｎｔ／９２４２２７９３．公開者：小泉悠馬、原田登、矢田部浩平、升山義紀、及川靖広Article 30, paragraph 2 of the Patent Act applies [Disclosure 1] 1. Issue date: August 26, 2020 2. Publication: The Acoustical Society of Japan, General Incorporated Foundation, 2020 Autumn Research Presentation Meeting (Proceedings of Lectures) https://acoustics.jp/annualmeeting/past-meetings/ 3. Disclosure by: Yuma Koizumi, Noboru Harada, Kohei Yatabe, Yoshinori Masuyama, Yasuhiro Oikawa [Disclosure 2] 1. Date held: September 9, 2020 to September 11, 2020 (publication date: September 10, 2020) 2. Name of meeting: The Acoustical Society of Japan, General Incorporated Foundation, 2020 Autumn Research Presentation Meeting (held online) https://acoustics.jp/annualmeeting/past-meetings/ jp/annualmeeting/past-meetings/ 3. Disclosure: Yuma Koizumi, Noboru Harada, Kohei Yatabe, Yoshinori Masuyama, Yasuhiro Oikawa [Disclosure fact 3] 1. Website posting date: October 28, 2020 2. Website address IEEE signal processing (vol15, No1, Jan 2021) https://ieeexplorer.ieee.org/document/9242279 3. Disclosure: Yuma Koizumi, Noboru Harada, Kohei Yatabe, Yoshinori Masuyama, Yasuhiro Oikawa

本発明は、振幅スペクトルのみから位相スペクトルを復元する音響信号復元技術に関するものである。 The present invention relates to an acoustic signal restoration technique that restores a phase spectrum from only the amplitude spectrum.

音声合成や音声強調などの多くの音響信号処理のアプリケーションは、時間領域の観測信号（波形）を短時間フーリエ変換（ＳＴＦＴ：Short-Time Fourier Transform）などを利用して時間周波数領域に変換して処理を行う。ＳＴＦＴスペクトルは複素数であり、ＳＴＦＴスペクトログラムから時間信号を復元するには、振幅スペクトログラムと位相スペクトログラムの両方が必要である。ところが、位相スペクトルはその扱いが難しい。このため、音声合成や音声強調では、振幅スペクトルのみを推定して制御し、位相スペクトルを最小位相や観測位相で代用して、時間信号へと逆変換することが多い。 Many acoustic signal processing applications, such as speech synthesis and speech enhancement, convert the observed signal (waveform) in the time domain into the time-frequency domain using a short-time Fourier transform (STFT) or similar to perform processing. The STFT spectrum is a complex number, and both the amplitude spectrogram and phase spectrogram are required to restore the time signal from the STFT spectrogram. However, the phase spectrum is difficult to handle. For this reason, in speech synthesis and speech enhancement, only the amplitude spectrum is estimated and controlled, and the phase spectrum is often substituted with the minimum phase or observed phase before being converted back into a time signal.

振幅スペクトログラムと位相スペクトログラムは独立変数ではないため、片方を制御した場合、もう片方はそれに対応した変数である必要がある。ゆえに、音声合成や音声強調では、振幅と位相の矛盾により、出力音の品質が低下することがあった。 The amplitude spectrogram and phase spectrogram are not independent variables, so when one is controlled, the other must be a corresponding variable. Therefore, in speech synthesis and speech enhancement, inconsistencies between amplitude and phase can sometimes result in a decrease in the quality of the output sound.

振幅スペクトログラムから、振幅と矛盾しない位相スペクトログラムを推定する技術として、非特許文献１に開示された技術が知られている。非特許文献１に開示された技術（Griffin-Limアルゴリズムと呼ばれている）は、以下の手順を繰り返すことで振幅スペクトログラムＡから、無矛盾な位相スペクトログラムを推定する技術である。 The technology disclosed in Non-Patent Document 1 is known as a technology for estimating a phase spectrogram that is consistent with the amplitude from an amplitude spectrogram. The technology disclosed in Non-Patent Document 1 (called the Griffin-Lim algorithm) is a technology for estimating a consistent phase spectrogram from an amplitude spectrogram A by repeating the following steps:

ここで、Ｘは振幅がＡの複素スペクトログラム、ＧはＳＴＦＴ、Ｇ^†は逆ＳＴＦＴ、◎は要素毎の乗算、：は要素毎の除算、｜・｜は要素毎の絶対値演算を表す。式（１）、式（２）は、以下の最適化問題を解いていることと等しい。 Here, X is a complex spectrogram with amplitude A, G is STFT, G ^† is inverse STFT, ◎ is element-wise multiplication, : is element-wise division, and |·| is element-wise absolute value calculation. Equations (1) and (2) are equivalent to solving the following optimization problem.

ここで||・||_Froはフロベニウスノルムを表す。なお、Ｂは振幅がＡのスペクトログラムの集合である。前述のとおり、位相スペクトルを最小位相や観測位相で代用するために、複素スペクトログラムＸに式（１）のＳＴＦＴと逆ＳＴＦＴを行うと、元の複素スペクトログラムＸに戻らない。そこで、式（２）により振幅を与えられた振幅スペクトログラムＡに固定し、式（３）により、正しい短時間フーリエ変換表現となるように位相を求める。 Here, |||| _Fro represents the Frobenius norm. Note that B is a set of spectrograms with amplitude A. As described above, when the STFT and inverse STFT of equation (1) are performed on complex spectrogram X in order to substitute the minimum phase or the observed phase for the phase spectrum, the original complex spectrogram X cannot be restored. Therefore, the amplitude is fixed to the given amplitude spectrogram A by equation (2), and the phase is found by equation (3) so as to obtain a correct short-time Fourier transform representation.

非特許文献１に開示された方式は、あらゆる音響信号に対して適応可能である一方、膨大な回数の繰り返しが必要である。膨大な回数の繰り返しが必要な理由は、最適化の枠組みの中に、復元したい所望の音響信号の統計的性質について一切の仮定を置いていないためである。 The method disclosed in Non-Patent Document 1 is applicable to any acoustic signal, but requires a huge number of iterations. The reason for the huge number of iterations is that the optimization framework does not make any assumptions about the statistical properties of the desired acoustic signal to be restored.

一方、非特許文献２では、非特許文献１に開示されたGriffin-Limアルゴリズムに深層学習を組み込む手法を提案している。すなわち、復元したい信号の統計的性質を、学習データを用いて訓練した深層ニューラルネットワーク（ＤＮＮ：Deep Neural Network）を利用して組み込む。図８に、非特許文献２に開示された推定装置の構成を示す。推定装置１００は、Ｍ個の推定部１１０－ｍ（ｍ＝０，１，２，・・・，Ｍ－１、Ｍは１以上の整数）と、位相付与部１２０とを備えている。 On the other hand, Non-Patent Document 2 proposes a method of incorporating deep learning into the Griffin-Lim algorithm disclosed in Non-Patent Document 1. In other words, the statistical properties of the signal to be restored are incorporated using a deep neural network (DNN) trained using learning data. Figure 8 shows the configuration of the estimation device disclosed in Non-Patent Document 2. The estimation device 100 includes M estimation units 110-m (m = 0, 1, 2, ..., M-1, M is an integer equal to or greater than 1) and a phase assignment unit 120.

図９は推定部１１０－ｍの構成を示すブロック図である。推定部１１０－ｍは、式（２）に対応する位相付与部１１１と、式（１）に対応する変換部１１２と、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムＸの位相を所望の音響信号の位相に近づける位相変更部１１３とから構成される。位相変更部１１３は、ＤＮＮ１１４と、減算部１１５とから構成される。 Figure 9 is a block diagram showing the configuration of the estimation unit 110-m. The estimation unit 110-m is composed of a phase assignment unit 111 corresponding to equation (2), a conversion unit 112 corresponding to equation (1), and a phase modification unit 113 that brings the phase of the complex spectrogram X closer to the phase of the desired acoustic signal based on the statistical properties of the training acoustic signal corresponding to the desired acoustic signal. The phase modification unit 113 is composed of a DNN 114 and a subtraction unit 115.

図８、図９に示したような構成において、Griffin-Limアルゴリズムの１回分の繰り返しの後にＤＮＮによる処理を行うことで、復元したい信号の統計的性質を考慮した無矛盾位相推定を実現する。図８、図９の構成は、内部のＤＮＮを繰り返し数Ｍ分スタッキングしていることと等価である。つまり、推定部１１０－ｍの繰り返し数Ｍを制御することで、処理時のＤＮＮのスケールを変化させることができる。繰り返し数Ｍを少なくすることは浅いＤＮＮを使うことと等価であり、処理性能は低下するが、高速な演算が可能になる。一方、繰り返し数Ｍを多くすることは深いＤＮＮを使うことと等価であり、処理速度は遅くなるが、高品質な出力音を得ることができる。 In the configuration shown in Figures 8 and 9, by performing processing using a DNN after one iteration of the Griffin-Lim algorithm, consistent phase estimation that takes into account the statistical properties of the signal to be restored is realized. The configuration in Figures 8 and 9 is equivalent to stacking the internal DNN for several M iterations. In other words, by controlling the number of iterations M of the estimation unit 110-m, the scale of the DNN during processing can be changed. Reducing the number of iterations M is equivalent to using a shallow DNN, and although processing performance decreases, high-speed calculations become possible. On the other hand, increasing the number of iterations M is equivalent to using a deep DNN, and although the processing speed becomes slower, high-quality output sound can be obtained.

ここで利用するＤＮＮは、復元したい信号の学習データから何らかの方式で学習されたものであればよく、Griffin-Limアルゴリズムの出力音の位相を、復元したい信号に近づける処理であれば何でもよい。ＤＮＮの学習方法の１例として、以下の残差学習の例を示す。 The DNN used here can be one that has been trained in some way from the training data of the signal to be restored, and can be any process that brings the phase of the output sound of the Griffin-Lim algorithm closer to the signal to be restored. As an example of a training method for a DNN, the following example of residual learning is shown.

Ｙ^[m]＝Ｐ_B（Ｘ^[m]）・・・（４）
Ｚ^[m]＝Ｐ_C（Ｙ^[m]）・・・（５）
Ｘ^[m+1]＝Ｅ（Ｘ^[m]）・・・（６）
＝Ｚ^[m]－Ｆ_θ（Ｘ^[m]，Ｙ^[m]，Ｚ^[m]）・・・（７） Y ^[m] = P _B (X ^[m] ) ... (4)
Z ^[m] = P _C (Y ^[m] ) ... (5)
X ^[m+1] = E (X ^[m] ) ... (6)
=Z ^[m] -F _θ (X ^[m] , Y ^[m] , Z ^[m] ) ... (7)

Ｙ^[m]は位相付与部１１１の出力信号、Ｚ^[m]は変換部１１２の出力信号、Ｘ^[m+1]は位相変更部１１３の出力信号である。ここでＦ_θは実数畳み込み層とゲート付線形層とからなるＤＮＮである。つまり、図９の構成は、Griffin-Limアルゴリズムで生じた歪みや推定誤差を、復元したい信号の統計的性質に基づき学習されたＤＮＮが除去（減算）するという構成になっている。ここでＤＮＮは、復元したい信号を直接推定するのではなく、復元したい信号でない成分を推定していることになる。ＤＮＮは、例えば以下の目的関数を最小化するように学習される。 Y ^[m] is the output signal of the phase adding unit 111, Z ^[m] is the output signal of the conversion unit 112, and X ^[m+1] is the output signal of the phase changing unit 113. Here, _Fθ is a DNN consisting of a real convolution layer and a gated linear layer. In other words, the configuration of FIG. 9 is configured such that the DNN, which has been trained based on the statistical properties of the signal to be restored, removes (subtracts) the distortion and estimation error generated by the Griffin-Lim algorithm. Here, the DNN does not directly estimate the signal to be restored, but estimates components that are not the signal to be restored. The DNN is trained to minimize, for example, the following objective function.

ここで、Ｘ^*は真の複素スペクトログラムである。Ｘチルダ、Ｙチルダ、Ｚチルダは次式のようになる。 where X ^* is the true complex spectrogram. X , Y , and Z are given by:

Ｎは複素ガウスノイズである。ただし、Griffin-Limアルゴリズムは位相スペクトルのみを復元する処理である。このため、Ｙチルダの振幅は、真の複素スペクトログラムＸ^*の振幅と一致するようにする。 N is complex Gaussian noise. However, the Griffin-Lim algorithm is a process for restoring only the phase spectrum. Therefore, the amplitude of Y is set to match the amplitude of the true complex spectrogram X ^* .

非特許文献２に開示されたＤＮＮの学習段階について説明する。図１０は従来の学習装置の構成を示すブロック図である。図１０に示す学習装置２００は、復元したい信号の学習データ（クリーン音響信号Ｘ^(L)*であり、複素スペクトログラムで表現される）とクリーン音響信号Ｘ^(L)*の振幅スペクトログラムＡ^(L)とノイズＮと各種最適化に必要なパラメータとを入力とする。 The learning stage of the DNN disclosed in Non-Patent Document 2 will be described. Fig. 10 is a block diagram showing the configuration of a conventional learning device. The learning device 200 shown in Fig. 10 receives as input learning data of a signal to be restored (clean audio signal X ^(L)* , expressed as a complex spectrogram), the amplitude spectrogram A(L ^{) of the clean audio signal X(L} ⁾ *, noise N, and parameters required for various optimizations.

学習装置２００は、ノイズ加算部２０９と、位相付与部２１１と、変換部２１２と、ＤＮＮ２１３と、減算部２１４と、パラメータ更新部２１５とから構成される。 The learning device 200 is composed of a noise addition unit 209, a phase assignment unit 211, a conversion unit 212, a DNN 213, a subtraction unit 214, and a parameter update unit 215.

図１１はＤＮＮ２１３の構成を示すブロック図である。ＤＮＮ２１３は、複数の実数畳み込み層２１３０～２１３４と、複数のゲート付線形層２１３５～２１３８とから構成される。図１１において、ｓは畳み込みのストライド、ｃはチャネル数、ｋはカーネルサイズを表している。 Figure 11 is a block diagram showing the configuration of the DNN 213. The DNN 213 is composed of multiple real-valued convolutional layers 2130-2134 and multiple gated linear layers 2135-2138. In Figure 11, s represents the convolution stride, c represents the number of channels, and k represents the kernel size.

図１２は学習装置２００の動作を説明するフローチャートである。例えば、図示しない初期化部は、ＤＮＮ２１３のパラメータθを乱数で初期化する（図１２ステップＳ１００）。
ノイズ加算部２０９は、クリーン音響信号Ｘ^(L)*とノイズＮとを入力とし、クリーン音響信号Ｘ^(L)*にノイズＮを加算し、複素スペクトログラムＸチルダを出力する（図１２ステップＳ１０１）。 Fig. 12 is a flowchart for explaining the operation of the learning device 200. For example, an initialization unit (not shown) initializes a parameter θ of the DNN 213 with a random number (step S100 in Fig. 12).
The noise adding unit 209 receives the clean audio signal X ^(L)* and noise N, adds the noise N to the clean audio signal X ^(L)* , and outputs a complex spectrogram X tilde (Step S101 in FIG. 12).

位相付与部２１１は、複素スペクトログラムＸチルダと振幅スペクトログラムＡ^(L)とを入力とし、次式に示すように、振幅スペクトログラムＡ^(L)に複素スペクトログラムＸチルダの位相を付与して、付与後の信号Ｙチルダを出力する（図１２ステップＳ１０２）。 The phase assigning unit 211 receives the complex spectrogram X{tilde} and the amplitude spectrogram A ^(L) , assigns the phase of the complex spectrogram X{tilde} to the amplitude spectrogram A ^(L) , as shown in the following equation, and outputs the phase-assigned signal Y{tilde} (step S102 in FIG. 12).

上記のとおり、◎は要素毎の乗算、：は要素毎の除算、｜・｜は要素毎の絶対値演算を表している。式（１３）は、複素スペクトログラムＸチルダの各要素に対して振幅スペクトログラムＡ^(L)の各要素を乗算し、乗算結果を複素スペクトログラムＸチルダの振幅スペクトログラム｜Ｘチルダ｜で除算しているため、複素スペクトログラムＸチルダの振幅を振幅スペクトログラムＡ^(L)の大きさに変換する処理といってもよい。 As described above, ◎ represents multiplication for each element, : represents division for each element, and |·| represents an absolute value calculation for each element. Since equation (13) multiplies each element of complex spectrogram X{tilde} by each element of amplitude spectrogram A ^(L) and divides the multiplication result by the amplitude spectrogram |X{tilde} of complex spectrogram X{tilde}, it can be said that this is a process of converting the amplitude of complex spectrogram X{tilde} into the magnitude of amplitude spectrogram A ^(L) .

変換部２１２は、信号Ｙチルダを入力とし、次式により、信号Ｙチルダを逆短時間フーリエ変換Ｇ†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換Ｇ†に対応する短時間フーリエ変換Ｇにより周波数領域の信号Ｚチルダに変換して出力する（図１２ステップＳ１０３）。 The conversion unit 212 receives the signal Y tilde as input, converts the signal Y tilde into a time waveform by the inverse short-time Fourier transform G† according to the following formula, and converts the converted time waveform into a frequency domain signal Z tilde by the short-time Fourier transform G corresponding to the inverse short-time Fourier transform G† and outputs it (Figure 12, step S103).

ＤＮＮ２１３は、複素スペクトログラムＸチルダと、信号Ｙチルダと、信号Ｚチルダとを入力とし、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差を推定し、推定値Ｆ_θ（Ｘチルダ，Ｙチルダ，Ｚチルダ）を出力する（図１２ステップＳ１０４）。 The DNN 213 receives the complex spectrogram X tilde, the signal Y tilde, and the signal Z tilde as input, estimates the distortion or estimation error generated by the Griffin-Lim algorithm, and outputs an estimated value F _θ (X tilde, Y tilde, Z tilde) (step S104 in FIG. 12).

一方、減算部２１４は、信号Ｚチルダとクリーン音響信号Ｘ^(L)*との差分Ｚチルダ－Ｘ^(L)*を求めて出力する（図１２ステップＳ１０５）。 On the other hand, the subtraction unit 214 obtains and outputs the difference Z tilde -X ⁽ ^{L)* between the signal Z tilde and the clean audio signal X (L} )* (step S105 in FIG. 12).

パラメータ更新部２１５は、差分Ｚチルダ－Ｘ^(L)*と、推定値Ｆ_θ（Ｘチルダ，Ｙチルダ，Ｚチルダ）とを入力とし、これらの値を用いて、以下の目的関数を最小化するようにＤＮＮ２１３のパラメータθを更新する（図１２ステップＳ１０６）。 The parameter update unit 215 receives the difference Z tilde -X ^(L)* and the estimated value F _θ (X tilde, Y tilde, Z tilde) as input, and uses these values to update the parameter θ of the DNN 213 so as to minimize the following objective function (step S106 in FIG. 12).

学習法としては、確率的最急降下法などを利用すればよい。学習率は例えば１０^-5程度に設定すればよい。
パラメータ更新部２１５は、所定の条件を満たすか否かを判定し（図１２ステップＳ１０７）、所定の条件を満たす場合には、その時点のＤＮＮ２１３を学習済みのＤＮＮとする。 As a learning method, a stochastic steepest descent method or the like may be used. The learning rate may be set to, for example, about 10 ^-5 .
The parameter update unit 215 determines whether or not a predetermined condition is satisfied (step S107 in FIG. 12), and if the predetermined condition is satisfied, the DNN 213 at that point in time is regarded as a trained DNN.

所定の条件を満たさない場合には、新たなクリーン音響信号Ｘ^(L)*と新たなノイズＮと更新後のパラメータθとを用いて、ステップＳ１０１～Ｓ１０６の処理が再び実施される。例えばステップＳ１０１～Ｓ１０６の処理を１０万回繰り返したときに、所定の条件を満たしたとしてＤＮＮ２１３の学習が終了する。 If the predetermined condition is not satisfied, the process of steps S101 to S106 is carried out again using a new clean acoustic signal X ^(L)* , new noise N, and updated parameters θ. For example, when the process of steps S101 to S106 is repeated 100,000 times, it is determined that the predetermined condition is satisfied and the learning of the DNN 213 ends.

図１３は推定装置１００の動作を説明するフローチャートである。推定装置１００は、位相と振幅が矛盾する複素スペクトログラムＸ^[0]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、振幅スペクトログラムＡに矛盾しない位相スペクトログラムを持つ複素スペクトログラムＹ^[M]を求めて出力する。複素スペクトログラムＸ^[0]の振幅は振幅スペクトログラムＡである。 13 is a flowchart explaining the operation of the estimation device 100. The estimation device 100 receives as input a complex spectrogram X ^[0] in which the phase and amplitude are inconsistent, and an amplitude spectrogram A of a desired acoustic signal, and obtains and outputs a complex spectrogram Y ^[M] having a phase spectrogram that is consistent with the amplitude spectrogram A. The amplitude of the complex spectrogram X ^[0] is the amplitude spectrogram A.

Ｍ個の推定部１１０－ｍ（ｍ＝０，１，２，・・・，Ｍ－１、Ｍは１以上の整数）は、位相と振幅が矛盾する複素スペクトログラムＸ^[m]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、推定した位相スペクトログラムを持つ複素スペクトログラムＸ^[m+1]を求めて出力する。 Each of M estimation units 110-m (m=0, 1, 2, ..., M-1, M is an integer equal to or greater than 1) receives as input a complex spectrogram X ^[m] having a conflicting phase and amplitude and an amplitude spectrogram A of a desired acoustic signal, and determines and outputs a complex spectrogram X ^[m+1] having an estimated phase spectrogram.

上記のとおり、推定部１１０－ｍは、位相付与部１１１と、変換部１１２と、位相変更部１１３とから構成される。位相変更部１１３は、ＤＮＮ１１４と、減算部１１５とから構成される。ＤＮＮ１１４には、図１０の学習装置２００で学習されたＤＮＮが設定されている。 As described above, the estimation unit 110-m is composed of a phase assignment unit 111, a conversion unit 112, and a phase change unit 113. The phase change unit 113 is composed of a DNN 114 and a subtraction unit 115. A DNN learned by the learning device 200 in FIG. 10 is set in the DNN 114.

位相付与部１１１は、位相と振幅が矛盾する複素スペクトログラムＸ^[m]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、次式に示すように、振幅スペクトログラムＡに複素スペクトログラムＸ^[m]の位相を付与して、付与後の信号Ｙ^[m]＝Ｐ_B（Ｘ^[m]）を出力する（図１３ステップＳ２０１）。 The phase assigning unit 111 receives as input a complex spectrogram X ^[m] whose phase and amplitude are inconsistent, and an amplitude spectrogram A of a desired acoustic signal, assigns the phase of the complex spectrogram X ^[m] to the amplitude spectrogram A as shown in the following equation, and outputs a signal Y ^[m] = _P (X ^[m] ) after the phase assignment (step S201 in FIG. 13).

式（１６）は、複素スペクトログラムＸ^[m]の各要素に対して振幅スペクトログラムＡの各要素を乗算し、乗算結果を複素スペクトログラムＸ^[m]の振幅スペクトログラム｜Ｘ^[m]｜で除算しているため、複素スペクトログラムＸ^[m]の振幅を振幅スペクトログラムＡの大きさに変換する処理といってもよい。 Since equation (16) multiplies each element of complex spectrogram X ^[m] by each element of amplitude spectrogram A and divides the multiplication result by the amplitude spectrogram |X ^[m] | of complex spectrogram X[ ^m] , it can be said that this is a process of converting the amplitude of complex spectrogram X ^[m] into the magnitude of amplitude spectrogram A.

変換部１１２は、信号Ｙ^[m]を入力とし、次式により、信号Ｙ^[m]を逆短時間フーリエ変換Ｇ†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換Ｇ†に対応する短時間フーリエ変換Ｇにより周波数領域の信号Ｚ^[m]＝Ｐ_C（Ｙ^[m]）に変換して出力する（図１３ステップＳ２０２）。 The conversion unit 112 receives the signal Y ^[m] as input, converts the signal Y ^[m] into a time waveform using the inverse short-time Fourier transform G† according to the following equation, and converts the converted time waveform into a frequency domain signal Z ^[m] = P _C (Y ^[m] ) using the short-time Fourier transform G corresponding to the inverse short-time Fourier transform G†, and outputs it (step S202 in Figure 13).

ステップＳ２０２の処理は、位相と振幅が矛盾する複素スペクトログラムＹ^[m]を時間波形に変換し、変換された時間波形を位相と振幅が矛盾しない複素スペクトログラムＺ^[m]に変換する処理に相当する。 The process of step S202 corresponds to a process of converting a complex spectrogram Y ^[m] in which the phase and amplitude are inconsistent into a time waveform, and then converting the converted time waveform into a complex spectrogram Z ^[m] in which the phase and amplitude are consistent.

位相変更部１１３は、複素スペクトログラムＸ^[m]と信号Ｙ^[m]と信号Ｚ^[m]とを用いて、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムＸ^[m]の位相を所望の音響信号の位相に近づける。 The phase modification unit 113 uses the complex spectrogram X ^[m] , the signal Y ^[m], and the signal Z ^[m] to bring the phase of the complex spectrogram X ^[m] closer to the phase of the desired acoustic signal based on the statistical properties of the training acoustic signal corresponding to the desired acoustic signal.

ＤＮＮ１１４は、複素スペクトログラムＸ^[m]と信号Ｙ^[m]と信号Ｚ^[m]とを入力とし、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差（Ｚ^[m]－Ｘ^[m]）を推定し、推定値Ｆ_θ（Ｘ^[m]，Ｙ^[m]，Ｚ^[m]）を出力する（図１３ステップＳ２０３）。 The DNN 114 receives the complex spectrogram X ^[m] , the signal Y ^[m], and the signal Z ^[m] , estimates the distortion or estimation error (Z ^[m] -X ^[m] ) generated by the Griffin-Lim algorithm, and outputs the estimated value _Fθ (X ^[m] , Y ^[m] , Z ^[m] ) (step S203 in FIG. 13).

減算部１１５は、信号Ｚ^[m]と推定値Ｆ_θ（Ｘ^[m]，Ｙ^[m]，Ｚ^[m]）との差分Ｘ^[m+1]＝Ｚ^[m]－Ｆ_θ（Ｘ^[m]，Ｙ^[m]，Ｚ^[m]）を求めて出力する（図１３ステップＳ２０４）。この減算処理が、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差を除去する処理に相当し、複素スペクトログラムＸ^[m]の位相スペクトログラムを所望の音響信号に近づける処理に相当する。 The subtraction unit 115 obtains and outputs the difference X ^[m+1] = Z ^[m] - _Fθ (X ^[m] , Y ^[m] , Z ^[m] ) between the signal Z ^[m] and the estimated value _Fθ (X ^[m] , Y ^[m] , Z ^[m] ) (step S204 in FIG. 13). This subtraction process corresponds to the process of removing the distortion or estimation error generated by the Griffin-Lim algorithm, and corresponds to the process of bringing the phase spectrogram of the complex spectrogram X ^[m] closer to the desired acoustic signal.

ステップＳ２０１～Ｓ２０４の処理を推定部１１０－ｍの個数Ｍ回分繰り返し、Ｍ回の処理が終わると（図１３ステップＳ２０６においてＹＥＳ）、終段の推定部１１０－（Ｍ－１）から複素スペクトログラムＸ^[M]が出力される。繰り返し数Ｍは例えば５程度とすればよい。 The process of steps S201 to S204 is repeated M times, the number of times being the number of estimation units 110-m, and when the process is completed M times (YES in step S206 in FIG. 13), a complex spectrogram X ^[M] is output from the final stage estimation unit 110-(M−1). The number of repetitions M may be set to, for example, about 5.

位相付与部１２０は、複素スペクトログラムＸ^[M]と振幅スペクトログラムＡとを入力とし、次式に示すように、振幅スペクトログラムＡに複素スペクトログラムＸ^[M]の位相を付与して、付与後の信号Ｙ^[M]＝Ｐ_B（Ｘ^[M]）を出力する（図１３ステップＳ２０７）。 The phase assigning unit 120 receives the complex spectrogram X ^[M] and the amplitude spectrogram A as input, assigns the phase of the complex spectrogram X ^[M] to the amplitude spectrogram A as shown in the following equation, and outputs the signal Y ^[M] = _P (X ^[M] ) after the phase assignment (step S207 in FIG. 13).

ステップＳ２０７の処理により、再度、複素スペクトログラムＸ^[M]の振幅を振幅スペクトログラムＡの大きさに変換する。
以上の構成により、非特許文献２に開示された技術では、復元したい信号の統計的性質を利用して、振幅スペクトルのみから、矛盾のない位相スペクトルを復元することができる。 In step S207, the amplitude of the complex spectrogram X ^[M] is converted into the magnitude of the amplitude spectrogram A again.
With the above configuration, the technique disclosed in Non-Patent Document 2 makes it possible to reconstruct a consistent phase spectrum from only the amplitude spectrum by utilizing the statistical properties of the signal to be reconstructed.

ただし、非特許文献２で用いられているＤＮＮの構造では、スペクトログラムの実部と虚部とをそれぞれ実数として連結したものを実数畳み込み層に入力していた。このようなＤＮＮでは、スペクトログラムの複素数としての代数構造を考慮できない。スペクトログラムの位相と振幅の関係は、スペクトログラムの複素数としての構造から得られるものである。このため、実数のみのＤＮＮの構造では、雑音成分の複素スペクトログラムを十分に推定できない可能性があった。 However, in the DNN structure used in Non-Patent Document 2, the real and imaginary parts of the spectrogram are concatenated as real numbers and input to the real convolution layer. This type of DNN cannot take into account the algebraic structure of the spectrogram as complex numbers. The relationship between the phase and amplitude of the spectrogram is obtained from the structure of the spectrogram as complex numbers. For this reason, with a DNN structure that is only real numbers, there is a possibility that the complex spectrogram of the noise component cannot be adequately estimated.

D.Griffin and J.Lim，“Signal estimation from modified shorttime Fourier transform”，IEEE Transactions on Acoustics，Speech，and Signal Processing，vol.32，no.2，pp.236-243，Apr.1984D. Griffin and J. Lim, “Signal estimation from modified shorttime Fourier transform”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, Apr. 1984. Y.Masuyama，K.Yatabe，Y.Koizumi，Y.Oikawa，N.Harada，“DEEP GRIFFIN-LIM ITERATION”，International Conference on Acoustics，Speech，and Signal Processing，Oct.2019Y.Masuyama, K.Yatabe, Y.Koizumi, Y.Oikawa, N.Harada, “DEEP GRIFFIN-LIM ITERATION”, International Conference on Acoustics, Speech, and Signal Processing, Oct.2019

本発明は、上記課題を解決するためになされたもので、スペクトログラムの複素数としての代数構造を考慮した音響信号復元のための演算を行うことが可能な推定方法、推定プログラム、深層ニューラルネットワーク装置および推定装置を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and aims to provide an estimation method, an estimation program, a deep neural network device , and an estimation device that are capable of performing calculations for acoustic signal restoration taking into account the algebraic structure of the spectrogram as a complex number.

本発明の推定方法は、複数回のゲート付き複素畳み込みステップと、最終回の前記ゲート付き複素畳み込みステップの出力の畳み込み演算を行う第１の複素畳み込みステップとを含み、初回の前記ゲート付き複素畳み込みステップは、第１の複素スペクトログラムの畳み込み演算を行う第２の複素畳み込みステップと、前記第１の複素スペクトログラムと所望の音響信号の振幅スペクトログラムとを入力として前記第１の複素スペクトログラムの補正が必要な領域のみを選び出す第１の振幅ゲート演算ステップと、前記第２の複素畳み込みステップの出力と前記第１の振幅ゲート演算ステップの出力とを乗算した結果を出力する第１の乗算ステップとを含み、初回以外の前記ゲート付き複素畳み込みステップは、直前の前記ゲート付き複素畳み込みステップの出力の畳み込み演算を行う第３の複素畳み込みステップと、直前の前記ゲート付き複素畳み込みステップの出力と前記振幅スペクトログラムとを入力として直前の前記ゲート付き複素畳み込みステップの出力の補正が必要な領域のみを選び出す第２の振幅ゲート演算ステップと、前記第３の複素畳み込みステップの出力と前記第２の振幅ゲート演算ステップの出力とを乗算した結果を出力する第２の乗算ステップとを含み、深層ニューラルネットワークにより前記所望の音響信号の雑音成分の複素スペクトログラムを推定することを特徴とするものである。 The estimation method of the present invention includes a plurality of gated complex convolution steps, and a first complex convolution step for performing a convolution operation on the output of the final gated complex convolution step, the first gated complex convolution step including a second complex convolution step for performing a convolution operation on a first complex spectrogram, a first amplitude gate operation step for inputting the first complex spectrogram and an amplitude spectrogram of a desired acoustic signal and selecting only an area of the first complex spectrogram that requires correction, and a first multiplication step for outputting a result of multiplying the output of the second complex convolution step by the output of the first amplitude gate operation step; The gated complex convolution step other than the first includes a third complex convolution step that performs a convolution operation on the output of the immediately preceding gated complex convolution step, a second amplitude gate operation step that uses the output of the immediately preceding gated complex convolution step and the amplitude spectrogram as inputs to select only an area in which correction of the output of the immediately preceding gated complex convolution step is required, and a second multiplication step that outputs a result of multiplying the output of the third complex convolution step by the output of the second amplitude gate operation step, and is characterized in that a complex spectrogram of the noise component of the desired acoustic signal is estimated by a deep neural network.

また、本発明の推定方法は、複数回のゲート付き複素畳み込みステップと、最終回の前記ゲート付き複素畳み込みステップの出力の畳み込み演算を行う第１の複素畳み込みステップとを含み、初回の前記ゲート付き複素畳み込みステップは、第１の複素スペクトログラムの畳み込み演算を行う第２の複素畳み込みステップと、前記第１の複素スペクトログラムの補正が必要な領域のみを選び出す第１の振幅ゲート演算ステップと、前記第２の複素畳み込みステップの出力と前記第１の振幅ゲート演算ステップの出力とを乗算した結果を出力する第１の乗算ステップとを含み、初回以外の前記ゲート付き複素畳み込みステップは、直前の前記ゲート付き複素畳み込みステップの出力の畳み込み演算を行う第３の複素畳み込みステップと、直前の前記ゲート付き複素畳み込みステップの出力の補正が必要な領域のみを選び出す第２の振幅ゲート演算ステップと、前記第３の複素畳み込みステップの出力と前記第２の振幅ゲート演算ステップの出力とを乗算した結果を出力する第２の乗算ステップとを含み、前記第１の振幅ゲート演算ステップは、前記第１の複素スペクトログラムをＣ、実数の重みパラメータをＷ _Ｒとしたとき、Ｓｉｇｍｏｉｄ（｜Ｃ｜＊Ｗ _Ｒ）により振幅ゲート演算を行うステップを含み、前記第２の振幅ゲート演算ステップは、直前の前記ゲート付き複素畳み込みステップの出力である複素スペクトログラムをＣとしたとき、Ｓｉｇｍｏｉｄ（｜Ｃ｜＊Ｗ _Ｒ）により振幅ゲート演算を行うステップを含み、深層ニューラルネットワークにより所望の音響信号の雑音成分の複素スペクトログラムを推定することを特徴とするものである。 The estimation method of the present invention includes a gated complex convolution step performed multiple times, and a first complex convolution step of performing a convolution operation on an output of the gated complex convolution step performed last time, and the first gated complex convolution step includes a second complex convolution step of performing a convolution operation on a first complex spectrogram, a first amplitude gate operation step of selecting only an area of the first complex spectrogram that needs to be corrected, and a first multiplication step of outputting a result of multiplying an output of the second complex convolution step by an output of the first amplitude gate operation step. a second amplitude gate calculation step for selecting only an area that needs to be corrected from the output of the immediately preceding gated complex convolution step; and a second multiplication step for outputting a result of multiplying the output of the third complex convolution step by the output of the second amplitude gate calculation step; the first amplitude gate calculation step includes a step of performing an amplitude gate calculation using Sigmoid(|C|*W _R ) when the first complex spectrogram is C and a real weighting parameter is W _R ; and the second amplitude gate calculation step includes a step of performing an amplitude gate calculation using Sigmoid(|C|*W _R ) when the complex spectrogram that is the output of the immediately preceding gated complex convolution step is C; and the complex spectrogram of a noise component of a desired acoustic signal is estimated by a deep neural network.

また、本発明の推定方法は、位相と振幅が矛盾する第２の複素スペクトログラムと所望の音響信号の振幅スペクトログラムとを入力とし、前記振幅スペクトログラムに前記第２の複素スペクトログラムの位相を付与して、付与後の第１の信号を求める位相付与ステップと、前記第１の信号を逆短時間フーリエ変換により時間波形に変換し、変換された時間波形を前記逆短時間フーリエ変換に対応する短時間フーリエ変換により周波数領域の第２の信号に変換する変換ステップと、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、前記第２の複素スペクトログラムの位相を前記所望の音響信号の位相に近づける位相変更ステップとをさらに含み、前記位相変更ステップは、前記複数回のゲート付き複素畳み込みステップと前記第１の複素畳み込みステップとを含み、前記深層ニューラルネットワークに入力する前記第１の複素スペクトログラムとして、前記第２の複素スペクトログラムと前記第１の信号と前記第２の信号とを用い、前記変換ステップの出力と前記深層ニューラルネットワークの出力との差分を出力するステップを含むことを特徴とするものである。
また、本発明の推定方法の１構成例において、前記深層ニューラルネットワークは、学習用の音響信号から得られる複素スペクトログラムとその振幅スペクトログラムとを用いて予め学習されたものであり、前記第２の信号と前記第２の複素スペクトログラムとの残差の推定値を出力することを特徴とするものである。
また、本発明の推定プログラムは、前記の各ステップをコンピュータに実行させることを特徴とするものである。 The estimation method of the present invention further includes a phase assignment step of receiving as input a second complex spectrogram having a conflicting phase and amplitude and an amplitude spectrogram of a desired acoustic signal, assigning the phase of the second complex spectrogram to the amplitude spectrogram to obtain a first signal after the assignment, a transformation step of converting the first signal into a time waveform by an inverse short-time Fourier transform and converting the converted time waveform into a second signal in the frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform, and a phase modification step of bringing the phase of the second complex spectrogram closer to the phase of the desired acoustic signal based on statistical properties of a learning acoustic signal corresponding to the desired acoustic signal, wherein the phase modification step includes the gated complex convolution step multiple times and the first complex convolution step, and includes a step of using the second complex spectrogram, the first signal, and the second signal as the first complex spectrogram to be input to the deep neural network, and outputting a difference between an output of the transformation step and an output of the deep neural network.
In addition, in one configuration example of the estimation method of the present invention, the deep neural network is trained in advance using a complex spectrogram obtained from a training acoustic signal and its amplitude spectrogram, and is characterized in that it outputs an estimate of the residual between the second signal and the second complex spectrogram.
The present invention also provides an estimation program for causing a computer to execute each of the above steps.

また、本発明の深層ニューラルネットワーク装置は、縦続接続された複数のゲート付き複素畳み込み層と、終段の前記ゲート付き複素畳み込み層の出力の畳み込み演算を行うように構成された第１の複素畳み込み層とを備え、初段の前記ゲート付き複素畳み込み層は、第１の複素スペクトログラムの畳み込み演算を行うように構成された第２の複素畳み込み層と、前記第１の複素スペクトログラムと所望の音響信号の振幅スペクトログラムとを入力として前記第１の複素スペクトログラムの補正が必要な領域のみを選び出すように構成された第１の振幅ゲート層と、前記第２の複素畳み込み層の出力と前記第１の振幅ゲート層の出力とを乗算した結果を出力するように構成された第１の乗算部とから構成され、初段以外の前記ゲート付き複素畳み込み層は、前段の前記ゲート付き複素畳み込み層の出力の畳み込み演算を行うように構成された第３の複素畳み込み層と、前段の前記ゲート付き複素畳み込み層の出力と前記振幅スペクトログラムとを入力として前段の前記ゲート付き複素畳み込み層の出力の補正が必要な領域のみを選び出すように構成された第２の振幅ゲート層と、前記第３の複素畳み込み層の出力と前記第２の振幅ゲート層の出力とを乗算した結果を出力するように構成された第２の乗算部とから構成され、前記所望の音響信号の雑音成分の複素スペクトログラムを推定することを特徴とするものである。 The deep neural network device of the present invention includes a plurality of gated complex convolutional layers connected in cascade, and a first complex convolutional layer configured to perform a convolutional operation on an output of the gated complex convolutional layer at a final stage, and the gated complex convolutional layer at the initial stage includes a second complex convolutional layer configured to perform a convolutional operation on a first complex spectrogram, a first amplitude gated layer configured to receive the first complex spectrogram and an amplitude spectrogram of a desired acoustic signal as input and select only an area of the first complex spectrogram that requires correction, and outputs a result of multiplying an output of the second complex convolutional layer by an output of the first amplitude gated layer. and a first multiplication unit configured to perform a convolution operation on an output of the gated complex convolution layer of a previous stage, and the gated complex convolution layers other than the first stage each include a third complex convolution layer configured to perform a convolution operation on an output of the gated complex convolution layer of a previous stage, a second amplitude gating layer configured to receive as input the output of the gated complex convolution layer of the previous stage and the amplitude spectrogram and select only a region in which correction of the output of the gated complex convolution layer of the previous stage is required, and a second multiplication unit configured to output a result of multiplying the output of the third complex convolution layer by the output of the second amplitude gating layer, and to estimate a complex spectrogram of a noise component of the desired acoustic signal.

また、本発明の深層ニューラルネットワーク装置は、縦続接続された複数のゲート付き複素畳み込み層と、終段の前記ゲート付き複素畳み込み層の出力の畳み込み演算を行うように構成された第１の複素畳み込み層とを備え、初段の前記ゲート付き複素畳み込み層は、第１の複素スペクトログラムの畳み込み演算を行うように構成された第２の複素畳み込み層と、前記第１の複素スペクトログラムの補正が必要な領域のみを選び出すように構成された第１の振幅ゲート層と、前記第２の複素畳み込み層の出力と前記第１の振幅ゲート層の出力とを乗算した結果を出力するように構成された第１の乗算部とから構成され、初段以外の前記ゲート付き複素畳み込み層は、前段の前記ゲート付き複素畳み込み層の出力の畳み込み演算を行うように構成された第３の複素畳み込み層と、前段の前記ゲート付き複素畳み込み層の出力の補正が必要な領域のみを選び出すように構成された第２の振幅ゲート層と、前記第３の複素畳み込み層の出力と前記第２の振幅ゲート層の出力とを乗算した結果を出力するように構成された第２の乗算部とから構成され、前記第１の振幅ゲート層は、前記第１の複素スペクトログラムをＣ、実数の重みパラメータをＷ _Ｒとしたとき、Ｓｉｇｍｏｉｄ（｜Ｃ｜＊Ｗ _Ｒ）により振幅ゲート演算を行うものであり、前記第２の振幅ゲート層は、前段の前記ゲート付き複素畳み込み層の出力である複素スペクトログラムをＣとしたとき、Ｓｉｇｍｏｉｄ（｜Ｃ｜＊Ｗ _Ｒ）により振幅ゲート演算を行うものであり、所望の音響信号の雑音成分の複素スペクトログラムを推定することを特徴とするものである。 The deep neural network device of the present invention includes a plurality of gated complex convolutional layers connected in cascade, and a first complex convolutional layer configured to perform a convolutional operation on an output of the gated complex convolutional layer at a final stage, and the gated complex convolutional layer at the initial stage includes a second complex convolutional layer configured to perform a convolutional operation on a first complex spectrogram, a first amplitude gated layer configured to select only an area of the first complex spectrogram that requires correction, and outputs a result of multiplying an output of the second complex convolutional layer by an output of the first amplitude gated layer. and a first multiplication unit configured to output a result of multiplying the output of the gated complex convolution layer of the previous stage by an output of the gated complex convolution layer of the previous stage, the second amplitude gate layer configured to select only a region that needs to be corrected from the output of the gated complex convolution layer of the previous stage, and a second multiplication unit configured to output a result of multiplying the output of the third complex convolution layer by the output of the second amplitude gate layer, and the first amplitude gate layer is configured to output a result of multiplying the output of the third complex convolution layer by the output of the second amplitude gate layer, the first amplitude gate layer being configured to multiply the first complex spectrogram by C and a real weighting parameter by W. The first amplitude gating layer performs an amplitude gating operation using Sigmoid (|C|*WR ₎ when C is a complex spectrogram that is _the output of the gated complex convolution layer of the previous stage, and is characterized in that it _estimates a complex spectrogram of the noise component of a desired acoustic signal.

また、本発明の推定装置は、位相と振幅が矛盾する第２の複素スペクトログラムと所望の音響信号の振幅スペクトログラムとを入力とし、前記振幅スペクトログラムに前記第２の複素スペクトログラムの位相を付与して、付与後の第１の信号を求めるように構成された位相付与部と、前記第１の信号を逆短時間フーリエ変換により時間波形に変換し、変換された時間波形を前記逆短時間フーリエ変換に対応する短時間フーリエ変換により周波数領域の第２の信号に変換するように構成された変換部と、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、前記第２の複素スペクトログラムの位相を前記所望の音響信号の位相に近づけるように構成された位相変更部とを備え、前記位相変更部は、前記深層ニューラルネットワーク装置を含み、前記深層ニューラルネットワーク装置に入力する第１の複素スペクトログラムとして、前記第２の複素スペクトログラムと前記第１の信号と前記第２の信号とを用い、前記変換部の出力と前記深層ニューラルネットワーク装置の出力との差分を出力することを特徴とするものである。 The estimation device of the present invention further includes a phase assigning unit configured to receive as input a second complex spectrogram having a conflicting phase and amplitude and an amplitude spectrogram of a desired acoustic signal, assign the phase of the second complex spectrogram to the amplitude spectrogram, and obtain a first signal after the assignment; a conversion unit configured to convert the first signal into a time waveform by an inverse short-time Fourier transform, and convert the converted time waveform into a second signal in the frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform; and a phase changing unit configured to bring the phase of the second complex spectrogram closer to the phase of the desired acoustic signal based on statistical properties of a training acoustic signal corresponding to the desired acoustic signal, wherein the phase changing unit includes the deep neural network device , and uses the second complex spectrogram, the first signal, and the second signal as a first complex spectrogram to be input to the deep neural network device , and outputs a difference between an output of the conversion unit and an output of the deep neural network device .

本発明によれば、複素畳み込みステップの結果と振幅ゲート演算ステップの結果とを乗算するゲート付き複素畳み込みステップを実行することにより、スペクトログラムの複素数としての代数構造を考慮した音響信号復元のための演算を行うことが可能になる。その結果、本発明では、雑音成分の複素スペクトログラムを精度良く推定することができ、従来の技術と比較して、より高品質な出力音を得ることが可能になる。 According to the present invention, by executing a gated complex convolution step in which the result of the complex convolution step is multiplied by the result of the amplitude gate calculation step, it becomes possible to perform calculations for restoring an acoustic signal taking into account the algebraic structure of the spectrogram as a complex number. As a result, the present invention makes it possible to accurately estimate the complex spectrogram of the noise component, and to obtain an output sound of higher quality compared to conventional techniques.

図１は、本発明の実施例に係るＤＮＮの構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a DNN according to an embodiment of the present invention. 図２は、本発明の実施例に係るＤＮＮの動作を説明するフローチャートである。FIG. 2 is a flowchart illustrating the operation of a DNN according to an embodiment of the present invention. 図３は、本発明の実施例に係る推定装置の構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of an estimation device according to an embodiment of the present invention. 図４は、本発明の実施例に係る推定装置の推定部の構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of an estimation unit of an estimation device according to an embodiment of the present invention. 図５は、本発明の実施例に係る学習装置の構成を示すブロック図である。FIG. 5 is a block diagram showing the configuration of a learning device according to an embodiment of the present invention. 図６は、本発明の実施例に係るＤＮＮの別の構成を示すブロック図である。FIG. 6 is a block diagram showing another configuration of a DNN according to an embodiment of the present invention. 図７は、本発明の実施例に係る推定装置と学習装置を実現するコンピュータの構成例を示すブロック図である。FIG. 7 is a block diagram showing an example of the configuration of a computer that realizes an estimation device and a learning device according to an embodiment of the present invention. 図８は、従来の推定装置の構成を示すブロック図である。FIG. 8 is a block diagram showing the configuration of a conventional estimation device. 図９は、従来の推定装置の推定部の構成を示すブロック図である。FIG. 9 is a block diagram showing a configuration of an estimation unit of a conventional estimation device. 図１０は、従来の学習装置の構成を示すブロック図である。FIG. 10 is a block diagram showing the configuration of a conventional learning device. 図１１は、従来のＤＮＮの構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of a conventional DNN. 図１２は、従来の学習装置の動作を説明するフローチャートである。FIG. 12 is a flowchart illustrating the operation of a conventional learning device. 図１３は、従来の推定装置の動作を説明するフローチャートである。FIG. 13 is a flowchart illustrating the operation of a conventional estimation device.

以下、本発明の実施例について図面を参照して説明する。本実施例では、非特許文献２におけるＤＮＮの構造として、ＡＩ－ＧＣＮＮ（amplitude-informed gated complex convolutional neural network）を提案する。 Below, an embodiment of the present invention will be described with reference to the drawings. In this embodiment, we propose an amplitude-informed gated complex convolutional neural network (AI-GCNN) as the DNN structure in Non-Patent Document 2.

複素スペクトログラムＣに対する複素畳み込みは、次式のように定義される。
Ｃｏｎｖ_C（Ｃ）＝（Ｗ_Re＊Ｃ_Re－Ｗ_Im＊Ｃ_Im）＋ｉ（Ｗ_Re＊Ｃ_Im＋Ｗ_Im＊Ｃ_Re）
・・・（１９） The complex convolution for the complex spectrogram C is defined as follows:
Conv _C (C)=(W _Re *C _Re -W _Im *C _Im )+i(W _Re *C _Im +W _Im *C _Re )
...(19)

ｉは虚数単位、Ｃ_Reは複素スペクトログラムＣの実部、Ｃ_Imは複素スペクトログラムＣの虚部、Ｗ_Reは実部に対する重みパラメータ、Ｗ_Imは虚部に対する重みパラメータである。 i is the imaginary unit, C _Re is the real part of the complex spectrogram C, C _Im is the imaginary part of the complex spectrogram C, W _Re is a weighting parameter for the real part, and W _Im is a weighting parameter for the imaginary part.

図１は本実施例に係るＤＮＮであるＡＩ－ＧＣＮＮの構成を示すブロック図である。ＡＩ－ＧＣＮＮは、縦続接続された複数のゲート付き複素畳み込み層１０００～１００２と、終段のゲート付き複素畳み込み層１００２の出力の畳み込み演算を行う複素畳み込み層１００３とから構成される。 Figure 1 is a block diagram showing the configuration of AI-GCNN, a DNN according to this embodiment. AI-GCNN is composed of multiple gated complex convolutional layers 1000-1002 connected in cascade, and a complex convolutional layer 1003 that performs a convolution operation on the output of the final gated complex convolutional layer 1002.

ゲート付き複素畳み込み層１０００～１００２の各々は、複素スペクトログラムの畳み込み演算を行う複素畳み込み層１００４と、複素スペクトログラムと所望の音響信号の振幅スペクトログラムとを入力として複素スペクトログラムの補正が必要な領域のみを選び出す振幅ゲート層１００５と、複素畳み込み層１００４の出力と振幅ゲート層１００５の出力とを要素毎に乗算した結果を出力する乗算部１００６とから構成される。 Each of the gated complex convolution layers 1000 to 1002 is composed of a complex convolution layer 1004 that performs a convolution operation on a complex spectrogram, an amplitude gate layer 1005 that receives as input the complex spectrogram and the amplitude spectrogram of the desired acoustic signal and selects only the areas of the complex spectrogram that require correction, and a multiplication unit 1006 that multiplies the output of the complex convolution layer 1004 and the output of the amplitude gate layer 1005 element by element and outputs the result.

図１において、ｃはチャネル数、ｋはカーネルサイズを表している。複素畳み込み層１００４と振幅ゲート層１００５のチャネル数は６４、カーネルサイズは５×３である。複素畳み込み層１００３のチャネル数は１、カーネルサイズは１×１である。各層の畳み込みのストライドは１とする。 In Figure 1, c represents the number of channels and k represents the kernel size. The complex convolution layer 1004 and the amplitude gate layer 1005 have 64 channels and a kernel size of 5 x 3. The complex convolution layer 1003 has 1 channel and a kernel size of 1 x 1. The convolution stride of each layer is 1.

ＡＩ－ＧＣＮＮでは、非線形層として次式のように定義される振幅ゲート層を用いる。
ＡｍｐＧａｔｅ_WR（Ｃ）＝Ｓｉｇｍｏｉｄ（｜Ｃ｜＊Ｗ_R）・・・（２０） In AI-GCNN, an amplitude gate layer defined as follows is used as a nonlinear layer.
AmpGate _WR (C)=Sigmoid(|C|*W _R )...(20)

Ｃは入力される複素スペクトログラム、Ｗ_Rは実数の重みパラメータである。振幅ゲート層は、複素スペクトログラムに対して時間周波数マスクのように働き、スペクトログラムの補正が必要な領域のみを選びだすことが期待できる。この振幅ゲート層を複素畳みこみ層Ｃｏｎｖ_WCに対して適用する操作として、次式のようにＡＧＣ（amplitude-basedgated complex convolution）層が定義される。
ＡＧＣ_WC,WR（Ｃ）＝Ｃｏｎｖ_WC（Ｃ）◎ＡｍｐＧａｔｅ_WR（Ｃ）・・（２１） C is the input complex spectrogram, and W _R is a real-valued weighting parameter. The amplitude gate layer acts like a time-frequency mask on the complex spectrogram, and is expected to select only the areas of the spectrogram that require correction. As an operation of applying this amplitude gate layer to the complex convolution layer Conv _WC , an amplitude-based gated complex convolution (AGC) layer is defined as follows:
AGC _WC,WR (C)=Conv _WC (C)◎AmpGate _WR (C)...(21)

上記のとおり、◎は要素毎の乗算を表す。さらに、時間周波数マスクの推定に有用であることが知られている振幅情報をより直接的に取り入れるため、ＡＧＣ層は次式のように定義されるゲート付き複素畳み込み（ＡＩ－ＧＣ：amplitude-informed gated complex convolution）層に拡張される。 As above, ◎ denotes element-wise multiplication. Furthermore, to more directly incorporate amplitude information, which is known to be useful in estimating the time-frequency mask, the AGC layer is extended to an amplitude-informed gated complex convolution (AI-GC) layer, defined as follows:

式（２３）の［Ａ，Ｃ］は振幅スペクトログラムＡと複素スペクトログラムＣのチャネル方向の結合を示している。式（２２）、式（２３）によれば、目的となる振幅スペクトログラムＡと複素スペクトログラムＣの振幅とを比較し、残差を抽出することが可能である。
このように、本実施例のゲート付き複素畳み込み層１０００～１００２は、複素スペクトログラムＣに加えて、振幅スペクトログラムＡも入力とする。 [A, C] in equation (23) indicates a channel-wise combination of the amplitude spectrogram A and the complex spectrogram C. According to equations (22) and (23), it is possible to compare the amplitudes of the target amplitude spectrogram A and the complex spectrogram C and extract the residual.
In this way, the gated complex convolutional layers 1000 to 1002 of this embodiment receive the amplitude spectrogram A as input in addition to the complex spectrogram C.

図２は本実施例に係るＤＮＮであるＡＩ－ＧＣＮＮの動作を説明するフローチャートである。ここでは、ゲート付き複素畳み込みステップの実行回数を数える変数をｎとする。 Figure 2 is a flowchart explaining the operation of AI-GCNN, which is a DNN according to this embodiment. Here, the variable that counts the number of times the gated complex convolution step is executed is n.

初回（ｎ＝０）のゲート付き複素畳み込みステップでは、ゲート付き複素畳み込み層１０００の複素畳み込み層１００４により複素スペクトログラムＣの畳み込み演算を行い（図２ステップＳ１１）、ゲート付き複素畳み込み層１０００の振幅ゲート層１００５により複素スペクトログラムＣの補正が必要な領域のみを選び出す振幅ゲート演算を行う（図２ステップＳ１２）。ゲート付き複素畳み込み層１０００の乗算部１００６は、ゲート付き複素畳み込み層１０００の複素畳み込み層１００４と振幅ゲート層１００５の出力を要素毎に乗算した結果を出力する（図２ステップＳ１３）。 In the first gated complex convolution step (n=0), the complex convolution layer 1004 of the gated complex convolution layer 1000 performs a convolution operation on the complex spectrogram C (step S11 in FIG. 2), and the amplitude gate layer 1005 of the gated complex convolution layer 1000 performs an amplitude gate operation to select only the area of the complex spectrogram C that needs to be corrected (step S12 in FIG. 2). The multiplication unit 1006 of the gated complex convolution layer 1000 outputs the result of multiplying the outputs of the complex convolution layer 1004 and the amplitude gate layer 1005 of the gated complex convolution layer 1000 element by element (step S13 in FIG. 2).

２回目（ｎ＝１）のゲート付き複素畳み込みステップでは、ゲート付き複素畳み込み層１００１の複素畳み込み層１００４により前段のゲート付き複素畳み込み層１０００から出力された複素スペクトログラムの畳み込み演算を行い（ステップＳ１１）、ゲート付き複素畳み込み層１００１の振幅ゲート層１００５により前段のゲート付き複素畳み込み層１０００から出力された複素スペクトログラムの補正が必要な領域のみを選び出す振幅ゲート演算を行う（ステップＳ１２）。ゲート付き複素畳み込み層１００１の乗算部１００６は、ゲート付き複素畳み込み層１００１の複素畳み込み層１００４と振幅ゲート層１００５の出力を要素毎に乗算した結果を出力する（ステップＳ１３）。 In the second gated complex convolution step (n=1), the complex convolution layer 1004 of the gated complex convolution layer 1001 performs a convolution operation on the complex spectrogram output from the previous gated complex convolution layer 1000 (step S11), and the amplitude gate layer 1005 of the gated complex convolution layer 1001 performs an amplitude gate operation to select only the region of the complex spectrogram output from the previous gated complex convolution layer 1000 that needs correction (step S12). The multiplication unit 1006 of the gated complex convolution layer 1001 outputs the result of multiplying the outputs of the complex convolution layer 1004 and the amplitude gate layer 1005 of the gated complex convolution layer 1001 element by element (step S13).

３回目（ｎ＝２）のゲート付き複素畳み込みステップでは、ゲート付き複素畳み込み層１００２の複素畳み込み層１００４により前段のゲート付き複素畳み込み層１００１から出力された複素スペクトログラムの畳み込み演算を行い（ステップＳ１１）、ゲート付き複素畳み込み層１００２の振幅ゲート層１００５により前段のゲート付き複素畳み込み層１００１から出力された複素スペクトログラムの補正が必要な領域のみを選び出す振幅ゲート演算を行う（ステップＳ１２）。ゲート付き複素畳み込み層１００２の乗算部１００６は、ゲート付き複素畳み込み層１００２の複素畳み込み層１００４と振幅ゲート層１００５の出力を要素毎に乗算した結果を出力する（ステップＳ１３）。 In the third gated complex convolution step (n=2), the complex convolution layer 1004 of the gated complex convolution layer 1002 performs a convolution operation on the complex spectrogram output from the preceding gated complex convolution layer 1001 (step S11), and the amplitude gate layer 1005 of the gated complex convolution layer 1002 performs an amplitude gate operation to select only the region of the complex spectrogram output from the preceding gated complex convolution layer 1001 that needs correction (step S12). The multiplication unit 1006 of the gated complex convolution layer 1002 outputs the result of multiplying the outputs of the complex convolution layer 1004 and the amplitude gate layer 1005 of the gated complex convolution layer 1002 by element (step S13).

３回のゲート付き複素畳み込みステップの終了後、複素畳み込み層１００３により前段のゲート付き複素畳み込み層１００２から出力された複素スペクトログラムの畳み込み演算を行う（図２ステップＳ１６）。
こうして、本実施例では、Griffin-Limアルゴリズムの出力（Ｚ^[m]）に含まれる不要な残差をＤＮＮで抽出することが可能になる。 After the three gated complex convolution steps are completed, the complex convolution layer 1003 performs a convolution operation on the complex spectrogram output from the preceding gated complex convolution layer 1002 (step S16 in FIG. 2).
Thus, in this embodiment, it becomes possible to extract unnecessary residuals contained in the output (Z ^[m] ) of the Griffin-Lim algorithm using a DNN.

本実施例では、ゲート付き複素畳み込み層１０００～１００２の数を３層（ゲート付き複素畳み込みステップの実行回数を３）としているが、これに限るものではなく、３層以上としてもよい。
また、複素畳み込み演算と振幅ゲート演算で用いられるパラメータは学習によって与えられるものであり、ｎ毎（層毎）に異なるパラメータを用いてもよい。 In this embodiment, the number of gated complex convolutional layers 1000 to 1002 is three (the number of gated complex convolutional steps is three), but this is not limited to three, and the number of layers may be three or more.
Furthermore, the parameters used in the complex convolution operation and the amplitude gate operation are given by learning, and different parameters may be used for each nth layer (each layer).

図３は本実施例に係る推定装置の構成を示すブロック図である。推定装置１００ａは、位相と振幅が矛盾する複素スペクトログラムＸ^[m]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、推定した位相スペクトログラムを持つ複素スペクトログラムＸ^[m+1]を出力するＭ個の推定部１１０ａ－ｍ（ｍ＝０，１，２，・・・，Ｍ－１、Ｍは１以上の整数）と、振幅スペクトログラムＡに終段の推定部１１０ａ－（Ｍ－１）から出力された複素スペクトログラムＸ^[M]の位相を付与する位相付与部１２０とを備えている。
なお、ｍ＝０の処理ブロック、すなわち初段の推定部１１０ａ－０に入力されるスペクトログラムＸ^[0]は、振幅スペクトログラムＡであってもよい。 3 is a block diagram showing the configuration of an estimation device according to the present embodiment. The estimation device 100a includes M estimation units 110a-m (m=0, 1, 2, ... ^{, M-1,} M is an integer of 1 or more) that receive a complex spectrogram X ^[m] with a contradiction between phase and amplitude and an amplitude spectrogram A of a desired acoustic signal, and output a complex spectrogram X[m+1] having an estimated phase spectrogram, and a phase assigning unit 120 that assigns the phase of the complex spectrogram X ^[M] output from the final estimation unit 110a-(M-1) to the amplitude spectrogram A.
The spectrogram X ^[0] input to the processing block where m=0, ie, the first-stage estimation section 110a-0, may be the amplitude spectrogram A.

図４は推定部１１０ａ－ｍの構成を示すブロック図である。本実施例の推定部１１０ａ－ｍは、複素スペクトログラムＸ^[m]と振幅スペクトログラムＡとを入力とし、振幅スペクトログラムＡに複素スペクトログラムＸ^[m]の位相を付与する位相付与部１１１と、位相付与部１１１の出力信号を逆短時間フーリエ変換により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換に対応する短時間フーリエ変換により周波数領域の信号に変換する変換部１１２と、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムＸ^[m]の位相を所望の音響信号の位相に近づける位相変更部１１３ａとから構成される。位相変更部１１３ａは、ＤＮＮ１１４ａと、減算部１１５とから構成される。 4 is a block diagram showing the configuration of the estimation unit 110a-m. The estimation unit 110a-m of this embodiment includes a phase assigning unit 111 that receives a complex spectrogram X ^[m] and an amplitude spectrogram A and assigns the phase of the complex spectrogram X ^[m] to the amplitude spectrogram A, a conversion unit 112 that converts the output signal of the phase assigning unit 111 into a time waveform by an inverse short-time Fourier transform and converts the converted time waveform into a frequency domain signal by a short-time Fourier transform corresponding to the inverse short-time Fourier transform, and a phase changing unit 113a that brings the phase of the complex spectrogram X ^[m] closer to the phase of the desired acoustic signal based on the statistical properties of a learning acoustic signal corresponding to the desired acoustic signal. The phase changing unit 113a includes a DNN 114a and a subtraction unit 115.

図５は本実施例に係る学習装置の構成を示すブロック図である。学習装置２００ａは、ノイズ加算部２０９と、位相付与部２１１と、変換部２１２と、ＤＮＮ２１３ａと、減算部２１４と、パラメータ更新部２１５とから構成される。ＤＮＮ１１４ａ，２１３ａは、図１に示した構成を有する。 Figure 5 is a block diagram showing the configuration of a learning device according to this embodiment. The learning device 200a is composed of a noise addition unit 209, a phase assignment unit 211, a conversion unit 212, a DNN 213a, a subtraction unit 214, and a parameter update unit 215. The DNNs 114a and 213a have the configuration shown in Figure 1.

次に、本実施例のＤＮＮ１１４ａ，２１３ａの学習段階について説明する。学習装置２００ａの動作の流れは非特許文献２に開示された学習装置２００と同様であるので、図１２を用いて学習装置２００ａの動作を説明する。 Next, the learning stage of the DNNs 114a and 213a in this embodiment will be described. The flow of operation of the learning device 200a is similar to that of the learning device 200 disclosed in Non-Patent Document 2, so the operation of the learning device 200a will be described using FIG. 12.

学習装置２００ａの図示しない初期化部は、ＤＮＮ２１３ａのパラメータθを乱数で初期化する（図１２ステップＳ１００）。
ノイズ加算部２０９は、クリーン音響信号Ｘ^(L)*とノイズＮとを入力とし、式（１２）に示すようにクリーン音響信号Ｘ^(L)*にノイズＮを加算し、複素スペクトログラムＸチルダを出力する（図１２ステップＳ１０１）。 An initialization unit (not shown) of the learning device 200a initializes a parameter θ of the DNN 213a with a random number (step S100 in FIG. 12).
The noise adding unit 209 receives the clean audio signal X ^(L)* and noise N as input, adds the noise N to the clean audio signal X ^(L)* as shown in equation (12), and outputs a complex spectrogram X tilde (Step S101 in FIG. 12).

位相付与部２１１は、複素スペクトログラムＸチルダと振幅スペクトログラムＡ^(L)とを入力とし、式（１３）に示すように、振幅スペクトログラムＡ^(L)に複素スペクトログラムＸチルダの位相を付与して、付与後の信号Ｙチルダを出力する（図１２ステップＳ１０２）。式（１３）は、複素スペクトログラムＸチルダの各要素に対して振幅スペクトログラムＡ^(L)の各要素を乗算し、乗算結果を複素スペクトログラムＸチルダの振幅スペクトログラム｜Ｘチルダ｜で除算しているため、複素スペクトログラムＸチルダの振幅を振幅スペクトログラムＡ^(L)の大きさに変換する処理といってもよい。 The phase assigning unit 211 receives the complex spectrogram X tilde and the amplitude spectrogram A ^(L) , assigns the phase of the complex spectrogram X tilde to the amplitude spectrogram A ^(L) as shown in equation (13), and outputs the assigned signal Y tilde (step S102 in FIG. 12). Since equation (13) multiplies each element of the complex spectrogram X tilde by each element of the amplitude spectrogram A ^(L) and divides the multiplication result by the amplitude spectrogram |X tilde| of the complex spectrogram X tilde, it can be said that this is a process of converting the amplitude of the complex spectrogram X tilde into the magnitude of the amplitude spectrogram A ^(L) .

変換部２１２は、信号Ｙチルダを入力とし、式（１４）に示すように、信号Ｙチルダを逆短時間フーリエ変換Ｇ†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換Ｇ†に対応する短時間フーリエ変換Ｇにより周波数領域の信号Ｚチルダに変換して出力する（図１２ステップＳ１０３）。 The conversion unit 212 receives the signal Y tilde as input, converts the signal Y tilde into a time waveform by an inverse short-time Fourier transform G† as shown in equation (14), and converts the converted time waveform into a frequency domain signal Z tilde by a short-time Fourier transform G corresponding to the inverse short-time Fourier transform G†, and outputs the signal Z tilde (step S103 in FIG. 12).

ＤＮＮ２１３ａは、複素スペクトログラムＸチルダと、信号Ｙチルダと、信号Ｚチルダと、振幅スペクトログラムＡ^(L)とを入力とし、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差を推定し、推定値Ｆ_θ（Ｘチルダ，Ｙチルダ，Ｚチルダ）を出力する（図１２ステップＳ１０４）。 DNN 213a receives as input complex spectrogram X tilde, signal Y tilde, signal Z tilde, and amplitude spectrogram A ^(L) , estimates the distortion or estimation error generated by the Griffin-Lim algorithm, and outputs an estimated value F _θ (X tilde, Y tilde, Z tilde) (step S104 in FIG. 12).

パラメータ更新部２１５は、差分Ｚチルダ－Ｘ^(L)*と、推定値Ｆ_θ（Ｘチルダ，Ｙチルダ，Ｚチルダ）とを入力とし、これらの値を用いて、式（１５）に示す目的関数を最小化するようにＤＮＮ２１３ａのパラメータθを更新する（図１２ステップＳ１０６）。 The parameter update unit 215 receives the difference Z tilde -X ^(L)* and the estimated value F _θ (X tilde, Y tilde, Z tilde) as input, and uses these values to update the parameter θ of the DNN 213a so as to minimize the objective function shown in equation (15) (step S106 in Figure 12).

学習法としては、確率的最急降下法などを利用すればよい。学習率は例えば１０^-5程度に設定すればよい。
パラメータ更新部２１５は、所定の条件を満たすか否かを判定し（図１２ステップＳ１０７）、所定の条件を満たす場合には、その時点のＤＮＮ２１３ａを学習済みのＤＮＮとする。 As a learning method, a stochastic steepest descent method or the like may be used. The learning rate may be set to, for example, about 10 ^-5 .
The parameter update unit 215 determines whether or not a predetermined condition is satisfied (step S107 in FIG. 12), and if the predetermined condition is satisfied, the DNN 213a at that point in time is regarded as a trained DNN.

所定の条件を満たさない場合には、新たなクリーン音響信号Ｘ^(L)*と新たなノイズＮと更新後のパラメータθとを用いて、ステップＳ１０１～Ｓ１０６の処理が再び実施される。例えばステップＳ１０１～Ｓ１０６の処理を１０万回繰り返したときに、所定の条件を満たしたとしてＤＮＮ２１３ａの学習が終了する。 If the predetermined condition is not satisfied, the process of steps S101 to S106 is carried out again using a new clean acoustic signal X ^(L)* , new noise N, and updated parameters θ. For example, when the process of steps S101 to S106 is repeated 100,000 times, it is determined that the predetermined condition is satisfied and the learning of the DNN 213a ends.

次に、推定装置１００ａの動作について説明する。推定装置１００ａの動作の流れは非特許文献２に開示された推定装置１００と同様であるので、図１３を用いて推定装置１００ａの動作を説明する。 Next, the operation of the estimation device 100a will be described. The flow of operation of the estimation device 100a is similar to that of the estimation device 100 disclosed in Non-Patent Document 2, so the operation of the estimation device 100a will be described using FIG. 13.

推定装置１００ａは、位相と振幅が矛盾する複素スペクトログラムＸ^[0]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、振幅スペクトログラムＡに矛盾しない位相スペクトログラムを持つ複素スペクトログラムＹ^[M]を求めて出力する。 The estimation device 100a receives as input a complex spectrogram X ^[0] in which the phase and amplitude are inconsistent, and an amplitude spectrogram A of a desired acoustic signal, and calculates and outputs a complex spectrogram Y ^[M] having a phase spectrogram that is consistent with the amplitude spectrogram A.

Ｍ個の推定部１１０ａ－ｍ（ｍ＝０，１，２，・・・，Ｍ－１、Ｍは１以上の整数）は、位相と振幅が矛盾する複素スペクトログラムＸ^[m]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、推定した位相スペクトログラムを持つ複素スペクトログラムＸ^[m+1]を求めて出力する。 Each of M estimation units 110a-m (m=0, 1, 2, ..., M-1, M is an integer equal to or greater than 1) receives as input a complex spectrogram X ^[m] having a conflicting phase and amplitude and an amplitude spectrogram A of a desired acoustic signal, and determines and outputs a complex spectrogram X ^[m+1] having an estimated phase spectrogram.

図４に示したように、本実施例の推定部１１０ａ－ｍは、位相付与部１１１と、変換部１１２と、位相変更部１１３ａとから構成される。位相変更部１１３ａは、ＤＮＮ１１４ａと、減算部１１５とから構成される。ＤＮＮ１１４ａには、学習装置２００ａで学習されたＤＮＮが設定されている。 As shown in FIG. 4, the estimation units 110a-m of this embodiment are composed of a phase assignment unit 111, a conversion unit 112, and a phase change unit 113a. The phase change unit 113a is composed of a DNN 114a and a subtraction unit 115. The DNN learned by the learning device 200a is set in the DNN 114a.

位相付与部１１１は、位相と振幅が矛盾する複素スペクトログラムＸ^[m]と、所望の音響信号の振幅スペクトログラムＡとを入力とし、式（１６）に示すように、振幅スペクトログラムＡに複素スペクトログラムＸ^[m]の位相を付与して、付与後の信号Ｙ^[m]＝Ｐ_B（Ｘ^[m]）を出力する（図１３ステップＳ２０１）。 The phase assigning unit 111 receives as input a complex spectrogram X ^[m] whose phase and amplitude are inconsistent and an amplitude spectrogram A of a desired acoustic signal, assigns the phase of the complex spectrogram X ^[m] to the amplitude spectrogram A as shown in equation (16), and outputs a signal Y ^[m] = _P (X ^[m] ) after the phase assignment (step S201 in FIG. 13).

変換部１１２は、信号Ｙ^[m]を入力とし、式（１７）に示すように、信号Ｙ^[m]を逆短時間フーリエ変換Ｇ†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換Ｇ†に対応する短時間フーリエ変換Ｇにより周波数領域の信号Ｚ^[m]＝Ｐ_C（Ｙ^[m]）に変換して出力する（図１３ステップＳ２０２）。 The conversion unit 112 receives the signal Y ^[m] as input, converts the signal Y ^[m] into a time waveform using the inverse short-time Fourier transform G† as shown in equation (17), and converts the converted time waveform into a frequency domain signal Z ^[m] = P _C (Y ^[m] ) using the short-time Fourier transform G corresponding to the inverse short-time Fourier transform G†, and outputs it (step S202 in Figure 13).

位相変更部１１３ａは、複素スペクトログラムＸ^[m]と信号Ｙ^[m]と信号Ｚ^[m]と振幅スペクトログラムＡとを用いて、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムＸ^[m]の位相を所望の音響信号の位相に近づける。 The phase changing unit 113a uses the complex spectrogram X ^[m] , the signal Y ^[m] , the signal Z ^[m], and the amplitude spectrogram A to bring the phase of the complex spectrogram X ^[m] closer to the phase of the desired acoustic signal, based on the statistical properties of the training acoustic signal corresponding to the desired acoustic signal.

ＤＮＮ１１４ａは、複素スペクトログラムＸ^[m]と信号Ｙ^[m]と信号Ｚ^[m]と振幅スペクトログラムＡとを入力とし、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差（Ｚ^[m]－Ｘ^[m]）を推定し、推定値Ｆ_θ（Ｘ^[m]，Ｙ^[m]，Ｚ^[m]）を出力する（図１３ステップＳ２０３）。 The DNN 114a receives the complex spectrogram X ^[m] , the signal Y ^[m] , the signal Z ^[m], and the amplitude spectrogram A as input, estimates the distortion or estimation error (Z ^[m] -X ^[m] ) generated by the Griffin-Lim algorithm, and outputs the estimated value _Fθ (X ^[m] , Y ^[m] , Z ^[m] ) (step S203 in FIG. 13).

ステップＳ２０１～Ｓ２０４の処理を推定部１１０ａ－ｍの個数Ｍ回分繰り返し、Ｍ回の処理が終わると（図１３ステップＳ２０６においてＹＥＳ）、終段の推定部１１０ａ－（Ｍ－１）から複素スペクトログラムＸ^[M]が出力される。繰り返し数Ｍは例えば５程度とすればよい。 The process of steps S201 to S204 is repeated M times for the number of estimation units 110a-m, and when the process is completed M times (YES in step S206 in FIG. 13), a complex spectrogram X ^[M] is output from the final stage estimation unit 110a-(M-1). The number of repetitions M may be set to about 5, for example.

位相付与部１２０は、複素スペクトログラムＸ^[M]と振幅スペクトログラムＡとを入力とし、式（１８）に示すように、振幅スペクトログラムＡに複素スペクトログラムＸ^[M]の位相を付与して、付与後の信号Ｙ^[M]＝Ｐ_B（Ｘ^[M]）を出力する（図１３ステップＳ２０７）。ステップＳ２０７の処理により、再度、複素スペクトログラムＸ^[M]の振幅を振幅スペクトログラムＡの大きさに変換する。 The phase assigning unit 120 receives the complex spectrogram X ^[M] and the amplitude spectrogram A, assigns the phase of the complex spectrogram X ^[M] to the amplitude spectrogram A as shown in equation (18), and outputs the signal Y ^[M] = _P (X ^[M] ) after the phase assignment (step S207 in FIG. 13). The amplitude of the complex spectrogram X ^[M] is converted again into the magnitude of the amplitude spectrogram A by the processing in step S207.

本実施例では、図１に示したＤＮＮの構造によって複素数の代数構造を考慮した音響信号復元のための演算を行うことが可能になる。その結果、本実施例では、雑音成分の複素スペクトログラムを精度良く推定することができ、非特許文献２に開示された技術と比較して、より高品質な出力音を得ることが可能になる。 In this embodiment, the DNN structure shown in FIG. 1 makes it possible to perform calculations for restoring an acoustic signal taking into account the algebraic structure of complex numbers. As a result, in this embodiment, the complex spectrogram of the noise component can be estimated with high accuracy, and it is possible to obtain output sound of higher quality compared to the technology disclosed in Non-Patent Document 2.

なお、図１に示したＤＮＮは振幅スペクトログラムＡを利用するため、少ない反復回数（上記の繰り返し数Ｍ）で高い精度が得られる。ただし、真の振幅情報が利用できない場合には、図１に示したＤＮＮを使用することはできない。 The DNN shown in Figure 1 uses the amplitude spectrogram A, so high accuracy can be achieved with a small number of iterations (the number of iterations M mentioned above). However, if true amplitude information is not available, the DNN shown in Figure 1 cannot be used.

一方で、式（２１）に示したＡＧＣ層を用いたＤＮＮは、少ない反復回数（繰り返し数Ｍ）での精度は図１に示したＤＮＮに劣るものの、真の振幅情報が利用できない場合でも使用可能である。したがって、ＤＮＮ１１４ａ，２１３ａとして、図１に示したＤＮＮとＡＧＣ層を用いたＤＮＮを、目的に応じて使い分けることが望ましい。 On the other hand, the DNN using the AGC layer shown in equation (21) is less accurate than the DNN shown in FIG. 1 at a small number of iterations (iteration number M), but can be used even when true amplitude information is not available. Therefore, it is desirable to use either the DNN shown in FIG. 1 or the DNN using the AGC layer as DNNs 114a and 213a depending on the purpose.

ＤＮＮ１１４ａ，２１３ａとしてＡＧＣ層を用いたＤＮＮを使用する場合には、図１に示した構成において各ゲート付き複素畳み込み層１０００～１００２の振幅ゲート層１００５の代わりに、式（２０）に示した振幅ゲート層を使用すればよい。この場合のＤＮＮの構成を図６に示す。 When using a DNN with an AGC layer as DNN 114a, 213a, the amplitude gate layer shown in equation (20) can be used instead of the amplitude gate layer 1005 of each gated complex convolution layer 1000-1002 in the configuration shown in Figure 1. The configuration of the DNN in this case is shown in Figure 6.

図６に示すＤＮＮであるＡＧＣＮＮは、縦続接続された複数のゲート付き複素畳み込み層１０００ａ～１００２ａと、終段のゲート付き複素畳み込み層１００２ａの出力の畳み込み演算を行う複素畳み込み層１００３とから構成される。 The AGCNN, which is a DNN shown in Figure 6, is composed of multiple gated complex convolutional layers 1000a to 1002a connected in cascade, and a complex convolutional layer 1003 that performs a convolution operation on the output of the final gated complex convolutional layer 1002a.

ゲート付き複素畳み込み層１０００ａ～１００２ａの各々は、複素畳み込み層１００４と、式（２０）に対応する振幅ゲート層１００５ａと、複素畳み込み層１００４の出力と振幅ゲート層１００５ａの出力とを要素毎に乗算した結果を出力する乗算部１００６とから構成される。振幅ゲート層１００５ａへの入力は、複素スペクトログラムＣのみとなる。学習装置２００ａにおける複素スペクトログラムＣはＸチルダ，Ｙチルダ，Ｚチルダ、推定装置１００ａにおける複素スペクトログラムＣはＸ^[m]，Ｙ^[m]，Ｚ^[m]である。 Each of the gated complex convolutional layers 1000a to 1002a is composed of a complex convolutional layer 1004, an amplitude gate layer 1005a corresponding to equation (20), and a multiplication unit 1006 that outputs the result of multiplying the output of the complex convolutional layer 1004 and the output of the amplitude gate layer 1005a for each element. The only input to the amplitude gate layer 1005a is a complex spectrogram C. The complex spectrogram C in the learning device 200a is X tilde, Y tilde, Z tilde, and the complex spectrogram C in the estimation device 100a is X ^[m] , Y ^[m] , Z ^[m] .

本実施例で説明した推定装置１００ａと学習装置２００ａは、ＣＰＵ（Central Processing Unit）、記憶装置及びインタフェースを備えたコンピュータと、これらのハードウェア資源を制御するプログラムによって実現することができる。このコンピュータの構成例を図７に示す。コンピュータは、ＣＰＵ３００と、記憶装置３０１と、インタフェース装置（Ｉ／Ｆ）３０２とを備えている。 The estimation device 100a and learning device 200a described in this embodiment can be realized by a computer equipped with a CPU (Central Processing Unit), a storage device, and an interface, and a program that controls these hardware resources. An example of the configuration of this computer is shown in FIG. 7. The computer is equipped with a CPU 300, a storage device 301, and an interface device (I/F) 302.

Ｉ／Ｆ３０２には、例えばネットワーク等が接続される。本発明の推定方法を実現させるための推定プログラムは、記憶装置３０１に格納される。また、プログラムをネットワークを通して提供してもよい。ＣＰＵ３００は、記憶装置３０１に格納されたプログラムに従って本実施例で説明した処理を実行する。 The I/F 302 is connected to, for example, a network. An estimation program for implementing the estimation method of the present invention is stored in the storage device 301. The program may also be provided via the network. The CPU 300 executes the processing described in this embodiment according to the program stored in the storage device 301.

また、コンピュータ上で所定のプログラムを実行させることにより、推定装置１００ａ、学習装置２００ａを構成することとしたが、推定装置、学習装置の処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, the estimation device 100a and the learning device 200a are configured by executing a specific program on a computer, but at least a portion of the processing content of the estimation device and the learning device may be realized by hardware.

本発明は、振幅スペクトルのみから位相スペクトルを復元する音響信号復元技術に適用することができる。 The present invention can be applied to acoustic signal restoration technology that restores the phase spectrum from only the amplitude spectrum.

１００ａ…推定装置、１１０ａ－０～１１０ａ－（Ｍ－１）…推定部、１１１，１２０，２１１…位相付与部、１１２，２１２…変換部、１１３ａ…位相変更部、１１４ａ，２１３ａ…ＤＮＮ、１１５，２１４…減算部、２００ａ…学習装置、２０９…ノイズ加算部、２１５…パラメータ更新部、１０００，１０００ａ，１００１，１００１ａ，１００２，１００２ａ…ゲート付き複素畳み込み層、１００３，１００４…複素畳み込み層、１００５，１００５ａ…振幅ゲート層、１００６…乗算部。 100a...Estimation device, 110a-0 to 110a-(M-1)...Estimation unit, 111, 120, 211...Phase assignment unit, 112, 212...Conversion unit, 113a...Phase change unit, 114a, 213a...DNN, 115, 214...Subtraction unit, 200a...Learning device, 209...Noise addition unit, 215...Parameter update unit, 1000, 1000a, 1001, 1001a, 1002, 1002a...Gated complex convolution layer, 1003, 1004...Complex convolution layer, 1005, 1005a...Amplitude gate layer, 1006...Multiplication unit.

Claims

multiple gated complex convolution steps;
a first complex convolution step for performing a convolution operation on an output of the final gated complex convolution step;
The first gated complex convolution step is
a second complex convolution step of convolving the first complex spectrogram;
a first amplitude gate calculation step of inputting the first complex spectrogram and an amplitude spectrogram of a desired acoustic signal to select only a region of the first complex spectrogram that needs to be corrected;
a first multiplication step of multiplying an output of the second complex convolution step by an output of the first amplitude gate operation step and outputting the result;
The gated complex convolution step other than the first one may be
a third complex convolution step for convolving the output of the immediately preceding gated complex convolution step;
a second amplitude gate operation step for selecting only a region in which the output of the immediately preceding gated complex convolution step needs to be corrected by using the output of the immediately preceding gated complex convolution step and the amplitude spectrogram as input;
a second multiplication step of multiplying an output of the third complex convolution step by an output of the second amplitude gate operation step and outputting the result;
An estimation method comprising estimating a complex spectrogram of a noise component of the desired acoustic signal using a deep neural network.

multiple gated complex convolution steps;
a first complex convolution step for performing a convolution operation on an output of the final gated complex convolution step;
The first gated complex convolution step is
a second complex convolution step of convolving the first complex spectrogram;
a first amplitude gate calculation step for selecting only a region of the first complex spectrogram that needs to be corrected;
a first multiplication step of multiplying an output of the second complex convolution step by an output of the first amplitude gate operation step and outputting the result;
The gated complex convolution step other than the first one may be
a third complex convolution step for convolving the output of the immediately preceding gated complex convolution step;
a second amplitude gating step for selecting only those regions of the output of the immediately preceding gated complex convolution step that require correction;
a second multiplication step of multiplying an output of the third complex convolution step by an output of the second amplitude gate operation step and outputting the result;
the first amplitude gate calculation step includes a step of performing an amplitude gate calculation by Sigmoid (|C|*W _R ), where C is the first complex spectrogram and W _R is a real-number weighting parameter;
the second amplitude gate calculation step includes a step of performing an amplitude gate calculation by Sigmoid(|C|*W _R ), where C is a complex spectrogram that is an output of the immediately preceding gated complex convolution step;
An estimation method comprising estimating a complex spectrogram of a noise component of a desired acoustic signal using a deep neural network.

3. The estimation method according to claim 1,
a phase assignment step of inputting a second complex spectrogram having a conflicting phase and amplitude and an amplitude spectrogram of a desired acoustic signal, assigning the phase of the second complex spectrogram to the amplitude spectrogram, and obtaining a first signal after the phase assignment;
a transforming step of transforming the first signal into a time waveform by an inverse short-time Fourier transform, and transforming the transformed time waveform into a second signal in the frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform;
and a phase changing step of changing a phase of the second complex spectrogram to approach a phase of the desired acoustic signal based on a statistical property of a training acoustic signal corresponding to the desired acoustic signal,
the phase changing step includes the gated complex convolution step performed multiple times and the first complex convolution step, and includes a step of using the second complex spectrogram, the first signal, and the second signal as the first complex spectrogram to be input to the deep neural network, and outputting a difference between an output of the conversion step and an output of the deep neural network.

4. The estimation method according to claim 3,
The deep neural network includes:
The system is trained in advance using a complex spectrogram obtained from a training acoustic signal and its amplitude spectrogram,
and outputting an estimate of a residual between the second signal and the second complex spectrogram.

An estimation program that causes a computer to execute each step described in any one of claims 1 to 4.

a plurality of cascaded gated complex convolutional layers;
a first complex convolution layer configured to perform a convolution operation on an output of the gated complex convolution layer at a final stage;
The gated complex convolutional layer in the first stage is
a second complex convolutional layer configured to perform a convolution operation on the first complex spectrogram; and
a first amplitude gate layer configured to receive the first complex spectrogram and an amplitude spectrogram of a desired acoustic signal as input and select only a region of the first complex spectrogram that needs to be corrected;
a first multiplication unit configured to output a result of multiplying an output of the second complex convolution layer by an output of the first amplitude gate layer;
The gated complex convolutional layer other than the first stage is
a third complex convolution layer configured to perform a convolution operation on an output of the gated complex convolution layer of the previous stage;
a second amplitude gating layer configured to receive an output of the gated complex convolution layer of a previous stage and the amplitude spectrogram as input and select only a region in which correction of the output of the gated complex convolution layer of the previous stage is required;
a second multiplication unit configured to output a result of multiplying an output of the third complex convolution layer by an output of the second amplitude gate layer;
A deep neural network device that estimates a complex spectrogram of a noise component of the desired acoustic signal.

a plurality of cascaded gated complex convolutional layers;
a first complex convolution layer configured to perform a convolution operation on an output of the gated complex convolution layer at a final stage;
The gated complex convolutional layer in the first stage is
a second complex convolutional layer configured to perform a convolution operation on the first complex spectrogram; and
a first amplitude gating layer configured to select only areas of the first complex spectrogram that require correction;
a first multiplication unit configured to output a result of multiplying an output of the second complex convolution layer by an output of the first amplitude gate layer;
The gated complex convolutional layer other than the first stage is
a third complex convolution layer configured to perform a convolution operation on an output of the gated complex convolution layer of the previous stage;
A second amplitude gated layer configured to select only a region that needs to be corrected from the output of the gated complex convolutional layer of the previous stage;
a second multiplication unit configured to output a result of multiplying an output of the third complex convolution layer by an output of the second amplitude gate layer;
the first amplitude gate layer performs an amplitude gate operation by Sigmoid (|C|*W _R ), where C is the first complex spectrogram and W _R is a real-number weighting parameter;
The second amplitude gate layer performs an amplitude gate operation by Sigmoid (|C|*W _R ), where C is a complex spectrogram that is an output of the gated complex convolution layer in the previous stage,
A deep neural network device that estimates a complex spectrogram of a noise component of a desired audio signal.

a phase assigning unit configured to receive a second complex spectrogram having a conflicting phase and amplitude and an amplitude spectrogram of a desired acoustic signal, assign the phase of the second complex spectrogram to the amplitude spectrogram, and obtain a first signal after the phase assignment;
a transform unit configured to transform the first signal into a time waveform by an inverse short-time Fourier transform, and to transform the transformed time waveform into a second signal in a frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform;
a phase changing unit configured to change a phase of the second complex spectrogram to be closer to a phase of a desired acoustic signal based on a statistical property of a training acoustic signal corresponding to the desired acoustic signal,
The phase change unit includes the deep neural network device according to claim 6 or 7, and uses the second complex spectrogram, the first signal, and the second signal as a first complex spectrogram to be input to the deep neural network device , and outputs a difference between an output of the conversion unit and an output of the deep neural network device .