JP2020122855A

JP2020122855A - Estimation device, method thereof and program

Info

Publication number: JP2020122855A
Application number: JP2019014052A
Authority: JP
Inventors: 悠馬小泉; Yuma Koizumi; 義紀升山; Yoshiki Masuyama; 浩平矢田部; Kohei Yatabe
Original assignee: Waseda University; Nippon Telegraph and Telephone Corp
Current assignee: Waseda University; Nippon Telegraph and Telephone Corp
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2020-08-13
Anticipated expiration: 2039-01-30
Also published as: JP7120573B2

Abstract

【課題】復元したい信号の統計的性質を利用して、振幅スペクトルのみから、矛盾のない位相スペクトルを復元する推定装置等を提供する。【解決手段】推定装置は、(i)位相と振幅が矛盾する複素スペクトログラムを時間波形に変換し、変換された時間波形を位相と振幅が矛盾しない複素スペクトログラムに変換する処理と、(ii)振幅を所望の音響信号の振幅スペクトログラムAの大きさに変換する処理と、(iii)所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、位相スペクトログラムを所望の音響信号に近づける処理と、を関連付けることで、振幅スペクトログラムAを所望の音響信号に近づける位相スペクトログラムを推定する推定部を有する。【選択図】図１PROBLEM TO BE SOLVED: To provide an estimation device or the like for restoring a consistent phase spectrum from only an amplitude spectrum by utilizing the statistical property of a signal to be restored. An estimation device (i) converts a complex spectrogram whose phase and amplitude conflict with each other into a time waveform, and converts the converted time waveform into a complex spectrogram whose phase and amplitude do not conflict, and (ii) amplitude. Is converted to the magnitude of the amplitude spectrogram A of the desired acoustic signal, and (iii) the phase spectrogram is brought closer to the desired acoustic signal based on the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal. It has an estimation unit that estimates a phase spectrogram that brings the amplitude spectrogram A closer to a desired acoustic signal by associating with and. [Selection diagram] Fig. 1

Description

本発明は、振幅スペクトルのみから、位相スペクトルを推定し、復元する推定装置、その方法、およびプログラムに関する。 The present invention relates to an estimation device that estimates and restores a phase spectrum from only an amplitude spectrum, its method, and a program.

STFT(short-time Fourier transform)スペクトルは複素数であり、STFTスペクトログラムから時間信号を復元するには、(1)振幅スペクトログラムと(2)位相スペクトログラムの両方が必要である。ところが、位相スペクトルはその扱いが難しいため、音声合成や音声強調では、振幅スペクトルのみを推定したり制御し、位相スペクトルは最小位相や、観測位相で代用し、時間信号へと逆変換することが多い。振幅スペクトログラムと位相スペクトログラムは独立変数ではないため、片方を制御した場合、もう片方はそれに対応した変数である必要がある。ゆえに、音声合成や音声強調では、振幅と位相の矛盾により、出力音の品質が低下することがある。 The STFT (short-time Fourier transform) spectrum is a complex number, and in order to recover the time signal from the STFT spectrogram, both (1) amplitude spectrogram and (2) phase spectrogram are necessary. However, since the phase spectrum is difficult to handle, it is possible to estimate or control only the amplitude spectrum in speech synthesis or speech enhancement, and substitute the minimum phase or the observed phase for the phase spectrum and convert it back into a time signal. Many. Since the amplitude spectrogram and the phase spectrogram are not independent variables, if one is controlled, the other must be the corresponding variable. Therefore, in speech synthesis or speech enhancement, the quality of the output sound may deteriorate due to the contradiction between the amplitude and the phase.

振幅スペクトログラムから、それと矛盾しない位相スペクトログラムを推定する技術として、非特許文献１が知られている。非特許文献１の技術（Griffin-Limアルゴリズムと呼ばれている）は、以下の手順を繰り返すことで振幅スペクトログラムAから、無矛盾な位相スペクトログラムを推定する技術である。 Non-Patent Document 1 is known as a technique for estimating a phase spectrogram that is consistent with the amplitude spectrogram. The technique of Non-Patent Document 1 (called a Griffin-Lim algorithm) is a technique of estimating a consistent phase spectrogram from the amplitude spectrogram A by repeating the following procedure.

ここでXは振幅がAの複素スペクトログラム、GとG^†は短時間フーリエ変換（STFT）と逆STFT、 Where X is a complex spectrogram with amplitude A, G and G ^† are short-time Fourier transform (STFT) and inverse STFT,

|・|は要素毎の絶対値演算を表す。この方式は、以下の最適化問題を解いていることと等しい。 |•| represents the absolute value calculation for each element. This method is equivalent to solving the following optimization problem.

ここで||・||² _Froはフロベニウスノルムを表す。なお、Bは振幅がAのスペクトログラムの集合である。前述の通り、位相スペクトルは最小位相や、観測位相で代用するため、複素スペクトログラムXに式(1)のSTFTと逆STFTを行うと、元の複素スペクトログラムXに戻らない。そこで、式(2)により振幅を与えられた振幅スペクトログラムAに固定し、式(3)により、正しい短時間フーリエ変換表現となるように位相を求める。 Where ||・|| ² _Fro represents the Frobenius norm. B is a set of spectrograms whose amplitude is A. As described above, since the phase spectrum is substituted with the minimum phase or the observed phase, when the complex spectrogram X is subjected to STFT and inverse STFT of equation (1), it does not return to the original complex spectrogram X. Therefore, the amplitude is fixed to the given amplitude spectrogram A by the equation (2), and the phase is obtained by the equation (3) so as to obtain a correct short-time Fourier transform expression.

D. Griffin and J. Lim, "Signal estimation from modied shorttime Fouriertransform", IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr.1984.D. Griffin and J. Lim, "Signal estimation from modied shorttime Fourier transform", IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984.

しかしながら、非特許文献１の方式は、あらゆる音響信号に対して適応可能である一方、膨大な回数の繰り返しが必要である。これは、最適化の枠組みの中に、復元したい信号(以下、所望の音響信号ともいう)の統計的性質について一切の仮定を置いていないためである。 However, the method of Non-Patent Document 1 is adaptable to all acoustic signals, but requires a huge number of repetitions. This is because no assumption is made about the statistical properties of the signal to be restored (hereinafter also referred to as the desired acoustic signal) within the optimization framework.

本発明は、復元したい信号の統計的性質を利用して、振幅スペクトルのみから、矛盾のない位相スペクトルを復元する推定装置、その方法、およびプログラムを提供することを目的とする。 It is an object of the present invention to provide an estimating apparatus, a method, and a program for restoring a consistent phase spectrum from only an amplitude spectrum by using the statistical property of a signal to be restored.

上記の課題を解決するために、本発明の一態様によれば、推定装置は、(i)位相と振幅が矛盾する複素スペクトログラムを時間波形に変換し、変換された時間波形を位相と振幅が矛盾しない複素スペクトログラムに変換する処理と、(ii)振幅を所望の音響信号の振幅スペクトログラムAの大きさに変換する処理と、(iii)所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、位相スペクトログラムを所望の音響信号に近づける処理と、を関連付けることで、振幅スペクトログラムAを所望の音響信号に近づける位相スペクトログラムを推定する推定部を有する。 In order to solve the above problems, according to one aspect of the present invention, the estimation apparatus converts (i) a complex spectrogram in which the phase and the amplitude are inconsistent into a time waveform, and the converted time waveform has a phase and an amplitude. The process of converting into a complex spectrogram that does not contradict, (ii) the process of converting the amplitude into the magnitude of the amplitude spectrogram A of the desired acoustic signal, and (iii) the statistical analysis of the learning acoustic signal corresponding to the desired acoustic signal. An estimation unit that estimates the phase spectrogram that brings the amplitude spectrogram A closer to the desired acoustic signal by associating with the processing that brings the phase spectrogram closer to the desired acoustic signal based on the property.

上記の課題を解決するために、本発明の他の態様によれば、推定装置は、所望の音響信号の振幅スペクトログラムAに複素スペクトログラムXの位相を付与し、付与後の信号Yを求める位相付与部と、信号Yを逆短時間フーリエ変換により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換に対応する短時間フーリエ変換により周波数領域の信号Zに変換する変換部と、複素スペクトログラムXと信号Yと信号Zとを用いて、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムXの位相を所望の音響信号の位相に近づける位相変更部と、を含む。 In order to solve the above-mentioned problems, according to another aspect of the present invention, the estimation apparatus adds the phase of the complex spectrogram X to the amplitude spectrogram A of the desired acoustic signal, and the phase addition for obtaining the signal Y after the addition. And a transform unit that transforms the signal Y into a time waveform by the inverse short-time Fourier transform, and transforms the transformed time waveform into a signal Z in the frequency domain by the short-time Fourier transform corresponding to the inverse short-time Fourier transform, and a complex Using the spectrogram X and the signal Y and the signal Z, based on the statistical properties of the acoustic signal for learning corresponding to the desired acoustic signal, a phase changing unit that brings the phase of the complex spectrogram X closer to the phase of the desired acoustic signal. ,including.

本発明によれば、復元したい信号の統計的性質を利用して、従来技術よりも少ない計算量で振幅スペクトルのみから、矛盾のない位相スペクトルを復元することができるという効果を奏する。 Advantageous Effects of Invention According to the present invention, it is possible to restore a consistent phase spectrum from only an amplitude spectrum with a smaller amount of calculation than that in the conventional technique by using the statistical property of a signal to be restored.

第一実施形態に係る推定装置の機能ブロック図。The functional block diagram of the estimation apparatus which concerns on 1st embodiment. 第一実施形態に係る推定装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the estimation apparatus which concerns on 1st embodiment. 第一実施形態に係る推定部の機能ブロック図。The functional block diagram of the estimation part which concerns on 1st embodiment. 第一実施形態に係る学習装置の機能ブロック図。The functional block diagram of the learning device which concerns on 1st embodiment. 第一実施形態に係る学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the learning apparatus which concerns on 1st embodiment.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, components having the same function and steps for performing the same process are denoted by the same reference numerals, and duplicate description will be omitted. In the following description, the symbol "^" or the like used in the text should be written directly above the character immediately after it, but due to the limitation of the text notation, it is written immediately before the character. In the formula, these symbols are written in their original positions. Unless otherwise specified, the processing performed for each element of a vector or matrix shall be applied to all the elements of the vector or matrix.

＜第一実施形態のポイント＞
本実施形態では、、非特許文献１の方式に、深層学習を組み込む。なお、深層学習を利用した位相復元には例えば参考文献１などの方式がある。
（参考文献１） K. Oyamada, H. Kameoka, K. Tanaka T. Kaneko, N. Hojo, and H. Ando, "Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms", in Eur. Signal Process. Conf. (EUSIPCO), Sept. 2018. <Points of the first embodiment>
In this embodiment, deep learning is incorporated into the method of Non-Patent Document 1. Note that there is a system such as Reference 1 for phase restoration using deep learning.
(Reference 1) K. Oyamada, H. Kameoka, K. Tanaka T. Kaneko, N. Hojo, and H. Ando, "Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms", in Eur. Signal Process. Conf. (EUSIPCO), Sept. 2018.

これらの方式と本実施形態の違いは、参考文献１が大規模なニューラルネットワークを用いていわば、end-to-endで位相を復元するのに対し、本実施形態は、非特許文献１の繰り返し最適化の一部にDNN(Deep Neural Network,ディープニューラルネットワーク)を利用することで、学習に必要なパラメータ数を削減する点にある。 The difference between these methods and the present embodiment is that the reference document 1 restores the phase by end-to-end if the large-scale neural network is used, whereas the present embodiment repeats the non-patent document 1. By using DNN (Deep Neural Network) as a part of optimization, the number of parameters required for learning is reduced.

また、繰り返し回数がそのままニューラルネットワークのスタッキング（深層化）に直結するため、従来のニューラルネットワークと異なり、学習時とテスト時にネットワーク形状が一致する必要がない。また、実用時の計算機パワーや精度の要件などに合わせ、処理時間と復元精度のトレードオフに対して、スケーラビリティを持つことも特徴である。 Further, since the number of repetitions is directly connected to the stacking (deepening) of the neural network, unlike the conventional neural network, the network shapes do not need to match during learning and testing. It is also characterized by having scalability in terms of the trade-off between processing time and restoration accuracy, according to the requirements of computer power and accuracy during practical use.

前述の通り、本実施形態では、Griffin-Limアルゴリズムの中に深層学習を組み込む。例えば、学習データを用いて訓練したDNNを利用して、Griffin-Limアルゴリズムの中に復元したい信号の統計的性質を組み込む。図１は第一実施形態に係る推定装置１００の機能ブロック図を、図２はその処理フローの例を示す。推定装置１００はM個の推定部１１０−ｍ（m=0,1,2,…,M-1、Mは1以上の整数の何れか）を含む。図３は、推定部１１０−ｍの機能ブロック図を示す。推定部１１０−ｍは、式(2)に対応する位相付与部１１１と、式(1)に対応する変換部１１２と含み、さらに、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムXの位相を所望の音響信号の位相に近づける位相変更部１１３を含む。 As described above, in this embodiment, deep learning is incorporated in the Griffin-Lim algorithm. For example, using the DNN trained with the training data, the statistical properties of the signal to be restored are incorporated into the Griffin-Lim algorithm. FIG. 1 is a functional block diagram of the estimation device 100 according to the first embodiment, and FIG. 2 shows an example of the processing flow. The estimation device 100 includes M estimation units 110-m (m=0, 1, 2,..., M-1, and M is any integer of 1 or more). FIG. 3 shows a functional block diagram of the estimation unit 110-m. The estimation unit 110-m includes a phase assignment unit 111 corresponding to the equation (2) and a conversion unit 112 corresponding to the equation (1), and further statistically evaluates a learning acoustic signal corresponding to a desired acoustic signal. It includes a phase changing unit 113 that brings the phase of the complex spectrogram X closer to the phase of the desired acoustic signal based on the property.

図１、図３の構成にし、Griffin-Limアルゴリズムの1回分の繰り返しの後にDNNによる処理を行うことで、復元したい信号の統計的性質を考慮した無矛盾位相推定を実現する。これは、内部のDNNを繰り返し数(M)分スタッキングしていることと等価である。つまり、この処理ブロックの繰り返し数(M)を制御することで、処理時のDNNのスケールを変化させることができる。例えば、DNN部１１３−１内のDNNの層数が3の場合には、M=1,2,3,…のときそれぞれ全体として3,6,9,…層からなるDNNとして機能する。繰り返し数を少なくすることは浅いDNNを使うことと等価であり、処理性能は低下するが、高速な演算が可能になる。一方、繰り返し数を多くすることは深いDNNを使うことと等価であり、処理速度は遅くなるが、高品質な出力音を得ることができる。 With the configurations of FIGS. 1 and 3, by performing the processing by the DNN after the Griffin-Lim algorithm is repeated once, the consistent phase estimation in consideration of the statistical property of the signal to be restored is realized. This is equivalent to stacking the internal DNN for the number of repetitions (M). In other words, the DNN scale during processing can be changed by controlling the number of repetitions (M) of this processing block. For example, when the number of DNN layers in the DNN unit 113-1 is 3, when M=1, 2, 3,..., As a whole, the DNN functions as a DNN composed of 3, 6, 9,. Reducing the number of iterations is equivalent to using a shallow DNN, which reduces processing performance but enables high-speed computation. On the other hand, increasing the number of repetitions is equivalent to using a deep DNN, and the processing speed becomes slower, but a high quality output sound can be obtained.

ここで利用するDNNの条件は、復元したい信号の統計的性質に基づき（復元したい信号の学習データから何らかの方式で学習されればよい）、Griffin-Limアルゴリズムの出力音の位相を、復元したい信号に近づける処理であれば何でもよい。その一例として、以下の残差学習を実施形態として示す。
Y^[m]=P_B(X^[m]) (4)
Z^[m]=P_C(Y^[m]) (5)
X[m+1]=E(X^[m]) (6)
=Z^[m]-F_θ(X^[m],Y^[m],Z^[m]) (7)
ここでF_θは何らかの形で実装されたDNNである。つまり、Griffin-Limアルゴリズムで生じた歪みや推定誤差を、復元したい信号の統計的性質に基づき学習されたDNNが除去（減算）するという構成になっている。ここでDNNは、復元したい信号を直接推定するのではなく、復元したい信号でない成分を推定していることになる。DNNの学習は、例えば以下の目的関数を最小化するように学習できる。 The condition of the DNN used here is based on the statistical properties of the signal to be restored (it may be learned in some way from the learning data of the signal to be restored), and the phase of the output sound of the Griffin-Lim algorithm is the signal to be restored. Any process may be used as long as it is close to. As an example, the following residual learning is shown as an embodiment.
Y ^[m] =P _B (X ^[m] ) (4)
Z ^[m] =P _C (Y ^[m] )(5)
X[m+1]=E(X ^[m] ) (6)
=Z ^[m] -F _θ (X ^[m] ,Y ^[m] ,Z ^[m] ) (7)
Where F _θ is a DNN implemented somehow. In other words, the DNN learned based on the statistical properties of the signal to be restored removes (subtracts) the distortion and estimation error caused by the Griffin-Lim algorithm. Here, the DNN does not directly estimate the signal to be restored, but estimates the component that is not the signal to be restored. The DNN can be learned by, for example, minimizing the following objective function.

ここでX^*は真の複素スペクトログラム、~X=X^*+N、Nは複素ガウスノイズ、~Y=P_B(~X)、~Z=P_C(~Y)である。ただし、Griffin-Limアルゴリズムは位相スペクトルのみを復元する処理のため、~Yの振幅は、X^*の振幅と一致するようにする。 Here, X ^* is a true complex spectrogram, ~X=X ^* +N, N is complex Gaussian noise, ~Y=P _B (~X), ~Z=P _C (~Y). However, since the Griffin-Lim algorithm is a process for restoring only the phase spectrum, the amplitude of ~Y is made to match the amplitude of X ^* .

本実施形態は、DNNの学習段階と位相スペクトルの推定段階とからなる。まず、学習段階について説明する。
＜第一実施形態に係る学習装置＞
図４は本実施形態の学習装置２００の機能ブロック図を、図５はその処理フローの例を示す。 This embodiment includes a DNN learning step and a phase spectrum estimation step. First, the learning stage will be described.
<Learning device according to the first embodiment>
FIG. 4 is a functional block diagram of the learning device 200 of this embodiment, and FIG. 5 shows an example of its processing flow.

学習装置２００は、復元したい信号の学習データ（クリーン音響信号X^(L)*であり、複素スペクトログラムで表現される）とクリーン音響信号X^(L)*に対応する振幅スペクトログラムA^(L)とノイズNと各種最適化に必要なパラメータを入力とし、学習済みのDNNを出力する。 The learning apparatus 200 includes learning data of a signal to be restored (clean acoustic signal X ^(L)*, which is represented by a complex spectrogram ⁾ , an amplitude spectrogram A ^(L) corresponding to the clean acoustic signal X ^(L)* , and noise. Input N and parameters required for various optimizations and output a trained DNN.

学習装置２００は、ノイズ加算部２０９と、位相付与部２１１と、変換部２１２と、DNN部２１３と、減算部２１４と、パラメータ更新部２１５とを含む。 The learning device 200 includes a noise adding unit 209, a phase adding unit 211, a converting unit 212, a DNN unit 213, a subtracting unit 214, and a parameter updating unit 215.

例えば、学習装置２００は、図示しない初期化部において、DNN部２１３で用いるDNNのパラメータθを何からの乱数で初期化する（Ｓ２０８）。 For example, the learning device 200 initializes the parameter θ of the DNN used in the DNN unit 213 by using a random number in an initialization unit (not shown) (S208).

＜ノイズ加算部２０９＞
ノイズ加算部２０９は、クリーン音響信号X^(L)*とノイズNとを入力とし、クリーン音響信号X^(L)*にノイズNを加算し（Ｓ２０９）、複素スペクトログラム~X(=X^(L)*+N)を求め、出力する。 <Noise addition unit 209>
Noise addition section 209 receives as input a clean audio signal X ^{(L) *} and the noise N, by adding the noise N to clean the audio signal ^{X (L) * (S209)} , the complex spectrogram ~ X ⁽⁼ X ^{(L) *} +N) is calculated and output.

＜位相付与部２１１＞
位相付与部２１１は、複素スペクトログラム~Xとクリーン音響信号X^(L)*に対応する振幅スペクトログラムA^(L)とを入力とし、次式に示すように、振幅スペクトログラムA^(L)に複素スペクトログラム~Xの位相を付与し（Ｓ２１１）、付与後の信号~Y=P_B(~X)を求め、出力する。 <Phase imparting unit 211>
The phase adding unit 211 receives the complex spectrogram ~X and the amplitude spectrogram A ^(L) corresponding to the clean acoustic signal X ^(L)* as input, and the complex spectrogram ~ ^(L) is added to the amplitude spectrogram A ^(L) as shown in the following equation. The phase of X is applied (S211), and the signal ~Y=P _B (~X) after application is determined and output.

なお、 In addition,

が複素スペクトログラム~Xの位相を抽出する処理に相当し、式(12)が抽出した複素スペクトログラム~Xの位相を振幅スペクトログラムA^(L)に付与する処理に相当する。なお、式(12)は、複素スペクトログラム~Xの各要素に対して振幅スペクトログラムA^(L)の各要素を乗算し、その積を複素スペクトログラム~Xの振幅スペクトログラム|~X|で除算しているため、複素スペクトログラム~Xの振幅を振幅スペクトログラムA^(L)の大きさに変換する処理といってもよい。 Corresponds to the process of extracting the phase of the complex spectrogram to X, and Eq. (12) corresponds to the process of adding the extracted phase of the complex spectrogram to X to the amplitude spectrogram A ^(L) . Note that in the equation (12), each element of the complex spectrogram ~X is multiplied by each element of the amplitude spectrogram A ^(L) , and the product is divided by the amplitude spectrogram |~X| of the complex spectrogram ~X. Therefore, it may be said that the amplitude of the complex spectrogram ~X is converted into the magnitude of the amplitude spectrogram A ^(L) .

＜変換部２１２＞
変換部２１２は、信号~Yを入力とし、次式により、信号~Yを逆短時間フーリエ変換G^†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換G^†に対応する短時間フーリエ変換Gにより周波数領域の信号~Z=P_c(~Y)に変換し（Ｓ２１２）、出力する。 <Conversion unit 212>
Conversion unit 212 inputs the signals ~ Y, the following equation to convert the signals ~ Y inverse short time Fourier transform G ^† by the time waveform, the corresponding converted time waveform inverse short time Fourier transform G ^† A short-time Fourier transform G is used to convert the signal in the frequency domain to ~Z=P _c (~Y) (S212) and output.

＜DNN部２１３＞
DNN部２１３は、パラメータθの初期値または後述するパラメータ更新部２１５で更新されたパラメータθと、複素スペクトログラム~Xと、信号~Yと、信号~Zとを入力とし、DNNにより、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差を推定し（Ｓ２１３）、推定値F_θ(~X,~Y,~Z)を出力する。 <DNN unit 213>
The DNN unit 213 receives the initial value of the parameter θ or the parameter θ updated by the parameter updating unit 215 to be described later, the complex spectrogram ~X, the signal ~Y, and the signal ~Z as input, and the DNN uses the Griffin-Lim. The distortion or estimation error generated by the algorithm is estimated (S213) and the estimated value F _θ (~X,~Y,~Z) is output.

＜減算部２１４＞
減算部２１４は、信号~Zとクリーン音響信号X^(L)*とを入力とし、差分を求め(Ｓ２１４)、求めた差分(複素スペクトログラム~Z-X^(L)*)を出力する。 <Subtraction unit 214>
The subtraction unit 214 receives the signals ~Z and the clean acoustic signal X ^(L)* as input, calculates a difference (S214), and outputs the calculated difference (complex spectrogram ~ZX ^(L)* ).

＜パラメータ更新部２１５＞
パラメータ更新部２１５は、差分(複素スペクトログラム~Z-X^(L)*)と、推定値F_θ(~X,~Y,~Z)とを入力とし、これらの値を用いて、 <Parameter updating unit 215>
The parameter updating unit 215 inputs the difference (complex spectrogram ~ZX ^(L)* ) and the estimated value F _θ (~X,~Y,~Z), and using these values,

となるように、DNNのパラメータθを更新する（Ｓ２１５−１）。学習法には、確率的最急降下法などを利用すればよく、その学習率は10^-5程度に設定すればよい。さらに、パラメータ更新部２１５は、所定の条件を満たすか否かを判定し(Ｓ２１５−２)、所定の条件を満たす場合には、その時点のDNNを学習済みのDNNとして出力する。所定の条件を満たさない場合には、更新後のパラメータθをDNN部２１３へ出力し、新たなクリーン音響信号X^(L)*と新たなノイズNと更新後のパラメータθとを用いて、Ｓ２０９〜Ｓ２１５−１を繰り返す。なお、所定の条件には、学習を一定回数（例えば10万回）繰り返したか？などを利用できる。 The parameter θ of the DNN is updated so that (S215-1). The learning method may be a stochastic steepest descent method or the like, and the learning rate may be set to about 10 ^-5 . Further, the parameter updating unit 215 determines whether or not a predetermined condition is satisfied (S215-2), and when the predetermined condition is satisfied, outputs the DNN at that time point as a learned DNN. If the predetermined condition is not satisfied, the updated parameter θ is output to the DNN unit 213, and the new clean acoustic signal X ^(L)* , the new noise N, and the updated parameter θ are used in S209. ~S215-1 is repeated. It should be noted that learning is repeated a certain number of times (for example, 100,000 times) as the predetermined condition. Etc. can be used.

以上の処理により、DNNの学習段階を実現する。次に位相スペクトルの推定段階について説明する。
＜推定装置１００＞
上述の通り、図１は本実施形態の推定装置１００の機能ブロック図を、図２はその処理フローの例を示す。 Through the above processing, the learning stage of DNN is realized. Next, the phase spectrum estimation step will be described.
<Estimation device 100>
As described above, FIG. 1 shows a functional block diagram of the estimation device 100 of this embodiment, and FIG. 2 shows an example of the processing flow thereof.

推定装置１００は、振幅スペクトログラムAと位相と振幅が矛盾する複素スペクトログラムX^[0]とを入力とし、振幅スペクトログラムAに矛盾しない位相スペクトログラムを持つ複素スペクトログラムY^[M]を求め、出力する。ここで、複素スペクトログラムX^[0]の振幅は振幅スペクトログラムAである。 The estimation apparatus 100 receives the amplitude spectrogram A and the complex spectrogram X ^{[0] in} which the phase and the amplitude are inconsistent, and obtains and outputs the complex spectrogram Y ^[M] having the phase spectrogram that is not inconsistent with the amplitude spectrogram A. Here, the amplitude of the complex spectrogram X ^[0] is the amplitude spectrogram A.

推定装置１００は、M個の推定部１１０−ｍと、位相付与部１２０とを含む（図１参照）。 The estimation device 100 includes M estimation units 110-m and a phase assignment unit 120 (see FIG. 1 ).

＜推定部１１０−ｍ＞
推定部１１０−ｍは、所望の音響信号の振幅スペクトログラムAと、位相と振幅が矛盾する複素スペクトログラムX^[m]とを入力とし、推定した位相スペクトログラムを持つ複素スペクトログラムX^[m+1]を求め、出力する。例えば、推定部１１０−ｍは、(i)位相と振幅が矛盾する複素スペクトログラムを時間波形に変換し、変換された時間波形を位相と振幅が矛盾しない複素スペクトログラムに変換する処理と、(ii)振幅を所望の音響信号の振幅スペクトログラムAの大きさに変換する処理と、(iii)所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、位相スペクトログラムを所望の音響信号に近づける処理と、を関連付けることで、振幅スペクトログラムAを所望の音響信号に近づける位相スペクトログラムを推定する(Ｓ１１０)。 <Estimation unit 110-m>
The estimation unit 110-m receives the amplitude spectrogram A of the desired acoustic signal and the complex spectrogram X ^{[m] in} which the phase and the amplitude are inconsistent, and obtains a complex spectrogram X ^[m+1] having the estimated phase spectrogram. ,Output. For example, the estimation unit 110-m performs (i) a process of converting a complex spectrogram in which the phase and the amplitude are inconsistent into a time waveform, and a process of converting the converted time waveform into a complex spectrogram in which the phase and the amplitude are not inconsistent, and (ii) The phase spectrogram is brought closer to the desired acoustic signal based on the process of converting the amplitude into the magnitude of the amplitude spectrogram A of the desired acoustic signal and (iii) the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal. By associating with the processing, the phase spectrogram that brings the amplitude spectrogram A closer to the desired acoustic signal is estimated (S110).

図３は、推定部１１０−ｍの機能ブロック図を示す。推定部１１０−ｍは位相付与部１１１と変換部１１２と位相変更部１１３とを含み、さらに、位相変更部１１３はDNN部１１３−１と減算部１１３−２とを含む。 FIG. 3 shows a functional block diagram of the estimation unit 110-m. The estimating unit 110-m includes a phase adding unit 111, a converting unit 112, and a phase changing unit 113, and the phase changing unit 113 includes a DNN unit 113-1 and a subtracting unit 113-2.

各推定部１１０−ｍの位相変更部１１３のDNN部１１３−１には、学習装置２００で学習されたDNNが設定されている。前述の通り、繰り返し回数がそのままニューラルネットワークのスタッキング（深層化）に直結するため、従来のニューラルネットワークと異なり、学習時とテスト時にネットワーク形状が一致する必要がなく、学習時には上述の通りM個ではなく1個のDNNを学習すればよい。また、推定時には計算機パワーや精度の要件などに合わせ、繰り返し回数(M)を制御し、処理時間と復元精度のトレードオフに対して、スケーラビリティを持つことができる。例えば、M=5程度を実行すればよい。 The DNN learned by the learning device 200 is set in the DNN unit 113-1 of the phase changing unit 113 of each estimation unit 110-m. As described above, since the number of iterations is directly linked to the stacking (deepening) of the neural network, unlike conventional neural networks, there is no need for the network shapes to match during learning and testing. Instead, you only have to learn one DNN. In addition, at the time of estimation, the number of iterations (M) can be controlled according to the requirements of computer power and accuracy, and scalability can be achieved with respect to the trade-off between processing time and restoration accuracy. For example, M=5 may be executed.

＜位相付与部１１１＞
位相付与部１１１は、所望の音響信号の振幅スペクトログラムAと、位相と振幅が矛盾する複素スペクトログラムX^[m]とを入力とし、次式に示すように、振幅スペクトログラムAに複素スペクトログラムX^[m]の位相を付与し（Ｓ１１１）、付与後の信号Y^[m]=P_B(X^[m])を求め、出力する。 <Phase imparting unit 111>
The phase providing unit 111 receives the amplitude spectrogram A of the desired acoustic signal and the complex spectrogram X ^{[m] in} which the phase and the amplitude are inconsistent, and as shown in the following equation, the complex spectrogram X ^{[m] is} added to the amplitude spectrogram A. Is added (S111), and the signal Y ^[m] =P _B (X ^[m] ) after the addition is obtained and output.

なお、 In addition,

が複素スペクトログラムX^[m]の位相を抽出する処理に相当し、式(21)が抽出した複素スペクトログラムX^[m]の位相を振幅スペクトログラムAに付与する処理に相当する。なお、式(21)は、複素スペクトログラムX^[m]の各要素に対して振幅スペクトログラムAの各要素を乗算し、その積を複素スペクトログラムX^[m]の振幅スペクトログラム|X^[m]|で除算しているため、複素スペクトログラムX^[m]の振幅を振幅スペクトログラムAの大きさに変換する処理といってもよい。 Corresponds to the process of extracting the phase of the complex spectrogram X ^[m] , and Eq. (21) corresponds to the process of adding the extracted phase of the complex spectrogram X ^[m] to the amplitude spectrogram A. In equation (21), each element of the complex spectrogram X ^[m] is multiplied by each element of the amplitude spectrogram A, and the product is divided by the amplitude spectrogram |X ^[m] | of the complex spectrogram X ^[m]. Therefore, it can be said that the amplitude of the complex spectrogram X ^[m] is converted into the magnitude of the amplitude spectrogram A.

＜変換部１１２＞
変換部１１２は、信号Y^[m]を入力とし、次式により、信号Y^[m]を逆短時間フーリエ変換G^†により時間波形に変換し、変換された時間波形を逆短時間フーリエ変換G^†に対応する短時間フーリエ変換Gにより周波数領域の信号Z^[m]=P_c(Y^[m])に変換し（Ｓ１１２）、出力する。 <Conversion unit 112>
Conversion unit 112 inputs the signal Y ^[m], the following equation, the signal Y is converted into time waveforms by inverse short time Fourier transform G ^† a ^[m], briefly transformed time waveform inverse Fourier transform G The signal is converted into a frequency domain signal Z ^[m] =P _c (Y ^[m] ) by the short-time Fourier transform G corresponding to ^† (S112) and output.

この処理は、位相と振幅が矛盾する複素スペクトログラムY^[m]を時間波形に変換し、変換された時間波形を位相と振幅が矛盾しない複素スペクトログラムZ^[m]に変換する処理に相当する。 This process corresponds to a process of converting the complex spectrogram Y ^[m] in which the phase and the amplitude are inconsistent into a time waveform, and the converted time waveform into a complex spectrogram Z ^[m] in which the phase and the amplitude are inconsistent.

＜位相変更部１１３＞
位相変更部１１３は、複素スペクトログラムX^[m]と信号Y^[m]と信号Z^[m]とを用いて、所望の音響信号に対応する学習用の音響信号の統計的性質に基づき、複素スペクトログラムX^[m]の位相を所望の音響信号の位相に近づけ（Ｓ１１３）、近づけた信号X^[m+1]を出力する。例えば、位相変更部１１３は、以下のDNN部１１３−１と減算部１１３−２とにより、この処理を実現する。 <Phase changing unit 113>
The phase changing unit 113 uses the complex spectrogram X ^[m] , the signal Y ^[m], and the signal Z ^[m] based on the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal. The phase of X ^[m] is brought close to the phase of the desired acoustic signal (S113), and the approximated signal X ^[m+1] is output. For example, the phase changing unit 113 realizes this processing by the following DNN unit 113-1 and subtracting unit 113-2.

＜DNN部１１３−１＞
DNN部１１３−１は、複素スペクトログラムX^[m]と信号Y^[m]と信号Z^[m]とを入力とし、所望の音響信号に対応する学習用の音響信号の統計的性質に基づくDNNにより、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差(Z^[m]-X^[m])を推定し（Ｓ１１３−１）、推定値F_θ(X^[m],Y^[m],Z^[m])を出力する。なお、推定値F_θ(X^[m],Y^[m],Z^[m])は複素スペクトログラムであり、例えば、次式によりF_θ(X^[m],Y^[m],Z^[m])からその位相スペクトログラムを求めることができる。 <DNN unit 113-1>
The DNN unit 113-1 receives the complex spectrogram X ^[m] , the signal Y ^[m], and the signal Z ^[m] as input, and uses the DNN based on the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal. , The distortion or estimation error (Z ^[m] -X ^[m] ) generated by the Griffin-Lim algorithm is estimated (S113-1), and the estimated value F _θ (X ^[m] ,Y ^[m] ,Z ^{[m ]} ) is output. The estimated value F _θ (X ^[m] ,Y ^[m] ,Z ^[m] ) is a complex spectrogram. For example, F _θ (X ^[m] ,Y ^[m] ,Z ^[m] ), the phase spectrogram can be obtained.

そのため、複素スペクトログラムF_θ(X^[m],Y^[m],Z^[m])を求める処理とその位相スペクトログラムを求める処理とは等価な処理と言える。 Therefore, it can be said that the process for obtaining the complex spectrogram F _θ (X ^[m] , Y ^[m] , Z ^[m] ) and the process for obtaining the phase spectrogram are equivalent.

＜減算部１１３−２＞
減算部１１３−２は、信号Z^[m]と推定値F_θ(X^[m],Y^[m],Z^[m])とを入力とし、差分を求め(Ｓ１１３−２)、求めた差分(複素スペクトログラムX^[m+1]=Z^[m]-F_θ(X^[m],Y^[m],Z^[m]))を出力する。この減算が、Griffin-Limアルゴリズムで生じた歪みまたは推定誤差を除去する処理に相当し、また、信号Z^[m](対応する複素スペクトログラムX^[m]と言ってもよい)の位相スペクトログラムを所望の音響信号に近づける処理に相当する。 <Subtraction unit 113-2>
The subtraction unit 113-2 receives the signal Z ^[m] and the estimated value F _θ (X ^[m] , Y ^[m] , Z ^[m] ) as input, calculates a difference (S113-2), and calculates the calculated difference. (Complex spectrogram X ^[m+1] =Z ^[m] -F _θ (X ^[m] ,Y ^[m] ,Z ^[m] )) is output. This subtraction corresponds to the process of removing the distortion or estimation error caused by the Griffin-Lim algorithm, and the phase spectrogram of the signal Z ^[m] (which may be called the corresponding complex spectrogram X ^[m] ) is desired. This is equivalent to the processing to bring the sound signal into

推定部１１０−ｍは、全体として振幅スペクトログラムAを所望の音響信号に近づけており、これは、振幅スペクトログラムAを所望の音響信号に近づける位相スペクトログラムを推定する処理と等価である。 The estimation unit 110-m approximates the amplitude spectrogram A to the desired acoustic signal as a whole, and this is equivalent to the process of estimating the phase spectrogram that approximates the amplitude spectrogram A to the desired acoustic signal.

上述の処理Ｓ１１１〜Ｓ１１３−２を推定部１１０−ｍの個数M回分繰り返し、推定部１１０−（Ｍ−１）は複素スペクトログラムX^[M]を求め、出力する。 The above-described processes S111 to S113-2 are repeated M times by the number of estimation units 110-m, and the estimation unit 110-(M-1) obtains and outputs a complex spectrogram X ^[M] .

＜位相付与部１２０＞
位相付与部１２０は、複素スペクトログラムX^[M]を入力とし、次式に示すように、振幅スペクトログラムAに複素スペクトログラムX^[M]の位相を付与し（Ｓ１２０）、付与後の信号Y^[M]=P_B(X^[M])を出力する。 <Phase imparting unit 120>
Phase deposition unit 120 inputs the complex spectrogram X ^[M], as shown in the following equation, the phase of the complex spectrogram X ^[M] is given to the amplitude spectrogram A (S120), the signal after applying Y ^[M] =P _B (X ^[M] ) is output.

この処理により、再度、複素スペクトログラムX^[M]の振幅を振幅スペクトログラムAの大きさに変換する。 By this processing, the amplitude of the complex spectrogram X ^[M] is converted into the magnitude of the amplitude spectrogram A again.

＜効果＞
以上の構成により、復元したい信号の統計的性質を利用して、従来技術よりも少ない計算量で振幅スペクトルのみから、矛盾のない位相スペクトルを復元することができる。 <Effect>
With the above configuration, it is possible to restore a consistent phase spectrum from only the amplitude spectrum with a smaller amount of calculation than the conventional technique by using the statistical property of the signal to be restored.

＜変形例＞
本実施形態では、位相と振幅が矛盾する複素スペクトログラムX^[0]を入力として与えられているが、振幅スペクトログラムAのみを入力とし、振幅スペクトログラムAに対し、適当な位相スペクトログラム(初期値)を乱数で選び、初期値の複素スペクトログラムX^[0]を作成する構成としてもよい。 <Modification>
In the present embodiment, the complex spectrogram X ^{[0] in} which the phase and the amplitude are contradictory is given as an input, but only the amplitude spectrogram A is input, and an appropriate phase spectrogram (initial value) is randomly generated for the amplitude spectrogram A. Alternatively, the initial value may be a complex spectrogram X ^[0] .

本実施形態では、ノイズに強いDNNを構築するために、ノイズ加算部２０９を設けているが、ノイズ加算部２０９を設けずに、クリーン音響信号X^(L)*をそのまま複素スペクトログラム~X(=X^(L)*)として用いてもよい。 In the present embodiment, the noise addition unit 209 is provided in order to construct a DNN that is resistant to noise. However, without providing the noise addition unit 209, the clean acoustic signal X ^(L)* is directly converted into the complex spectrogram ~X(= X ^(L)* ) may be used.

本実施形態では、残差学習の例を示したが、復元したい信号の統計的性質に基づき、Griffin-Limアルゴリズムの出力信号の位相を、復元したい信号に近づける処理を含めばよい。 In the present embodiment, an example of residual learning is shown, but a process of bringing the phase of the output signal of the Griffin-Lim algorithm closer to the signal to be restored may be included based on the statistical property of the signal to be restored.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other modifications>
The present invention is not limited to the above embodiments and modifications. For example, the above-described various processes may be executed not only in time series according to the description but also in parallel or individually according to the processing capability of the device that executes the process or the need. Other changes can be made as appropriate without departing from the spirit of the present invention.

＜ハードウェア構成＞
学習装置２００と推定装置１００は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。学習装置２００と推定装置１００は、例えば、中央演算処理装置の制御のもとで各処理を実行する。学習装置２００と推定装置１００に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。学習装置２００と推定装置１００の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。学習装置２００と推定装置１００が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも学習装置２００と推定装置１００がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置により構成し、学習装置２００と推定装置１００の外部に備える構成としてもよい。 <Hardware configuration>
The learning device 200 and the estimation device 100 are configured by loading a special program into a known or dedicated computer having, for example, a central processing unit (CPU) and a main memory (RAM: Random Access Memory). It is a special device. The learning device 200 and the estimation device 100 execute each process under the control of the central processing unit, for example. The data input to the learning device 200 and the estimation device 100 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device are read to the central processing unit as necessary. Issued and used for other processing. At least a part of each processing unit of the learning device 200 and the estimation device 100 may be configured by hardware such as an integrated circuit. Each storage unit included in the learning device 200 and the estimation device 100 can be configured by, for example, a main storage device such as a RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be provided inside the learning device 200 and the estimation device 100, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory). The learning device 200 and the estimation device 100 may be provided outside the device.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
Further, various processing functions in each device described in the above-described embodiment and modification may be realized by a computer. In that case, the processing content of the function that each device should have is described by the program. By executing this program on a computer, various processing functions of the above-mentioned devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The distribution of this program is performed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, the program recorded in a portable recording medium or the program transferred from the server computer in its own storage unit. Then, when executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. Further, as another embodiment of this program, the computer may directly read the program from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer to this computer, the processing according to the received program may be sequentially executed. Further, the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing the execution instruction and the result acquisition without transferring the program from the server computer to the computer. May be The program includes information used for processing by an electronic computer and equivalent to the program (data that is not a direct command to a computer but has the property of defining processing of a computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, although each device is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized by hardware.

Claims

(i) A process for converting a complex spectrogram in which the phase and amplitude are inconsistent into a time waveform, and converting the converted time waveform into a complex spectrogram in which the phase and amplitude are inconsistent, and (ii) the amplitude spectrogram of the desired acoustic signal. By associating the process of converting to the magnitude of A and (iii) the process of bringing the phase spectrogram closer to the desired acoustic signal based on the statistical property of the learning acoustic signal corresponding to the desired acoustic signal. , Having an estimator for estimating a phase spectrogram that brings the amplitude spectrogram A closer to the desired acoustic signal,
Estimator.

A phase imparting unit that imparts the phase of the complex spectrogram X to the amplitude spectrogram A of the desired acoustic signal and obtains the signal Y after imparting,
The signal Y is converted into a time waveform by an inverse short-time Fourier transform, and the converted time waveform is converted into a signal Z in the frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform,
Using the complex spectrogram X, the signal Y and the signal Z, based on the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal, the phase of the complex spectrogram X is the phase of the desired acoustic signal. And a phase changing unit that brings the
Estimator.

The acoustic signal restoration device according to claim 2,
The statistical properties of the acoustic signal for learning are expressed by a deep neural network,
The deep neural network is
A complex spectrogram X ^(L)* obtained from the acoustic signal for learning and its amplitude spectrogram A ^(L) are used for learning,
Inputting the complex spectrogram X, the signal Y, and the signal Z, and outputting an estimated value of the residual difference between the signal Z and the complex spectrogram X,
Estimator.

(i) A process for converting a complex spectrogram in which the phase and amplitude are inconsistent into a time waveform, and converting the converted time waveform into a complex spectrogram in which the phase and amplitude are inconsistent, and (ii) the amplitude spectrogram of the desired acoustic signal. By associating the process of converting to the magnitude of A and (iii) the process of bringing the phase spectrogram closer to the desired acoustic signal based on the statistical property of the learning acoustic signal corresponding to the desired acoustic signal. , An estimation step of estimating a phase spectrogram approximating the amplitude spectrogram A to the desired acoustic signal,
Estimation method.

A phase adding step of adding the phase of the complex spectrogram X to the amplitude spectrogram A of the desired acoustic signal and obtaining the signal Y after the addition,
A conversion step of converting the signal Y into a time waveform by an inverse short-time Fourier transform, and converting the converted time waveform into a signal Z in the frequency domain by a short-time Fourier transform corresponding to the inverse short-time Fourier transform,
Using the complex spectrogram X, the signal Y and the signal Z, based on the statistical properties of the learning acoustic signal corresponding to the desired acoustic signal, the phase of the complex spectrogram X is the phase of the desired acoustic signal. And a phase changing step of
Estimation method.

The acoustic signal restoration method according to claim 5, wherein
The statistical properties of the acoustic signal for learning are expressed by a deep neural network,
The deep neural network is
A complex spectrogram X ^(L)* obtained from the acoustic signal for learning and its amplitude spectrogram A ^(L) are used for learning,
Inputting the complex spectrogram X, the signal Y, and the signal Z, and outputting an estimated value of the residual difference between the signal Z and the complex spectrogram X,
Estimation method.

A program for causing a computer to function as the estimation device according to any one of claims 1 to 3.