JP6721165B2

JP6721165B2 - Input sound mask processing learning device, input data processing function learning device, input sound mask processing learning method, input data processing function learning method, program

Info

Publication number: JP6721165B2
Application number: JP2017157322A
Authority: JP
Inventors: 悠馬小泉; 健太丹羽; 小林　和則; 和則小林; 羽田　陽一; 陽一羽田
Original assignee: THE UNIVERSITY OF ELECTRO-COMUNICATINS; Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: THE UNIVERSITY OF ELECTRO-COMUNICATINS; Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2020-07-08
Anticipated expiration: 2037-08-17
Also published as: JP2019035862A

Description

本発明は、入力音をマスク処理するためのマスクや入力データを処理するための処理関数の生成に用いることができる学習技術に関する。 The present invention relates to a learning technique that can be used to generate a mask for masking an input sound and a processing function for processing input data.

音源強調技術は、雑音に埋もれた観測信号の中から所望の目的音を強調する技術であり、音声認識の前処理、高臨場音響向け集音、聴覚補助など、その応用範囲の広さから長年研究されている。その実現例の一つとして、ウィナーフィルタリングのような時間周波数マスクに基づく処理がある。 The sound source enhancement technology is a technology that emphasizes a desired target sound from the observation signal buried in noise, and it has been used for many years due to its wide range of applications such as preprocessing for voice recognition, sound collection for highly realistic sound, and hearing aid. Being researched. As one example of its implementation, there is a process based on a time-frequency mask such as Wiener filtering.

音源強調の定式化のために、まず観測信号をモデル化する。m番目のマイクロホンの観測信号を数十ms分の長さで切り出し、時間フレームごとに短時間フーリエ変換（STFT: short-time Fourier transform）した信号X_ω,τ∈C^Ω×Τを所望の源信号S_ω,τ∈C^Ω×Τと雑音N_ω,τ∈C^Ω×Τが重畳されたものとして以下のように記述する。 In order to formulate the sound source enhancement, the observed signal is first modeled. The signal X _{ω, τ} ∈ C ^{Ω × Τ} which is obtained by cutting out the observation signal of the m-th microphone with a length of several tens of ms and performing a short-time Fourier transform (STFT) for each time frame is the desired source. The signal S _{ω, τ} ∈ C ^{Ω × Τ} and the noise N _{ω, τ} ∈ C ^{Ω × Τ} are described as superposed as follows.

ここで、ω∈{1,…,Ω}とτ∈{1,…,Τ}は、周波数と時間のインデックスを表す変数である。 Here, ω ∈ {1,..., Ω} and τ ∈ {1,..., Τ} are variables representing frequency and time indexes.

非線形フィルタリングとは、時間周波数成分ごとにゲインを調整する時間周波数マスクに基づく処理である。時間周波数マスクに基づく音源強調では、0から1の値を持つ時間周波数マスクG_ω,τ∈[0,1]を観測信号X_ω,τに掛け合わせることで、源信号S_ω,τが強調された信号S^_ω,τ∈C^Ω×Τを得る（図１参照）。 Non-linear filtering is processing based on a time-frequency mask that adjusts the gain for each time-frequency component. In the sound source enhancement based on the time-frequency mask, the source signal S _ω,τ is emphasized by multiplying the observed signal X _ω,τ by the time-frequency mask G _ω,τ ∈ [0,1] having a value of 0 to 1. The obtained signal S ^ _{ω, τ} ∈ C ^{Ω × T} is obtained (see FIG. 1).

時間周波数マスクG_ω,τの代表的な計算法として、ウィナーマスクがある。ウィナーマスクは、源信号とすべての雑音が互いに無相関かつ定常であるときにS_ω,τとS^_ω,τの平均二乗誤差(MSE:mean squared error)を最小化するマスクである。しかし、源信号や雑音は非定常であることが多いため、実用上は以下の時変ウィナーマスクG^WF _ω,τを用いることが多い。 A Wiener mask is a typical method for calculating the time-frequency mask G _ω,τ . The Wiener mask is a mask that minimizes the mean squared error (MSE) of S _ω,τ and S ^ _ω,τ when the source signal and all noises are uncorrelated and stationary with each other. However, since the source signal and noise are often non-stationary, the following time-varying Wiener mask G ^WF _ω,τ is often used in practice.

ウィナーマスクを計算するためには、源信号の振幅スペクトル|S_ω,τ|と雑音の振幅スペクトル|N_ω,τ|の両方を推定しなくてはならないが、実用上は計算量や推定する値の数を少なくするために、以下のように源信号と雑音の加法性がパワースペクトル領域でも成り立つと仮定し、 In order to calculate the Wiener mask, it is necessary to estimate both the amplitude spectrum of the source signal |S _ω,τ | and the amplitude spectrum of noise |N _ω,τ | In order to reduce the number of values, it is assumed that the additivity of the source signal and noise holds in the power spectrum region as follows,

源信号と雑音のどちらか片方を推定し近似的にウィナーマスクを計算することが多い。例えば、源信号の振幅スペクトル|S_ω,τ|を推定した場合、ウィナーマスクは以下のように計算できる。 Often, either the source signal or the noise is estimated and the Wiener mask is calculated approximately. For example, when estimating the amplitude spectrum |S _ω,τ | of the source signal, the Wiener mask can be calculated as follows.

近年、時間周波数マスク推定に、観測信号を時間周波数マスクのパラメータに非線形射影するための射影関数として深層ニューラルネットワーク（DNN:deep neural network）が適用されている（非特許文献１）。観測信号X_ω,τの時間周波数要素を並べたベクトルをx_τ、時間周波数マスクを計算するためのパラメータを並べたベクトルy_τとして、以下の式でベクトルy^_τを推定する（図２参照）。例えば、図２のベクトルx_τはフレーム結合された振幅スペクトルやMFCC(Mel-Frequency Cepstrum Coefficients)であり、ベクトルy^_τは源信号の振幅スペクトルである。 In recent years, a deep neural network (DNN) has been applied to time-frequency mask estimation as a projection function for non-linearly projecting an observed signal onto parameters of the time-frequency mask (Non-Patent Document 1). Estimate the vector y^ _τ by the following formula, where x _{τ is} the vector in which the time-frequency elements of the observed signal X _ω,τ are arranged and vector y _{τ in} which the parameters for calculating the time-frequency mask are arranged (see Fig. 2). ). For example, the vector x _{τ in} FIG. 2 is a frame-combined amplitude spectrum or MFCC (Mel-Frequency Cepstrum Coefficients), and the vector y^ _τ is the amplitude spectrum of the source signal.

ここで、Lはニューラルネットワークの層数であり、W^(j)、b^(j)はそれぞれj層目の重み行列とバイアスベクトルである。つまり、DNNのパラメータΘ_Μは、Θ_Μ={W^(j),b^(j)|j=2,…,L}である。また、σ_θは活性化関数と呼ばれる非線形関数であり、シグモイド関数やランプ関数が用いられる。なお、z_τ ⁽¹⁾=x_τである。DNNの入力となるベクトルx_τは、観測信号の周波数情報と時間情報の両方を考慮するために、例えば、以下のような観測信号X_ω,τ∈C^Ω×Τの時間周波数要素を並べたベクトルとする。 Here, L is the number of layers of the neural network, and W ^(j) and b ^(j) are the weight matrix and bias vector of the jth layer, respectively. That is, the parameter Θ _Μ of the DNN is Θ _Μ = {W ^(j) , b ^(j) |j=2,..., L}. Further, σ _θ is a non-linear function called an activation function, and a sigmoid function or a ramp function is used. Note that z _τ ⁽¹⁾ =x _τ . In order to consider both the frequency information and the time information of the observed signal, the vector x _{τ that} is the input of the DNN has the time-frequency elements of the observed signal X _{ω, τ} ∈ C ^{Ω × T} arranged as follows, for example. Vector.

ここで、式(9)の括弧の右肩のtは転置を表す。また、P_b, P_fは考慮する前後の時間フレーム数であり、コンテキスト窓と呼ばれる。 Here, t on the right shoulder of the parentheses in Expression (9) represents transposition. Further, P _b and P _f are the number of time frames before and after considering, and are called context windows.

源信号の振幅スペクトル|S_ω,τ|から時間周波数マスクを計算する場合、DNNの出力となるベクトルy_τは、例えば、以下のようになる。 When the time-frequency mask is calculated from the amplitude spectrum |S _ω,τ | of the source signal, the vector y _{τ that} is the output of the DNN is, for example, as follows.

DNNのパラメータΘ_Μは、観測信号とラベルデータ（時間周波数マスクのパラメータ）が対になったデータ（この例では、ベクトルx_τとベクトルy_τが対になったデータ）を大量に用意し、二乗誤差などの微分可能な評価値を最小化するように、誤差逆伝搬を用いて教師あり学習により生成される。 For the DNN parameter Θ _Μ , prepare a large amount of data (in this example, vector x _τ and vector y _τ paired) in which the observed signal and label data (time-frequency mask parameter) are paired, It is generated by supervised learning using error back propagation so as to minimize a differentiable evaluation value such as a squared error.

ただし、入出力ベクトルの次元数を抑えるために、源信号S_ω,τや雑音N_ω,τは、64次元程度のメルフィルタバンクで圧縮することもできる。このような圧縮をした場合には、メルフィルタバンク圧縮を行列演算とみなし、その逆行列などを用いて、DNNの出力を元の周波数領域に戻し、時間周波数マスクを設計する。 However, in order to suppress the number of dimensions of the input/output vector, the source signal S _ω,τ and the noise N _ω,τ can be compressed by a mel filter bank of about 64 dimensions. When such a compression is performed, the mel filter bank compression is regarded as a matrix operation, and the inverse matrix is used to restore the output of the DNN to the original frequency domain and design the time-frequency mask.

Y. Xu, J. Du, L. R. Dai and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks”, IEEE/ACM Trans. Audio, Speech and Language Processing, Vol.23, No.1, pp.7-19, 2015.Y. Xu, J. Du, LR Dai and CH Lee, “A regression approach to speech enhancement based on deep neural networks”, IEEE/ACM Trans. Audio, Speech and Language Processing, Vol.23, No.1, pp. 7-19, 2015.

従来、誤差逆伝搬のために用いることができる評価値は、二乗誤差のように微分可能なものに限られていた。しかし、音源強調の性能評価値には、音源強調の応用に応じて、二乗誤差のように微分可能なものだけでなく、PESQ(perceptual evaluation of speech quality)やSTOI(short-time objective intelligibility measure)のような微分不可能なものも用いられる（参考非特許文献１、参考非特許文献２）。
（参考非特許文献１：ITU-T Recommendation P.862,”Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs”, 2001.）
（参考非特許文献２：C.H.Taal, R.C.Hendriks, R.Heusdens, and J.Jensen, “An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech”, IEEE Transactions on Audio, Speech and Language Processing, Vol.19, pp.2125-2136, 2011.） Conventionally, the evaluation value that can be used for error back propagation has been limited to a differentiable value such as a squared error. However, the performance evaluation value of the sound source enhancement is not only a differentiable one such as a squared error, depending on the application of the sound source enhancement, but also PESQ (perceptual evaluation of speech quality) and STOI (short-time objective intelligibility measure). Those that cannot be differentiated are also used (reference non-patent document 1, reference non-patent document 2).
(Reference Non-Patent Document 1: ITU-T Recommendation P.862, "Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs", 2001. )
(Reference Non-Patent Document 2: CHTaal, RCHendriks, R. Heusdens, and J. Jensen, “An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech”, IEEE Transactions on Audio, Speech and Language Processing, Vol.19, pp. .2125-2136, 2011.)

したがって、用途に応じてDNNを適切に学習するためには、PESQなどの微分不可能な評価値を用いてDNNパラメータを最適化するようなDNNの学習フレームワークが必要になる。 Therefore, in order to learn DNN appropriately according to the application, a DNN learning framework that optimizes DNN parameters using non-differentiable evaluation values such as PESQ is required.

そこで本発明では、微分不可能な評価値を含む多様な評価値を用いて入力音をマスク処理するためのマスクや入力データを処理するための処理関数を生成する用途に適用できる学習技術を提供することを目的とする。 Therefore, the present invention provides a learning technique applicable to the purpose of generating a mask for masking an input sound using various evaluation values including non-differentiable evaluation values and a processing function for processing input data. The purpose is to do.

本発明の一態様は、入力音に基づく入力ベクトルx_τ（τ∈{1,…,Τ}）を入力とした場合にマスクG_τ（τ∈{1,…,Τ}）が生成される生成確率をモデル化した事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）に基づき、N個の入力音（Nは1以上τ以下の整数）に基づく入力ベクトルx_NからN個のマスクG_Nを生成するマスク生成部と、前記マスクG_Nを用いて、前記N個の入力音から、前記N個の入力音をマスク処理したN個の出力音を生成するマスク処理部と、前記N個の出力音に対する、前記マスクG_Nの報酬係数を得る報酬係数取得部と、前記報酬係数と、前記事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）に基づく前記入力ベクトルx_Nを入力とした場合にマスクG_Nが生成される生成確率q(G_N|x_N)とを用いて、前記事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）を更新する更新部とを含み、前記報酬係数は、前記出力音の評価値と、前記入力音が入力された場合に生成した前記マスクG_Nの確からしさである確信度から定まる。 According to one aspect of the present invention, a mask G _τ ( _τ ∈ {1,..., Τ}) is generated when an input vector x _τ ( _τ ∈ {1,..., Τ}) based on an input sound is input. An input vector based on N input sounds (N is an integer between 1 and τ inclusive) based on the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,…, Τ}) modeling the generation probability A mask generation unit that generates _N masks _GN from x _N and the mask _GN is used to generate N output sounds by masking the N input sounds from the N input sounds. A mask processing unit that obtains a reward coefficient of the mask G _N for the N output sounds, the reward coefficient, and the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ Using the generation probability q(G _N |x _N ) at which the mask G _N is generated when the input vector x _N based on {1,...,Τ}) is input, the posterior probability distribution p( G _τ |x _τ )( _τ ∈ {1,...,Τ}) is updated, and the reward coefficient is generated when the evaluation value of the output sound and the input sound are input. It is determined from the certainty factor that is the certainty of the mask G _N.

本発明の一態様は、入力データに基づく入力ベクトルx_τ（τ∈{1,…,Τ}）を入力とした場合に処理関数G_τ（τ∈{1,…,Τ}）が生成される生成確率をモデル化した事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）に基づき、N個の入力データ（Nは1以上τ以下の整数）に基づく入力ベクトルx_NからN個の処理関数G_Nを生成する処理関数生成部と、前記処理関数G_Nを用いて、前記N個の入力データから、前記N個の入力データを処理関数により処理したN個の出力データを生成する処理関数適用部と、前記N個の出力データに対する、前記処理関数G_Nの報酬係数を得る報酬係数取得部と、前記報酬係数と、前記事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）に基づく前記入力ベクトルx_Nを入力とした場合に処理関数G_Nが生成される生成確率q(G_N|x_N)とを用いて、前記事後確率分布p(G_τ|x_τ)（τ∈{1,…,Τ}）を更新する更新部とを含み、前記報酬係数は、前記出力データの評価値と、前記入力データが入力された場合に生成した前記処理関数G_Nの確からしさである確信度から定まる。 According to one aspect of the present invention, a processing function G _τ ( _τ ∈ {1,..., Τ}) is generated when an input vector x _τ ( _τ ∈ {1,..., Τ}) based on input data is input. Based on the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,…, Τ}) that models the generation probability of Using the processing function generator that generates _N processing functions G _N from the vector x _N and the processing function G _N , the N input data are processed by the processing function N from the N input data. Processing function applying unit for generating output data, a reward coefficient acquisition unit for obtaining the reward coefficient of the processing function G _N for the N output data, the reward coefficient, and the posterior probability distribution p(G _tau | using the _{_{x N) | x τ) (}} τ∈ {1, ..., Τ} process when an input said input vector x _N based on) the function G _N is generated probability q generated (G _N And an update unit for updating the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}), wherein the reward coefficient is the evaluation value of the output data and the input value. It is determined from the certainty factor, which is the certainty of the processing function G _N generated when data is input.

本発明によれば、微分不可能な評価値を含む多様な評価値を用いて事後確率分布を更新することにより、入力音をマスク処理するためのマスクや入力データを処理するための処理関数を生成するための事後確率分布を学習することが可能となる。 According to the present invention, by updating the posterior probability distribution using various evaluation values including non-differentiable evaluation values, a mask for masking the input sound and a processing function for processing the input data are provided. It is possible to learn the posterior probability distribution for generation.

音源強調装置９００の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the sound source emphasis apparatus 900. DNNを用いた時間周波数マスク生成部９１０の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the time frequency mask generation part 910 using DNN. 音源強調学習装置１００の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the sound source emphasis learning apparatus 100. 音源強調学習装置１００の動作の一例を示すフローチャート。The flowchart which shows an example of operation|movement of the sound source emphasis learning apparatus 100. DNNパラメータ初期値生成部１１０の構成の一例を示すブロック図。3 is a block diagram showing an example of the configuration of a DNN parameter initial value generation unit 110. FIG. DNNパラメータ初期値生成部１１０の動作の一例を示すフローチャート。6 is a flowchart showing an example of the operation of the DNN parameter initial value generation unit 110. DNN-RLパラメータ生成部１２０の構成の一例を示すブロック図。3 is a block diagram showing an example of the configuration of a DNN-RL parameter generation unit 120. FIG. DNN-RLパラメータ生成部１２０の動作の一例を示すフローチャート。6 is a flowchart showing an example of the operation of the DNN-RL parameter generation unit 120. DNN-RL時間領域出力信号生成部１２４の構成の一例を示すブロック図。3 is a block diagram showing an example of the configuration of a DNN-RL time domain output signal generation unit 124. FIG. DNN-RL時間領域出力信号生成部１２４の動作の一例を示すフローチャート。9 is a flowchart showing an example of the operation of the DNN-RL time domain output signal generation unit 124. 音源強調装置２００の構成の一例を示すブロック図。3 is a block diagram showing an example of the configuration of a sound source enhancement device 200. FIG. 音源強調装置２００の動作の一例を示すフローチャート。6 is a flowchart showing an example of the operation of the sound source enhancement device 200. 入力音マスク処理学習装置３００の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the input sound mask process learning apparatus 300. 入力音マスク処理学習装置３００の動作の一例を示すフローチャート。The flowchart which shows an example of operation|movement of the input sound mask process learning apparatus 300. 入力音マスク処理学習装置３０１の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the input sound mask process learning apparatus 301. 入力音マスク処理学習装置３０１の動作の一例を示すフローチャート。The flowchart which shows an example of operation|movement of the input sound mask process learning apparatus 301. 入力データ処理関数学習装置４００の構成の一例を示すブロック図。The block diagram which shows an example of a structure of the input data processing function learning apparatus 400. 入力データ処理関数学習装置４００の動作の一例を示すフローチャート。The flowchart which shows an example of operation|movement of the input data processing function learning apparatus 400.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. It should be noted that components having the same function are denoted by the same reference numeral, and redundant description will be omitted.

＜技術的背景＞
PESQやSTOIのような評価値は、時間周波数マスク（やそのパラメータ）の推定値とラベルデータとの誤差（式(12)参照）のように微分することはできない。そこで、ここでは、従来のように直接時間周波数マスク（またはそのパラメータ）を推定する非線形射影のアプローチではなく、観測信号を得た下で評価値を最大化する時間周波数マスクの事後確率分布（またはそのパラメータ）を推定する。この事後確率分布が満たすべき性質を目的関数として記述し、この目的関数を用いて、DNN（DNNパラメータΘ_Μ）を学習する。 <Technical background>
Evaluation values such as PESQ and STOI cannot be differentiated like the error between the estimated value of the time-frequency mask (or its parameter) and the label data (see equation (12)). Therefore, here, instead of the non-linear projection approach that directly estimates the time-frequency mask (or its parameters) as in the past, the posterior probability distribution (or the Its parameters). The property that this posterior probability distribution should satisfy is described as an objective function, and DNN (DNN parameter Θ _Μ ) is learned using this objective function.

従来は、時間周波数マスクそのものやそのパラメータをDNNで出力し、事前に用意したラベルデータとDNNの出力の二乗誤差などの微分可能な評価値を最小化するようにDNNを学習していた（図２、式(12)参照）。しかし、ここでは、観測信号を得た下での評価値を最大化する時間周波数マスクの確率密度関数（またはそのパラメータ）をDNNで出力する。そして、微分不可能な評価値を出力する評価関数Rを最大化するような新たな目的関数T_arを用いてDNNを学習する。つまり、従来は式(12)を用いてDNNを学習していたところ、ここでは、後述する式(26)を用いてDNNを学習する。 Previously, the time-frequency mask itself and its parameters were output as a DNN, and the DNN was learned so as to minimize the differentiable evaluation values such as the squared error between the label data prepared in advance and the DNN output (Fig. 2, see equation (12)). However, here, the probability density function (or its parameter) of the time-frequency mask that maximizes the evaluation value under the observation signal is output by DNN. Then, the DNN is learned using a new objective function T _ar that maximizes the evaluation function R that outputs an evaluation value that cannot be differentiated. That is, conventionally, the DNN was learned using the equation (12), but here, the DNN is learned using the equation (26) described later.

以下、詳細に説明する。
《目的関数T_arの導出》
本発明の実施の形態で最大化する対象となる評価値には、PESQやSTOIのような音声強調の出力信号S^_ω,τから計算できる評価値がある。また、MOS値のように主観評価を行った結果やよい悪いを示す二値のように計算以外の方法で出力信号S^_ω,τから得ることができる評価値であってもよい。さらに、例えば音声認識向けに音源強調を最適化したいのであれば、音声認識の結果が正解か否かの二値を評価値としてもよい。 The details will be described below.
《 _Deriving the objective function T _ar 》
The evaluation value to be maximized in the embodiment of the present invention includes the evaluation value that can be calculated from the output signal S^ _ω,τ of speech enhancement such as PESQ and STOI. Further, it may be an evaluation value that can be obtained from the output signal S^ _{ω,τ by} a method other than calculation, such as a result of subjective evaluation such as a MOS value or a binary value indicating good or bad. Further, for example, if it is desired to optimize the sound source enhancement for voice recognition, a binary value indicating whether or not the result of voice recognition is correct may be used as the evaluation value.

また、本発明の実施の形態では、音源強調は時間周波数マスク処理により行われるため、評価値は時間周波数マスク系列Gの関数としてとらえることができる。すなわち、 Further, in the embodiment of the present invention, since the sound source enhancement is performed by the time frequency mask processing, the evaluation value can be regarded as a function of the time frequency mask sequence G. That is,

である。 Is.

ここで、評価値を出力する評価関数をR、最大化したい評価値をR(G)とおく。すると、問題は、評価値R(G)を最大化する時間周波数マスクを出力するDNNのパラメータΘ_Μを求めることに帰着する。 Here, the evaluation function that outputs the evaluation value is R, and the evaluation value to be maximized is R(G). Then, the problem results in finding the parameter Θ _Μ of the DNN that outputs the time-frequency mask that maximizes the evaluation value R(G).

従来、DNNの出力は、式(6)のベクトルy^_τのように時間周波数マスクそのものやそのパラメータを出力していた。ここでは、Μ(x_τ|Θ_Μ)を次式のように時間周波数マスクG_τが評価値を最大化する事後確率として定義する。 Conventionally, the output of the DNN has been to output the time-frequency mask itself and its parameters like the vector y ^ _τ in equation (6). Here, Μ(x _τ | Θ _Μ ) is defined as the posterior probability that the time-frequency mask G _τ maximizes the evaluation value as follows.

そして、時間周波数マスクG_τは以下の事後確率最大化推定で求める。 Then, the time-frequency mask G _τ is obtained by the following posterior probability maximization estimation.

ここで、事後確率p(G_τ|x_τ,Θ_Μ)は時間周波数マスクG_τに対する連続的な確率分布であるため、式(16)はp(G_τ|x_τ,Θ_Μ)を最大化するG_τを直接求めること、つまり時間周波数マスクG_τの生成とみなすことができる。 Here, since the posterior probability p(G _τ |x _τ ,Θ _Μ ) is a continuous probability distribution for the time-frequency mask G _τ , Eq. (16) maximizes p(G _τ |x _τ ,Θ _Μ ). It can be regarded that the G _τ to be _converted is directly obtained, that is, the time-frequency mask G _τ is generated.

評価値R(G)を最大化する音源強調を実現するために、目的関数T_arを評価値R(G)の期待値として以下のように設計する。 In order to realize the sound source enhancement that maximizes the evaluation value R(G), the objective function T _ar is designed as the expected value of the evaluation value R(G) as follows.

ただし、 However,

であり、p(X)は観測信号系列Xを得る確率、p(G|X,Θ_Μ)は観測信号系列Xを得たもとで時間周波数マスク系列Gが評価値を最大化する確率密度関数を表す。 Where p(X) is the probability of obtaining the observed signal sequence X, and p(G|X, Θ _Μ ) is the probability density function that maximizes the evaluation value of the time-frequency mask sequence G under the observation signal sequence X. Represent

さらに、音源強調においては、時間周波数マスキングが次の時刻の観測信号に影響を及ぼすことはなく、また時刻τにおける時間周波数マスクの設計は他の時刻とは独立に行われる。このことを考慮すると、音源強調における確率密度関数p(G|X,Θ_Μ)は以下の簡潔な形で記述できる。 Further, in the sound source enhancement, the time-frequency masking does not affect the observation signal at the next time, and the design of the time-frequency mask at the time τ is performed independently of other times. Considering this, the probability density function p(G|X, Θ _Μ ) in the sound source enhancement can be described in the following simple form.

よって、目的関数T_arは以下のように記述できる。 Therefore, the objective function T _ar can be described as follows.

この目的関数T_arの性質を調べるために、式(21)に出現する目的関数T_arのΘ_Μに関する勾配を求める。目的関数T_arのΘ_Μに関する勾配は以下のように計算できる。 In order to investigate the property of this objective function T _ar, the gradient with respect to Θ _Μ of the objective function T _ar appearing in equation (21) is obtained. The gradient of the objective function T _ar with respect to Θ _Μ can be calculated as follows.

ここで、式(23)の期待値をI回（Iは1以上の整数）のエピソードに関する算術平均に置き換える。すると、式(23)は以下のように書き換えることができる。 Here, the expected value of equation (23) is replaced with the arithmetic mean of I (I is an integer of 1 or more) episodes. Then, equation (23) can be rewritten as follows.

ここで、R_ew(i)=R(G⁽ⁱ⁾)p(G⁽ⁱ⁾|X⁽ⁱ⁾,Θ_Μ)、上付き/下付きの文字iや(i)はi番目のエピソードの変数であることを示す。以下、R_ewを報酬係数という。 Where R _ew (i)=R(G ⁽ⁱ⁾ )p(G ⁽ⁱ⁾ |X ⁽ⁱ⁾ ,Θ _Μ ), the superscript/subscript i or (i) is the i-th episode Indicates that it is a variable. Hereinafter, _{Rew is referred} to as a reward coefficient.

報酬係数R_ew(i)の意味を定性的に考える。第一項R(G⁽ⁱ⁾)は評価値に関する項であり、生成した時間周波数マスクがよい評価であれば値が正、悪い評価であれば値が負となる、自身の生成した“時間周波数マスクの評価”を表す。また、第二項p(G⁽ⁱ⁾|X⁽ⁱ⁾,Θ_Μ)は時間周波数マスクの生成確率に関する項であり、自身の生成した時間周波数マスクは現状のDNNパラメータΘ_Μにおいてどれだけ確信を持って生成したものであるかという“時間周波数マスクの確信度”を表す。報酬係数R_ew(i)はこの２つの項の積であるため、確信をもって生成した時間周波数マスクが評価値を向上させたならば生成確率lnp(G⁽ⁱ⁾ _τ|X⁽ⁱ⁾ _τ, Θ_Μ)を大きく増加させ、確信をもって生成した時間周波数マスクが評価値を低下させたならば生成確率lnp(G⁽ⁱ⁾ _τ|X⁽ⁱ⁾ _τ, Θ_Μ)を大きく減少させる働きを持っている。また、確信をもたずに生成した時間周波数マスクによって評価値が向上または低下した場合、その結果は偶発的に生じたものである可能性があるため、生成確率lnp(G⁽ⁱ⁾ _τ|X⁽ⁱ⁾ _τ, Θ_Μ)の増加または減少を小さく抑える働きを持っている。 Qualitatively consider the meaning of the reward coefficient R _ew (i). The first term R(G ⁽ⁱ⁾ ) is a term related to the evaluation value. If the generated time-frequency mask is a good evaluation, the value is positive, and if it is a bad evaluation, the value is negative. Evaluation of frequency mask". Also, the second term p(G ⁽ⁱ⁾ |X ⁽ⁱ⁾ ,Θ _Μ ) is a term related to the generation probability of the time-frequency mask, and the confidence of the time-frequency mask generated by itself in the current DNN parameter Θ _Μ It represents the “confidence of the time-frequency mask” that it was generated with. Since the reward coefficient R _ew (i) is the product of these two terms, if the time-frequency mask generated with confidence improves the evaluation value, the generation probability lnp(G ⁽ⁱ⁾ _τ |X ⁽ⁱ⁾ _τ , Θ _Μ ) is greatly increased, and the probability of generation lnp(G ⁽ⁱ⁾ _τ |X ⁽ⁱ⁾ _τ , Θ _Μ ) is greatly reduced if the time-frequency mask generated with confidence lowers the evaluation value. ing. Further, when the evaluation value is improved or decreased by the time-frequency mask generated without confidence, the result may be accidental, and thus the generation probability lnp(G ⁽ⁱ⁾ _τ | It has the function of suppressing the increase or decrease of X ⁽ⁱ⁾ _τ , Θ _Μ ).

以上まとめると、PESQ、STOI、MOS値のような微分不可能な評価値を最大化する時間周波数マスク生成のための目的関数T_arは、評価値R(G⁽ⁱ⁾)と確信度p(G⁽ⁱ⁾|X⁽ⁱ⁾, Θ_Μ)で重み付けられた、生成した時間周波数マスクに対する対数尤度lnp(G⁽ⁱ⁾ _τ|X⁽ⁱ⁾ _τ, Θ_Μ)の算術平均となる。 In summary, the objective function T _ar for generating a time-frequency mask that maximizes non-differentiable evaluation values such as PESQ, STOI, and MOS values is evaluated value R(G ⁽ⁱ⁾ ) and confidence p( ^{^{G (i) | X (i}} ), the weighted with theta _Micromax), log likelihood lnp against the generated time-frequency mask ^{_{(G (i) τ | X}} (i) τ, the arithmetic mean of theta _Micromax).

なお、報酬係数R_ew(i)の第二項p(G⁽ⁱ⁾ _τ|X⁽ⁱ⁾ _τ, Θ_Μ)はΘ_Μで微分されていないことに注意されたい。 Note that the second term p(G ⁽ⁱ⁾ _τ |X ⁽ⁱ⁾ _τ , Θ _Μ ) of the reward coefficient R _ew (i) is not differentiated by Θ _Μ .

式(26)で定義される目的関数T_arの導出では、微分不可能な評価値を対象に議論を進めてきたが、この議論は微分不可能な評価値に限られるものではない。つまり、式(26)で定義される目的関数T_ar及び式(26’)で定義される報酬係数R_ew(i)は、微分可能な評価値についても適用することが可能である。 In the derivation of the objective function T _ar defined by the equation (26), the discussion has been made on the evaluation value which is not differentiable, but this discussion is not limited to the evaluation value which is not differentiable. That is, the objective function T _ar defined by the equation (26) and the reward coefficient R _ew (i) defined by the equation (26′) can be applied to differentiable evaluation values.

《DNNパラメータΘ_Μの学習アルゴリズム》
以下、式(26)の目的関数T_arを用いて、時間周波数マスクG_τが評価値を最大化する事後確率p(G_τ|x_τ,Θ_Μ)の分布パラメータを出力とするDNNのパラメータΘ_Μを学習するためのアルゴリズムについて説明する。《DNN parameter Θ _Μ learning algorithm》
Below, using the objective function T _ar of Eq. (26), the DNN parameters that output the distribution parameters of the posterior probability p(G _τ |x _τ , Θ _Μ ) that the time-frequency mask G _τ maximizes the evaluation value. An algorithm for learning Θ _Μ will be described.

（DNNの出力p(G_τ|x_τ,Θ_Μ)の分布パラメータの設計）
まず、p(G_τ|x_τ,Θ_Μ)をDNNのパラメータΘ_Μで微分可能な分布として表現し、p(G_τ|x_τ,Θ_Μ)の分布パラメータをニューラルネットワークで推定、出力する。 (Design of distribution parameters of DNN output p(G _τ |x _τ , Θ _Μ ))
First, p(G _τ |x _τ ,Θ _Μ ) is expressed as a distribution that can be differentiated by the DNN parameter Θ _Μ , and the distribution parameter of p(G _τ |x _τ , Θ _Μ ) is estimated and output by a neural network. ..

そこで、p(G_τ|x_τ,Θ_Μ)をDNNのパラメータΘ_Μで微分が容易で、数値的に扱いやすい複素ガウス分布としてモデル化する。 Therefore, p(G _τ |x _τ , Θ _Μ ) is modeled as a complex Gaussian distribution that is easy to differentiate and numerically easy to handle with the DNN parameter Θ _Μ .

ここで、式(27)の右辺の小さい丸印はアダマール積を表し、式(29)の右辺のRとIはそれぞれ複素数の実部と虚部を表す。 Here, the small circle on the right side of equation (27) represents the Hadamard product, and R and I on the right side of equation (29) represent the real and imaginary parts of the complex number, respectively.

そして、複素ガウス分布p(G_τ|x_τ,Θ_Μ)の分布パラメータである平均ベクトルμ(x_τ)と分散ベクトルσ(x_τ)をDNNの出力とする。 Then, the average vector μ(x _τ ) and the variance vector σ(x _τ ) that are distribution parameters of the complex Gaussian distribution p(G _τ |x _τ , Θ _Μ ) are set as the output of the DNN.

ここで、平均ベクトルμ(x_τ)を時間周波数マスクG_τ∈[0,1]の推定値とするために、活性化関数にシグモイド関数を利用している。 Here, a sigmoid function is used as an activation function in order to use the average vector μ(x _τ ) as an estimated value of the time-frequency mask G _τ ∈ [0,1].

ここでは、事後確率p(G_τ|x_τ,Θ_Μ)を複素ガウス分布によりモデル化することで、DNNの出力を平均ベクトルμ(x_τ)と分散ベクトルσ(x_τ)としたが、事後確率p(G_τ|x_τ,Θ_Μ)そのものをDNNの出力としてもよい（図２参照）。 Here, the posterior probability p(G _τ |x _τ ,Θ _Μ ) is modeled by a complex Gaussian distribution, and the output of the DNN is the mean vector μ(x _τ ) and the variance vector σ(x _τ ). The posterior probability p(G _τ |x _τ , Θ _Μ ) itself may be used as the output of the DNN (see FIG. 2).

（評価関数Rの設計）
代表的な評価値であるPESQやSTOIは、音源強調の性能だけでなく観測信号のSNRや雑音の種類によっても値が変動してしまう。そこで、上述のDNN（式(26)の目的関数T_arを用いたDNN）により学習したパラメータΘ_Μから求めた時間周波数マスクを用いて音源強調した出力音の評価値と、従来のMMSE(minimum mean squared error)基準を用いたDNNにより学習したパラメータΘ_Μから求めた時間周波数マスクを用いて音源強調した出力音（非特許文献１）の評価値を比較することで得られる評価値（以下、比較報酬という）を計算する。 (Design of evaluation function R)
The values of PESQ and STOI, which are typical evaluation values, vary depending on not only the performance of sound source enhancement but also the SNR of the observed signal and the type of noise. Therefore, the evaluation value of the output sound emphasized by the sound source using the time-frequency mask obtained from the parameter Θ _Μ learned by the above-mentioned DNN (DNN using the objective function T _ar of Equation (26)) and the conventional MMSE (minimum The evaluation value obtained by comparing the evaluation values of the output sound (Non-Patent Document 1) emphasized by the sound source using the time frequency mask obtained from the parameter Θ _Μ learned by the DNN using the mean squared error criterion (hereinafter, Calculate the comparative reward).

以下、簡単のために、式(26)の目的関数T_arを用いて学習したDNNをDNN-RLと呼ぶ。また、MMSE基準を用いて学習したDNNをDNN-MMSEという。同様に、簡単のため、DNN-RLで得られた時間周波数マスクを用いて音源強調された出力音のことをDNN-RLで得られた出力音、DNN-MMSEで得られた時間周波数マスクを用いて音源強調された出力音のことをDNN-MMSEで得られた出力音という。 Hereinafter, for simplicity, the DNN learned by using the objective function T _ar of Expression (26) is referred to as DNN-RL. A DNN learned using the MMSE criterion is called a DNN-MMSE. Similarly, for simplification, the output sound obtained by DNN-RL and the time-frequency mask obtained by DNN-MMSE are the output sounds emphasized by using the time-frequency mask obtained by DNN-RL. The output sound whose sound source is emphasized by using it is called the output sound obtained by DNN-MMSE.

なお、式(26)の目的関数T_arを用いたDNNの学習をDNN-RL学習、MMSE基準を用いたDNNの学習をDNN-MMSE学習という。 Note that DNN learning using the objective function T _ar of Expression (26) is called DNN-RL learning, and DNN learning using the MMSE criterion is called DNN-MMSE learning.

DNN-RLで得られた出力音の評価値をZ^RL、DNN-MMSEで得られた出力音の評価値をZ^MMSEとする。そして、この２つの評価値を比較した評価値である比較報酬R(G)を以下のように求める。 Let Z ^{RL be} the evaluation value of the output sound obtained by DNN-RL, and Z ^{MMSE be} the evaluation value of the output sound obtained by DNN-MMSE. Then, a comparative reward R(G), which is an evaluation value obtained by comparing these two evaluation values, is obtained as follows.

ここで、α(>0)は比較報酬のスケーリング係数であり、tanhは比較報酬のクリッピングのための双曲線正接関数である。 Where α(>0) is the scaling factor of the comparative reward and tanh is the hyperbolic tangent function for clipping the comparative reward.

この比較報酬R(G)は、ゲームの勝敗から着想を得た値である。Z^RLがZ^MMSEより大きいということは、DNN-MMSEで得られた出力音の評価値よりもDNN-RLで得られた出力音の評価値が高いということであり、Z^RLを求めるために行った音源強調は正しかったと判断することができる（このとき、R(G)>0となる）。一方、Z^RLがZ^MMSEより小さいということは、DNN-MMSEで得られた出力音の評価値よりもDNN-RLで得られた出力音の評価値が低いということであり、Z^RLを求めるために行った音源強調は誤っていたと判断することができる（このとき、R(G)<0となる）。このように、DNN-MMSEというDNN-RLと比較対象となる音源強調手段を設けることで、音源強調の性能以外からの評価値への影響を低減することが可能となる。また、MMSE基準に基づく音源強調よりも高い評価値となる音源強調のためのDNNのパラメータの学習が可能となる。 This comparative reward R(G) is a value inspired by the outcome of the game. Z ^RL that is greater than Z ^MMSE is that high evaluation value of output sounds obtained by DNN-RL than the evaluation value of output sounds obtained by DNN-MMSE, in order to obtain the Z ^RL It can be judged that the sound source enhancement performed was correct (at this time, R(G)>0). On the other hand, it Z ^RL is called Z ^MMSE smaller is that the evaluation value of output sounds obtained by DNN-RL than the evaluation value of output sounds obtained by DNN-MMSE is low, obtains the Z ^RL It can be determined that the sound source enhancement performed for the purpose was wrong (at this time, R(G)<0). As described above, by providing the sound source emphasizing means called DNN-MMSE which is to be compared with the DNN-RL, it is possible to reduce the influence on the evaluation value from other than the performance of the sound source emphasizing. In addition, it is possible to learn the DNN parameters for sound source enhancement, which have higher evaluation values than the sound source enhancement based on the MMSE criterion.

また、報酬係数R_ewの第二項p(G|X,Θ_Μ)は確率の積であるため非常に小さな値となる（式(20)参照）。アンダーフローを避けるために、報酬係数R_ewを以下の式で求める。 In addition, the second term p(G|X, Θ _Μ ) of the reward coefficient _Rew is a product of probabilities and thus has a very small value (see equation (20)). In order to avoid underflow, the reward coefficient R _ew is _calculated by the following formula.

ここで、βとγはp(G|X)のアンダーフローを避けるための係数である。 Here, β and γ are coefficients for avoiding underflow of p(G|X).

DNN-MMSEで得られた出力音の評価値よりもDNN-RLで得られた出力音の評価値が低い場合（R(G)<0）、DNN-RLの時間周波数マスクよりもDNN-MMSEの時間周波数マスクの方が、評価値が高くなると考えられる。そこで、MMSEベースの時間周波数マスクの生成確率を高めるために以下の処理を行う。 When the evaluation value of the output sound obtained by DNN-RL is lower than the evaluation value of the output sound obtained by DNN-MMSE (R(G)<0), DNN-MMSE is better than the time-frequency mask of DNN-RL. It is considered that the evaluation value becomes higher in the time frequency mask of. Therefore, the following processing is performed in order to increase the generation probability of the MMSE-based time-frequency mask.

＜第一実施形態＞
ここでは、＜技術的背景＞で説明した内容に基づいて構成した音源強調学習装置について説明する。 <First embodiment>
Here, a sound source emphasis learning device configured based on the contents described in <Technical background> will be described.

以下、図３〜図４を参照して音源強調学習装置１００を説明する。図３は、音源強調学習装置１００の構成を示すブロック図である。図４は、音源強調学習装置１００の動作を示すフローチャートである。図３に示すように音源強調学習装置１００は、周波数領域信号生成部１０５と、DNNパラメータ初期値生成部１１０と、DNN-RLパラメータ生成部１２０と、記録部１９０を含む。記録部１９０は、音源強調学習装置１００の処理に必要な情報を適宜記録する構成部である。 Hereinafter, the sound source emphasis learning device 100 will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the sound source emphasis learning device 100. FIG. 4 is a flowchart showing the operation of the sound source emphasis learning device 100. As shown in FIG. 3, the sound source emphasis learning device 100 includes a frequency domain signal generation unit 105, a DNN parameter initial value generation unit 110, a DNN-RL parameter generation unit 120, and a recording unit 190. The recording unit 190 is a component that appropriately records information necessary for the processing of the sound source emphasis learning device 100.

音源強調学習装置１００は、目的音学習データ記録部９１０、雑音学習データ記録部９２０に接続している。目的音学習データ記録部９１０、雑音学習データ記録部９２０には、事前に収音した目的音と雑音が学習データとして記録されている。目的音学習データ、雑音学習データは、時間領域信号である。例えば、音声を目的音とする場合、目的音学習データは、無響室などで収録した発話データである。この発話データは、8秒間程度の発話を、5000発話程度以上集めることが望ましい。また、雑音学習データは、使用を想定する環境で収録した雑音である。 The sound source emphasis learning device 100 is connected to the target sound learning data recording unit 910 and the noise learning data recording unit 920. The target sound learning data recording unit 910 and the noise learning data recording unit 920 record the target sound and noise collected in advance as learning data. The target sound learning data and the noise learning data are time domain signals. For example, when a voice is the target sound, the target sound learning data is utterance data recorded in an anechoic room or the like. It is desirable that this utterance data collects utterances for about 8 seconds and more than about 5000 utterances. Further, the noise learning data is noise recorded in an environment in which it is supposed to be used.

音源強調学習装置１００の各構成部で用いる各種パラメータ（例えば、DNN-MMSE学習、DNN-RL学習などに用いるパラメータ）については、目的音学習データや雑音学習データと同様外部から入力するようにしてもよいし、事前に各構成部に設定されていてもよい。 Various parameters (for example, parameters used for DNN-MMSE learning, DNN-RL learning, etc.) used in each component of the sound source emphasis learning device 100 are input from the outside like the target sound learning data and the noise learning data. Alternatively, it may be set in each component in advance.

図４に従い音源強調学習装置１００の動作について説明する。周波数領域信号生成部１０５は、目的音学習データと雑音学習データから、周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}、Ω、Τはそれぞれ目的音学習データ及び雑音学習データにより定まる1以上の整数）を生成する（Ｓ１０５）。具体的には、まず、目的音学習データ（先ほどの例でいうと、8秒間程度の発話データ）をランダムに1つ選択し、目的音学習データと同じ長さの雑音学習データをランダムに1つ選択する。さらに、目的音学習データと雑音学習データをランダムなSNR(signal-to-noise ratio)で重畳することにより時間領域観測信号を生成する。このSNRの範囲は、例えば、-6dB〜12dB程度に設定するとよい。次に、これらの目的音学習データ、雑音学習データ、時間領域観測信号から周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）を生成する。これらの周波数領域信号の生成には短時間フーリエ変換等を用いるとよい。 The operation of the sound source emphasis learning device 100 will be described with reference to FIG. The frequency domain signal generation unit 105 calculates the frequency domain target sound signal S _ω,τ , the frequency domain noise signal N _ω,τ , and the frequency domain observation signal X _ω,τ (ω∈{1 from the target sound learning data and the noise learning data. ,..., Ω}, τε{1,..., Τ}, Ω, Τ respectively generate integers of 1 or more determined by the target sound learning data and the noise learning data (S105). Specifically, first, one target sound learning data (in the above example, utterance data for about 8 seconds) is randomly selected, and noise learning data having the same length as the target sound learning data is randomly selected. Choose one. Further, a time domain observation signal is generated by superimposing the target sound learning data and the noise learning data at a random SNR (signal-to-noise ratio). The SNR range may be set to, for example, about -6 dB to 12 dB. Next, from these target sound learning data, noise learning data, and time domain observation signal, frequency domain target sound signal S _ω,τ , frequency domain noise signal N _ω,τ , frequency domain observation signal X _ω,τ ( _ω ∈ ( 1,...,Ω}, τ ∈ {1,...,Τ}) is generated. Short-time Fourier transform or the like may be used to generate these frequency domain signals.

DNNパラメータ初期値生成部１１０は、Ｓ１０５で生成した周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-MMSEパラメータの初期値Θ^MMSE _iniとDNN-RLパラメータの初期値Θ^RL _iniを生成する（Ｓ１１０）。DNN-RLパラメータ生成部１２０は、Ｓ１１０で生成したDNN-MMSEパラメータ初期値Θ^MMSE _iniとDNN-RLパラメータ初期値Θ^RL _iniを用いて、Ｓ１０５で生成した周波数領域目的音信号S_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）からDNN-RLパラメータΘ^RLを生成する（Ｓ１２０）。 The DNN parameter initial value generating unit 110 generates the frequency domain target sound signal S _ω,τ generated in S105, the frequency domain noise signal N _ω,τ , and the frequency domain observation signal X _ω,τ (ωε{1,...,Ω}. ,τε{1,...,Τ}), the initial value Θ ^MMSE _ini of the DNN- ^MMSE parameter and the initial value Θ ^RL _ini of the DNN-RL parameter are generated (S110). The DNN-RL parameter generation unit 120 uses the DNN-MMSE parameter initial value Θ ^MMSE _ini and the DNN-RL parameter initial value Θ ^RL _ini generated in S110 to generate the frequency domain target sound signal S _{ω, τ} , The DNN-RL parameter Θ ^RL is generated from the frequency domain observation signal X _ω,τ ( _ω ε{1,..., Ω}, τ ε{1,..., Τ}) (S120).

なお、Ｓ１０５の処理は、Ｓ１１０やＳ１２０の処理（DNN-MMSE学習やDNN-RL学習）に必要な回数だけ適宜実行される。したがって、Ｓ１２０の処理に必要となるＳ１０５の処理は、図４におけるＳ１１０とＳ１２０の間で実行してもよい。 The process of S105 is appropriately executed as many times as necessary for the processes of S110 and S120 (DNN-MMSE learning and DNN-RL learning). Therefore, the process of S105 necessary for the process of S120 may be executed between S110 and S120 in FIG.

以下、図５〜図６を参照してDNNパラメータ初期値生成部１１０について説明する。図５は、DNNパラメータ初期値生成部１１０の構成を示すブロック図である。図６は、DNNパラメータ初期値生成部１１０の動作を示すフローチャートである。図５に示すようにDNNパラメータ初期値生成部１１０は、DNN-MMSEパラメータ初期値生成部１１１と、DNN-RLパラメータ初期値生成部１１２を含む。 Hereinafter, the DNN parameter initial value generation unit 110 will be described with reference to FIGS. FIG. 5 is a block diagram showing the configuration of the DNN parameter initial value generation unit 110. FIG. 6 is a flowchart showing the operation of the DNN parameter initial value generation unit 110. As shown in FIG. 5, the DNN parameter initial value generation unit 110 includes a DNN-MMSE parameter initial value generation unit 111 and a DNN-RL parameter initial value generation unit 112.

図６に従いDNNパラメータ初期値生成部１１０の動作について説明する。DNN-MMSEパラメータ初期値生成部１１１は、Ｓ１０５で生成した周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-MMSEパラメータの初期値Θ^MMSE _iniを生成する（Ｓ１１１）。初期値Θ^MMSE _iniの生成には、例えば、非特許文献１を用いることができる。具体的には、まず、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、式(9)によりDNN-MMSEの入力ベクトルx_τ（τ∈{1,…,Τ}）を生成する。また、周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τから、次式により時間周波数マスクG^IRM _ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）を生成する。 The operation of the DNN parameter initial value generation unit 110 will be described with reference to FIG. The DNN-MMSE parameter initial value generation unit 111 generates the frequency domain target sound signal S _ω,τ generated in S105, the frequency domain noise signal N _ω,τ , and the frequency domain observation signal X _ω,τ ( _ω ∈ {1,..., Ω}, τε{1,..., Τ}), an initial value Θ ^MMSE _ini of the DNN-MMSE parameter is generated (S111). For example, Non-Patent Document 1 can be used to generate the initial value Θ ^MMSE _ini . Specifically, first, from the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ∈{1,...,Τ}), the input vector x of the DNN-MMSE is calculated by the equation (9). Generate _τ ( _τ ∈ {1,..., Τ}). Further, from the frequency-domain target sound signal S _ω,τ and the frequency-domain noise signal N _ω,τ , the time-frequency mask G ^IRM _ω,τ (ω∈{1,...,Ω},τ∈{1,... , Τ}) is generated.

この時間周波数マスクG^IRM _ω,τがラベルデータとなる。 This time-frequency mask G ^IRM _ω,τ becomes label data.

次に、式(42)〜式(44)を用いてDNN-MMSEを学習する。 Next, the DNN-MMSE is learned using equations (42) to (44).

具体的には、まず、DNN-MMSEの入力ベクトルx_τに対して、DNN-MMSEの出力であるμ(x_τ)（τ∈{1,…,Τ}）を生成する。次に、G_τ=μ(x_τ)として、ラベルデータG^IRM _ω,τとG_ω,τ(=μ_ω,τ)の二乗誤差を最小化するように誤差逆伝搬法を用いて、DNN-MMSEパラメータΘ_Mを学習する。このDNN-MMSEの構造を定める式(42)〜式(44)は、DNN-RLの構造を定める式(32)〜式(35)から式(33)の分散ベクトルの推定を除いたものに等しい。 Specifically, first, for an input vector x _τ of DNN-MMSE, μ(x _τ ) ( _τ ε{1,..., Τ}) that is the output of DNN-MMSE is generated. Next, assuming that G _τ =μ(x _τ ), the back-propagation method is used to minimize the square error of the label data G ^IRM _ω,τ and G _ω,τ (=μ _ω,τ ), and DNN is used. -Learn the MMSE parameter Θ _M. The equations (42) to (44) that determine the structure of this DNN-MMSE are the equations (32) to (35) that determine the structure of the DNN-RL excluding the estimation of the variance vector of equation (33). equal.

なお、この学習にはdiscriminative pre-training（参考非特許文献３）などの初期化法を用いることができる。また、誤差逆伝搬法の実装には、Adam（参考非特許文献４）などのアルゴリズムを用いることができる。
（参考非特許文献３：F.Seide, G.Li, X.Chen and D.Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, In Proc. IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), pp. 24-29, 2011.）
（参考非特許文献４：D.Kingma and J.Ba, “Adam: A Method for Stochastic Optimization”, In Proc. of the 3rd International Conference for Learning Representations(ICLR), pp.1-15, 2015.） An initialization method such as discriminative pre-training (Reference Non-Patent Document 3) can be used for this learning. An algorithm such as Adam (Reference Non-Patent Document 4) can be used to implement the error back propagation method.
(Reference Non-Patent Document 3: F.Seide, G.Li, X.Chen and D.Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, In Proc. IEEE Automatic Speech Recognition and Understanding Workshop( ASRU), pp. 24-29, 2011.)
(Reference Non-Patent Document 4: D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization”, In Proc. of the 3rd International Conference for Learning Representations(ICLR), pp.1-15, 2015.)

学習が終了したときのDNN-MMSEパラメータΘ_Mを、DNN-MMSEパラメータ初期値Θ^MMSE _iniとして出力する。DNN-MMSEパラメータ初期値Θ^MMSE _iniは、DNN-RLパラメータ生成部１２０の処理で用いるので、記録部１９０に記録しておく。 The DNN-MMSE parameter Θ _M at the end of learning is output as the DNN-MMSE parameter initial value Θ ^MMSE _ini . Since the DNN-MMSE parameter initial value Θ ^MMSE _ini is used in the process of the DNN-RL parameter generating unit 120, it is recorded in the recording unit 190.

DNN-RLパラメータ初期値生成部１１２は、Ｓ１０５で生成した周波数領域目的音信号S_ω,τ、周波数領域雑音信号N_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-RLパラメータの初期値Θ^RL _iniを生成する（Ｓ１１２）。具体的には、まず、Ｓ１０５で生成した周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、式(9)によりDNN-RLの入力ベクトルx_τ（τ∈{1,…,Τ}）を生成する。 The DNN-RL parameter initial value generation unit 112 generates the frequency domain target sound signal S _ω,τ generated in S105, the frequency domain noise signal N _ω,τ , and the frequency domain observation signal X _ω,τ ( _ω ε{1,..., Ω}, τε{1,..., Τ}), an initial value Θ ^RL _ini of the DNN-RL parameter is generated (S112). Specifically, first, from the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ε{1,...,Τ}) generated in S105, the DNN-RL is calculated by the equation (9). Generate an input vector x _τ ( _τ ∈ {1,..., Τ}) of.

次に、式(32)〜式(35)を用いてDNN-RLを学習する。具体的には、まず、DNN-RLの入力ベクトルx_τに対して、DNN-RLの出力である平均ベクトルμ(x_τ)と分散ベクトルσ(x_τ)（τ∈{1,…,Τ}）を生成する。次に、式(45)のように尤度関数を最大化するように誤差逆伝搬法を用いて、DNN-RLパラメータΘ_Mを学習する。 Next, the DNN-RL is learned using equations (32) to (35). Specifically, first, for the input vector x _τ of the DNN-RL, the average vector μ(x _τ ) and the variance vector σ(x _τ ) ( _τ ∈ {1,...,Τ }) is generated. Next, the DNN-RL parameter Θ _M is learned by using the error backpropagation method so as to maximize the likelihood function as shown in Expression (45).

ただし、 However,

である。 Is.

なお、誤差逆伝搬法の実装には、先ほど同様、Adamを用いることができる。 Note that Adam can be used to implement the error back-propagation method as before.

学習が終了したときのDNN-RLパラメータΘ_Mを、DNN-RLパラメータ初期値Θ^RL _iniとして出力する。DNN-RLパラメータ初期値Θ^RL _iniは、DNN-RLパラメータ生成部１２０の処理で用いるので、記録部１９０に記録しておく。 The DNN-RL parameter Θ _M at the end of learning is output as the DNN-RL parameter initial value Θ ^RL _ini . Since the DNN-RL parameter initial value Θ ^RL _ini is used in the process of the DNN-RL parameter generation unit 120, it is recorded in the recording unit 190.

以下、図７〜図８を参照してDNN-RLパラメータ生成部１２０について説明する。図７は、DNN-RLパラメータ生成部１２０の構成を示すブロック図である。図８は、DNN-RLパラメータ生成部１２０の動作を示すフローチャートである。図７に示すようにDNN-RLパラメータ生成部１２０は、DNN-RL時間領域出力信号生成部１２４と、DNN-MMSE時間領域出力信号生成部１２５と、報酬係数計算部１２６と、DNN-RLパラメータ最適化部１２７と、収束条件判定部１２８を含む。 Hereinafter, the DNN-RL parameter generation unit 120 will be described with reference to FIGS. 7 to 8. FIG. 7 is a block diagram showing the configuration of the DNN-RL parameter generation unit 120. FIG. 8 is a flowchart showing the operation of the DNN-RL parameter generation unit 120. As shown in FIG. 7, the DNN-RL parameter generation unit 120 includes a DNN-RL time domain output signal generation unit 124, a DNN-MMSE time domain output signal generation unit 125, a reward coefficient calculation unit 126, and a DNN-RL parameter. The optimizing unit 127 and the convergence condition determining unit 128 are included.

図８に従いDNN-RLパラメータ生成部１２０の動作について説明する。DNN-RL時間領域出力信号生成部１２４は、Ｓ１０５で生成した周波数領域目的音信号S_ω,τ、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-RL時間領域出力信号を生成する（Ｓ１２４）。 The operation of the DNN-RL parameter generator 120 will be described with reference to FIG. The DNN-RL time-domain output signal generator 124 generates the frequency-domain target sound signal S _ω,τ generated in S105 and the frequency-domain observation signal X _ω,τ (ωε{1,...,Ω},τε{1, , Τ}), a DNN-RL time domain output signal is generated (S124).

以下、図９〜図１０を参照してDNN-RL時間領域出力信号生成部１２４について説明する。図９は、DNN-RL時間領域出力信号生成部１２４の構成を示すブロック図である。図１０は、DNN-RL時間領域出力信号生成部１２４の動作を示すフローチャートである。図９に示すようにDNN-RL時間領域出力信号生成部１２４は、事後確率分布パラメータ生成部１２１と、時間周波数マスク生成部１２２と、時間周波数マスク処理部１２３を含む。 Hereinafter, the DNN-RL time domain output signal generation unit 124 will be described with reference to FIGS. 9 to 10. FIG. 9 is a block diagram showing the configuration of the DNN-RL time domain output signal generation unit 124. FIG. 10 is a flowchart showing the operation of the DNN-RL time domain output signal generation unit 124. As shown in FIG. 9, the DNN-RL time domain output signal generation unit 124 includes a posterior probability distribution parameter generation unit 121, a time frequency mask generation unit 122, and a time frequency mask processing unit 123.

なお、事後確率分布パラメータ生成部１２１、時間周波数マスク生成部１２２、時間周波数マスク処理部１２３はそれぞれ従来技術における非線形写像部９１２、マスク計算部９１３、フィルタリング部９２０に対応する。 The posterior probability distribution parameter generation unit 121, the time-frequency mask generation unit 122, and the time-frequency mask processing unit 123 correspond to the non-linear mapping unit 912, the mask calculation unit 913, and the filtering unit 920 in the related art, respectively.

図１０に従いDNN-RL時間領域出力信号生成部１２４の動作について説明する。事後確率分布パラメータ生成部１２１は、Ｓ１０５で生成した周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、事後確率分布パラメータである平均ベクトルμ(x_τ)、分散ベクトルσ(x_τ)（τ∈{1,…,Τ}）を生成する（Ｓ１２１）。具体的には、まず、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、式(9)によりDNN-RLの入力ベクトルx_τ（τ∈{1,…,Τ}）を生成する。 The operation of the DNN-RL time domain output signal generation unit 124 will be described with reference to FIG. The posterior probability distribution parameter generation unit 121 is a posterior probability distribution parameter from the frequency domain observation signal X _ω,τ (ω∈{1,...,Ω},τ∈{1,...,Τ}) generated in S105. An average vector μ(x _τ ) and a variance vector σ(x _τ ) (τε{1,..., Τ}) are generated (S121). Specifically, first, from the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ∈ {1,...,Τ}), the input vector x of the DNN-RL is calculated by the equation (9). Generate _τ ( _τ ∈ {1,..., Τ}).

次に、現時点のDNN-RLパラメータΘ_Mを用いて、式(32)〜式(35)により、入力ベクトルx_τ（τ∈{1,…,Τ}）から事後確率分布パラメータである平均ベクトルμ(x_τ)と分散ベクトルσ(x_τ) （τ∈{1,…,Τ}）を生成する。なお、事後確率分布パラメータ生成部１２１の１番目の処理で用いられるDNN-RLパラメータは、DNN-RLパラメータ初期値Θ^RL _iniである。 Next, using the current DNN-RL parameter Θ _M , the average vector, which is the posterior probability distribution parameter, is calculated from the input vector x _τ ( _τ ∈ {1,…, Τ}) using Equations (32) to (35). Generate μ(x _τ ) and variance vector σ(x _τ ) ( _τ ∈ {1,..., Τ}). The DNN-RL parameter used in the first process of the posterior probability distribution parameter generation unit 121 is the DNN-RL parameter initial value Θ ^RL _ini .

時間周波数マスク生成部１２２は、Ｓ１２１で生成した事後確率分布パラメータである平均ベクトルμ(x_τ)、分散ベクトルσ(x_τ)（τ∈{1,…,Τ}）から、時間周波数マスクG_τ（τ∈{1,…,Τ}）を生成する（Ｓ１２２）。具体的には、以下のε-greedyアルゴリズムを用いて時間周波数マスクG_τ（τ∈{1,…,Τ}）を生成する。 The time-frequency mask generator 122 calculates the time-frequency mask G from the mean vector μ(x _τ ) and the variance vector σ(x _τ ) ( _τ ∈ {1,..., Τ}) that are posterior probability distribution parameters generated in S121. _τ (τε{1,..., Τ}) is generated (S122). Specifically, the time-frequency mask G _τ (τ∈{1,..., Τ}) is generated using the following ε-greedy algorithm.

ここで、式(50)の〜は右辺の確率分布から乱数生成することを表す。確率ε(0<ε<1)は、例えば、0.05程度に設定するとよい。 Here, ~ in Expression (50) represents that a random number is generated from the probability distribution on the right side. The probability ε(0<ε<1) may be set to about 0.05, for example.

もちろん、単にG_τ=μ(x_τ)（τ∈{1,…,Τ}）としてもよい。 Of course, G _τ =μ(x _τ ) ( _τ ∈ {1,..., Τ}) may be simply used.

時間周波数マスク処理部１２３は、Ｓ１２２で生成した時間周波数マスクG_τ（τ∈{1,…,Τ}）を用いて、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-RL時間領域出力信号を生成する（Ｓ１２３）。具体的には、時間周波数マスクG_τを用いて、式(2)により、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）からDNN-RL周波数領域出力信号S^_τ=(G_1,τX_1,τ,…, G_Ω,τX_Ω,τ)（τ∈{1,…,Τ}）を生成し、逆フーリエ変換などを用いて時間領域波形に変換することによりDNN-RL時間領域出力信号を生成する。 The time-frequency mask processing unit 123 uses the time-frequency mask G _τ ( _τ ∈ {1,..., Τ}) generated in S122 to observe the frequency domain observation signal X _ω,τ ( _ω ∈ {1,..., Ω}. , τ ∈ {1,..., T}), a DNN-RL time domain output signal is generated (S123). Specifically, using the time-frequency mask G _τ , the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ∈{1,...,Τ}) DNN-RL frequency domain output signal S^ _τ =(G _1,τ X _1,τ ,…, G _Ω,τ X _Ω,τ )( _τ ∈ {1,…,Τ}) is generated and inverse Fourier transform is generated. A DNN-RL time domain output signal is generated by transforming into a time domain waveform using.

DNN-MMSE時間領域出力信号生成部１２５は、Ｓ１０５で生成した周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-MMSE時間領域出力信号を生成する（Ｓ１２５）。具体的には、まず、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、式(9)によりDNN-MMSEの入力ベクトルx_τ（τ∈{1,…,Τ}）を生成し、DNN-MMSEパラメータ初期値Θ^MMSE _iniを用いて、式(42)〜式(44)によりDNN-MMSEの出力である平均ベクトルμ(x_τ)（τ∈{1,…,Τ}）を生成する。次に、時間周波数マスクG_τ=μ(x_τ)として、時間周波数マスクG_τを用いて、式(2)により、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）からDNN-MMSE周波波数領域出力信号S^_τ=(G_1,τX_1,τ,…, G_Ω,τX_Ω,τ)を生成し、逆フーリエ変換などを用いて時間領域波形に変換することによりDNN-MMSE時間領域出力信号を生成する。 The DNN-MMSE time domain output signal generation unit 125 uses the frequency domain observation signals X _ω,τ ( _ω ε{1,...,Ω}, τε{1,...,Τ}) generated in S105 to obtain the DNN-MMSE. A time domain output signal is generated (S125). Specifically, first, from the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ∈{1,...,Τ}), the input vector x of the DNN-MMSE is calculated by the equation (9). _τ ( _τ ∈ {1,…, Τ}) is generated, and using the DNN-MMSE parameter initial value Θ ^MMSE _ini , the average vector μ( which is the output of the DNN-MMSE is given by Equations (42) to (44). x _τ )( _τ ∈ {1,..., Τ}) is generated. Then, as the time-frequency mask _{_{G τ = μ (x τ)}} , by using a time-frequency mask G _tau, the equation (2), the frequency domain observed signal _{X ω, τ (ω∈ {1} , ..., Ω}, τ ∈ {1,…, Τ}) generates DNN-MMSE frequency domain output signal S^ _τ =(G _{1, τ} X _{1, τ} ,…, G _{Ω, τ} X _{Ω, τ} ) A DNN-MMSE time domain output signal is generated by transforming into a time domain waveform using transformation or the like.

報酬係数計算部１２６は、Ｓ１２４で生成したDNN-RL時間領域出力信号とＳ１２５で生成したDNN-MMSE時間領域出力信号から、Ｓ１２２で生成した時間周波数マスクG_τ（τ∈{1,…,Τ}）の報酬係数を計算する（Ｓ１２６）。具体的には、DNN-RL時間領域出力信号の評価値Z^RLとDNN-MMSE時間領域出力信号の評価値Z^MMSEを算出、式(36)を用いて比較報酬を計算し、式(37)〜式(38)を用いて報酬係数を計算する。報酬係数の算出に用いる各パラメータは、比較報酬の計算に用いる評価値によってチューニングするのが好ましい。例えば、評価値としてPESQを用いる場合、α=1.0、β=10.0、γ=0.01程度に設定できる。 The reward coefficient calculation unit 126 uses the DNN-RL time domain output signal generated in S124 and the DNN-MMSE time domain output signal generated in S125 to generate the time-frequency mask G _τ ( _τ ∈ {1,...,Τ). }) is calculated (S126). Specifically, calculates an evaluation value Z ^MMSE evaluation value Z ^RL and DNN-MMSE time domain output signal of the DNN-RL time domain output signal, the comparison compensation using equation (36) calculates the formula (37) ~ Calculate the reward coefficient using equation (38). Each parameter used to calculate the reward coefficient is preferably tuned according to the evaluation value used to calculate the comparative reward. For example, when PESQ is used as the evaluation value, it can be set to about α=1.0, β=10.0, γ=0.01.

I個の目的音学習データと雑音学習データの組に対して、報酬係数を計算する。つまり、Ｓ１２４〜Ｓ１２６までの処理をI回繰り返す。ここで、Iは5程度に設定するとよい。 A reward coefficient is calculated for a set of I target sound learning data and noise learning data. That is, the processing from S124 to S126 is repeated I times. Here, I should be set to about 5.

DNN-RLパラメータ最適化部１２７は、式(26)の目的関数T_arの値を最大化するようにDNN-RLパラメータΘ_Mを更新する（Ｓ１２７）。式(26)の目的関数T_arの値は、式(27)〜式(31)を用いて、Ｓ１２１の処理過程で生成した入力ベクトルx_τ、Ｓ１２１で生成した平均ベクトルμ(x_τ)と分散ベクトルσ(x_τ)、Ｓ１２６で計算した報酬係数から求めることができる。なお、式(26)の目的関数T_arに出現する(i)やiは繰り返し回数を表すインデックスである。また、DNN-RLパラメータ初期値生成部１１２と同様、誤差逆伝搬法によりDNN-RLパラメータΘ_Mを最適化するよう更新する。なお、誤差逆伝搬法には、Adamを用いればよい。 The DNN-RL parameter optimizing unit 127 updates the DNN-RL parameter Θ _M so as to maximize the value of the objective function T _ar in Expression (26) (S127). The value of the objective function T _ar of the equation (26) is obtained by using the equations (27) to (31) and the input vector x _τ generated in the processing step of S121 and the average vector μ(x _τ ) generated in S121. It can be obtained from the variance vector σ(x _τ ), the reward coefficient calculated in S126. It should be noted that (i) and i appearing in the objective function T _ar of the equation (26) are indexes representing the number of repetitions. Further, similar to the DNN-RL parameter initial value generation unit 112, the DNN-RL parameter Θ _M is updated by the error back propagation method so as to be optimized. Note that Adam may be used for the error back propagation method.

収束条件判定部１２８は、学習の終了条件として事前に設定された収束条件を判定し、収束条件が満たされる場合は処理を終了し、収束条件が満たされない場合はＳ１２４〜Ｓ１２７の処理を繰り返す（Ｓ１２８）。学習が終了したときのDNN-RLパラメータΘ_Mを、DNN-RLパラメータΘ^RLとして出力する。収束条件として、例えばＳ１２４〜Ｓ１２７の処理の実行回数が所定の回数に達するかという条件を採用することができる。この場合、所定の回数を10万回程度に設定することができる。 The convergence condition determination unit 128 determines a convergence condition set in advance as a learning end condition, ends the process if the convergence condition is satisfied, and repeats the processes of S124 to S127 if the convergence condition is not satisfied ( S128). The DNN-RL parameter Θ _M at the end of learning is output as the DNN-RL parameter Θ ^RL . As the convergence condition, for example, a condition that the number of times the processes of S124 to S127 are executed reaches a predetermined number can be adopted. In this case, the predetermined number of times can be set to about 100,000 times.

本実施形態の発明によれば、微分不可能な評価値を含む多様な評価値を用いてDNNパラメータを最適化することにより、入力音を音源強調するための時間周波数マスクを生成するためのDNNを学習することができる。例えば、音声認識向けに音源強調を最適化したい場合、音声認識の結果が正解か否かの二値を評価値として目的関数を構成することにより、音声認識向けの音源強調に適した形でDNNパラメータを最適化することができるようになる。 According to the invention of this embodiment, by optimizing the DNN parameters using various evaluation values including non-differentiable evaluation values, the DNN for generating the time-frequency mask for sound source enhancement of the input sound is generated. Can learn. For example, if you want to optimize the sound source enhancement for speech recognition, by constructing the objective function with the binary value of whether the result of the speech recognition is correct or not as an evaluation value, the DNN in a form suitable for the sound source enhancement for speech recognition. It will be possible to optimize the parameters.

＜第二実施形態＞
ここでは、第一実施形態の音源強調学習装置が生成したDNNパラメータを用いた音源強調装置について説明する。 <Second embodiment>
Here, a sound source enhancement device using the DNN parameter generated by the sound source enhancement learning device of the first embodiment will be described.

以下、図１１〜図１２を参照して音源強調装置２００を説明する。図１１は、音源強調装置２００の構成を示すブロック図である。図１２は、音源強調装置２００の動作を示すフローチャートである。図１１に示すように音源強調装置２００は、周波数領域観測信号生成部２１０と、事後確率分布パラメータ生成部１２１と、時間周波数マスク生成部１２２と、時間周波数マスク処理部１２３と、記録部２９０を含む。記録部２９０は、音源強調装置２００の処理に必要な情報を適宜記録する構成部である。例えば、音源強調学習装置１００が生成したDNN-RLパラメータΘ^RLを記録しておく。 Hereinafter, the sound source enhancement device 200 will be described with reference to FIGS. 11 to 12. FIG. 11 is a block diagram showing the configuration of the sound source enhancement device 200. FIG. 12 is a flowchart showing the operation of the sound source emphasizing device 200. As shown in FIG. 11, the sound source enhancement apparatus 200 includes a frequency domain observation signal generation unit 210, a posterior probability distribution parameter generation unit 121, a time frequency mask generation unit 122, a time frequency mask processing unit 123, and a recording unit 290. Including. The recording unit 290 is a component that appropriately records information necessary for the processing of the sound source enhancement device 200. For example, the DNN-RL parameter Θ ^RL generated by the sound source emphasis learning device 100 is recorded.

図１２に従い音源強調装置２００の動作について説明する。周波数領域観測信号生成部２１０は、時間領域観測信号から、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}、Ω、Τはそれぞれ時間領域観測信号により定まる1以上の整数）を生成する（Ｓ２１０）。例えば、短時間フーリエ変換を用いて、マイクロホンで収音した時間領域観測信号を周波数領域に変換し、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）を生成する。事後確率分布パラメータ生成部１２１は、Ｓ２１０で生成した周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から、DNN-RLの出力として事後確率分布パラメータである平均ベクトルμ(x_τ)、分散ベクトルσ(x_τ)（τ∈{1,…,Τ}）を生成する（Ｓ１２１）。その際、DNN-RLパラメータΘ^RLを用いる。時間周波数マスク生成部１２２は、Ｓ１２１で生成した事後確率分布パラメータである平均ベクトルμ(x_τ)、分散ベクトルσ(x_τ)（τ∈{1,…,Τ}）から、時間周波数マスクG_τ（τ∈{1,…,Τ}）を生成する（Ｓ１２２）。時間周波数マスク処理部１２３は、Ｓ１２２で生成した時間周波数マスクG_τ（τ∈{1,…,Τ}）を用いて、周波数領域観測信号X_ω,τ（ω∈{1,…,Ω},τ∈{1,…,Τ}）から時間領域出力信号を生成する（Ｓ１２３）。 The operation of the sound source emphasizing device 200 will be described with reference to FIG. The frequency domain observation signal generation unit 210 calculates the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ε{1,...,Τ}, Ω, and Τ from the time domain observation signal, respectively) An integer greater than or equal to 1 determined by the region observation signal) is generated (S210). For example, using the short-time Fourier transform, the time-domain observation signal picked up by the microphone is transformed into the frequency domain, and the frequency-domain observation signal X _ω,τ (ω∈{1,...,Ω},τ∈{1, …, Τ}) is generated. The posterior probability distribution parameter generation unit 121 outputs the frequency domain observation signal X _ω,τ ( _ω ∈ {1,...,Ω}, τ ∈{1,...,Τ}) generated in S210 as an output of the DNN-RL. An average vector μ(x _τ ) and a variance vector σ(x _τ ) (τε{1,..., Τ}) which are posterior probability distribution parameters are generated (S121). At that time, the DNN-RL parameter Θ ^RL is used. The time-frequency mask generator 122 calculates the time-frequency mask G from the mean vector μ(x _τ ) and the variance vector σ(x _τ ) ( _τ ∈ {1,..., Τ}) that are posterior probability distribution parameters generated in S121. _τ (τε{1,..., Τ}) is generated (S122). The time-frequency mask processing unit 123 uses the time-frequency mask G _τ ( _τ ∈ {1,..., Τ}) generated in S122 to observe the frequency domain observation signal X _ω,τ ( _ω ∈ {1,..., Ω}. , τ ∈ {1,..., Τ}) to generate a time domain output signal (S123).

本実施形態の発明によれば、微分不可能な評価値を含む多様な評価値を用いて最適化したDNNパラメータを設定したDNNに基づいて生成した時間周波数マスクにより、音源強調が可能となる。例えば、音声認識向けの音源強調に適した形で最適化したDNNパラメータを用いた音源強調が可能となる。また、評価値として主観的な音質評価と相関が高いPESQを採用することにより、音質評価を目的とする音情報処理技術に適した基準（目的関数）にて生成したDNNパラメータを用いた音源強調が可能となる。 According to the invention of the present embodiment, the sound source enhancement can be performed by the time-frequency mask generated based on the DNN in which the optimized DNN parameters are set using various evaluation values including the evaluation values that cannot be differentiated. For example, it becomes possible to perform sound source enhancement using DNN parameters optimized in a form suitable for sound source enhancement for speech recognition. In addition, by adopting PESQ, which has a high correlation with subjective sound quality evaluation, as the evaluation value, sound source enhancement using DNN parameters generated by a criterion (objective function) suitable for sound information processing technology for sound quality evaluation. Is possible.

＜第三実施形態＞
第一実施形態では、音源強調のためのDNN-RL学習について説明したが、＜技術的背景＞で説明した枠組み、つまり、DNN-RLパラメータΘ_Μの学習（最適化）を式(15)のような事後確率分布p(G_τ|x_τ,Θ_Μ)を出力とするDNN-RLにより定式化する枠組みは、一般に音のマスク処理（フィルタリング）に関しても適用することができる。 <Third embodiment>
In the first embodiment, the DNN-RL learning for sound source enhancement has been described. However, the framework described in <Technical Background>, that is, the learning (optimization) of the DNN-RL parameter Θ _Μ is expressed by Equation (15). The framework formulated by DNN-RL that outputs the posterior probability distribution p(G _τ |x _τ , Θ _Μ ) as described above can be generally applied to mask processing (filtering) of sounds.

さらに、第一実施形態で扱った学習は、DNNに限定されるものではなく、より一般のニューラルネットワークにも適用することが可能である。 Furthermore, the learning dealt with in the first embodiment is not limited to DNNs, but can be applied to more general neural networks.

そこで、ここでは、音源強調に限定しない、一般のニューラルネットワークによる学習に関する実施形態について説明する。なお、以下では、ニューラルネットワークのことをNNと表すことにする。 Therefore, an embodiment relating to learning by a general neural network, which is not limited to sound source emphasis, will be described here. In the following, the neural network will be referred to as NN.

以下、図１３〜図１４を参照して入力音マスク処理学習装置３００を説明する。図１３は、入力音マスク処理学習装置３００の構成を示すブロック図である。図１４は、入力音マスク処理学習装置３００の動作を示すフローチャートである。図１３に示すように入力音マスク処理学習装置３００は、入力ベクトル生成部３０５と、事後確率分布生成部３１０と、マスク生成部３２０と、マスク処理部３３０と、報酬係数計算部３６０と、パラメータ最適化部３７０と、収束条件判定部３８０と、記録部３９０を含む。記録部３９０は、入力音マスク処理学習装置３００の処理に必要な情報を適宜記録する構成部である。 Hereinafter, the input sound mask processing learning device 300 will be described with reference to FIGS. 13 to 14. FIG. 13 is a block diagram showing the configuration of the input sound mask processing learning device 300. FIG. 14 is a flowchart showing the operation of the input sound mask processing learning device 300. As shown in FIG. 13, the input sound mask processing learning device 300 includes an input vector generation unit 305, a posterior probability distribution generation unit 310, a mask generation unit 320, a mask processing unit 330, a reward coefficient calculation unit 360, and parameters. It includes an optimization unit 370, a convergence condition determination unit 380, and a recording unit 390. The recording unit 390 is a component that appropriately records information necessary for the processing of the input sound mask processing learning apparatus 300.

入力音マスク処理学習装置３００は、入力音学習データ記録部９３０に接続している。入力音学習データ記録部９３０には、事前に収音した、マスク処理の対象となる入力音が学習データとして記録されている。 The input sound mask processing learning device 300 is connected to the input sound learning data recording unit 930. The input sound learning data recording unit 930 records, as learning data, input sounds that have been collected in advance and are to be masked.

入力音マスク処理学習装置３００の各構成部で用いる各種パラメータ（例えば、NNの学習などに用いるパラメータ）については、入力音と同様外部から入力するようにしてもよいし、事前に各構成部に設定されていてもよい。 Various parameters used in each component of the input sound mask processing learning device 300 (for example, parameters used for learning NN) may be input from the outside similarly to the input sound, or may be input to each component in advance. It may be set.

また、マスク処理は、各入力音について独立であり、他の入力音の処理に影響を及ぼすことはなく、各入力音に対するマスクの設計は他の入力音のそれとは独立に行われるものと仮定する。 It is also assumed that the mask processing is independent for each input sound and does not affect the processing of other input sounds, and the mask design for each input sound is performed independently of that of other input sounds. To do.

図１４に従い入力音マスク処理学習装置３００の動作について説明する。入力ベクトル生成部３０５は、入力音からNNへの入力ベクトルx_τ（τ∈{1,…,Τ}、Τは当該入力音により定まる1以上の整数）を生成する（Ｓ３０５）。事後確率分布生成部３１０は、NNのパラメータΘ_Μを用いて、Ｓ３０５で生成した入力ベクトルx_τ（τ∈{1,…,Τ}）から、NNの出力である、入力ベクトルx_τが入力された場合にマスクG_τが生成される確率である事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）を生成する（Ｓ３１０）。ここで、事後確率分布p(G_τ|x_τ,Θ_Μ)は、式(15)のように表現される。 The operation of the input sound mask processing learning device 300 will be described with reference to FIG. The input vector generation unit 305 generates an input vector x _τ (τε{1,..., Τ}, Τ is an integer of 1 or more determined by the input sound) from the input sound to the NN (S305). Posterior probability distribution generating unit 310, using the parameters theta _Micromax the NN, the input vector _{x τ (τ∈ {1, ...} , Τ}) generated in S305 from an output of the NN, the input vector x _tau input Then, a posterior probability distribution p(G _τ |x _τ , Θ _Μ )( _τ ε{1,..., Τ}), which is the probability that the mask G _τ is generated, is generated (S310). Here, the posterior probability distribution p(G _τ |x _τ , Θ _Μ ) is expressed as in Expression (15).

なお、事後確率分布生成部３１０の１番目の処理で用いられるNNのパラメータΘ_Μは、例えば、記録部３９０に記録されているなど、事前に与えられているものとする。 Incidentally, NN is the parameter theta _Micromax used in the first process of the posterior probability distribution generating unit 310, for example, recorded in the recording unit 390, it is assumed to be given in advance.

マスク生成部３２０は、Ｓ３１０で生成した事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）から、入力ベクトルx_τのマスク処理に用いるマスクG_τ（τ∈{1,…,Τ}）を生成する（Ｓ３２０）。具体的には、マスクG_τは式(16)で求める。 The mask generation unit 320 uses the posterior probability distribution p(G _τ |x _τ , Θ _Μ ) ( _τ ε{1,..., Τ}) generated in S310 to use the mask G _τ (for masking the input vector x _τ ). τ ∈ {1,..., Τ}) is generated (S320). Specifically, the mask G _τ is _calculated by equation (16).

マスク処理部３３０は、Ｓ３２０で生成したマスクG_τ（τ∈{1,…,Τ}）を用いて、入力ベクトルx_τ（τ∈{1,…,Τ}）から出力音を生成する（Ｓ３３０）。具体的には、Ｓ３２０で生成したマスクG_τの処理内容に応じた処理が入力ベクトルx_τに対して行われ、出力音が生成されることになる。 The mask processing unit 330 uses the mask G _τ ( _τ ∈ {1,..., Τ}) generated in S320 to generate an output sound from the input vector x _τ ( _τ ∈ {1,..., Τ}) ( S330). Specifically, the processing according to the processing content of the mask G _τ generated in S320 is performed on the input vector x _τ , and the output sound is generated.

なお、事後確率分布生成部３１０と、マスク生成部３２０と、マスク処理部３３０をまとめて、出力音生成部３４０という。出力音生成部３４０は、第一実施形態のDNN-RL時間領域出力信号生成部１２４に対応する構成部であり、入力ベクトルx_τ（τ∈{1,…,Τ}）から出力音を生成する。 The posterior probability distribution generation unit 310, the mask generation unit 320, and the mask processing unit 330 are collectively referred to as an output sound generation unit 340. The output sound generation unit 340 is a component unit corresponding to the DNN-RL time domain output signal generation unit 124 of the first embodiment, and generates an output sound from the input vector x _τ ( _τ ∈ {1,..., Τ}). To do.

報酬係数計算部３６０は、Ｓ３３０で生成した出力音から、マスクG_τ（τ∈{1,…,Τ}）の報酬係数を計算する（Ｓ３６０）。具体的には、マスク処理と入力音に関する仮定より、報酬係数R_ewを次式により計算する（式(26’)、式(20)参照）。 The reward coefficient calculation unit 360 calculates the reward coefficient of the mask G _τ (τε{1,..., Τ}) from the output sound generated in S330 (S360). Specifically, the reward coefficient R _ew is calculated by the following equation based on the assumption of the mask processing and the input sound (see equation (26′) and equation (20)).

R(G)はＳ３３０で生成した出力音の評価値である。また、Π_τp(G_τ|x_τ,Θ_Μ)は入力ベクトルx_τが入力された場合にマスクG_τが生成される確率である生成確率p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）の積であるから、入力音が入力された場合に生成したマスクG_τ（τ∈{1,…,Τ}）の確からしさである確信度を示す。 R(G) is the evaluation value of the output sound generated in S330. Further, Π _τ p(G _τ |x _τ , Θ _Μ ) is the probability that the mask G _τ is generated when the input vector x _τ is input. Generation probability p(G _τ |x _τ , Θ _Μ ) ( Since it is a product of _τ ∈ {1,..., Τ}), it indicates the certainty factor that is the certainty of the mask G _τ ( _τ ∈ {1,..., Τ}) generated when the input sound is input.

なお、評価値R(G)は、NNのパラメータΘ_Μで微分することができないものであってもよい。 The evaluation value R(G) may not be differentiable by the parameter Θ _Μ of NN.

I個（Iは1以上の整数）の入力音に対して、報酬係数を計算する。つまり、Ｓ３０５〜Ｓ３６０までの処理をI回繰り返す。 The reward coefficient is calculated for I (I is an integer of 1 or more) input sounds. That is, the processing from S305 to S360 is repeated I times.

パラメータ最適化部３７０は、式(26)の目的関数T_arの値を最大化するようにNNのパラメータΘ_Mを更新する（Ｓ３７０）。 The parameter optimizing unit 370 updates the parameter Θ _M of the NN so as to maximize the value of the objective function T _ar of Expression (26) (S370).

ただし、文字iや(i)はi番目のエピソードを表す変数であり、繰り返し回数を表すインデックスとなる。 However, the letters i and (i) are variables that represent the i-th episode and are indexes that represent the number of repetitions.

式(26)の目的関数T_arは、報酬係数と事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）を用いて定義されるパラメータΘ_Μの関数であり、具体的には、報酬係数と、事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）を用いて表現される式（ここでは、具体的には、Σ_τlnp(G_τ|x_τ,Θ_Μ)）の積となっている。 The objective function T _{ar in} Eq. (26) is a function of the parameter Θ _Μ defined by using the reward coefficient and the posterior probability distribution p(G _τ |x _τ , Θ _Μ )( _τ ∈ {1,…, Τ}) And, specifically, the expression expressed using the reward coefficient and the posterior probability distribution p(G _τ |x _τ ,Θ _Μ )( _τ ∈ {1,...,Τ}) (here, Is the product of Σ _τ lnp(G _τ |x _τ , Θ _Μ )).

事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）を用いて表現される式として、出力音の評価値R(G)が正の値であるときは、その値が大きくなるように変動し、出力音の評価値R(G)が負の値であるときは、その値が小さくなるように変動し、確信度が相対的に低いときの値の変動は、前記確信度が相対的に高いときの値の変動に比して小さくなるものを用いる。 As an expression expressed using the posterior probability distribution p(G _τ |x _τ , Θ _Μ )( _τ ∈ {1,…, Τ}), when the evaluation value R(G) of the output sound is a positive value Fluctuates so that the value increases, and when the output sound evaluation value R(G) is a negative value, it fluctuates so that the value decreases, and the value when the confidence is relatively low. The fluctuation of is smaller than the fluctuation of the value when the certainty factor is relatively high.

収束条件判定部３８０は、学習の終了条件として事前に設定された収束条件を判定し、収束条件が満たされる場合は処理を終了し、収束条件が満たされない場合はＳ３０５〜Ｓ３７０の処理を繰り返す（Ｓ３８０）。学習が終了したときのNNのパラメータΘ_Mを、NNのパラメータΘ^NNとして出力する。収束条件として、例えばＳ３０５〜Ｓ３７０の処理の実行回数が所定の回数に達するかという条件を採用することができる。 The convergence condition determination unit 380 determines a convergence condition set in advance as a learning end condition, ends the process if the convergence condition is satisfied, and repeats the processes of S305 to S370 if the convergence condition is not satisfied ( S380). The NN parameter Θ _M at the end of learning is output as the NN parameter Θ ^NN . As the convergence condition, for example, a condition that the number of times the processes of S305 to S370 are executed reaches a predetermined number can be adopted.

本実施形態の発明によれば、微分不可能な評価値を含む多様な評価値を用いてNNのパラメータを最適化することにより、入力音をマスク処理するためのマスクを生成するためのNNを学習することができる。 According to the invention of the present embodiment, by optimizing the parameters of the NN using various evaluation values including non-differentiable evaluation values, an NN for generating a mask for masking an input sound is obtained. You can learn.

＜第四実施形態＞
第一実施形態における報酬係数の計算では、DNN-MMSEパラメータΘ^MMSE _iniを用いて得られる時間周波数マスク処理によるDNN-MMSE時間領域出力信号の評価値も用いる比較報酬に基づいて計算した。 <Fourth Embodiment>
In the calculation of the reward coefficient in the first embodiment, the calculation is performed based on the comparison reward that also uses the evaluation value of the DNN-MMSE time domain output signal by the time frequency mask processing obtained using the DNN-MMSE parameter Θ ^MMSE _ini .

そこで、ここでは、比較報酬を用いて報酬係数を計算するような実施形態について説明する。 Therefore, here, an embodiment in which the reward coefficient is calculated using the comparative reward will be described.

以下、図１５〜図１６を参照して入力音マスク処理学習装置３０１を説明する。図１５は、入力音マスク処理学習装置３０１の構成を示すブロック図である。図１６は、入力音マスク処理学習装置３０１の動作を示すフローチャートである。図１５からわかるように、入力音マスク処理学習装置３０１は、比較出力音生成部３５０をさらに含む点と、報酬係数計算部３６０の代わりに報酬係数計算部３６１を含む点においてのみ、入力音マスク処理学習装置３００と異なる。また、図１６からわかるように、入力音マスク処理学習装置３０１の動作は、Ｓ３６０の代わりに、Ｓ３５０とＳ３６１が追加されている点においてのみ、入力音マスク処理学習装置３００と異なる。 The input sound mask processing learning device 301 will be described below with reference to FIGS. FIG. 15 is a block diagram showing the configuration of the input sound mask processing learning device 301. FIG. 16 is a flowchart showing the operation of the input sound mask processing learning device 301. As can be seen from FIG. 15, the input sound mask processing learning apparatus 301 includes the input sound mask only in that the comparison output sound generation unit 350 is further included and that the reward coefficient calculation unit 361 is included instead of the reward coefficient calculation unit 360. Different from the processing learning device 300. Further, as can be seen from FIG. 16, the operation of the input sound masking process learning device 301 differs from the input sound masking process learning device 300 only in that S350 and S361 are added instead of S360.

以下、Ｓ３５０とＳ３６１の処理について説明する。比較出力音生成部３５０は、Ｓ３０５で生成した入力ベクトルx_τ（τ∈{1,…,Τ}）から、比較出力音を生成する（Ｓ３５０）。具体的には、まず、入力ベクトルx_τに対して、DNNの場合における式(6)〜式(8)に相当する式（つまり、当該ニューラルネットワークの出力を計算するための式）を用いて、NNの出力y^_τとしてマスクG_τを生成する。なお、比較出力音生成部３５０の１番目の処理で用いられるNNのパラメータは、例えば、記録部３９０に記録されているなど、事前に与えられているものとする。 The processes of S350 and S361 will be described below. The comparative output sound generation unit 350 generates a comparative output sound from the input vector x _τ (τε{1,..., Τ}) generated in S305 (S350). Specifically, first, for the input vector x _τ , using the equations corresponding to the equations (6) to (8) in the case of DNN (that is, the equation for calculating the output of the neural network). , NN produces a mask G _τ as the output y ^ _τ . The parameters of the NN used in the first process of the comparative output sound generation unit 350 are assumed to be given in advance, for example, recorded in the recording unit 390.

次に、マスクG_τ（τ∈{1,…,Τ}）を用いて、入力ベクトルx_τ（τ∈{1,…,Τ}）から比較出力音を生成する。具体的には、生成したマスクG_τの処理内容に応じた処理が入力ベクトルx_τに対して行われ、比較出力音が生成される。 Next, a comparison output sound is generated from the input vector x _τ ( _τ ∈ {1,..., Τ}) using the mask G _τ ( _τ ∈ {1,..., Τ}). Specifically, the processing corresponding to the processing content of the generated mask G _τ is performed on the input vector x _τ , and the comparative output sound is generated.

報酬係数計算部３６１は、Ｓ３３０で生成した出力音とＳ３５０で生成した比較出力音から、Ｓ３２０で生成したマスクG_τ（τ∈{1,…,Τ}）の報酬係数を計算する（Ｓ３６１）。具体的には、出力音の評価値と比較出力音の評価値を算出、式(36)を用いて比較報酬を計算し、式(26”)を用いて報酬係数を計算する。 The reward coefficient calculation unit 361 calculates the reward coefficient of the mask G _τ (τ∈{1,..., Τ}) generated in S320 from the output sound generated in S330 and the comparative output sound generated in S350 (S361). .. Specifically, the evaluation value of the output sound and the evaluation value of the comparative output sound are calculated, the comparative reward is calculated using Expression (36), and the reward coefficient is calculated using Expression (26″).

（変形例）
第三実施形態や第四実施形態では、入力音に対するマスク（フィルタ）による処理を対象にしたNNの学習について説明したが、より一般に入力データに対する所定の処理関数による処理を対象としたNNの学習について、＜技術的背景＞で説明した枠組みを適用した例を説明する。 (Modification)
In the third and fourth embodiments, the learning of the NN targeted for the processing by the mask (filter) for the input sound has been described, but more generally, the learning of the NN for the processing by the predetermined processing function for the input data. The following describes an example in which the framework described in <Technical background> is applied.

以下、図１７〜図１８を参照して入力データ処理関数学習装置４００を説明する。図１７は、入力データ処理関数学習装置４００の構成を示すブロック図である。図１８は、入力データ処理関数学習装置４００の動作を示すフローチャートである。図１７に示すように入力データ処理関数学習装置４００は、入力ベクトル生成部４０５と、事後確率分布生成部４１０と、処理関数生成部４２０と、処理関数適用部４３０と、報酬係数計算部４６０と、パラメータ最適化部４７０と、収束条件判定部４８０と、記録部４９０を含む。 The input data processing function learning device 400 will be described below with reference to FIGS. 17 to 18. FIG. 17 is a block diagram showing the configuration of the input data processing function learning device 400. FIG. 18 is a flowchart showing the operation of the input data processing function learning device 400. As illustrated in FIG. 17, the input data processing function learning device 400 includes an input vector generation unit 405, a posterior probability distribution generation unit 410, a processing function generation unit 420, a processing function application unit 430, and a reward coefficient calculation unit 460. A parameter optimization unit 470, a convergence condition determination unit 480, and a recording unit 490 are included.

入力データ処理関数学習装置４００は、入力データ記録部９４０に接続している。入力音データ記録部９４０には、所定の処理関数による処理対象となる入力データが記録されている。 The input data processing function learning device 400 is connected to the input data recording unit 940. The input sound data recording unit 940 records input data to be processed by a predetermined processing function.

入力データ処理関数学習装置４００の各構成部で用いる各種パラメータ（例えば、NNの学習などに用いるパラメータ）については、入力データと同様外部から入力するようにしてもよいし、事前に各構成部に設定されていてもよい。 Various parameters used in each component of the input data processing function learning device 400 (for example, parameters used for learning NN) may be input from the outside like the input data, or may be input to each component in advance. It may be set.

また、処理関数による処理は、各入力データについて独立であり、他の入力データの処理に影響を及ぼすことはなく、各入力データに対する処理関数の設計は他の入力データのそれとは独立に行われるものと仮定する。 Further, the processing by the processing function is independent for each input data, does not affect the processing of other input data, and the design of the processing function for each input data is performed independently of that of other input data. Suppose.

図１８に従い入力データ処理関数学習装置４００の動作について説明する。入力ベクトル生成部４０５は、入力データからNNへの入力ベクトルx_τ（τ∈{1,…,Τ}、Τは当該入力データにより定まる1以上の整数）を生成する（Ｓ４０５）。事後確率分布生成部４１０は、Ｓ４０５で生成した入力ベクトルx_τ（τ∈{1,…,Τ}）から、NNの出力である、入力ベクトルx_τが入力された場合に処理関数G_τが生成される確率である事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）を生成する（Ｓ４１０）。処理関数生成部４２０は、Ｓ４１０で生成した事後確率分布p(G_τ|x_τ,Θ_Μ)（τ∈{1,…,Τ}）から、入力ベクトルx_τの処理に用いる処理関数G_τ（τ∈{1,…,Τ}）を生成する（Ｓ４２０）。処理関数適用部４３０は、Ｓ４２０で生成した処理関数G_τ（τ∈{1,…,Τ}）を用いて、入力ベクトルx_τ（τ∈{1,…,Τ}）から出力データを生成する（Ｓ４３０）。報酬係数計算部４６０は、Ｓ４３０で生成した出力データから、処理関数G_τ（τ∈{1,…,Τ}）の報酬係数を計算する（Ｓ４６０）。I個（Iは1以上の整数）の入力データに対して、報酬係数を計算する。つまり、Ｓ４０５〜Ｓ４６０までの処理をI回繰り返す。パラメータ最適化部４７０は、式(26)の目的関数T_arの値を最大化するようにNNのパラメータΘ_Mを更新する（Ｓ４７０）。収束条件判定部４８０は、学習の終了条件として事前に設定された収束条件を判定し、収束条件が満たされる場合は処理を終了し、収束条件が満たされない場合はＳ４０５〜Ｓ４７０の処理を繰り返す（Ｓ４８０）。学習が終了したときのNNのパラメータΘ_Mを、NNのパラメータΘ^NNとして出力する。 The operation of the input data processing function learning device 400 will be described with reference to FIG. The input vector generation unit 405 generates an input vector x _τ ( _τ ε {1,..., Τ}, Τ is an integer of 1 or more determined by the input data) from the input data to the NN (S405). The posterior probability distribution generation unit 410 determines the processing function G _τ when the input vector x _τ , which is the output of the NN, is input from the input vector x _τ ( _τ ∈ {1,..., Τ}) generated in S405. A posterior probability distribution p(G _τ |x _τ , Θ _Μ ) ( _τ ε{1,..., Τ}) that is the generated probability is generated (S410). Processing function generation unit 420, the posterior probability distribution p generated in _{_{S410 (G τ | x τ,}} Θ Μ) (τ∈ {1, ..., Τ}) from use in the process of the input vector x _tau processing function G _tau (Τε{1,...,Τ}) is generated (S420). The processing function application unit 430 generates output data from the input vector x _τ (τ∈{1,..., Τ}) using the processing function G _τ (τ∈{1,..., Τ}) generated in S420. Yes (S430). The reward coefficient calculation unit 460 calculates the reward coefficient of the processing function G _τ (τε{1,..., Τ}) from the output data generated in S430 (S460). The reward coefficient is calculated for I (I is an integer of 1 or more) input data. That is, the processing from S405 to S460 is repeated I times. The parameter optimizing unit 470 updates the NN parameter Θ _M so as to maximize the value of the objective function T _ar of Expression (26) (S470). The convergence condition determination unit 480 determines a convergence condition set in advance as a learning end condition, ends the process if the convergence condition is satisfied, and repeats the processes of S405 to S470 if the convergence condition is not satisfied ( S480). The NN parameter Θ _M at the end of learning is output as the NN parameter Θ ^NN .

つまり、Ｓ４０５〜Ｓ４８０の処理は、Ｓ３０５〜Ｓ３８０の処理と同様でよい。 That is, the processing of S405 to S480 may be the same as the processing of S305 to S380.

本実施形態の発明によれば、微分不可能な評価値を含む多様な評価値を用いてNNのパラメータを最適化することにより、入力データを処理するための処理関数を生成するためのNNを学習することができる。 According to the invention of this embodiment, the NN for generating the processing function for processing the input data is optimized by optimizing the parameters of the NN using various evaluation values including the non-differentiable evaluation value. You can learn.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Additional notes>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. Connectable communication unit, CPU (Central Processing Unit, cache memory and registers may be provided), RAM or ROM that is memory, external storage device that is a hard disk, and their input unit, output unit, and communication unit , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged among external storage devices. Further, if necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. As a physical entity provided with such hardware resources, there is a general-purpose computer or the like.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary to realize the above-described functions and data necessary for the processing of this program (not limited to the external storage device, for example, the program is read). It may be stored in a ROM that is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in the RAM, the external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM, etc.) and data necessary for the processing of each program are read into the memory as necessary, and interpreted and executed/processed by the CPU as appropriate. .. As a result, the CPU realizes a predetermined function (each constituent element represented by the above,... Unit,... Means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiments, and can be modified as appropriate without departing from the spirit of the present invention. Further, the processes described in the above-described embodiments are not only executed in time series in the order described, but may be executed in parallel or individually according to the processing capability of the device that executes the processes or as necessary. ..

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions of the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on the computer, the processing functions of the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disc. Memory), CD-R (Recordable)/RW (ReWritable), etc. as a magneto-optical recording medium, MO (Magneto-Optical disc) etc., and semiconductor memory EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is performed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program temporarily stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. Further, the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing the execution instruction and the result acquisition without transferring the program from the server computer to the computer. May be It should be noted that the program in this embodiment includes information that is used for processing by an electronic computer and that is equivalent to the program (data that is not a direct command to a computer, but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be implemented by hardware.

Claims

After modeling the generation probability that the mask G _τ ( _τ ∈ {1,…, Τ}) is generated when the input vector x _τ ( _τ ∈ {1,…, Τ}) based on the input sound is input N masks from the input vector x _N based on N input sounds (N is an integer between 1 and τ) based on the probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,…, Τ}) A mask generation unit that generates G _N ,
Using the mask G _N , from the N input sounds, a mask processing unit that generates N output sounds by masking the N input sounds,
A reward coefficient acquisition unit for obtaining the reward coefficient of the mask G _N for the N output sounds,
A mask G _N is generated when the input vector x _N based on the reward coefficient and the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) is input. An input sound mask process including an updating unit for updating the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) using the probability q(G _N |x _N ). A learning device,
The input sound mask processing learning device, wherein the reward coefficient is determined from an evaluation value of the output sound and a certainty factor that is the certainty of the mask G _N generated when the input sound is input.

The input sound mask processing learning device according to claim 1,
The reward coefficient is a product of the evaluation value of the output sound and the certainty factor,
The update unit uses the product of the reward coefficient and the generation probability q(G _N |x _N ) to calculate the posterior probability distribution p(G _τ |x _τ )( _τ ∈ {1,...,Τ}. ) Is an input sound mask processing learning device.

The input sound mask processing learning device according to claim 1 or 2, wherein
The generation probability q(G _N |x _N ) is
When the evaluation value of the output sound is a positive value, the value fluctuates so as to increase,
When the evaluation value of the output sound is a negative value, it fluctuates so that the value becomes smaller,
The input sound mask processing learning device, wherein the variation in the value when the certainty factor is relatively low is smaller than the variation in the value when the certainty factor is relatively high.

The input sound mask processing learning device according to any one of claims 1 to 3,
The generated probability q(G _N |x _N ) is the sum of logarithms of the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,...,Τ}). Mask processing learning device.

The input sound mask processing learning device according to claim 1,
The posterior probability distribution _{_{p (G τ | x τ)}} (τ∈ {1, ..., Τ}) , using the parameters _{_{Θ Μ, p (G τ |}} x τ, Θ Μ) (τ∈ {1, ..., Τ}),
The input sound mask processing learning device, wherein the evaluation value cannot be differentiated by a parameter Θ _Μ .

The input sound mask processing learning device according to any one of claims 1 to 5,
further,
And a comparative output sound generation unit that generates N comparative output sounds from the N input sounds,
The input sound mask processing learning device, wherein the reward coefficient is determined from a difference between an evaluation value of the output sound and an evaluation value of the comparative output sound and the certainty factor.

We modeled the generation probability that the processing function G _τ ( _τ ∈ {1,…, Τ}) is generated when the input vector x _τ ( _τ ∈ {1,…, Τ}) based on the input data is input. Based on the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,…, Τ}), input vector x _N based on N input data (N is an integer between 1 and τ) A processing function generator that generates a processing function G _N ,
Using the processing function G _N , from the N input data, a processing function application unit that generates N output data obtained by processing the N input data by a processing function,
A reward coefficient acquisition unit for obtaining a reward coefficient of the processing function G _N for the N output data,
A processing function G _N is generated when the input vector x _N based on the reward coefficient and the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) is input. Input data processing including an updating unit for updating the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) using the generation probability q(G _N |x _N ). A function learning device,
The input data processing function learning device is characterized in that the reward coefficient is determined from an evaluation value of the output data and a certainty factor that is the certainty of the processing function G _N generated when the input data is input.

The input sound mask processing learning device generates a mask G _τ ( _τ ∈ {1,..., Τ}) when the input vector x _τ ( _τ ∈ {1,..., Τ}) based on the input sound is input. Based on the posterior probability distribution p(G _τ |x _τ )( _τ ∈ {1,…, Τ}) that models the generation probability, the input based on N input sounds (N is an integer between 1 and τ inclusive) A mask generation step for generating _N masks G _N from the vector x _N ,
The input sound mask processing learning device, using the mask G _N , from the N input sounds, a mask processing step of generating N output sounds by masking the N input sounds,
The input sound mask processing learning device, for the N output sounds, a reward coefficient acquisition step of obtaining a reward coefficient of the mask G _N ,
When the input sound mask processing learning device receives the reward coefficient and the input vector x _N based on the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) Updating the posterior probability distribution p(G _τ |x _τ )( _τ ∈ {1,...,Τ}) with the generation probability q(G _N |x _N ) that the mask G _N is generated in An input sound mask processing learning method including an updating step,
The input sound mask processing learning method, wherein the reward coefficient is determined from an evaluation value of the output sound and a certainty factor that is the certainty of the mask G _N generated when the input sound is input.

The input data processing function learning device generates a processing function G _τ ( _τ ∈ {1,..., Τ}) when an input vector x _τ ( _τ ∈ {1,..., Τ}) based on the input data is input. Based on the posterior probability distribution p(G _τ |x _τ )( _τ ∈ {1,…, Τ}) that models the generated probability, based on N input data (N is an integer from 1 to τ) A processing function generation step for generating _N processing functions G _N from the input vector x _N ,
The input data processing function learning device, using the processing function G _N , a processing function applying step of generating N output data obtained by processing the N input data by a processing function from the N input data. When,
The input data processing function learning device, to said N output data, and rewards coefficient acquisition step of obtaining a compensation coefficient of the processing function G _N,
When the input data processing function learning device receives the reward coefficient and the input vector x _N based on the posterior probability distribution p(G _τ |x _τ ) ( _τ ∈ {1,..., Τ}) The posterior probability distribution p(G _τ |x _τ )( _τ ∈ {1,…, Τ}) is updated by using the generation probability q(G _N |x _N ) that the processing function G _N is generated in And an input data processing function learning method including
The input data processing function learning method is characterized in that the reward coefficient is determined from an evaluation value of the output data and a certainty factor that is the certainty of the processing function G _N generated when the input data is input.

A program for causing a computer to function as the input sound mask processing learning device according to any one of claims 1 to 6 or the input data processing function learning device according to claim 7.