Background
In real life, the process of transmitting voice information by people is often inevitably interfered by external noise, and the interference can degrade the voice quality and influence the effects of voice communication and recognition. Speech enhancement is a technique for extracting useful speech signals from speech interfered by noise, and suppressing and reducing noise, i.e. extracting original speech as pure as possible from noisy speech, and has wide application in speech communication, speech recognition, etc.
Existing speech enhancement algorithms can be divided into two categories, depending on the number of microphones used, one being single-microphone based speech enhancement algorithms such as spectral subtraction, wiener filtering, MMSE, kalman filtering, etc. The voice enhancement algorithm uses a single microphone to receive voice signals, has small volume and simple structure, but has limited noise reduction capability, can only process stationary noise mostly, and has not ideal voice enhancement effect on non-stationary noise effect. The other type is speech enhancement based on a microphone array, namely, a plurality of microphones are used in a sound acquisition system to receive sounds from different spatial directions, signals in the speaker direction are amplified through spatial filtering, noise and interference in other directions are suppressed, compared with the traditional method, the method has higher signal gain and stronger interference suppression capability, can solve various acoustic estimation problems, such as sound source positioning, dereverberation, speech enhancement, blind source separation and the like, and has the defects of large volume and higher algorithm complexity. The existing microphone array speech enhancement technology can be roughly divided into three types, namely a fixed beam forming method, an adaptive beam forming method and a post-adaptive filtering method, wherein the adaptive beam forming method adjusts and optimizes array weights through an adaptive algorithm under certain optimal criteria, and the adaptive beam forming method has good adaptive capacity to environmental changes, so that the microphone array speech enhancement technology is most widely applied in practice.
The Generalized Sidelobe Canceller (GSC) is a common structure for realizing adaptive beam, and mainly comprises two branches: and the first branch adopts a fixed beam former to enhance signals in the receiving direction, the second branch adopts a blocking matrix to prevent the signals in the receiving direction from passing, and then adopts an adaptive filter to filter the output of the blocking matrix so as to estimate the residual noise in the output of the first branch and offset the residual noise by subtraction. GSC can transform a constrained Linear Constrained Minimum Variance (LCMV) optimization problem into an unconstrained optimization problem, and thus has high computational efficiency and is simpler to implement than other adaptive beamforming algorithms. However, the conventional GSC has some disadvantages, such as: the method has the advantages of weak capability of inhibiting the spatial incoherent noise, no utilization of the priori knowledge of the voice signal, optimization aiming at the characteristics of the voice signal and the like.
In order to solve the above problems, chinese patent 201711201341.5 provides a microphone array speech enhancement method based on statistical models, which uses a clean speech model and a noise model estimated from the output of the GSC branch two to construct an optimal speech filter to enhance the output signal of the GSC branch one, so as to effectively improve the suppression capability of the enhancement system on incoherent noise, and make the output speech more conform to the human auditory characteristics by using the priori knowledge of the speech signal. However, this method has the following disadvantages: (1) the method uses the proportion of the sum of the output signal energy of the self-adaptive filter and the M-1 path input signal energy of the self-adaptive filter to adjust the update rate of the incoherent noise, and the incoherent noise is difficult to accurately estimate and track when the coherent noise and the incoherent noise exist simultaneously, so the noise suppression effect is influenced; (2) the method uses a linear filter to enhance the output of a fixed beam forming part, and can bring distortion of voice signals while eliminating noise, so that the enhancement effect is greatly limited; (3) in the speech enhancement process, the processing of the preceding and following speech frames are independent of each other, and cannot utilize the correlation of the speech signals in time.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a microphone array speech enhancement method and a realization device based on a deep neural network, wherein the method is different from the prior art in that: (1) the three-purpose branch is added for estimating incoherent noise on the basis of the traditional GSC, so that the residual noise in the output of the branch I can be estimated more accurately; (2) the deep neural network is trained by taking the voice with noise and the noise as input and the clean voice as output, and the output of the first branch is enhanced by utilizing the deep neural network, so that the nonlinear characteristic and the time correlation of the voice signal can be better utilized, and the output of the first branch is more accurately mapped into the clean voice. The invention can be widely applied to various voice communication applications with noisy backgrounds, such as video conferences, vehicle-mounted communication, meeting places, multimedia classrooms and the like.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a microphone array speech enhancement method based on a deep neural network is disclosed, which adopts the following steps to enhance the input speech signal:
s1, training a deep neural network for mapping the noisy speech and noise to clean speech using the clean speech library and the noise library.
S2, estimating the incoming wave direction theta of the speaker by using the microphone array0The number J of interference sources and the incoming wave direction theta of the interference sourcesj,1≤j≤J。
S3, dividing the signal received by the microphone array into three branches, wherein the first branch adopts a fixed beam former to enhance the signal in the speaker direction to obtain the voice frequency spectrum S output by the first branch
(f)(ω, t), where t is the frame number. Branch two uses blocking matrix B
1The signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output
Branch three adopts the blocking matrix B
2Suppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches
S4, use
And
estimate S
(f)Noise spectrum contained in (ω, t)
S5, mixing S
(f)(ω, t) and
inputting the deep neural network trained in the step S1 to obtain enhanced voice.
Further, in step S1, the deep neural network training includes the following steps:
s1.1, superposing the voice of the clean voice library and the noise of the noise library to obtain the voice with noise, taking the short-time frequency spectrum of the voice with noise and the short-time frequency spectrum of the corresponding noise as input, and taking the short-time frequency spectrum of the corresponding clean voice as target output to obtain a training data set.
S1.2, setting structural parameters of the deep neural network, and adopting the following cost function:
where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,
representing the short-time spectrum S of noisy speech from the t-th frame
(f)(omega, t) and noise short-time frequency spectrum
The constructed input sample, f (Y (ω, T)), represents the output of the neural network, and T is the number of speech frames for training.
And S1.3, training the deep neural network to enable the change of the cost function phi to be smaller than a preset value.
In the above steps S3 and S4, the input signal is first decomposed into K subbands, the signal of each subband is processed by three branches, and then the full band S is synthesized
(f)(ω, t) and
in step S3, the weight matrix w of the branch one for the ith subbandq,iThe following method is adopted for calculation:
wherein C is
1i=d(ω
i,θ
0) In order to constrain the matrix, the matrix is,
m is the array element number, omega, of the microphone array
iIs the center frequency of the ith sub-band, theta
0Is the direction of the incoming wave of the speaker, tau
0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector.
In the above step S3, for the ith subband, the blocking matrix B of branch two1iThe following method is adopted for calculation:
will matrix C1i=d(ωi,θ0) Performing singular value decomposition
Wherein
1irIs r
1×r
1Diagonal matrix of r
1Is C
1iIs determined. Order to
Wherein U is
1irIs U
1iFront r of
1The rows of the image data are, in turn,
is U
1iThe remaining rows of then
In the above step S3, for the ith subband, the blocking matrix B of branch three2iThe following method is adopted for calculation:
will matrix C2i=[d(ωi,θ0),d(ωi,θ1),…,d(ωi,θJ)]Performing singular value decomposition
Wherein
M is the array element number, omega, of the microphone array
iIs the center frequency of the ith sub-band, theta
0Is the direction of the incoming wave of the speaker, tau
0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the sound of the speaker reaching the M-th array element and the sound of the speaker reaching the 0-th array element,
j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, theta
jIs the direction of the incoming wave of the interference source, tau
j,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma
2irIs r
2×r
2Diagonal matrix of r
2Is C
2iIs determined. Order to
Wherein U is
2irIs U
2iFront r of
2The rows of the image data are, in turn,
is U
2iThe remaining rows of then
In step S4, for the ith subband, the following formula is used to calculate the voice spectrum output by branch one
Spectrum of noise contained therein
Wherein w
q,iAnd w
a,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B
1iIs the blocking matrix for branch two,
for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,
the spectrum of the noise component output by branch two in the ith sub-band.
The other purpose of the invention can be achieved by adopting the following technical scheme:
an implementation device of a microphone array voice enhancement method based on a deep neural network comprises a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-band GSC modules and the deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are sequentially connected and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals;
the microphone array receiving module adopts a linear array structure and comprises 8 microphones which are uniformly distributed on a straight line, and each array element is isotropic; the sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-band GSC for processing; the sub-band synthesis module synthesizes the output of the 24 improved sub-band GSC modules into a full-band signal and sends the full-band signal to a deep neural network for enhancement.
Further, the i, i-th 1,2, …,24 modified subband GSC block comprises 3 branches, branch one using the fixed beamformer w
q,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B
1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter w
a,iTo obtain a noise component spectrum
Branch three adopts the blocking matrix B
2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise
Compared with the prior art, the invention has the following advantages and effects:
1. the invention inhibits the signal of the speaker and the interference source direction through the branch III to obtain the spatial incoherent noise frequency spectrum vector, and compared with the Chinese invention patent 201711201341.5, the spatial incoherent noise can be estimated and tracked more accurately.
2. The invention uses deep neural network to complete the mapping from the noisy speech and noise to the clean speech, and compared with the direct subtraction of the traditional GSC or the linear filter constructed by using statistical models such as GMM, HMM and the like in the Chinese patent 201711201341.5, the invention can effectively utilize the nonlinear characteristic and time correlation of the speech signal, so that the estimation result is more accurate and close to the human auditory characteristic.
3. The deep neural network used by the invention adopts the voice with noise and the noise as input, and has better enhancement effect compared with the traditional deep neural network voice enhancement technology which only adopts the voice with noise as input.
4. The invention combines the voice enhancement technology based on the microphone array and the deep neural network, and the performance is superior to the traditional microphone array voice enhancement method and the single-microphone deep neural network voice enhancement method.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The block diagram of the system structure for implementing the speech enhancement method of the microphone array disclosed in the embodiment is, as shown in fig. 1, composed of a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-bands GSC and a deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are connected in sequence and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; and the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals. In the above embodiment, the microphone array receiving module adopts a linear array structure, and includes 8 microphones uniformly distributed on a straight line, and each array element is isotropic. The sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-bands GSC for processing. The sub-band synthesis module synthesizes the output of the 24 improved sub-bands GSC into a full-band signal, and sends the full-band signal to a deep neural network for enhancement.
The ith modified sub-band GSC block is shown in fig. 2 and comprises 3 branches. Branch one adopts fixed beam former w
q,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B
1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter w
a,iTo obtain a noise component spectrum
Branch three adopts the blocking matrix B
2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise
Example two
The embodiment discloses a microphone array speech enhancement method based on a Deep Neural Network (DNN), which is implemented by using the microphone array speech enhancement system disclosed in the first embodiment, and the process of enhancing the input speech is shown in fig. 3:
step S1, training a deep neural network for mapping the voice with noise and the noise into clean voice by using a clean voice library and a noise library;
in step S1, the deep neural network training includes the following steps:
s1.1, superposing the voice of the clean voice library and the noise of the noise library to obtain the voice with noise, taking the short-time frequency spectrum of the voice with noise and the short-time frequency spectrum of the corresponding noise as input, and taking the short-time frequency spectrum of the corresponding clean voice as target output to obtain a training data set.
In this embodiment, the noise in the noise bank includes different kinds of noise with different signal-to-noise ratios.
S1.2, setting structural parameters of the deep neural network, and adopting the following cost function:
where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,
representing the short-time spectrum S of noisy speech from the t-th frame
(f)(omega, t) and noise short-time frequency spectrum
The constructed input sample, f (Y (ω, T)), represents the output of the neural network, and T is the number of speech frames for training.
In this embodiment, the deep neural network structure is shown in fig. 4, and includes 1 dimensionality reduction layer, 10 full-connection layers, and 3 Dropout layers, after the dimensionality reduction of the input vector by the dimensionality reduction layer, the input vector passes through a deep neural network hidden layer formed by 9 full-connection layers and 3 Dropout layers, the number of nodes of each full-connection layer is 2048, Relu is used as an activation function, one Dropout layer is connected behind each 3 full-connection layers, and the discarding rates of the 3 Dropout layers are 0.1, 0.2, and 0.2, respectively. The output layer of the deep neural network is a fully connected layer using Relu as an activation function, and the number of nodes is the same as the dimension of the input Y (ω, t).
And S1.3, training the deep neural network by using a gradient descent method, so that the change of the cost function phi is smaller than a preset value.
And step S2, estimating the incoming wave direction of the speaker, the number of interference sources and the incoming wave direction of the interference sources by using the microphone array.
In this embodiment, the method for estimating the direction of the incoming wave of the speaker, the number of the interference sources, and the direction of the incoming wave is as follows:
and S2.1, determining the number of the information sources by adopting a characteristic value decomposition method. When there are J independent far-field broadband signals in space, the signals are respectively in incident angle thetajI is 1 to J, and is incident on a uniform linear matrix containing M array elements, and the array receives signals
X(t)=AS(t)+N(t)
Where X (t) is the array received signal vector, S (t) is the J far-field signal source vectors, A is the array manifold rectangle, and N (t) is the additive background noise vector. Covariance of array received signal vector
R=E[X(t)X(t)H]
E represents expectation. And (3) carrying out characteristic value decomposition on the covariance R:
R=UΣUH
where Σ is an M-dimensional diagonal matrix with M diagonal elements λnN is 1-M and is the eigenvalue of R, U is the corresponding eigenvector matrix, and M is the number of array elements. And (3) sorting the M characteristic values in a descending order according to the sizes, namely:
λ1≥λ2≥…λn≥λn+1…≥λM。
in this embodiment, the number of signal sources is calculated by the following formula
Wherein
K is the number of observation signal samples.
In another embodiment, the number of signal sources is calculated by the following formula
Wherein
S2.2, estimating the azimuth angle of the sound source by adopting a Music algorithm, wherein the method comprises the following steps: decomposing the characteristic values of R, forming a matrix G by the characteristic vectors corresponding to the minimum M-K characteristic values, and calculating the MUSIC spectrum PMUSICAnd (theta), the maximum point of the MUSIC spectrum is the incoming wave direction. The MUSIC spectrum is calculated as follows:
wherein
φ
k=2πd sinθ
k/λ,k=1,2,…,J。
Step S3, decomposing the signals received by the microphone array into 24 sub-bands by a sub-band decomposition module, dividing the signal of each sub-band i into three branches, and enhancing the signal in the speaker direction by a first branch by a fixed beam former to obtain the voice frequency spectrum output by the first branch
Where t is the frame number. Branch two uses blocking matrix B
1iThe signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output
Branch three adopts the blocking matrix B
2iSuppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches
In this embodiment, the decomposition of subband signalsAnd the synthesis is realized by using a cosine modulation filter bank, and the analysis and synthesis filters of the filter bank are the filters with the bandwidth of
And modulating by a low-pass filter, wherein K & lt24 & gt is the number of the sub-bands. The method for calculating the coefficients of the analysis filter bank for subband decomposition comprises the following steps: with a coefficient of h
0(l) The low-pass filter of (2) is a prototype filter, then the analysis filter coefficient is
Wherein h isk(l) For analyzing the coefficients of the kth filter in the filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θk=(-1)kπ/4。
In this embodiment, for the ith subband, the weight matrix w of branch oneq,iThe following method is adopted for calculation:
wherein C is
1i=d(ω
i,θ
0) In order to constrain the matrix, the matrix is,
m is the array element number, omega, of the microphone array
iIs the center frequency of the ith sub-band, theta
0Is the direction of the incoming wave of the speaker, tau
0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector. Passing the ith sub-band signal output by the microphone array receiving module through w
q,iAfter weighting, obtaining the speech frequency spectrum output by branch circuit
For the ith sub-band, the blocking matrix B for branch two1iAdopt the followingThe method comprises the following steps: will matrix C1i=d(ωi,θ0) Performing singular value decomposition
Wherein
1irIs r
1×r
1Diagonal matrix of r
1Is C
1iIs determined. Order to
Wherein U is
1irIs U
1iFront r of
1The rows of the image data are, in turn,
is U
1iThe remaining rows of then
Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B
1iAfter weighting, the data is passed through an adaptive filter w
a,iObtaining the noise component frequency spectrum of the branch circuit two output
For the ith subband, branch three blocking matrix B2iThe following method is adopted for calculation: will matrix C2i=[d(ωi,θ0),d(ωi,θ1),…,d(ωi,θJ)]Performing singular value decomposition
Wherein
M is the array element number, omega, of the microphone array
iIs the center frequency of the ith sub-band, theta
0Is the direction of the incoming wave of the speaker, tau
0,mM is not less than 0 and not more than M-1, isThe time delay difference between the sound of the speaker arriving at the m-th array element and arriving at the 0-th array element,
j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, theta
jIs the direction of the incoming wave of the interference source, tau
j,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma
2irIs r
2×r
2Diagonal matrix of r
2Is C
2iIs determined. Order to
Wherein U is
2irIs U
2iFront r of
2The rows of the image data are, in turn,
is U
2iThe remaining rows of then
Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B
2iAfter weighting, the noise component frequency spectrum of the branch circuit three output is obtained
Step S4, use
And
estimate S
(f)Noise spectrum contained in (ω, t)
In this embodiment, for the ith subband, the following formula is adopted to calculate the voice spectrum output by the branch one
Is contained inNoise spectrum
Wherein w
q,iAnd w
a,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B
1iIs the blocking matrix for branch two,
for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,
the spectrum of the noise component output by branch two in the ith sub-band.
In this embodiment, of all sub-bands
And
the cosine modulation filtering is adopted to combine into a full band
And
the method for calculating the coefficients of the synthesis filter bank for subband synthesis comprises the following steps: using a low-pass prototype filter h identical to the analysis filter bank
0(l) The synthesis filter coefficient is
Wherein g isk(l) For the coefficients of the kth filter in the synthesis filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θk=(-1)kπ/4。
Step S5, the
And
inputting the deep neural network trained in the step S1 to obtain the enhanced voice.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.