CN110517701B

CN110517701B - Microphone array speech enhancement method and implementation device

Info

Publication number: CN110517701B
Application number: CN201910677433.3A
Authority: CN
Inventors: 张军; 梁晟; 宁更新; 冯义志; 余华; 季飞
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2021-09-21
Anticipated expiration: 2039-07-25
Also published as: CN110517701A

Abstract

The invention discloses a microphone array speech enhancement method and implementation device. The third branch is used to suppress the signals in the direction of the speaker and the interference source, and the spatial incoherent noise spectrum vector is obtained; the deep neural network is used to complete the transformation from noisy speech and noise The mapping to clean speech can effectively utilize the nonlinear characteristics and temporal correlation of speech signals, making the estimation results more accurate and close to the characteristics of human hearing; the deep neural network uses noisy speech and noise as input, which is different from the traditional Compared with the deep neural network speech enhancement technology with noisy speech as input, it has better enhancement effect. The invention combines the voice enhancement technology based on the microphone array and the deep neural network, and the performance is better than that of the traditional microphone array voice enhancement method and the single-microphone deep neural network voice enhancement method; it can be widely used in video conferences, vehicle communications, conference venues, multimedia In various voice communication applications with noisy backgrounds such as classrooms.

Description

Microphone array speech enhancement method and implementation device

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a microphone array voice enhancement method based on a Deep Neural Network (DNN) and an implementation device.

Background

In real life, the process of transmitting voice information by people is often inevitably interfered by external noise, and the interference can degrade the voice quality and influence the effects of voice communication and recognition. Speech enhancement is a technique for extracting useful speech signals from speech interfered by noise, and suppressing and reducing noise, i.e. extracting original speech as pure as possible from noisy speech, and has wide application in speech communication, speech recognition, etc.

Existing speech enhancement algorithms can be divided into two categories, depending on the number of microphones used, one being single-microphone based speech enhancement algorithms such as spectral subtraction, wiener filtering, MMSE, kalman filtering, etc. The voice enhancement algorithm uses a single microphone to receive voice signals, has small volume and simple structure, but has limited noise reduction capability, can only process stationary noise mostly, and has not ideal voice enhancement effect on non-stationary noise effect. The other type is speech enhancement based on a microphone array, namely, a plurality of microphones are used in a sound acquisition system to receive sounds from different spatial directions, signals in the speaker direction are amplified through spatial filtering, noise and interference in other directions are suppressed, compared with the traditional method, the method has higher signal gain and stronger interference suppression capability, can solve various acoustic estimation problems, such as sound source positioning, dereverberation, speech enhancement, blind source separation and the like, and has the defects of large volume and higher algorithm complexity. The existing microphone array speech enhancement technology can be roughly divided into three types, namely a fixed beam forming method, an adaptive beam forming method and a post-adaptive filtering method, wherein the adaptive beam forming method adjusts and optimizes array weights through an adaptive algorithm under certain optimal criteria, and the adaptive beam forming method has good adaptive capacity to environmental changes, so that the microphone array speech enhancement technology is most widely applied in practice.

The Generalized Sidelobe Canceller (GSC) is a common structure for realizing adaptive beam, and mainly comprises two branches: and the first branch adopts a fixed beam former to enhance signals in the receiving direction, the second branch adopts a blocking matrix to prevent the signals in the receiving direction from passing, and then adopts an adaptive filter to filter the output of the blocking matrix so as to estimate the residual noise in the output of the first branch and offset the residual noise by subtraction. GSC can transform a constrained Linear Constrained Minimum Variance (LCMV) optimization problem into an unconstrained optimization problem, and thus has high computational efficiency and is simpler to implement than other adaptive beamforming algorithms. However, the conventional GSC has some disadvantages, such as: the method has the advantages of weak capability of inhibiting the spatial incoherent noise, no utilization of the priori knowledge of the voice signal, optimization aiming at the characteristics of the voice signal and the like.

In order to solve the above problems, chinese patent 201711201341.5 provides a microphone array speech enhancement method based on statistical models, which uses a clean speech model and a noise model estimated from the output of the GSC branch two to construct an optimal speech filter to enhance the output signal of the GSC branch one, so as to effectively improve the suppression capability of the enhancement system on incoherent noise, and make the output speech more conform to the human auditory characteristics by using the priori knowledge of the speech signal. However, this method has the following disadvantages: (1) the method uses the proportion of the sum of the output signal energy of the self-adaptive filter and the M-1 path input signal energy of the self-adaptive filter to adjust the update rate of the incoherent noise, and the incoherent noise is difficult to accurately estimate and track when the coherent noise and the incoherent noise exist simultaneously, so the noise suppression effect is influenced; (2) the method uses a linear filter to enhance the output of a fixed beam forming part, and can bring distortion of voice signals while eliminating noise, so that the enhancement effect is greatly limited; (3) in the speech enhancement process, the processing of the preceding and following speech frames are independent of each other, and cannot utilize the correlation of the speech signals in time.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a microphone array speech enhancement method and a realization device based on a deep neural network, wherein the method is different from the prior art in that: (1) the three-purpose branch is added for estimating incoherent noise on the basis of the traditional GSC, so that the residual noise in the output of the branch I can be estimated more accurately; (2) the deep neural network is trained by taking the voice with noise and the noise as input and the clean voice as output, and the output of the first branch is enhanced by utilizing the deep neural network, so that the nonlinear characteristic and the time correlation of the voice signal can be better utilized, and the output of the first branch is more accurately mapped into the clean voice. The invention can be widely applied to various voice communication applications with noisy backgrounds, such as video conferences, vehicle-mounted communication, meeting places, multimedia classrooms and the like.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a microphone array speech enhancement method based on a deep neural network is disclosed, which adopts the following steps to enhance the input speech signal:

s1, training a deep neural network for mapping the noisy speech and noise to clean speech using the clean speech library and the noise library.

S2, estimating the incoming wave direction theta of the speaker by using the microphone array₀The number J of interference sources and the incoming wave direction theta of the interference sources_j，1≤j≤J。

S3, dividing the signal received by the microphone array into three branches, wherein the first branch adopts a fixed beam former to enhance the signal in the speaker direction to obtain the voice frequency spectrum S output by the first branch^(f)(ω, t), where t is the frame number. Branch two uses blocking matrix B₁The signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output

Branch three adopts the blocking matrix B₂Suppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches

S4, use

And

estimate S^(f)Noise spectrum contained in (ω, t)

S5, mixing S^(f)(ω, t) and

inputting the deep neural network trained in the step S1 to obtain enhanced voice.

Further, in step S1, the deep neural network training includes the following steps:

s1.1, superposing the voice of the clean voice library and the noise of the noise library to obtain the voice with noise, taking the short-time frequency spectrum of the voice with noise and the short-time frequency spectrum of the corresponding noise as input, and taking the short-time frequency spectrum of the corresponding clean voice as target output to obtain a training data set.

S1.2, setting structural parameters of the deep neural network, and adopting the following cost function:

where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,

representing the short-time spectrum S of noisy speech from the t-th frame^(f)(omega, t) and noise short-time frequency spectrum

The constructed input sample, f (Y (ω, T)), represents the output of the neural network, and T is the number of speech frames for training.

And S1.3, training the deep neural network to enable the change of the cost function phi to be smaller than a preset value.

In the above steps S3 and S4, the input signal is first decomposed into K subbands, the signal of each subband is processed by three branches, and then the full band S is synthesized^(f)(ω, t) and

in step S3, the weight matrix w of the branch one for the ith subband_q,iThe following method is adopted for calculation:

wherein C is_1i＝d(ω_i,θ₀) In order to constrain the matrix, the matrix is,

m is the array element number, omega, of the microphone array_iIs the center frequency of the ith sub-band, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector.

In the above step S3, for the ith subband, the blocking matrix B of branch two_1iThe following method is adopted for calculation:

will matrix C_1i＝d(ω_i,θ₀) Performing singular value decomposition

Wherein_1irIs r₁×r₁Diagonal matrix of r₁Is C_1iIs determined. Order to

Wherein U is_1irIs U_1iFront r of₁The rows of the image data are, in turn,

is U_1iThe remaining rows of then

In the above step S3, for the ith subband, the blocking matrix B of branch three_2iThe following method is adopted for calculation:

will matrix C_2i＝[d(ω_i,θ₀),d(ω_i,θ₁),…,d(ω_i,θ_J)]Performing singular value decomposition

Wherein

M is the array element number, omega, of the microphone array_iIs the center frequency of the ith sub-band, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the sound of the speaker reaching the M-th array element and the sound of the speaker reaching the 0-th array element,

j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, theta_jIs the direction of the incoming wave of the interference source, tau_j,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma_2irIs r₂×r₂Diagonal matrix of r₂Is C_2iIs determined. Order to

Wherein U is_2irIs U_2iFront r of₂The rows of the image data are, in turn,

is U_2iThe remaining rows of then

In step S4, for the ith subband, the following formula is used to calculate the voice spectrum output by branch one

Spectrum of noise contained therein

Wherein w_q,iAnd w_a,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B_1iIs the blocking matrix for branch two,

for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,

the spectrum of the noise component output by branch two in the ith sub-band.

The other purpose of the invention can be achieved by adopting the following technical scheme:

an implementation device of a microphone array voice enhancement method based on a deep neural network comprises a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-band GSC modules and the deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are sequentially connected and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals;

the microphone array receiving module adopts a linear array structure and comprises 8 microphones which are uniformly distributed on a straight line, and each array element is isotropic; the sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-band GSC for processing; the sub-band synthesis module synthesizes the output of the 24 improved sub-band GSC modules into a full-band signal and sends the full-band signal to a deep neural network for enhancement.

Further, the i, i-th 1,2, …,24 modified subband GSC block comprises 3 branches, branch one using the fixed beamformer w_q,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B_1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter w_a,iTo obtain a noise component spectrum

Branch three adopts the blocking matrix B_2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise

Compared with the prior art, the invention has the following advantages and effects:

1. the invention inhibits the signal of the speaker and the interference source direction through the branch III to obtain the spatial incoherent noise frequency spectrum vector, and compared with the Chinese invention patent 201711201341.5, the spatial incoherent noise can be estimated and tracked more accurately.

2. The invention uses deep neural network to complete the mapping from the noisy speech and noise to the clean speech, and compared with the direct subtraction of the traditional GSC or the linear filter constructed by using statistical models such as GMM, HMM and the like in the Chinese patent 201711201341.5, the invention can effectively utilize the nonlinear characteristic and time correlation of the speech signal, so that the estimation result is more accurate and close to the human auditory characteristic.

3. The deep neural network used by the invention adopts the voice with noise and the noise as input, and has better enhancement effect compared with the traditional deep neural network voice enhancement technology which only adopts the voice with noise as input.

4. The invention combines the voice enhancement technology based on the microphone array and the deep neural network, and the performance is superior to the traditional microphone array voice enhancement method and the single-microphone deep neural network voice enhancement method.

Drawings

FIG. 1 is a block diagram of a system for implementing a microphone array speech enhancement method according to an embodiment of the present invention;

fig. 2 is a block diagram of an ith modified sub-band GSC block in an embodiment of the present invention;

FIG. 3 is a flow chart of a microphone array speech enhancement method in an embodiment of the invention;

fig. 4 is a block diagram of a deep neural network architecture used in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The block diagram of the system structure for implementing the speech enhancement method of the microphone array disclosed in the embodiment is, as shown in fig. 1, composed of a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-bands GSC and a deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are connected in sequence and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; and the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals. In the above embodiment, the microphone array receiving module adopts a linear array structure, and includes 8 microphones uniformly distributed on a straight line, and each array element is isotropic. The sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-bands GSC for processing. The sub-band synthesis module synthesizes the output of the 24 improved sub-bands GSC into a full-band signal, and sends the full-band signal to a deep neural network for enhancement.

The ith modified sub-band GSC block is shown in fig. 2 and comprises 3 branches. Branch one adopts fixed beam former w_q,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B_1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter w_a,iTo obtain a noise component spectrum

Example two

The embodiment discloses a microphone array speech enhancement method based on a Deep Neural Network (DNN), which is implemented by using the microphone array speech enhancement system disclosed in the first embodiment, and the process of enhancing the input speech is shown in fig. 3:

step S1, training a deep neural network for mapping the voice with noise and the noise into clean voice by using a clean voice library and a noise library;

in step S1, the deep neural network training includes the following steps:

In this embodiment, the noise in the noise bank includes different kinds of noise with different signal-to-noise ratios.

In this embodiment, the deep neural network structure is shown in fig. 4, and includes 1 dimensionality reduction layer, 10 full-connection layers, and 3 Dropout layers, after the dimensionality reduction of the input vector by the dimensionality reduction layer, the input vector passes through a deep neural network hidden layer formed by 9 full-connection layers and 3 Dropout layers, the number of nodes of each full-connection layer is 2048, Relu is used as an activation function, one Dropout layer is connected behind each 3 full-connection layers, and the discarding rates of the 3 Dropout layers are 0.1, 0.2, and 0.2, respectively. The output layer of the deep neural network is a fully connected layer using Relu as an activation function, and the number of nodes is the same as the dimension of the input Y (ω, t).

And S1.3, training the deep neural network by using a gradient descent method, so that the change of the cost function phi is smaller than a preset value.

And step S2, estimating the incoming wave direction of the speaker, the number of interference sources and the incoming wave direction of the interference sources by using the microphone array.

In this embodiment, the method for estimating the direction of the incoming wave of the speaker, the number of the interference sources, and the direction of the incoming wave is as follows:

and S2.1, determining the number of the information sources by adopting a characteristic value decomposition method. When there are J independent far-field broadband signals in space, the signals are respectively in incident angle theta_jI is 1 to J, and is incident on a uniform linear matrix containing M array elements, and the array receives signals

X(t)＝AS(t)+N(t)

Where X (t) is the array received signal vector, S (t) is the J far-field signal source vectors, A is the array manifold rectangle, and N (t) is the additive background noise vector. Covariance of array received signal vector

R＝E[X(t)X(t)^H]

E represents expectation. And (3) carrying out characteristic value decomposition on the covariance R:

R＝UΣU^H

where Σ is an M-dimensional diagonal matrix with M diagonal elements λ_nN is 1-M and is the eigenvalue of R, U is the corresponding eigenvector matrix, and M is the number of array elements. And (3) sorting the M characteristic values in a descending order according to the sizes, namely:

λ₁≥λ₂≥…λ_n≥λ_n+1…≥λ_M。

in this embodiment, the number of signal sources is calculated by the following formula

Wherein

K is the number of observation signal samples.

In another embodiment, the number of signal sources is calculated by the following formula

Wherein

S2.2, estimating the azimuth angle of the sound source by adopting a Music algorithm, wherein the method comprises the following steps: decomposing the characteristic values of R, forming a matrix G by the characteristic vectors corresponding to the minimum M-K characteristic values, and calculating the MUSIC spectrum P_MUSICAnd (theta), the maximum point of the MUSIC spectrum is the incoming wave direction. The MUSIC spectrum is calculated as follows:

wherein

φ_k＝2πd sinθ_k/λ,k＝1,2,…,J。

Step S3, decomposing the signals received by the microphone array into 24 sub-bands by a sub-band decomposition module, dividing the signal of each sub-band i into three branches, and enhancing the signal in the speaker direction by a first branch by a fixed beam former to obtain the voice frequency spectrum output by the first branch

Where t is the frame number. Branch two uses blocking matrix B_1iThe signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output

Branch three adopts the blocking matrix B_2iSuppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches

In this embodiment, the decomposition of subband signalsAnd the synthesis is realized by using a cosine modulation filter bank, and the analysis and synthesis filters of the filter bank are the filters with the bandwidth of

And modulating by a low-pass filter, wherein K & lt24 & gt is the number of the sub-bands. The method for calculating the coefficients of the analysis filter bank for subband decomposition comprises the following steps: with a coefficient of h₀(l) The low-pass filter of (2) is a prototype filter, then the analysis filter coefficient is

Wherein h is_k(l) For analyzing the coefficients of the kth filter in the filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θ_k＝(-1)^kπ/4。

In this embodiment, for the ith subband, the weight matrix w of branch one_q,iThe following method is adopted for calculation:

m is the array element number, omega, of the microphone array_iIs the center frequency of the ith sub-band, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector. Passing the ith sub-band signal output by the microphone array receiving module through w_q,iAfter weighting, obtaining the speech frequency spectrum output by branch circuit

For the ith sub-band, the blocking matrix B for branch two_1iAdopt the followingThe method comprises the following steps: will matrix C_1i＝d(ω_i,θ₀) Performing singular value decomposition

Wherein_1irIs r₁×r₁Diagonal matrix of r₁Is C_1iIs determined. Order to

Wherein U is_1irIs U_1iFront r of₁The rows of the image data are, in turn,

is U_1iThe remaining rows of then

Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B_1iAfter weighting, the data is passed through an adaptive filter w_a,iObtaining the noise component frequency spectrum of the branch circuit two output

For the ith subband, branch three blocking matrix B_2iThe following method is adopted for calculation: will matrix C_2i＝[d(ω_i,θ₀),d(ω_i,θ₁),…,d(ω_i,θ_J)]Performing singular value decomposition

Wherein

M is the array element number, omega, of the microphone array_iIs the center frequency of the ith sub-band, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is not less than 0 and not more than M-1, isThe time delay difference between the sound of the speaker arriving at the m-th array element and arriving at the 0-th array element,

Wherein U is_2irIs U_2iFront r of₂The rows of the image data are, in turn,

is U_2iThe remaining rows of then

Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B_2iAfter weighting, the noise component frequency spectrum of the branch circuit three output is obtained

Step S4, use

And

estimate S^(f)Noise spectrum contained in (ω, t)

In this embodiment, for the ith subband, the following formula is adopted to calculate the voice spectrum output by the branch one

Is contained inNoise spectrum

the spectrum of the noise component output by branch two in the ith sub-band.

In this embodiment, of all sub-bands

And

the cosine modulation filtering is adopted to combine into a full band

And

the method for calculating the coefficients of the synthesis filter bank for subband synthesis comprises the following steps: using a low-pass prototype filter h identical to the analysis filter bank₀(l) The synthesis filter coefficient is

Wherein g is_k(l) For the coefficients of the kth filter in the synthesis filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θ_k＝(-1)^kπ/4。

Step S5, the

And

inputting the deep neural network trained in the step S1 to obtain the enhanced voice.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A microphone array speech enhancement method based on a deep neural network is characterized in that the following steps are adopted to enhance an input speech signal:

s1, training a deep neural network for mapping the voice with noise and the noise into clean voice by using a clean voice library and a noise library;

s2, estimating the incoming wave direction theta of the speaker by using the microphone array₀The number J of interference sources and the incoming wave direction theta of the interference sources_j，1≤j≤J；

S3, dividing the signal received by the microphone array into three branches, wherein the first branch adopts a fixed beam former to enhance the signal in the speaker direction to obtain the voice frequency spectrum S output by the first branch^(f)(ω, t), where t is the frame number; branch two uses blocking matrix B₁The signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output

In step S3, for the ith, i-1, 2, …, and 24 subbands, the weight matrix w of branch one is obtained_q,iThe following method is adopted for calculation:

m is the array element number, omega, of the microphone array_iIs the center frequency of the ith sub-band, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector;

in step S3, for the ith sub-band, i is 1,2, …,24, the blocking matrix B of branch three_2iThe following method is adopted for calculation:

Wherein

M is the array element number, omega, of the microphone array_iIs the center of the ith sub-bandFrequency, theta₀Is the direction of the incoming wave of the speaker, tau_0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the sound of the speaker reaching the M-th array element and the sound of the speaker reaching the 0-th array element,

j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, theta_jIs the direction of the incoming wave of the interference source, tau_j,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma_2irIs r₂×r₂Diagonal matrix of r₂Is C_2iRank of (1), order

Wherein U is_2irIs U_2iFront r of₂The rows of the image data are, in turn,

is U_2iThe remaining rows of then

S4, use

And

estimate S^(f)Noise spectrum contained in (ω, t)

S5, mixing S^(f)(ω, t) and

2. The microphone array speech enhancement method of claim 1, wherein the deep neural network training of step S1 comprises the steps of:

s1.1, superposing the voice of a clean voice library and the noise of a noise library to obtain a voice with noise, taking a short-time frequency spectrum of the voice with noise and a short-time frequency spectrum of corresponding noise as input, and taking a short-time frequency spectrum of corresponding clean voice as target output to obtain a training data set;

s1.2, setting structural parameters of a deep neural network, and adopting the following cost function:

The formed input sample, f (Y (omega, T)) represents the output of the neural network, and T is the number of voice frames for training;

s1.3, training the deep neural network to enable the change of the cost function phi to be smaller than a preset value.

3. The microphone array speech enhancement method of claim 1, wherein in steps S3 and S4, the input signal is first decomposed into K sub-bands, the signal of each sub-band is processed by three branches, and then the full band S is synthesized^(f)(ω, t) and

4. microphone array according to claim 1The tone enhancement method is characterized in that in step S3, for the ith sub-band, i is 1,2, …,24, the blocking matrix B of branch two_1iThe following method is adopted for calculation:

will matrix C_1i＝d(ω_i,θ₀) Performing singular value decomposition

Wherein, C_1i＝d(ω_i,θ₀) To constrain the matrix, sigma_1irIs r₁×r₁Diagonal matrix of r₁Is C_1iRank of (1), order

Wherein U is_1irIs U_1iFront r of₁The rows of the image data are, in turn,

is U_1iThe remaining rows of then

5. The microphone array speech enhancement method of claim 1, wherein in step S4, for the ith, i-1, 2, …,24 sub-bands, the following formula is used to calculate the speech spectrum of the branch-one output

Spectrum of noise contained therein

the spectrum of the noise component output by branch two in the ith sub-band.

6. An implementation device of a microphone array voice enhancement method based on a deep neural network is characterized by comprising a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-band GSC modules and the deep neural network, wherein the GSC is a short for generalized sidelobe canceller, and the microphone array receiving module and the sub-band decomposition module are connected in sequence and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals;

the microphone array receiving module adopts a linear array structure and comprises 8 microphones which are uniformly distributed on a straight line, and each array element is isotropic; the sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands, and the sub-bands are respectively sent to the corresponding improved sub-band GSC module for processing; the sub-band synthesis module synthesizes the output of the 24 improved sub-band GSC modules into full-band signals and sends the full-band signals to a deep neural network for enhancement;

wherein, the ith, i-1, 2, …,24 improved sub-band GSC module comprises 3 branches, and the first branch adopts a fixed beam former w_q,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B_1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter w_a,iTo obtain a noise component spectrum