[go: up one dir, main page]

CN110517701B - Microphone array speech enhancement method and implementation device - Google Patents

Microphone array speech enhancement method and implementation device Download PDF

Info

Publication number
CN110517701B
CN110517701B CN201910677433.3A CN201910677433A CN110517701B CN 110517701 B CN110517701 B CN 110517701B CN 201910677433 A CN201910677433 A CN 201910677433A CN 110517701 B CN110517701 B CN 110517701B
Authority
CN
China
Prior art keywords
sub
band
noise
branch
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910677433.3A
Other languages
Chinese (zh)
Other versions
CN110517701A (en
Inventor
张军
梁晟
宁更新
冯义志
余华
季飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910677433.3A priority Critical patent/CN110517701B/en
Publication of CN110517701A publication Critical patent/CN110517701A/en
Application granted granted Critical
Publication of CN110517701B publication Critical patent/CN110517701B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明公开了一种麦克风阵列语音增强方法及实现装置,通过支路三来抑制说话人和干扰源方向的信号,得到空间非相干噪声频谱矢量;使用深度神经网络来完成从带噪语音和噪声到干净语音的映射,可以有效地利用语音信号的非线性特性和时间相关性,使估计结果更精确和接近人类听觉特性;该深度神经网络采用带噪语音和噪声作为输入,与传统仅采用带噪语音作为输入的深度神经网络语音增强技术相比具有更好的增强效果。本发明将基于麦克风阵列和深度神经网络的语音增强技术相结合,性能优于传统的麦克风阵列语音增强方法和单麦克风深度神经网络语音增强方法;可以广泛用于视频会议、车载通信、会场、多媒体教室等各种具有嘈杂背景的语音通信应用中。

Figure 201910677433

The invention discloses a microphone array speech enhancement method and implementation device. The third branch is used to suppress the signals in the direction of the speaker and the interference source, and the spatial incoherent noise spectrum vector is obtained; the deep neural network is used to complete the transformation from noisy speech and noise The mapping to clean speech can effectively utilize the nonlinear characteristics and temporal correlation of speech signals, making the estimation results more accurate and close to the characteristics of human hearing; the deep neural network uses noisy speech and noise as input, which is different from the traditional Compared with the deep neural network speech enhancement technology with noisy speech as input, it has better enhancement effect. The invention combines the voice enhancement technology based on the microphone array and the deep neural network, and the performance is better than that of the traditional microphone array voice enhancement method and the single-microphone deep neural network voice enhancement method; it can be widely used in video conferences, vehicle communications, conference venues, multimedia In various voice communication applications with noisy backgrounds such as classrooms.

Figure 201910677433

Description

Microphone array speech enhancement method and implementation device
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a microphone array voice enhancement method based on a Deep Neural Network (DNN) and an implementation device.
Background
In real life, the process of transmitting voice information by people is often inevitably interfered by external noise, and the interference can degrade the voice quality and influence the effects of voice communication and recognition. Speech enhancement is a technique for extracting useful speech signals from speech interfered by noise, and suppressing and reducing noise, i.e. extracting original speech as pure as possible from noisy speech, and has wide application in speech communication, speech recognition, etc.
Existing speech enhancement algorithms can be divided into two categories, depending on the number of microphones used, one being single-microphone based speech enhancement algorithms such as spectral subtraction, wiener filtering, MMSE, kalman filtering, etc. The voice enhancement algorithm uses a single microphone to receive voice signals, has small volume and simple structure, but has limited noise reduction capability, can only process stationary noise mostly, and has not ideal voice enhancement effect on non-stationary noise effect. The other type is speech enhancement based on a microphone array, namely, a plurality of microphones are used in a sound acquisition system to receive sounds from different spatial directions, signals in the speaker direction are amplified through spatial filtering, noise and interference in other directions are suppressed, compared with the traditional method, the method has higher signal gain and stronger interference suppression capability, can solve various acoustic estimation problems, such as sound source positioning, dereverberation, speech enhancement, blind source separation and the like, and has the defects of large volume and higher algorithm complexity. The existing microphone array speech enhancement technology can be roughly divided into three types, namely a fixed beam forming method, an adaptive beam forming method and a post-adaptive filtering method, wherein the adaptive beam forming method adjusts and optimizes array weights through an adaptive algorithm under certain optimal criteria, and the adaptive beam forming method has good adaptive capacity to environmental changes, so that the microphone array speech enhancement technology is most widely applied in practice.
The Generalized Sidelobe Canceller (GSC) is a common structure for realizing adaptive beam, and mainly comprises two branches: and the first branch adopts a fixed beam former to enhance signals in the receiving direction, the second branch adopts a blocking matrix to prevent the signals in the receiving direction from passing, and then adopts an adaptive filter to filter the output of the blocking matrix so as to estimate the residual noise in the output of the first branch and offset the residual noise by subtraction. GSC can transform a constrained Linear Constrained Minimum Variance (LCMV) optimization problem into an unconstrained optimization problem, and thus has high computational efficiency and is simpler to implement than other adaptive beamforming algorithms. However, the conventional GSC has some disadvantages, such as: the method has the advantages of weak capability of inhibiting the spatial incoherent noise, no utilization of the priori knowledge of the voice signal, optimization aiming at the characteristics of the voice signal and the like.
In order to solve the above problems, chinese patent 201711201341.5 provides a microphone array speech enhancement method based on statistical models, which uses a clean speech model and a noise model estimated from the output of the GSC branch two to construct an optimal speech filter to enhance the output signal of the GSC branch one, so as to effectively improve the suppression capability of the enhancement system on incoherent noise, and make the output speech more conform to the human auditory characteristics by using the priori knowledge of the speech signal. However, this method has the following disadvantages: (1) the method uses the proportion of the sum of the output signal energy of the self-adaptive filter and the M-1 path input signal energy of the self-adaptive filter to adjust the update rate of the incoherent noise, and the incoherent noise is difficult to accurately estimate and track when the coherent noise and the incoherent noise exist simultaneously, so the noise suppression effect is influenced; (2) the method uses a linear filter to enhance the output of a fixed beam forming part, and can bring distortion of voice signals while eliminating noise, so that the enhancement effect is greatly limited; (3) in the speech enhancement process, the processing of the preceding and following speech frames are independent of each other, and cannot utilize the correlation of the speech signals in time.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a microphone array speech enhancement method and a realization device based on a deep neural network, wherein the method is different from the prior art in that: (1) the three-purpose branch is added for estimating incoherent noise on the basis of the traditional GSC, so that the residual noise in the output of the branch I can be estimated more accurately; (2) the deep neural network is trained by taking the voice with noise and the noise as input and the clean voice as output, and the output of the first branch is enhanced by utilizing the deep neural network, so that the nonlinear characteristic and the time correlation of the voice signal can be better utilized, and the output of the first branch is more accurately mapped into the clean voice. The invention can be widely applied to various voice communication applications with noisy backgrounds, such as video conferences, vehicle-mounted communication, meeting places, multimedia classrooms and the like.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a microphone array speech enhancement method based on a deep neural network is disclosed, which adopts the following steps to enhance the input speech signal:
s1, training a deep neural network for mapping the noisy speech and noise to clean speech using the clean speech library and the noise library.
S2, estimating the incoming wave direction theta of the speaker by using the microphone array0The number J of interference sources and the incoming wave direction theta of the interference sourcesj,1≤j≤J。
S3, dividing the signal received by the microphone array into three branches, wherein the first branch adopts a fixed beam former to enhance the signal in the speaker direction to obtain the voice frequency spectrum S output by the first branch(f)(ω, t), where t is the frame number. Branch two uses blocking matrix B1The signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output
Figure GDA0003147951460000031
Branch three adopts the blocking matrix B2Suppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches
Figure GDA0003147951460000032
S4, use
Figure GDA0003147951460000033
And
Figure GDA0003147951460000034
estimate S(f)Noise spectrum contained in (ω, t)
Figure GDA0003147951460000035
S5, mixing S(f)(ω, t) and
Figure GDA0003147951460000036
inputting the deep neural network trained in the step S1 to obtain enhanced voice.
Further, in step S1, the deep neural network training includes the following steps:
s1.1, superposing the voice of the clean voice library and the noise of the noise library to obtain the voice with noise, taking the short-time frequency spectrum of the voice with noise and the short-time frequency spectrum of the corresponding noise as input, and taking the short-time frequency spectrum of the corresponding clean voice as target output to obtain a training data set.
S1.2, setting structural parameters of the deep neural network, and adopting the following cost function:
Figure GDA0003147951460000041
where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,
Figure GDA0003147951460000042
representing the short-time spectrum S of noisy speech from the t-th frame(f)(omega, t) and noise short-time frequency spectrum
Figure GDA0003147951460000043
The constructed input sample, f (Y (ω, T)), represents the output of the neural network, and T is the number of speech frames for training.
And S1.3, training the deep neural network to enable the change of the cost function phi to be smaller than a preset value.
In the above steps S3 and S4, the input signal is first decomposed into K subbands, the signal of each subband is processed by three branches, and then the full band S is synthesized(f)(ω, t) and
Figure GDA0003147951460000044
in step S3, the weight matrix w of the branch one for the ith subbandq,iThe following method is adopted for calculation:
Figure GDA0003147951460000045
wherein C is1i=d(ωi0) In order to constrain the matrix, the matrix is,
Figure GDA0003147951460000046
m is the array element number, omega, of the microphone arrayiIs the center frequency of the ith sub-band, theta0Is the direction of the incoming wave of the speaker, tau0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector.
In the above step S3, for the ith subband, the blocking matrix B of branch two1iThe following method is adopted for calculation:
will matrix C1i=d(ωi0) Performing singular value decomposition
Figure GDA0003147951460000047
Wherein1irIs r1×r1Diagonal matrix of r1Is C1iIs determined. Order to
Figure GDA0003147951460000048
Wherein U is1irIs U1iFront r of1The rows of the image data are, in turn,
Figure GDA0003147951460000049
is U1iThe remaining rows of then
Figure GDA00031479514600000410
In the above step S3, for the ith subband, the blocking matrix B of branch three2iThe following method is adopted for calculation:
will matrix C2i=[d(ωi0),d(ωi1),…,d(ωiJ)]Performing singular value decomposition
Figure GDA0003147951460000051
Wherein
Figure GDA0003147951460000052
M is the array element number, omega, of the microphone arrayiIs the center frequency of the ith sub-band, theta0Is the direction of the incoming wave of the speaker, tau0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the sound of the speaker reaching the M-th array element and the sound of the speaker reaching the 0-th array element,
Figure GDA0003147951460000053
j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, thetajIs the direction of the incoming wave of the interference source, tauj,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma2irIs r2×r2Diagonal matrix of r2Is C2iIs determined. Order to
Figure GDA0003147951460000054
Wherein U is2irIs U2iFront r of2The rows of the image data are, in turn,
Figure GDA0003147951460000055
is U2iThe remaining rows of then
Figure GDA0003147951460000056
In step S4, for the ith subband, the following formula is used to calculate the voice spectrum output by branch one
Figure GDA0003147951460000057
Spectrum of noise contained therein
Figure GDA0003147951460000058
Figure GDA0003147951460000059
Wherein wq,iAnd wa,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B1iIs the blocking matrix for branch two,
Figure GDA00031479514600000510
Figure GDA00031479514600000511
for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,
Figure GDA00031479514600000512
the spectrum of the noise component output by branch two in the ith sub-band.
The other purpose of the invention can be achieved by adopting the following technical scheme:
an implementation device of a microphone array voice enhancement method based on a deep neural network comprises a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-band GSC modules and the deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are sequentially connected and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals;
the microphone array receiving module adopts a linear array structure and comprises 8 microphones which are uniformly distributed on a straight line, and each array element is isotropic; the sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-band GSC for processing; the sub-band synthesis module synthesizes the output of the 24 improved sub-band GSC modules into a full-band signal and sends the full-band signal to a deep neural network for enhancement.
Further, the i, i-th 1,2, …,24 modified subband GSC block comprises 3 branches, branch one using the fixed beamformer wq,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter wa,iTo obtain a noise component spectrum
Figure GDA0003147951460000061
Branch three adopts the blocking matrix B2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise
Figure GDA0003147951460000062
Compared with the prior art, the invention has the following advantages and effects:
1. the invention inhibits the signal of the speaker and the interference source direction through the branch III to obtain the spatial incoherent noise frequency spectrum vector, and compared with the Chinese invention patent 201711201341.5, the spatial incoherent noise can be estimated and tracked more accurately.
2. The invention uses deep neural network to complete the mapping from the noisy speech and noise to the clean speech, and compared with the direct subtraction of the traditional GSC or the linear filter constructed by using statistical models such as GMM, HMM and the like in the Chinese patent 201711201341.5, the invention can effectively utilize the nonlinear characteristic and time correlation of the speech signal, so that the estimation result is more accurate and close to the human auditory characteristic.
3. The deep neural network used by the invention adopts the voice with noise and the noise as input, and has better enhancement effect compared with the traditional deep neural network voice enhancement technology which only adopts the voice with noise as input.
4. The invention combines the voice enhancement technology based on the microphone array and the deep neural network, and the performance is superior to the traditional microphone array voice enhancement method and the single-microphone deep neural network voice enhancement method.
Drawings
FIG. 1 is a block diagram of a system for implementing a microphone array speech enhancement method according to an embodiment of the present invention;
fig. 2 is a block diagram of an ith modified sub-band GSC block in an embodiment of the present invention;
FIG. 3 is a flow chart of a microphone array speech enhancement method in an embodiment of the invention;
fig. 4 is a block diagram of a deep neural network architecture used in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The block diagram of the system structure for implementing the speech enhancement method of the microphone array disclosed in the embodiment is, as shown in fig. 1, composed of a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-bands GSC and a deep neural network, wherein the microphone array receiving module and the sub-band decomposition module are connected in sequence and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; and the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals. In the above embodiment, the microphone array receiving module adopts a linear array structure, and includes 8 microphones uniformly distributed on a straight line, and each array element is isotropic. The sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands which are respectively sent to the corresponding improved sub-bands GSC for processing. The sub-band synthesis module synthesizes the output of the 24 improved sub-bands GSC into a full-band signal, and sends the full-band signal to a deep neural network for enhancement.
The ith modified sub-band GSC block is shown in fig. 2 and comprises 3 branches. Branch one adopts fixed beam former wq,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter wa,iTo obtain a noise component spectrum
Figure GDA0003147951460000081
Branch three adopts the blocking matrix B2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise
Figure GDA0003147951460000082
Example two
The embodiment discloses a microphone array speech enhancement method based on a Deep Neural Network (DNN), which is implemented by using the microphone array speech enhancement system disclosed in the first embodiment, and the process of enhancing the input speech is shown in fig. 3:
step S1, training a deep neural network for mapping the voice with noise and the noise into clean voice by using a clean voice library and a noise library;
in step S1, the deep neural network training includes the following steps:
s1.1, superposing the voice of the clean voice library and the noise of the noise library to obtain the voice with noise, taking the short-time frequency spectrum of the voice with noise and the short-time frequency spectrum of the corresponding noise as input, and taking the short-time frequency spectrum of the corresponding clean voice as target output to obtain a training data set.
In this embodiment, the noise in the noise bank includes different kinds of noise with different signal-to-noise ratios.
S1.2, setting structural parameters of the deep neural network, and adopting the following cost function:
Figure GDA0003147951460000083
where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,
Figure GDA0003147951460000091
representing the short-time spectrum S of noisy speech from the t-th frame(f)(omega, t) and noise short-time frequency spectrum
Figure GDA0003147951460000092
The constructed input sample, f (Y (ω, T)), represents the output of the neural network, and T is the number of speech frames for training.
In this embodiment, the deep neural network structure is shown in fig. 4, and includes 1 dimensionality reduction layer, 10 full-connection layers, and 3 Dropout layers, after the dimensionality reduction of the input vector by the dimensionality reduction layer, the input vector passes through a deep neural network hidden layer formed by 9 full-connection layers and 3 Dropout layers, the number of nodes of each full-connection layer is 2048, Relu is used as an activation function, one Dropout layer is connected behind each 3 full-connection layers, and the discarding rates of the 3 Dropout layers are 0.1, 0.2, and 0.2, respectively. The output layer of the deep neural network is a fully connected layer using Relu as an activation function, and the number of nodes is the same as the dimension of the input Y (ω, t).
And S1.3, training the deep neural network by using a gradient descent method, so that the change of the cost function phi is smaller than a preset value.
And step S2, estimating the incoming wave direction of the speaker, the number of interference sources and the incoming wave direction of the interference sources by using the microphone array.
In this embodiment, the method for estimating the direction of the incoming wave of the speaker, the number of the interference sources, and the direction of the incoming wave is as follows:
and S2.1, determining the number of the information sources by adopting a characteristic value decomposition method. When there are J independent far-field broadband signals in space, the signals are respectively in incident angle thetajI is 1 to J, and is incident on a uniform linear matrix containing M array elements, and the array receives signals
X(t)=AS(t)+N(t)
Where X (t) is the array received signal vector, S (t) is the J far-field signal source vectors, A is the array manifold rectangle, and N (t) is the additive background noise vector. Covariance of array received signal vector
R=E[X(t)X(t)H]
E represents expectation. And (3) carrying out characteristic value decomposition on the covariance R:
R=UΣUH
where Σ is an M-dimensional diagonal matrix with M diagonal elements λnN is 1-M and is the eigenvalue of R, U is the corresponding eigenvector matrix, and M is the number of array elements. And (3) sorting the M characteristic values in a descending order according to the sizes, namely:
λ1≥λ2≥…λn≥λn+1…≥λM
in this embodiment, the number of signal sources is calculated by the following formula
Figure GDA0003147951460000101
Wherein
Figure GDA0003147951460000102
K is the number of observation signal samples.
In another embodiment, the number of signal sources is calculated by the following formula
Figure GDA0003147951460000103
Wherein
Figure GDA0003147951460000104
S2.2, estimating the azimuth angle of the sound source by adopting a Music algorithm, wherein the method comprises the following steps: decomposing the characteristic values of R, forming a matrix G by the characteristic vectors corresponding to the minimum M-K characteristic values, and calculating the MUSIC spectrum PMUSICAnd (theta), the maximum point of the MUSIC spectrum is the incoming wave direction. The MUSIC spectrum is calculated as follows:
Figure GDA0003147951460000105
wherein
Figure GDA0003147951460000106
φk=2πd sinθk/λ,k=1,2,…,J。
Step S3, decomposing the signals received by the microphone array into 24 sub-bands by a sub-band decomposition module, dividing the signal of each sub-band i into three branches, and enhancing the signal in the speaker direction by a first branch by a fixed beam former to obtain the voice frequency spectrum output by the first branch
Figure GDA0003147951460000107
Where t is the frame number. Branch two uses blocking matrix B1iThe signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output
Figure GDA0003147951460000111
Branch three adopts the blocking matrix B2iSuppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches
Figure GDA0003147951460000112
In this embodiment, the decomposition of subband signalsAnd the synthesis is realized by using a cosine modulation filter bank, and the analysis and synthesis filters of the filter bank are the filters with the bandwidth of
Figure GDA0003147951460000113
And modulating by a low-pass filter, wherein K & lt24 & gt is the number of the sub-bands. The method for calculating the coefficients of the analysis filter bank for subband decomposition comprises the following steps: with a coefficient of h0(l) The low-pass filter of (2) is a prototype filter, then the analysis filter coefficient is
Figure GDA0003147951460000114
Wherein h isk(l) For analyzing the coefficients of the kth filter in the filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θk=(-1)kπ/4。
In this embodiment, for the ith subband, the weight matrix w of branch oneq,iThe following method is adopted for calculation:
Figure GDA0003147951460000115
wherein C is1i=d(ωi0) In order to constrain the matrix, the matrix is,
Figure GDA0003147951460000116
m is the array element number, omega, of the microphone arrayiIs the center frequency of the ith sub-band, theta0Is the direction of the incoming wave of the speaker, tau0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector. Passing the ith sub-band signal output by the microphone array receiving module through wq,iAfter weighting, obtaining the speech frequency spectrum output by branch circuit
Figure GDA0003147951460000117
For the ith sub-band, the blocking matrix B for branch two1iAdopt the followingThe method comprises the following steps: will matrix C1i=d(ωi0) Performing singular value decomposition
Figure GDA0003147951460000118
Wherein1irIs r1×r1Diagonal matrix of r1Is C1iIs determined. Order to
Figure GDA0003147951460000121
Wherein U is1irIs U1iFront r of1The rows of the image data are, in turn,
Figure GDA0003147951460000122
is U1iThe remaining rows of then
Figure GDA0003147951460000123
Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B1iAfter weighting, the data is passed through an adaptive filter wa,iObtaining the noise component frequency spectrum of the branch circuit two output
Figure GDA0003147951460000124
For the ith subband, branch three blocking matrix B2iThe following method is adopted for calculation: will matrix C2i=[d(ωi0),d(ωi1),…,d(ωiJ)]Performing singular value decomposition
Figure GDA0003147951460000125
Wherein
Figure GDA0003147951460000126
M is the array element number, omega, of the microphone arrayiIs the center frequency of the ith sub-band, theta0Is the direction of the incoming wave of the speaker, tau0,mM is not less than 0 and not more than M-1, isThe time delay difference between the sound of the speaker arriving at the m-th array element and arriving at the 0-th array element,
Figure GDA0003147951460000127
j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, thetajIs the direction of the incoming wave of the interference source, tauj,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma2irIs r2×r2Diagonal matrix of r2Is C2iIs determined. Order to
Figure GDA0003147951460000128
Wherein U is2irIs U2iFront r of2The rows of the image data are, in turn,
Figure GDA0003147951460000129
is U2iThe remaining rows of then
Figure GDA00031479514600001210
Passing the ith sub-band signal output by the microphone array receiving module through a blocking matrix B2iAfter weighting, the noise component frequency spectrum of the branch circuit three output is obtained
Figure GDA00031479514600001211
Step S4, use
Figure GDA00031479514600001212
And
Figure GDA00031479514600001213
estimate S(f)Noise spectrum contained in (ω, t)
Figure GDA00031479514600001214
In this embodiment, for the ith subband, the following formula is adopted to calculate the voice spectrum output by the branch one
Figure GDA00031479514600001215
Is contained inNoise spectrum
Figure GDA00031479514600001216
Figure GDA00031479514600001217
Wherein wq,iAnd wa,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B1iIs the blocking matrix for branch two,
Figure GDA00031479514600001218
Figure GDA00031479514600001219
for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,
Figure GDA0003147951460000131
the spectrum of the noise component output by branch two in the ith sub-band.
In this embodiment, of all sub-bands
Figure GDA0003147951460000132
And
Figure GDA0003147951460000133
the cosine modulation filtering is adopted to combine into a full band
Figure GDA0003147951460000134
And
Figure GDA0003147951460000135
the method for calculating the coefficients of the synthesis filter bank for subband synthesis comprises the following steps: using a low-pass prototype filter h identical to the analysis filter bank0(l) The synthesis filter coefficient is
Figure GDA0003147951460000136
Wherein g isk(l) For the coefficients of the kth filter in the synthesis filter bank, L is the order of the prototype filter, L is 0 to L-1, K is 0 to K-1, θk=(-1)kπ/4。
Step S5, the
Figure GDA0003147951460000137
And
Figure GDA0003147951460000138
inputting the deep neural network trained in the step S1 to obtain the enhanced voice.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A microphone array speech enhancement method based on a deep neural network is characterized in that the following steps are adopted to enhance an input speech signal:
s1, training a deep neural network for mapping the voice with noise and the noise into clean voice by using a clean voice library and a noise library;
s2, estimating the incoming wave direction theta of the speaker by using the microphone array0The number J of interference sources and the incoming wave direction theta of the interference sourcesj,1≤j≤J;
S3, dividing the signal received by the microphone array into three branches, wherein the first branch adopts a fixed beam former to enhance the signal in the speaker direction to obtain the voice frequency spectrum S output by the first branch(f)(ω, t), where t is the frame number; branch two uses blocking matrix B1The signal in the speaker direction is suppressed, and the output of the blocking matrix is passed through the self-adaptive filter to obtain the noise component frequency spectrum of the branch circuit two output
Figure FDA0003147951450000011
Branch three adopts the blocking matrix B2Suppressing the signal of speaker and all interference source directions to obtain the frequency spectrum vector of the spatial incoherent noise output by three branches
Figure FDA0003147951450000012
In step S3, for the ith, i-1, 2, …, and 24 subbands, the weight matrix w of branch one is obtainedq,iThe following method is adopted for calculation:
Figure FDA0003147951450000013
wherein C is1i=d(ωi0) In order to constrain the matrix, the matrix is,
Figure FDA0003147951450000014
m is the array element number, omega, of the microphone arrayiIs the center frequency of the ith sub-band, theta0Is the direction of the incoming wave of the speaker, tau0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the arrival of the speaker voice at the M-th array element and the arrival at the 0-th array element, and f is a response vector;
in step S3, for the ith sub-band, i is 1,2, …,24, the blocking matrix B of branch three2iThe following method is adopted for calculation:
will matrix C2i=[d(ωi0),d(ωi1),…,d(ωiJ)]Performing singular value decomposition
Figure FDA0003147951450000021
Wherein
Figure FDA0003147951450000022
M is the array element number, omega, of the microphone arrayiIs the center of the ith sub-bandFrequency, theta0Is the direction of the incoming wave of the speaker, tau0,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the sound of the speaker reaching the M-th array element and the sound of the speaker reaching the 0-th array element,
Figure FDA0003147951450000023
j is more than or equal to 1 and less than or equal to J, J is the number of interference sources, thetajIs the direction of the incoming wave of the interference source, tauj,mM is more than or equal to 0 and less than or equal to M-1, which is the time delay difference between the jth interference source sound reaching the mth array element and reaching the 0 th array element, sigma2irIs r2×r2Diagonal matrix of r2Is C2iRank of (1), order
Figure FDA0003147951450000024
Wherein U is2irIs U2iFront r of2The rows of the image data are, in turn,
Figure FDA0003147951450000025
is U2iThe remaining rows of then
Figure FDA0003147951450000026
S4, use
Figure FDA0003147951450000027
And
Figure FDA0003147951450000028
estimate S(f)Noise spectrum contained in (ω, t)
Figure FDA0003147951450000029
S5, mixing S(f)(ω, t) and
Figure FDA00031479514500000210
inputting the deep neural network trained in the step S1 to obtain the enhanced voice.
2. The microphone array speech enhancement method of claim 1, wherein the deep neural network training of step S1 comprises the steps of:
s1.1, superposing the voice of a clean voice library and the noise of a noise library to obtain a voice with noise, taking a short-time frequency spectrum of the voice with noise and a short-time frequency spectrum of corresponding noise as input, and taking a short-time frequency spectrum of corresponding clean voice as target output to obtain a training data set;
s1.2, setting structural parameters of a deep neural network, and adopting the following cost function:
Figure FDA00031479514500000211
where X (ω, t) represents the short-time spectrum of the t-th frame of clean speech,
Figure FDA00031479514500000212
representing the short-time spectrum S of noisy speech from the t-th frame(f)(omega, t) and noise short-time frequency spectrum
Figure FDA00031479514500000213
The formed input sample, f (Y (omega, T)) represents the output of the neural network, and T is the number of voice frames for training;
s1.3, training the deep neural network to enable the change of the cost function phi to be smaller than a preset value.
3. The microphone array speech enhancement method of claim 1, wherein in steps S3 and S4, the input signal is first decomposed into K sub-bands, the signal of each sub-band is processed by three branches, and then the full band S is synthesized(f)(ω, t) and
Figure FDA0003147951450000031
4. microphone array according to claim 1The tone enhancement method is characterized in that in step S3, for the ith sub-band, i is 1,2, …,24, the blocking matrix B of branch two1iThe following method is adopted for calculation:
will matrix C1i=d(ωi0) Performing singular value decomposition
Figure FDA0003147951450000032
Wherein, C1i=d(ωi0) To constrain the matrix, sigma1irIs r1×r1Diagonal matrix of r1Is C1iRank of (1), order
Figure FDA0003147951450000033
Wherein U is1irIs U1iFront r of1The rows of the image data are, in turn,
Figure FDA0003147951450000034
is U1iThe remaining rows of then
Figure FDA0003147951450000035
5. The microphone array speech enhancement method of claim 1, wherein in step S4, for the ith, i-1, 2, …,24 sub-bands, the following formula is used to calculate the speech spectrum of the branch-one output
Figure FDA0003147951450000036
Spectrum of noise contained therein
Figure FDA0003147951450000037
Figure FDA0003147951450000038
Wherein wq,iAnd wa,iWeight vectors of the fixed beamformer of branch one and the adaptive filter of branch two, respectively, B1iIs the blocking matrix for branch two,
Figure FDA0003147951450000039
Figure FDA00031479514500000310
for the spectral vector of the spatial incoherent noise output by branch three in the ith sub-band,
Figure FDA00031479514500000311
the spectrum of the noise component output by branch two in the ith sub-band.
6. An implementation device of a microphone array voice enhancement method based on a deep neural network is characterized by comprising a microphone array receiving module, a sub-band decomposition module, a sub-band synthesis module, 24 improved sub-band GSC modules and the deep neural network, wherein the GSC is a short for generalized sidelobe canceller, and the microphone array receiving module and the sub-band decomposition module are connected in sequence and are respectively used for receiving multi-channel audio signals and dividing sub-bands; the sub-band synthesis module is connected with the deep neural network in sequence and is respectively used for synthesizing full-band signals and training the neural network for filtering; the 24 improved sub-band GSC modules are respectively connected with the sub-band decomposition module and the sub-band synthesis module and are used for carrying out GSC filtering on the sub-bands of the signals;
the microphone array receiving module adopts a linear array structure and comprises 8 microphones which are uniformly distributed on a straight line, and each array element is isotropic; the sub-band decomposition module decomposes the audio signals collected by each microphone element into 24 sub-bands, and the sub-bands are respectively sent to the corresponding improved sub-band GSC module for processing; the sub-band synthesis module synthesizes the output of the 24 improved sub-band GSC modules into full-band signals and sends the full-band signals to a deep neural network for enhancement;
wherein, the ith, i-1, 2, …,24 improved sub-band GSC module comprises 3 branches, and the first branch adopts a fixed beam former wq,iThe signal of speaker direction is enhanced, and the second branch adopts a blocking matrix B1iSuppressing the speaker-oriented signal and passing the output of the occlusion matrix through an adaptive filter wa,iTo obtain a noise component spectrum
Figure FDA0003147951450000041
Branch three adopts the blocking matrix B2iSuppressing the signals of speaker and all interference sources to obtain the frequency spectrum vector of spatial incoherent noise
Figure FDA0003147951450000042
CN201910677433.3A 2019-07-25 2019-07-25 Microphone array speech enhancement method and implementation device Expired - Fee Related CN110517701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910677433.3A CN110517701B (en) 2019-07-25 2019-07-25 Microphone array speech enhancement method and implementation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910677433.3A CN110517701B (en) 2019-07-25 2019-07-25 Microphone array speech enhancement method and implementation device

Publications (2)

Publication Number Publication Date
CN110517701A CN110517701A (en) 2019-11-29
CN110517701B true CN110517701B (en) 2021-09-21

Family

ID=68624022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910677433.3A Expired - Fee Related CN110517701B (en) 2019-07-25 2019-07-25 Microphone array speech enhancement method and implementation device

Country Status (1)

Country Link
CN (1) CN110517701B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111341342A (en) * 2020-02-11 2020-06-26 上海应用技术大学 Vehicle voice extraction method and system based on ambient sound separation
CN111596261B (en) * 2020-04-02 2022-06-14 云知声智能科技股份有限公司 Sound source positioning method and device
CN111866665B (en) * 2020-07-22 2022-01-28 海尔优家智能科技(北京)有限公司 Microphone array beam forming method and device
CN112017681B (en) * 2020-09-07 2022-05-13 思必驰科技股份有限公司 Method and system for enhancing directional voice
CN112259113B (en) * 2020-09-30 2024-07-30 清华大学苏州汽车研究院(相城) Preprocessing system for improving accuracy of in-vehicle voice recognition and control method thereof
CN114373475A (en) * 2021-12-28 2022-04-19 陕西科技大学 Method, device and storage medium for speech noise reduction based on microphone array
CN114501283B (en) * 2022-04-15 2022-06-28 南京天悦电子科技有限公司 Low-complexity double-microphone directional sound pickup method for digital hearing aid
CN114613385A (en) * 2022-05-07 2022-06-10 广州易而达科技股份有限公司 Far-field voice noise reduction method, cloud server and audio acquisition equipment
CN117037827B (en) * 2023-08-10 2025-03-14 长沙东玛克信息科技有限公司 A multi-channel microphone array speech modulation method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509552A (en) * 2011-10-21 2012-06-20 浙江大学 Method for enhancing microphone array voice based on combined inhibition
CN104835503A (en) * 2015-05-06 2015-08-12 南京信息工程大学 Improved GSC self-adaptive speech enhancement method
CN105792074A (en) * 2016-02-26 2016-07-20 西北工业大学 A voice signal processing method and device
CN107993670A (en) * 2017-11-23 2018-05-04 华南理工大学 Microphone array voice enhancement method based on statistical model
US9972315B2 (en) * 2015-01-14 2018-05-15 Honda Motor Co., Ltd. Speech processing device, speech processing method, and speech processing system
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Method and system for training and identifying bidirectional neural network model
US10074380B2 (en) * 2016-08-03 2018-09-11 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN109616136A (en) * 2018-12-21 2019-04-12 出门问问信息科技有限公司 A kind of Adaptive beamformer method, apparatus and system
CN109686381A (en) * 2017-10-19 2019-04-26 恩智浦有限公司 Signal processor and correlation technique for signal enhancing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8374854B2 (en) * 2008-03-28 2013-02-12 Southern Methodist University Spatio-temporal speech enhancement technique based on generalized eigenvalue decomposition
US10546593B2 (en) * 2017-12-04 2020-01-28 Apple Inc. Deep learning driven multi-channel filtering for speech enhancement

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509552A (en) * 2011-10-21 2012-06-20 浙江大学 Method for enhancing microphone array voice based on combined inhibition
US9972315B2 (en) * 2015-01-14 2018-05-15 Honda Motor Co., Ltd. Speech processing device, speech processing method, and speech processing system
CN104835503A (en) * 2015-05-06 2015-08-12 南京信息工程大学 Improved GSC self-adaptive speech enhancement method
CN105792074A (en) * 2016-02-26 2016-07-20 西北工业大学 A voice signal processing method and device
US10074380B2 (en) * 2016-08-03 2018-09-11 Apple Inc. System and method for performing speech enhancement using a deep neural network-based signal
CN109686381A (en) * 2017-10-19 2019-04-26 恩智浦有限公司 Signal processor and correlation technique for signal enhancing
CN107993670A (en) * 2017-11-23 2018-05-04 华南理工大学 Microphone array voice enhancement method based on statistical model
CN108417224A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Method and system for training and identifying bidirectional neural network model
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN109616136A (en) * 2018-12-21 2019-04-12 出门问问信息科技有限公司 A kind of Adaptive beamformer method, apparatus and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Model-Based Post Filter for Microphone Array Speech Enhancement";Xiong Y;《2018 7th International Conference on Digital Home (ICDH)》;20181231;全文 *
"Speech Enhancement Based on Multi-Stream Model";Yan X;《2016 6th International Conference on Digital Home (ICDH)》;20161231;全文 *
"基于波束形成与DNN的远距离语音识别方法研究";王旭东;《中国优秀硕士学位论文全文数据库信息科技辑》;20190115;全文 *
"基于深度学习的语音分离研究";张晖;《中国博士学位论文全文数据库信息科技辑》;20180515;全文 *

Also Published As

Publication number Publication date
CN110517701A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN110517701B (en) Microphone array speech enhancement method and implementation device
CN106251877B (en) Voice Sounnd source direction estimation method and device
US10446171B2 (en) Online dereverberation algorithm based on weighted prediction error for noisy time-varying environments
US8848933B2 (en) Signal enhancement device, method thereof, program, and recording medium
CN107993670B (en) Microphone array speech enhancement method based on statistical model
CN106710601B (en) Noise-reduction and pickup processing method and device for voice signals and refrigerator
Simmer et al. Post-filtering techniques
US9830926B2 (en) Signal processing apparatus, method and computer program for dereverberating a number of input audio signals
US8351554B2 (en) Signal extraction
JP6547003B2 (en) Adaptive mixing of subband signals
US11373667B2 (en) Real-time single-channel speech enhancement in noisy and time-varying environments
Doclo Multi-microphone noise reduction and dereverberation techniques for speech applications
CN114245266B (en) Area pickup method and system for small microphone array device
Tammen et al. Deep multi-frame MVDR filtering for single-microphone speech enhancement
CN111681665A (en) Omnidirectional noise reduction method, equipment and storage medium
Spriet et al. Stochastic gradient-based implementation of spatially preprocessed speech distortion weighted multichannel Wiener filtering for noise reduction in hearing aids
CN111341339A (en) Target voice enhancement method based on acoustic vector sensor adaptive beam forming and deep neural network technology
Stanacevic et al. Gradient flow adaptive beamforming and signal separation in a miniature microphone array
CN117121104A (en) Estimating an optimized mask for processing acquired sound data
Xu et al. FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
Lorenzelli et al. Broadband array processing using subband techniques
Pan et al. Combined spatial/beamforming and time/frequency processing for blind source separation
Xiao et al. Adaptive Beamforming Based on Interference-Plus-Noise Covariance Matrix Reconstruction for Speech Separation
Jinzai et al. Wavelength-Proportional Interpolation and Extrapolation of Virtual Microphone for Underdetermined Speech Enhancement
Deshpande et al. Multi-microphone speech dereverberation using spatial filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210921

CF01 Termination of patent right due to non-payment of annual fee