Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that when an element is referred to as being "mounted to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "secured to" another element, it can be directly secured to the other element or intervening elements may also be present.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "or/and" as used herein includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1, fig. 1 shows a data flow diagram of a dual-branch network-based mono speech enhancement method, which in fact also shows the flow of the dual-branch network-based mono speech enhancement method.
As shown in fig. 1, the mono speech enhancement method based on the dual-branch network includes the following steps:
Step one, obtaining noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
The noisy speech may be represented as Voice (t, f) comprising clean speech signal (t, f), noise (t, f), where t represents a time frame index and f represents a frequency index.
Typically, noisy speech is sampled at 16kHz and framed using a 20ms hamming window with 50% overlap between frames.
And step two, inputting X 0、Y0 into a trained voice enhancement network (BSDB-Net for short) for processing to obtain an enhanced complex spectrum Z'.
The speech enhancement network includes 2 band dividing units, 2 feature extracting units, 1 sequence modeling unit, 1 complex spectrum feature modeling unit, 1 amplitude spectrum feature modeling unit, and 1 band synthesizing unit.
As shown in fig. 1:
The 1 st frequency band division part is used for carrying out frequency band division processing on the X 0 to obtain an intermediate complex spectrum characteristic A;
the 2 nd frequency band division part is used for carrying out frequency band division processing on the Y 0 to obtain an intermediate amplitude spectrum characteristic E;
the 1 st feature extraction part is used for extracting an intermediate feature B based on A, E;
The 2 nd feature extraction part is used for extracting an intermediate feature F based on A, E;
the sequence modeling part is constructed based on Mamba and is used for performing sequence modeling on B, F to obtain an intermediate feature C and an intermediate feature H;
the complex spectrum feature modeling part is used for carrying out feature modeling on the C to obtain a final complex spectrum feature D;
The amplitude spectrum feature modeling part is used for carrying out feature modeling on the H to obtain a final amplitude spectrum feature L;
the band synthesis unit is configured to perform synthesis processing on D, L to obtain Z'.
Wherein:
the 1 st frequency band dividing part, the 1 st feature extracting part, the sequence modeling part and the complex spectrum feature modeling part form a CEN branch, and the CEN branch focuses on processing X 0;
The 2 nd band dividing unit, the 2 nd feature extracting unit, the sequence modeling unit, and the amplitude spectrum feature modeling unit constitute a MEN branch, and the MEN branch focuses on processing Y 0.
The following description of the various parts is provided:
The (one) 2 band partitions are respectively aimed at reducing the dimension of X 0、Y0 and the complexity of data.
The band splitting section may employ the design of fig. 2, which includes 1 spectrum splitting Layer, 1 normalization Layer, 1 Layer, 1 dimension merging Layer.
101, In the 1 st band division unit:
the spectrum segmentation layer is used for segmenting the X 0 according to a preset specification to obtain M original complex bands X1;
the normalization layer is used for normalizing the X1 to obtain M normalized complex spectrum features X2;
The Layer is used for unifying the frequency dimension of the X2 to obtain M unified complex spectrum features X3;
the dimension merging layer is used for merging the X3 in the frequency dimension to obtain A;
Wherein, x1= [ X 11,x12,…,x1M],x1m ] represents the m-th original complex band in X1, m e [1, m ];
X2= [ X 21,x22,…,x2M],x2m represents the mth normalized complex spectral feature in X2;
x3= [ X 31,x32,…,x3M],x3m ] represents the mth unified complex spectral feature in X3.
That is, the processing procedure of the 1 st band division section can be formulated as:
;
;
;
;
Where Split (), FC (), LN (), and Concat represent processes of the Layer, the normalization Layer, and the dimension merging Layer, respectively.
102, In the 2 nd band dividing section:
The frequency spectrum segmentation layer is used for segmenting Y 0 according to a preset specification to obtain M original amplitude bands Y1;
The normalization layer is used for normalizing the Y1 to obtain M normalized amplitude spectrum characteristics Y2;
the Layer is used for unifying the Y2 in the frequency dimension to obtain M unified amplitude spectrum characteristics Y3;
the dimension merging layer is used for merging the Y3 in the frequency dimension to obtain E;
wherein y1= [ Y 11,y12,…,y1M],y1m represents the m-th original amplitude band in Y1;
Y2= [ Y 21,y22,…,y2M],y2m ] represents the mth normalized amplitude spectrum feature in Y2;
Y3= [ Y 31,y32,…,y3M],y3m ] represents the mth uniform amplitude spectrum feature in Y3.
That is, the processing procedure of the 2 nd band dividing section can be formulated as:
;
;
;
。
Where Split (), FC (), LN (), and Concat represent processes of the Layer, the normalization Layer, and the dimension merging Layer, respectively.
In this embodiment 1, M is 20, the preset specification is [2,3,3,3,3,3,3,8,8,8,8,8,8,8,8,12,16,16,16,17], the frequency band dimension of the complex spectrum can be reduced from 161 to 20 through a series of processing of the 1 st frequency band division part, the frequency band dimension of the amplitude spectrum can be reduced from 161 to 20 through a series of processing of the 2 nd frequency band division part, and thus, the characteristics of different frequency bands can be captured better, and the data complexity can be reduced greatly.
The 2 feature extraction steps are respectively aimed at further extracting the features of complex spectrum and magnitude spectrum for better modeling.
The feature extraction section may employ the design of fig. 3, which includes 1 two-dimensional convolution layer, 1 normalization layer, 1 PRelu activation functions, 1 information interaction layer.
201, In the 1 st feature extraction section:
the two-dimensional convolution layer is used for extracting features of the A to obtain complex frequency spectrum information a1;
The normalization layer is used for normalizing the a1 to obtain complex frequency spectrum information a2;
PRelu the activation function is used to introduce nonlinear features into a2 to obtain complex spectral information a3;
The information interaction layer is used for carrying out interaction processing on the a3 and the E to obtain the B, so that information loss under double branches can be prevented.
That is, the processing procedure of the 1 st feature extraction section can be formulated as:
;
;
In the formula, conv2d (the term) represents the processing procedure of a two-dimensional convolution layer, LN (the term) represents the processing procedure of a sub-normalization layer, PRelu (the term) represents the processing procedure of PRelu activation function, and Interr (the term) represents the processing procedure of an information interaction layer.
202, In the 2 nd feature extraction section:
the two-dimensional convolution layer is used for extracting features of the E to obtain amplitude spectrum information E1;
the normalization layer is used for normalizing the e1 to obtain amplitude spectrum information e2;
PRelu the activation function is used to introduce a nonlinear feature into e2 to obtain amplitude spectrum information e3;
the information interaction layer is used for carrying out interaction processing on e3 and A to obtain F, so that information loss under double branches can be prevented.
That is, the processing procedure of the 2 nd feature extraction section can be formulated as:
;
;
In the formula, conv2d (the term) represents the processing procedure of a two-dimensional convolution layer, LN (the term) represents the processing procedure of a sub-normalization layer, PRelu (the term) represents the processing procedure of PRelu activation function, and Interr (the term) represents the processing procedure of an information interaction layer.
The (III) sequence modeling section aims to model in the time domain and the time-frequency domain, respectively, with linear complexity.
The sequence modeling portion may employ the design of fig. 4, which includes N Mamba model portions that are identical in structure.
Taking the nth Mamba model part as an example, the model part comprises 1 information interaction layer, 2 superimposed layers and 2 FT-Mamba parts.
In the nth Mamba model section:
The information interaction layer is used for carrying out interaction processing on the intermediate frequency spectrum characteristic W n-1 and the intermediate frequency spectrum characteristic W n-1 to obtain an intermediate frequency spectrum characteristic U n;
The 1 st overlay layer is used for overlapping U n and W n-1 to obtain an intermediate frequency spectrum characteristic U_B n;
The 2 nd superimposed layer is used for superimposing U n and w n-1 to obtain an intermediate spectral feature U_F n;
Section 1 FT-Mamba is used to model the sequence of u_b n based on Mamba to obtain an intermediate spectral feature W n;
Section 2 FT-Mamba is used to sequence model U_F n based on Mamba to obtain the intermediate spectral feature w n, n ε [1, N ].
Wherein B is taken as an intermediate spectral feature W 0 and F is taken as an intermediate spectral feature W 0;
Intermediate spectral feature W N is C and intermediate spectral feature W N is H.
That is, the process of the nth Mamba model portion can be formulated as:
;
;
;
in the formula, inter (the term) represents the processing procedure of the information exchange layer, and FT_ Mamba (the term) represents the processing procedure of the FT-Mamba unit.
Note that N suggests taking 6 to achieve a balance between network performance and complexity.
301, The information interaction layer appears in both of the 2 feature extraction units and the sequence modeling unit. The structure of the information interaction layer is described in addition to:
The information interaction layer may employ the design of fig. 5, which includes 1 joint layer, 1 sub two-dimensional convolution layer, 1 sub normalization layer, 1 sub PRelu activation functions. The information interaction layer is double-input and single-output.
In either information interaction layer:
The joint layer is used for connecting the double inputs of the information interaction layer together to obtain intermediate data G1;
The sub two-dimensional convolution layer is used for extracting the characteristics of the G1 to obtain intermediate data G2;
The sub normalization layer is used for carrying out normalization processing on the G2 to obtain intermediate data G3;
The sub PRelu activation function is used to introduce a nonlinear feature into G3 to get a single output of the information interaction layer.
That is, the above procedure can be formulated as:
;
;
In the formula, input 1、Input2 represents double Input of an information interaction layer, output represents single Input of the information interaction layer, cat (& gt) represents processing procedure of a joint layer, conv2d (& gt) represents processing procedure of a sub two-dimensional convolution layer, LN (& gt) represents processing procedure of a sub normalization layer, and PRelu (& gt) represents processing procedure of activating a function of a sub PRelu.
Then, the double inputs of the information interaction layer in the 1 st feature extraction part are a3 and E, and the single output is B;
the double inputs of the information interaction layer in the 2 nd feature extraction part are e3 and A, and the single output is F;
the dual input of the information exchange layer in the nth Mamba model portion is W n-1、wn-1, and the single output is U n.
302, The ft_mamba section comprises 5 sub-sections.
The 5 sub-portions may be arranged in two ways:
The first is that the 1 st sub-part is a normalization layer, the 2 nd sub-part is an F_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a T_ Mamba part, and the 5 th sub-part is an overlapping layer;
The second type is that the 1 st sub-part is a normalization layer, the 2 nd sub-part is a T_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a F_ Mamba part, and the 5 th sub-part is an overlapping layer.
In any FT Mamba section:
the 1 st sub-part is used for carrying out normalization processing on the input of the FT_ Mamba part to obtain intermediate data J1;
the 2 nd sub-section is used for performing sequence modeling on J1 in the frequency dimension to obtain intermediate data J2;
the 3 rd subsection is configured to superimpose the input of the FT Mamba section with J2 to obtain intermediate data J3;
the 4 th subsection is used for performing sequence modeling on J3 in the time dimension to obtain intermediate data J4;
the 5 th subsection is for superimposing J3 and J4 to obtain the output of the FT_ Mamba section.
Referring to fig. 6, which shows the structure of the first setting mode of the ft_ Mamba part, the processing procedure can be formulated as follows:
;
;
In the formula, in 0 denotes an input of the ft_ Mamba unit, out 0 denotes an output of the ft_ Mamba unit, LN ()'s denote a process of the normalization layer, f_ Mamba ()'s denote a process of the f_ Mamba unit, and t_ Mamba ()'s denote a process of the t_ Mamba unit.
The following describes the f_ Mamba and t_ Mamba, respectively:
3021 the F_Mamba section may employ the design of FIG. 7, which includes 2 Transpose layers, 1 Unfold layer, 1 F_ Mamba layer, 1 one-dimensional transpose convolutional layer.
In any of the f_ Mamba sections:
Layer 1 Transpose is used for tensor converting the input of the F_ Mamba part to obtain intermediate data R1;
the Unfold layer is used for upsampling R1 to obtain intermediate data R2;
The F_ Mamba layer is used for carrying out sequence modeling on R2 through the two-way Mamba to obtain intermediate data R3;
the one-dimensional transposition convolution layer is used for carrying out one-dimensional transposition convolution on the R3 to obtain intermediate data R4;
layer 2 Transpose is used to tensor convert R4 to get the output of the f_ Mamba section.
That is, the processing of the f_ Mamba section can be formulated as:
;
;
In the formula, in 1 denotes an input of the f_ Mamba part, out 1 denotes an output of the f_ Mamba part, tran (-) denotes a process of the Transpose layer, unfold (-) denotes a process of the Unfold layer, f_ma (-) denotes a process of the f_ Mamba layer, and Deconv (-) denotes a process of the one-dimensional transposed convolution layer.
Referring to fig. 7, the f_mamba Layer includes 1 flip Layer, 2 Mamba layers, 2 Norm layers, 2 overlay layers, 1 Constant Layer, and 1 Layer.
In any of the F Mamba layers:
layer 1 Mamba is used to model the sequence of inputs of the f_ Mamba layer to obtain intermediate data P1;
the 1 st Norm layer is used for carrying out normalization treatment on the P1 to obtain intermediate data P2;
The 1 st superimposing layer is used for superimposing the input of the F_ Mamba layer with P2 to obtain intermediate data P3;
The flip layer is used for reversing the input of the F_ Mamba layer to obtain intermediate data P4;
layer 2 Mamba is used to model P4 to obtain intermediate data P5;
The 2 nd Norm layer is used for normalizing the P5 to obtain intermediate data P6;
the 2 nd superimposed layer is used for superimposing the P4 and the P6 to obtain intermediate data P7;
the Constant layer is used for performing tensor adjustment on the P3 and the P7 to obtain intermediate data P8;
The Layer is used to linearly map P8 to obtain the output of the F Mamba Layer.
That is, the processing of the F Mamba layer can be formulated as:
;
;
;
In the formula, in1 represents an input of the f_ Mamba layer, out1 represents an output of the f_ Mamba layer, ma (main) represents a process of the Mamba layer, LN (main) represents a process of the Norm layer, flip (main) represents a process of the flip layer, constant (main) represents a process of the Constant layer, and FC (main) represents a process of the flip layer.
3022 The t_mamba section may employ the design of fig. 8, which includes 2 Transpose layers, 1 one-dimensional filler layer, 1 Unfold layers, 1 t_ Mamba layer, 1 one-dimensional transpose convolutional layer.
In any of the T Mamba sections:
Layer 1 Transpose is used for tensor converting the input of the T_ Mamba part to obtain intermediate data T1;
The one-dimensional filling layer is used for filling data of the T1 to obtain intermediate data T2;
the Unfold layer is used for extracting abstract features of the T2 to obtain intermediate data T3;
The T_ Mamba layer is used for performing sequence modeling on T3 through one-way Mamba to obtain intermediate data T4;
the one-dimensional transposition convolution layer is used for carrying out one-dimensional transposition convolution on the T4 to obtain intermediate data T5;
Layer 2 Transpose is used to tensor convert the output of the t_ Mamba section for T5.
That is, the process of the t_ Mamba section can be formulated as:
;
;
In the formula, in 2 denotes an input of the t_ Mamba part, out 2 denotes an output of the t_ Mamba part, tran (-) denotes a process of Transpose layers, pad (-) denotes a process of one-dimensional filler layers, unfold (-) denotes a process of Unfold layers, t_ma (-) denotes a process of t_ Mamba layers, and Deconv (-) denotes a process of one-dimensional transposed convolutional layers.
Wherein, referring to fig. 8, the t_mamba Layer includes 1 Mamba layers, 1 Norm Layer, 1 Layer, 1 superimposed Layer.
In any of the T Mamba layers:
The Mamba layer is used for modeling the input of the T_ Mamba layer to obtain intermediate data Q1;
The Norm layer is used for carrying out normalization processing on the Q1 to obtain intermediate data Q2;
the Layer is used for carrying out linear mapping on the Q2 to obtain intermediate data Q3;
The superimposing layer is used for superimposing the input of the T_ Mamba layer with Q3 to obtain the output of the T_ Mamba layer.
That is, the process of the T_ Mamba layer can be formulated as:
;
Where in2 denotes an input of the t_ Mamba Layer, out2 denotes an output of the t_ Mamba Layer, ma (), LN (), FC (), and Layer respectively denote processes of Mamba, of the Norm Layer, of the Layer.
(IV) the complex spectrum modeling part aims at restoring the complex spectrum characteristics to the original dimension for subsequent synthesis.
The complex spectral feature modeling unit may employ the design of FIG. 9, which includes 1F 1E-Real layer, 1F 2E-image layer, and 1 superimposed layer.
In the complex spectral feature modeling section:
The F1E-Real layer is used for carrying out feature modeling on the C to obtain a Real part intermediate feature C1;
the F2E-image layer is used for carrying out feature modeling on the C to obtain an imaginary part intermediate feature C2;
The superimposed layers are used to superimpose c1, c2 to give D.
That is, the processing procedure of the complex spectral feature modeling section can be formulated as:
;
Wherein F1E_real () represents the processing procedure of the F1E-Real layer, and F2E_image () represents the processing procedure of the F2E-image layer.
401F 1E-Real layer may employ the design of FIG. 10, which includes 3 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activate functions, 1 Sigmoid activate function.
In the F1E-Real layer:
the 1 st two-dimensional convolution layer is used for up-sampling C to obtain intermediate data u1;
The Norm layer is used for carrying out normalization processing on the u1 to obtain intermediate data u2;
The 1 st PRelu th activation function is used to introduce a nonlinear feature into u2 to obtain intermediate data u3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on u3 to obtain intermediate data u4;
the Sigmoid activation function is used for introducing a linear characteristic into u4 to obtain intermediate data u5;
the 3 rd two-dimensional convolution layer is used for extracting the characteristics of the U5 to obtain intermediate data U6;
The 2 nd PRelu activation function is used to introduce a nonlinear feature into u6 to get c1.
That is, the processing of the F1E-Real layer can be formulated as:
;
;
Wherein PRelu (-) represents the process of PRelu activating a function, conv2d (-) represents the process of two-dimensional convolution layer, LN (-) represents the process of Norm layer, sigmoid (-) represents the process of Sigmoid activating a function.
402, The F2E-image layer may employ the design of FIG. 10, which includes 4 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function, 1 superimposed layer.
In the F2E-Imag layer:
the 1 st two-dimensional convolution layer is used for up-sampling C to obtain intermediate data U1;
The Norm layer is used for carrying out normalization processing on the U1 to obtain intermediate data U2;
the 1 st PRelu th activation function is used to introduce a nonlinear feature into U2 to obtain intermediate data U3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the U3 to obtain intermediate data U4;
The Sigmoid activation function is used for introducing linear characteristics into U4 to obtain intermediate data U5;
the 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the U3 to obtain intermediate data U6;
the Tanh activation function is used for introducing linear characteristics into U6 to obtain intermediate data U7;
the overlapping layer is used for overlapping the U5 and the U7 to obtain intermediate data U8;
The 4 th two-dimensional convolution layer is used for extracting the characteristics of the U8 to obtain intermediate data U9;
The 2 nd PRelu activation function is used to introduce a nonlinear feature into U9 to get c2.
That is, the processing of the F2E-Imag layer can be formulated as:
;
;
;
;
Wherein PRelu (i) represents the processing procedure of PRelu activation function, conv2d (i) represents the processing procedure of two-dimensional convolution layer, LN (i) represents the processing procedure of Norm layer, sigmoid (i) represents the processing procedure of Sigmoid activation function, and Tanh (i) represents the processing procedure of Tanh activation function.
The (fifth) magnitude spectrum modeling section aims to restore complex spectrum features to the original dimensions for subsequent synthesis.
The amplitude spectrum feature modeling part can adopt the design of fig. 11, and comprises 1F 2E-Mask layer and 1 matrix multiplication layer.
In the amplitude spectrum feature modeling section:
The F2E-Mask layer is used for modeling H to obtain an amplitude spectrum feature Mask H';
the matrix multiplication layer is used for carrying out matrix multiplication on H' and E to obtain L.
That is, the processing procedure of the amplitude spectrum feature modeling section can be formulated as:
;
Wherein F2E-Mask (-) represents the processing of the F2E-Mask layer.
The F2E-Mask layer may employ the design of FIG. 12, which includes 4 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function, 1 overlay layer.
In the F2E-Mask layer:
The 1 st two-dimensional convolution layer is used for up-sampling H to obtain intermediate data H1;
the Norm layer is used for carrying out normalization treatment on the h1 to obtain intermediate data h2;
the 1 st PRelu th activation function is used to introduce a nonlinear feature into h2 to obtain intermediate data h3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the h3 to obtain intermediate data h4;
the Sigmoid activation function is used for introducing linear characteristics into h4 to obtain intermediate data h5;
the 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the h3 to obtain intermediate data h6;
the Tanh activation function is used for introducing a linear characteristic to h6 to obtain intermediate data h7;
the superposition layer is used for superposing the h5 and the h7 to obtain intermediate data h8;
The 4 th two-dimensional convolution layer is used for carrying out feature extraction on the h8 so as to obtain intermediate data h9;
The 2 nd PRelu th activation function is used to introduce a nonlinear feature into H9 to get H'.
That is, the processing of the F2E-Mask layer can be formulated as:
;
;
;
;
Wherein PRelu (i) represents the processing procedure of PRelu activation function, conv2d (i) represents the processing procedure of two-dimensional convolution layer, LN (i) represents the processing procedure of Norm layer, sigmoid (i) represents the processing procedure of Sigmoid activation function, and Tanh (i) represents the processing procedure of Tanh activation function.
The sixth band synthesis unit is intended to recombine the divided bands into complex spectra.
The band synthesis unit may employ the design of fig. 13, which includes 1 superimposed Layer, 1 block-disassembled Layer, 1 normalized Layer, 1 Layer, 1 Tanh activation function, 1 GLU Layer, and 1 band-merged Layer.
In the band synthesis section:
the overlapping layer is used for overlapping D, L to obtain a complex spectrum characteristic block V0;
The block disassembly layer is used for carrying out block disassembly on the V0 to obtain M original complex spectrum tensors V1;
the normalization layer is used for carrying out normalization processing on the V1 to obtain a normalized complex spectrum tensor V2;
the Layer is used to linearly vary V2 to obtain a double complex spectral tensor V3,
The Tanh activation function is used for introducing linear characteristics into V3 to obtain an activated complex spectrum tensor V4;
The GLU layer is used for carrying out dimension reduction on the V4 to obtain a final segmentation band mask V5;
the frequency band combining layer is used for combining V5 in the frequency dimension to obtain Z';
wherein v1= [ V 11,v12,…,v1M],v1m represents the mth original complex spectral tensor in V1;
v2= [ V 21,v22,…,v2M],v2m represents the mth normalized complex spectral tensor in V2;
V3= [ V 31,v32,…,v3M],v3m represents the mth double complex spectral tensor in V3;
V4= [ V 41,v42,…,v4M],v4m represents the mth activated complex spectral tensor in V4;
V5= [ V 51,v52,…,v5M],v5m ] represents the m-th final split band mask in V5.
That is, the band synthesis section may be formulated as:
;
;
;
;
;
In the formula, exp (level) represents a processing procedure of a block disassembly Layer, LN (level) represents a processing procedure of a normalization Layer, FC (level) represents a processing procedure of a Layer, tanh (level) represents a processing procedure of a Tanh activation function, GLU (level) represents a processing procedure of a GLU Layer, and Merge (level) represents a processing procedure of a band combining Layer.
In this example 1, M is 20 and V5 is reduced in dimension to a size of [2,3,3,3,3,3,3,8,8,8,8,8,8,8,8,12,16,16,16,17].
It should be noted that the present method requires the use of a trained speech enhancement network whose network parameters are optimal. The voice enhancement network is typically trained based on existing known voice datasets to ensure the network training effect.
The Loss function Loss of "ri+mag" is employed in training the speech enhancement network to monitor both phase and amplitude component optimization. Wherein, the Loss function Loss of "ri+mag" is expressed as:
;
Where L RI denotes the minimum mean square error loss function of the complex spectrum, L Mag denotes the minimum mean square error loss function of the magnitude spectrum, and β denotes the weight coefficient, which is typically 0.5.
Simulation verification
Existing datasets WSJ0-SI84, DNS-CHALLENGE, VOICEBANK, DEMAND are employed.
1. An ablation experiment was performed on BSDB-Net based on WSJ 0-si84+dns-change to evaluate the effectiveness of the dual-branch structural design.
The CEN branches of BSDB-Net are removed and the MEN branches are reserved, the MEN branches are trained alone, i.e. corresponding to BSDB-MEN, and the MEN branches of BSDB-Net are removed and the CEN branches are reserved, and the CEN branches are trained alone, i.e. corresponding to BSDB-CEN.
The performance of BSDB-MEN, BSDB-CEN, BSDB-Net was compared and the results are shown in Table 1:
Table 1 comparison of ablation experiments
The PESQ is a narrow-band and wide-band speech quality perception evaluation index, the larger and better the value of the PESQ is, ESTOI is an expanded version of short-time objective intelligibility STOI for measuring speech quality, the larger and better the value of the PESQ is, and the larger and better the value of the PESQ is used for evaluating the speech distortion degree.
As shown in table 1, the dual path structure is superior to the single path structure in all indexes. This means that the coordinated efforts of the CEN branch and the MEN branch can improve the quality of the target speech-the CEN branch filters out the main noise for rough estimation, while the MEN branch continuously supplements the speech information, thereby improving the overall performance of the system.
2. An alternative experiment was performed on BSDB-Net based on WSJ 0-si84+dns-change to evaluate the effectiveness of the sequence modeling part design.
The BSDB-Net sequence modeling was replaced with LSTM, corresponding to BSDB-LSTM, and the BSDB-Net sequence modeling was replaced with transducer, corresponding to BSDB-transducer.
The performance of BSDB-LSTM, BSDB-Transformer, BSDB-Net was compared and the results are shown in Table 2:
table 2 comparative results of substitution experiments
The PSQ is a narrow-band and wide-band voice quality perception evaluation index, the larger and the better the value of the PSQ is, MACs represent network complexity, and Parameters represent the number of network Parameters.
As shown in Table 2, BSDB-Net greatly reduces the complexity of the network and reduces the number of network parameters while maintaining optimal performance. It should be noted that although the performance improvement of BSDB-Net is not obvious relative to BSDB-transducer, the network complexity is significantly reduced and the number of network parameters is significantly reduced.
3. The BSDB-Net was compared in complexity with the other 9 models (including: convTasNet, DPRNN, DDAEC, LSTM, CRN, GCRN, DCCRN, fullSubNet, CTSNet, gaGNet) based on WSJ 0-Si84+DNS-Change, and the results are shown in Table 3.
Table 3 complexity comparison results
The PSQ is a narrow-band and wide-band voice quality perception evaluation index, the larger the value is, the better the value is, and MACs represent network complexity.
As shown in Table 3, BSDB-Net is about 8 times lower in average computational complexity than other models and has good performance.
4. BSDB-Net was compared to the other 14 models (including :SEGAN、MMSEGAN、MetricGAN、SRTNET、Wavenet、PHASEN、MHSA-SPK、DCCRN、TSTNN、S4DSE、FDFnet、CTS-net、GaGnet、CompNet)) based on VoiceBank +Demand, and the results are shown in Table 4.
Table 4 results of performance comparisons
Wherein PESQ is a narrow-band and wide-band speech quality perception evaluation index, the larger the value is, the better the short-time objective intelligibility of speech quality is, STOI is, the larger the value is, the better the value is, and CSIG, CBAK, COVL is an index for evaluating the speech quality, the larger the value is, the better the value is.
As shown in Table 4, BSDB-Net achieves average improvements of 0.32, 1.5%, 0.31, 0.17 and 0.35 over other models in PESQ, STOI, CSIG, CBAK and COVL, respectively, also illustrating the general applicability of the present method.
Example 2
Embodiment 2 provides a mono speech enhancement device based on a dual-branch network, which uses the mono speech enhancement method based on the dual-branch network disclosed in embodiment 1.
The single-channel voice enhancement device based on the dual-branch network comprises a preprocessing module, a voice enhancement module and a post-processing module.
The preprocessing module is used for acquiring noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
The voice enhancement module is used for inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z'.
The post-processing module is used for carrying out mask processing on Z and Z' first and then carrying out inverse Fourier transform to obtain enhanced voice.
Since the present apparatus uses the monaural speech enhancement method based on the dual-branch network in embodiment 1, the same effect is also obtained and will not be repeated here.
Example 3
Embodiment 3 discloses a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the binaural network-based monaural speech enhancement method disclosed in embodiment 1 when the computer program is executed.
Embodiment 3 also discloses a readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the steps of the two-branch network-based mono speech enhancement method disclosed in embodiment 1.
Embodiment 3 also discloses a computer program product comprising a computer program. The computer program when executed by a processor implements the steps of the two-branch network-based mono speech enhancement method disclosed in embodiment 1.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.