[go: up one dir, main page]

CN119049489B - A method and device for monophonic speech enhancement based on dual-branch network - Google Patents

A method and device for monophonic speech enhancement based on dual-branch network Download PDF

Info

Publication number
CN119049489B
CN119049489B CN202411533326.0A CN202411533326A CN119049489B CN 119049489 B CN119049489 B CN 119049489B CN 202411533326 A CN202411533326 A CN 202411533326A CN 119049489 B CN119049489 B CN 119049489B
Authority
CN
China
Prior art keywords
layer
mamba
intermediate data
carrying
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411533326.0A
Other languages
Chinese (zh)
Other versions
CN119049489A (en
Inventor
范存航
刘恩睿
吕钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202411533326.0A priority Critical patent/CN119049489B/en
Publication of CN119049489A publication Critical patent/CN119049489A/en
Application granted granted Critical
Publication of CN119049489B publication Critical patent/CN119049489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Complex Calculations (AREA)

Abstract

本发明涉及语音增强技术领域,具体涉及一种基于双分支网络的单声道语音增强方法及装置。本发明公开了一种基于双分支网络的单声道语音增强方法,首先将带噪语音转换成原始语音谱、并引入了分解策略将其解耦成原始幅度谱、原始复数谱;然后基于双分支结构的语音增强网络,并行地从原始幅度谱和原始复数谱中提取特征,并通过信息交互处理得到增强复数谱;之后将基于增强复数谱、原始语音谱进行逆处理,从而得到增强语音。经过在现有公共数据集上进行仿真对比可知,本发明的方法可以在保持相当性能的同时,将Gb/s的复杂度水平进一步压缩,平均实现8.3倍的计算复杂度降低。本发明解决了现有SE解决方案存在计算复杂度高的问题。

The present invention relates to the field of speech enhancement technology, and in particular to a monophonic speech enhancement method and device based on a dual-branch network. The present invention discloses a monophonic speech enhancement method based on a dual-branch network, firstly converting the noisy speech into the original speech spectrum, and introducing a decomposition strategy to decouple it into the original amplitude spectrum and the original complex spectrum; then based on the speech enhancement network with a dual-branch structure, extracting features from the original amplitude spectrum and the original complex spectrum in parallel, and obtaining an enhanced complex spectrum through information interaction processing; then inverse processing is performed based on the enhanced complex spectrum and the original speech spectrum, so as to obtain enhanced speech. Through simulation comparison on existing public data sets, it can be seen that the method of the present invention can further compress the complexity level of Gb/s while maintaining comparable performance, and achieve an average reduction in computational complexity of 8.3 times. The present invention solves the problem of high computational complexity of existing SE solutions.

Description

Single-channel voice enhancement method and device based on dual-branch network
Technical Field
The invention relates to the technical field of voice enhancement, in particular to a single-channel voice enhancement method based on a dual-branch network and a single-channel voice enhancement device based on the dual-branch network.
Background
Speech Enhancement (SE) is a technique to recover clean speech from noisy environments. The degradation of speech quality caused by background noise is not only perceptually disturbing, but also significantly compromises the performance of Automatic Speech Recognition (ASR). Furthermore, SE is indispensable in smart devices, in-vehicle systems and home automation. And with the rapid popularity of online conferences, the need for real-time SE solutions has proliferated, emphasizing the need for technologies that are both effective and computationally efficient.
Existing SE solutions can be broadly divided into two categories, namely the time domain and the time-frequency (T-F) domain. While existing SE solutions perform well, their high computational complexity may prevent practical application of SE. Specifically, 1, as a front-end task, the influence of SE on a downstream task (such as ASR) should be considered, 2, effective deployment in an environment with limited edges or resources, such as online conferences and real-time communication, needs a low-complexity SE solution, 3, the calculation complexity of sequence modeling may be quite large, and some models based on a transducer reach hundreds of giga floating point operations (G/s) due to a secondary attention mechanism, which may cause obstruction to practical application.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a method and apparatus for enhancing mono speech based on a dual-branch network, aiming at the problem of high computational complexity in the existing SE solution.
The invention is realized by adopting the following technical scheme:
In a first aspect, the present invention discloses a method for enhancing mono speech based on a dual-branch network, comprising:
Step one, obtaining noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
And step two, inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z'.
The voice enhancement network comprises 2 frequency band dividing parts, 2 feature extracting parts, 1 sequence modeling part, 1 complex spectrum feature modeling part, 1 amplitude spectrum feature modeling part and 1 frequency band synthesizing part.
The 1 st frequency band division part is used for carrying out frequency band division processing on the X 0 to obtain an intermediate complex spectrum characteristic A;
the 2 nd frequency band division part is used for carrying out frequency band division processing on the Y 0 to obtain an intermediate amplitude spectrum characteristic E;
the 1 st feature extraction part is used for extracting an intermediate feature B based on A, E;
The 2 nd feature extraction part is used for extracting an intermediate feature F based on A, E;
the sequence modeling part is constructed based on Mamba and is used for performing sequence modeling on B, F to obtain an intermediate feature C and an intermediate feature H;
the complex spectrum feature modeling part is used for carrying out feature modeling on the C to obtain a final complex spectrum feature D;
The amplitude spectrum feature modeling part is used for carrying out feature modeling on the H to obtain a final amplitude spectrum feature L;
the band synthesis unit is configured to perform synthesis processing on D, L to obtain Z'.
And thirdly, performing mask processing on Z and Z', and performing inverse Fourier transform to obtain enhanced voice.
Such a dual-branch network-based mono speech enhancement method implements a method or process according to embodiments of the present disclosure.
In a second aspect, the present invention discloses a mono speech enhancement device based on a dual-branch network, which uses the mono speech enhancement method based on the dual-branch network disclosed in the first aspect.
The single-channel voice enhancement device based on the dual-branch network comprises a preprocessing module, a voice enhancement module and a post-processing module.
The preprocessing module is used for acquiring noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
The voice enhancement module is used for inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z'.
The post-processing module is used for carrying out mask processing on Z and Z' first and then carrying out inverse Fourier transform to obtain enhanced voice.
Such a dual-branch network-based mono speech enhancement apparatus implements a method or process according to embodiments of the present disclosure.
Compared with the prior art, the invention has the following beneficial effects:
1. The method comprises the steps of firstly converting voice with noise into an original voice spectrum, introducing a decomposition strategy to decouple the voice into an original amplitude spectrum and an original complex spectrum, then extracting features from the original amplitude spectrum and the original complex spectrum in parallel based on a voice enhancement network with a double-branch structure, obtaining an enhanced complex spectrum through information interaction processing, and then carrying out inverse processing based on the enhanced complex spectrum and the original voice spectrum, thereby obtaining the enhanced voice. As can be seen from simulation comparison between the existing public data set and the existing SE solution, the method can further compress the complexity level of Gb/s while maintaining equivalent performance, and the average implementation of 8.3 times of calculation complexity is reduced.
2. The voice enhancement network designed by the invention adopts two frequency band dividing parts to divide the amplitude spectrum and the complex spectrum respectively to compress the frequency band dimension, thereby remarkably reducing the data processing amount, introduces an information interaction layer to mutually supplement the intermediate information of the double-branch process, and also introduces a sequence modeling part constructed based on Mamba to perform sequence modeling, thereby reducing the modeling complexity by using the linear complexity and maintaining the modeling effect.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from these drawings without inventive faculty for a person skilled in the art.
Fig. 1 is a data flow chart of a mono speech enhancement method based on a dual-branch network according to embodiment 1 of the present invention;
Fig. 2 is a data flow chart of the 1 st band division unit and the 2 nd band division unit in fig. 1;
Fig. 3 is a data flow chart of the 1 st feature extraction unit and the 2 nd band division unit in fig. 1;
FIG. 4 is a data flow diagram of the sequence modeling portion of FIG. 1;
FIG. 5 is a data flow diagram of the information interaction layer of FIGS. 3 and 4;
FIG. 6 is a data flow diagram of the portion FT-Mamba of FIG. 4;
FIG. 7 is a data flow diagram of the F-Mamba subsection of FIG. 6;
FIG. 8 is a data flow diagram of the T-Mamba sub-portion of FIG. 6;
FIG. 9 is a data flow diagram of the complex spectral feature modeling unit of FIG. 1;
FIG. 10 is a data flow diagram of the F1E-Real layer and the F2E-image layer of FIG. 9;
FIG. 11 is a data flow diagram of the magnitude spectrum feature modeling portion of FIG. 1;
FIG. 12 is a data flow diagram of the F2E-Mask layer of FIG. 11;
Fig. 13 is a data flow chart of the band combining section in fig. 1.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that when an element is referred to as being "mounted to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "secured to" another element, it can be directly secured to the other element or intervening elements may also be present.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "or/and" as used herein includes any and all combinations of one or more of the associated listed items.
Example 1
Referring to fig. 1, fig. 1 shows a data flow diagram of a dual-branch network-based mono speech enhancement method, which in fact also shows the flow of the dual-branch network-based mono speech enhancement method.
As shown in fig. 1, the mono speech enhancement method based on the dual-branch network includes the following steps:
Step one, obtaining noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
The noisy speech may be represented as Voice (t, f) comprising clean speech signal (t, f), noise (t, f), where t represents a time frame index and f represents a frequency index.
Typically, noisy speech is sampled at 16kHz and framed using a 20ms hamming window with 50% overlap between frames.
And step two, inputting X 0、Y0 into a trained voice enhancement network (BSDB-Net for short) for processing to obtain an enhanced complex spectrum Z'.
The speech enhancement network includes 2 band dividing units, 2 feature extracting units, 1 sequence modeling unit, 1 complex spectrum feature modeling unit, 1 amplitude spectrum feature modeling unit, and 1 band synthesizing unit.
As shown in fig. 1:
The 1 st frequency band division part is used for carrying out frequency band division processing on the X 0 to obtain an intermediate complex spectrum characteristic A;
the 2 nd frequency band division part is used for carrying out frequency band division processing on the Y 0 to obtain an intermediate amplitude spectrum characteristic E;
the 1 st feature extraction part is used for extracting an intermediate feature B based on A, E;
The 2 nd feature extraction part is used for extracting an intermediate feature F based on A, E;
the sequence modeling part is constructed based on Mamba and is used for performing sequence modeling on B, F to obtain an intermediate feature C and an intermediate feature H;
the complex spectrum feature modeling part is used for carrying out feature modeling on the C to obtain a final complex spectrum feature D;
The amplitude spectrum feature modeling part is used for carrying out feature modeling on the H to obtain a final amplitude spectrum feature L;
the band synthesis unit is configured to perform synthesis processing on D, L to obtain Z'.
Wherein:
the 1 st frequency band dividing part, the 1 st feature extracting part, the sequence modeling part and the complex spectrum feature modeling part form a CEN branch, and the CEN branch focuses on processing X 0;
The 2 nd band dividing unit, the 2 nd feature extracting unit, the sequence modeling unit, and the amplitude spectrum feature modeling unit constitute a MEN branch, and the MEN branch focuses on processing Y 0.
The following description of the various parts is provided:
The (one) 2 band partitions are respectively aimed at reducing the dimension of X 0、Y0 and the complexity of data.
The band splitting section may employ the design of fig. 2, which includes 1 spectrum splitting Layer, 1 normalization Layer, 1 Layer, 1 dimension merging Layer.
101, In the 1 st band division unit:
the spectrum segmentation layer is used for segmenting the X 0 according to a preset specification to obtain M original complex bands X1;
the normalization layer is used for normalizing the X1 to obtain M normalized complex spectrum features X2;
The Layer is used for unifying the frequency dimension of the X2 to obtain M unified complex spectrum features X3;
the dimension merging layer is used for merging the X3 in the frequency dimension to obtain A;
Wherein, x1= [ X 11,x12,…,x1M],x1m ] represents the m-th original complex band in X1, m e [1, m ];
X2= [ X 21,x22,…,x2M],x2m represents the mth normalized complex spectral feature in X2;
x3= [ X 31,x32,…,x3M],x3m ] represents the mth unified complex spectral feature in X3.
That is, the processing procedure of the 1 st band division section can be formulated as:
;
;
;
;
Where Split (), FC (), LN (), and Concat represent processes of the Layer, the normalization Layer, and the dimension merging Layer, respectively.
102, In the 2 nd band dividing section:
The frequency spectrum segmentation layer is used for segmenting Y 0 according to a preset specification to obtain M original amplitude bands Y1;
The normalization layer is used for normalizing the Y1 to obtain M normalized amplitude spectrum characteristics Y2;
the Layer is used for unifying the Y2 in the frequency dimension to obtain M unified amplitude spectrum characteristics Y3;
the dimension merging layer is used for merging the Y3 in the frequency dimension to obtain E;
wherein y1= [ Y 11,y12,…,y1M],y1m represents the m-th original amplitude band in Y1;
Y2= [ Y 21,y22,…,y2M],y2m ] represents the mth normalized amplitude spectrum feature in Y2;
Y3= [ Y 31,y32,…,y3M],y3m ] represents the mth uniform amplitude spectrum feature in Y3.
That is, the processing procedure of the 2 nd band dividing section can be formulated as:
;
;
;
Where Split (), FC (), LN (), and Concat represent processes of the Layer, the normalization Layer, and the dimension merging Layer, respectively.
In this embodiment 1, M is 20, the preset specification is [2,3,3,3,3,3,3,8,8,8,8,8,8,8,8,12,16,16,16,17], the frequency band dimension of the complex spectrum can be reduced from 161 to 20 through a series of processing of the 1 st frequency band division part, the frequency band dimension of the amplitude spectrum can be reduced from 161 to 20 through a series of processing of the 2 nd frequency band division part, and thus, the characteristics of different frequency bands can be captured better, and the data complexity can be reduced greatly.
The 2 feature extraction steps are respectively aimed at further extracting the features of complex spectrum and magnitude spectrum for better modeling.
The feature extraction section may employ the design of fig. 3, which includes 1 two-dimensional convolution layer, 1 normalization layer, 1 PRelu activation functions, 1 information interaction layer.
201, In the 1 st feature extraction section:
the two-dimensional convolution layer is used for extracting features of the A to obtain complex frequency spectrum information a1;
The normalization layer is used for normalizing the a1 to obtain complex frequency spectrum information a2;
PRelu the activation function is used to introduce nonlinear features into a2 to obtain complex spectral information a3;
The information interaction layer is used for carrying out interaction processing on the a3 and the E to obtain the B, so that information loss under double branches can be prevented.
That is, the processing procedure of the 1 st feature extraction section can be formulated as:
;
;
In the formula, conv2d (the term) represents the processing procedure of a two-dimensional convolution layer, LN (the term) represents the processing procedure of a sub-normalization layer, PRelu (the term) represents the processing procedure of PRelu activation function, and Interr (the term) represents the processing procedure of an information interaction layer.
202, In the 2 nd feature extraction section:
the two-dimensional convolution layer is used for extracting features of the E to obtain amplitude spectrum information E1;
the normalization layer is used for normalizing the e1 to obtain amplitude spectrum information e2;
PRelu the activation function is used to introduce a nonlinear feature into e2 to obtain amplitude spectrum information e3;
the information interaction layer is used for carrying out interaction processing on e3 and A to obtain F, so that information loss under double branches can be prevented.
That is, the processing procedure of the 2 nd feature extraction section can be formulated as:
;
;
In the formula, conv2d (the term) represents the processing procedure of a two-dimensional convolution layer, LN (the term) represents the processing procedure of a sub-normalization layer, PRelu (the term) represents the processing procedure of PRelu activation function, and Interr (the term) represents the processing procedure of an information interaction layer.
The (III) sequence modeling section aims to model in the time domain and the time-frequency domain, respectively, with linear complexity.
The sequence modeling portion may employ the design of fig. 4, which includes N Mamba model portions that are identical in structure.
Taking the nth Mamba model part as an example, the model part comprises 1 information interaction layer, 2 superimposed layers and 2 FT-Mamba parts.
In the nth Mamba model section:
The information interaction layer is used for carrying out interaction processing on the intermediate frequency spectrum characteristic W n-1 and the intermediate frequency spectrum characteristic W n-1 to obtain an intermediate frequency spectrum characteristic U n;
The 1 st overlay layer is used for overlapping U n and W n-1 to obtain an intermediate frequency spectrum characteristic U_B n;
The 2 nd superimposed layer is used for superimposing U n and w n-1 to obtain an intermediate spectral feature U_F n;
Section 1 FT-Mamba is used to model the sequence of u_b n based on Mamba to obtain an intermediate spectral feature W n;
Section 2 FT-Mamba is used to sequence model U_F n based on Mamba to obtain the intermediate spectral feature w n, n ε [1, N ].
Wherein B is taken as an intermediate spectral feature W 0 and F is taken as an intermediate spectral feature W 0;
Intermediate spectral feature W N is C and intermediate spectral feature W N is H.
That is, the process of the nth Mamba model portion can be formulated as:
;
;
;
in the formula, inter (the term) represents the processing procedure of the information exchange layer, and FT_ Mamba (the term) represents the processing procedure of the FT-Mamba unit.
Note that N suggests taking 6 to achieve a balance between network performance and complexity.
301, The information interaction layer appears in both of the 2 feature extraction units and the sequence modeling unit. The structure of the information interaction layer is described in addition to:
The information interaction layer may employ the design of fig. 5, which includes 1 joint layer, 1 sub two-dimensional convolution layer, 1 sub normalization layer, 1 sub PRelu activation functions. The information interaction layer is double-input and single-output.
In either information interaction layer:
The joint layer is used for connecting the double inputs of the information interaction layer together to obtain intermediate data G1;
The sub two-dimensional convolution layer is used for extracting the characteristics of the G1 to obtain intermediate data G2;
The sub normalization layer is used for carrying out normalization processing on the G2 to obtain intermediate data G3;
The sub PRelu activation function is used to introduce a nonlinear feature into G3 to get a single output of the information interaction layer.
That is, the above procedure can be formulated as:
;
;
In the formula, input 1、Input2 represents double Input of an information interaction layer, output represents single Input of the information interaction layer, cat (& gt) represents processing procedure of a joint layer, conv2d (& gt) represents processing procedure of a sub two-dimensional convolution layer, LN (& gt) represents processing procedure of a sub normalization layer, and PRelu (& gt) represents processing procedure of activating a function of a sub PRelu.
Then, the double inputs of the information interaction layer in the 1 st feature extraction part are a3 and E, and the single output is B;
the double inputs of the information interaction layer in the 2 nd feature extraction part are e3 and A, and the single output is F;
the dual input of the information exchange layer in the nth Mamba model portion is W n-1、wn-1, and the single output is U n.
302, The ft_mamba section comprises 5 sub-sections.
The 5 sub-portions may be arranged in two ways:
The first is that the 1 st sub-part is a normalization layer, the 2 nd sub-part is an F_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a T_ Mamba part, and the 5 th sub-part is an overlapping layer;
The second type is that the 1 st sub-part is a normalization layer, the 2 nd sub-part is a T_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a F_ Mamba part, and the 5 th sub-part is an overlapping layer.
In any FT Mamba section:
the 1 st sub-part is used for carrying out normalization processing on the input of the FT_ Mamba part to obtain intermediate data J1;
the 2 nd sub-section is used for performing sequence modeling on J1 in the frequency dimension to obtain intermediate data J2;
the 3 rd subsection is configured to superimpose the input of the FT Mamba section with J2 to obtain intermediate data J3;
the 4 th subsection is used for performing sequence modeling on J3 in the time dimension to obtain intermediate data J4;
the 5 th subsection is for superimposing J3 and J4 to obtain the output of the FT_ Mamba section.
Referring to fig. 6, which shows the structure of the first setting mode of the ft_ Mamba part, the processing procedure can be formulated as follows:
;
;
In the formula, in 0 denotes an input of the ft_ Mamba unit, out 0 denotes an output of the ft_ Mamba unit, LN ()'s denote a process of the normalization layer, f_ Mamba ()'s denote a process of the f_ Mamba unit, and t_ Mamba ()'s denote a process of the t_ Mamba unit.
The following describes the f_ Mamba and t_ Mamba, respectively:
3021 the F_Mamba section may employ the design of FIG. 7, which includes 2 Transpose layers, 1 Unfold layer, 1 F_ Mamba layer, 1 one-dimensional transpose convolutional layer.
In any of the f_ Mamba sections:
Layer 1 Transpose is used for tensor converting the input of the F_ Mamba part to obtain intermediate data R1;
the Unfold layer is used for upsampling R1 to obtain intermediate data R2;
The F_ Mamba layer is used for carrying out sequence modeling on R2 through the two-way Mamba to obtain intermediate data R3;
the one-dimensional transposition convolution layer is used for carrying out one-dimensional transposition convolution on the R3 to obtain intermediate data R4;
layer 2 Transpose is used to tensor convert R4 to get the output of the f_ Mamba section.
That is, the processing of the f_ Mamba section can be formulated as:
;
;
In the formula, in 1 denotes an input of the f_ Mamba part, out 1 denotes an output of the f_ Mamba part, tran (-) denotes a process of the Transpose layer, unfold (-) denotes a process of the Unfold layer, f_ma (-) denotes a process of the f_ Mamba layer, and Deconv (-) denotes a process of the one-dimensional transposed convolution layer.
Referring to fig. 7, the f_mamba Layer includes 1 flip Layer, 2 Mamba layers, 2 Norm layers, 2 overlay layers, 1 Constant Layer, and 1 Layer.
In any of the F Mamba layers:
layer 1 Mamba is used to model the sequence of inputs of the f_ Mamba layer to obtain intermediate data P1;
the 1 st Norm layer is used for carrying out normalization treatment on the P1 to obtain intermediate data P2;
The 1 st superimposing layer is used for superimposing the input of the F_ Mamba layer with P2 to obtain intermediate data P3;
The flip layer is used for reversing the input of the F_ Mamba layer to obtain intermediate data P4;
layer 2 Mamba is used to model P4 to obtain intermediate data P5;
The 2 nd Norm layer is used for normalizing the P5 to obtain intermediate data P6;
the 2 nd superimposed layer is used for superimposing the P4 and the P6 to obtain intermediate data P7;
the Constant layer is used for performing tensor adjustment on the P3 and the P7 to obtain intermediate data P8;
The Layer is used to linearly map P8 to obtain the output of the F Mamba Layer.
That is, the processing of the F Mamba layer can be formulated as:
;
;
;
In the formula, in1 represents an input of the f_ Mamba layer, out1 represents an output of the f_ Mamba layer, ma (main) represents a process of the Mamba layer, LN (main) represents a process of the Norm layer, flip (main) represents a process of the flip layer, constant (main) represents a process of the Constant layer, and FC (main) represents a process of the flip layer.
3022 The t_mamba section may employ the design of fig. 8, which includes 2 Transpose layers, 1 one-dimensional filler layer, 1 Unfold layers, 1 t_ Mamba layer, 1 one-dimensional transpose convolutional layer.
In any of the T Mamba sections:
Layer 1 Transpose is used for tensor converting the input of the T_ Mamba part to obtain intermediate data T1;
The one-dimensional filling layer is used for filling data of the T1 to obtain intermediate data T2;
the Unfold layer is used for extracting abstract features of the T2 to obtain intermediate data T3;
The T_ Mamba layer is used for performing sequence modeling on T3 through one-way Mamba to obtain intermediate data T4;
the one-dimensional transposition convolution layer is used for carrying out one-dimensional transposition convolution on the T4 to obtain intermediate data T5;
Layer 2 Transpose is used to tensor convert the output of the t_ Mamba section for T5.
That is, the process of the t_ Mamba section can be formulated as:
;
;
In the formula, in 2 denotes an input of the t_ Mamba part, out 2 denotes an output of the t_ Mamba part, tran (-) denotes a process of Transpose layers, pad (-) denotes a process of one-dimensional filler layers, unfold (-) denotes a process of Unfold layers, t_ma (-) denotes a process of t_ Mamba layers, and Deconv (-) denotes a process of one-dimensional transposed convolutional layers.
Wherein, referring to fig. 8, the t_mamba Layer includes 1 Mamba layers, 1 Norm Layer, 1 Layer, 1 superimposed Layer.
In any of the T Mamba layers:
The Mamba layer is used for modeling the input of the T_ Mamba layer to obtain intermediate data Q1;
The Norm layer is used for carrying out normalization processing on the Q1 to obtain intermediate data Q2;
the Layer is used for carrying out linear mapping on the Q2 to obtain intermediate data Q3;
The superimposing layer is used for superimposing the input of the T_ Mamba layer with Q3 to obtain the output of the T_ Mamba layer.
That is, the process of the T_ Mamba layer can be formulated as:
;
Where in2 denotes an input of the t_ Mamba Layer, out2 denotes an output of the t_ Mamba Layer, ma (), LN (), FC (), and Layer respectively denote processes of Mamba, of the Norm Layer, of the Layer.
(IV) the complex spectrum modeling part aims at restoring the complex spectrum characteristics to the original dimension for subsequent synthesis.
The complex spectral feature modeling unit may employ the design of FIG. 9, which includes 1F 1E-Real layer, 1F 2E-image layer, and 1 superimposed layer.
In the complex spectral feature modeling section:
The F1E-Real layer is used for carrying out feature modeling on the C to obtain a Real part intermediate feature C1;
the F2E-image layer is used for carrying out feature modeling on the C to obtain an imaginary part intermediate feature C2;
The superimposed layers are used to superimpose c1, c2 to give D.
That is, the processing procedure of the complex spectral feature modeling section can be formulated as:
;
Wherein F1E_real () represents the processing procedure of the F1E-Real layer, and F2E_image () represents the processing procedure of the F2E-image layer.
401F 1E-Real layer may employ the design of FIG. 10, which includes 3 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activate functions, 1 Sigmoid activate function.
In the F1E-Real layer:
the 1 st two-dimensional convolution layer is used for up-sampling C to obtain intermediate data u1;
The Norm layer is used for carrying out normalization processing on the u1 to obtain intermediate data u2;
The 1 st PRelu th activation function is used to introduce a nonlinear feature into u2 to obtain intermediate data u3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on u3 to obtain intermediate data u4;
the Sigmoid activation function is used for introducing a linear characteristic into u4 to obtain intermediate data u5;
the 3 rd two-dimensional convolution layer is used for extracting the characteristics of the U5 to obtain intermediate data U6;
The 2 nd PRelu activation function is used to introduce a nonlinear feature into u6 to get c1.
That is, the processing of the F1E-Real layer can be formulated as:
;
;
Wherein PRelu (-) represents the process of PRelu activating a function, conv2d (-) represents the process of two-dimensional convolution layer, LN (-) represents the process of Norm layer, sigmoid (-) represents the process of Sigmoid activating a function.
402, The F2E-image layer may employ the design of FIG. 10, which includes 4 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function, 1 superimposed layer.
In the F2E-Imag layer:
the 1 st two-dimensional convolution layer is used for up-sampling C to obtain intermediate data U1;
The Norm layer is used for carrying out normalization processing on the U1 to obtain intermediate data U2;
the 1 st PRelu th activation function is used to introduce a nonlinear feature into U2 to obtain intermediate data U3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the U3 to obtain intermediate data U4;
The Sigmoid activation function is used for introducing linear characteristics into U4 to obtain intermediate data U5;
the 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the U3 to obtain intermediate data U6;
the Tanh activation function is used for introducing linear characteristics into U6 to obtain intermediate data U7;
the overlapping layer is used for overlapping the U5 and the U7 to obtain intermediate data U8;
The 4 th two-dimensional convolution layer is used for extracting the characteristics of the U8 to obtain intermediate data U9;
The 2 nd PRelu activation function is used to introduce a nonlinear feature into U9 to get c2.
That is, the processing of the F2E-Imag layer can be formulated as:
;
;
;
;
Wherein PRelu (i) represents the processing procedure of PRelu activation function, conv2d (i) represents the processing procedure of two-dimensional convolution layer, LN (i) represents the processing procedure of Norm layer, sigmoid (i) represents the processing procedure of Sigmoid activation function, and Tanh (i) represents the processing procedure of Tanh activation function.
The (fifth) magnitude spectrum modeling section aims to restore complex spectrum features to the original dimensions for subsequent synthesis.
The amplitude spectrum feature modeling part can adopt the design of fig. 11, and comprises 1F 2E-Mask layer and 1 matrix multiplication layer.
In the amplitude spectrum feature modeling section:
The F2E-Mask layer is used for modeling H to obtain an amplitude spectrum feature Mask H';
the matrix multiplication layer is used for carrying out matrix multiplication on H' and E to obtain L.
That is, the processing procedure of the amplitude spectrum feature modeling section can be formulated as:
;
Wherein F2E-Mask (-) represents the processing of the F2E-Mask layer.
The F2E-Mask layer may employ the design of FIG. 12, which includes 4 two-dimensional convolutional layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function, 1 overlay layer.
In the F2E-Mask layer:
The 1 st two-dimensional convolution layer is used for up-sampling H to obtain intermediate data H1;
the Norm layer is used for carrying out normalization treatment on the h1 to obtain intermediate data h2;
the 1 st PRelu th activation function is used to introduce a nonlinear feature into h2 to obtain intermediate data h3;
The 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the h3 to obtain intermediate data h4;
the Sigmoid activation function is used for introducing linear characteristics into h4 to obtain intermediate data h5;
the 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the h3 to obtain intermediate data h6;
the Tanh activation function is used for introducing a linear characteristic to h6 to obtain intermediate data h7;
the superposition layer is used for superposing the h5 and the h7 to obtain intermediate data h8;
The 4 th two-dimensional convolution layer is used for carrying out feature extraction on the h8 so as to obtain intermediate data h9;
The 2 nd PRelu th activation function is used to introduce a nonlinear feature into H9 to get H'.
That is, the processing of the F2E-Mask layer can be formulated as:
;
;
;
;
Wherein PRelu (i) represents the processing procedure of PRelu activation function, conv2d (i) represents the processing procedure of two-dimensional convolution layer, LN (i) represents the processing procedure of Norm layer, sigmoid (i) represents the processing procedure of Sigmoid activation function, and Tanh (i) represents the processing procedure of Tanh activation function.
The sixth band synthesis unit is intended to recombine the divided bands into complex spectra.
The band synthesis unit may employ the design of fig. 13, which includes 1 superimposed Layer, 1 block-disassembled Layer, 1 normalized Layer, 1 Layer, 1 Tanh activation function, 1 GLU Layer, and 1 band-merged Layer.
In the band synthesis section:
the overlapping layer is used for overlapping D, L to obtain a complex spectrum characteristic block V0;
The block disassembly layer is used for carrying out block disassembly on the V0 to obtain M original complex spectrum tensors V1;
the normalization layer is used for carrying out normalization processing on the V1 to obtain a normalized complex spectrum tensor V2;
the Layer is used to linearly vary V2 to obtain a double complex spectral tensor V3,
The Tanh activation function is used for introducing linear characteristics into V3 to obtain an activated complex spectrum tensor V4;
The GLU layer is used for carrying out dimension reduction on the V4 to obtain a final segmentation band mask V5;
the frequency band combining layer is used for combining V5 in the frequency dimension to obtain Z';
wherein v1= [ V 11,v12,…,v1M],v1m represents the mth original complex spectral tensor in V1;
v2= [ V 21,v22,…,v2M],v2m represents the mth normalized complex spectral tensor in V2;
V3= [ V 31,v32,…,v3M],v3m represents the mth double complex spectral tensor in V3;
V4= [ V 41,v42,…,v4M],v4m represents the mth activated complex spectral tensor in V4;
V5= [ V 51,v52,…,v5M],v5m ] represents the m-th final split band mask in V5.
That is, the band synthesis section may be formulated as:
;
;
;
;
;
In the formula, exp (level) represents a processing procedure of a block disassembly Layer, LN (level) represents a processing procedure of a normalization Layer, FC (level) represents a processing procedure of a Layer, tanh (level) represents a processing procedure of a Tanh activation function, GLU (level) represents a processing procedure of a GLU Layer, and Merge (level) represents a processing procedure of a band combining Layer.
In this example 1, M is 20 and V5 is reduced in dimension to a size of [2,3,3,3,3,3,3,8,8,8,8,8,8,8,8,12,16,16,16,17].
It should be noted that the present method requires the use of a trained speech enhancement network whose network parameters are optimal. The voice enhancement network is typically trained based on existing known voice datasets to ensure the network training effect.
The Loss function Loss of "ri+mag" is employed in training the speech enhancement network to monitor both phase and amplitude component optimization. Wherein, the Loss function Loss of "ri+mag" is expressed as:
;
Where L RI denotes the minimum mean square error loss function of the complex spectrum, L Mag denotes the minimum mean square error loss function of the magnitude spectrum, and β denotes the weight coefficient, which is typically 0.5.
Simulation verification
Existing datasets WSJ0-SI84, DNS-CHALLENGE, VOICEBANK, DEMAND are employed.
1. An ablation experiment was performed on BSDB-Net based on WSJ 0-si84+dns-change to evaluate the effectiveness of the dual-branch structural design.
The CEN branches of BSDB-Net are removed and the MEN branches are reserved, the MEN branches are trained alone, i.e. corresponding to BSDB-MEN, and the MEN branches of BSDB-Net are removed and the CEN branches are reserved, and the CEN branches are trained alone, i.e. corresponding to BSDB-CEN.
The performance of BSDB-MEN, BSDB-CEN, BSDB-Net was compared and the results are shown in Table 1:
Table 1 comparison of ablation experiments
The PESQ is a narrow-band and wide-band speech quality perception evaluation index, the larger and better the value of the PESQ is, ESTOI is an expanded version of short-time objective intelligibility STOI for measuring speech quality, the larger and better the value of the PESQ is, and the larger and better the value of the PESQ is used for evaluating the speech distortion degree.
As shown in table 1, the dual path structure is superior to the single path structure in all indexes. This means that the coordinated efforts of the CEN branch and the MEN branch can improve the quality of the target speech-the CEN branch filters out the main noise for rough estimation, while the MEN branch continuously supplements the speech information, thereby improving the overall performance of the system.
2. An alternative experiment was performed on BSDB-Net based on WSJ 0-si84+dns-change to evaluate the effectiveness of the sequence modeling part design.
The BSDB-Net sequence modeling was replaced with LSTM, corresponding to BSDB-LSTM, and the BSDB-Net sequence modeling was replaced with transducer, corresponding to BSDB-transducer.
The performance of BSDB-LSTM, BSDB-Transformer, BSDB-Net was compared and the results are shown in Table 2:
table 2 comparative results of substitution experiments
The PSQ is a narrow-band and wide-band voice quality perception evaluation index, the larger and the better the value of the PSQ is, MACs represent network complexity, and Parameters represent the number of network Parameters.
As shown in Table 2, BSDB-Net greatly reduces the complexity of the network and reduces the number of network parameters while maintaining optimal performance. It should be noted that although the performance improvement of BSDB-Net is not obvious relative to BSDB-transducer, the network complexity is significantly reduced and the number of network parameters is significantly reduced.
3. The BSDB-Net was compared in complexity with the other 9 models (including: convTasNet, DPRNN, DDAEC, LSTM, CRN, GCRN, DCCRN, fullSubNet, CTSNet, gaGNet) based on WSJ 0-Si84+DNS-Change, and the results are shown in Table 3.
Table 3 complexity comparison results
The PSQ is a narrow-band and wide-band voice quality perception evaluation index, the larger the value is, the better the value is, and MACs represent network complexity.
As shown in Table 3, BSDB-Net is about 8 times lower in average computational complexity than other models and has good performance.
4. BSDB-Net was compared to the other 14 models (including :SEGAN、MMSEGAN、MetricGAN、SRTNET、Wavenet、PHASEN、MHSA-SPK、DCCRN、TSTNN、S4DSE、FDFnet、CTS-net、GaGnet、CompNet)) based on VoiceBank +Demand, and the results are shown in Table 4.
Table 4 results of performance comparisons
Wherein PESQ is a narrow-band and wide-band speech quality perception evaluation index, the larger the value is, the better the short-time objective intelligibility of speech quality is, STOI is, the larger the value is, the better the value is, and CSIG, CBAK, COVL is an index for evaluating the speech quality, the larger the value is, the better the value is.
As shown in Table 4, BSDB-Net achieves average improvements of 0.32, 1.5%, 0.31, 0.17 and 0.35 over other models in PESQ, STOI, CSIG, CBAK and COVL, respectively, also illustrating the general applicability of the present method.
Example 2
Embodiment 2 provides a mono speech enhancement device based on a dual-branch network, which uses the mono speech enhancement method based on the dual-branch network disclosed in embodiment 1.
The single-channel voice enhancement device based on the dual-branch network comprises a preprocessing module, a voice enhancement module and a post-processing module.
The preprocessing module is used for acquiring noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0.
The voice enhancement module is used for inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z'.
The post-processing module is used for carrying out mask processing on Z and Z' first and then carrying out inverse Fourier transform to obtain enhanced voice.
Since the present apparatus uses the monaural speech enhancement method based on the dual-branch network in embodiment 1, the same effect is also obtained and will not be repeated here.
Example 3
Embodiment 3 discloses a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the binaural network-based monaural speech enhancement method disclosed in embodiment 1 when the computer program is executed.
Embodiment 3 also discloses a readable storage medium having stored therein computer program instructions which, when read and executed by a processor, perform the steps of the two-branch network-based mono speech enhancement method disclosed in embodiment 1.
Embodiment 3 also discloses a computer program product comprising a computer program. The computer program when executed by a processor implements the steps of the two-branch network-based mono speech enhancement method disclosed in embodiment 1.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. A method for mono speech enhancement based on a dual-branch network, comprising:
Firstly, obtaining noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0;
inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z';
The voice enhancement network comprises 2 frequency band dividing parts, 2 feature extracting parts, 1 sequence modeling part, 1 complex spectrum feature modeling part, 1 amplitude spectrum feature modeling part and 1 frequency band synthesizing part;
The method comprises the steps of carrying out frequency band segmentation processing on X 0 by a 1 st frequency band segmentation part to obtain intermediate complex spectrum characteristic A, carrying out frequency band segmentation processing on Y 0 by a 2 nd frequency band segmentation part to obtain intermediate amplitude spectrum characteristic E, extracting intermediate characteristic B based on A, E by a 1 st characteristic extraction part, extracting intermediate characteristic F based on A, E by a 2 nd characteristic extraction part, carrying out sequence modeling on B, F to obtain intermediate characteristic C and intermediate characteristic H by a sequence modeling part based on Mamba construction, carrying out characteristic modeling on C to obtain final complex spectrum characteristic D by a complex spectrum characteristic modeling part, carrying out characteristic modeling on H to obtain final amplitude spectrum characteristic L by an amplitude spectrum characteristic modeling part, and carrying out synthesis processing on D, L by a frequency band synthesis part to obtain Z';
and thirdly, performing mask processing on Z and Z', and performing inverse Fourier transform to obtain enhanced voice.
2. The method for enhancing mono speech based on the two-branch network according to claim 1, wherein the band division unit comprises 1 spectrum division Layer, 1 normalization Layer, 1 Layer, and 1 dimension merging Layer;
in the 1 st frequency band division part, a frequency spectrum division Layer is used for dividing X 0 according to a preset specification to obtain M original complex spectrum bands X1, a normalization Layer is used for normalizing X1 to obtain M normalized complex spectrum features X2, a Layer is used for unifying X2 in frequency dimension to obtain M unified complex spectrum features X3, and a dimension merging Layer is used for merging X3 in frequency dimension to obtain A;
Wherein, x1= [ X 11,x12,…,x1M],x1m ] represents the m-th original complex band in X1, m e [1, m ];
X2= [ X 21,x22,…,x2M],x2m represents the mth normalized complex spectral feature in X2;
x3= [ X 31,x32,…,x3M],x3m ] represents the mth unified complex spectral feature in X3;
In the 2 nd frequency band dividing part, a frequency spectrum dividing Layer is used for dividing Y 0 according to preset specification to obtain M original amplitude bands Y1, a normalizing Layer is used for normalizing Y1 to obtain M normalized amplitude spectrum characteristics Y2, a Layer is used for unifying Y2 in frequency dimension to obtain M unified amplitude spectrum characteristics Y3, and a dimension merging Layer is used for merging Y3 in frequency dimension to obtain E;
wherein y1= [ Y 11,y12,…,y1M],y1m represents the m-th original amplitude band in Y1;
Y2= [ Y 21,y22,…,y2M],y2m ] represents the mth normalized amplitude spectrum feature in Y2;
Y3= [ Y 31,y32,…,y3M],y3m ] represents the mth uniform amplitude spectrum feature in Y3.
3. The method for enhancing mono speech based on the two-branch network according to claim 1, wherein the feature extraction unit comprises 1 two-dimensional convolution layer, 1 normalization layer, 1 PRelu activation functions, and 1 information interaction layer;
In the 1 st feature extraction part, a two-dimensional convolution layer is used for carrying out feature extraction on A to obtain complex frequency spectrum information a1, a normalization layer is used for normalizing the a1 to obtain complex frequency spectrum information a2, PRelu activation functions are used for introducing nonlinear features into the a2 to obtain complex frequency spectrum information a3, and an information interaction layer is used for carrying out interaction processing on the a3 and E to obtain B;
in the 2 nd feature extraction part, a two-dimensional convolution layer is used for carrying out feature extraction on E to obtain amplitude spectrum information E1, a normalization layer is used for normalizing the E1 to obtain amplitude spectrum information E2, PRelu activation functions are used for introducing nonlinear features into the E2 to obtain amplitude spectrum information E3, and an information interaction layer is used for carrying out interaction processing on the E3 and the A to obtain F.
4. The method for enhancing mono speech based on the two-branch network according to claim 3, wherein the sequence modeling section comprises N Mamba model sections with identical structures;
the nth Mamba model part comprises 1 information interaction layer, 2 superimposed layers and 2 FT-Mamba parts;
In the nth Mamba model section, the information interaction layer is used for carrying out interaction processing on the intermediate frequency spectrum characteristic W n-1 and the intermediate frequency spectrum characteristic W n-1 to obtain an intermediate frequency spectrum characteristic U n, the 1 st superposition layer is used for superposing U n and W n-1 to obtain an intermediate frequency spectrum characteristic U_B n, the 2 nd superposition layer is used for superposing U n and W n-1 to obtain an intermediate frequency spectrum characteristic U_F n, the 1 st FT-Mamba section is used for carrying out sequence modeling on U_B n based on Mamba to obtain an intermediate frequency spectrum characteristic W n, the 2 nd FT-Mamba section is used for carrying out sequence modeling on U_F n based on Mamba to obtain an intermediate frequency spectrum characteristic W n, and n epsilon [1, N ];
Wherein B is taken as an intermediate spectral feature W 0 and F is taken as an intermediate spectral feature W 0;
Intermediate spectral feature W N is C and intermediate spectral feature W N is H.
5. The method for enhancing mono speech based on the dual branch network according to claim 3 or 4, wherein the information interaction layer comprises 1 joint layer, 1 sub two-dimensional convolution layer, 1 sub normalization layer, 1 sub PRelu activation functions;
In any information interaction layer, a joint layer is used for connecting the double inputs of the information interaction layer together to obtain intermediate data G1, a sub two-dimensional convolution layer is used for extracting characteristics of G1 to obtain intermediate data G2, a sub normalization layer is used for normalizing G2 to obtain intermediate data G3, and a sub PRelu activation function is used for introducing nonlinear characteristics into G3 to obtain single output of the information interaction layer.
6. The method for monaural speech enhancement based on a two-branch network of claim 4, wherein FT Mamba section comprises 5 sub-sections;
Wherein the 1 st sub-part is a normalization layer, the 2 nd sub-part is an F_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a T_ Mamba part, and the 5 th sub-part is an overlapping layer;
or, the 1 st sub-part is a normalization layer, the 2 nd sub-part is a T_ Mamba part, the 3 rd sub-part is an overlapping layer, the 4 th sub-part is a F_ Mamba part, and the 5 th sub-part is an overlapping layer;
In any FT_ Mamba part, the 1 st subsection is used for normalizing the input of the FT_ Mamba part to obtain intermediate data J1, the 2 nd subsection is used for carrying out sequence modeling on J1 in the frequency dimension to obtain intermediate data J2, the 3 rd subsection is used for superposing the input of the FT_ Mamba part and J2 to obtain intermediate data J3, the 4 th subsection is used for carrying out sequence modeling on J3 in the time dimension to obtain intermediate data J4, and the 5 th subsection is used for superposing J3 and J4 to obtain the output of the FT_ Mamba part.
7. The method for enhancing mono speech based on the two-branch network according to claim 6, wherein the F_ Mamba section comprises 2 Transpose layers, 1 Unfold layer, 1 F_ Mamba layer, and 1 one-dimensional transposed convolutional layer;
In any of the f_ Mamba sections, layer 1 Transpose is used for tensor converting the input of the f_ Mamba section to obtain intermediate data R1, layer Unfold is used for upsampling R1 to obtain intermediate data R2, layer f_ Mamba is used for sequential modeling of R2 through bi-direction Mamba to obtain intermediate data R3, one-dimensional transpose convolution layer is used for one-dimensional transpose convolution of R3 to obtain intermediate data R4, and layer 2 Transpose is used for tensor converting R4 to obtain the output of the f_ Mamba section;
The F_ Mamba layers comprise 1 flip Layer, 2 Mamba layers, 2 Norm layers, 2 superimposed layers, 1 Constant Layer and 1 Layer;
In any F_ Mamba Layer, the 1 st Mamba Layer is used for carrying out sequential modeling on the input of the F_ Mamba Layer to obtain intermediate data P1, the 1 st Norm Layer is used for carrying out normalization processing on P1 to obtain intermediate data P2, the 1 st superposition Layer is used for superposing the input of the F_ Mamba Layer and P2 to obtain intermediate data P3, the flip Layer is used for carrying out inverse direction on the input of the F_ Mamba Layer to obtain intermediate data P4, the 2 nd Mamba Layer is used for carrying out modeling on P4 to obtain intermediate data P5, the 2 nd Norm Layer is used for carrying out normalization processing on P5 to obtain intermediate data P6, the 2 nd superposition Layer is used for superposing P4 and P6 to obtain intermediate data P7, the Constant Layer is used for carrying out tensor adjustment on P3 and P7 to obtain intermediate data P8, and the Layer Layer is used for carrying out linear mapping on P8 to obtain the output of the F_ Mamba Layer;
The T_ Mamba part comprises 2 Transpose layers, 1 one-dimensional filling layer, 1 Unfold layer, 1 T_ Mamba layer and 1 one-dimensional transposition convolution layer;
In any of the sections T_ Mamba, the 1 st Transpose th layer is used for tensor converting the input of the section T_ Mamba to obtain intermediate data T1, the one-dimensional filling layer is used for data filling of the T1 to obtain intermediate data T2, the Unfold th layer is used for abstract feature extraction of the T2 to obtain intermediate data T3, the T_ Mamba th layer is used for sequential modeling of the T3 through unidirectional Mamba to obtain intermediate data T4, the one-dimensional transpose convolution layer is used for tensor converting the T4 to obtain intermediate data T5, and the 2 nd Transpose th layer is used for tensor converting the T5 to the output of the section T_ Mamba;
Wherein the T_ Mamba layers comprise 1 Mamba Layer, 1 Norm Layer, 1 Layer and 1 superimposed Layer;
In any of the T_ Mamba layers, mamba layers are used to model the input of the T_ Mamba Layer to obtain intermediate data Q1, norm layers are used to normalize Q1 to obtain intermediate data Q2, layer layers are used to linearly map Q2 to obtain intermediate data Q3, and an overlap Layer is used to overlap the input of the T_ Mamba Layer with Q3 to obtain the output of the T_ Mamba Layer.
8. The method for enhancing mono speech based on the dual branch network according to claim 1, wherein the complex spectral feature modeling section comprises 1F 1E-Real layer, 1F 2E-image layer, 1 superimposed layer;
In the complex spectrum feature modeling part, an F1E-Real layer is used for carrying out feature modeling on C to obtain a Real part intermediate feature C1, an F2E-Imag layer is used for carrying out feature modeling on C to obtain an imaginary part intermediate feature C2, and an overlapped layer is used for overlapping the C1 and the C2 to obtain D;
The F1E-Real layer comprises 3 two-dimensional convolution layers, 1 Norm layer, 2 PRelu activation functions and 1 Sigmoid activation function;
In the F1E-Real layer, a1 st two-dimensional convolution layer is used for upsampling C to obtain intermediate data U1, a Norm layer is used for normalizing U1 to obtain intermediate data U2, a1 st PRelu activation function is used for introducing nonlinear features into U2 to obtain intermediate data U3, a 2 nd two-dimensional convolution layer is used for performing two-dimensional convolution on U3 to obtain intermediate data U4, a Sigmoid activation function is used for introducing linear features into U4 to obtain intermediate data U5, a 3 rd two-dimensional convolution layer is used for performing feature extraction on U5 to obtain intermediate data U6, and a 2 nd PRelu activation function is used for introducing nonlinear features into U6 to obtain C1;
the F2E-image layer comprises 4 two-dimensional convolution layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function and 1 superposition layer;
In the F2E-image layer, a1 st two-dimensional convolution layer is used for upsampling C to obtain intermediate data U1, a Norm layer is used for carrying out normalization processing on U1 to obtain intermediate data U2, a1 st PRelu activation function is used for introducing nonlinear characteristics into U2 to obtain intermediate data U3, a 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on U3 to obtain intermediate data U4, a Sigmoid activation function is used for introducing linear characteristics into U4 to obtain intermediate data U5, a 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on U3 to obtain intermediate data U6, a Tanh activation function is used for introducing linear characteristics into U6 to obtain intermediate data U7, an overlap layer is used for superposing U5 and U7 to obtain intermediate data U8, a 4 th two-dimensional convolution layer is used for carrying out characteristic extraction on U8 to obtain intermediate data U9, and a 2 nd PRelu activation function is used for introducing nonlinear characteristics into U9 to obtain C2;
The amplitude spectrum characteristic modeling part comprises 1F 2E-Mask layers and 1 matrix multiplication layer;
in the amplitude spectrum feature modeling part, an F2E-Mask layer is used for modeling H to obtain an amplitude spectrum feature Mask H ', and a matrix multiplication layer is used for carrying out matrix multiplication on H' and E to obtain L;
The F2E-Mask layer comprises 4 two-dimensional convolution layers, 1 Norm layer, 2 PRelu activation functions, 1 Sigmoid activation function, 1 Tanh activation function and 1 superposition layer;
In the F2E-Mask layer, a1 st two-dimensional convolution layer is used for upsampling H to obtain intermediate data H1, a Norm layer is used for carrying out normalization processing on H1 to obtain intermediate data H2, a1 st PRelu th activation function is used for introducing nonlinear characteristics into H2 to obtain intermediate data H3, a 2 nd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on H3 to obtain intermediate data H4, a Sigmoid activation function is used for introducing linear characteristics into H4 to obtain intermediate data H5, a 3 rd two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on H3 to obtain intermediate data H6, a Tanh activation function is used for introducing linear characteristics into H6 to obtain intermediate data H7, an overlap layer is used for superposing H5 and H7 to obtain intermediate data H8, a 4 th two-dimensional convolution layer is used for carrying out characteristic extraction on H8 to obtain intermediate data H9, and a 2 nd PRelu activation function is used for introducing nonlinear characteristics into H9 to obtain H'.
9. The method for enhancing mono speech based on the two-branch network according to claim 1, wherein the band synthesis unit comprises 1 superimposed Layer, 1 block disassembly Layer, 1 normalization Layer, 1 Layer, 1 Tanh activation function, 1 GLU Layer, and 1 band merging Layer;
In the band synthesis part, an overlapping Layer is used for overlapping D, L to obtain a complex spectrum characteristic block V0, a block disassembly Layer is used for carrying out block disassembly on V0 to obtain M original complex spectrum tensors V1, a normalization Layer is used for carrying out normalization processing on V1 to obtain normalized complex spectrum tensors V2, a Layer is used for carrying out linear change on V2 to obtain double complex spectrum tensors V3, a Tanh activation function is used for introducing linear features into V3 to obtain activated complex spectrum tensors V4, a GLU Layer is used for carrying out dimension reduction on V4 to obtain a final segmentation band mask V5, and a band merging Layer is used for merging V5 in a frequency dimension to obtain Z';
wherein v1= [ V 11,v12,…,v1M],v1m represents the mth original complex spectral tensor in V1;
v2= [ V 21,v22,…,v2M],v2m represents the mth normalized complex spectral tensor in V2;
V3= [ V 31,v32,…,v3M],v3m represents the mth double complex spectral tensor in V3;
V4= [ V 41,v42,…,v4M],v4m represents the mth activated complex spectral tensor in V4;
V5= [ V 51,v52,…,v5M],v5m ] represents the m-th final split band mask in V5.
10. A dual-branch network-based mono speech enhancement apparatus using the dual-branch network-based mono speech enhancement method according to any one of claims 1-9;
the single-channel voice enhancement device based on the dual-branch network comprises:
The preprocessing module is used for acquiring noisy speech, performing Fourier transform on the noisy speech to obtain an original speech spectrum Z, and decoupling the Z to obtain an original complex spectrum X 0 and an original amplitude spectrum Y 0;
The voice enhancement module is used for inputting X 0、Y0 into a trained voice enhancement network for processing to obtain an enhanced complex spectrum Z';
And
And the post-processing module is used for carrying out mask processing on Z and Z' first and then carrying out inverse Fourier transform to obtain enhanced voice.
CN202411533326.0A 2024-10-31 2024-10-31 A method and device for monophonic speech enhancement based on dual-branch network Active CN119049489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411533326.0A CN119049489B (en) 2024-10-31 2024-10-31 A method and device for monophonic speech enhancement based on dual-branch network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411533326.0A CN119049489B (en) 2024-10-31 2024-10-31 A method and device for monophonic speech enhancement based on dual-branch network

Publications (2)

Publication Number Publication Date
CN119049489A CN119049489A (en) 2024-11-29
CN119049489B true CN119049489B (en) 2025-01-03

Family

ID=93585577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411533326.0A Active CN119049489B (en) 2024-10-31 2024-10-31 A method and device for monophonic speech enhancement based on dual-branch network

Country Status (1)

Country Link
CN (1) CN119049489B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219109A (en) * 2023-10-17 2023-12-12 南京邮电大学 Double-branch voice enhancement algorithm based on structured state space sequence model
CN117558288A (en) * 2023-11-13 2024-02-13 广州大学 Training method, device, equipment and storage medium for single-channel speech enhancement model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571080B (en) * 2021-02-08 2024-11-08 腾讯科技(深圳)有限公司 Speech enhancement method, device, equipment and storage medium
CN114974292A (en) * 2022-05-23 2022-08-30 维沃移动通信有限公司 Audio enhancement method, apparatus, electronic device, and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219109A (en) * 2023-10-17 2023-12-12 南京邮电大学 Double-branch voice enhancement algorithm based on structured state space sequence model
CN117558288A (en) * 2023-11-13 2024-02-13 广州大学 Training method, device, equipment and storage medium for single-channel speech enhancement model

Also Published As

Publication number Publication date
CN119049489A (en) 2024-11-29

Similar Documents

Publication Publication Date Title
CN111081268A (en) A Phase-Correlated Shared Deep Convolutional Neural Network Speech Enhancement Method
ES2278338T3 (en) DEVICE AND PROCEDURE FOR PROCESSING A SIGNAL.
Sakiyama et al. Spectral graph wavelets and filter banks with low approximation error
Venkataramani et al. Adaptive front-ends for end-to-end source separation
CN102663695A (en) DR image denoising method based on wavelet transformation and system thereof
JP2007526511A (en) Method and apparatus for blind separation of multipath multichannel mixed signals in the frequency domain
CN116994564B (en) Voice data processing method and processing device
CN106601266A (en) Echo cancellation method, device and system
CN113409216A (en) Image restoration method based on frequency band self-adaptive restoration model
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
Li et al. Deeplabv3+ vision transformer for visual bird sound denoising
CN116682444A (en) Single-channel voice enhancement method based on waveform spectrum fusion network
CN115273883A (en) Convolution cyclic neural network, and voice enhancement method and device
CN115295002A (en) Single-channel speech enhancement method based on interactive time-frequency attention mechanism
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
US9847085B2 (en) Filtering in the transformed domain
Stoeva et al. On the dual frame induced by an invertible frame multiplier
CN119049489B (en) A method and device for monophonic speech enhancement based on dual-branch network
CN114360571A (en) Reference-Based Speech Enhancement Methods
CN118969006A (en) A two-stage vocal accompaniment separation method based on separation first and then compensation
Yu et al. FSI-Net: A dual-stage full-and sub-band integration network for full-band speech enhancement
Ullah et al. Single channel speech dereverberation and separation using RPCA and SNMF
CN117219109A (en) Double-branch voice enhancement algorithm based on structured state space sequence model
CN116913303A (en) Single-channel voice enhancement method based on step-by-step amplitude compensation network
Hoang et al. Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant