CN114464201A

CN114464201A - A single-channel speech enhancement method based on attention mechanism and convolutional neural network

Info

Publication number: CN114464201A
Application number: CN202210125478.1A
Authority: CN
Inventors: 沈学利; 田桂源; 马琳琳
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-10

Abstract

The invention discloses a single-channel speech enhancement method based on an attention mechanism and a convolutional neural network, which includes a training phase and an enhancement phase: the training phase: firstly synthesizing noise and pure speech into noisy speech, and then characterizing the noisy speech Extraction, and then send the pure speech and the noisy speech after feature extraction into the neural network model to learn the mapping relationship between the noisy speech and the pure speech. After the training phase, save the trained model; the enhancement phase : First perform feature extraction on noisy speech, then send it into the trained model for speech enhancement, and finally output the enhanced speech. The single-channel speech enhancement method based on the attention mechanism and the convolutional neural network of the present invention can directly enhance the noisy speech in the time domain, which can not only save the calculation time of the Fourier transform, but also preserve the enhanced speech as much as possible. Phase information can achieve better noise reduction effect.

Description

A single-channel speech enhancement method based on attention mechanism and convolutional neural network

技术领域technical field

本发明属于语音处理的技术领域，尤其涉及一种基于注意力机制和卷积神经网络的单通道语音增强方法。The invention belongs to the technical field of speech processing, and in particular relates to a single-channel speech enhancement method based on an attention mechanism and a convolutional neural network.

背景技术Background technique

语音增强就是从带噪语音中还原纯净语音的一种技术，根据通道数量的不同分为单通道和多通道技术。根据处理的域不同又分为时域语音增强和频域语音增强。根据处理方法的不同还可以分为基于信号知识的语音增强方法和基于机器学习的语音增强方法。随着计算机技术的快速发展，深度学习方法对于解决各个领域中的问题均有较好的帮助，于是基于深度学习的语音增强方法正成为语音领域研究的热点。Speech enhancement is a technology that restores pure speech from noisy speech. It is divided into single-channel and multi-channel technology according to the number of channels. According to the different processing domains, it is divided into time domain speech enhancement and frequency domain speech enhancement. According to different processing methods, it can also be divided into speech enhancement methods based on signal knowledge and speech enhancement methods based on machine learning. With the rapid development of computer technology, deep learning methods are helpful for solving problems in various fields, so speech enhancement methods based on deep learning are becoming a research hotspot in the field of speech.

传统的单通道语音增强方法研究中，通常需要对噪声信号和语音信号之间的作用关系做出一定的假设，正是由于其先决假设条件限制了系统的性能，适用性差、增强效果欠佳。基于深度学习的语音增强方法，不需要对语音信号和噪声信号做任何假设，通过大量训练数据，直接学习纯净语音和噪声之间的关系，克服了传统语音增强算法的固有缺陷，取得了更优异的去噪水平和泛化能力。In the traditional single-channel speech enhancement method research, it is usually necessary to make certain assumptions about the relationship between the noise signal and the speech signal. It is precisely because of its prerequisites that the performance of the system is limited, and the applicability and enhancement effect are poor. The speech enhancement method based on deep learning does not need to make any assumptions about speech signals and noise signals. Through a large amount of training data, the relationship between pure speech and noise can be directly learned, which overcomes the inherent defects of traditional speech enhancement algorithms and achieves better results. The denoising level and generalization ability.

基于卷积神经网络的语音增强方法只能捕获语音的局部信息，无法高效利用语音的全局信息，这对语音增强是不利的。同时，基于注意力机制的神经网络虽然可以提取语音的上下文信息但无法较好的处理语音的局部信息。此外，大多数流行的语音增强方法是利用短时傅里叶变换将带噪语音转换到频域上进行处理，在频域上进行语音增强一般是忽略增强语音的相位信息，直接利用带噪语音的相位合成增强语音，相位信息的丢失也制约了增强语音的效果。Speech enhancement methods based on convolutional neural networks can only capture the local information of speech, and cannot efficiently utilize the global information of speech, which is unfavorable for speech enhancement. At the same time, although the neural network based on the attention mechanism can extract the contextual information of speech, it cannot process the local information of speech well. In addition, most popular speech enhancement methods use short-time Fourier transform to convert noisy speech to the frequency domain for processing. Speech enhancement in the frequency domain generally ignores the phase information of the enhanced speech and directly uses the noisy speech. The phase synthesis enhances speech, and the loss of phase information also restricts the effect of enhanced speech.

发明内容SUMMARY OF THE INVENTION

针对现有技术中的不足，本发明所要解决的技术问题在于提供一种基于注意力机制和卷积神经网络的单通道语音增强方法，直接在时域上对带噪语音进行增强，不但可以节省傅里叶变换的计算时间还可以尽可能的保留增强语音的相位信息，能取得较好的降噪效果。In view of the deficiencies in the prior art, the technical problem to be solved by the present invention is to provide a single-channel speech enhancement method based on an attention mechanism and a convolutional neural network, which directly enhances the noisy speech in the time domain, which not only saves The calculation time of the Fourier transform can also preserve the phase information of the enhanced speech as much as possible, and can achieve a better noise reduction effect.

为了解决上述技术问题，本发明通过以下技术方案来实现：In order to solve the above-mentioned technical problems, the present invention realizes through the following technical solutions:

本发明提供一种基于注意力机制和卷积神经网络的单通道语音增强方法，包括训练阶段和增强阶段：The present invention provides a single-channel speech enhancement method based on an attention mechanism and a convolutional neural network, including a training phase and an enhancement phase:

训练阶段：首先将噪声与纯净语音合成带噪语音，接着对带噪语音进行特征提取，然后将纯净语音与特征提取后的带噪语音一并送入神经网络模型中学习带噪语音和纯净语音之间的映射关系，训练阶段结束后，将训练好的模型保存下来；Training stage: First, the noise and pure speech are synthesized into noisy speech, then feature extraction is performed on the noisy speech, and then the pure speech and the noisy speech after feature extraction are sent to the neural network model to learn the noisy speech and pure speech After the training phase is over, save the trained model;

增强阶段：首先对带噪语音进行特征提取，然后送入训练好的模型中进行语音增强，最后输出增强后的语音。Enhancement stage: First, extract the features of the noisy speech, then send it into the trained model for speech enhancement, and finally output the enhanced speech.

优选的，所述神经网络模型包括编码模块、降噪模块和解码模块，带噪语音首先经过编码模块处理，然后送入降噪模块，接着编码模块的输出与降噪模块的输出相乘最后送入解码模块获得增强后的语音。Preferably, the neural network model includes an encoding module, a noise reduction module and a decoding module, the noisy speech is first processed by the encoding module, and then sent to the noise reduction module, and then the output of the encoding module is multiplied by the output of the noise reduction module and sent to the Enter the decoding module to obtain the enhanced speech.

进一步的，所述编码模块的输出首先送入降噪模块中的注意力模块处理，然后再经过卷积模块处理后输出；在注意力模块中，语音特征经过层归一化、多头自注意力和Dropout处理后和编码模块的输出逐点相加后输出；在卷积模块中，语音特征经过层归一化、逐点卷积、激活函数GLU、深度卷积、批归一化、激活函数Swish、逐点卷积和Dropout处理后和注意力模块的输出逐点相加后输出。Further, the output of the encoding module is first sent to the attention module in the noise reduction module for processing, and then processed by the convolution module and output; in the attention module, the speech features are subjected to layer normalization and multi-head self-attention. After processing and Dropout, the output of the encoding module is added point by point and output; in the convolution module, the speech features are subjected to layer normalization, point-by-point convolution, activation function GLU, depth convolution, batch normalization, activation function Swish, point-by-point convolution and dropout processing and the output of the attention module are added point-by-point and output.

进一步的，在解码模块中：首先，语音特征经过二维反卷积、批归一化和激活函数PReLU处理后大小变为[B,128,K,S]，接着语音特征经过重叠相加后大小变为[B,128,L]，最后语音特征经过一维反卷积处理大小还原为[B,1,L]。Further, in the decoding module: first, the size of the speech features becomes [B, 128, K, S] after two-dimensional deconvolution, batch normalization and activation function PReLU processing, and then the speech features are overlapped and added. The size becomes [B, 128, L], and finally the speech feature is restored to [B, 1, L] after one-dimensional deconvolution processing.

由上，本发明的基于注意力机制和卷积神经网络的单通道语音增强方法至少具有如下有益效果：From the above, the single-channel speech enhancement method based on the attention mechanism and the convolutional neural network of the present invention has at least the following beneficial effects:

1、本发明基于深度学习的方法在时域对语音进行增强，节省了傅里叶变换的计算时间同时尽可能保留目标增强语音的相位信息。另外，本发明提出的注意力模块和卷积模块结合了注意力与卷积的优势，在突破了卷积神经网络不能并行计算的限制的同时，利用注意力机制赋予了其处理较长语音序列的能力，达到了可以对语音信息进行局部和全局建模的效果。1. The method based on the deep learning of the present invention enhances the speech in the time domain, which saves the calculation time of the Fourier transform and at the same time preserves the phase information of the target enhanced speech as much as possible. In addition, the attention module and convolution module proposed in the present invention combine the advantages of attention and convolution, break through the limitation that the convolutional neural network cannot be calculated in parallel, and use the attention mechanism to give it the ability to process long speech sequences The ability to achieve the effect of local and global modeling of speech information.

2、本发明基于深度学习的方法，采用了基于注意力机制和卷积神经网络的框架从时域对带噪语音进行增强。在此过程中，利用网络结构直接学习带噪语音到纯净语音的映射关系。可以利用卷积神经网络和注意力机制充分提取带噪语音信号的局部信息和全局信息。与此同时，本发明直接在时域上对带噪语音进行增强，不但可以节省傅里叶变换的计算时间还可以尽可能的保留增强语音的相位信息，能取得较好的降噪效果。2. The present invention is based on the deep learning method, and adopts the framework based on the attention mechanism and the convolutional neural network to enhance the noisy speech from the time domain. In this process, the network structure is used to directly learn the mapping relationship from noisy speech to pure speech. The local and global information of noisy speech signals can be fully extracted by using convolutional neural networks and attention mechanisms. At the same time, the present invention directly enhances the noisy speech in the time domain, which not only saves the calculation time of Fourier transform but also preserves the phase information of the enhanced speech as much as possible, and can achieve better noise reduction effect.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其他目的、特征和优点能够更明显易懂，以下结合优选实施例，并配合附图，详细说明如下。The above description is only an overview of the technical solution of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other objects, features and advantages of the present invention more obvious and easy to understand , the following detailed description is given in conjunction with the preferred embodiments and in conjunction with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例的附图作简单地介绍。In order to describe the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings of the embodiments will be briefly introduced below.

图1是本发明的语音增强方法的流程图；Fig. 1 is the flow chart of the speech enhancement method of the present invention;

图2是本发明的神经网络模型图；Fig. 2 is a neural network model diagram of the present invention;

图3是本发明的编码模块的结构图；Fig. 3 is the structural diagram of the coding module of the present invention;

图4是本发明的分块的过程图；Fig. 4 is the block process diagram of the present invention;

图5是本发明的注意力模块的结构图；Fig. 5 is the structure diagram of the attention module of the present invention;

图6是本发明的卷积模块的结构图；Fig. 6 is the structure diagram of the convolution module of the present invention;

图7是本发明的解码模块的结构图。FIG. 7 is a structural diagram of a decoding module of the present invention.

具体实施方式Detailed ways

下面结合附图详细说明本发明的具体实施方式，其作为本说明书的一部分，通过实施例来说明本发明的原理，本发明的其他方面、特征及其优点通过该详细说明将会变得一目了然。在所参照的附图中，不同的图中相同或相似的部件使用相同的附图标号来表示。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. As a part of this specification, the principles of the present invention will be illustrated by examples. Other aspects, features and advantages of the present invention will become apparent from the detailed description. In the figures to which reference is made, the same reference numerals are used for the same or similar parts in different figures.

本发明旨在充分利用深度学习方法无需对噪声信号与语音信号做出任何假设的优势，在时域上直接对带噪语音进行增强。在节省傅里叶变换计算时间的同时，尽可能的保留增强语音的相位信息。并利用卷积神经网络和注意力机制更好的提取语音信号的局部和全局特征，致力于获取较好的降噪效果。本发明的方法具体的执行步骤如下：The invention aims to make full use of the advantage of the deep learning method without making any assumptions about the noise signal and the speech signal, and directly enhance the noisy speech in the time domain. The phase information of the enhanced speech is preserved as much as possible while saving the calculation time of the Fourier transform. And use the convolutional neural network and attention mechanism to better extract the local and global features of the speech signal, and strive to obtain a better noise reduction effect. The specific execution steps of the method of the present invention are as follows:

本发明的语音增强分两个阶段进行，包括训练阶段和增强阶段，如图1所示。The speech enhancement of the present invention is carried out in two stages, including a training stage and an enhancement stage, as shown in FIG. 1 .

训练阶段，首先将噪声与纯净语音合成带噪语音，接着对带噪语音进行特征提取，然后将纯净语音与特征提取后的带噪语音一并送入神经网络模型中学习带噪语音和纯净语音之间的映射关系。训练阶段结束后，将训练好的模型保存下来。In the training stage, the noise and pure speech are first synthesized into noisy speech, and then feature extraction is performed on the noisy speech, and then the pure speech and the noisy speech after feature extraction are sent into the neural network model to learn the noisy speech and pure speech. the mapping relationship between them. After the training phase, save the trained model.

增强阶段，首先对带噪语音进行特征提取，然后送入训练好的模型中进行语音增强，最后输出增强后的语音。In the enhancement stage, firstly extract the features of the noisy speech, then send it into the trained model for speech enhancement, and finally output the enhanced speech.

本发明提出的神经网络模型如图2所示。本发明的模型由三部分构成，编码模块、解码模块和降噪模块。编解码的功能类似于傅里叶变换，都是将语音信号变换到高维空间便于降噪模块对其进行降噪处理。带噪语音首先经过编码模块处理，然后送入降噪模块，接着编码模块的输出与降噪模块的输出相乘最后送入解码模块获得增强后的语音。The neural network model proposed by the present invention is shown in FIG. 2 . The model of the present invention consists of three parts, an encoding module, a decoding module and a noise reduction module. The function of the codec is similar to the Fourier transform, which is to transform the speech signal into a high-dimensional space for the noise reduction module to perform noise reduction processing. The noisy speech is first processed by the encoding module, and then sent to the noise reduction module, then the output of the encoding module is multiplied by the output of the noise reduction module, and finally sent to the decoding module to obtain the enhanced speech.

编码模块的内部构成如图3所示，将大小为[B,1,L]的带噪语音送入网络模型中。其中B代表BatchSize，表示每次送入神经网络中需要训练的数据量，L表示语音信号的长度。首先经过一维卷积处理，大小变为[B,128,L]；然后经过分块处理，大小变为[B,128,K,S]。最后经过二维卷积、批归一化和激活函数PReLU处理，此时编码模块输出的大小为[B,256,K,S]。The internal structure of the encoding module is shown in Figure 3, and the noisy speech of size [B, 1, L] is fed into the network model. Among them, B represents BatchSize, which represents the amount of data that needs to be trained in the neural network each time, and L represents the length of the speech signal. First, after one-dimensional convolution processing, the size becomes [B, 128, L]; then after block processing, the size becomes [B, 128, K, S]. Finally, after two-dimensional convolution, batch normalization and activation function PReLU processing, the size of the output of the encoding module is [B, 256, K, S].

分块的过程如图4所示，首先，一维语音序列以帧长2D、帧移D的长度进行分帧处理。然后，每一帧语音数据都作为一个单独的数据块并将这些数据块堆叠在一起构成一个二维的数据矩阵。其中S表示长度为L的序列可分割块的数量，K表示帧长等于2D。The block division process is shown in Figure 4. First, the one-dimensional speech sequence is divided into frames with a frame length of 2D and a frame shift of D. Then, each frame of speech data is treated as a separate data block and these data blocks are stacked together to form a two-dimensional data matrix. where S represents the number of divisible blocks of the sequence of length L, and K represents the frame length equal to 2D.

编码模块的输出首先送入降噪模块中的注意力模块处理，然后再经过卷积模块处理后输出。在降噪模块处理过程中，语音特征的维度大小仍为[B,256,K,S]。注意力模块如图5所示，在注意力模块中，语音特征经过层归一化、多头自注意力和Dropout处理后和编码模块的输出逐点相加后输出。The output of the encoding module is first sent to the attention module in the noise reduction module for processing, and then processed by the convolution module for output. During the processing of the noise reduction module, the dimension of the speech feature is still [B, 256, K, S]. The attention module is shown in Figure 5. In the attention module, the speech features are processed by layer normalization, multi-head self-attention and dropout, and the output of the encoding module is added point by point and then output.

卷积模块如图6所示，在卷积模块中，语音特征经过层归一化、逐点卷积、激活函数GLU、深度卷积、批归一化、激活函数Swish、逐点卷积和Dropout处理后和注意力模块的输出逐点相加后输出。The convolution module is shown in Figure 6. In the convolution module, the speech features undergo layer normalization, pointwise convolution, activation function GLU, depthwise convolution, batch normalization, activation function Swish, pointwise convolution and After Dropout processing, the output of the attention module is added point by point and output.

编码模块的输出和降噪模块的输出相乘后送入解码模块，大小为[B,256,K,S]。解码模块如图7所示，首先，语音特征经过二维反卷积、批归一化和激活函数PReLU处理后大小变为[B,128,K,S]，接着语音特征经过重叠相加后大小变为[B,128,L]，最后语音特征经过一维反卷积处理大小还原为[B,1,L]。至此神经网络模型的处理过程结束，输出增强后的语音。The output of the encoding module and the output of the noise reduction module are multiplied and sent to the decoding module, the size is [B, 256, K, S]. The decoding module is shown in Figure 7. First, the size of the speech features becomes [B, 128, K, S] after being processed by two-dimensional deconvolution, batch normalization and activation function PReLU, and then the speech features are overlapped and added. The size becomes [B, 128, L], and finally the speech feature is restored to [B, 1, L] after one-dimensional deconvolution processing. At this point, the processing of the neural network model ends, and the enhanced speech is output.

其中，重叠相加是分块的逆过程。二维数据矩阵中相邻的两个数据块重叠的部分将会以求平均值的方式合并，将所有的块都合并后便将二维矩阵变换为一维的语音序列。Among them, overlap-add is the inverse process of blocking. The overlapping parts of two adjacent data blocks in the two-dimensional data matrix will be merged by means of averaging, and after all blocks are merged, the two-dimensional matrix will be transformed into a one-dimensional speech sequence.

选择VoiceBank和DEMAND语音数据为例进行实验。我们选用的数据集由30个说话人组成，其中28人用于训练，2人用于测试。训练数据由11572个带噪语音与纯净语音对组成，测试数据由824个语音对组成。Choose VoiceBank and DEMAND voice data as examples for experiment. The dataset we chose consists of 30 speakers, of which 28 are used for training and 2 are used for testing. The training data consists of 11572 pairs of noisy speech and pure speech, and the test data consists of 824 speech pairs.

本发明采用的损失函数结合了时域和频域两个方面。时频域的损失函数可以监督模型学习更多的信息，从而获得更高的语音清晰度和感知质量，其定义为：The loss function adopted in the present invention combines the two aspects of the time domain and the frequency domain. The loss function in the time-frequency domain can supervise the model to learn more information, resulting in higher speech intelligibility and perceptual quality, which is defined as:

loss＝α*loss_F+(1-α)loss_T loss=α*loss _F +(1-α)loss _T

其中，X和

表示纯净语音和增强语音的频谱。r和i表示复值的实部和虚部。T表示帧数，F表示频点数，N表示采样数。α是一个可调参数，本发明中设置为0.2。where X and

Represents the spectrum of clean and enhanced speech. r and i represent the real and imaginary parts of the complex value. T represents the number of frames, F represents the number of frequency points, and N represents the number of samples. α is an adjustable parameter, which is set to 0.2 in the present invention.

为了验证本发明提出方法的有效性，将所提出的方法与其他语音增强方法进行对比。对比方法包括基于生成对抗网络的SEGAN模型和基于U-Net的Wave-U-Net模型。In order to verify the effectiveness of the method proposed in the present invention, the proposed method is compared with other speech enhancement methods. Comparing methods include SEGAN model based on generative adversarial network and Wave-U-Net model based on U-Net.

表1不同评价指标下本发明与不同算法对比Table 1 Comparison of the present invention and different algorithms under different evaluation indexes

评价指标\算法Evaluation Metrics\Algorithms NoisyNoisy SEGANSEGAN Wave-U-NetWave-U-Net 本发明this invention PESQPESQ 1.971.97 2.162.16 2.402.40 2.422.42 CSIGCSIG 3.353.35 3.483.48 3.523.52 3.823.82 SSNRSSNR 1.681.68 7.737.73 9.979.97 10.210.2

从表中可以看出，本发明提出的方法在PESQ、CSIG和SSNR指标上均高于对比算法，说明本发明提出的方法具有一定的优越性。It can be seen from the table that the method proposed by the present invention is higher than the comparison algorithm in terms of PESQ, CSIG and SSNR, indicating that the method proposed by the present invention has certain advantages.

本发明的基于深度学习的方法在时域对语音进行增强，节省了傅里叶变换的计算时间同时尽可能保留目标增强语音的相位信息。另外，本发明提出的注意力模块和卷积模块结合了注意力与卷积的优势：在突破了卷积神经网络不能并行计算的限制的同时，利用注意力机制赋予了其处理较长语音序列的能力，达到了可以对语音信息进行局部和全局建模的效果。The method based on the deep learning of the present invention enhances the speech in the time domain, saves the calculation time of the Fourier transform and preserves the phase information of the target enhanced speech as much as possible. In addition, the attention module and convolution module proposed in the present invention combine the advantages of attention and convolution: while breaking the limitation that the convolutional neural network cannot be calculated in parallel, the attention mechanism enables it to process longer speech sequences The ability to achieve the effect of local and global modeling of speech information.

经过实验验证，对比了本发明提出的方法与其他两种方法在PESQ、CSIG和SSNR指标的数值。见表1，基于本发明的单通道语音增强算法取得了比其他两种算法更佳的降噪性能。说明本发明提出的方法具有一定的优越性。Through experimental verification, the values of the PESQ, CSIG and SSNR indicators between the method proposed by the present invention and the other two methods are compared. See Table 1, the single-channel speech enhancement algorithm based on the present invention achieves better noise reduction performance than the other two algorithms. It shows that the method proposed by the present invention has certain advantages.

以上所述是本发明的优选实施方式而已，当然不能以此来限定本发明之权利范围，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和变动，这些改进和变动也视为本发明的保护范围。The above descriptions are only the preferred embodiments of the present invention, of course, it cannot limit the scope of rights of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, Several improvements and changes are made, and these improvements and changes are also regarded as the protection scope of the present invention.

Claims

1. a single-channel speech enhancement method based on attention mechanism and convolutional neural network, is characterized in that, comprises training phase and enhancement phase:

Training stage: First, the noise and pure speech are synthesized into noisy speech, then feature extraction is performed on the noisy speech, and then the pure speech and the noisy speech after feature extraction are sent to the neural network model to learn the noisy speech and pure speech After the training phase is over, save the trained model;

Enhancement stage: First, extract the features of the noisy speech, then send it into the trained model for speech enhancement, and finally output the enhanced speech.

2. the single-channel speech enhancement method based on attention mechanism and convolutional neural network as claimed in claim 1, is characterized in that, described neural network model comprises coding module, noise reduction module and decoding module, and the noisy speech first passes through. The encoding module is processed, and then sent to the noise reduction module, then the output of the encoding module is multiplied by the output of the noise reduction module, and finally sent to the decoding module to obtain the enhanced speech.

3. the single-channel speech enhancement method based on attention mechanism and convolutional neural network as claimed in claim 2 is characterized in that, the output of described coding module is first sent to the attention module in the noise reduction module for processing, and then After being processed by the convolution module, it is output; in the attention module, the speech features are processed by layer normalization, multi-head self-attention and dropout, and the output of the encoding module is added point by point; in the convolution module, the speech features are After layer normalization, pointwise convolution, activation function GLU, depthwise convolution, batch normalization, activation function Swish, pointwise convolution and Dropout processing, the output of the attention module is added point by point and output.

4. the single-channel speech enhancement method based on attention mechanism and convolutional neural network as claimed in claim 2, is characterized in that, in decoding module: first, speech feature is through two-dimensional deconvolution, batch normalization and After the activation function PReLU is processed, the size becomes [B, 128, K, S], and then the size of the speech feature becomes [B, 128, L] after overlapping and addition, and finally the size of the speech feature is restored to the one-dimensional deconvolution process. [B,1,L].