CN118335092A

CN118335092A - Voice compression method and system based on multi-scale residual error attention

Info

Publication number: CN118335092A
Application number: CN202410748437.7A
Authority: CN
Inventors: 李晔; 于兴业; 吝灵霞; 张鹏; 蔡田雨
Original assignee: Qilu University of Technology; National Supercomputing Center in Jinan
Current assignee: Qilu University of Technology; National Supercomputing Center in Jinan
Priority date: 2024-06-12
Filing date: 2024-06-12
Publication date: 2024-07-12
Anticipated expiration: 2044-06-12
Also published as: CN118335092B

Abstract

The invention belongs to the technical field of voice signal processing, and provides a voice compression method and a voice compression system based on multi-scale residual error attention, wherein the voice compression method comprises the steps of obtaining a voice signal; performing convolution operation on the voice signal to obtain a first feature, and performing operation on the first feature to obtain residual error and identity mapping of the first feature; adding the residual error and the identity mapping to obtain a first output characteristic, extracting the first output characteristic, obtaining an attention score through multiple operations, multiplying the attention score by the residual error and the identity mapping respectively, and obtaining a third output characteristic through multiple operations; performing multistage iterative quantization on the third output characteristic to obtain a first vector, finding a corresponding quantized vector in a codebook according to the index of the received first vector by a second network, and adding all the quantized vectors to obtain a reconstructed vector; and decoding the reconstructed vector to output the synthesized voice, and judging the authenticity of the generated voice through a discriminator. The invention can improve the quality of the synthesized voice.

Description

Speech compression method and system based on multi-scale residual attention

技术领域Technical Field

本公开涉及语音信号处理技术领域，具体涉及了一种基于多尺度残差注意力的语音压缩方法及系统。The present disclosure relates to the technical field of speech signal processing, and in particular to a speech compression method and system based on multi-scale residual attention.

背景技术Background technique

本部分的陈述仅仅是提供了与本公开相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art.

基于深度学习的超低速率语音压缩编码技术也被称为神经声码器。低速率语音编码技术在卫星通信、短波通信、水声通信以及保密通信等领域中有着广泛的应用需求。比如，在极端恶劣的山区通信环境中，通信设备不仅要抵抗极端低温和强风的考验，还要在有限的能源供应下保持稳定运行，语音编码的编码速率往往要低于2400bps。当语音的编码速率降低时，语音合成质量会受到影响，因此语音压缩技术尤为重要。Ultra-low-rate speech compression coding technology based on deep learning is also called neural vocoder. Low-rate speech coding technology has a wide range of application needs in satellite communications, shortwave communications, underwater acoustic communications, and confidential communications. For example, in extremely harsh mountain communication environments, communication equipment must not only withstand extreme low temperatures and strong winds, but also maintain stable operation under limited energy supply. The coding rate of speech coding is often lower than 2400bps. When the speech coding rate is reduced, the quality of speech synthesis will be affected, so speech compression technology is particularly important.

语音信号是一种复杂的信号，它包含了丰富的信息，如音调、节奏、音色等，这些信息在不同的频率和时间维度上表现出来。目前的语音编码方法通常只关注某一尺度的特征，而忽略了多尺度信息，导致语音合成质量受限，丧失特征信息的完整性。另外，层级信息没有得到充分关注，导致模型不能充分提取语音中的细节信息。因此，有必要探索更加全面、综合的特征表示方法，以提高系统对语音信号的理解和表达能力。Speech signals are complex signals that contain rich information, such as pitch, rhythm, timbre, etc., which are expressed in different frequency and time dimensions. Current speech coding methods usually only focus on features at a certain scale, while ignoring multi-scale information, resulting in limited speech synthesis quality and loss of feature information integrity. In addition, hierarchical information has not received sufficient attention, resulting in the model being unable to fully extract detailed information in speech. Therefore, it is necessary to explore more comprehensive and integrated feature representation methods to improve the system's ability to understand and express speech signals.

发明内容Summary of the invention

针对上述缺陷，本公开提出了一种基于多尺度残差注意力的语音压缩方法及系统，将多尺度残差注意力技术引入到低速率神经声码器中，利用多尺度残差注意力来学习更精细的特征，以此来提高合成语音质量。In response to the above-mentioned defects, the present invention proposes a speech compression method and system based on multi-scale residual attention, introduces multi-scale residual attention technology into a low-rate neural vocoder, and uses multi-scale residual attention to learn more refined features, so as to improve the quality of synthesized speech.

为了实现上述目的，本公开采用如下技术方案：In order to achieve the above objectives, the present disclosure adopts the following technical solutions:

本公开第一方面提供了一种基于多尺度残差注意力的语音压缩方法，包括以下步骤：The first aspect of the present disclosure provides a speech compression method based on multi-scale residual attention, comprising the following steps:

S1获取低速率下的多帧语音信号；S1 obtains multi-frame speech signals at a low rate;

S2将语音信号输入第一网络中进行卷积操作，得到第一特征，对第一特征进行多次操作，得到第一特征的残差和第一特征的恒等映射；S2 inputs the speech signal into the first network to perform a convolution operation to obtain a first feature, performs multiple operations on the first feature, and obtains a residual of the first feature and an identity mapping of the first feature;

S3将残差和恒等映射相加得到第一输出特征，提取第一输出特征的全局和局部特征，根据全局和局部特征得到融合特征，对融合特征进行操作，得到融合特征的注意力分数，将注意力分数分别与残差和恒等映射相乘，并将得到的两个结果相加得到第二输出特征；S3 adds the residual and the identity mapping to obtain a first output feature, extracts global and local features of the first output feature, obtains a fused feature according to the global and local features, operates the fused feature to obtain an attention score of the fused feature, multiplies the attention score by the residual and the identity mapping respectively, and adds the two results to obtain a second output feature;

S4将第二输出特征输入全波段特征提取器中，得到第三输出特征；S4 inputs the second output feature into a full-band feature extractor to obtain a third output feature;

S5对第三输出特征进行多级迭代量化，得到第一矢量，将第一矢量的索引传输到第二网络，第二网络根据接收到的索引，在码本中找到相应的量化矢量，将所有量化矢量相加，得到重构矢量；S5 performs multi-level iterative quantization on the third output feature to obtain a first vector, transmits the index of the first vector to the second network, and the second network finds the corresponding quantization vector in the codebook according to the received index, adds all the quantization vectors, and obtains a reconstructed vector;

S6对重构矢量解码输出合成语音，并对生成语音的真伪通过判别器进行判断。S6 decodes the reconstructed vector to output the synthesized speech, and determines the authenticity of the generated speech through a discriminator.

作为进一步的实现方式，对第一特征进行多次操作，得到第一特征的残差，具体为：对第一特征进行两组卷积核大小分别为三和五的卷积操作，并将组间卷积两两相乘，对卷积后的特征进行拼接，经过一维卷积进行多尺度融合，得到第一特征的残差。As a further implementation method, the first feature is operated multiple times to obtain the residual of the first feature. Specifically, two groups of convolution operations with convolution kernel sizes of three and five are performed on the first feature, and the convolution operations between the groups are multiplied by two. The convolved features are concatenated, and multi-scale fusion is performed through one-dimensional convolution to obtain the residual of the first feature.

作为进一步的实现方式，对第一特征进行跳过连接操作，得到第一特征的恒等映射。As a further implementation manner, a skip connection operation is performed on the first feature to obtain an identity mapping of the first feature.

作为进一步的实现方式，提取第一输出特征的全局特征，具体为：As a further implementation, the global feature of the first output feature is extracted, specifically:

对第一输出特征进行平均池化操作，对平均池化后的第一输出特征进行两次卷积核为1的卷积操作，并对卷积后的特征进行批量归一化处理。An average pooling operation is performed on the first output feature, and a convolution operation with a convolution kernel of 1 is performed twice on the first output feature after average pooling, and batch normalization is performed on the convolved feature.

作为进一步的实现方式，提取第一输出特征的局部特征，具体为：As a further implementation, extracting local features of the first output feature is specifically as follows:

对第一输出特征进行两次卷积核为1的卷积操作，对卷积后的特征进行批量归一化处理。The first output feature is convolved twice with a convolution kernel of 1, and the convolved features are batch normalized.

作为进一步的实现方式，在将第二输出特征输入全波段特征提取器中之前还包括以下步骤：As a further implementation, before the second output feature is input into the full-band feature extractor, the following steps are also included:

将第二输出特征再次输入多尺度残差注意力块中进行特征提取，重复操作多次。The second output feature is input into the multi-scale residual attention block again for feature extraction, and the operation is repeated multiple times.

作为进一步的实现方式，对生成语音的真伪通过判别器进行判断，具体为：As a further implementation method, the authenticity of the generated speech is judged by a discriminator, specifically:

引入多尺度STFT鉴别器和多周期判别器来判断生成语音的真假，通过对抗训练促使生成器产生更加真实的合成语音。Multi-scale STFT discriminators and multi-cycle discriminators are introduced to judge the authenticity of generated speech, and adversarial training is used to encourage the generator to produce more realistic synthetic speech.

本公开第二方面提供了一种基于多尺度残差注意力的语音压缩系统，包括：A second aspect of the present disclosure provides a speech compression system based on multi-scale residual attention, comprising:

数据获取模块，被配置为：获取低速率下的多帧语音信号；The data acquisition module is configured to: acquire a multi-frame speech signal at a low rate;

多尺度残差模块，被配置为：将语音信号输入第一网络中进行卷积操作，得到第一特征，对第一特征进行多次操作，得到第一特征的残差和第一特征的恒等映射；The multi-scale residual module is configured to: input the speech signal into the first network to perform a convolution operation to obtain a first feature, perform multiple operations on the first feature to obtain a residual of the first feature and an identity mapping of the first feature;

通道注意力模块，被配置为：将残差和恒等映射相加得到第一输出特征，提取第一输出特征的全局和局部特征，根据全局和局部特征得到融合特征，对融合特征进行操作，得到融合特征的注意力分数，将注意力分数分别与残差和恒等映射相乘，并将得到的两个结果相加得到第二输出特征；A channel attention module is configured to: add the residual and the identity mapping to obtain a first output feature, extract global and local features of the first output feature, obtain a fused feature according to the global and local features, operate the fused feature to obtain an attention score of the fused feature, multiply the attention score by the residual and the identity mapping respectively, and add the two results to obtain a second output feature;

全波段特征提取模块，被配置为：将第二输出特征输入全波段特征提取器中，得到第三输出特征；The full-band feature extraction module is configured to: input the second output feature into the full-band feature extractor to obtain a third output feature;

量化和解量化模块，被配置为：对第三输出特征进行多级迭代量化，得到第一矢量，将第一矢量的索引传输到第二网络，第二网络根据接收到的索引，在码本中找到相应的量化矢量，将所有量化矢量相加，得到重构矢量；The quantization and dequantization module is configured to: perform multi-level iterative quantization on the third output feature to obtain a first vector, transmit the index of the first vector to the second network, and the second network finds the corresponding quantization vector in the codebook according to the received index, and adds all the quantization vectors to obtain a reconstructed vector;

语音合成模块，被配置为：对重构矢量解码输出合成语音，并对生成语音的真伪通过判别器进行判断。The speech synthesis module is configured to: decode the reconstructed vector to output the synthesized speech, and judge the authenticity of the generated speech through a discriminator.

本公开第三方面提供了一种介质，其上存储有程序，该程序被处理器执行时实现本公开第一方面所述的基于多尺度残差注意力的语音压缩方法中的步骤。The third aspect of the present disclosure provides a medium on which a program is stored. When the program is executed by a processor, the steps in the speech compression method based on multi-scale residual attention described in the first aspect of the present disclosure are implemented.

本公开第四方面提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的程序，所述处理器执行所述程序时实现本公开第一方面所述的基于多尺度残差注意力的语音压缩方法中的步骤。The fourth aspect of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps in the multi-scale residual attention-based speech compression method described in the first aspect of the present disclosure are implemented.

与现有技术相比，本公开的有益效果为：Compared with the prior art, the present invention has the following beneficial effects:

本公开提出的基于多尺度残差注意力的语音压缩方法，将多尺度残差注意力技术引入到低速率神经声码器中，利用多尺度残差注意力来学习更重要的特征。多尺度残差注意力块包含多尺度残差块和注意力融合模块两部分，多尺度残差块通过交叉学习来捕获不同尺度的语音细节特征，并对其进行融合。注意力特征融合模块通过对残差块和恒等映射施加不同注意力，以解决层级信息不一致的问题，来进一步关注更精细的特征。采用多级矢量量化技术对编码特征进行量化，将量化索引传送到解码端。解码端根据接收到的索引在码本中检索相应的量化矢量，以重构输入矢量，从而重建语音。通过引入多尺度残差注意力，提高了语音的合成质量。The speech compression method based on multi-scale residual attention proposed in the present disclosure introduces multi-scale residual attention technology into the low-rate neural vocoder, and uses multi-scale residual attention to learn more important features. The multi-scale residual attention block consists of two parts: a multi-scale residual block and an attention fusion module. The multi-scale residual block captures speech detail features of different scales through cross-learning and fuses them. The attention feature fusion module further focuses on finer features by applying different attention to the residual block and the identity mapping to solve the problem of inconsistent hierarchical information. The multi-level vector quantization technology is used to quantize the encoded features, and the quantization index is transmitted to the decoding end. The decoding end retrieves the corresponding quantization vector in the codebook according to the received index to reconstruct the input vector and thus reconstruct the speech. By introducing multi-scale residual attention, the synthesis quality of speech is improved.

本发明附加方面的优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Advantages of additional aspects of the present invention will be given in part in the following description, and in part will become obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本公开的一部分的说明书附图用来提供对本公开的进一步理解，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。The accompanying drawings constituting a part of the present disclosure are used to provide a further understanding of the present disclosure. The illustrative embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation on the present disclosure.

图1是本公开的神经声码器编解码的结构示意图；FIG1 is a schematic diagram of the structure of the neural vocoder codec disclosed in the present invention;

图2是本公开的多级矢量量化的流程图；FIG2 is a flowchart of multi-stage vector quantization of the present disclosure;

图3是本公开的基于多尺度残差注意力的语音压缩方法流程图；FIG3 is a flow chart of a speech compression method based on multi-scale residual attention disclosed in the present invention;

图4是本公开的多尺度残差块的结构示意图；FIG4 is a schematic diagram of the structure of a multi-scale residual block disclosed in the present invention;

图5是本公开的通道注意力块的结构示意图。FIG5 is a schematic diagram of the structure of the channel attention block of the present invention.

具体实施方式Detailed ways

下面结合附图与实施例对本公开作进一步说明。The present disclosure is further described below in conjunction with the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本公开提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本公开所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are all illustrative and intended to provide further explanation of the present disclosure. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present disclosure belongs.

在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。In the absence of conflict, the embodiments in the present disclosure and the features in the embodiments may be combined with each other.

实施例一Embodiment 1

如图1所示，本公开实施例一提供了一种基于多尺度残差注意力的语音压缩方法，包括以下步骤：As shown in FIG1 , the first embodiment of the present disclosure provides a speech compression method based on multi-scale residual attention, comprising the following steps:

首先，获取的语音信号为8KHz采样的语音，语音序列由表示，为语音通道数，为语音的总样本点，其中d为语音的持续时间，为语音的采样率。First, the acquired speech signal is a speech sampled at 8KHz, and the speech sequence is composed of express, is the number of voice channels, is the total sample points of speech, where d is the duration of speech, is the sampling rate of the speech.

如图3所示，本公开的第一网络（编码器）由一维卷积、多尺度残差注意力块、全波段特征提取器和门控循环单元组成。首先，语音信号经过通道为=32、内核大小为K的一维卷积，得到第一特征，接着将第一特征依次输入四个多尺度残差注意力块，每个多尺度残差注意力块均由多尺度残差块和通道注意力块两部分组成。其中，如图4所示，多尺度残差块包含2组卷积核大小分别为3和5的卷积，组间卷积两两相乘，并对其卷积后的特征进行拼接，经过卷积核大小为1的一维卷积进行多尺度融合，得到残差R，跳过连接获得输入特征的恒等映射I。随后经过如图5所示的通道注意力模块，将残差R与恒等映射相加得到输出特征M。通道注意力模块对M进行全局和局部特征提取，对于全局特征，首先经过输出为1的平均池化，然后经过2层卷积核为1的卷积层，并对其卷积后的特征进行批量归一化处理。对于局部特征，经过两层卷积核为1的卷积层，同样对卷积后的特征进行批量归一化处理。将全局特征和局部特征进行相加得到融合特征T，并对其进行Sigmoid激活，得到融合特征的注意力分数。随后将注意力分数与残差相乘得到，将注意力分数差值与恒等映射相乘得到，最终相加得到多尺度残差注意力块输出，即第二输出特征。多尺度特征融合模块后面紧跟一个具有下采样的一维卷积，步长为S，卷积核为步幅S的两倍，下采样过程中通道数加倍。As shown in FIG3 , the first network (encoder) of the present disclosure is composed of a one-dimensional convolution, a multi-scale residual attention block, a full-band feature extractor, and a gated recurrent unit. First, the speech signal passes through the channel =32, and a one-dimensional convolution with a kernel size of K to obtain the first feature, and then the first feature is sequentially input into four multi-scale residual attention blocks, each of which consists of a multi-scale residual block and a channel attention block. As shown in FIG4, the multi-scale residual block contains two groups of convolutions with kernel sizes of 3 and 5, respectively. The convolutions between the groups are multiplied by each other, and the features after convolution are spliced. After multi-scale fusion through a one-dimensional convolution with a kernel size of 1, the residual R is obtained, and the identity map I of the input feature is obtained by skipping the connection. Then, after the channel attention module shown in FIG5, the residual R is added to the identity map to obtain the output feature M. The channel attention module extracts global and local features from M. For global features, it first undergoes average pooling with an output of 1, then passes through two layers of convolution layers with a convolution kernel of 1, and the features after convolution are batch normalized. For local features, after two layers of convolution layers with a convolution kernel of 1, the features after convolution are also batch normalized. The global features and local features are added together to obtain the fused feature T, and Sigmoid activation is performed on it to obtain the attention score of the fused feature. Then multiply the attention score by the residual to get , multiply the attention score difference by the identity mapping to get , and finally add them together to get the multi-scale residual attention block output , which is the second output feature. The multi-scale feature fusion module is followed by a one-dimensional convolution with downsampling, with a stride of S and a convolution kernel twice the stride S. The number of channels is doubled during the downsampling process.

将下采样后的第二输出特征再次输入多尺度残差注意力块中进行特征提取，重复操作多次，在本实施例中优选4个多尺度残差注意力块依次进行特征提取。The downsampled second output feature is input into the multi-scale residual attention block again for feature extraction, and the operation is repeated multiple times. In this embodiment, four multi-scale residual attention blocks are preferably used to perform feature extraction in sequence.

将经过多次多尺度残差注意力块的特征提取后的特征输入全波段特征提取器，全波段特征提取器由四组具有不同扩张因子的时序卷积网络块堆叠组成，内核为3，扩张因子为D，通道为。其中时序卷积网络块由输入一维卷积、深度可分离卷积和输出一维Conv组成，相邻的卷积之间插入了参数化PReLU激活函数和归一化层，并加入残差连接。全波段特征提取器后紧跟一个门控循环单元，通道数为。最后，经过内核大小为 K和输出通道为的一维卷积层，得到第三输出特征。其中=32，K=7，S=(1，4，5，8)，D=(1，2，5，9)，=512。The features extracted by multiple multi-scale residual attention blocks are input into the full-band feature extractor, which consists of four stacked temporal convolutional network blocks with different dilation factors, with a kernel of 3, a dilation factor of D, and a channel of The temporal convolutional network block consists of an input one-dimensional convolution, a depth-separable convolution, and an output one-dimensional Conv. A parameterized PReLU activation function and a normalization layer are inserted between adjacent convolutions, and a residual connection is added. The full-band feature extractor is followed by a gated recurrent unit with a channel number of Finally, after the kernel size is K and the output channel is The one-dimensional convolution layer of , obtains the third output feature. =32, K=7, S=(1, 4, 5, 8), D=(1, 2, 5, 9), =512.

如图2所示，利用多级量化器对第三输出特征进行量化。设计码本大小为M，码本维度为C，潜在帧向量个数为N。每个码本可以编码比特。对于码本初始化，使用K-Means算法对M个簇进行聚类得到初始化的码本。首先，获取编码后的潜在帧向量，多级量化器对潜在帧向量z进行量化。例如，第一级量化器得到最匹配的量化矢量。第二级量化器得到最匹配的量化矢量，第三级量化器得到最匹配的量化矢量，剩下的量化器依次进行迭代，其中码本搜索算法采用欧氏距离。最终，将量化后的索引传送到第二网络。其中M=256，C=512，N=50。As shown in Figure 2, the third output feature is quantized using a multi-level quantizer. The codebook size is designed to be M, the codebook dimension is C, and the number of potential frame vectors is N. Each codebook can encode For codebook initialization, the K-Means algorithm is used to cluster M clusters to obtain the initialized codebook First, obtain the encoded latent frame vector , the multi-level quantizer quantizes the potential frame vector z. For example, the first-level quantizer obtains the best matching quantization vector The second-level quantizer obtains the most matching quantization vector , the third-level quantizer obtains the most matching quantization vector , the remaining quantizers are iterated in turn, and the codebook search algorithm uses the Euclidean distance. The index is transmitted to the second network. Where M=256, C=512, N=50.

将量化后的矢量打包成二进制字节流进行传输，对传输的二进制字节进行解包，第二网络（解码端）根据索引在多级量化器中得到解量化后的特征，依次相加得到最终潜在帧向量。The quantized vector is packaged into a binary byte stream for transmission, and the transmitted binary bytes are unpacked. The second network (decoding end) obtains the dequantized features in the multi-level quantizer according to the index. , and add them up in sequence to get the final potential frame vector .

将解量化后的语音特征输入到与第一网络（编码器）结构相同但对称倒置的第二网络（解码器）中重建语音。首先通过内核大小为7的一维卷积，通道为。紧跟门控循环单元和全波段特征提取器，通道保持不变。然后，经过四个多尺度残差注意力模块，与编码端不同的是，每个多尺度残差注意力模块后面紧跟步幅为E，内核为步幅E的两倍的一维转置卷积，在上采样过程中通道数减倍。最后通过内核为7的一维卷积，通道数为。最后得到重建语音。其中=512，E=(8,5,4,1)，=32。The dequantized speech features The speech is reconstructed by inputting it into the second network (decoder) which has the same structure as the first network (encoder) but is symmetrically inverted. First, it is convolved in one dimension with a kernel size of 7 and a channel of . Following the gated recurrent unit and the full-band feature extractor, the channels remain unchanged. Then, after four multi-scale residual attention modules, each multi-scale residual attention module is followed by a one-dimensional transposed convolution with a stride of E and a kernel twice the stride of E, and the number of channels is doubled during the upsampling process. Finally, through a one-dimensional convolution with a kernel of 7, the number of channels is Finally, we get the reconstructed speech. =512, E=(8,5,4,1), =32.

为了进一步提高生成器的性能，引入多尺度STFT鉴别器(MS-STFT)和多周期判别器(MPD)来判断生成语音的真假，通过对抗训练促使生成器产生更加真实的合成语音。MS-STFT鉴别器由对多尺度复值STFT进行操作的相同结构化网络组成，其中实部和虚部连接。每个子网络由一个二维卷积层组成（使用具有32个通道的内核大小3x8），然后是二维卷积，在时间维度上按D的扩张率增加，在频率轴上步幅为2。内核大小为3x3和步幅(1,1)的最终二维卷积提供了最终预测。我们使用5个不同的尺度，STFT窗口长度为(2048,1024,51,256,128)。MPD是子鉴别器的混合，每个子鉴别器只接受输入语音的等间距样本。空间作为周期p给出。子鉴别器旨在通过查看输入音频的不同部分来捕获彼此不同的隐式结构。我们将周期p设置为[2,3,5,7,11]以避免尽可能重叠。首先将长度为T的1D原始音频重塑为高度T/p和宽度p的2D数据，然后将2D卷积应用于重塑的数据。在MPD的每个卷积层中，将宽度轴上的核大小限制为1，以独立处理周期样本。To further improve the performance of the generator, a multi-scale STFT discriminator (MS-STFT) and a multi-period discriminator (MPD) are introduced to judge the authenticity of the generated speech, and the generator is encouraged to produce more realistic synthetic speech through adversarial training. The MS-STFT discriminator consists of the same structured network operating on a multi-scale complex-valued STFT, where the real and imaginary parts are connected. Each sub-network consists of a 2D convolution layer (using a kernel size of 3x8 with 32 channels), followed by a 2D convolution, increasing at a dilation rate of D in the time dimension and a stride of 2 in the frequency axis. The final 2D convolution with a kernel size of 3x3 and a stride of (1,1) provides the final prediction. We use 5 different scales with STFT window lengths of (2048, 1024, 51, 256, 128). MPD is a mixture of sub-discriminators, each of which only accepts equally spaced samples of the input speech. The space is given as the period p. The sub-discriminators are designed to capture different implicit structures from each other by looking at different parts of the input audio. We set the period p to [2, 3, 5, 7, 11] to avoid overlap as much as possible. The 1D original audio of length T is first reshaped into 2D data of height T/p and width p, and then 2D convolution is applied to the reshaped data. In each convolution layer of MPD, the kernel size on the width axis is restricted to 1 to process periodic samples independently.

实施例二Embodiment 2

本公开实施例二提供了一种基于多尺度残差注意力的语音压缩系统，包括：Embodiment 2 of the present disclosure provides a speech compression system based on multi-scale residual attention, including:

更详细的步骤与实施例一中的相同，这里不再赘述。The more detailed steps are the same as those in the first embodiment and will not be repeated here.

实施例三Embodiment 3

本公开实施例三提供了一种介质，其上存储有程序，该程序被处理器执行时实现如本公开实施例一所述的基于多尺度残差注意力的语音压缩方法中的步骤。Embodiment 3 of the present disclosure provides a medium on which a program is stored. When the program is executed by a processor, the steps in the speech compression method based on multi-scale residual attention as described in Embodiment 1 of the present disclosure are implemented.

实施例四Embodiment 4

本公开实施例四提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的程序，所述处理器执行所述程序时实现如本公开实施例一所述的基于多尺度残差注意力的语音压缩方法中的步骤。Embodiment 4 of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps in the speech compression method based on multi-scale residual attention as described in Embodiment 1 of the present disclosure are implemented.

以上所述仅为本公开的优选实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

1. A speech compression method based on multi-scale residual attention, characterized in that it comprises the following steps:

S1 obtains multi-frame speech signals at a low rate;

S2 inputs the speech signal into the first network to perform a convolution operation to obtain a first feature, performs multiple operations on the first feature, and obtains a residual of the first feature and an identity mapping of the first feature;

S3 adds the residual and the identity mapping to obtain a first output feature, extracts global and local features of the first output feature, obtains a fused feature according to the global and local features, operates the fused feature to obtain an attention score of the fused feature, multiplies the attention score by the residual and the identity mapping respectively, and adds the two results to obtain a second output feature;

S4 inputs the second output feature into a full-band feature extractor to obtain a third output feature;

S5 performs multi-level iterative quantization on the third output feature to obtain a first vector, transmits the index of the first vector to the second network, and the second network finds the corresponding quantization vector in the codebook according to the received index, adds all the quantization vectors, and obtains a reconstructed vector;

S6 decodes the reconstructed vector to output the synthesized speech, and determines the authenticity of the generated speech through a discriminator.

2. The speech compression method based on multi-scale residual attention as described in claim 1 is characterized in that the first feature is operated multiple times to obtain the residual of the first feature, specifically: two groups of convolution operations with convolution kernel sizes of three and five are performed on the first feature, and the convolution operations between the groups are multiplied by two, the features after convolution are spliced, and multi-scale fusion is performed through one-dimensional convolution to obtain the residual of the first feature.

3. The speech compression method based on multi-scale residual attention as described in claim 2 is characterized in that a skip connection operation is performed on the first feature to obtain an identity mapping of the first feature.

4. The speech compression method based on multi-scale residual attention as claimed in claim 1, characterized in that the global feature of the first output feature is extracted, specifically:

An average pooling operation is performed on the first output feature, and a convolution operation with a convolution kernel of 1 is performed twice on the first output feature after average pooling, and batch normalization is performed on the convolved feature.

5. The speech compression method based on multi-scale residual attention as claimed in claim 4, characterized in that the local features of the first output features are extracted, specifically:

The first output feature is convolved twice with a convolution kernel of 1, and the convolved features are batch normalized.

6. The speech compression method based on multi-scale residual attention as described in claim 1, characterized in that the second output feature is input into the full-band feature extractor, and further comprises the following steps:

The second output feature is input into the multi-scale residual attention block again for feature extraction, and the operation is repeated multiple times.

7. The speech compression method based on multi-scale residual attention as claimed in claim 1, characterized in that the authenticity of the generated speech is judged by a discriminator, specifically:

Multi-scale STFT discriminators and multi-cycle discriminators are introduced to judge the authenticity of generated speech, and adversarial training is used to encourage the generator to produce more realistic synthetic speech.

8. A speech compression system based on multi-scale residual attention, characterized by comprising:

The data acquisition module is configured to: acquire a multi-frame speech signal at a low rate;

The multi-scale residual module is configured to: input the speech signal into the first network to perform a convolution operation to obtain a first feature, perform multiple operations on the first feature to obtain a residual of the first feature and an identity mapping of the first feature;

A channel attention module is configured to: add the residual and the identity mapping to obtain a first output feature, extract global and local features of the first output feature, obtain a fused feature according to the global and local features, operate the fused feature to obtain an attention score of the fused feature, multiply the attention score by the residual and the identity mapping respectively, and add the two results to obtain a second output feature;

The full-band feature extraction module is configured to: input the second output feature into the full-band feature extractor to obtain a third output feature;

The quantization and dequantization module is configured to: perform multi-level iterative quantization on the third output feature to obtain a first vector, transmit the index of the first vector to the second network, and the second network finds the corresponding quantization vector in the codebook according to the received index, and adds all the quantization vectors to obtain a reconstructed vector;

The speech synthesis module is configured to: decode the reconstructed vector to output the synthesized speech, and judge the authenticity of the generated speech through a discriminator.

9. A medium having a program stored thereon, characterized in that when the program is executed by a processor, the steps in the speech compression method based on multi-scale residual attention as described in any one of claims 1 to 7 are implemented.

10. An electronic device comprising a memory, a processor, and a program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps in the multi-scale residual attention-based speech compression method as described in any one of claims 1 to 7 are implemented.