CN110619886A

CN110619886A - End-to-end voice enhancement method for low-resource Tujia language

Info

Publication number: CN110619886A
Application number: CN201910966022.6A
Authority: CN
Inventors: 于重重; 康萌; 陈运兵; 徐世璇
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2019-12-27
Anticipated expiration: 2039-10-11
Also published as: CN110619886B

Abstract

The invention discloses an end-to-end speech enhancement method for low-resource Tujia language, belongs to the field of speech signal processing, relates to the speech enhancement technology of low-resource language, and aims at the diversity, randomness and non-stationary of environmental noise in Tujia language data performance, to achieve end-to-end rapid speech enhancement processing. Including: Based on the deep convolutional generation confrontation network, establish an end-to-end low-resource Tujia speech enhancement model, perform rapid enhancement processing, realize end-to-end Tujia speech rapid enhancement processing, and effectively remove Tujia speech with almost no distortion environmental noise.

Description

An end-to-end speech enhancement method for low-resource Tujia languages

技术领域technical field

本发明属于语音信号处理领域，涉及低资源语言的语音增强技术，具体涉及一种基于深度卷积生成对抗网络针对低资源土家语的端到端语音增强方法。The invention belongs to the field of speech signal processing, and relates to speech enhancement technology for low-resource languages, in particular to an end-to-end speech enhancement method for low-resource Tujia languages based on deep convolution generation confrontation networks.

背景技术Background technique

语音增强技术是语音数字信号的预处理部分，主要是从带噪语音信号中尽可能提取纯净的原始语音信号，其目的主要有两点：一是抑制背景噪声，改善语音质量，消除人们的听觉疲劳，这是主观测量；二是提高语音的可懂性，这是客观测量。现在语音识别技术已经进入实用阶段，但许多识别系统对环境要求较高。在实际应用中，环境噪声污染会降低语音处理系统性能。因此语音增强技术能够有效解决噪声污染，提高语音识别系统的准确率。目前语音增强系统在语音通信和多媒体技术等领域已被广泛应用。Speech enhancement technology is the preprocessing part of the voice digital signal. It mainly extracts the pure original voice signal from the noisy voice signal as much as possible. Fatigue, which is a subjective measurement; the second is to improve the intelligibility of speech, which is an objective measurement. Now the speech recognition technology has entered the practical stage, but many recognition systems have higher requirements on the environment. In practical applications, environmental noise pollution will degrade the speech processing system performance. Therefore, the speech enhancement technology can effectively solve the noise pollution and improve the accuracy of the speech recognition system. At present, the speech enhancement system has been widely used in the fields of speech communication and multimedia technology.

传统的语音增强算法有谱减法，其计算量小，可简单控制语音信号失真和残留噪声，但容易残留音乐噪声；自适应滤波如维纳滤波、卡尔曼滤波需要知道噪声的一些特征或统计特性。基于时域的子空间分解也可用于语音增强，但在低信噪比或白噪声的情况下效果更好。随着深度学习的快速发展，对于语音增强来说采用深度神经网络的方法也受到人们的广泛关注，其在非平稳噪声处理中相比传统方法具有明显的优势，但深度网络模型多为有监督训练，模型依赖于大量的标注数据和长时间的训练。Traditional speech enhancement algorithms include spectral subtraction, which has a small amount of calculation, and can easily control speech signal distortion and residual noise, but is prone to residual music noise; adaptive filters such as Wiener filtering and Kalman filtering need to know some characteristics or statistical properties of noise. Time-domain based subspace decomposition can also be used for speech enhancement, but works better with low SNR or white noise. With the rapid development of deep learning, the method of using deep neural network for speech enhancement has also attracted widespread attention. Compared with traditional methods in non-stationary noise processing, it has obvious advantages, but deep network models are mostly supervised. For training, the model relies on a large amount of labeled data and long-term training.

土家语作为我国土家族世代相传的语言，其中蕴含了丰富的民族文化内涵，但由于使用人数急剧减少，口语的传承出现断层现象，且无文本记录形式，已经面临濒危消亡的危机状态。此外土家语的使用范围也极为有限，留存较好的地区处于交通不便、十分闭塞的高山深谷中。在这种情况下不仅可采集的数据量十分有限，而且难以寻得专业的录音室，调查和采集土家语的过程均处于自然环境中，音频文件包含噪声的现象难以避免，其中出现的噪声诸如动物叫声、机动车声、采集设备发出的电流声以及多人同时说话的干扰都会将有用的语音信息淹没在噪声中，更对后续进行土家语标注和语音识别的任务造成影响。As the language passed down from generation to generation by the Tujia people in my country, Tujia language contains rich national cultural connotations. However, due to the sharp decrease in the number of users, there is a gap in the inheritance of spoken language, and there is no form of textual records. It is already facing a crisis of extinction. In addition, the scope of use of Tujia language is also extremely limited, and the areas with better preservation are located in the mountains and deep valleys with inconvenient transportation and very closed. In this case, not only the amount of data that can be collected is very limited, but also it is difficult to find a professional recording studio. The process of investigating and collecting Tujia language is in a natural environment, and it is inevitable that audio files contain noise. Animal calls, motor vehicle sounds, current sounds from acquisition equipment, and interference from multiple people talking at the same time will drown useful voice information in the noise, which will also affect the subsequent tasks of Tujia language labeling and voice recognition.

为了确保获得高质量的语料，去除土家语语音数据中的噪声是一项具有挑战性的研究。采用现有的语音去噪方法，难以实现土家语标注和语音识别，土家语语音识别的准确率低。To ensure high-quality corpus, removing noise from Tujia speech data is a challenging research. Using the existing speech denoising method, it is difficult to realize Tujia language annotation and speech recognition, and the accuracy rate of Tujia speech recognition is low.

发明内容Contents of the invention

为了克服上述现有技术的不足，本发明提供一种基于深度卷积生成对抗网络的针对低资源土家语的端到端语音增强方法，针对土家语数据中环境噪声的多样性、随机性和非平稳性，实现端到端的语音快速增强处理。In order to overcome the deficiencies of the above-mentioned prior art, the present invention provides an end-to-end speech enhancement method for low-resource Tujia language based on deep convolutional generative adversarial network, aiming at the diversity, randomness and abnormality of environmental noise in Tujia language data. Stability, to achieve end-to-end rapid speech enhancement processing.

本发明可以为语言数字资源库奠定研究基础，提高后续语音识别的准确率，并且帮助语音学家完成濒危语言的记录和保存工作，更加直观生动地展示语言面貌及其文化内涵，对语言文化保护传承都具有重要的实际意义。The invention can lay a research foundation for the language digital resource library, improve the accuracy of subsequent speech recognition, and help phoneticians complete the recording and preservation of endangered languages, more intuitively and vividly display the language appearance and its cultural connotation, and protect the language and culture. Inheritance has important practical significance.

本发明提供的技术方案是：The technical scheme provided by the invention is:

一种针对低资源土家语的端到端语音增强方法，基于深度卷积生成对抗网络，建立端到端的低资源土家语语音增强模型，实现端到端的土家语语音快速增强处理，有效去除土家语语音的环境噪声，包括以下步骤：An end-to-end speech enhancement method for low-resource Tujia language, based on a deep convolutional generative adversarial network, establishes an end-to-end low-resource Tujia speech enhancement model, realizes end-to-end Tujia speech rapid enhancement processing, and effectively removes Tujia language Ambient noise for speech, comprising the following steps:

1)构建土家语语料库，对土家语录音数据进行分类和切分，得到土家语原始带噪语料和土家语原始干净语料，并从土家语原始带噪语料中截取得到纯噪声片段：1) Construct the Tujia language corpus, classify and segment the Tujia language recording data, obtain the original noisy corpus of Tujia language and the original clean corpus of Tujia language, and intercept the pure noise segment from the original noisy corpus of Tujia language:

11)首先根据土家语录音数据的质量将土家语语料分为两部分：无噪声数据 (土家语原始干净语料)和有噪声数据(土家语原始带噪语料)。在有噪声数据中，句与句之间的无人声片段也包含环境噪声，因此可根据语音处理工具(如ELAN软件)将噪声片段截取出来得到纯噪声片段。具体地，土家语原始带噪语料中有人声和无人声的地方均有噪声。将无人声的片段截取出，作为土家语纯噪声片段。11) First, according to the quality of Tujia language recording data, the Tujia language corpus is divided into two parts: noise-free data (Tujia original clean corpus) and noisy data (Tujia original noisy corpus). In the noisy data, the unvoiced segments between sentences also contain environmental noise, so the noise segments can be intercepted according to speech processing tools (such as ELAN software) to obtain pure noise segments. Specifically, in the original noisy corpus of Tujia language, there are noises in places where there are voices and where there are no voices. The segment without voice is intercepted as a pure noise segment of Tujia language.

12)土家语的有噪声数据和无噪声数据均为长篇故事叙述(语音长句)，需要对其使用语音处理工具(如采用跨平台的多功能语音学专业软件Praat脚本) 进行切分，切分后的得到独立短句仍然分为两类：土家语原始带噪语料和土家语原始干净语料。12) The noisy data and noise-free data of Tujia language are both long storytelling (voice long sentences), which need to be segmented using voice processing tools (such as the cross-platform multi-functional phonetics professional software Praat script). The independent short sentences obtained after classification are still divided into two categories: the original noisy corpus of Tujia language and the original clean corpus of Tujia language.

2)扩展语料库：2) Extended corpus:

由于土家语语音数据量有限，将采用汉语语音数据集(例如：清华大学 30小时(thchs30)汉语语音数据集)作为土家语的扩展数据，称为汉语原始干净语料，以此解决土家语语音数据不足的问题。将步骤11)中截取的纯噪声片段分别加入到土家语原始干净语料和汉语原始干净语料中去，分别得到新的语料，将得到的新语料分别称为土家语合成带噪语料和汉语合成带噪语料。Due to the limited amount of Tujia language voice data, Chinese voice data sets (for example: Tsinghua University 30-hour (thchs30) Chinese voice data set) will be used as the extended data of Tujia language, called Chinese original clean corpus, in order to solve the problem of Tujia language voice data Insufficient problem. The pure noise fragments intercepted in step 11) are added to the original clean corpus of Tujia language and the original clean corpus of Chinese respectively to obtain new corpus respectively, and the new corpus obtained are respectively called Tujia synthetic noise corpus and Chinese synthetic band noisy corpus.

3)建立端到端的语音增强模型：3) Establish an end-to-end speech enhancement model:

本发明采用深度卷积生成对抗网络(Deep Convolutional GenerativeAdversarial Network,DCGAN)来建立端到端的土家语语音增强模型，对土家语语料进行语音增强；The present invention adopts Deep Convolutional Generative Adversarial Network (DCGAN) to establish an end-to-end Tujia language speech enhancement model, and performs speech enhancement on Tujia language corpus;

端到端的土家语语音增强模型包括：生成网络和判别网络；生成网络采用编码-解码的端到端全卷积网络结构。采用对抗训练设置，将增强后的语音与真实干净语音输入判别网络中进行分类，尽可能地判断出输入信号的真假，从而传递到生成网络，使得增强模型可以将其输出波形朝着真实的分布微调，直到判别网络难以区分输入信号的真实性，从而达到去除噪声信号的目的。本发明在网络的每个卷积层中加入谱归一化(SpectralNormalization,SN)，通过限制每个层的谱范数来约束网络的Lipschitz常数。在模型训练时，采用不平衡学习率可以使模型训练更加稳定，即对生成网络和判别网络分别设置学习率和不同的更新速率。The end-to-end Tujia speech enhancement model includes: a generation network and a discriminant network; the generation network adopts an end-to-end fully convolutional network structure of encoding-decoding. Using the confrontation training setting, the enhanced speech and the real clean speech are input into the discriminant network for classification, and the authenticity of the input signal is judged as much as possible, and then passed to the generation network, so that the enhanced model can direct its output waveform towards the real one. The distribution is fine-tuned until the discriminative network is difficult to distinguish the authenticity of the input signal, so as to achieve the purpose of removing the noise signal. The present invention adds spectral normalization (SpectralNormalization, SN) to each convolutional layer of the network, and constrains the Lipschitz constant of the network by limiting the spectral norm of each layer. During model training, using an unbalanced learning rate can make the model training more stable, that is, setting the learning rate and different update rates for the generation network and the discriminant network respectively.

具体执行如下操作：Specifically, perform the following operations:

31)首先用汉语合成带噪语料的时域波形图作为生成网络的输入。通过采取重叠滑窗的方式对波形进行分帧，具体实施时窗长为1秒，帧与帧之间重叠 500毫秒；然后输入生成网络编码阶段的11个卷积层得到压缩向量，压缩向量进入生成网络解码阶段。解码阶段与编码阶段呈镜像关系，有11个反卷积层且卷积核参数与编码阶段相对应的卷积层参数一致，每一个反卷积层同时接收上一个反卷积层结果和编码阶段中相对称的卷积层结果，将两个结果通过加权相加传递给下一个反卷积层，最终得到增强后的汉语干净语料；31) Firstly, the time-domain waveform graph of the Chinese synthetic noisy corpus is used as the input of the generating network. The waveform is divided into frames by overlapping sliding windows. The specific implementation time window is 1 second, and the overlap between frames is 500 milliseconds; then input the 11 convolutional layers in the generation network coding stage to obtain the compression vector, and the compression vector enters Generate network decoding stage. The decoding stage and the encoding stage are in a mirror image relationship. There are 11 deconvolution layers and the convolution kernel parameters are consistent with the convolution layer parameters corresponding to the encoding stage. Each deconvolution layer simultaneously receives the result of the previous deconvolution layer and the encoding The results of the symmetrical convolutional layer in the stage, the two results are passed to the next deconvolutional layer through weighted addition, and finally the enhanced Chinese clean corpus is obtained;

32)判别网络接收汉语原始干净语料和步骤31)得到的增强后的汉语干净语料，经判别网络多层卷积进行分类得到判别结果(输出0或1)，将判别结果传递给生成网络，生成网络根据损失函数计算损失值进行反向传播更新各层权重，开始对带噪语料进行新一轮的增强训练；判别网络继续接收生成网络的增强结果，根据损失函数计算损失值。如此反复迭代，，直到判别网络无法判别输入来源(此时输出设置为0.5)，则得到端到端语音增强模型；32) The discriminant network receives the original Chinese clean corpus and the enhanced Chinese clean corpus obtained in step 31), classifies and obtains the discriminant result (output 0 or 1) through multi-layer convolution of the discriminant network, and passes the discriminant result to the generating network to generate The network calculates the loss value according to the loss function, performs backpropagation to update the weights of each layer, and starts a new round of enhanced training on the noisy corpus; the discriminant network continues to receive the enhanced results of the generated network, and calculates the loss value according to the loss function. Iterate repeatedly in this way until the discriminant network cannot distinguish the input source (the output is set to 0.5 at this time), and then the end-to-end speech enhancement model is obtained;

4)对步骤3)得到的语音增强模型进行微调(Fine-tuning)继续训练得到端到端土家语语音增强模型称为Fine-tuning DCGAN(FDCGAN)，具体操作为：采用步骤1)中的土家语原始干净语料和步骤2)得到的土家语合成带噪语料作为训练数据输入步骤3)得到的端到端的语音增强模型，并且修改模型的学习率和批处理参数进行训练，最终得到训练好的端到端土家语语音增强模型FDCGAN；4) Fine-tuning the speech enhancement model obtained in step 3) to continue training to obtain an end-to-end Tujia speech enhancement model called Fine-tuning DCGAN (FDCGAN), the specific operation is: use the Tujia speech enhancement model in step 1) The original clean corpus of the language and the synthesized noisy corpus of Tujia language obtained in step 2) are used as training data to input the end-to-end speech enhancement model obtained in step 3), and the learning rate and batch processing parameters of the model are modified for training, and finally the trained End-to-end Tujia Speech Enhancement Model FDCGAN;

5)将待进行语音增强的土家语数据输入步骤4)得到的训练好的端到端土家语语音增强模型FDCGAN，即输出增强的土家语语音。5) Input the Tujia language data to be voice-enhanced into the trained end-to-end Tujia language speech enhancement model FDCGAN obtained in step 4), that is, output the enhanced Tujia language speech.

具体实施时，本发明采用步骤1)中的土家语原始带噪语料作为测试数据，对步骤4)得到的土家语语音增强模型进行测试，并采用语音质量评估工具对本发明提供的土家语语音增强模型进行验证和评价。During concrete implementation, the present invention adopts the original noisy corpus of Tujia language in step 1) as test data, step 4) the Tujia language speech enhancement model that obtains is tested, and adopts speech quality evaluation tool to the Tujia language speech enhancement that the present invention provides Model validation and evaluation.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

本发明针对土家语语音数据中环境噪声的多样性、随机性和非平稳性提出了一种基于改进深度卷积对抗生成网络的语音增强模型，它能够进行快速增强处理，实现了对土家语音频文件进行端到端的增强处理。由于土家语具有低资源性，数据量十分有限，本发明采用汉语语音数据集作为扩展，使得模型泛化性更强。相比于现有技术，本发明在每个卷积层中加入谱归一化(Spectral Normalization,SN)，通过限制每个层的谱范数来约束网络的Lipschitz常数。在模型训练时，采用不平衡学习率可以使模型训练更加稳定，即对生成网络和判别网络分别设置学习率和不同的更新速率。通过与现有主流语音增强方法进行对比，结果表明在几乎不失真的情况下能够有效去除土家语语音中的环境噪声。Aiming at the diversity, randomness and non-stationarity of environmental noise in the Tujia speech data, the present invention proposes a speech enhancement model based on the improved deep convolution confrontation generation network, which can perform rapid enhancement processing and realize the Tujia audio files undergo end-to-end enhanced processing. Since Tujia language has low resource and very limited data volume, the present invention adopts the Chinese speech data set as an extension to make the model more generalizable. Compared with the prior art, the present invention adds spectral normalization (Spectral Normalization, SN) to each convolutional layer, and constrains the Lipschitz constant of the network by limiting the spectral norm of each layer. During model training, using an unbalanced learning rate can make the model training more stable, that is, setting the learning rate and different update rates for the generation network and the discriminant network respectively. By comparing with the existing mainstream speech enhancement methods, the results show that it can effectively remove the environmental noise in Tujia speech with almost no distortion.

附图说明Description of drawings

图1是本发明方法的具体实施方案的流程框图。Fig. 1 is a block flow diagram of a specific embodiment of the method of the present invention.

图2是本发明实施例采用的端到端语音增强模型的结构示意图。Fig. 2 is a schematic structural diagram of an end-to-end speech enhancement model used in an embodiment of the present invention.

图3是本发明实施例中生成网络训练时的损失函数值变化示意图。FIG. 3 is a schematic diagram of loss function value changes during network training in an embodiment of the present invention.

图4是本发明实施例中判别网络训练时的损失函数值变化示意图。Fig. 4 is a schematic diagram of loss function value changes during discriminant network training in an embodiment of the present invention.

图3-图4中，横坐标为迭代次数(passes)，纵坐标为损失函数值(loss)。In Figure 3-Figure 4, the abscissa is the number of iterations (passes), and the ordinate is the loss function value (loss).

图5是本发明实施例中未进行增强前的土家语语谱图。Fig. 5 is a spectrogram of Tujia language before enhancement in the embodiment of the present invention.

图6是本发明实施例中进行增强后的土家语语谱图。Fig. 6 is an enhanced Tujia language spectrogram in the embodiment of the present invention.

图5-图6中，横坐标为时间(Time)，纵坐标为频率(Hz)。In FIGS. 5-6 , the abscissa is time (Time), and the ordinate is frequency (Hz).

具体实施方式Detailed ways

下面结合附图通过实施例对本发明做进一步说明，但不以任何方式限制本发明的范围。The present invention will be further described through the embodiments below in conjunction with the accompanying drawings, but the scope of the present invention is not limited in any way.

以下实施例采用共包括27篇口语短篇语料、总计时长为7小时8分59秒土家语数据和由25人录制而成总时长超过30小时的thchs30汉语语料库详细叙述本发明提供的语音增强方法的实施过程。The following embodiment adopts the thchs30 Chinese corpus that comprises 27 short spoken language corpora in total, and the total duration is 7 hours, 8 minutes and 59 seconds, and the thchs30 Chinese corpus with a total duration of more than 30 hours recorded by 25 people describes in detail the speech enhancement method provided by the present invention. Implementation process.

方法具体实施的流程框图如图1所示。本发明提供一种针对低资源土家语的基于深度卷积对抗生成网络的端到端语音增强方法，由于土家语具有低资源性，对实验数据进行扩展来构建数据库；采用深度卷积网络与生成对抗训练相结合来增强语音信号，并且针对土家语数据进行微调再训练，得到最终模型的泛化性更强且增强效果更好。模型直接输入原始语音信号，输出增强语音信号，端到端的方法能够保留原始语音信号时域上的相位细节信息。在深度卷积生成对抗网络中，每一个卷积层都采用谱归一化，通过修改损失函数以及网络层次参数，降低模型训练成本。在训练生成网络和判别网络时采用不平衡学习率，使得二者训练更加稳定。具体实施步骤如下：The flow chart of the specific implementation of the method is shown in Figure 1. The invention provides an end-to-end speech enhancement method based on deep convolution confrontation generation network for low-resource Tujia language. Since Tujia language has low resource, the experimental data is expanded to build a database; the deep convolution network and generation Combining adversarial training to enhance the speech signal, and fine-tuning and retraining for Tujia language data, the final model has stronger generalization and better enhancement effect. The model directly inputs the original speech signal and outputs the enhanced speech signal. The end-to-end method can preserve the phase details of the original speech signal in the time domain. In the deep convolutional generative adversarial network, each convolutional layer uses spectral normalization, and the cost of model training is reduced by modifying the loss function and network layer parameters. The unbalanced learning rate is used when training the generation network and the discriminative network, which makes the training of the two more stable. The specific implementation steps are as follows:

数据预处理以及数据库的构建：Data preprocessing and database construction:

1)将土家语数据集分为两部分，一部分为有噪声数据，另一部分为无噪声数据。利用ELAN软件(一个对视频和音频数据的标识进行创建、编辑、可视化和搜索的标注工具，可为标识提供声音技术以及对多媒体剪辑进行开发利用)和语音学软件Praat脚本(一款跨平台的多功能语音学专业软件) 将语音数据切分为短句后称为土家语原始带噪语料和土家语原始干净语料，并且手动截取有噪声数据中的噪声片段，噪声种类包括公鸡叫声、小鸡叫声、机动车声电子设备干扰噪声和其他噪声，数量如表1所示：1) Divide the Tujia language dataset into two parts, one part is noisy data and the other part is noise-free data. Utilize ELAN software (annotation tool for creating, editing, visualizing and searching signs of video and audio data, providing sound technology for signs and exploiting multimedia clips) and phonetic software Praat script (a cross-platform multi-functional phonetics professional software) divide the voice data into short sentences and call them Tujia original noisy corpus and Tujia original clean corpus, and manually intercept the noise segments in the noisy data. The types of noise include rooster crowing, The quantity of chicken crowing, motor vehicle acoustic electronic equipment interference noise and other noises is shown in Table 1:

表1土家语噪声种类及个数Table 1 The types and numbers of noises in Tujia language

2)将噪声片段通过音频转换及处理程序sox工具叠加到土家语原始干净语料和汉语原始干净语料中。噪声叠加方法是在采样点上随机选择开始位置，根据每类噪声个数占噪声总个数的比例向汉语原始干净语料中每个人的录音中注入不同噪声，如式(1)所示：2) Superimpose the noise segment into the original clean corpus of Tujia language and the original clean corpus of Chinese through the audio conversion and processing program sox tool. The noise superposition method is to randomly select the starting position at the sampling point, and inject different noises into the recordings of each person in the original clean Chinese corpus according to the ratio of the number of each type of noise to the total number of noises, as shown in formula (1):

其中，N_i表示噪声i的个数，M_j表示汉语原始干净语料中第j个人的录音条数，m_ij表示向thchs30语料中第j个人的录音中注入噪声i的条数；具体实施时，i＝1,…,5,j＝1,…,25。土家语原始干净语料的噪声注入方法同理，这样得到的新语料称为土家语合成带噪语料和汉语合成带噪语料。Among them, N _i represents the number of noise i, M _j represents the number of recordings of the jth person in the original clean corpus of Chinese, m _ij represents the number of noise i injected into the recording of the jth person in the thchs30 corpus; the specific implementation , i=1,...,5, j=1,...,25. The noise injection method of the original clean corpus of Tujia language is the same, and the new corpus obtained in this way is called Tujia synthetic noisy corpus and Chinese synthetic noisy corpus.

语音增强模型训练过程，端到端语音增强模型如图2所示：The speech enhancement model training process, the end-to-end speech enhancement model is shown in Figure 2:

1)将汉语原始带噪语料的语音波形(用z表示)输入生成网络中，其编码端由11个宽度为31、步长为2的1维步进卷积层组成，每层滤波器个数分别为：16、32、32、64、64、128、128、256、256、512、1024，解码端与编码端保持镜像关系，也包含11个同样参数的反卷积层。图 2中的箭头表示跳跃连接，即将卷积特征映射的信息传递给相应的反卷积层，同时接收上一个反卷积层的结将两个结果通过加权相加传递给下一个反卷积层，避免细节的丢失。每个卷积层的激活函数采用PReLU函数。生成网络输出端为汉语增强语音波形记作G(z)。1) The speech waveform (denoted by z) of the original Chinese noisy corpus is input into the generation network, and its encoding end is composed of 11 1-dimensional step convolution layers with a width of 31 and a step size of 2, and each layer has a filter The numbers are: 16, 32, 32, 64, 64, 128, 128, 256, 256, 512, 1024. The decoding end and the encoding end maintain a mirror image relationship, and also contain 11 deconvolution layers with the same parameters. The arrow in Figure 2 indicates a skip connection, that is, the information of the convolutional feature map is passed to the corresponding deconvolution layer, and at the same time, the knot of the previous deconvolution layer is received, and the two results are passed to the next deconvolution through weighted addition. layers to avoid loss of detail. The activation function of each convolutional layer uses the PReLU function. The output end of the generating network is the Chinese enhanced speech waveform denoted as G(z).

2)将生成的汉语增强语音G(z)和汉语原始干净语音输入判别网络。判别网络由一个1维的二分类卷积网络构成，有两个获取输入来源的通道，其中每个通道为16384个采样点，最后一层为1×1的卷积，每层使用 alpha值为0.3的LeakyRelu非线性激活函数。判别网络的损失函数L_D记作式(2)：2) Input the generated Chinese enhanced speech G(z) and Chinese original clean speech into the discriminant network. The discriminant network consists of a 1-dimensional binary classification convolutional network with two channels for obtaining input sources, each of which has 16384 sampling points, and the last layer is a 1×1 convolution, and each layer uses an alpha value of A LeakyRelu nonlinear activation function of 0.3. The loss function L _D of the discriminant network is written as formula (2):

其中，x表示纯净语音；P_data表示纯净语音x服从的分布函数；z为含噪语音；P_z表示含噪语音z服从的分布函数。如果输入是G(z)则判别网络输出D(G(z))为0；如果输入是x则判别网络输出D(x)为1。Among them, x represents the pure speech; P _data represents the distribution function that the pure speech x obeys; z represents the noisy speech; P _z represents the distribution function that the noisy speech z obeys. If the input is G(z), the discriminant network output D(G(z)) is 0; if the input is x, the discriminative network output D(x) is 1.

3)判别网络将判别结果传递给生成网络，生成网络根据式(3)计算损失函数L_G：3) The discriminant network passes the discriminative result to the generating network, and the generating network calculates the loss function L _G according to formula (3):

两个网络根据损失值进行反向传播更新各层权重，直到D(G(z))＝D(x) ＝0.5，即判别网络无法识别输入的信号为原始干净语音信号还是生成网络增强后的干净语音，则训练完成。The two networks perform backpropagation to update the weights of each layer according to the loss value until D(G(z))=D(x)=0.5, that is, the discriminative network cannot recognize whether the input signal is the original clean speech signal or the generated network enhanced If the speech is clean, the training is complete.

4)设置生成网络和判别网络以1比1的速率更新，两个网络在训练其一时，另一个网络保持冻结状态。生成网络学习率为0.0001，判别网络学习率为 0.0003，批处理参数为24。在模型训练过程中，本发明选用谱归一化和不平衡学习率来使得训练过程更加稳定。谱归一化限制了每个层的谱范数来约束判别网络的Lipschitz常数，Lipschitz连续性函数的梯度上界被限制，因此函数更平滑，在神经网络的优化过程中，参数变化也会更稳定，不容易出现梯度爆炸。分别使用学习速率a(n)和b(n)对生成网络和判别网络进行参数更新，表示为式(4)和式(5)：4) Set the generation network and the discriminant network to update at a rate of 1 to 1. When the two networks are training one, the other network remains frozen. The generative network learning rate is 0.0001, the discriminative network learning rate is 0.0003, and the batching parameter is 24. During the model training process, the present invention selects spectral normalization and unbalanced learning rate to make the training process more stable. Spectral normalization limits the spectral norm of each layer to constrain the Lipschitz constant of the discriminant network. The upper bound of the gradient of the Lipschitz continuity function is limited, so the function is smoother, and the parameter changes will be more smooth during the optimization process of the neural network. Stable and not prone to gradient explosions. Use the learning rate a(n) and b(n) to update the parameters of the generation network and the discriminant network, expressed as formula (4) and formula (5):

其中，θ_n、h(θ_n,ω_n)、分别是生成网络第n次更新的参数向量、随机下降梯度、随机向量；ω_n、g(θ_n,ω_n)、是判别网络第n次更新的参数向量、随机下降梯度、随机向量。Among them, θ _n , h(θ _n ,ω _n ), They are the parameter vector, stochastic descent gradient, and random vector of the nth update of the generating network; ω _n , g(θ _n ,ω _n ), is the parameter vector, stochastic descent gradient, and random vector for the nth update of the discriminant network.

5)在上述利用汉语语料训练好语音增强模型后，将土家语合成带噪语料和土家语原始干净语料输入，其他参数一致的情况下，设置生成网络学习率 0.00006，判别网络学习率为0.0001，批处理参数为16，对模型再次进行训练，使得模型泛化性更好。模型中生成网络和判别网络训练时的损失函数变化如图3和图4所示。5) After the above-mentioned speech enhancement model is trained using Chinese corpus, input Tujia synthetic noisy corpus and Tujia original clean corpus, and set the learning rate of generation network to 0.00006 and the learning rate of discriminant network to 0.0001 when other parameters are consistent. The batch parameter is 16, and the model is trained again to make the model generalization better. Figure 3 and Figure 4 show the loss function changes during the training of the generative network and the discriminative network in the model.

6)最终采用土家语原始带噪语料对模型进行测试，图5为未进行增强的带噪土家语语谱图，图6为增强后的土家语语谱图，根据对比本发明提出的方法能够有效去除土家语数据中的环境噪声。6) Finally, the original noisy corpus of Tujia language is used to test the model. Fig. 5 is a noisy Tujia language spectrogram without enhancement, and Fig. 6 is an enhanced Tujia language spectrogram. According to the comparison method proposed by the present invention, Effectively remove environmental noise in Tujia language data.

将本发明提出的针对土家语语音数据的增强方法与常用的传统语音增强方法，以及基于深度循环神经网络的语音增强方法进行对比，评价指标选择主观语音质量评估(Perceptual evaluation of speech quality,PESQ)和平均意见得分-语音质量指标(Mean Opinion Score Listening Quality Objective,MOSLQO)。PESQ是语音质量评价中的一种典型算法，其采用的是线性评分制度，受到广泛使用，数值在 -0.5～4.5之间，表示输入测试语音与输出语音相比语音质量的高低，分数越高，语音质量越好。评价结果如表2所示：The enhancement method proposed by the present invention for Tujia language voice data is compared with conventional traditional voice enhancement methods and the voice enhancement method based on deep recurrent neural network, and the evaluation index selects subjective voice quality evaluation (Perceptual evaluation of speech quality, PESQ) And the average opinion score - voice quality index (Mean Opinion Score Listening Quality Objective, MOSLQO). PESQ is a typical algorithm in voice quality evaluation. It adopts a linear scoring system and is widely used. The value is between -0.5 and 4.5, indicating the voice quality of the input test voice compared with the output voice. The higher the score , the better the voice quality. The evaluation results are shown in Table 2:

表2不同增强方法结果评价对比Table 2 Comparison of evaluation results of different enhancement methods

表2的结果表明本发明提出基于深度卷积生成对抗网络的端到端语音增强方法能够有效去除土家语中的环境噪声，具有更好的增强效果，为语音识别奠定了稳定的基础。The results in Table 2 show that the end-to-end speech enhancement method based on the deep convolutional generative adversarial network proposed by the present invention can effectively remove the environmental noise in Tujia language, has a better enhancement effect, and lays a stable foundation for speech recognition.

需要注意的是，公布实施例的目的在于帮助进一步理解本发明，但是本领域的技术人员可以理解：在不脱离本发明及所附权利要求的精神和范围内，各种替换和修改都是可能的。因此，本发明不应局限于实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims of. Therefore, the present invention should not be limited to the content disclosed in the embodiments, and the protection scope of the present invention is subject to the scope defined in the claims.

Claims

1. An end-to-end speech enhancement method for low-resource Tujia language, which is characterized in that, based on deep convolution generation confrontation network, an end-to-end low-resource Tujia speech enhancement model is established to realize end-to-end Tujia speech rapid enhancement Process, effectively remove the environmental noise of Tujia speech; comprise the following steps:

1) Construct the Tujia language corpus, classify and segment the Tujia language recording data, obtain the original noisy corpus of Tujia language and the original clean corpus of Tujia language, and obtain pure noise fragments from the original noisy corpus of Tujia language;

2) Extended corpus: use the original clean corpus of Chinese as the extended data of Tujia language, add the pure noise fragments to the original clean corpus of Tujia language and the original clean corpus of Chinese respectively, and call the new corpus of Tujia synthetic noise corpus respectively Synthesize noisy corpus with Chinese;

3) Establish and train an end-to-end speech enhancement model; including:

Establish an end-to-end Tujia speech enhancement model using deep convolutional generative confrontation network DCGAN;

The end-to-end Tujia language speech enhancement model includes: a generation network and a discrimination network;

The generation network adopts an encoding-decoding end-to-end fully convolutional network structure;

Add spectral normalization to each convolutional layer of the network, and constrain the Lipschitz constant of the network by limiting the spectral norm of each convolutional layer;

Using the adversarial training setting, the enhanced speech and the real clean speech are input into the discriminant network for classification, and the authenticity of the input signal is judged, and then passed to the generation network, so that the end-to-end Tujia speech enhancement model will model the output waveform towards Real distribution fine-tuning, thereby achieving the purpose of removing noise signals;

4) continue fine-tuning training to the end-to-end speech enhancement model obtained in step 3), obtain the trained end-to-end Tujia speech enhancement model FDCGAN; specific operations are:

Use the original clean corpus of Tujia language in step 1) and the synthesized noisy corpus of Tujia language obtained in step 2) as training data input step 3) The end-to-end speech enhancement model obtained, and modify the learning rate and batch processing parameters of the model to carry out Training, and finally get the trained end-to-end Tujia speech enhancement model FDCGAN;

5) Input the Tujia language data to be voice-enhanced into the trained end-to-end Tujia language speech enhancement model FDCGAN obtained in step 4), that is, output the enhanced Tujia language speech.

2. as claimed in claim 1 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, step 1) builds Tujia language corpus, specifically comprises the following operations:

11) First, according to the quality of Tujia language recording data, the Tujia language corpus is divided into two parts: noise-free data and noisy data, which are the original noisy corpus of Tujia language and the original clean corpus of Tujia language;

Then use speech processing tools to intercept the noise fragments in the noisy data to obtain pure noise fragments;

12) Segment the long sentences of the speech data to obtain independent short sentences; the short sentences are still divided into two categories: the original noisy corpus of Tujia language and the original clean corpus of Tujia language.

3. as claimed in claim 1 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, step 2) expands corpus, specifically adopts Tsinghua University 30 hours Chinese speech data set thchs30 as the extended data of Tujia language.

4. as claimed in claim 1 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, step 3) trains the speech enhancement model of end-to-end, specifically performs the following operations:

31) using the time-domain waveform diagram of the Chinese synthetic noise corpus as the input of the generation network;

The time-domain waveform diagram is input to generate multiple convolutional layers in the network coding stage by overlapping sliding windows to obtain a compressed vector;

The compressed vector enters the generation network decoding stage;

The generation network decoding stage is in a mirror image relationship with the encoding stage, where the deconvolution layer and convolution kernel parameters are consistent with the convolution layer parameters corresponding to the encoding stage; each deconvolution layer simultaneously receives the results of the previous deconvolution layer and The results of the relatively symmetrical convolutional layer in the encoding stage are passed to the next deconvolutional layer through weighted addition, thereby obtaining an enhanced Chinese clean corpus;

32) The discriminative network receives the original Chinese clean corpus and the enhanced Chinese clean corpus obtained in step 31), and classifies through multi-layer convolution of the discriminant network to obtain a discriminative result;

Pass the discriminative result to the generation network;

By calculating the network loss function, the loop training is passed to each other until the discriminant network cannot distinguish the source of the input, and an end-to-end speech enhancement model is obtained.

5. the end-to-end speech enhancement method for low-resource Tujia language as claimed in claim 4, is characterized in that, described multiple convolution layers are 11 convolution layers.

6. as claimed in claim 1 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, step 2), the pure noise segment is added respectively in the original clean corpus of Tujia language and the original clean corpus of Chinese, specifically by Audio conversion and processing tools for overlay; use the following methods:

Randomly select the starting position at the sampling point, and inject different noises into the recordings of each person in the original clean Chinese corpus according to the ratio of the number of each type of noise to the total number of noises, expressed as formula (1):

Among them, N _i represents the number of noise i, M _j represents the number of recordings of the jth person in the original clean Chinese corpus, and m _ij represents the number of noise i injected into the recordings of the jth person in the thchs30 corpus.

7. the end-to-end speech enhancement method for low-resource Tujia language as claimed in claim 1, is characterized in that, in step 3), the activation function of each convolutional layer of generation network adopts PReLU function.

8. as claimed in claim 1, for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, in step 3), the discriminant network is composed of a 1-dimensional binary classification convolutional network, and there are two sources of input The channel, each channel is 16384 sampling points, and the last layer is a 1×1 convolutional layer;

Each layer of the discriminant network uses the LeakyRelu nonlinear activation function;

The loss function L _D of the discriminant network is expressed as formula (2):

Among them, x represents pure speech; P _data represents the distribution function that pure speech x obeys; z is noisy speech; let the distribution of pure speech x obey P _data , and the distribution of noisy speech z obey p _z ; if the input is G(z) then The discriminative network output D(G(z)) is 0; if the input is x, the discriminative network output D(x) is 1;

The discriminant network passes the discriminative result to the generating network, and the generating network calculates the loss function L _G according to formula (3):

The two networks perform backpropagation to update the weights of each layer according to the loss value until the network cannot identify whether the input signal is the original clean speech signal or the clean speech after network enhancement is generated, then the training is completed.

9. as claimed in claim 8 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, the specific setting generation network and the discriminant network are updated at a rate of 1 to 1, and one of the generation network and the discriminant network is training , the other network remains frozen.

10. as claimed in claim 9 for the end-to-end speech enhancement method of low-resource Tujia language, it is characterized in that, specifically adopt spectrum normalization and unbalanced learning rate to make training process stable; Described spectrum normalization limits each The spectral norm of the layer, the Lipschitz constant of the constrained discriminant network;

Use the learning rate a(n) and b(n) to update the parameters of the generation network and the discriminant network, expressed as formula (4) and formula (6):

Among them, θ _n , h(θ _n ,ω _n ), They are the parameter vector, stochastic descent gradient, and random vector of the nth update of the generating network; ω _n , g(θ _n ,ω _n ), They are the parameter vector, random descent gradient, and random vector of the nth update of the discriminant network.