CN116469394A

CN116469394A - A Robust Speaker Recognition Method Based on Spectral Graph Denoising and Adversarial Learning

Info

Publication number: CN116469394A
Application number: CN202310425824.2A
Authority: CN
Inventors: 张烨; 常浩
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-21
Anticipated expiration: 2043-04-20
Also published as: CN116469394B

Abstract

The invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, collecting a spectrogram data set of clean voice and a noisy spectrogram data set of the clean voice after noise addition; training a U-shaped network (U-Net) of a multi-stage coding and decoding structure by using a mean square error loss function to remove noise interference on a Mel spectrogram of a noise-containing voice signal, so as to obtain an enhanced Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, taking the Time Delay Neural Network (TDNN) as a generator in the TDNN-CGAN to extract depth characteristics of an enhanced Mel spectrogram, and taking a multi-layer perceptron (MLP) as a discriminator in the TDNN-CGAN; and finally, training a speaker classifier by using cross entropy loss to identify the identity of the speaker, so as to realize speaker identification in a noise environment. The depth characteristic extracted from the noisy speech is close to the depth characteristic extracted from the clean speech, so that the performance of the speaker recognition system in a noisy environment is improved.

Description

Robust speaker identification method based on spectrogram denoising and countermeasure learning

Technical Field

The invention belongs to the technical field of voice processing. Relates to a robust speaker recognition method based on spectrogram denoising and countermeasure learning.

Background

In a real environment, the speech input by the speaker recognition system is often interfered by various background noises and reverberation, and the additional noise on the clean speech obscures the acoustic details and reduces the speech intelligibility and quality, so that the performance of the speaker recognition system is reduced. The common method for improving the robustness of the speaker recognition system is mainly to train the system through a data set consisting of clean and noisy data; or adding a speech enhancement front-end, speech enhancement refers to a technique of extracting a useful speech signal from a noisy background after the speech signal is disturbed by noise. However, during the speech enhancement process, speech distortion may occur, and even the performance of the speaker recognition system may be reduced, and since the neural network has a strong feature extraction capability, the neural network may be used to directly extract the frequency domain features without noise from the frequency domain features of the speech interfered by noise. In addition, generating an countermeasure network (GAN) is currently widely studied and has been applied to many speech or audio related tasks that focus mainly on domain conversion and generating a more realistic data distribution, and the GAN network has a certain potential for the extraction of anti-noise features.

Disclosure of Invention

The invention aims to provide a robust speaker recognition method based on spectrogram denoising and countermeasure learning, so as to solve the problems in the background technology.

Firstly, collecting a data set of a Mel spectrogram of clean voice and a data set of a noisy Mel spectrogram of the clean voice after noise addition; training a U-shaped network U-Net of a multi-stage encoding and decoding structure by using a mean square error loss function, and extracting an enhanced Mel spectrogram from the noisy Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, adopting a multi-layer perceptron (MLP) as a discriminator, and adopting the Time Delay Neural Network (TDNN) as a generator to extract depth characteristics of an enhanced Mel spectrogram; training the MLP-based speaker classifier with cross entropy loss is used for identifying the identity of the speaker, so that the speaker identification in the noise environment is realized.

The specific steps of the speaker recognition method are as follows:

step one: will clean the speech s _c Adding noise n to obtain noisy speech s _n ＝s _c +n, using Hamming window to pass clean speech s _c Noise-containing speech s _n Dividing into short frames, extracting Mel eigenvectors from each frame, and respectively forming two eigenvectors:wherein x is _c (t)、x _n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.

Step two: the noisy Mel spectrogram X of the noisy speech _n Inputting U-Net with multi-stage encoding and decoding structure to obtain X _n ^* ＝Enhance(X _n ) Wherein enhancement (. Cndot.) represents the analysis of the spectrum from the noisy Mel spectrum X _n Extraction enhancement mel spectrogram X _n ^* The U-Net is trained by using a mean square error loss function as a spectrogram denoising loss, and the expression of the spectrogram denoising loss is as follows:

step three: clean mel spectrogram X _c Enhanced mel profile X _n ^* Respectively inputting TDNN-CGAN, respectively extracting clean Mel spectrogram and enhancing depth characteristic E of Mel spectrogram by generator-Time Delay Neural Network (TDNN) in TDNN-CGAN _c ＝G(X _c )、E _n ＝G(X _n ^* ). Inputting the extracted depth features into a discriminant-multi-layer perceptron (MLP), G (-) and D (-) representing the outputs of the generator and discriminant, respectively, and generating a least squares solution into an countermeasure network (LSGAN)The expression of the discriminant loss is as follows:

network parameters of the fixed authentication network are used for enhancing depth characteristic E of the Mel spectrogram _n The input discriminator calculates the generation loss in the countermeasure learning, and is used for training the generator TDNN so that the depth characteristic extracted from the noisy speech is more approximate to the depth characteristic of the clean speech, and the expression of the generation loss is as follows:

step four: depth feature E to be extracted from enhanced mel spectrogram _n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss to realize speaker recognition under a noise environment, namely robust speaker recognition.

The beneficial effects of the invention are as follows:

according to the invention, through spectrogram denoising and countermeasure learning, a U-Net Mel spectrogram enhancement network is adopted, and a joint training scheme of the countermeasure network TDNN-CGAN and the speaker classifier is generated based on the condition of a time delay neural network, so that the depth characteristic extracted from noisy voice is close to the depth characteristic extracted from clean voice, and the performance of the speaker recognition system in a noise environment is improved.

Drawings

Fig. 1 is a schematic diagram of a robust speaker recognition method based on spectrogram denoising and countermeasure learning according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are only for the purpose of illustrating the technical solution of the present invention and are not to be construed as limiting the invention.

As shown in fig. 1, the invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, mixing a clean voice signal with a noise signal to obtain a noise-containing voice signal, and extracting the clean voice signal and the frequency domain characteristics (such as a mel spectrogram) of the noise-containing voice signal. Secondly, inputting the Mel spectrogram extracted from the noise-containing voice into a spectrogram enhancement network based on U-Net, and removing noise interference on the Mel spectrogram of the noise-containing voice to obtain an enhancement spectrogram. And respectively inputting the Mel spectrogram of the clean voice and the enhanced Mel spectrogram of the noise-containing voice into a time delay neural network-based condition generation countermeasure network TDNN-CGAN, and obtaining the depth characteristics of the clean voice signal and the depth characteristics of the noise-containing voice signal through generator coding. Then, performing countermeasure learning by a discriminator so that the enhanced depth features extracted from the noisy speech are closer to the depth features of the clean speech; and finally, inputting the enhanced depth characteristic into a speaker classifier, thereby realizing speaker identification in a noise environment.

The invention will be further illustrated by the following implementation steps.

Step one: firstly, voice Activity Detection (VAD) is carried out on a section of speaker voice to remove a mute section, 3s long voice is intercepted to be used as clean voice, a section of noise signal with 3s long time is randomly taken from a noise database to carry out linear addition, and a noise-containing voice copy of the clean voice is obtained. Then pre-emphasis is carried out on the clean voice and the noise-containing voice copy, a Hamming window is added for framing, and the Mel characteristics are extracted, thus obtaining a Mel spectrogram of the clean voiceMeier spectrogram of noisy speechWherein x is _c (t)、x _n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.

Step two: meier spectrogram X containing noise of voice signal _n Input as a U-net based spectrogram denoising networkThe network has a multistage coding-decoding structure, in the coding stage, an input characteristic graph firstly sequentially passes through 5 layers of convolution layers to perform characteristic compression to obtain a hidden layer vector c, in the decoding stage, the hidden layer vector c sequentially passes through 5 layers of deconvolution layers to perform characteristic reconstruction to obtain an enhanced characteristic X _n ^* ＝Enhance(X _n ) Wherein enhancement (·) represents a U-net based spectral denoising process from a mel spectrum X of noisy speech _n Extraction enhancement mel spectrogram X _n ^* 。

The 5 layers of convolution layers in the encoding stage all adopt 2D convolution, the number of input channels is 1, 16, 32 and 64, the number of output channels is 16, 32, 64 and 64, and the convolution kernel size of each convolution layer is 4 multiplied by 1. The 5 layers of convolution layers in the decoding stage are all 2D deconvolution, the input channel numbers are 64, 128, 64 and 32 respectively, the output channel numbers are 64, 32, 16 and 1 respectively, and the convolution kernel sizes of the deconvolution layers are 4 multiplied by 1. The PReLu activation function is added after all convolution layers. And each coding layer is connected to its homologous decoding layer, bypassing the feature compression process performed in the middle of the model, passing the fine-grained information of the feature map directly to the decoding stage.

Step three: clean mel spectrogram X _c Enhanced mel profile X _n ^* Respectively inputting the conditions based on the time delay neural network to generate an countermeasure network TDNN-CGAN, and respectively obtaining depth characteristics E of a clean Mel spectrogram _c Enhancing depth features E of a Mel spectrogram _n . In the coding network, the input characteristic diagram sequentially passes through 4 layers of 1D convolution layers, and after the 4 layers of convolution layers, batch standardization operation is carried out, a Dropout layer is added, and the parameter value p of the Dropout layer is set to be 0.1. The output of each convolution layer is expressed asWhere l=1, 2,3,4, t 'denotes the number of frames and D' denotes the depth feature vector dimension per frame.

Considering that in a neural network, the depth features extracted by each layer of the network contain information about the original input, the output of each layer of convolution layer is passed through a linear phaseAdding to realize feature polymerization to obtain polymerized featuresThen, the aggregated features are subjected to statistical pooling, and the average value and the standard deviation of the aggregated features are spliced to obtain speech-level features +.>Finally, E is carried out by utilizing the full connection layer _Statistics Is converted into a fixed 256-dimensional vector as the depth feature extracted by the encoder.

Step four: through countermeasure learning, the discriminators are trained to correctly discriminate E _c And E is _n . Then fixing a discriminant, training a generator TDNN, wherein the discriminant consists of three fully connected layers, and the last layer is provided with 2 output nodes, and the objective function of the training discriminant is expressed as:

step five: depth features extracted from the enhanced mel-spectrum are input to a speaker classifier composed of a fully connected layer and a Softmax layer, which is trained by cross entropy loss.

Step six:

(1) according to the first step and the second step, inputting a noise-containing Mel spectrogram of noise-containing voice into a spectrogram denoising network based on U-net to obtain an enhanced Mel spectrogram, and calculating a Mean Square Error (MSE) loss with the Mel spectrogram of clean voice, wherein the method is used for training the spectrogram denoising network based on U-net.

(2) According to the third and fourth steps, firstly training the discriminators in the TDNN-CGAN through discrimination loss, then fixing network parameters of the discriminators, calculating and generating loss, and training spectrograms to strengthen generators in the network and the TDNN-CGAN through a back propagation algorithm.

(3) And step five, calculating cross entropy loss by combining the speaker real label corresponding to the voice and the prediction label of the speaker classifier, and training a spectrogram enhancement network, a generator and the speaker classifier by using a back propagation algorithm.

(4) And (3) training alternately until the loss value of the network converges, stopping training, storing a U-Net spectrogram enhanced network model, generating a generator network model in a countermeasure network TDNN-CGAN based on the condition of a time delay neural network, and a network model of a speaker classifier.

Step seven: and D, utilizing the saved U-Net spectrogram enhanced network model in the step six, generating a generator network model in the countermeasure network TDNN-CGAN and a network model of a speaker classifier based on the condition of the time delay neural network, and recognizing the speaker identity of the noise-containing voice according to the step two, the step three and the step five to realize speaker recognition in a noise environment.

In order to verify the performance of the speaker identification method, 340 speakers are selected from an Aishell-1 data set, 40 sentences are selected from each speaker, each sentence is intercepted into 3s fragments, 20 sentences are used as clean voices of a training set, the clean voices and noise data in a Musan noise data set are randomly mixed according to one signal to noise ratio of 0,5,10,15 and 20 to obtain a noise-containing copy, and the clean voices and the noise-containing voice copy form the training set; the rest 20 sentences are mixed with three different types of noise in the Musan noise data set according to the presence or absence of noise and five signal-to-noise ratios of 0,5,10,15 and 20 respectively to obtain 16 groups of test sets, the 16 groups of test sets are used for calculating the speaker recognition accuracy, and the higher the speaker recognition accuracy is, the better the recognition performance of the speaker recognition system is.

In addition, constructing a speaker recognition system which is trained by cross entropy loss and consists of a TDNN network and a speaker classifier; a speaker recognition system is constructed using countermeasure learning and cross entropy loss training, consisting of a time delay neural network based condition generation countermeasure network TDNN-CGAN and a speaker classifier. The test results for evaluating the effectiveness of the robust speaker recognition system based on feature enhancement and challenge learning are shown in table 1.

As can be seen from table 1, the speaker recognition accuracy of the robust speaker recognition method (U-Net spectrogram enhanced network+tdnn-cgan+speaker classifier) based on spectrogram denoising and countermeasure learning was 99.68% without adding noise, and was 89.72%, 92.50% and 91.47% in the case of three different noise types and signal-to-noise ratio of 0 dB. The TDNN+speaker classifier has the speaker recognition accuracy of 98.24% under the condition that no noise is added, and has the speaker recognition accuracy of 79.63%, 83.82% and 86.29% under the condition that three different noise types and the signal to noise ratio is 0 dB; the TDNN-CGAN+speaker classifier has speaker recognition accuracy of 99.12% without adding noise, and 86.93%, 90.11% and 87.64% with three different noise types and signal-to-noise ratio of 0 dB.

Under the condition that training data are the same, compared with the method for identifying the robust speaker (U-Net spectrogram enhanced network+TDNN-CGAN+speaker classifier) based on spectrogram denoising and countermeasure learning, the method has the advantages that higher speaker identification accuracy is achieved under a noise-free environment and a noise environment, and the best robust speaker identification performance is obtained. Therefore, the robust speaker recognition system based on spectrogram denoising and countermeasure learning of the present invention is effective.

TABLE 1 speaker recognition accuracy

The foregoing description of the preferred embodiments of the present invention has been presented only in terms of those specific and detailed descriptions, and is not, therefore, to be construed as limiting the scope of the invention. It should be noted that modifications, improvements and substitutions can be made by those skilled in the art without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A robust speaker recognition method based on spectrogram denoising and countermeasure learning is characterized in that: removing noise interference on a Mel spectrogram of a noise-containing voice signal by adopting a U-Net (U-network) with a multi-level coding and decoding structure to obtain an enhanced Mel spectrogram; generating an countermeasure network TDNN-CGAN based on the condition of the time delay neural network to extract the depth features of the enhanced Mel spectrogram, and inputting the obtained depth features into a speaker classifier to identify the identity of a speaker;

the speaker recognition method comprises the following specific steps:

(1) adding noise to clean voice to obtain noisy voice, framing, windowing, extracting Mel spectrogram, and respectively obtaining Mel spectrogram of clean voiceMeier spectrogram of noisy speechWherein x is _c (t)、x _n (T) the Mel eigenvectors of the T-th frame of the Mel spectrogram of clean, noisy speech, T representing the number of speech frames, T e {1,., T }, the superscript T denotes a transpose, D represents the dimension of the Mel feature vector of each frame;

(2) x is to be _n Inputting into a U-Net spectrogram enhancement network to obtain an enhanced Mel spectrogram X _n ^* Training a U-Net spectrogram enhancement network by using a mean square error loss function as spectrogram enhancement loss, wherein the expression of the spectrogram enhancement loss is as follows:

(3) x is to be _c 、X _n ^* Respectively inputting TDNN-CGAN, and respectively extracting X by using generator in TDNN-CGAN _c Depth feature E of (2) _c ＝G(X _c )、X _n ^* Depth feature E of (2) _n ＝G(X _n ^* ) The method comprises the steps of carrying out a first treatment on the surface of the Will E _c 、E _n The discriminators in the TDNN-CGAN are respectively input, the discriminators are trained according to the discrimination losses in the least square generation countermeasure network, and the expression of the discrimination losses is as follows:

g (-) and D (-) represent the output of the generator and arbiter, respectively; fixing network parameters of the discriminator, will E _n The input discriminator generates a generation loss training generator in the countermeasure network according to least squares, and the expression of the generation loss is as follows:

(4) will E _c 、E _n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss for identifying the identity of the speaker.