[go: up one dir, main page]

CN116469394A - A Robust Speaker Recognition Method Based on Spectral Graph Denoising and Adversarial Learning - Google Patents

A Robust Speaker Recognition Method Based on Spectral Graph Denoising and Adversarial Learning Download PDF

Info

Publication number
CN116469394A
CN116469394A CN202310425824.2A CN202310425824A CN116469394A CN 116469394 A CN116469394 A CN 116469394A CN 202310425824 A CN202310425824 A CN 202310425824A CN 116469394 A CN116469394 A CN 116469394A
Authority
CN
China
Prior art keywords
spectrogram
speaker
mel
network
tdnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310425824.2A
Other languages
Chinese (zh)
Other versions
CN116469394B (en
Inventor
张烨
常浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang University
Original Assignee
Nanchang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang University filed Critical Nanchang University
Priority to CN202310425824.2A priority Critical patent/CN116469394B/en
Publication of CN116469394A publication Critical patent/CN116469394A/en
Application granted granted Critical
Publication of CN116469394B publication Critical patent/CN116469394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, collecting a spectrogram data set of clean voice and a noisy spectrogram data set of the clean voice after noise addition; training a U-shaped network (U-Net) of a multi-stage coding and decoding structure by using a mean square error loss function to remove noise interference on a Mel spectrogram of a noise-containing voice signal, so as to obtain an enhanced Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, taking the Time Delay Neural Network (TDNN) as a generator in the TDNN-CGAN to extract depth characteristics of an enhanced Mel spectrogram, and taking a multi-layer perceptron (MLP) as a discriminator in the TDNN-CGAN; and finally, training a speaker classifier by using cross entropy loss to identify the identity of the speaker, so as to realize speaker identification in a noise environment. The depth characteristic extracted from the noisy speech is close to the depth characteristic extracted from the clean speech, so that the performance of the speaker recognition system in a noisy environment is improved.

Description

Robust speaker identification method based on spectrogram denoising and countermeasure learning
Technical Field
The invention belongs to the technical field of voice processing. Relates to a robust speaker recognition method based on spectrogram denoising and countermeasure learning.
Background
In a real environment, the speech input by the speaker recognition system is often interfered by various background noises and reverberation, and the additional noise on the clean speech obscures the acoustic details and reduces the speech intelligibility and quality, so that the performance of the speaker recognition system is reduced. The common method for improving the robustness of the speaker recognition system is mainly to train the system through a data set consisting of clean and noisy data; or adding a speech enhancement front-end, speech enhancement refers to a technique of extracting a useful speech signal from a noisy background after the speech signal is disturbed by noise. However, during the speech enhancement process, speech distortion may occur, and even the performance of the speaker recognition system may be reduced, and since the neural network has a strong feature extraction capability, the neural network may be used to directly extract the frequency domain features without noise from the frequency domain features of the speech interfered by noise. In addition, generating an countermeasure network (GAN) is currently widely studied and has been applied to many speech or audio related tasks that focus mainly on domain conversion and generating a more realistic data distribution, and the GAN network has a certain potential for the extraction of anti-noise features.
Disclosure of Invention
The invention aims to provide a robust speaker recognition method based on spectrogram denoising and countermeasure learning, so as to solve the problems in the background technology.
Firstly, collecting a data set of a Mel spectrogram of clean voice and a data set of a noisy Mel spectrogram of the clean voice after noise addition; training a U-shaped network U-Net of a multi-stage encoding and decoding structure by using a mean square error loss function, and extracting an enhanced Mel spectrogram from the noisy Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, adopting a multi-layer perceptron (MLP) as a discriminator, and adopting the Time Delay Neural Network (TDNN) as a generator to extract depth characteristics of an enhanced Mel spectrogram; training the MLP-based speaker classifier with cross entropy loss is used for identifying the identity of the speaker, so that the speaker identification in the noise environment is realized.
The specific steps of the speaker recognition method are as follows:
step one: will clean the speech s c Adding noise n to obtain noisy speech s n =s c +n, using Hamming window to pass clean speech s c Noise-containing speech s n Dividing into short frames, extracting Mel eigenvectors from each frame, and respectively forming two eigenvectors:wherein x is c (t)、x n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.
Step two: the noisy Mel spectrogram X of the noisy speech n Inputting U-Net with multi-stage encoding and decoding structure to obtain X n * =Enhance(X n ) Wherein enhancement (. Cndot.) represents the analysis of the spectrum from the noisy Mel spectrum X n Extraction enhancement mel spectrogram X n * The U-Net is trained by using a mean square error loss function as a spectrogram denoising loss, and the expression of the spectrogram denoising loss is as follows:
step three: clean mel spectrogram X c Enhanced mel profile X n * Respectively inputting TDNN-CGAN, respectively extracting clean Mel spectrogram and enhancing depth characteristic E of Mel spectrogram by generator-Time Delay Neural Network (TDNN) in TDNN-CGAN c =G(X c )、E n =G(X n * ). Inputting the extracted depth features into a discriminant-multi-layer perceptron (MLP), G (-) and D (-) representing the outputs of the generator and discriminant, respectively, and generating a least squares solution into an countermeasure network (LSGAN)The expression of the discriminant loss is as follows:
network parameters of the fixed authentication network are used for enhancing depth characteristic E of the Mel spectrogram n The input discriminator calculates the generation loss in the countermeasure learning, and is used for training the generator TDNN so that the depth characteristic extracted from the noisy speech is more approximate to the depth characteristic of the clean speech, and the expression of the generation loss is as follows:
step four: depth feature E to be extracted from enhanced mel spectrogram n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss to realize speaker recognition under a noise environment, namely robust speaker recognition.
The beneficial effects of the invention are as follows:
according to the invention, through spectrogram denoising and countermeasure learning, a U-Net Mel spectrogram enhancement network is adopted, and a joint training scheme of the countermeasure network TDNN-CGAN and the speaker classifier is generated based on the condition of a time delay neural network, so that the depth characteristic extracted from noisy voice is close to the depth characteristic extracted from clean voice, and the performance of the speaker recognition system in a noise environment is improved.
Drawings
Fig. 1 is a schematic diagram of a robust speaker recognition method based on spectrogram denoising and countermeasure learning according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are only for the purpose of illustrating the technical solution of the present invention and are not to be construed as limiting the invention.
As shown in fig. 1, the invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, mixing a clean voice signal with a noise signal to obtain a noise-containing voice signal, and extracting the clean voice signal and the frequency domain characteristics (such as a mel spectrogram) of the noise-containing voice signal. Secondly, inputting the Mel spectrogram extracted from the noise-containing voice into a spectrogram enhancement network based on U-Net, and removing noise interference on the Mel spectrogram of the noise-containing voice to obtain an enhancement spectrogram. And respectively inputting the Mel spectrogram of the clean voice and the enhanced Mel spectrogram of the noise-containing voice into a time delay neural network-based condition generation countermeasure network TDNN-CGAN, and obtaining the depth characteristics of the clean voice signal and the depth characteristics of the noise-containing voice signal through generator coding. Then, performing countermeasure learning by a discriminator so that the enhanced depth features extracted from the noisy speech are closer to the depth features of the clean speech; and finally, inputting the enhanced depth characteristic into a speaker classifier, thereby realizing speaker identification in a noise environment.
The invention will be further illustrated by the following implementation steps.
Step one: firstly, voice Activity Detection (VAD) is carried out on a section of speaker voice to remove a mute section, 3s long voice is intercepted to be used as clean voice, a section of noise signal with 3s long time is randomly taken from a noise database to carry out linear addition, and a noise-containing voice copy of the clean voice is obtained. Then pre-emphasis is carried out on the clean voice and the noise-containing voice copy, a Hamming window is added for framing, and the Mel characteristics are extracted, thus obtaining a Mel spectrogram of the clean voiceMeier spectrogram of noisy speechWherein x is c (t)、x n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.
Step two: meier spectrogram X containing noise of voice signal n Input as a U-net based spectrogram denoising networkThe network has a multistage coding-decoding structure, in the coding stage, an input characteristic graph firstly sequentially passes through 5 layers of convolution layers to perform characteristic compression to obtain a hidden layer vector c, in the decoding stage, the hidden layer vector c sequentially passes through 5 layers of deconvolution layers to perform characteristic reconstruction to obtain an enhanced characteristic X n * =Enhance(X n ) Wherein enhancement (·) represents a U-net based spectral denoising process from a mel spectrum X of noisy speech n Extraction enhancement mel spectrogram X n *
The 5 layers of convolution layers in the encoding stage all adopt 2D convolution, the number of input channels is 1, 16, 32 and 64, the number of output channels is 16, 32, 64 and 64, and the convolution kernel size of each convolution layer is 4 multiplied by 1. The 5 layers of convolution layers in the decoding stage are all 2D deconvolution, the input channel numbers are 64, 128, 64 and 32 respectively, the output channel numbers are 64, 32, 16 and 1 respectively, and the convolution kernel sizes of the deconvolution layers are 4 multiplied by 1. The PReLu activation function is added after all convolution layers. And each coding layer is connected to its homologous decoding layer, bypassing the feature compression process performed in the middle of the model, passing the fine-grained information of the feature map directly to the decoding stage.
Step three: clean mel spectrogram X c Enhanced mel profile X n * Respectively inputting the conditions based on the time delay neural network to generate an countermeasure network TDNN-CGAN, and respectively obtaining depth characteristics E of a clean Mel spectrogram c Enhancing depth features E of a Mel spectrogram n . In the coding network, the input characteristic diagram sequentially passes through 4 layers of 1D convolution layers, and after the 4 layers of convolution layers, batch standardization operation is carried out, a Dropout layer is added, and the parameter value p of the Dropout layer is set to be 0.1. The output of each convolution layer is expressed asWhere l=1, 2,3,4, t 'denotes the number of frames and D' denotes the depth feature vector dimension per frame.
Considering that in a neural network, the depth features extracted by each layer of the network contain information about the original input, the output of each layer of convolution layer is passed through a linear phaseAdding to realize feature polymerization to obtain polymerized featuresThen, the aggregated features are subjected to statistical pooling, and the average value and the standard deviation of the aggregated features are spliced to obtain speech-level features +.>Finally, E is carried out by utilizing the full connection layer Statistics Is converted into a fixed 256-dimensional vector as the depth feature extracted by the encoder.
Step four: through countermeasure learning, the discriminators are trained to correctly discriminate E c And E is n . Then fixing a discriminant, training a generator TDNN, wherein the discriminant consists of three fully connected layers, and the last layer is provided with 2 output nodes, and the objective function of the training discriminant is expressed as:
step five: depth features extracted from the enhanced mel-spectrum are input to a speaker classifier composed of a fully connected layer and a Softmax layer, which is trained by cross entropy loss.
Step six:
(1) according to the first step and the second step, inputting a noise-containing Mel spectrogram of noise-containing voice into a spectrogram denoising network based on U-net to obtain an enhanced Mel spectrogram, and calculating a Mean Square Error (MSE) loss with the Mel spectrogram of clean voice, wherein the method is used for training the spectrogram denoising network based on U-net.
(2) According to the third and fourth steps, firstly training the discriminators in the TDNN-CGAN through discrimination loss, then fixing network parameters of the discriminators, calculating and generating loss, and training spectrograms to strengthen generators in the network and the TDNN-CGAN through a back propagation algorithm.
(3) And step five, calculating cross entropy loss by combining the speaker real label corresponding to the voice and the prediction label of the speaker classifier, and training a spectrogram enhancement network, a generator and the speaker classifier by using a back propagation algorithm.
(4) And (3) training alternately until the loss value of the network converges, stopping training, storing a U-Net spectrogram enhanced network model, generating a generator network model in a countermeasure network TDNN-CGAN based on the condition of a time delay neural network, and a network model of a speaker classifier.
Step seven: and D, utilizing the saved U-Net spectrogram enhanced network model in the step six, generating a generator network model in the countermeasure network TDNN-CGAN and a network model of a speaker classifier based on the condition of the time delay neural network, and recognizing the speaker identity of the noise-containing voice according to the step two, the step three and the step five to realize speaker recognition in a noise environment.
In order to verify the performance of the speaker identification method, 340 speakers are selected from an Aishell-1 data set, 40 sentences are selected from each speaker, each sentence is intercepted into 3s fragments, 20 sentences are used as clean voices of a training set, the clean voices and noise data in a Musan noise data set are randomly mixed according to one signal to noise ratio of 0,5,10,15 and 20 to obtain a noise-containing copy, and the clean voices and the noise-containing voice copy form the training set; the rest 20 sentences are mixed with three different types of noise in the Musan noise data set according to the presence or absence of noise and five signal-to-noise ratios of 0,5,10,15 and 20 respectively to obtain 16 groups of test sets, the 16 groups of test sets are used for calculating the speaker recognition accuracy, and the higher the speaker recognition accuracy is, the better the recognition performance of the speaker recognition system is.
In addition, constructing a speaker recognition system which is trained by cross entropy loss and consists of a TDNN network and a speaker classifier; a speaker recognition system is constructed using countermeasure learning and cross entropy loss training, consisting of a time delay neural network based condition generation countermeasure network TDNN-CGAN and a speaker classifier. The test results for evaluating the effectiveness of the robust speaker recognition system based on feature enhancement and challenge learning are shown in table 1.
As can be seen from table 1, the speaker recognition accuracy of the robust speaker recognition method (U-Net spectrogram enhanced network+tdnn-cgan+speaker classifier) based on spectrogram denoising and countermeasure learning was 99.68% without adding noise, and was 89.72%, 92.50% and 91.47% in the case of three different noise types and signal-to-noise ratio of 0 dB. The TDNN+speaker classifier has the speaker recognition accuracy of 98.24% under the condition that no noise is added, and has the speaker recognition accuracy of 79.63%, 83.82% and 86.29% under the condition that three different noise types and the signal to noise ratio is 0 dB; the TDNN-CGAN+speaker classifier has speaker recognition accuracy of 99.12% without adding noise, and 86.93%, 90.11% and 87.64% with three different noise types and signal-to-noise ratio of 0 dB.
Under the condition that training data are the same, compared with the method for identifying the robust speaker (U-Net spectrogram enhanced network+TDNN-CGAN+speaker classifier) based on spectrogram denoising and countermeasure learning, the method has the advantages that higher speaker identification accuracy is achieved under a noise-free environment and a noise environment, and the best robust speaker identification performance is obtained. Therefore, the robust speaker recognition system based on spectrogram denoising and countermeasure learning of the present invention is effective.
TABLE 1 speaker recognition accuracy
The foregoing description of the preferred embodiments of the present invention has been presented only in terms of those specific and detailed descriptions, and is not, therefore, to be construed as limiting the scope of the invention. It should be noted that modifications, improvements and substitutions can be made by those skilled in the art without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (1)

1. A robust speaker recognition method based on spectrogram denoising and countermeasure learning is characterized in that: removing noise interference on a Mel spectrogram of a noise-containing voice signal by adopting a U-Net (U-network) with a multi-level coding and decoding structure to obtain an enhanced Mel spectrogram; generating an countermeasure network TDNN-CGAN based on the condition of the time delay neural network to extract the depth features of the enhanced Mel spectrogram, and inputting the obtained depth features into a speaker classifier to identify the identity of a speaker;
the speaker recognition method comprises the following specific steps:
(1) adding noise to clean voice to obtain noisy voice, framing, windowing, extracting Mel spectrogram, and respectively obtaining Mel spectrogram of clean voiceMeier spectrogram of noisy speechWherein x is c (t)、x n (T) the Mel eigenvectors of the T-th frame of the Mel spectrogram of clean, noisy speech, T representing the number of speech frames, T e {1,., T }, the superscript T denotes a transpose, D represents the dimension of the Mel feature vector of each frame;
(2) x is to be n Inputting into a U-Net spectrogram enhancement network to obtain an enhanced Mel spectrogram X n * Training a U-Net spectrogram enhancement network by using a mean square error loss function as spectrogram enhancement loss, wherein the expression of the spectrogram enhancement loss is as follows:
(3) x is to be c 、X n * Respectively inputting TDNN-CGAN, and respectively extracting X by using generator in TDNN-CGAN c Depth feature E of (2) c =G(X c )、X n * Depth feature E of (2) n =G(X n * ) The method comprises the steps of carrying out a first treatment on the surface of the Will E c 、E n The discriminators in the TDNN-CGAN are respectively input, the discriminators are trained according to the discrimination losses in the least square generation countermeasure network, and the expression of the discrimination losses is as follows:
g (-) and D (-) represent the output of the generator and arbiter, respectively; fixing network parameters of the discriminator, will E n The input discriminator generates a generation loss training generator in the countermeasure network according to least squares, and the expression of the generation loss is as follows:
(4) will E c 、E n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss for identifying the identity of the speaker.
CN202310425824.2A 2023-04-20 2023-04-20 A robust speaker recognition method based on spectrogram denoising and adversarial learning Active CN116469394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310425824.2A CN116469394B (en) 2023-04-20 2023-04-20 A robust speaker recognition method based on spectrogram denoising and adversarial learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310425824.2A CN116469394B (en) 2023-04-20 2023-04-20 A robust speaker recognition method based on spectrogram denoising and adversarial learning

Publications (2)

Publication Number Publication Date
CN116469394A true CN116469394A (en) 2023-07-21
CN116469394B CN116469394B (en) 2025-09-02

Family

ID=87176661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310425824.2A Active CN116469394B (en) 2023-04-20 2023-04-20 A robust speaker recognition method based on spectrogram denoising and adversarial learning

Country Status (1)

Country Link
CN (1) CN116469394B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037843A (en) * 2023-09-11 2023-11-10 中南大学 A speech adversarial sample generation method, device, terminal equipment and medium
CN117194897A (en) * 2023-09-12 2023-12-08 上海交通大学 RFID-based speech perception method
CN118506792A (en) * 2024-07-18 2024-08-16 青岛科技大学 Marine mammal sound data enhancement method based on improved Inception blocks and SACGAN
CN119207433A (en) * 2024-09-19 2024-12-27 华中师范大学 Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios
CN119811361A (en) * 2025-02-20 2025-04-11 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN119943057A (en) * 2025-01-07 2025-05-06 武汉大学 A method and system for voice adversarial defense against speaker recognition system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN114187898A (en) * 2021-12-31 2022-03-15 电子科技大学 An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure
CN114530156A (en) * 2022-02-25 2022-05-24 国家电网有限公司 Generation countermeasure network optimization method and system for short voice speaker confirmation
US20220208198A1 (en) * 2019-04-01 2022-06-30 Iucf-Hyu (Industry-University Cooperation Foundation Hanyang University Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments
CN114822555A (en) * 2022-03-29 2022-07-29 南昌大学 A Speaker Recognition Method Based on Cross-Gated Parallel Convolutional Networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
US20220208198A1 (en) * 2019-04-01 2022-06-30 Iucf-Hyu (Industry-University Cooperation Foundation Hanyang University Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN114187898A (en) * 2021-12-31 2022-03-15 电子科技大学 An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure
CN114530156A (en) * 2022-02-25 2022-05-24 国家电网有限公司 Generation countermeasure network optimization method and system for short voice speaker confirmation
CN114822555A (en) * 2022-03-29 2022-07-29 南昌大学 A Speaker Recognition Method Based on Cross-Gated Parallel Convolutional Networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SNYDER, D 等: "DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION", 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 3 May 2017 (2017-05-03) *
张嘉诚: "基于多尺度频域特征和并行神经网络的说话人识别", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 February 2023 (2023-02-15) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037843A (en) * 2023-09-11 2023-11-10 中南大学 A speech adversarial sample generation method, device, terminal equipment and medium
CN117194897A (en) * 2023-09-12 2023-12-08 上海交通大学 RFID-based speech perception method
CN117194897B (en) * 2023-09-12 2025-11-11 上海交通大学 Voice sensing method based on RFID
CN118506792A (en) * 2024-07-18 2024-08-16 青岛科技大学 Marine mammal sound data enhancement method based on improved Inception blocks and SACGAN
CN118506792B (en) * 2024-07-18 2024-10-18 青岛科技大学 Marine mammal call data enhancement method based on improved Inception block and SACGAN
CN119207433A (en) * 2024-09-19 2024-12-27 华中师范大学 Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios
CN119207433B (en) * 2024-09-19 2025-10-10 华中师范大学 Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios
CN119943057A (en) * 2025-01-07 2025-05-06 武汉大学 A method and system for voice adversarial defense against speaker recognition system
CN119811361A (en) * 2025-02-20 2025-04-11 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN119811361B (en) * 2025-02-20 2025-10-10 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium

Also Published As

Publication number Publication date
CN116469394B (en) 2025-09-02

Similar Documents

Publication Publication Date Title
CN116469394B (en) A robust speaker recognition method based on spectrogram denoising and adversarial learning
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
Yang et al. Characterizing speech adversarial examples using self-attention u-net enhancement
Zhang et al. X-TaSNet: Robust and accurate time-domain speaker extraction network
CN108447495B (en) A Deep Learning Speech Enhancement Method Based on Comprehensive Feature Set
Do et al. Speech source separation using variational autoencoder and bandpass filter
Mun et al. The sound of my voice: Speaker representation loss for target voice separation
Ganapathy Multivariate autoregressive spectrogram modeling for noisy speech recognition
CN114360571B (en) Reference-based speech enhancement method
CN118212929A (en) A personalized Ambisonics speech enhancement method
Do et al. Speech Separation in the Frequency Domain with Autoencoder.
Chao et al. Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR
Tu et al. DNN training based on classic gain function for single-channel speech enhancement and recognition
Wu et al. A fused speech enhancement framework for robust speaker verification
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
Le et al. Personalized speech enhancement combining band-split rnn and speaker attentive module
TWI749547B (en) Speech enhancement system based on deep learning
Al-Ali et al. Enhanced forensic speaker verification using multi-run ICA in the presence of environmental noise and reverberation conditions
CN120319258A (en) A speech enhancement method based on bispectral nonlinear feature coupling
CN111681649B (en) Speech recognition method, interactive system and performance management system including the system
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
CN116648747A (en) Device for providing a processed audio signal, method for providing a processed audio signal, device for providing neural network parameters and method for providing neural network parameters
Lan et al. Embedding encoder-decoder with attention mechanism for monaural speech enhancement
CN116229992A (en) A voice lie detection method, device, medium and equipment
Li et al. Aligning noisy-clean speech pairs at feature and embedding levels for learning noise-invariant speaker representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant