CN116469394A - A Robust Speaker Recognition Method Based on Spectral Graph Denoising and Adversarial Learning - Google Patents
A Robust Speaker Recognition Method Based on Spectral Graph Denoising and Adversarial Learning Download PDFInfo
- Publication number
- CN116469394A CN116469394A CN202310425824.2A CN202310425824A CN116469394A CN 116469394 A CN116469394 A CN 116469394A CN 202310425824 A CN202310425824 A CN 202310425824A CN 116469394 A CN116469394 A CN 116469394A
- Authority
- CN
- China
- Prior art keywords
- spectrogram
- speaker
- mel
- network
- tdnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, collecting a spectrogram data set of clean voice and a noisy spectrogram data set of the clean voice after noise addition; training a U-shaped network (U-Net) of a multi-stage coding and decoding structure by using a mean square error loss function to remove noise interference on a Mel spectrogram of a noise-containing voice signal, so as to obtain an enhanced Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, taking the Time Delay Neural Network (TDNN) as a generator in the TDNN-CGAN to extract depth characteristics of an enhanced Mel spectrogram, and taking a multi-layer perceptron (MLP) as a discriminator in the TDNN-CGAN; and finally, training a speaker classifier by using cross entropy loss to identify the identity of the speaker, so as to realize speaker identification in a noise environment. The depth characteristic extracted from the noisy speech is close to the depth characteristic extracted from the clean speech, so that the performance of the speaker recognition system in a noisy environment is improved.
Description
Technical Field
The invention belongs to the technical field of voice processing. Relates to a robust speaker recognition method based on spectrogram denoising and countermeasure learning.
Background
In a real environment, the speech input by the speaker recognition system is often interfered by various background noises and reverberation, and the additional noise on the clean speech obscures the acoustic details and reduces the speech intelligibility and quality, so that the performance of the speaker recognition system is reduced. The common method for improving the robustness of the speaker recognition system is mainly to train the system through a data set consisting of clean and noisy data; or adding a speech enhancement front-end, speech enhancement refers to a technique of extracting a useful speech signal from a noisy background after the speech signal is disturbed by noise. However, during the speech enhancement process, speech distortion may occur, and even the performance of the speaker recognition system may be reduced, and since the neural network has a strong feature extraction capability, the neural network may be used to directly extract the frequency domain features without noise from the frequency domain features of the speech interfered by noise. In addition, generating an countermeasure network (GAN) is currently widely studied and has been applied to many speech or audio related tasks that focus mainly on domain conversion and generating a more realistic data distribution, and the GAN network has a certain potential for the extraction of anti-noise features.
Disclosure of Invention
The invention aims to provide a robust speaker recognition method based on spectrogram denoising and countermeasure learning, so as to solve the problems in the background technology.
Firstly, collecting a data set of a Mel spectrogram of clean voice and a data set of a noisy Mel spectrogram of the clean voice after noise addition; training a U-shaped network U-Net of a multi-stage encoding and decoding structure by using a mean square error loss function, and extracting an enhanced Mel spectrogram from the noisy Mel spectrogram; training a condition generation countermeasure network (TDNN-CGAN) based on a time delay neural network by utilizing a least square loss function, adopting a multi-layer perceptron (MLP) as a discriminator, and adopting the Time Delay Neural Network (TDNN) as a generator to extract depth characteristics of an enhanced Mel spectrogram; training the MLP-based speaker classifier with cross entropy loss is used for identifying the identity of the speaker, so that the speaker identification in the noise environment is realized.
The specific steps of the speaker recognition method are as follows:
step one: will clean the speech s c Adding noise n to obtain noisy speech s n =s c +n, using Hamming window to pass clean speech s c Noise-containing speech s n Dividing into short frames, extracting Mel eigenvectors from each frame, and respectively forming two eigenvectors:wherein x is c (t)、x n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.
Step two: the noisy Mel spectrogram X of the noisy speech n Inputting U-Net with multi-stage encoding and decoding structure to obtain X n * =Enhance(X n ) Wherein enhancement (. Cndot.) represents the analysis of the spectrum from the noisy Mel spectrum X n Extraction enhancement mel spectrogram X n * The U-Net is trained by using a mean square error loss function as a spectrogram denoising loss, and the expression of the spectrogram denoising loss is as follows:
step three: clean mel spectrogram X c Enhanced mel profile X n * Respectively inputting TDNN-CGAN, respectively extracting clean Mel spectrogram and enhancing depth characteristic E of Mel spectrogram by generator-Time Delay Neural Network (TDNN) in TDNN-CGAN c =G(X c )、E n =G(X n * ). Inputting the extracted depth features into a discriminant-multi-layer perceptron (MLP), G (-) and D (-) representing the outputs of the generator and discriminant, respectively, and generating a least squares solution into an countermeasure network (LSGAN)The expression of the discriminant loss is as follows:
network parameters of the fixed authentication network are used for enhancing depth characteristic E of the Mel spectrogram n The input discriminator calculates the generation loss in the countermeasure learning, and is used for training the generator TDNN so that the depth characteristic extracted from the noisy speech is more approximate to the depth characteristic of the clean speech, and the expression of the generation loss is as follows:
step four: depth feature E to be extracted from enhanced mel spectrogram n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss to realize speaker recognition under a noise environment, namely robust speaker recognition.
The beneficial effects of the invention are as follows:
according to the invention, through spectrogram denoising and countermeasure learning, a U-Net Mel spectrogram enhancement network is adopted, and a joint training scheme of the countermeasure network TDNN-CGAN and the speaker classifier is generated based on the condition of a time delay neural network, so that the depth characteristic extracted from noisy voice is close to the depth characteristic extracted from clean voice, and the performance of the speaker recognition system in a noise environment is improved.
Drawings
Fig. 1 is a schematic diagram of a robust speaker recognition method based on spectrogram denoising and countermeasure learning according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are only for the purpose of illustrating the technical solution of the present invention and are not to be construed as limiting the invention.
As shown in fig. 1, the invention provides a robust speaker recognition method based on spectrogram denoising and countermeasure learning. Firstly, mixing a clean voice signal with a noise signal to obtain a noise-containing voice signal, and extracting the clean voice signal and the frequency domain characteristics (such as a mel spectrogram) of the noise-containing voice signal. Secondly, inputting the Mel spectrogram extracted from the noise-containing voice into a spectrogram enhancement network based on U-Net, and removing noise interference on the Mel spectrogram of the noise-containing voice to obtain an enhancement spectrogram. And respectively inputting the Mel spectrogram of the clean voice and the enhanced Mel spectrogram of the noise-containing voice into a time delay neural network-based condition generation countermeasure network TDNN-CGAN, and obtaining the depth characteristics of the clean voice signal and the depth characteristics of the noise-containing voice signal through generator coding. Then, performing countermeasure learning by a discriminator so that the enhanced depth features extracted from the noisy speech are closer to the depth features of the clean speech; and finally, inputting the enhanced depth characteristic into a speaker classifier, thereby realizing speaker identification in a noise environment.
The invention will be further illustrated by the following implementation steps.
Step one: firstly, voice Activity Detection (VAD) is carried out on a section of speaker voice to remove a mute section, 3s long voice is intercepted to be used as clean voice, a section of noise signal with 3s long time is randomly taken from a noise database to carry out linear addition, and a noise-containing voice copy of the clean voice is obtained. Then pre-emphasis is carried out on the clean voice and the noise-containing voice copy, a Hamming window is added for framing, and the Mel characteristics are extracted, thus obtaining a Mel spectrogram of the clean voiceMeier spectrogram of noisy speechWherein x is c (t)、x n (T) Mel feature vectors respectively representing the T-th frame of a Mel spectrogram of clean, noisy speech, T represents the number of speech frames, T e 1, a., T, the superscript T denotes the transpose and D denotes the dimension of the feature vector.
Step two: meier spectrogram X containing noise of voice signal n Input as a U-net based spectrogram denoising networkThe network has a multistage coding-decoding structure, in the coding stage, an input characteristic graph firstly sequentially passes through 5 layers of convolution layers to perform characteristic compression to obtain a hidden layer vector c, in the decoding stage, the hidden layer vector c sequentially passes through 5 layers of deconvolution layers to perform characteristic reconstruction to obtain an enhanced characteristic X n * =Enhance(X n ) Wherein enhancement (·) represents a U-net based spectral denoising process from a mel spectrum X of noisy speech n Extraction enhancement mel spectrogram X n * 。
The 5 layers of convolution layers in the encoding stage all adopt 2D convolution, the number of input channels is 1, 16, 32 and 64, the number of output channels is 16, 32, 64 and 64, and the convolution kernel size of each convolution layer is 4 multiplied by 1. The 5 layers of convolution layers in the decoding stage are all 2D deconvolution, the input channel numbers are 64, 128, 64 and 32 respectively, the output channel numbers are 64, 32, 16 and 1 respectively, and the convolution kernel sizes of the deconvolution layers are 4 multiplied by 1. The PReLu activation function is added after all convolution layers. And each coding layer is connected to its homologous decoding layer, bypassing the feature compression process performed in the middle of the model, passing the fine-grained information of the feature map directly to the decoding stage.
Step three: clean mel spectrogram X c Enhanced mel profile X n * Respectively inputting the conditions based on the time delay neural network to generate an countermeasure network TDNN-CGAN, and respectively obtaining depth characteristics E of a clean Mel spectrogram c Enhancing depth features E of a Mel spectrogram n . In the coding network, the input characteristic diagram sequentially passes through 4 layers of 1D convolution layers, and after the 4 layers of convolution layers, batch standardization operation is carried out, a Dropout layer is added, and the parameter value p of the Dropout layer is set to be 0.1. The output of each convolution layer is expressed asWhere l=1, 2,3,4, t 'denotes the number of frames and D' denotes the depth feature vector dimension per frame.
Considering that in a neural network, the depth features extracted by each layer of the network contain information about the original input, the output of each layer of convolution layer is passed through a linear phaseAdding to realize feature polymerization to obtain polymerized featuresThen, the aggregated features are subjected to statistical pooling, and the average value and the standard deviation of the aggregated features are spliced to obtain speech-level features +.>Finally, E is carried out by utilizing the full connection layer Statistics Is converted into a fixed 256-dimensional vector as the depth feature extracted by the encoder.
Step four: through countermeasure learning, the discriminators are trained to correctly discriminate E c And E is n . Then fixing a discriminant, training a generator TDNN, wherein the discriminant consists of three fully connected layers, and the last layer is provided with 2 output nodes, and the objective function of the training discriminant is expressed as:
step five: depth features extracted from the enhanced mel-spectrum are input to a speaker classifier composed of a fully connected layer and a Softmax layer, which is trained by cross entropy loss.
Step six:
(1) according to the first step and the second step, inputting a noise-containing Mel spectrogram of noise-containing voice into a spectrogram denoising network based on U-net to obtain an enhanced Mel spectrogram, and calculating a Mean Square Error (MSE) loss with the Mel spectrogram of clean voice, wherein the method is used for training the spectrogram denoising network based on U-net.
(2) According to the third and fourth steps, firstly training the discriminators in the TDNN-CGAN through discrimination loss, then fixing network parameters of the discriminators, calculating and generating loss, and training spectrograms to strengthen generators in the network and the TDNN-CGAN through a back propagation algorithm.
(3) And step five, calculating cross entropy loss by combining the speaker real label corresponding to the voice and the prediction label of the speaker classifier, and training a spectrogram enhancement network, a generator and the speaker classifier by using a back propagation algorithm.
(4) And (3) training alternately until the loss value of the network converges, stopping training, storing a U-Net spectrogram enhanced network model, generating a generator network model in a countermeasure network TDNN-CGAN based on the condition of a time delay neural network, and a network model of a speaker classifier.
Step seven: and D, utilizing the saved U-Net spectrogram enhanced network model in the step six, generating a generator network model in the countermeasure network TDNN-CGAN and a network model of a speaker classifier based on the condition of the time delay neural network, and recognizing the speaker identity of the noise-containing voice according to the step two, the step three and the step five to realize speaker recognition in a noise environment.
In order to verify the performance of the speaker identification method, 340 speakers are selected from an Aishell-1 data set, 40 sentences are selected from each speaker, each sentence is intercepted into 3s fragments, 20 sentences are used as clean voices of a training set, the clean voices and noise data in a Musan noise data set are randomly mixed according to one signal to noise ratio of 0,5,10,15 and 20 to obtain a noise-containing copy, and the clean voices and the noise-containing voice copy form the training set; the rest 20 sentences are mixed with three different types of noise in the Musan noise data set according to the presence or absence of noise and five signal-to-noise ratios of 0,5,10,15 and 20 respectively to obtain 16 groups of test sets, the 16 groups of test sets are used for calculating the speaker recognition accuracy, and the higher the speaker recognition accuracy is, the better the recognition performance of the speaker recognition system is.
In addition, constructing a speaker recognition system which is trained by cross entropy loss and consists of a TDNN network and a speaker classifier; a speaker recognition system is constructed using countermeasure learning and cross entropy loss training, consisting of a time delay neural network based condition generation countermeasure network TDNN-CGAN and a speaker classifier. The test results for evaluating the effectiveness of the robust speaker recognition system based on feature enhancement and challenge learning are shown in table 1.
As can be seen from table 1, the speaker recognition accuracy of the robust speaker recognition method (U-Net spectrogram enhanced network+tdnn-cgan+speaker classifier) based on spectrogram denoising and countermeasure learning was 99.68% without adding noise, and was 89.72%, 92.50% and 91.47% in the case of three different noise types and signal-to-noise ratio of 0 dB. The TDNN+speaker classifier has the speaker recognition accuracy of 98.24% under the condition that no noise is added, and has the speaker recognition accuracy of 79.63%, 83.82% and 86.29% under the condition that three different noise types and the signal to noise ratio is 0 dB; the TDNN-CGAN+speaker classifier has speaker recognition accuracy of 99.12% without adding noise, and 86.93%, 90.11% and 87.64% with three different noise types and signal-to-noise ratio of 0 dB.
Under the condition that training data are the same, compared with the method for identifying the robust speaker (U-Net spectrogram enhanced network+TDNN-CGAN+speaker classifier) based on spectrogram denoising and countermeasure learning, the method has the advantages that higher speaker identification accuracy is achieved under a noise-free environment and a noise environment, and the best robust speaker identification performance is obtained. Therefore, the robust speaker recognition system based on spectrogram denoising and countermeasure learning of the present invention is effective.
TABLE 1 speaker recognition accuracy
The foregoing description of the preferred embodiments of the present invention has been presented only in terms of those specific and detailed descriptions, and is not, therefore, to be construed as limiting the scope of the invention. It should be noted that modifications, improvements and substitutions can be made by those skilled in the art without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Claims (1)
1. A robust speaker recognition method based on spectrogram denoising and countermeasure learning is characterized in that: removing noise interference on a Mel spectrogram of a noise-containing voice signal by adopting a U-Net (U-network) with a multi-level coding and decoding structure to obtain an enhanced Mel spectrogram; generating an countermeasure network TDNN-CGAN based on the condition of the time delay neural network to extract the depth features of the enhanced Mel spectrogram, and inputting the obtained depth features into a speaker classifier to identify the identity of a speaker;
the speaker recognition method comprises the following specific steps:
(1) adding noise to clean voice to obtain noisy voice, framing, windowing, extracting Mel spectrogram, and respectively obtaining Mel spectrogram of clean voiceMeier spectrogram of noisy speechWherein x is c (t)、x n (T) the Mel eigenvectors of the T-th frame of the Mel spectrogram of clean, noisy speech, T representing the number of speech frames, T e {1,., T }, the superscript T denotes a transpose, D represents the dimension of the Mel feature vector of each frame;
(2) x is to be n Inputting into a U-Net spectrogram enhancement network to obtain an enhanced Mel spectrogram X n * Training a U-Net spectrogram enhancement network by using a mean square error loss function as spectrogram enhancement loss, wherein the expression of the spectrogram enhancement loss is as follows:
(3) x is to be c 、X n * Respectively inputting TDNN-CGAN, and respectively extracting X by using generator in TDNN-CGAN c Depth feature E of (2) c =G(X c )、X n * Depth feature E of (2) n =G(X n * ) The method comprises the steps of carrying out a first treatment on the surface of the Will E c 、E n The discriminators in the TDNN-CGAN are respectively input, the discriminators are trained according to the discrimination losses in the least square generation countermeasure network, and the expression of the discrimination losses is as follows:
g (-) and D (-) represent the output of the generator and arbiter, respectively; fixing network parameters of the discriminator, will E n The input discriminator generates a generation loss training generator in the countermeasure network according to least squares, and the expression of the generation loss is as follows:
(4) will E c 、E n And inputting the speaker classifier, and training the speaker classifier through cross entropy loss for identifying the identity of the speaker.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310425824.2A CN116469394B (en) | 2023-04-20 | 2023-04-20 | A robust speaker recognition method based on spectrogram denoising and adversarial learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310425824.2A CN116469394B (en) | 2023-04-20 | 2023-04-20 | A robust speaker recognition method based on spectrogram denoising and adversarial learning |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116469394A true CN116469394A (en) | 2023-07-21 |
| CN116469394B CN116469394B (en) | 2025-09-02 |
Family
ID=87176661
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310425824.2A Active CN116469394B (en) | 2023-04-20 | 2023-04-20 | A robust speaker recognition method based on spectrogram denoising and adversarial learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116469394B (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117037843A (en) * | 2023-09-11 | 2023-11-10 | 中南大学 | A speech adversarial sample generation method, device, terminal equipment and medium |
| CN117194897A (en) * | 2023-09-12 | 2023-12-08 | 上海交通大学 | RFID-based speech perception method |
| CN118506792A (en) * | 2024-07-18 | 2024-08-16 | 青岛科技大学 | Marine mammal sound data enhancement method based on improved Inception blocks and SACGAN |
| CN119207433A (en) * | 2024-09-19 | 2024-12-27 | 华中师范大学 | Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios |
| CN119811361A (en) * | 2025-02-20 | 2025-04-11 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
| CN119943057A (en) * | 2025-01-07 | 2025-05-06 | 武汉大学 | A method and system for voice adversarial defense against speaker recognition system |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102034472A (en) * | 2009-09-28 | 2011-04-27 | 戴红霞 | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network |
| CN110619885A (en) * | 2019-08-15 | 2019-12-27 | 西北工业大学 | Method for generating confrontation network voice enhancement based on deep complete convolution neural network |
| CN112992157A (en) * | 2021-02-08 | 2021-06-18 | 贵州师范大学 | Neural network noisy line identification method based on residual error and batch normalization |
| CN114187898A (en) * | 2021-12-31 | 2022-03-15 | 电子科技大学 | An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure |
| CN114530156A (en) * | 2022-02-25 | 2022-05-24 | 国家电网有限公司 | Generation countermeasure network optimization method and system for short voice speaker confirmation |
| US20220208198A1 (en) * | 2019-04-01 | 2022-06-30 | Iucf-Hyu (Industry-University Cooperation Foundation Hanyang University | Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments |
| CN114822555A (en) * | 2022-03-29 | 2022-07-29 | 南昌大学 | A Speaker Recognition Method Based on Cross-Gated Parallel Convolutional Networks |
-
2023
- 2023-04-20 CN CN202310425824.2A patent/CN116469394B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102034472A (en) * | 2009-09-28 | 2011-04-27 | 戴红霞 | Speaker recognition method based on Gaussian mixture model embedded with time delay neural network |
| US20220208198A1 (en) * | 2019-04-01 | 2022-06-30 | Iucf-Hyu (Industry-University Cooperation Foundation Hanyang University | Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments |
| CN110619885A (en) * | 2019-08-15 | 2019-12-27 | 西北工业大学 | Method for generating confrontation network voice enhancement based on deep complete convolution neural network |
| CN112992157A (en) * | 2021-02-08 | 2021-06-18 | 贵州师范大学 | Neural network noisy line identification method based on residual error and batch normalization |
| CN114187898A (en) * | 2021-12-31 | 2022-03-15 | 电子科技大学 | An End-to-End Speech Recognition Method Based on Fusion Neural Network Structure |
| CN114530156A (en) * | 2022-02-25 | 2022-05-24 | 国家电网有限公司 | Generation countermeasure network optimization method and system for short voice speaker confirmation |
| CN114822555A (en) * | 2022-03-29 | 2022-07-29 | 南昌大学 | A Speaker Recognition Method Based on Cross-Gated Parallel Convolutional Networks |
Non-Patent Citations (2)
| Title |
|---|
| SNYDER, D 等: "DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION", 2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 3 May 2017 (2017-05-03) * |
| 张嘉诚: "基于多尺度频域特征和并行神经网络的说话人识别", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 February 2023 (2023-02-15) * |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117037843A (en) * | 2023-09-11 | 2023-11-10 | 中南大学 | A speech adversarial sample generation method, device, terminal equipment and medium |
| CN117194897A (en) * | 2023-09-12 | 2023-12-08 | 上海交通大学 | RFID-based speech perception method |
| CN117194897B (en) * | 2023-09-12 | 2025-11-11 | 上海交通大学 | Voice sensing method based on RFID |
| CN118506792A (en) * | 2024-07-18 | 2024-08-16 | 青岛科技大学 | Marine mammal sound data enhancement method based on improved Inception blocks and SACGAN |
| CN118506792B (en) * | 2024-07-18 | 2024-10-18 | 青岛科技大学 | Marine mammal call data enhancement method based on improved Inception block and SACGAN |
| CN119207433A (en) * | 2024-09-19 | 2024-12-27 | 华中师范大学 | Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios |
| CN119207433B (en) * | 2024-09-19 | 2025-10-10 | 华中师范大学 | Domain-adaptive speaker verification method with self-supervised adversarial training in complex scenarios |
| CN119943057A (en) * | 2025-01-07 | 2025-05-06 | 武汉大学 | A method and system for voice adversarial defense against speaker recognition system |
| CN119811361A (en) * | 2025-02-20 | 2025-04-11 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
| CN119811361B (en) * | 2025-02-20 | 2025-10-10 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116469394B (en) | 2025-09-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN116469394B (en) | A robust speaker recognition method based on spectrogram denoising and adversarial learning | |
| Lin et al. | Speech enhancement using multi-stage self-attentive temporal convolutional networks | |
| Yang et al. | Characterizing speech adversarial examples using self-attention u-net enhancement | |
| Zhang et al. | X-TaSNet: Robust and accurate time-domain speaker extraction network | |
| CN108447495B (en) | A Deep Learning Speech Enhancement Method Based on Comprehensive Feature Set | |
| Do et al. | Speech source separation using variational autoencoder and bandpass filter | |
| Mun et al. | The sound of my voice: Speaker representation loss for target voice separation | |
| Ganapathy | Multivariate autoregressive spectrogram modeling for noisy speech recognition | |
| CN114360571B (en) | Reference-based speech enhancement method | |
| CN118212929A (en) | A personalized Ambisonics speech enhancement method | |
| Do et al. | Speech Separation in the Frequency Domain with Autoencoder. | |
| Chao et al. | Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR | |
| Tu et al. | DNN training based on classic gain function for single-channel speech enhancement and recognition | |
| Wu et al. | A fused speech enhancement framework for robust speaker verification | |
| CN113066483B (en) | Sparse continuous constraint-based method for generating countermeasure network voice enhancement | |
| Le et al. | Personalized speech enhancement combining band-split rnn and speaker attentive module | |
| TWI749547B (en) | Speech enhancement system based on deep learning | |
| Al-Ali et al. | Enhanced forensic speaker verification using multi-run ICA in the presence of environmental noise and reverberation conditions | |
| CN120319258A (en) | A speech enhancement method based on bispectral nonlinear feature coupling | |
| CN111681649B (en) | Speech recognition method, interactive system and performance management system including the system | |
| Singh et al. | Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition | |
| CN116648747A (en) | Device for providing a processed audio signal, method for providing a processed audio signal, device for providing neural network parameters and method for providing neural network parameters | |
| Lan et al. | Embedding encoder-decoder with attention mechanism for monaural speech enhancement | |
| CN116229992A (en) | A voice lie detection method, device, medium and equipment | |
| Li et al. | Aligning noisy-clean speech pairs at feature and embedding levels for learning noise-invariant speaker representations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |