[go: up one dir, main page]

CN111583954B - Speaker independent single-channel voice separation method - Google Patents

Speaker independent single-channel voice separation method Download PDF

Info

Publication number
CN111583954B
CN111583954B CN202010401151.3A CN202010401151A CN111583954B CN 111583954 B CN111583954 B CN 111583954B CN 202010401151 A CN202010401151 A CN 202010401151A CN 111583954 B CN111583954 B CN 111583954B
Authority
CN
China
Prior art keywords
training
real
speech
model
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010401151.3A
Other languages
Chinese (zh)
Other versions
CN111583954A (en
Inventor
张文
宋君强
任开军
李小勇
邓科峰
周翱隆
汪祥
任小丽
邵成成
吴国溧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010401151.3A priority Critical patent/CN111583954B/en
Publication of CN111583954A publication Critical patent/CN111583954A/en
Application granted granted Critical
Publication of CN111583954B publication Critical patent/CN111583954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a single-channel voice separation method irrelevant to speakers, which comprises the following steps: preparing a data set and carrying out data preprocessing; establishing a single-track voice separation model based on the masking of a plurality of ideal floating values; when the single sound channel voice separation model is trained, sentence level replacement invariance training is adopted; and inputting the mixed voice data into the trained model for voice separation. The method effectively and accurately realizes the estimation of the complex ideal floating value masking through the sentence-level replacement invariance training, adopts a bidirectional long-short term memory neural network structure to estimate the complex ideal floating value masking, and further solves the problem of label ambiguity by utilizing the standard of the sentence-level replacement invariance training, thereby leading the separation of the single-channel voice to have better effect.

Description

Speaker independent single-channel voice separation method
Technical Field
The invention belongs to the technical field of intelligent voice processing, and particularly relates to speaker independent single-channel voice separation based on sentence-level permutation invariance training and complex ideal floating value masking.
Background
The objective of the speech source separation task is to extract a plurality of speech source signals from a mixed speech signal containing two or more speech sources, one for each speaker. In general, the speech separation problem can be divided into mono (i.e., single channel) and array-based (i.e., multi-channel) source separation problems, depending on the number of microphones or channels. For the former problem, the mainstream research method is to extract the target voice or remove the interference signal from the mixed signal based on the acoustic characteristics and statistical characteristics of the target voice and the interference signal. In the multi-channel speech separation problem, spatial information is available in addition to the acoustic and statistical properties of the signal. The mono speech separation problem remains very challenging because only one speech recording is available and the spatial information that can be extracted is very limited.
Since the nineties of the twentieth century, researchers have developed many approaches to solving the problem of monophonic speech separation. Before entering the deep learning era, classical single-channel speech separation methods can be divided into three categories: model-based methods, Blind Source Separation (BSS) methods and Computational Auditory Scene Analysis (CASA) methods. However, these methods have limited effectiveness in processing sound sources in multi-source mixed speech captured in a real environment. Because of the numerous difficulties involved, including the wide variety of noise in mixed speech, low signal-to-noise environments, and limited computational resources. Therefore, in a real environment, it is difficult to consistently obtain a high-quality target speech signal by the above-described method.
Recently, researchers have used regression models in Deep Neural Networks (DNN) to solve the source separation problem, and particularly for the mono case, have achieved very good performance gains. Depending on the training objectives, the DNN-based mono source separation method can be divided into three categories, namely a masking-based method, a mapping-based method and a Signal Approximation (SA) -based method. In contrast, the mask-based approach may be trained to yield a more accurate neural network model than the map-based approach.
The first applied masking-based training goal in the supervised speech separation approach was Ideal Binary Masking (IBM), which was inspired by the masking effect of sound and proprietary allocation principles in auditory scene analysis. Many researchers have IBM as a training target and have achieved better speech separation results. Because the decision mode that the IBM method takes a value other than 0, i.e., 1, in each time-frequency domain (T-F) unit is not flexible enough, the speech signal separated by the IBM-based method is distorted. To this end, researchers have proposed ideal floating-point masking (IRM) to optimize IBM performance, setting the value of the T-F unit as the ratio of the energy of the target sound source to the energy of the mixed speech. The target speech signal separated using the IRM-based method is generally of better quality than IBM.
Although these DNN-based methods achieve good performance, both IBM and IRM use only the amplitude information of the target signal when separating and synthesizing clean speech signals, since in earlier studies it was thought that the phase spectrum was not important for speech separation. However, recent studies by Erdogan et al have found that phase information is beneficial for predicting accurate masking and signal estimation, and they propose a Phase Sensitive Masking (PSM) based approach that is significantly better than IBM and IRM. In addition, Williamson et al estimate complex ideal float masking (cIRM) using both magnitude and phase spectral information in the complex domain
In the voice separation task, if the target speaker and the interferers can not be changed in the training data and the test data, the voice separation task belongs to the voice separation related to the speaker; if the targeted speaker is fixed but the interferer is allowed to make changes, then the targeted relevant voice separation is assumed. Similarly, if the speaker is not required to be the same between the training data and the test data, it is called speaker-independent speech separation, which is the least constrained case. The label ambiguity (or permutation) problem is the most dominant cause of poor performance of speaker independent speech separation algorithms in prior studies. In a speaker independent scenario, there are multiple outputs due to the speech separation model, where each output represents a sound source. When a plurality of speakers produce voices several times in a voice and a plurality of voices are overlapped, how to allocate separated voice components to each sound source is a troublesome problem, researchers propose a displacement invariant training (PIT) model to solve the problem and obtain a good effect.
Disclosure of Invention
In view of the above, the present invention aims to provide a speaker independent single-channel speech separation method based on sentence level permutation invariance training (uPIT) and complex ideal floating value masking (crirm), which effectively and accurately implements crirm estimation through sentence level permutation invariance training (uPIT), and specifically, the speaker independent single-channel speech separation method employs a Bi-directional Long Short-Term Memory neural network (Bi-LSTM RNN) structure to estimate complex ideal floating value masking crirm, and further utilizes the standard of sentence level permutation invariance training (uPIT) to solve the problem of label ambiguity.
In order to achieve the purpose, the invention adopts the following technical scheme that the method for separating the independent single-channel voice of the speaker comprises the following steps:
step 1, preparing a data set and carrying out data preprocessing;
step 2, establishing a single sound channel voice separation model based on a plurality of ideal floating value masking;
step 3, adopting sentence level replacement invariance training when training the single sound channel voice separation model;
and 4, performing voice separation on the model after the mixed voice input training is finished.
Specifically, the data set in step 1 is a WSJ0-2mix data set, the WSJ0-2mix data set includes a training set, a verification set and a test set, two speakers are randomly selected from the WSJ0 training set si _ tr _ s, sentences are randomly selected from the recordings of the two speakers and mixed, the signal-to-noise ratio of the two sentences in mixing ranges from 0dB to 5dB, the specific signal-to-noise ratio is randomly selected, and all speech data are preprocessed through short-time fourier transform to obtain a 129-dimensional complex spectrum.
Specifically, the monophonic voice separation model takes a Y-shaped bidirectional long-short term memory cyclic neural network as a framework model and comprises 3 layers, the number of neuron nodes of each hidden layer is 896, when a data stream is transmitted from a lower layer to a higher layer network, the model is provided with random dropouts, the dropouts probability is 0.5, when the mixed voice of | S | speakers is separated, the output stream of the network model has | S | numbers, in order to avoid the problem of gradient disappearance, data are sequentially led into a linear layer with | S | X1792 neurons and a ReLU layer with | S | X1792 neurons, the input data of the model is a three-dimensional tensor, the shape is DxT X129, D represents the number of samples selected in one training, and the number of samples used for each training is fixed; t represents the maximum frame number in the training sentences contained in each training, 129 is the frequency point number, and is a 129-dimensional complex spectrum obtained by performing short-time Fourier transform on voice data, wherein the frame length is 16ms, the frame shift is 8ms, the output of the model consists of | S | masking estimation values, and the dimension of each masking estimation value vector is T multiplied by 129.
Specifically, in the training process of the model in step 3, the training target is a complex ideal floating value mask which comprises a real part and an imaginary part, the two-way long-short term memory cyclic neural network has two outputs, one is used for predicting the real part component, the other is used for predicting the imaginary part component, and the two networks for predicting the real part component and the imaginary part component are separately optimized.
In the training phase, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'rAnd compressed imaginary mask cIRM'cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;
in the model testing stage, the short-time Fourier transformation result of the mixed voice is also obtained, then the short-time Fourier transformation result is used as the input of the network model obtained in the training stage, the two output values of the network model are subjected to recovery processing by using an inverse function, so that the estimated values of the real part mask and the imaginary part mask of the target source voice are respectively obtained, the real part and the imaginary part of the estimated signal are obtained by multiplying the estimated values of the real part mask and the imaginary part mask by the short-time Fourier transformation value of the mixed voice, and then the signal reconstruction is carried out by using the inverse Fourier transformation, so that the separated voice signal is obtained.
Specifically, the real part of the complex ideal floating value mask is represented as:
Figure BDA0002487781770000051
the imaginary part is represented as:
Figure BDA0002487781770000052
thus, the complex ideal float mask is represented as:
Figure BDA0002487781770000053
wherein, YrAnd YcUsing real and imaginary parts, S, after short-time Fourier transform for mixed speech signalsrAnd ScRespectively real and imaginary parts, Y, after short-time Fourier transformation of a clean source speech signalr、Yc、SrAnd SrIs in the value range of
Figure BDA0002487781770000054
The real part masks cIRM'rAnd imaginary masking cIRM'cIs uniformly expressed as
Figure BDA0002487781770000055
Wherein x is r or C, which represents real part or imaginary part, the masking value is limited in [ -K, K ] by compression operation, K is a preset value, and the gradient of the parameter C is controlled;
the inverse function is expressed as:
Figure BDA0002487781770000061
wherein, cIRMxRepresenting an estimate of the uncompressed mask, OxIs the output of the deep neural network model.
Preferably, when the sentence-level permutation invariance training model is adopted, the real part cost function is defined as:
Figure BDA0002487781770000062
where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames of all sound sources, N is the window length or frame length, S represents the number of sound sources, which are analyzed on units of the time-frequency (T-F) domain after short-time fourier transform of the speech signal, where T represents the index of time and F represents the index of frequency,
Figure BDA0002487781770000063
the ith output stream (total | S |) representing the network during the training phase, i.e. the ith ideal float value masks the estimated values of the real components,
Figure BDA0002487781770000064
indicating that the tag value of the ideal float value masking real component corresponding to the i-th ideal float value masking real component estimated value is taken when the tag arrangement minimizing the cost of sentence-level speech separation is taken,
Figure BDA0002487781770000065
is an arrangement that minimizes the cost value of sentence-level speech separation, defined as:
Figure BDA0002487781770000066
wherein, S represents the number of sound sources,
Figure BDA0002487781770000067
is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations, Sφ(i)-cIRM′rRepresenting the ith tag value, similarly, the imaginary cost function when the fixed ideal float value masks the arrangement of real component tag values
Figure BDA0002487781770000068
The training process of (a) is the same as the real component part.
The method designs and realizes a Y-shaped bidirectional long-short term memory neural network (Bi-LSTMRNN) as a model architecture, uses a complex ideal floating value masking cIRM as a model training target, fully utilizes the amplitude and phase information of the voice signal, and can obtain a more accurate estimation result; the problem of label ambiguity of speaker independent voice separation is solved by adopting sentence-level replacement invariance training, and the research work of combining a plurality of ideal floating value masking cIRM and a sentence-level replacement invariance uPIT method into an integral model for the first time enables the separation effect of speaker independent single-channel voice to be better.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a block diagram of a Y-shaped bi-directional long short term memory neural network of the present invention;
FIG. 3 is a block diagram of a model for single-channel speech separation of two speakers according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When using a regression model of DNN to solve the source separation problem, mainstream mask-based training objectives include: the ideal float mask IRM, the phase sensitive mask PSM and the complex ideal float mask cIRM, the above methods will be briefly described.
(1) Ideal float masking (IRM Ideal Ratio Mask)
The speech signal is sampled at a certain frequency, and at discrete time m, the target speech signal, the interfering signal, and the mixed speech signal sequence may be represented as s (m), i (m), and y (m) ═ s (m) + i (m), respectively. It is subjected to a Short Time Fourier Transform (STFT), which can be expressed as S (t, f), I (t, f) and Y (t, f) being S (t, f) + I (t, f), respectively, where f is the index of the frequency and t is the index of the time frame. In addition, given Y (t, f), the goal of monophonic speech separation is to recover S (t, f) for each target sound source. By masking M (t, f) with an ideal time-frequency (t-f), the spectrum of the target speech can be reconstructed as follows:
S(t,f)=Y(t,f)*M(t,f) (1)
where ". x" denotes a complex multiplication, the masking value M (t, f) at time frame t and frequency f may be expressed as:
Figure BDA0002487781770000081
where β is an adjustment parameter for scaling the masking value, and | S (t, f) | and | I (t, f) | represent the magnitude spectrum of the target speech signal and the magnitude spectrum of the interference signal, respectively. In addition, | S (t, f) & gtnon |, cell2And | I (t, f) & gtdoes not count2Respectively representing the target speech power spectrum and the interference signal power spectrum within the T-F unit. Typically, β is chosen to be 0.5.
Obviously, in the process of calculating the IRM, only the amplitude information is utilized, and the phase information of the target speech signal is ignored in the speech reconstruction. To compensate for the lack of IRM, researchers have proposed PSM and crirm.
(2) Phase Sensitive Mask (PSM, Phase-Sensitive Mask)
In polar coordinates, the STFT of a speech signal can be defined as equation (3).
Figure BDA0002487781770000082
I S (t, f) I represents the amplitude response, and j θs(t,f)Representing the phase response of the speech signal at time t and frequency f, which is typically used when performing enhancement or separation operations on noisy speech after a short time fourier transform. In polar coordinates, PSM, which belongs to an IRM incorporating a phase metric, becomes well understood, an extension of the IRM:
Figure BDA0002487781770000083
wherein the < Y and the < S respectively represent a mixed voice phase and a target voice phase in the T-F unit. The inclusion of a phase difference between the mixed speech and the target speech in the PSM results in a higher SNR and produces a better estimate of the target speech than the IRM. It is clear that,
Figure BDA0002487781770000091
the range of the sum | cos ([ sub ] Y- [ sub ] S) | value is (0, 1), and the cos ([ sub ] Y- [ sub ] S) may take a negative value.
(3) Complex ideal float masking (cIRM, complete ideal Ratio Mask)
cIRM is a complex-valued time-frequency mask that is computed using the real and imaginary parts of the target and mixed speech signals after a short-time Fourier transform. The short-time fourier transform result and the crirm of the mixed speech and clean signal are defined as follows:
Y(t,f)=Yr(t,f)+jYc(t,f) (5)
S(t,f)=Sr(t,f)+jSc(t,f) (6)
cIRM(t,f)=cIRMr(t,f)+jcIRMc(t,f) (7)
wherein
Figure BDA0002487781770000092
Subscripts r and c are eachRepresenting the real and imaginary parts. For convenience, the indices of frequency f and time frame t are omitted below, and Y, S and crirm are still defined for each time-frequency unit. Thus, in the complex domain, equation (1) can be further rewritten as:
Sr+jSc=(Yr+jYc)*(cIRMr+jcIRMc) (8)
Sr=cIRMr*Yr-cIRMc*Yc (9)
Sc=cIRMr*Yc+cIRMc*Yi (10)
using equations (9) and (10), the real and imaginary parts of the crirm can be derived:
Figure BDA0002487781770000093
Figure BDA0002487781770000094
therefore, we can obtain the defined formula of crirm as:
Figure BDA0002487781770000095
it is noted that the ranges of values of Yr, Yc, Sr and Sr are
Figure BDA0002487781770000096
This means that
Figure BDA0002487781770000097
And
Figure BDA0002487781770000098
Figure BDA0002487781770000101
is unbounded, as beforeThe range of IRM is [0,1 ]]This is very advantageous for DNN model based training. Thus, the crirm is compressed using the following hyperbolic tangent function:
Figure BDA0002487781770000102
where x is r or C, representing the real or imaginary part, the compression operation limits the masking value to [ -K, K ], and the parameter C controls its steepness. In experiments, several sets of K and C values were evaluated, and it was found that the DNN-based sound source separation model performed best when K was 10 and C was 0.05. In the training phase, the training labels are compressed cIRMs, and the model output values are also compressed values. Similarly, in the testing phase, where the DNN output is the compressed estimate of the mask rather than the original mask, we use the following inverse function to recover the estimate of the uncompressed mask.
Figure BDA0002487781770000103
Wherein cIRMxRepresenting an estimate of the uncompressed mask, OxIs the DNN output.
According to Lee's research, it is found that it is difficult to directly and accurately estimate the phase without a clear structure. Therefore, it is difficult to reconstruct accurate speech by separately estimating the amplitude and phase. Theoretically, accurate estimates of the imaginary and real parts, including amplitude and phase information, can be obtained by estimating the cIRM, which is superior to PSM in more accurately estimating the source speech.
Thus, as shown in fig. 1, a speaker independent single-channel speech separation method includes the following steps:
step 1, preparing a data set and carrying out data preprocessing;
step 2, establishing a single sound channel voice separation model based on a plurality of ideal floating value masking;
step 3, adopting sentence level replacement invariance training when training the single sound channel voice separation model;
and 4, performing voice separation on the model after the mixed voice input training is finished.
In fact, various network architectures are very efficient in processing speech signals, and DNN or RNN based approaches have been widely used to solve the mono speech separation problem. In particular, LSTM RNN networks, which operate on a statement frame by frame, can effectively utilize historical information in the time sequence, often used to process time sequence related speech data. In addition, relevant research has shown that the LSTM RNN network can improve the generalization ability of speech separation methods to speakers. If a Bi-directional LSTM (called Bi-LSTM) RNN network is used, the past and future information with respect to a certain frame is stacked and passed to the next layer throughout the speech sentence, and the performance is superior to the uni-directional LSTM RNN network when processing of time series is involved. Therefore, the invention adopts a bidirectional long-short term memory neural network (Bi-LSTM RNN) as a network framework model.
Since the training target is a complex ideal floating value mask, which contains real and imaginary components, the Bi-LSTM RNN network has two outputs, one for predicting the real component and the other for predicting the imaginary component. The invention designs and realizes a Y-shaped neural network architecture to obtain the training target, as shown in figure 2. Where the input features are the STFT spectrum of the mixed speech, the two networks that predict the real and imaginary components are optimized separately. In contrast, the output of the IRM-based and PSM-based Bi-LSTM RNN models are both single outputs.
The example given in this embodiment is dual speaker speech separation based on complex ideal floating value masking of the crirm target, and when using a sentence-level Permutation invariance (uPIT) method, the present invention takes the Mean Square Error (MSE) between the predicted value output by the Bi-LSTM RNN network in the uPIT module and the compressed target masking (i.e., the tag value) of the clean speech signal as the cost function. Thus, the real part cost function of pcIRM based approach can be defined as:
Figure BDA0002487781770000111
where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames for all sound sources, N is the window length (or frame length),
Figure BDA0002487781770000121
the arrangement that minimizes the cost value of sentence-level speech separation can be defined as:
Figure BDA0002487781770000122
note that S represents the number of sound sources, in equation (17)
Figure BDA0002487781770000123
Is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations. Similarly, the imaginary component cIRMc and the imaginary cost function of the predicted cIRM are predicted
Figure BDA0002487781770000124
The training process of (a) is the same as the real component part. Similarly, the cost functions of the uPIT-based IRM model and the uPIT-based PSM model may be defined by equations (18) and (19), respectively.
Figure BDA0002487781770000125
Figure BDA0002487781770000126
For those masking-based methods that do not use uPIT, the order of the target sound source is fixed, and then there is only one permutation of the estimated speech and the target speech, whose cost function is the same as the form of uPIT, but does not involve the process of finding the minimum cost value.
As shown in figure 3 of the drawings,model structure for single-channel speech separation of two speakers, in a training stage, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'rAnd compressed imaginary mask cIRM'cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;
in the testing phase, the short-time Fourier transform results of the mixed speech are also obtained and then used as inputs to the network models Bi-LSTM RNN1 and Bi-LSTM RNN2 obtained in the training phase. The output values of the two networks are subjected to a restoration process using equation (15), thereby obtaining estimated values of the real part mask and the imaginary part mask of the target source speech, respectively. The real and imaginary parts of the estimated signal are obtained by multiplying the real and imaginary masked estimates by the STFT value of the mixed speech. Then, the signal is reconstructed by using the inverse Fourier transform to obtain a separated voice signal.
The invention uses WSJ0-2mix data set to evaluate the single-channel voice separation model, the sampling frequency of 16KHz, and the voice signal obtains 129-dimensional complex spectrum as input through short-time Fourier transform. The WSJ0-2mix dataset was obtained from a WSJ0 corpus. The WSJ0 corpus includes a training set (si _ tr _ s) and two validation sets (si _ dt _05 and si _ et _ 05). The training set si _ tr _ s contains 101 speakers, each having recorded about 140 or 90 sentences each having a duration of about 5 seconds.
The WSJ0-2mix data set generated includes a training set, a validation set, and a test set. For the training set of 30h and the verification set of 10h, the training set of WSJ0 was obtained by randomly selecting two speakers (including 49 males and 51 females in si _ tr _ s) from the training set si _ tr _ s of WSJ0, and randomly selecting sentences from the recordings of the two speakers to mix, wherein the signal-to-noise ratio (SNR) of the two sentences in the mixing range from 0dB to 5dB, and the SNR is also randomly selected. For the 5h test set, generated using the data in the validation sets si _ dt _05 and si _ et _05 of WSJ0, which included 7 women and 11 men, the construction method was the same as for the 30h training set. These 18 speakers in the WSJ0-2mix test set were not included in the training set, so the experiment was performed speaker independent.
In the experiment, all the methods based on vanilla DNN comprise 3 hidden layers, the number of neuron nodes of each hidden layer is 1792, and all the methods based on bidirectional LSTM RNN also comprise 3 hidden layers, and the number of neuron nodes of each hidden layer is 896, so that all the models are guaranteed to have similar number of parameters. To avoid overfitting, all models set a random dropout with a dropout probability of 0.5 as the data stream passes from lower layers into higher layer networks. When separating the mixed speech of the speaker, the output stream of the network model has S. For example, in the embodiment, | S | is set to 2, the data set WSJ0-2mix in the present experiment is also generated by mixing the voices of two different speakers, the output of the network model is two, and most of the research today is to separate the mixed voices of the two speakers. In order to avoid the gradient vanishing problem, data is sequentially imported into a linear layer having | S | × 1792 neurons and a ReLU layer having | S | × 1792 neurons.
The input to all models is the same, and is a 129-dimensional complex spectrum obtained by fourier transforming a mixed speech, where the frame length is 16ms and the frame shift is 8 ms. Specifically, the input data is a three-dimensional tensor having a shape of D × T × 129, D represents the number of samples (batch size) selected in one training, the number of samples used for each training is fixed, and the number of words included in one training is represented by 8. T represents the maximum number of frames in the training sentence included in each training, and 129 is the number of frequency points. The output of all models consists of | S | masking estimates, each with dimensions of T × 129.
For a complex ideal float-based masking approach to cIRM, the output of the model is an estimate of the real component of the cIRM and the imaginary partEstimate of the quantity, which corresponds to two Bi-LSTM RNN networks each using an MSE cost function
Figure BDA0002487781770000141
And
Figure BDA0002487781770000142
and (5) training. The experiment adopts Adam optimization algorithm to optimize DNN and Bi-LSTM RNN models, and the weight attenuation is set to be 10-5The learning rate is not fixed and is adjusted according to the effect of the network training. When the learning rate is less than 10-10The training process is automatically terminated. Furthermore, the batch size is set to 8, which means that 8 pieces of speech data are randomly selected from the data set to be loaded into the model per training. The number of iterations is set to 100. In the experimental process, the training set is used for training the model, and the verification set is only used for controlling the learning rate.
And (3) training the mixed voice data set WSJ0-2mix on a u-PIT method and a conventional training method by using a voice separation method respectively taking the cIRM, the IRM and the PSM as training targets, wherein MSE is used as an evaluation index in the training process. It can be seen from the results that the MSE of the conventional training method decreases slowly, and from the tenth iteration, the MSE remains almost constant, which is likely due to the permutation problem. By using the uPIT method, MSE converges rapidly. A huge gap exists between the MSE of the cIRM model based on the traditional training method and the MSE of the pcIRM model, and the large gap proves the effectiveness of the cIRM model in solving the label replacement problem.
On the same data set, the differences in the training process of the vanilla DNN-based model and the pcIRM model when the uPIT method was employed were also compared in the experiments. On the training and validation sets, the MSE of both methods decreased rapidly and showed almost the same trend, and the MSE based on the Bi-LSTM model (i.e., pcIRM) was much smaller than that of the vanilla DNN model, indicating that the method of the present invention is more efficient than the vanilla DNN based method in processing time series context information.
The performance of speech separation algorithms is typically evaluated using three metrics, including short-term target intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and signal-to-distortion ratio (SDR). STOI and PESQ measure the intelligibility score and the human speech quality score, respectively. SDR belongs to comprehensive evaluation indexes and can evaluate the overall separation performance, so that the method adopts the SDR as the evaluation index to evaluate the potential of the method in the aspect of improving the voice separation performance.
Analysis of the results after the experiment revealed. Firstly, under the current experimental setting, the method of the invention obtains better separation performance than the same-sex mixed voice scene in the scene of separating male-female mixed voice. Meanwhile, the Bi-LSTM RNN model based on the PSM can almost obtain the maximum SDR improvement when a conventional training method and a uPIT method are adopted, which shows the effectiveness of phase information in improving the voice separation performance. Secondly, compared with the conventional training method, the method based on the uPIT can obtain better results than the former under different training targets, and highlights the advantages of the uPIT method. Furthermore, the SDR score of the BilSTM RNN-based model in this task is higher than that of the vanilla DNN-based model, demonstrating the powerful ability of the Bi-LSTM RNN to capture time series information.
Also, as can be seen from the analysis of the results, although gender information is not explicitly used in the training process of the model, the method of the present invention achieves better SDR improvement for the separation of mixed speeches of different speakers. As training periods increase, IBM and PSM based approaches will approach the results of oracle IRM and oracle PSM in the case of mixed speech of heteronyms, showing that the training effect gradually moves towards the performance limit in the current experimental data and settings. The experimental result is consistent with the conclusion of research results of other researchers, and can show that the task of separating mixed voice with the same gender still has great challenges, has great performance improvement space, and is one of the current and future research worthy problems.
Furthermore, the SDR score of oracle irim is six times more than oracle IRM and almost five times of oracle PSM method, while pcIRM achieves better results than PSM-based methods except in the case of mixed speech separation of speakers. The reason is analyzed, on one hand, as mentioned earlier, cIRMr belongs to R and cIRMc belongs to R, and the values of the cIRM and the cIRMc are unbounded. On the other hand, because the real part and the imaginary part of the complex frequency spectrum both have structural characteristics, the Y-shaped Bi-LSTM RNN is designed and realized, the framework of the Y-shaped neural network is shown in FIG. 2, and two independent networks respectively optimize the real part and the imaginary part. The Bi-LSTM RNN network training used to optimize the MSE of the imaginary component does not work well. Theoretically, the real component of the complex spectrum obtained after the speech signal is subjected to short-time fourier transform can be understood as the projection of the spectrum, which is the definition of PSM, but the definition corresponding to the imaginary component is difficult to intuitively understand. Therefore, the imaginary component model is difficult to be trained well compared to the real component model. However, compared with the cIRM-based method in other documents, the model of the method of the invention obtains higher SDR improvement, and shows the effectiveness of the method of the invention.
The present invention proposes a new approach to solve the speaker independent mono source separation problem. The Y-shaped Bi-LSTM RNN network is designed and realized as a training framework of the method, the real component and the imaginary component of the mixed voice complex spectrum are trained and optimized respectively, the amplitude and phase information of voice signals are effectively utilized, the cIRM estimation is realized by adopting sentence-level displacement invariant training, and the problems of label displacement and speaker tracking can be solved simultaneously. Then, on a WSJ0-2mix data set, the voice separation performance of the method is verified and compared with that of the existing popular method by using SDR indexes, and the experimental result shows the effectiveness of the method, further shows the importance of phase information on a voice separation task and well solves the replacement problem in the voice separation task by using the uPIT method.

Claims (5)

1.一种说话人无关单通道语音分离方法,其特征在于,包括以下步骤:1. a speaker-independent single-channel speech separation method, is characterized in that, comprises the following steps: 步骤1,准备数据集,进行数据预处理;Step 1, prepare the data set and perform data preprocessing; 步骤2,建立基于复数理想浮值掩蔽的单声道语音分离模型;Step 2, establish a monophonic speech separation model based on complex ideal floating value masking; 步骤3,对所述的单声道语音分离模型进行训练时采用语句级置换不变性训练;Step 3, adopt sentence-level permutation invariance training when training the monophonic speech separation model; 步骤4,将混合语音输入训练完毕的模型进行语音分离;Step 4, voice separation is performed on the model trained by the mixed voice input; 步骤1中所述的数据集为WSJ0-2mix数据集,所述的WSJ0-2mix数据集包括训练集、验证集和测试集,是通过从WSJ0训练集si_tr_s中随机选择两位说话人,从该两位说话人的录音中随机选择语句进行混合得到的,混合时两个语句的信噪比从范围是0dB-5dB,具体信噪比值为随机选择,所有的语音数据都通过短时傅里叶变换的预处理得到129维复数谱;The data set described in step 1 is the WSJ0-2mix data set, and the WSJ0-2mix data set includes a training set, a verification set and a test set. Two speakers are randomly selected from the WSJ0 training set si_tr_s. It is obtained by mixing randomly selected sentences in the recordings of two speakers. When mixing, the signal-to-noise ratio of the two sentences ranges from 0dB to 5dB. The specific signal-to-noise ratio value is randomly selected. The preprocessing of the leaf transform obtains a 129-dimensional complex spectrum; 所述的单声道语音分离模型以Y形的双向长短期记忆循环神经网络为框架模型,包含3层,每个隐层的神经元节点数为896,当数据流从较低层传递进入较高层网络时,模型设置随机dropout,dropout概率为0.5,对|S|个说话人的混合语音进行分离时,网络模型的输出流有|S|个,为了避免梯度消失问题,数据被依次导入有|S|×1792个神经元的线性层和有|S|×1792个神经元的ReLU层,模型的输入数据是三维张量,形状为D×T×129,D表示一次训练所选取的样本数,每次用于训练的样本数是固定的;T表示每次训练时包含的训练语句中最大的帧数,129是频点数,是语音数据进行短时傅里叶变换得到的129维复数谱,其中,帧长为16ms,帧移为8ms,模型的输出由|S|个掩蔽估计值组成,每个掩蔽估计值向量的维度是T×129;The monophonic speech separation model is based on a Y-shaped bidirectional long short-term memory recurrent neural network, including 3 layers, and the number of neuron nodes in each hidden layer is 896. In the high-level network, the model is set to random dropout, and the dropout probability is 0.5. When the mixed speech of |S| speakers is separated, there are |S| output streams of the network model. The linear layer with |S|×1792 neurons and the ReLU layer with |S|×1792 neurons, the input data of the model is a three-dimensional tensor with a shape of D×T×129, D represents the sample selected for one training The number of samples used for training each time is fixed; T represents the maximum number of frames in the training sentences included in each training, 129 is the number of frequency points, which is a 129-dimensional complex number obtained by short-time Fourier transform of the speech data spectrum, where the frame length is 16ms, the frame shift is 8ms, the output of the model consists of |S| masking estimates, and the dimension of each masking estimate vector is T×129; 步骤3中所述的模型进行训练的过程中,训练目标是复数理想浮值掩蔽,其包含实部和虚部,所述的双向长短期记忆循环神经网络有两个输出,一个用于预测实部分量,另一个用于预测虚部分量,预测实部分量和虚部分量的两个网络是分开进行优化的。During the training process of the model described in step 3, the training target is a complex ideal floating-value mask, which includes a real part and an imaginary part, and the bidirectional long short-term memory recurrent neural network has two outputs, one is used to predict the real Partial component, another for predicting the imaginary component, and the two networks for predicting the real and imaginary components are optimized separately. 2.根据权利要求1所述的说话人无关单通道语音分离方法,其特征在于,在模型训练阶段,对干净源语音和混合语音进行短时傅里叶变换,然后,将变换之后语音源的实部和虚部分别用于计算压缩的实部掩蔽cIRM′r和压缩的虚部掩蔽cIRM′c作为双向长短期记忆循环神经网络中实部和虚部的训练标签,在每次迭代时,通过最小化标签值与双向长短期记忆循环神经网络输出值之间的均方误差来优化时频掩蔽的估计值,多次迭代之后,将均方误差缩小到某个范围或者触发其他设定时停止训练,训练完成,保存此时双向长短期记忆循环神经网络的参数,在测试阶段使用;2. speaker-independent single-channel voice separation method according to claim 1, is characterized in that, in the model training stage, clean source voice and mixed voice are carried out short-time Fourier transform, then, will transform the voice source after the transformation. The real and imaginary parts are used to compute the compressed real mask cIRM′ r and the compressed imaginary mask cIRM′ c , respectively, as the training labels for the real and imaginary parts in the bidirectional long short-term memory recurrent neural network. At each iteration, The estimated value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long short-term memory recurrent neural network. After many iterations, the mean square error is reduced to a certain range or when other settings are triggered Stop the training, the training is completed, save the parameters of the bidirectional long short-term memory recurrent neural network at this time, and use it in the test phase; 在模型测试阶段,同样获得混合语音的短时傅里叶变换结果,然后将其作为在训练阶段得到的网络模型的输入,使用逆函数对网络模型的两个输出值进行恢复处理,从而分别得到目标源语音的实部掩蔽和虚部掩码的估计值,通过将实部掩蔽和虚部掩蔽估计值乘以混合语音的短时傅立叶变换值,得到估计信号的实部和虚部,然后,再使用傅里叶变换的逆变换进行信号重建,得到分离的语音信号。In the model testing stage, the short-time Fourier transform result of the mixed speech is also obtained, and then it is used as the input of the network model obtained in the training stage, and the two output values of the network model are restored using the inverse function, thereby obtaining respectively The estimated values of the real and imaginary masks of the target source speech are obtained by multiplying the estimated real and imaginary masks by the short-time Fourier transform values of the mixed speech to obtain the real and imaginary parts of the estimated signal, and then, Then use the inverse Fourier transform to reconstruct the signal to obtain the separated speech signal. 3.根据权利要求1或2所述的说话人无关单通道语音分离方法,其特征在于,所述的复数理想浮值掩蔽的实部表示为:3. The speaker-independent single-channel speech separation method according to claim 1 and 2 is characterized in that, the real part of the described complex ideal floating value masking is expressed as:
Figure FDA0002944252330000021
Figure FDA0002944252330000021
虚部表示为:The imaginary part is represented as:
Figure FDA0002944252330000022
Figure FDA0002944252330000022
由此,复数理想浮值掩蔽表示为:Thus, the complex ideal float mask is expressed as:
Figure FDA0002944252330000031
Figure FDA0002944252330000031
其中,Yr和Yc为混合语音使用短时傅里叶变换之后的实部和虚部,Sr和Sc分别为对干净源语音信号进行短时傅里叶变换之后的实部和虚部,Yr、Yc、Sr和Sr的取值范围是
Figure FDA0002944252330000032
Among them, Y r and Y c are the real and imaginary parts after the short-time Fourier transform of the mixed speech, S r and S c are the real and imaginary parts after the short-time Fourier transform of the clean source speech signal, respectively part, the value range of Y r , Y c , S r and S r is
Figure FDA0002944252330000032
4.根据权利要求3所述的说话人无关单通道语音分离方法,其特征在于,所述的实部掩蔽cIRM′r和虚部掩蔽cIRM′c统一表示为4. The speaker-independent single-channel speech separation method according to claim 3, wherein the real part masking cIRM' r and the imaginary part masking cIRM' c are uniformly expressed as
Figure FDA0002944252330000033
Figure FDA0002944252330000033
其中x取值为r或c,表示实部或虚部,压缩操作将掩蔽值限制在[-K,K]之内,K为预设值,参数C控制其陡度;The value of x is r or c, which means the real part or the imaginary part. The compression operation limits the masking value within [-K, K], where K is the preset value, and the parameter C controls its steepness; 所述的逆函数表示为:The inverse function is expressed as:
Figure FDA0002944252330000034
Figure FDA0002944252330000034
其中,cIRMx表示未压缩掩码的估计,Ox是深度神经网络模型的输出。where cIRM x represents the estimate of the uncompressed mask and O x is the output of the deep neural network model.
5.根据权利要求1或4所述的说话人无关单通道语音分离方法,其特征在于,所述的采用语句级置换不变性训练模型时,实部代价函数定义为:5. The speaker-independent single-channel speech separation method according to claim 1 or 4, characterized in that, when the training model is described using sentence-level permutation invariance, the real part cost function is defined as:
Figure FDA0002944252330000035
Figure FDA0002944252330000035
其中,B=T×N×S是所有声源上时频单元的总数,T是所有声源的语句帧总数,N是窗长或帧长,S表示声源数目,t表示时间帧的索引,f表示频率的索引,
Figure FDA0002944252330000036
表示训练阶段网络的第i个输出流,即第i个理想浮值掩蔽实部分量的估计值,
Figure FDA0002944252330000037
表示取使得语句级语音分离的代价最小的标签排列时,与第i个理想浮值掩蔽实部分量估计值相对应的理想浮值掩蔽实部分量的标签值,
Figure FDA0002944252330000041
是使得语句级语音分离的代价值最小的排列,定义为:
where B=T×N×S is the total number of time-frequency units on all sound sources, T is the total number of sentence frames for all sound sources, N is the window or frame length, S is the number of sound sources, and t is the index of the time frame , f denotes the frequency index,
Figure FDA0002944252330000036
represents the ith output stream of the network during the training phase, i.e. the ith ideal floating-value masked real component estimate,
Figure FDA0002944252330000037
represents the label value of the ideal floating-value masked real component corresponding to the ith ideal floating-value masked real component estimated value when the label arrangement that minimizes the cost of sentence-level speech separation is taken,
Figure FDA0002944252330000041
is the permutation that minimizes the cost of sentence-level speech separation, defined as:
Figure FDA0002944252330000042
Figure FDA0002944252330000042
其中,S表示声源数目,
Figure FDA0002944252330000043
是S次对称群,包含所有S!个排列方式的集合,φ表示其中一种排列方式,
Figure FDA0002944252330000045
表示当固定理想浮值掩蔽实部分量标签值的排列时第i个标签值,虚部代价函数
Figure FDA0002944252330000044
的训练过程与实部分量部分相同。
where S is the number of sound sources,
Figure FDA0002944252330000043
is the symmetry group of degree S, including all S! A set of permutations, φ represents one of the permutations,
Figure FDA0002944252330000045
Represents the ith label value when the fixed ideal float masks the permutation of the real component label values, the imaginary cost function
Figure FDA0002944252330000044
The training process is the same as the real component part.
CN202010401151.3A 2020-05-12 2020-05-12 Speaker independent single-channel voice separation method Active CN111583954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010401151.3A CN111583954B (en) 2020-05-12 2020-05-12 Speaker independent single-channel voice separation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010401151.3A CN111583954B (en) 2020-05-12 2020-05-12 Speaker independent single-channel voice separation method

Publications (2)

Publication Number Publication Date
CN111583954A CN111583954A (en) 2020-08-25
CN111583954B true CN111583954B (en) 2021-03-30

Family

ID=72112661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010401151.3A Active CN111583954B (en) 2020-05-12 2020-05-12 Speaker independent single-channel voice separation method

Country Status (1)

Country Link
CN (1) CN111583954B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899757B (en) * 2020-09-29 2021-01-12 南京蕴智科技有限公司 Single-channel voice separation method and system for target speaker extraction
CN112435655B (en) * 2020-10-16 2023-11-07 北京紫光青藤微系统有限公司 Data acquisition and model training method and device for isolated word speech recognition
CN112201276B (en) * 2020-11-11 2022-04-29 东南大学 Microphone array speech separation method based on TC-ResNet network
CN114822583B (en) * 2021-01-28 2024-11-22 中国科学院声学研究所 A single-channel sound source separation method using kernelized auditory model
CN113271272B (en) * 2021-05-13 2022-09-13 侯小琪 Single-channel time-frequency aliasing signal blind separation method based on residual error neural network
CN113259283B (en) * 2021-05-13 2022-08-26 侯小琪 Single-channel time-frequency aliasing signal blind separation method based on recurrent neural network
CN113707172B (en) * 2021-06-02 2024-02-09 西安电子科技大学 Single-channel voice separation method, system and computer equipment of sparse orthogonal network
CN113288150B (en) * 2021-06-25 2022-09-27 杭州电子科技大学 Channel selection method based on fatigue electroencephalogram combination characteristics
CN113362831A (en) * 2021-07-12 2021-09-07 科大讯飞股份有限公司 Speaker separation method and related equipment thereof
CN113611292B (en) * 2021-08-06 2023-11-10 思必驰科技股份有限公司 Optimization method and system for short-time Fourier change for voice separation and recognition
CN114446316B (en) * 2022-01-27 2024-03-12 腾讯科技(深圳)有限公司 Audio separation method, training method, device and equipment of audio separation model
CN116701921B (en) * 2023-08-08 2023-10-20 电子科技大学 Multi-channel timing signal adaptive noise suppression circuit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106373583A (en) * 2016-09-28 2017-02-01 北京大学 Multi-Audio Object Encoding and Decoding Method Based on Ideal Soft Threshold Mask IRM
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN111091847A (en) * 2019-12-09 2020-05-01 北京计算机技术及应用研究所 Deep clustering voice separation method based on improvement

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096961B (en) * 2014-05-06 2019-02-01 华为技术有限公司 Speech separating method and device
US10249305B2 (en) * 2016-05-19 2019-04-02 Microsoft Technology Licensing, Llc Permutation invariant training for talker-independent multi-talker speech separation
WO2019104229A1 (en) * 2017-11-22 2019-05-31 Google Llc Audio-visual speech separation
JP6927419B2 (en) * 2018-04-12 2021-08-25 日本電信電話株式会社 Estimator, learning device, estimation method, learning method and program
CN108806708A (en) * 2018-06-13 2018-11-13 中国电子科技集团公司第三研究所 Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model
US10699700B2 (en) * 2018-07-31 2020-06-30 Tencent Technology (Shenzhen) Company Limited Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks
CN110459238B (en) * 2019-04-12 2020-11-20 腾讯科技(深圳)有限公司 Voice separation method, voice recognition method and related equipment
CN110321810A (en) * 2019-06-14 2019-10-11 华南师范大学 Single channel signal two-way separation method, device, storage medium and processor
CN110634502B (en) * 2019-09-06 2022-02-11 南京邮电大学 Single-channel speech separation algorithm based on deep neural network
CN111128197B (en) * 2019-12-25 2022-05-13 北京邮电大学 Multi-speaker voice separation method based on voiceprint features and generation confrontation learning
CN111128209B (en) * 2019-12-28 2022-05-10 天津大学 Speech enhancement method based on mixed masking learning target

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106373583A (en) * 2016-09-28 2017-02-01 北京大学 Multi-Audio Object Encoding and Decoding Method Based on Ideal Soft Threshold Mask IRM
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network
CN111091847A (en) * 2019-12-09 2020-05-01 北京计算机技术及应用研究所 Deep clustering voice separation method based on improvement

Also Published As

Publication number Publication date
CN111583954A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN111583954B (en) Speaker independent single-channel voice separation method
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
WO2021139294A1 (en) Method and apparatus for training speech separation model, storage medium, and computer device
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
Yuliani et al. Speech enhancement using deep learning methods: A review
Shi et al. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN108962229B (en) A single-channel, unsupervised method for target speaker speech extraction
Geng et al. End-to-end speech enhancement based on discrete cosine transform
Le et al. Inference skipping for more efficient real-time speech enhancement with parallel RNNs
Li et al. Sams-net: A sliced attention-based neural network for music source separation
CN113539293B (en) Single-channel voice separation method based on convolutional neural network and joint optimization
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
Wichern et al. Low-Latency approximation of bidirectional recurrent networks for speech denoising.
Hou et al. Multi-task learning for end-to-end noise-robust bandwidth extension
Fan et al. Real-time single-channel speech enhancement based on causal attention mechanism
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
CN118398033A (en) A speech-based emotion recognition method, system, device and storage medium
Sofer et al. CNN self-attention voice activity detector
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
Fan et al. Deep attention fusion feature for speech separation with end-to-end post-filter method
Wang et al. Cross-domain diffusion based speech enhancement for very noisy speech
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
CN118248159A (en) A joint training method for speech enhancement model based on frequency subband

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant