CN111583954B - Speaker independent single-channel voice separation method - Google Patents
Speaker independent single-channel voice separation method Download PDFInfo
- Publication number
- CN111583954B CN111583954B CN202010401151.3A CN202010401151A CN111583954B CN 111583954 B CN111583954 B CN 111583954B CN 202010401151 A CN202010401151 A CN 202010401151A CN 111583954 B CN111583954 B CN 111583954B
- Authority
- CN
- China
- Prior art keywords
- training
- real
- speech
- model
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a single-channel voice separation method irrelevant to speakers, which comprises the following steps: preparing a data set and carrying out data preprocessing; establishing a single-track voice separation model based on the masking of a plurality of ideal floating values; when the single sound channel voice separation model is trained, sentence level replacement invariance training is adopted; and inputting the mixed voice data into the trained model for voice separation. The method effectively and accurately realizes the estimation of the complex ideal floating value masking through the sentence-level replacement invariance training, adopts a bidirectional long-short term memory neural network structure to estimate the complex ideal floating value masking, and further solves the problem of label ambiguity by utilizing the standard of the sentence-level replacement invariance training, thereby leading the separation of the single-channel voice to have better effect.
Description
Technical Field
The invention belongs to the technical field of intelligent voice processing, and particularly relates to speaker independent single-channel voice separation based on sentence-level permutation invariance training and complex ideal floating value masking.
Background
The objective of the speech source separation task is to extract a plurality of speech source signals from a mixed speech signal containing two or more speech sources, one for each speaker. In general, the speech separation problem can be divided into mono (i.e., single channel) and array-based (i.e., multi-channel) source separation problems, depending on the number of microphones or channels. For the former problem, the mainstream research method is to extract the target voice or remove the interference signal from the mixed signal based on the acoustic characteristics and statistical characteristics of the target voice and the interference signal. In the multi-channel speech separation problem, spatial information is available in addition to the acoustic and statistical properties of the signal. The mono speech separation problem remains very challenging because only one speech recording is available and the spatial information that can be extracted is very limited.
Since the nineties of the twentieth century, researchers have developed many approaches to solving the problem of monophonic speech separation. Before entering the deep learning era, classical single-channel speech separation methods can be divided into three categories: model-based methods, Blind Source Separation (BSS) methods and Computational Auditory Scene Analysis (CASA) methods. However, these methods have limited effectiveness in processing sound sources in multi-source mixed speech captured in a real environment. Because of the numerous difficulties involved, including the wide variety of noise in mixed speech, low signal-to-noise environments, and limited computational resources. Therefore, in a real environment, it is difficult to consistently obtain a high-quality target speech signal by the above-described method.
Recently, researchers have used regression models in Deep Neural Networks (DNN) to solve the source separation problem, and particularly for the mono case, have achieved very good performance gains. Depending on the training objectives, the DNN-based mono source separation method can be divided into three categories, namely a masking-based method, a mapping-based method and a Signal Approximation (SA) -based method. In contrast, the mask-based approach may be trained to yield a more accurate neural network model than the map-based approach.
The first applied masking-based training goal in the supervised speech separation approach was Ideal Binary Masking (IBM), which was inspired by the masking effect of sound and proprietary allocation principles in auditory scene analysis. Many researchers have IBM as a training target and have achieved better speech separation results. Because the decision mode that the IBM method takes a value other than 0, i.e., 1, in each time-frequency domain (T-F) unit is not flexible enough, the speech signal separated by the IBM-based method is distorted. To this end, researchers have proposed ideal floating-point masking (IRM) to optimize IBM performance, setting the value of the T-F unit as the ratio of the energy of the target sound source to the energy of the mixed speech. The target speech signal separated using the IRM-based method is generally of better quality than IBM.
Although these DNN-based methods achieve good performance, both IBM and IRM use only the amplitude information of the target signal when separating and synthesizing clean speech signals, since in earlier studies it was thought that the phase spectrum was not important for speech separation. However, recent studies by Erdogan et al have found that phase information is beneficial for predicting accurate masking and signal estimation, and they propose a Phase Sensitive Masking (PSM) based approach that is significantly better than IBM and IRM. In addition, Williamson et al estimate complex ideal float masking (cIRM) using both magnitude and phase spectral information in the complex domain
In the voice separation task, if the target speaker and the interferers can not be changed in the training data and the test data, the voice separation task belongs to the voice separation related to the speaker; if the targeted speaker is fixed but the interferer is allowed to make changes, then the targeted relevant voice separation is assumed. Similarly, if the speaker is not required to be the same between the training data and the test data, it is called speaker-independent speech separation, which is the least constrained case. The label ambiguity (or permutation) problem is the most dominant cause of poor performance of speaker independent speech separation algorithms in prior studies. In a speaker independent scenario, there are multiple outputs due to the speech separation model, where each output represents a sound source. When a plurality of speakers produce voices several times in a voice and a plurality of voices are overlapped, how to allocate separated voice components to each sound source is a troublesome problem, researchers propose a displacement invariant training (PIT) model to solve the problem and obtain a good effect.
Disclosure of Invention
In view of the above, the present invention aims to provide a speaker independent single-channel speech separation method based on sentence level permutation invariance training (uPIT) and complex ideal floating value masking (crirm), which effectively and accurately implements crirm estimation through sentence level permutation invariance training (uPIT), and specifically, the speaker independent single-channel speech separation method employs a Bi-directional Long Short-Term Memory neural network (Bi-LSTM RNN) structure to estimate complex ideal floating value masking crirm, and further utilizes the standard of sentence level permutation invariance training (uPIT) to solve the problem of label ambiguity.
In order to achieve the purpose, the invention adopts the following technical scheme that the method for separating the independent single-channel voice of the speaker comprises the following steps:
step 1, preparing a data set and carrying out data preprocessing;
step 2, establishing a single sound channel voice separation model based on a plurality of ideal floating value masking;
step 3, adopting sentence level replacement invariance training when training the single sound channel voice separation model;
and 4, performing voice separation on the model after the mixed voice input training is finished.
Specifically, the data set in step 1 is a WSJ0-2mix data set, the WSJ0-2mix data set includes a training set, a verification set and a test set, two speakers are randomly selected from the WSJ0 training set si _ tr _ s, sentences are randomly selected from the recordings of the two speakers and mixed, the signal-to-noise ratio of the two sentences in mixing ranges from 0dB to 5dB, the specific signal-to-noise ratio is randomly selected, and all speech data are preprocessed through short-time fourier transform to obtain a 129-dimensional complex spectrum.
Specifically, the monophonic voice separation model takes a Y-shaped bidirectional long-short term memory cyclic neural network as a framework model and comprises 3 layers, the number of neuron nodes of each hidden layer is 896, when a data stream is transmitted from a lower layer to a higher layer network, the model is provided with random dropouts, the dropouts probability is 0.5, when the mixed voice of | S | speakers is separated, the output stream of the network model has | S | numbers, in order to avoid the problem of gradient disappearance, data are sequentially led into a linear layer with | S | X1792 neurons and a ReLU layer with | S | X1792 neurons, the input data of the model is a three-dimensional tensor, the shape is DxT X129, D represents the number of samples selected in one training, and the number of samples used for each training is fixed; t represents the maximum frame number in the training sentences contained in each training, 129 is the frequency point number, and is a 129-dimensional complex spectrum obtained by performing short-time Fourier transform on voice data, wherein the frame length is 16ms, the frame shift is 8ms, the output of the model consists of | S | masking estimation values, and the dimension of each masking estimation value vector is T multiplied by 129.
Specifically, in the training process of the model in step 3, the training target is a complex ideal floating value mask which comprises a real part and an imaginary part, the two-way long-short term memory cyclic neural network has two outputs, one is used for predicting the real part component, the other is used for predicting the imaginary part component, and the two networks for predicting the real part component and the imaginary part component are separately optimized.
In the training phase, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'rAnd compressed imaginary mask cIRM'cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;
in the model testing stage, the short-time Fourier transformation result of the mixed voice is also obtained, then the short-time Fourier transformation result is used as the input of the network model obtained in the training stage, the two output values of the network model are subjected to recovery processing by using an inverse function, so that the estimated values of the real part mask and the imaginary part mask of the target source voice are respectively obtained, the real part and the imaginary part of the estimated signal are obtained by multiplying the estimated values of the real part mask and the imaginary part mask by the short-time Fourier transformation value of the mixed voice, and then the signal reconstruction is carried out by using the inverse Fourier transformation, so that the separated voice signal is obtained.
Specifically, the real part of the complex ideal floating value mask is represented as:
the imaginary part is represented as:
thus, the complex ideal float mask is represented as:
wherein, YrAnd YcUsing real and imaginary parts, S, after short-time Fourier transform for mixed speech signalsrAnd ScRespectively real and imaginary parts, Y, after short-time Fourier transformation of a clean source speech signalr、Yc、SrAnd SrIs in the value range of
The real part masks cIRM'rAnd imaginary masking cIRM'cIs uniformly expressed as
Wherein x is r or C, which represents real part or imaginary part, the masking value is limited in [ -K, K ] by compression operation, K is a preset value, and the gradient of the parameter C is controlled;
the inverse function is expressed as:
wherein, cIRMxRepresenting an estimate of the uncompressed mask, OxIs the output of the deep neural network model.
Preferably, when the sentence-level permutation invariance training model is adopted, the real part cost function is defined as:
where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames of all sound sources, N is the window length or frame length, S represents the number of sound sources, which are analyzed on units of the time-frequency (T-F) domain after short-time fourier transform of the speech signal, where T represents the index of time and F represents the index of frequency,the ith output stream (total | S |) representing the network during the training phase, i.e. the ith ideal float value masks the estimated values of the real components,indicating that the tag value of the ideal float value masking real component corresponding to the i-th ideal float value masking real component estimated value is taken when the tag arrangement minimizing the cost of sentence-level speech separation is taken,is an arrangement that minimizes the cost value of sentence-level speech separation, defined as:
wherein, S represents the number of sound sources,is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations, Sφ(i)-cIRM′rRepresenting the ith tag value, similarly, the imaginary cost function when the fixed ideal float value masks the arrangement of real component tag valuesThe training process of (a) is the same as the real component part.
The method designs and realizes a Y-shaped bidirectional long-short term memory neural network (Bi-LSTMRNN) as a model architecture, uses a complex ideal floating value masking cIRM as a model training target, fully utilizes the amplitude and phase information of the voice signal, and can obtain a more accurate estimation result; the problem of label ambiguity of speaker independent voice separation is solved by adopting sentence-level replacement invariance training, and the research work of combining a plurality of ideal floating value masking cIRM and a sentence-level replacement invariance uPIT method into an integral model for the first time enables the separation effect of speaker independent single-channel voice to be better.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a block diagram of a Y-shaped bi-directional long short term memory neural network of the present invention;
FIG. 3 is a block diagram of a model for single-channel speech separation of two speakers according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When using a regression model of DNN to solve the source separation problem, mainstream mask-based training objectives include: the ideal float mask IRM, the phase sensitive mask PSM and the complex ideal float mask cIRM, the above methods will be briefly described.
(1) Ideal float masking (IRM Ideal Ratio Mask)
The speech signal is sampled at a certain frequency, and at discrete time m, the target speech signal, the interfering signal, and the mixed speech signal sequence may be represented as s (m), i (m), and y (m) ═ s (m) + i (m), respectively. It is subjected to a Short Time Fourier Transform (STFT), which can be expressed as S (t, f), I (t, f) and Y (t, f) being S (t, f) + I (t, f), respectively, where f is the index of the frequency and t is the index of the time frame. In addition, given Y (t, f), the goal of monophonic speech separation is to recover S (t, f) for each target sound source. By masking M (t, f) with an ideal time-frequency (t-f), the spectrum of the target speech can be reconstructed as follows:
S(t,f)=Y(t,f)*M(t,f) (1)
where ". x" denotes a complex multiplication, the masking value M (t, f) at time frame t and frequency f may be expressed as:
where β is an adjustment parameter for scaling the masking value, and | S (t, f) | and | I (t, f) | represent the magnitude spectrum of the target speech signal and the magnitude spectrum of the interference signal, respectively. In addition, | S (t, f) & gtnon |, cell2And | I (t, f) & gtdoes not count2Respectively representing the target speech power spectrum and the interference signal power spectrum within the T-F unit. Typically, β is chosen to be 0.5.
Obviously, in the process of calculating the IRM, only the amplitude information is utilized, and the phase information of the target speech signal is ignored in the speech reconstruction. To compensate for the lack of IRM, researchers have proposed PSM and crirm.
(2) Phase Sensitive Mask (PSM, Phase-Sensitive Mask)
In polar coordinates, the STFT of a speech signal can be defined as equation (3).
I S (t, f) I represents the amplitude response, and j θs(t,f)Representing the phase response of the speech signal at time t and frequency f, which is typically used when performing enhancement or separation operations on noisy speech after a short time fourier transform. In polar coordinates, PSM, which belongs to an IRM incorporating a phase metric, becomes well understood, an extension of the IRM:
wherein the < Y and the < S respectively represent a mixed voice phase and a target voice phase in the T-F unit. The inclusion of a phase difference between the mixed speech and the target speech in the PSM results in a higher SNR and produces a better estimate of the target speech than the IRM. It is clear that,the range of the sum | cos ([ sub ] Y- [ sub ] S) | value is (0, 1), and the cos ([ sub ] Y- [ sub ] S) may take a negative value.
(3) Complex ideal float masking (cIRM, complete ideal Ratio Mask)
cIRM is a complex-valued time-frequency mask that is computed using the real and imaginary parts of the target and mixed speech signals after a short-time Fourier transform. The short-time fourier transform result and the crirm of the mixed speech and clean signal are defined as follows:
Y(t,f)=Yr(t,f)+jYc(t,f) (5)
S(t,f)=Sr(t,f)+jSc(t,f) (6)
cIRM(t,f)=cIRMr(t,f)+jcIRMc(t,f) (7)
whereinSubscripts r and c are eachRepresenting the real and imaginary parts. For convenience, the indices of frequency f and time frame t are omitted below, and Y, S and crirm are still defined for each time-frequency unit. Thus, in the complex domain, equation (1) can be further rewritten as:
Sr+jSc=(Yr+jYc)*(cIRMr+jcIRMc) (8)
Sr=cIRMr*Yr-cIRMc*Yc (9)
Sc=cIRMr*Yc+cIRMc*Yi (10)
using equations (9) and (10), the real and imaginary parts of the crirm can be derived:
therefore, we can obtain the defined formula of crirm as:
it is noted that the ranges of values of Yr, Yc, Sr and Sr areThis means thatAnd is unbounded, as beforeThe range of IRM is [0,1 ]]This is very advantageous for DNN model based training. Thus, the crirm is compressed using the following hyperbolic tangent function:
where x is r or C, representing the real or imaginary part, the compression operation limits the masking value to [ -K, K ], and the parameter C controls its steepness. In experiments, several sets of K and C values were evaluated, and it was found that the DNN-based sound source separation model performed best when K was 10 and C was 0.05. In the training phase, the training labels are compressed cIRMs, and the model output values are also compressed values. Similarly, in the testing phase, where the DNN output is the compressed estimate of the mask rather than the original mask, we use the following inverse function to recover the estimate of the uncompressed mask.
Wherein cIRMxRepresenting an estimate of the uncompressed mask, OxIs the DNN output.
According to Lee's research, it is found that it is difficult to directly and accurately estimate the phase without a clear structure. Therefore, it is difficult to reconstruct accurate speech by separately estimating the amplitude and phase. Theoretically, accurate estimates of the imaginary and real parts, including amplitude and phase information, can be obtained by estimating the cIRM, which is superior to PSM in more accurately estimating the source speech.
Thus, as shown in fig. 1, a speaker independent single-channel speech separation method includes the following steps:
step 1, preparing a data set and carrying out data preprocessing;
step 2, establishing a single sound channel voice separation model based on a plurality of ideal floating value masking;
step 3, adopting sentence level replacement invariance training when training the single sound channel voice separation model;
and 4, performing voice separation on the model after the mixed voice input training is finished.
In fact, various network architectures are very efficient in processing speech signals, and DNN or RNN based approaches have been widely used to solve the mono speech separation problem. In particular, LSTM RNN networks, which operate on a statement frame by frame, can effectively utilize historical information in the time sequence, often used to process time sequence related speech data. In addition, relevant research has shown that the LSTM RNN network can improve the generalization ability of speech separation methods to speakers. If a Bi-directional LSTM (called Bi-LSTM) RNN network is used, the past and future information with respect to a certain frame is stacked and passed to the next layer throughout the speech sentence, and the performance is superior to the uni-directional LSTM RNN network when processing of time series is involved. Therefore, the invention adopts a bidirectional long-short term memory neural network (Bi-LSTM RNN) as a network framework model.
Since the training target is a complex ideal floating value mask, which contains real and imaginary components, the Bi-LSTM RNN network has two outputs, one for predicting the real component and the other for predicting the imaginary component. The invention designs and realizes a Y-shaped neural network architecture to obtain the training target, as shown in figure 2. Where the input features are the STFT spectrum of the mixed speech, the two networks that predict the real and imaginary components are optimized separately. In contrast, the output of the IRM-based and PSM-based Bi-LSTM RNN models are both single outputs.
The example given in this embodiment is dual speaker speech separation based on complex ideal floating value masking of the crirm target, and when using a sentence-level Permutation invariance (uPIT) method, the present invention takes the Mean Square Error (MSE) between the predicted value output by the Bi-LSTM RNN network in the uPIT module and the compressed target masking (i.e., the tag value) of the clean speech signal as the cost function. Thus, the real part cost function of pcIRM based approach can be defined as:
where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames for all sound sources, N is the window length (or frame length),the arrangement that minimizes the cost value of sentence-level speech separation can be defined as:
note that S represents the number of sound sources, in equation (17)Is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations. Similarly, the imaginary component cIRMc and the imaginary cost function of the predicted cIRM are predictedThe training process of (a) is the same as the real component part. Similarly, the cost functions of the uPIT-based IRM model and the uPIT-based PSM model may be defined by equations (18) and (19), respectively.
For those masking-based methods that do not use uPIT, the order of the target sound source is fixed, and then there is only one permutation of the estimated speech and the target speech, whose cost function is the same as the form of uPIT, but does not involve the process of finding the minimum cost value.
As shown in figure 3 of the drawings,model structure for single-channel speech separation of two speakers, in a training stage, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'rAnd compressed imaginary mask cIRM'cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;
in the testing phase, the short-time Fourier transform results of the mixed speech are also obtained and then used as inputs to the network models Bi-LSTM RNN1 and Bi-LSTM RNN2 obtained in the training phase. The output values of the two networks are subjected to a restoration process using equation (15), thereby obtaining estimated values of the real part mask and the imaginary part mask of the target source speech, respectively. The real and imaginary parts of the estimated signal are obtained by multiplying the real and imaginary masked estimates by the STFT value of the mixed speech. Then, the signal is reconstructed by using the inverse Fourier transform to obtain a separated voice signal.
The invention uses WSJ0-2mix data set to evaluate the single-channel voice separation model, the sampling frequency of 16KHz, and the voice signal obtains 129-dimensional complex spectrum as input through short-time Fourier transform. The WSJ0-2mix dataset was obtained from a WSJ0 corpus. The WSJ0 corpus includes a training set (si _ tr _ s) and two validation sets (si _ dt _05 and si _ et _ 05). The training set si _ tr _ s contains 101 speakers, each having recorded about 140 or 90 sentences each having a duration of about 5 seconds.
The WSJ0-2mix data set generated includes a training set, a validation set, and a test set. For the training set of 30h and the verification set of 10h, the training set of WSJ0 was obtained by randomly selecting two speakers (including 49 males and 51 females in si _ tr _ s) from the training set si _ tr _ s of WSJ0, and randomly selecting sentences from the recordings of the two speakers to mix, wherein the signal-to-noise ratio (SNR) of the two sentences in the mixing range from 0dB to 5dB, and the SNR is also randomly selected. For the 5h test set, generated using the data in the validation sets si _ dt _05 and si _ et _05 of WSJ0, which included 7 women and 11 men, the construction method was the same as for the 30h training set. These 18 speakers in the WSJ0-2mix test set were not included in the training set, so the experiment was performed speaker independent.
In the experiment, all the methods based on vanilla DNN comprise 3 hidden layers, the number of neuron nodes of each hidden layer is 1792, and all the methods based on bidirectional LSTM RNN also comprise 3 hidden layers, and the number of neuron nodes of each hidden layer is 896, so that all the models are guaranteed to have similar number of parameters. To avoid overfitting, all models set a random dropout with a dropout probability of 0.5 as the data stream passes from lower layers into higher layer networks. When separating the mixed speech of the speaker, the output stream of the network model has S. For example, in the embodiment, | S | is set to 2, the data set WSJ0-2mix in the present experiment is also generated by mixing the voices of two different speakers, the output of the network model is two, and most of the research today is to separate the mixed voices of the two speakers. In order to avoid the gradient vanishing problem, data is sequentially imported into a linear layer having | S | × 1792 neurons and a ReLU layer having | S | × 1792 neurons.
The input to all models is the same, and is a 129-dimensional complex spectrum obtained by fourier transforming a mixed speech, where the frame length is 16ms and the frame shift is 8 ms. Specifically, the input data is a three-dimensional tensor having a shape of D × T × 129, D represents the number of samples (batch size) selected in one training, the number of samples used for each training is fixed, and the number of words included in one training is represented by 8. T represents the maximum number of frames in the training sentence included in each training, and 129 is the number of frequency points. The output of all models consists of | S | masking estimates, each with dimensions of T × 129.
For a complex ideal float-based masking approach to cIRM, the output of the model is an estimate of the real component of the cIRM and the imaginary partEstimate of the quantity, which corresponds to two Bi-LSTM RNN networks each using an MSE cost functionAndand (5) training. The experiment adopts Adam optimization algorithm to optimize DNN and Bi-LSTM RNN models, and the weight attenuation is set to be 10-5The learning rate is not fixed and is adjusted according to the effect of the network training. When the learning rate is less than 10-10The training process is automatically terminated. Furthermore, the batch size is set to 8, which means that 8 pieces of speech data are randomly selected from the data set to be loaded into the model per training. The number of iterations is set to 100. In the experimental process, the training set is used for training the model, and the verification set is only used for controlling the learning rate.
And (3) training the mixed voice data set WSJ0-2mix on a u-PIT method and a conventional training method by using a voice separation method respectively taking the cIRM, the IRM and the PSM as training targets, wherein MSE is used as an evaluation index in the training process. It can be seen from the results that the MSE of the conventional training method decreases slowly, and from the tenth iteration, the MSE remains almost constant, which is likely due to the permutation problem. By using the uPIT method, MSE converges rapidly. A huge gap exists between the MSE of the cIRM model based on the traditional training method and the MSE of the pcIRM model, and the large gap proves the effectiveness of the cIRM model in solving the label replacement problem.
On the same data set, the differences in the training process of the vanilla DNN-based model and the pcIRM model when the uPIT method was employed were also compared in the experiments. On the training and validation sets, the MSE of both methods decreased rapidly and showed almost the same trend, and the MSE based on the Bi-LSTM model (i.e., pcIRM) was much smaller than that of the vanilla DNN model, indicating that the method of the present invention is more efficient than the vanilla DNN based method in processing time series context information.
The performance of speech separation algorithms is typically evaluated using three metrics, including short-term target intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and signal-to-distortion ratio (SDR). STOI and PESQ measure the intelligibility score and the human speech quality score, respectively. SDR belongs to comprehensive evaluation indexes and can evaluate the overall separation performance, so that the method adopts the SDR as the evaluation index to evaluate the potential of the method in the aspect of improving the voice separation performance.
Analysis of the results after the experiment revealed. Firstly, under the current experimental setting, the method of the invention obtains better separation performance than the same-sex mixed voice scene in the scene of separating male-female mixed voice. Meanwhile, the Bi-LSTM RNN model based on the PSM can almost obtain the maximum SDR improvement when a conventional training method and a uPIT method are adopted, which shows the effectiveness of phase information in improving the voice separation performance. Secondly, compared with the conventional training method, the method based on the uPIT can obtain better results than the former under different training targets, and highlights the advantages of the uPIT method. Furthermore, the SDR score of the BilSTM RNN-based model in this task is higher than that of the vanilla DNN-based model, demonstrating the powerful ability of the Bi-LSTM RNN to capture time series information.
Also, as can be seen from the analysis of the results, although gender information is not explicitly used in the training process of the model, the method of the present invention achieves better SDR improvement for the separation of mixed speeches of different speakers. As training periods increase, IBM and PSM based approaches will approach the results of oracle IRM and oracle PSM in the case of mixed speech of heteronyms, showing that the training effect gradually moves towards the performance limit in the current experimental data and settings. The experimental result is consistent with the conclusion of research results of other researchers, and can show that the task of separating mixed voice with the same gender still has great challenges, has great performance improvement space, and is one of the current and future research worthy problems.
Furthermore, the SDR score of oracle irim is six times more than oracle IRM and almost five times of oracle PSM method, while pcIRM achieves better results than PSM-based methods except in the case of mixed speech separation of speakers. The reason is analyzed, on one hand, as mentioned earlier, cIRMr belongs to R and cIRMc belongs to R, and the values of the cIRM and the cIRMc are unbounded. On the other hand, because the real part and the imaginary part of the complex frequency spectrum both have structural characteristics, the Y-shaped Bi-LSTM RNN is designed and realized, the framework of the Y-shaped neural network is shown in FIG. 2, and two independent networks respectively optimize the real part and the imaginary part. The Bi-LSTM RNN network training used to optimize the MSE of the imaginary component does not work well. Theoretically, the real component of the complex spectrum obtained after the speech signal is subjected to short-time fourier transform can be understood as the projection of the spectrum, which is the definition of PSM, but the definition corresponding to the imaginary component is difficult to intuitively understand. Therefore, the imaginary component model is difficult to be trained well compared to the real component model. However, compared with the cIRM-based method in other documents, the model of the method of the invention obtains higher SDR improvement, and shows the effectiveness of the method of the invention.
The present invention proposes a new approach to solve the speaker independent mono source separation problem. The Y-shaped Bi-LSTM RNN network is designed and realized as a training framework of the method, the real component and the imaginary component of the mixed voice complex spectrum are trained and optimized respectively, the amplitude and phase information of voice signals are effectively utilized, the cIRM estimation is realized by adopting sentence-level displacement invariant training, and the problems of label displacement and speaker tracking can be solved simultaneously. Then, on a WSJ0-2mix data set, the voice separation performance of the method is verified and compared with that of the existing popular method by using SDR indexes, and the experimental result shows the effectiveness of the method, further shows the importance of phase information on a voice separation task and well solves the replacement problem in the voice separation task by using the uPIT method.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010401151.3A CN111583954B (en) | 2020-05-12 | 2020-05-12 | Speaker independent single-channel voice separation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010401151.3A CN111583954B (en) | 2020-05-12 | 2020-05-12 | Speaker independent single-channel voice separation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583954A CN111583954A (en) | 2020-08-25 |
CN111583954B true CN111583954B (en) | 2021-03-30 |
Family
ID=72112661
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010401151.3A Active CN111583954B (en) | 2020-05-12 | 2020-05-12 | Speaker independent single-channel voice separation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583954B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111899757B (en) * | 2020-09-29 | 2021-01-12 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for target speaker extraction |
CN112435655B (en) * | 2020-10-16 | 2023-11-07 | 北京紫光青藤微系统有限公司 | Data acquisition and model training method and device for isolated word speech recognition |
CN112201276B (en) * | 2020-11-11 | 2022-04-29 | 东南大学 | Microphone array speech separation method based on TC-ResNet network |
CN114822583B (en) * | 2021-01-28 | 2024-11-22 | 中国科学院声学研究所 | A single-channel sound source separation method using kernelized auditory model |
CN113271272B (en) * | 2021-05-13 | 2022-09-13 | 侯小琪 | Single-channel time-frequency aliasing signal blind separation method based on residual error neural network |
CN113259283B (en) * | 2021-05-13 | 2022-08-26 | 侯小琪 | Single-channel time-frequency aliasing signal blind separation method based on recurrent neural network |
CN113707172B (en) * | 2021-06-02 | 2024-02-09 | 西安电子科技大学 | Single-channel voice separation method, system and computer equipment of sparse orthogonal network |
CN113288150B (en) * | 2021-06-25 | 2022-09-27 | 杭州电子科技大学 | Channel selection method based on fatigue electroencephalogram combination characteristics |
CN113362831A (en) * | 2021-07-12 | 2021-09-07 | 科大讯飞股份有限公司 | Speaker separation method and related equipment thereof |
CN113611292B (en) * | 2021-08-06 | 2023-11-10 | 思必驰科技股份有限公司 | Optimization method and system for short-time Fourier change for voice separation and recognition |
CN114446316B (en) * | 2022-01-27 | 2024-03-12 | 腾讯科技(深圳)有限公司 | Audio separation method, training method, device and equipment of audio separation model |
CN116701921B (en) * | 2023-08-08 | 2023-10-20 | 电子科技大学 | Multi-channel timing signal adaptive noise suppression circuit |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106373583A (en) * | 2016-09-28 | 2017-02-01 | 北京大学 | Multi-Audio Object Encoding and Decoding Method Based on Ideal Soft Threshold Mask IRM |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
CN111091847A (en) * | 2019-12-09 | 2020-05-01 | 北京计算机技术及应用研究所 | Deep clustering voice separation method based on improvement |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105096961B (en) * | 2014-05-06 | 2019-02-01 | 华为技术有限公司 | Speech separating method and device |
US10249305B2 (en) * | 2016-05-19 | 2019-04-02 | Microsoft Technology Licensing, Llc | Permutation invariant training for talker-independent multi-talker speech separation |
WO2019104229A1 (en) * | 2017-11-22 | 2019-05-31 | Google Llc | Audio-visual speech separation |
JP6927419B2 (en) * | 2018-04-12 | 2021-08-25 | 日本電信電話株式会社 | Estimator, learning device, estimation method, learning method and program |
CN108806708A (en) * | 2018-06-13 | 2018-11-13 | 中国电子科技集团公司第三研究所 | Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model |
US10699700B2 (en) * | 2018-07-31 | 2020-06-30 | Tencent Technology (Shenzhen) Company Limited | Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks |
CN110459238B (en) * | 2019-04-12 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Voice separation method, voice recognition method and related equipment |
CN110321810A (en) * | 2019-06-14 | 2019-10-11 | 华南师范大学 | Single channel signal two-way separation method, device, storage medium and processor |
CN110634502B (en) * | 2019-09-06 | 2022-02-11 | 南京邮电大学 | Single-channel speech separation algorithm based on deep neural network |
CN111128197B (en) * | 2019-12-25 | 2022-05-13 | 北京邮电大学 | Multi-speaker voice separation method based on voiceprint features and generation confrontation learning |
CN111128209B (en) * | 2019-12-28 | 2022-05-10 | 天津大学 | Speech enhancement method based on mixed masking learning target |
-
2020
- 2020-05-12 CN CN202010401151.3A patent/CN111583954B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106373583A (en) * | 2016-09-28 | 2017-02-01 | 北京大学 | Multi-Audio Object Encoding and Decoding Method Based on Ideal Soft Threshold Mask IRM |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
CN111091847A (en) * | 2019-12-09 | 2020-05-01 | 北京计算机技术及应用研究所 | Deep clustering voice separation method based on improvement |
Also Published As
Publication number | Publication date |
---|---|
CN111583954A (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583954B (en) | Speaker independent single-channel voice separation method | |
WO2021143327A1 (en) | Voice recognition method, device, and computer-readable storage medium | |
CN108597496B (en) | Voice generation method and device based on generation type countermeasure network | |
WO2021139294A1 (en) | Method and apparatus for training speech separation model, storage medium, and computer device | |
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
Yuliani et al. | Speech enhancement using deep learning methods: A review | |
Shi et al. | Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation. | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN108962229B (en) | A single-channel, unsupervised method for target speaker speech extraction | |
Geng et al. | End-to-end speech enhancement based on discrete cosine transform | |
Le et al. | Inference skipping for more efficient real-time speech enhancement with parallel RNNs | |
Li et al. | Sams-net: A sliced attention-based neural network for music source separation | |
CN113539293B (en) | Single-channel voice separation method based on convolutional neural network and joint optimization | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
Wichern et al. | Low-Latency approximation of bidirectional recurrent networks for speech denoising. | |
Hou et al. | Multi-task learning for end-to-end noise-robust bandwidth extension | |
Fan et al. | Real-time single-channel speech enhancement based on causal attention mechanism | |
Girirajan et al. | Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network. | |
CN118398033A (en) | A speech-based emotion recognition method, system, device and storage medium | |
Sofer et al. | CNN self-attention voice activity detector | |
Cheng et al. | DNN-based speech enhancement with self-attention on feature dimension | |
Fan et al. | Deep attention fusion feature for speech separation with end-to-end post-filter method | |
Wang et al. | Cross-domain diffusion based speech enhancement for very noisy speech | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
CN118248159A (en) | A joint training method for speech enhancement model based on frequency subband |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |