CN111583954B

CN111583954B - Speaker independent single-channel voice separation method

Info

Publication number: CN111583954B
Application number: CN202010401151.3A
Authority: CN
Inventors: 张文; 宋君强; 任开军; 李小勇; 邓科峰; 周翱隆; 汪祥; 任小丽; 邵成成; 吴国溧
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-05-12
Filing date: 2020-05-12
Publication date: 2021-03-30
Anticipated expiration: 2040-05-12
Also published as: CN111583954A

Abstract

The invention discloses a single-channel voice separation method irrelevant to speakers, which comprises the following steps: preparing a data set and carrying out data preprocessing; establishing a single-track voice separation model based on the masking of a plurality of ideal floating values; when the single sound channel voice separation model is trained, sentence level replacement invariance training is adopted; and inputting the mixed voice data into the trained model for voice separation. The method effectively and accurately realizes the estimation of the complex ideal floating value masking through the sentence-level replacement invariance training, adopts a bidirectional long-short term memory neural network structure to estimate the complex ideal floating value masking, and further solves the problem of label ambiguity by utilizing the standard of the sentence-level replacement invariance training, thereby leading the separation of the single-channel voice to have better effect.

Description

Speaker independent single-channel voice separation method

Technical Field

The invention belongs to the technical field of intelligent voice processing, and particularly relates to speaker independent single-channel voice separation based on sentence-level permutation invariance training and complex ideal floating value masking.

Background

The objective of the speech source separation task is to extract a plurality of speech source signals from a mixed speech signal containing two or more speech sources, one for each speaker. In general, the speech separation problem can be divided into mono (i.e., single channel) and array-based (i.e., multi-channel) source separation problems, depending on the number of microphones or channels. For the former problem, the mainstream research method is to extract the target voice or remove the interference signal from the mixed signal based on the acoustic characteristics and statistical characteristics of the target voice and the interference signal. In the multi-channel speech separation problem, spatial information is available in addition to the acoustic and statistical properties of the signal. The mono speech separation problem remains very challenging because only one speech recording is available and the spatial information that can be extracted is very limited.

Since the nineties of the twentieth century, researchers have developed many approaches to solving the problem of monophonic speech separation. Before entering the deep learning era, classical single-channel speech separation methods can be divided into three categories: model-based methods, Blind Source Separation (BSS) methods and Computational Auditory Scene Analysis (CASA) methods. However, these methods have limited effectiveness in processing sound sources in multi-source mixed speech captured in a real environment. Because of the numerous difficulties involved, including the wide variety of noise in mixed speech, low signal-to-noise environments, and limited computational resources. Therefore, in a real environment, it is difficult to consistently obtain a high-quality target speech signal by the above-described method.

Recently, researchers have used regression models in Deep Neural Networks (DNN) to solve the source separation problem, and particularly for the mono case, have achieved very good performance gains. Depending on the training objectives, the DNN-based mono source separation method can be divided into three categories, namely a masking-based method, a mapping-based method and a Signal Approximation (SA) -based method. In contrast, the mask-based approach may be trained to yield a more accurate neural network model than the map-based approach.

The first applied masking-based training goal in the supervised speech separation approach was Ideal Binary Masking (IBM), which was inspired by the masking effect of sound and proprietary allocation principles in auditory scene analysis. Many researchers have IBM as a training target and have achieved better speech separation results. Because the decision mode that the IBM method takes a value other than 0, i.e., 1, in each time-frequency domain (T-F) unit is not flexible enough, the speech signal separated by the IBM-based method is distorted. To this end, researchers have proposed ideal floating-point masking (IRM) to optimize IBM performance, setting the value of the T-F unit as the ratio of the energy of the target sound source to the energy of the mixed speech. The target speech signal separated using the IRM-based method is generally of better quality than IBM.

Although these DNN-based methods achieve good performance, both IBM and IRM use only the amplitude information of the target signal when separating and synthesizing clean speech signals, since in earlier studies it was thought that the phase spectrum was not important for speech separation. However, recent studies by Erdogan et al have found that phase information is beneficial for predicting accurate masking and signal estimation, and they propose a Phase Sensitive Masking (PSM) based approach that is significantly better than IBM and IRM. In addition, Williamson et al estimate complex ideal float masking (cIRM) using both magnitude and phase spectral information in the complex domain

In the voice separation task, if the target speaker and the interferers can not be changed in the training data and the test data, the voice separation task belongs to the voice separation related to the speaker; if the targeted speaker is fixed but the interferer is allowed to make changes, then the targeted relevant voice separation is assumed. Similarly, if the speaker is not required to be the same between the training data and the test data, it is called speaker-independent speech separation, which is the least constrained case. The label ambiguity (or permutation) problem is the most dominant cause of poor performance of speaker independent speech separation algorithms in prior studies. In a speaker independent scenario, there are multiple outputs due to the speech separation model, where each output represents a sound source. When a plurality of speakers produce voices several times in a voice and a plurality of voices are overlapped, how to allocate separated voice components to each sound source is a troublesome problem, researchers propose a displacement invariant training (PIT) model to solve the problem and obtain a good effect.

Disclosure of Invention

In view of the above, the present invention aims to provide a speaker independent single-channel speech separation method based on sentence level permutation invariance training (uPIT) and complex ideal floating value masking (crirm), which effectively and accurately implements crirm estimation through sentence level permutation invariance training (uPIT), and specifically, the speaker independent single-channel speech separation method employs a Bi-directional Long Short-Term Memory neural network (Bi-LSTM RNN) structure to estimate complex ideal floating value masking crirm, and further utilizes the standard of sentence level permutation invariance training (uPIT) to solve the problem of label ambiguity.

In order to achieve the purpose, the invention adopts the following technical scheme that the method for separating the independent single-channel voice of the speaker comprises the following steps:

step 1, preparing a data set and carrying out data preprocessing;

step 2, establishing a single sound channel voice separation model based on a plurality of ideal floating value masking;

step 3, adopting sentence level replacement invariance training when training the single sound channel voice separation model;

and 4, performing voice separation on the model after the mixed voice input training is finished.

Specifically, the data set in step 1 is a WSJ0-2mix data set, the WSJ0-2mix data set includes a training set, a verification set and a test set, two speakers are randomly selected from the WSJ0 training set si _ tr _ s, sentences are randomly selected from the recordings of the two speakers and mixed, the signal-to-noise ratio of the two sentences in mixing ranges from 0dB to 5dB, the specific signal-to-noise ratio is randomly selected, and all speech data are preprocessed through short-time fourier transform to obtain a 129-dimensional complex spectrum.

Specifically, the monophonic voice separation model takes a Y-shaped bidirectional long-short term memory cyclic neural network as a framework model and comprises 3 layers, the number of neuron nodes of each hidden layer is 896, when a data stream is transmitted from a lower layer to a higher layer network, the model is provided with random dropouts, the dropouts probability is 0.5, when the mixed voice of | S | speakers is separated, the output stream of the network model has | S | numbers, in order to avoid the problem of gradient disappearance, data are sequentially led into a linear layer with | S | X1792 neurons and a ReLU layer with | S | X1792 neurons, the input data of the model is a three-dimensional tensor, the shape is DxT X129, D represents the number of samples selected in one training, and the number of samples used for each training is fixed; t represents the maximum frame number in the training sentences contained in each training, 129 is the frequency point number, and is a 129-dimensional complex spectrum obtained by performing short-time Fourier transform on voice data, wherein the frame length is 16ms, the frame shift is 8ms, the output of the model consists of | S | masking estimation values, and the dimension of each masking estimation value vector is T multiplied by 129.

Specifically, in the training process of the model in step 3, the training target is a complex ideal floating value mask which comprises a real part and an imaginary part, the two-way long-short term memory cyclic neural network has two outputs, one is used for predicting the real part component, the other is used for predicting the imaginary part component, and the two networks for predicting the real part component and the imaginary part component are separately optimized.

In the training phase, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'_rAnd compressed imaginary mask cIRM'_cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;

in the model testing stage, the short-time Fourier transformation result of the mixed voice is also obtained, then the short-time Fourier transformation result is used as the input of the network model obtained in the training stage, the two output values of the network model are subjected to recovery processing by using an inverse function, so that the estimated values of the real part mask and the imaginary part mask of the target source voice are respectively obtained, the real part and the imaginary part of the estimated signal are obtained by multiplying the estimated values of the real part mask and the imaginary part mask by the short-time Fourier transformation value of the mixed voice, and then the signal reconstruction is carried out by using the inverse Fourier transformation, so that the separated voice signal is obtained.

Specifically, the real part of the complex ideal floating value mask is represented as:

the imaginary part is represented as:

thus, the complex ideal float mask is represented as:

wherein, Y_rAnd Y_cUsing real and imaginary parts, S, after short-time Fourier transform for mixed speech signals_rAnd S_cRespectively real and imaginary parts, Y, after short-time Fourier transformation of a clean source speech signal_r、Y_c、S_rAnd S_rIs in the value range of

The real part masks cIRM'_rAnd imaginary masking cIRM'_cIs uniformly expressed as

Wherein x is r or C, which represents real part or imaginary part, the masking value is limited in [ -K, K ] by compression operation, K is a preset value, and the gradient of the parameter C is controlled;

the inverse function is expressed as:

wherein, cIRM_xRepresenting an estimate of the uncompressed mask, O_xIs the output of the deep neural network model.

Preferably, when the sentence-level permutation invariance training model is adopted, the real part cost function is defined as:

where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames of all sound sources, N is the window length or frame length, S represents the number of sound sources, which are analyzed on units of the time-frequency (T-F) domain after short-time fourier transform of the speech signal, where T represents the index of time and F represents the index of frequency,

the ith output stream (total | S |) representing the network during the training phase, i.e. the ith ideal float value masks the estimated values of the real components,

indicating that the tag value of the ideal float value masking real component corresponding to the i-th ideal float value masking real component estimated value is taken when the tag arrangement minimizing the cost of sentence-level speech separation is taken,

is an arrangement that minimizes the cost value of sentence-level speech separation, defined as:

wherein, S represents the number of sound sources,

is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations, S_φ(i)-cIRM′_rRepresenting the ith tag value, similarly, the imaginary cost function when the fixed ideal float value masks the arrangement of real component tag values

The training process of (a) is the same as the real component part.

The method designs and realizes a Y-shaped bidirectional long-short term memory neural network (Bi-LSTMRNN) as a model architecture, uses a complex ideal floating value masking cIRM as a model training target, fully utilizes the amplitude and phase information of the voice signal, and can obtain a more accurate estimation result; the problem of label ambiguity of speaker independent voice separation is solved by adopting sentence-level replacement invariance training, and the research work of combining a plurality of ideal floating value masking cIRM and a sentence-level replacement invariance uPIT method into an integral model for the first time enables the separation effect of speaker independent single-channel voice to be better.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a block diagram of a Y-shaped bi-directional long short term memory neural network of the present invention;

FIG. 3 is a block diagram of a model for single-channel speech separation of two speakers according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When using a regression model of DNN to solve the source separation problem, mainstream mask-based training objectives include: the ideal float mask IRM, the phase sensitive mask PSM and the complex ideal float mask cIRM, the above methods will be briefly described.

(1) Ideal float masking (IRM Ideal Ratio Mask)

The speech signal is sampled at a certain frequency, and at discrete time m, the target speech signal, the interfering signal, and the mixed speech signal sequence may be represented as s (m), i (m), and y (m) ═ s (m) + i (m), respectively. It is subjected to a Short Time Fourier Transform (STFT), which can be expressed as S (t, f), I (t, f) and Y (t, f) being S (t, f) + I (t, f), respectively, where f is the index of the frequency and t is the index of the time frame. In addition, given Y (t, f), the goal of monophonic speech separation is to recover S (t, f) for each target sound source. By masking M (t, f) with an ideal time-frequency (t-f), the spectrum of the target speech can be reconstructed as follows:

S(t，f)＝Y(t，f)*M(t，f) (1)

where ". x" denotes a complex multiplication, the masking value M (t, f) at time frame t and frequency f may be expressed as:

where β is an adjustment parameter for scaling the masking value, and | S (t, f) | and | I (t, f) | represent the magnitude spectrum of the target speech signal and the magnitude spectrum of the interference signal, respectively. In addition, | S (t, f) & gtnon |, cell²And | I (t, f) & gtdoes not count²Respectively representing the target speech power spectrum and the interference signal power spectrum within the T-F unit. Typically, β is chosen to be 0.5.

Obviously, in the process of calculating the IRM, only the amplitude information is utilized, and the phase information of the target speech signal is ignored in the speech reconstruction. To compensate for the lack of IRM, researchers have proposed PSM and crirm.

(2) Phase Sensitive Mask (PSM, Phase-Sensitive Mask)

In polar coordinates, the STFT of a speech signal can be defined as equation (3).

I S (t, f) I represents the amplitude response, and j θ_s(t，f)Representing the phase response of the speech signal at time t and frequency f, which is typically used when performing enhancement or separation operations on noisy speech after a short time fourier transform. In polar coordinates, PSM, which belongs to an IRM incorporating a phase metric, becomes well understood, an extension of the IRM:

wherein the < Y and the < S respectively represent a mixed voice phase and a target voice phase in the T-F unit. The inclusion of a phase difference between the mixed speech and the target speech in the PSM results in a higher SNR and produces a better estimate of the target speech than the IRM. It is clear that,

the range of the sum | cos ([ sub ] Y- [ sub ] S) | value is (0, 1), and the cos ([ sub ] Y- [ sub ] S) may take a negative value.

(3) Complex ideal float masking (cIRM, complete ideal Ratio Mask)

cIRM is a complex-valued time-frequency mask that is computed using the real and imaginary parts of the target and mixed speech signals after a short-time Fourier transform. The short-time fourier transform result and the crirm of the mixed speech and clean signal are defined as follows:

Y(t，f)＝Y_r(t，f)+jY_c(t，f) (5)

S(t，f)＝S_r(t，f)+jS_c(t，f) (6)

cIRM(t，f)＝cIRM_r(t，f)+jcIRM_c(t，f) (7)

wherein

Subscripts r and c are eachRepresenting the real and imaginary parts. For convenience, the indices of frequency f and time frame t are omitted below, and Y, S and crirm are still defined for each time-frequency unit. Thus, in the complex domain, equation (1) can be further rewritten as:

S_r+jS_c＝(Y_r+jY_c)*(cIRM_r+jcIRM_c) (8)

S_r＝cIRM_r*Y_r-cIRM_c*Y_c (9)

S_c＝cIRM_r*Y_c+cIRM_c*Y_i (10)

using equations (9) and (10), the real and imaginary parts of the crirm can be derived:

therefore, we can obtain the defined formula of crirm as:

it is noted that the ranges of values of Yr, Yc, Sr and Sr are

This means that

And

is unbounded, as beforeThe range of IRM is [0,1 ]]This is very advantageous for DNN model based training. Thus, the crirm is compressed using the following hyperbolic tangent function:

where x is r or C, representing the real or imaginary part, the compression operation limits the masking value to [ -K, K ], and the parameter C controls its steepness. In experiments, several sets of K and C values were evaluated, and it was found that the DNN-based sound source separation model performed best when K was 10 and C was 0.05. In the training phase, the training labels are compressed cIRMs, and the model output values are also compressed values. Similarly, in the testing phase, where the DNN output is the compressed estimate of the mask rather than the original mask, we use the following inverse function to recover the estimate of the uncompressed mask.

Wherein cIRM_xRepresenting an estimate of the uncompressed mask, O_xIs the DNN output.

According to Lee's research, it is found that it is difficult to directly and accurately estimate the phase without a clear structure. Therefore, it is difficult to reconstruct accurate speech by separately estimating the amplitude and phase. Theoretically, accurate estimates of the imaginary and real parts, including amplitude and phase information, can be obtained by estimating the cIRM, which is superior to PSM in more accurately estimating the source speech.

Thus, as shown in fig. 1, a speaker independent single-channel speech separation method includes the following steps:

step 1, preparing a data set and carrying out data preprocessing;

In fact, various network architectures are very efficient in processing speech signals, and DNN or RNN based approaches have been widely used to solve the mono speech separation problem. In particular, LSTM RNN networks, which operate on a statement frame by frame, can effectively utilize historical information in the time sequence, often used to process time sequence related speech data. In addition, relevant research has shown that the LSTM RNN network can improve the generalization ability of speech separation methods to speakers. If a Bi-directional LSTM (called Bi-LSTM) RNN network is used, the past and future information with respect to a certain frame is stacked and passed to the next layer throughout the speech sentence, and the performance is superior to the uni-directional LSTM RNN network when processing of time series is involved. Therefore, the invention adopts a bidirectional long-short term memory neural network (Bi-LSTM RNN) as a network framework model.

Since the training target is a complex ideal floating value mask, which contains real and imaginary components, the Bi-LSTM RNN network has two outputs, one for predicting the real component and the other for predicting the imaginary component. The invention designs and realizes a Y-shaped neural network architecture to obtain the training target, as shown in figure 2. Where the input features are the STFT spectrum of the mixed speech, the two networks that predict the real and imaginary components are optimized separately. In contrast, the output of the IRM-based and PSM-based Bi-LSTM RNN models are both single outputs.

The example given in this embodiment is dual speaker speech separation based on complex ideal floating value masking of the crirm target, and when using a sentence-level Permutation invariance (uPIT) method, the present invention takes the Mean Square Error (MSE) between the predicted value output by the Bi-LSTM RNN network in the uPIT module and the compressed target masking (i.e., the tag value) of the clean speech signal as the cost function. Thus, the real part cost function of pcIRM based approach can be defined as:

where B is the total number of time-frequency units on all sound sources, T is the total number of sentence frames for all sound sources, N is the window length (or frame length),

the arrangement that minimizes the cost value of sentence-level speech separation can be defined as:

note that S represents the number of sound sources, in equation (17)

Is a symmetric group S times, including all S! A set of permutations, phi representing one of the permutations. Similarly, the imaginary component cIRMc and the imaginary cost function of the predicted cIRM are predicted

The training process of (a) is the same as the real component part. Similarly, the cost functions of the uPIT-based IRM model and the uPIT-based PSM model may be defined by equations (18) and (19), respectively.

For those masking-based methods that do not use uPIT, the order of the target sound source is fixed, and then there is only one permutation of the estimated speech and the target speech, whose cost function is the same as the form of uPIT, but does not involve the process of finding the minimum cost value.

As shown in figure 3 of the drawings,model structure for single-channel speech separation of two speakers, in a training stage, clean source speech and mixed speech are subjected to short-time Fourier transform, and then the real part and the imaginary part of the speech source after the transform are respectively used for calculating compressed real part masking cIRM'_rAnd compressed imaginary mask cIRM'_cThe training labels are used as real part and imaginary part training labels in the bidirectional long and short term memory cyclic neural network, the estimation value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long and short term memory cyclic neural network during each iteration, after multiple iterations, the training is stopped when the mean square error is reduced to a certain range or other settings are triggered, the training is completed, the parameters of the bidirectional long and short term memory cyclic neural network at the moment are stored and used in a testing stage;

in the testing phase, the short-time Fourier transform results of the mixed speech are also obtained and then used as inputs to the network models Bi-LSTM RNN1 and Bi-LSTM RNN2 obtained in the training phase. The output values of the two networks are subjected to a restoration process using equation (15), thereby obtaining estimated values of the real part mask and the imaginary part mask of the target source speech, respectively. The real and imaginary parts of the estimated signal are obtained by multiplying the real and imaginary masked estimates by the STFT value of the mixed speech. Then, the signal is reconstructed by using the inverse Fourier transform to obtain a separated voice signal.

The invention uses WSJ0-2mix data set to evaluate the single-channel voice separation model, the sampling frequency of 16KHz, and the voice signal obtains 129-dimensional complex spectrum as input through short-time Fourier transform. The WSJ0-2mix dataset was obtained from a WSJ0 corpus. The WSJ0 corpus includes a training set (si _ tr _ s) and two validation sets (si _ dt _05 and si _ et _ 05). The training set si _ tr _ s contains 101 speakers, each having recorded about 140 or 90 sentences each having a duration of about 5 seconds.

The WSJ0-2mix data set generated includes a training set, a validation set, and a test set. For the training set of 30h and the verification set of 10h, the training set of WSJ0 was obtained by randomly selecting two speakers (including 49 males and 51 females in si _ tr _ s) from the training set si _ tr _ s of WSJ0, and randomly selecting sentences from the recordings of the two speakers to mix, wherein the signal-to-noise ratio (SNR) of the two sentences in the mixing range from 0dB to 5dB, and the SNR is also randomly selected. For the 5h test set, generated using the data in the validation sets si _ dt _05 and si _ et _05 of WSJ0, which included 7 women and 11 men, the construction method was the same as for the 30h training set. These 18 speakers in the WSJ0-2mix test set were not included in the training set, so the experiment was performed speaker independent.

In the experiment, all the methods based on vanilla DNN comprise 3 hidden layers, the number of neuron nodes of each hidden layer is 1792, and all the methods based on bidirectional LSTM RNN also comprise 3 hidden layers, and the number of neuron nodes of each hidden layer is 896, so that all the models are guaranteed to have similar number of parameters. To avoid overfitting, all models set a random dropout with a dropout probability of 0.5 as the data stream passes from lower layers into higher layer networks. When separating the mixed speech of the speaker, the output stream of the network model has S. For example, in the embodiment, | S | is set to 2, the data set WSJ0-2mix in the present experiment is also generated by mixing the voices of two different speakers, the output of the network model is two, and most of the research today is to separate the mixed voices of the two speakers. In order to avoid the gradient vanishing problem, data is sequentially imported into a linear layer having | S | × 1792 neurons and a ReLU layer having | S | × 1792 neurons.

The input to all models is the same, and is a 129-dimensional complex spectrum obtained by fourier transforming a mixed speech, where the frame length is 16ms and the frame shift is 8 ms. Specifically, the input data is a three-dimensional tensor having a shape of D × T × 129, D represents the number of samples (batch size) selected in one training, the number of samples used for each training is fixed, and the number of words included in one training is represented by 8. T represents the maximum number of frames in the training sentence included in each training, and 129 is the number of frequency points. The output of all models consists of | S | masking estimates, each with dimensions of T × 129.

For a complex ideal float-based masking approach to cIRM, the output of the model is an estimate of the real component of the cIRM and the imaginary partEstimate of the quantity, which corresponds to two Bi-LSTM RNN networks each using an MSE cost function

And

and (5) training. The experiment adopts Adam optimization algorithm to optimize DNN and Bi-LSTM RNN models, and the weight attenuation is set to be 10^-5The learning rate is not fixed and is adjusted according to the effect of the network training. When the learning rate is less than 10^-10The training process is automatically terminated. Furthermore, the batch size is set to 8, which means that 8 pieces of speech data are randomly selected from the data set to be loaded into the model per training. The number of iterations is set to 100. In the experimental process, the training set is used for training the model, and the verification set is only used for controlling the learning rate.

And (3) training the mixed voice data set WSJ0-2mix on a u-PIT method and a conventional training method by using a voice separation method respectively taking the cIRM, the IRM and the PSM as training targets, wherein MSE is used as an evaluation index in the training process. It can be seen from the results that the MSE of the conventional training method decreases slowly, and from the tenth iteration, the MSE remains almost constant, which is likely due to the permutation problem. By using the uPIT method, MSE converges rapidly. A huge gap exists between the MSE of the cIRM model based on the traditional training method and the MSE of the pcIRM model, and the large gap proves the effectiveness of the cIRM model in solving the label replacement problem.

On the same data set, the differences in the training process of the vanilla DNN-based model and the pcIRM model when the uPIT method was employed were also compared in the experiments. On the training and validation sets, the MSE of both methods decreased rapidly and showed almost the same trend, and the MSE based on the Bi-LSTM model (i.e., pcIRM) was much smaller than that of the vanilla DNN model, indicating that the method of the present invention is more efficient than the vanilla DNN based method in processing time series context information.

The performance of speech separation algorithms is typically evaluated using three metrics, including short-term target intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ), and signal-to-distortion ratio (SDR). STOI and PESQ measure the intelligibility score and the human speech quality score, respectively. SDR belongs to comprehensive evaluation indexes and can evaluate the overall separation performance, so that the method adopts the SDR as the evaluation index to evaluate the potential of the method in the aspect of improving the voice separation performance.

Analysis of the results after the experiment revealed. Firstly, under the current experimental setting, the method of the invention obtains better separation performance than the same-sex mixed voice scene in the scene of separating male-female mixed voice. Meanwhile, the Bi-LSTM RNN model based on the PSM can almost obtain the maximum SDR improvement when a conventional training method and a uPIT method are adopted, which shows the effectiveness of phase information in improving the voice separation performance. Secondly, compared with the conventional training method, the method based on the uPIT can obtain better results than the former under different training targets, and highlights the advantages of the uPIT method. Furthermore, the SDR score of the BilSTM RNN-based model in this task is higher than that of the vanilla DNN-based model, demonstrating the powerful ability of the Bi-LSTM RNN to capture time series information.

Also, as can be seen from the analysis of the results, although gender information is not explicitly used in the training process of the model, the method of the present invention achieves better SDR improvement for the separation of mixed speeches of different speakers. As training periods increase, IBM and PSM based approaches will approach the results of oracle IRM and oracle PSM in the case of mixed speech of heteronyms, showing that the training effect gradually moves towards the performance limit in the current experimental data and settings. The experimental result is consistent with the conclusion of research results of other researchers, and can show that the task of separating mixed voice with the same gender still has great challenges, has great performance improvement space, and is one of the current and future research worthy problems.

Furthermore, the SDR score of oracle irim is six times more than oracle IRM and almost five times of oracle PSM method, while pcIRM achieves better results than PSM-based methods except in the case of mixed speech separation of speakers. The reason is analyzed, on one hand, as mentioned earlier, cIRMr belongs to R and cIRMc belongs to R, and the values of the cIRM and the cIRMc are unbounded. On the other hand, because the real part and the imaginary part of the complex frequency spectrum both have structural characteristics, the Y-shaped Bi-LSTM RNN is designed and realized, the framework of the Y-shaped neural network is shown in FIG. 2, and two independent networks respectively optimize the real part and the imaginary part. The Bi-LSTM RNN network training used to optimize the MSE of the imaginary component does not work well. Theoretically, the real component of the complex spectrum obtained after the speech signal is subjected to short-time fourier transform can be understood as the projection of the spectrum, which is the definition of PSM, but the definition corresponding to the imaginary component is difficult to intuitively understand. Therefore, the imaginary component model is difficult to be trained well compared to the real component model. However, compared with the cIRM-based method in other documents, the model of the method of the invention obtains higher SDR improvement, and shows the effectiveness of the method of the invention.

The present invention proposes a new approach to solve the speaker independent mono source separation problem. The Y-shaped Bi-LSTM RNN network is designed and realized as a training framework of the method, the real component and the imaginary component of the mixed voice complex spectrum are trained and optimized respectively, the amplitude and phase information of voice signals are effectively utilized, the cIRM estimation is realized by adopting sentence-level displacement invariant training, and the problems of label displacement and speaker tracking can be solved simultaneously. Then, on a WSJ0-2mix data set, the voice separation performance of the method is verified and compared with that of the existing popular method by using SDR indexes, and the experimental result shows the effectiveness of the method, further shows the importance of phase information on a voice separation task and well solves the replacement problem in the voice separation task by using the uPIT method.

Claims

1. a speaker-independent single-channel speech separation method, is characterized in that, comprises the following steps:

Step 1, prepare the data set and perform data preprocessing;

Step 2, establish a monophonic speech separation model based on complex ideal floating value masking;

Step 3, adopt sentence-level permutation invariance training when training the monophonic speech separation model;

Step 4, voice separation is performed on the model trained by the mixed voice input;

The data set described in step 1 is the WSJ0-2mix data set, and the WSJ0-2mix data set includes a training set, a verification set and a test set. Two speakers are randomly selected from the WSJ0 training set si_tr_s. It is obtained by mixing randomly selected sentences in the recordings of two speakers. When mixing, the signal-to-noise ratio of the two sentences ranges from 0dB to 5dB. The specific signal-to-noise ratio value is randomly selected. The preprocessing of the leaf transform obtains a 129-dimensional complex spectrum;

The monophonic speech separation model is based on a Y-shaped bidirectional long short-term memory recurrent neural network, including 3 layers, and the number of neuron nodes in each hidden layer is 896. In the high-level network, the model is set to random dropout, and the dropout probability is 0.5. When the mixed speech of |S| speakers is separated, there are |S| output streams of the network model. The linear layer with |S|×1792 neurons and the ReLU layer with |S|×1792 neurons, the input data of the model is a three-dimensional tensor with a shape of D×T×129, D represents the sample selected for one training The number of samples used for training each time is fixed; T represents the maximum number of frames in the training sentences included in each training, 129 is the number of frequency points, which is a 129-dimensional complex number obtained by short-time Fourier transform of the speech data spectrum, where the frame length is 16ms, the frame shift is 8ms, the output of the model consists of |S| masking estimates, and the dimension of each masking estimate vector is T×129;

During the training process of the model described in step 3, the training target is a complex ideal floating-value mask, which includes a real part and an imaginary part, and the bidirectional long short-term memory recurrent neural network has two outputs, one is used to predict the real Partial component, another for predicting the imaginary component, and the two networks for predicting the real and imaginary components are optimized separately.

2. speaker-independent single-channel voice separation method according to claim 1, is characterized in that, in the model training stage, clean source voice and mixed voice are carried out short-time Fourier transform, then, will transform the voice source after the transformation. The real and imaginary parts are used to compute the compressed real mask cIRM′ _r and the compressed imaginary mask cIRM′ _c , respectively, as the training labels for the real and imaginary parts in the bidirectional long short-term memory recurrent neural network. At each iteration, The estimated value of time-frequency masking is optimized by minimizing the mean square error between the label value and the output value of the bidirectional long short-term memory recurrent neural network. After many iterations, the mean square error is reduced to a certain range or when other settings are triggered Stop the training, the training is completed, save the parameters of the bidirectional long short-term memory recurrent neural network at this time, and use it in the test phase;

In the model testing stage, the short-time Fourier transform result of the mixed speech is also obtained, and then it is used as the input of the network model obtained in the training stage, and the two output values of the network model are restored using the inverse function, thereby obtaining respectively The estimated values of the real and imaginary masks of the target source speech are obtained by multiplying the estimated real and imaginary masks by the short-time Fourier transform values of the mixed speech to obtain the real and imaginary parts of the estimated signal, and then, Then use the inverse Fourier transform to reconstruct the signal to obtain the separated speech signal.

3. The speaker-independent single-channel speech separation method according to claim 1 and 2 is characterized in that, the real part of the described complex ideal floating value masking is expressed as:

The imaginary part is represented as:

Thus, the complex ideal float mask is expressed as:

Among them, Y _r and Y _c are the real and imaginary parts after the short-time Fourier transform of the mixed speech, S _r and S _c are the real and imaginary parts after the short-time Fourier transform of the clean source speech signal, respectively part, the value range of Y _r , Y _c , S _r and S _r is

4. The speaker-independent single-channel speech separation method according to claim 3, wherein the real part masking cIRM' _r and the imaginary part masking cIRM' _c are uniformly expressed as

The value of x is r or c, which means the real part or the imaginary part. The compression operation limits the masking value within [-K, K], where K is the preset value, and the parameter C controls its steepness;

The inverse function is expressed as:

where cIRM _x represents the estimate of the uncompressed mask and O _x is the output of the deep neural network model.

5. The speaker-independent single-channel speech separation method according to claim 1 or 4, characterized in that, when the training model is described using sentence-level permutation invariance, the real part cost function is defined as:

where B=T×N×S is the total number of time-frequency units on all sound sources, T is the total number of sentence frames for all sound sources, N is the window or frame length, S is the number of sound sources, and t is the index of the time frame , f denotes the frequency index,

represents the ith output stream of the network during the training phase, i.e. the ith ideal floating-value masked real component estimate,

represents the label value of the ideal floating-value masked real component corresponding to the ith ideal floating-value masked real component estimated value when the label arrangement that minimizes the cost of sentence-level speech separation is taken,

is the permutation that minimizes the cost of sentence-level speech separation, defined as:

where S is the number of sound sources,

is the symmetry group of degree S, including all S! A set of permutations, φ represents one of the permutations,

Represents the ith label value when the fixed ideal float masks the permutation of the real component label values, the imaginary cost function

The training process is the same as the real component part.