CN118212929A

CN118212929A - A personalized Ambisonics speech enhancement method

Info

Publication number: CN118212929A
Application number: CN202410480255.6A
Authority: CN
Inventors: 周翊; 周嘉诚
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2024-04-22
Filing date: 2024-04-22
Publication date: 2024-06-18

Abstract

A personalized Ambisonics speech enhancement method comprising: acquiring voice data to be enhanced, extracting a spectrogram from the voice data to be enhanced, and performing short-time Fourier transform on the voice data to be enhanced; inputting the spectrogram into a speaker encoder and into an LSTM network of a time domain masking system; inputting the signals after the short-time Fourier transformation to a complex feature encoder to obtain a real part and imaginary part spectrogram; the LSTM network processes the embedded vector of the target speaker and the real part and imaginary part spectrogram, and inputs the processed data into the FCN network to obtain the enhanced voice of the target speaker; multiplying the enhanced target speaker voice by the signal after the short-time Fourier transform, and performing the short-time Fourier inverse transform on the multiplied signal to obtain an enhanced clean voice signal; the invention extracts the high-dimensional characteristics of the voice of the target speaker by constructing the encoder of the target speaker, thereby simultaneously removing the interference voice and the background noise.

Description

Personalized Ambiosonic voice enhancement method

Technical Field

The invention belongs to the technical field of voice processing, and particularly relates to a personalized Ambiosonic voice enhancement method.

Background

The voice recognition needs clear voice signals of the awakeners, and in a space audio scene, the Ambisonics voice signals are affected by environmental noise and are also interfered by the voices of other non-awakeners, so that the voice recognition effect is poor. An artificial target speaker speaking wake-up speech in a speech enhancement environment is called a non-target speaker because the target is to enhance the speaker's voice, other speakers that have an impact on speech recognition are called non-target speakers, and the target is to cancel them together with ambient noise, and the non-target speaker's voice is called foreground disturbance. One approach to the problem of personalized speech enhancement is to first apply a speech separation system on noisy audio in order to separate the sound from different speakers. Thus, if the noise signal contains N speakers, this approach will produce N outputs with potentially additional outputs of ambient noise. The classical speech separation task needs to solve two main problems, firstly the number of speakers N in the recognition record, which is unknown in the actual scenario. Second, optimization of the speech separation system may require that the alignment of speaker tags remain unchanged because the order of the speakers should not be affected during training. With the rapid development of computing level, deep learning has shown a unusual ability. Deep clustering, deep attractor networks, displacement invariant training, and the like are utilized to address these issues with deep neural networks.

The most common and simplest method for personalized speech enhancement algorithms is a class-based method, in which a hierarchical classification scheme is trained to estimate class labels for each pixel or super-pixel region. In the audio domain, time-frequency elements of the spectrogram of speech are segmented into regions dominated by the targeted speaker, either based on a classifier or based on a generative model. With the explosion of deep learning, class-based segmentation problems have also proven to be very successful. However, class-based approaches have some important limitations. First, it is assumed that the task of knowing the tag class does not fully solve the general problem of the large number of classes that may exist in real world signals. Many objects may not have a well-defined class. The class-based deep network model requires explicit representation of classes and object instances in the output node, which increases complexity. While the method of generating models may theoretically be more flexible in handling the number of model types and instances after training, it is often computationally infeasible to extend to more general segmentation tasks, which may present greater challenges.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a personalized Ambisonics voice enhancement method, which comprises the following steps: acquiring voice data to be enhanced, extracting LogMel spectrograms of the voice data to be enhanced, and performing short-time Fourier transform on the voice data to be enhanced; training a speaker encoder and a time domain masking system, wherein the time domain masking system comprises a complex feature encoder, an LSTM network, and an FCN network;

Inputting LogMel spectrograms into a trained speaker encoder to obtain a target speaker embedded vector, and inputting the target speaker embedded vector into an LSTM (least squares) network of a time domain mask system;

Inputting the signals after the short-time Fourier transformation to a complex feature encoder to obtain a real part and imaginary part spectrogram; the LSTM network processes the input target speaker embedded vector and the real part and imaginary part spectrogram, and inputs the processed data into the FCN network to obtain enhanced target speaker voice;

and multiplying the enhanced target speaker voice by the signal after the short-time Fourier transform, and performing the short-time Fourier transform on the multiplied signal to obtain an enhanced clean voice signal.

The invention has the beneficial effects that:

The invention extracts the high-dimensional characteristics of the target speaker voice by constructing the target speaker encoder to help the network separate the target speaker voice, thereby simultaneously removing the interference voice and the background noise and greatly improving the subsequent voice recognition.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of a speaker encoder of the present invention processing LogMel spectral diagrams;

FIG. 3 is a flow chart of the complex feature encoder of the present invention processing data;

Fig. 4 is a diagram of the FCN network architecture of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A personalized Ambisonics speech enhancement method, as shown in fig. 1, the method comprising: acquiring voice data to be enhanced, extracting LogMel spectrograms of the voice data to be enhanced, and performing short-time Fourier transform on the voice data to be enhanced; training a speaker encoder and a time domain masking system, wherein the time domain masking system comprises a complex feature encoder, an LSTM network, and an FCN network; inputting LogMel spectrograms into a trained speaker encoder to obtain a target speaker embedded vector, and inputting the target speaker embedded vector into an LSTM (least squares) network of a time domain mask system; inputting the signals after the short-time Fourier transformation to a complex feature encoder to obtain a real part and imaginary part spectrogram; the LSTM network processes the input target speaker embedded vector and the real part and imaginary part spectrogram, and inputs the processed data into the FCN network to obtain enhanced target speaker voice; and multiplying the enhanced target speaker voice by the signal after the short-time Fourier transform, and performing the short-time Fourier transform on the multiplied signal to obtain an enhanced clean voice signal.

In this embodiment, as shown in fig. 2, extracting LogMel a spectrogram includes: resampling and framing the original audio signal; performing fast Fourier transform on the audio signal of each frame to obtain frequency domain information; weighting the frequency domain information by using a Mel filter group to obtain the energy of each Mel frequency band; the logarithm of the energy of each Mel frequency band is measured, and a LogMel spectrogram is obtained; and carrying out normalization processing on LogMel spectrograms.

Extracting LogMel of a first-order Ambisonics voice signal: the voice signal is converted into a frequency domain through STFT and then applied:

Wherein f is the Nyquist frequency; mel (f) is the Mel frequency, which is a logarithmic relationship. Similar to the STFT spectrogram, the absolute value and the logarithm of the Mel spectrum are taken to obtain a logarithmic Mel spectrum.

Frequency domain based speech signal processing is the most common method of speech enhancement pre-processing. Because the voice signal is stable in short time, the frequency domain analysis can be carried out on each frame by dividing the voice signal into frames with short time periods, so that the frequency domain characteristics are obtained, the local characteristics of the voice signal can be better captured, and the subsequent voice processing and analysis are facilitated. After framing, a Short-time fourier transform (Short-Time Fourier Transform, STFT) of a certain frame is calculated on the voice signal, and the formula is as follows:

where x (n) represents a speech signal and w (n) is a real window. According to n, sliding different real window to different position, calculating short time Fourier transform of the position, finally obtaining linear superposition of Fourier transform of different frames of a section of voice, reflecting characteristic of frequency change of voice signal along with time. In STFT, the choice of window function has a significant impact on the analysis results. Common window functions are rectangular windows, hanning windows, hamming windows, harris windows, and the like. The rectangular window is simple and easy to realize, but has serious spectrum leakage, large main lobe width and low resolution; the hanning window reduces spectrum leakage, has better spectrum smoothness, but has larger main lobe width, and has stronger signal inhibition on the window edge; the hamming window is similar to the hanning window, but slightly reduces the width of the main lobe and has stronger sidelobe suppression capability. In practice, the choice of window function needs to be determined according to specific signal characteristics and analysis requirements.

The signal after the short-time Fourier transform is input to a complex feature encoder for processing, which comprises the following steps: the digital characteristic encoder processes the real part and the imaginary part of the voice signal obtained after the short-time Fourier transform respectively, and the specific implementation is that the real part of the voice signal is obtained by multiplying the amplitude spectrum by the cosine of the phase spectrum, and the imaginary part is obtained by multiplying the amplitude spectrum by the sine of the phase spectrum. A specific network structure diagram is shown in fig. 3; the high-dimensional characteristic representation of the voice signal is extracted by carrying out two-dimensional convolution on the real part and the imaginary part of the voice signal for a plurality of times, wherein the normalization uses Layer Norm Layer normalization, the rest Layer activation functions except the last Layer uses sigmoid activation functions use ReLU, and the total of 8 times of convolution is carried out.

The LSTM network processing the input target speaker embedded vector and the real part and imaginary part spectrogram comprises the following steps: the target speaker embedded vector, the d-vector, is repeatedly concatenated to the last convolutional layer output of the complex feature encoder in each time frame, and the resulting concatenated vector input is then fed into the LSTM network, with LSTM nodes set to 400.

The FCN network processing the input data includes: the FCN network is a fully connected network as shown in fig. 4: x represents the characteristics of the input to the FCN network and y represents the characteristics of the output. FCN replaces the fully connected layer in conventional CNN by using fully stacked layers, the network accepts the feature map output from LSTM and outputs an estimated mask map of size F x T, where F is the size of the frequency dimension and T is the size of the time frame dimension. And multiplying the mask map with the spectrogram of the original voice, adding to obtain an estimated clean voice spectrogram, and finally performing inverse Fourier transform to obtain the estimated voice.

In another embodiment, the more end-to-end approach to personalized speech enhancement tasks is to consider it as a binary classification problem, where the positive class is the speech of the target speaker and the negative class is formed by a combination of all foreground speaker disturbances and background noise. Three problems can be suppressed by the targeted speaker encoder: an unknown number of speakers, alignment problems, and selection from a plurality of outputs. The goal of this chapter experiment was to construct a speaker encoder that was able to record a high-dimensional feature embedding to the targeted speaker. An LSTM-based speaker encoder is first trained to calculate a target speaker embedding vector. A time-frequency mask based system is then trained that accepts both the embedded vector of the target speaker, previously extracted using the speaker encoder, and the noisy multicaster Ambisonics audio signal. The system is trained to eliminate the interference speaker and output only the voice of the target speaker.

In this embodiment, the mel spectrum of the voice emitted by the target speaker is X _i＝1, where i is the unique identifier of the target person, i=1, 2, where N represents a certain segment of voice of the speaker, the spectrum of the environmental noise is N (t, f), and the input voice with noise is in the form of ambisonic, expressed as:

Y(t,f)＝X_i＝else1(t,f)+X_j(t,f)+N(t,f)

Wherein X _i＝else1 (t, f) represents the spectrum of other speech segments except for the speech input by the target speaker into the encoder; j in X _j (t, f) represents any speech segment of the other speaker; then Y is the FOA signal covering the other speaker's voice and the target speaker's voice and the ambient noise. The objective of this chapter is to eliminate the sounds of other unrelated speakers by identifying the feature vectors output by the target speaker encoder, while filtering out ambient noise. The vector after passing through the target speaker encoder can be expressed as:

d＝f_{speaker_encoder}(X_i＝1)

Where f _{speaker_encoder} (·) is the target speaker encoder structure. The d vector with the encoded output dimension of 256 is used for being input into a voice filter network to realize voice enhancement:

Wherein g (·) represents the speech filter network; the d vector is fused with the spectrum characteristics of the input voice signal in an RNN module of the network; /(I) Representing the estimated voice of the clean target speaker.

In this embodiment, the personalized speech enhancement network is called P-DPCRN, which is composed of a targeted speaker encoder and a speech filter network, the latter uses the output of the targeted speaker encoder as an additional input, and in order to reduce the burden of the network processing, the difference from DPCRN is that the CNN structure is not used in the decoding stage, but the FCN structure is adopted, and the network in this stage is called the speech filter network. The purpose of the speaker encoder is to generate speaker-specific feature embedding from audio samples of the target speaker. The structure achieves good performance in both text-dependent and text-independent speaker verification tasks, speaker binarization, multi-speaker and speech-to-speech translation. The speaker encoder shown in the following figure is a network using a 3-layer LSTM with a 1600ms window overlapping 50% of the extracted log-mel filter bank energy as input, and an output speaker embedded, called the d-vector, with a fixed dimension of 256.

In this embodiment, the model is based on two principles of ideal ratio masking estimation (Ideal Ratio Mask Estimation, IRM) and neural network B-mode beamforming filter estimation:

Ideal ratio masking estimation aims at estimating an ideal masking ratio in a speech signal in order to accurately separate out speech signal components that need enhancement, thereby achieving speech enhancement. The method relies on masking effects in signal processing, i.e. masking of strong signals by the human ear at a higher signal-to-noise ratio. The IRM is calculated based on the ratio in the frequency domain. Spectrally, the signal is decomposed into individual frequency components, the energy of each frequency component being compared with the energy of the background noise. Based on the ratio between signal and noise, it is determined which frequency components should be preserved (i.e., the target speaker's voice required for the experiment) and which should be suppressed (i.e., the noise and non-target speaker's voice). The goal of IRM is to achieve accurate estimation of the signal by maximizing the ratio between the desired signal and noise. The formula of IRM is:

Wherein, beta is a scale factor; r (t, f) is a signal-to-noise ratio matrix, and the calculation formula is as follows:

Wherein X (t, f) represents a non-noise signal; and N (t, f) is a sound signal including noise and non-target speaker sounds. IRM is a widely used method in deep learning speech enhancement to obtain an estimated clean target speaker speech signal by spectral phase matching with a noisy signal:

X＝YM

the network beamforming filter estimation comprises:

Speech enhancement in a spatial audio environment is tested for subsequent speech recognition, which generally requires a clean single-channel speech signal, and therefore requires beamforming of the Ambiosonic signal through a neural network to make the noisy Ambiosonic signal a clean single-channel speech signal. The process of adding after multiplication of the neural network post-mask can be understood as a process of beamforming filtering the FOA, the parameters of the filter being generated from the network predictions.

The speech filter network is based on DPCRN for speech enhancement. The neural network accepts two inputs, a specific feature d vector of the target speaker and a real-part and imaginary-part spectrogram calculated from the noisy sound using STFT. The speech filter is capable of predicting a complex mask of the spectrogram by multiplying the input noise spectrogram by a complex multiplication by elements and then adding to achieve beamforming to produce an enhanced single channel speech spectrogram. The enhanced time domain waveform is then acquired through iSTFT. The speech filter network consists of 8 convolutional layers, 2 LSTM layers and 2 fully connected layers, each layer having ReLU activation except for the last layer which has sigmoid activation. The present experiment injects the d vector between the convolutional layer and the LSTM layer, rather than before the convolutional layer, for two reasons. First, the d-vector is already an accurate, unique representation of the targeted speaker, so it need not be modified by applying a convolution layer to it, and a d-vector of dimension 256 corresponds to an "identification number" of the targeted speaker, from which the network can generate a mask that extracts the targeted speaker's voice, which he can uniquely characterize the tone color, etc. Another reason is that the convolution layer is assumed to be time and frequency uniform and cannot be applied to an input consisting of two completely different signals.

Experimental data for the present invention are from VCTK dataset and librispeech dataset. In the training stage, the corpus of 2330 speakers is randomly extracted in the experiment, and the audio signals are cut off in units of 2 seconds, and the voices of 2200 speakers are finally determined as a training set because the voices of 2330 speakers are different in duration. The speech of each speaker is cut into 3,4 or 52 second speech segments depending on the speech duration of each speaker. Wherein the number of voices cut into 3 fragments is 512, the number of voices cut into 928 is 4, and the number of voices cut into 5 is 760. In order to ensure that the learned characteristics of the target speaker encoder are effective, the total 9048 voice fragments are generated, the voice fragments input into the target speaker encoder cannot appear in the noisy voice and the target voice, only other voice fragments of the same speaker can be used as the material of the noisy voice, 1 fragment is selected from the 3 slices as the material input into the target speaker encoder, 2 fragments are selected from the 4 and 5 slices as the material input into the target speaker encoder, and the rest of the sliced voice fragments are used for manufacturing the noisy data set. The preparation of the noisy data set comprises the following steps: the inaudible voice of the same person as the voice signal input to the target speaker encoder is first found as a fixed voice, and then several voices of speakers of different persons from the VCTK data set and the librisspeech data set are intercepted and directly added together. For noise signals, office-like background noise was extracted from the FSD50K dataset, for a total of 1440 noise sound files, including 14 transient noise classes and 4 continuous noise classes. The room pulse signals of the offices Ambisoncis of the L3DAS23 racing group have been obtained through mail, and thus the method for producing the Ambiosonic voice signals is consistent with the fourth chapter, and will not be repeated here. Wherein the input of the target speaker encoder is a single-channel speech signal without being converted into an Ambisonics speech signal. The target signal is a clean signal of the target speaker corresponding to the mixed noisy signal. 30 lecturers were tested on the test chapter random extraction librisspeech dataset. In addition to the dataset made using the present invention containing foreground multiple speaker interference, the experiment was also evaluated using an L3DAS23 test set containing only background noise.

The present invention uses a combination of the wSDR, SI-SNR and PHASEN losses as the loss function. The PHASEN loss function can emphasize the phases of the higher amplitude time-frequency points, which can help the network focus on the high amplitude time-frequency points, because most voice information is usually concentrated on the higher amplitude time-frequency points, so that the emphasis on the phases of the points can capture important characteristics of voice more effectively, and the quality of voice is improved. In addition, noise is usually more remarkable at a time-frequency point with a lower amplitude, and the influence of noise on voice can be reduced by emphasizing the phase of the time-frequency point with a higher amplitude. The formula is as follows:

Wherein the method comprises the steps of S represents the estimated output of the network and a clean spectrogram respectively; the parameter p is an empirically set compression factor of 0.3; /(I)Representing complex calculations.

The experimental loss function is a combination of these three loss functions:

L＝(1-β)(L_SI-SNR+L_wSDR)+βL_PHASEN

where β is the weight distribution parameter of the loss function.

The experiment of the invention uses four NVIDIA GeForce RTX 3090,3090 GPUs, uses a model with a batch size of 12 for training, and uses a AdamW gradient optimizer for updating parameters. The learning rate of the next epoch is adjusted to be 0.9 times the current learning rate at the end of each epoch by a learning rate planner. The results of the model performance evaluation on the blind test set are shown in the following table.

It can be seen that the baseline model UNet performs poorly with multiple foreground speaker disturbances, with WERs as high as 0.683, which is very detrimental to speech recognition. FOA-DPCRN performs well, but the WER score is not satisfactory since no other foreground interfering speaker is targeted for cancellation. The personalized speech enhancement algorithm of the invention can aim at the voice of the target speaker and separate the voice from background noise and foreground interference, so that WER and STOI also achieve better performance, WER is higher than a baseline model by a score of 0.47, STOI is higher than the baseline model by 0.318 score, and the PESQ is improved by 21.48%.

While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.

Claims

1. A personalized Ambisonics speech enhancement method, comprising: acquiring voice data to be enhanced, extracting LogMel spectrograms of the voice data to be enhanced, and performing short-time Fourier transform on the voice data to be enhanced; training a speaker encoder and a time domain masking system, wherein the time domain masking system comprises a complex feature encoder, an LSTM network, and an FCN network;

2. The personalized ambisonic speech enhancement method of claim 1, wherein extracting LogMel spectrograms comprises: resampling and framing the original audio signal; performing fast Fourier transform on the audio signal of each frame to obtain frequency domain information; weighting the frequency domain information by using a Mel filter group to obtain the energy of each Mel frequency band; the logarithm of the energy of each Mel frequency band is measured, and a LogMel spectrogram is obtained; and carrying out normalization processing on LogMel spectrograms.

3. The personalized ambisonic speech enhancement method according to claim 1, wherein LogMel spectral images are input to a speaker encoder for processing comprising: the voice signal is transformed to the frequency domain through STFT, and the expression is:

wherein f is the Nyquist frequency; mel (f) is the Mel frequency.

4. The personalized ambisonic speech enhancement method according to claim 1, wherein the speech data is subjected to a short-time fourier transform formula of:

where x (m) represents a speech signal and w (n) is a real window.

5. A personalized ambisonic speech enhancement method according to claim 1, wherein the input of the short-time fourier transformed signal to the complex feature encoder for processing comprises: the complex feature encoder processes the real part and the imaginary part of the voice signal obtained after the short-time Fourier transform respectively; the method specifically comprises the following steps: and respectively carrying out two-dimensional convolution on the real part and the imaginary part of the signal for multiple times to extract high-dimensional characteristic representation of the signal, wherein a normalization layer and an activation function are adopted to process the convolved characteristic after the two-dimensional convolution process is carried out, and a real part and imaginary part spectrogram is obtained.

6. A personalized ambisonic speech enhancement method according to claim 1, wherein the number of two-dimensional convolutions is 8 and the normalization Layer is a Layer Norm Layer normalization.

7. The personalized ambisonic speech enhancement method according to claim 1, wherein the LSTM network processing the input target speaker embedded vector and the real-part imaginary-part spectrogram comprises: the target speaker embedded vector, i.e., the d vector, is repeatedly connected to the last convolutional layer output of the complex feature encoder in each time frame; and feeding the generated connection vector input to an LSTM network to obtain a hidden characteristic diagram.

8. A personalized ambisonic speech enhancement method according to claim 1, wherein the FCN network processing the input data comprises: the FCN replaces a full connection layer in the CNN by using a full lamination, the characteristic diagram output by the LSTM is input into the FCN network, and an estimated mask diagram with the size of F x T is output, wherein F is the size of a frequency dimension, and T is the size of a time frame dimension; multiplying the mask map with the spectrogram of the original voice, adding to obtain an estimated clean voice spectrogram, and finally performing inverse Fourier transform to obtain the estimated voice.