CN116913296A

CN116913296A - Audio processing method and device

Info

Publication number: CN116913296A
Application number: CN202310163454.XA
Authority: CN
Inventors: T-C·佐里拉; R·S·多迪帕特拉
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2022-04-14
Filing date: 2023-02-16
Publication date: 2023-10-20
Also published as: GB2617613B; JP7551805B2; GB2617613A; JP2023157845A; GB202205590D0

Abstract

Embodiments described herein relate to audio processing methods and apparatus. A method for processing an audio signal to enhance a target component of the audio signal, the method comprising: receiving a first audio signal comprising a target component in a first environment; processing the first audio signal to extract a second audio signal comprising the target component in a second environment, the second environment having less noise than the first environment; and mixing the first audio signal with the second audio signal to produce a third audio signal, the third audio signal comprising the extracted target component.

Description

Audio processing method and device

Technical Field

Embodiments described herein relate to audio processing methods and apparatus.

Background

Recently, deep Neural Networks (DNNs) have greatly improved the accuracy of Automatic Speech Recognizers (ASRs). Current ASR systems have achieved human performance under clean conditions, however, they are still inferior to a normally-hearing listener under noisy conditions.

In order to improve the robustness of ASR against noise, there are at least two different strategies. One strategy relies on a large amount of data to train a multi-condition model, another strategy is based on using signal enhancement for data cleaning. While the first approach is simple, it is costly in terms of computational resources required to train such models and in terms of collecting tagged data. Furthermore, the accuracy of multi-condition systems can be degraded in very challenging environments, such as those with competing speakers.

In general, the distortions introduced by speech enhancement limit the applicability of these methods as independent front ends of ASR, and the acoustic model of the ASR system must be retrained with matched distortion data to obtain optimal recognition performance. Alternatively, joint training enhancement and recognition systems have been proposed to mitigate distortion. However, in practical applications with very dynamic acoustic conditions, the latter approach may not work well.

Drawings

FIG. 1A is a schematic diagram of a speech processing apparatus according to an embodiment;

FIG. 1B is a schematic diagram of a speech processing apparatus according to an embodiment;

FIG. 2 is a schematic diagram of the structure of an automatic speech recognition device according to an embodiment;

FIG. 3 is a schematic diagram illustrating the structure of a speaker extraction system;

FIG. 4A is a detailed layered structure of a speech extractor;

FIG. 4B is a detailed layered structure of a time convolution block of the extractor of FIG. 4A;

fig. 5 is a diagram showing how the intelligibility metric may be used to calculate a loss function.

Detailed Description

In an embodiment, a method for processing an audio signal to enhance a target component of the audio signal is provided, the method comprising:

receiving a first audio signal comprising a target component in a first environment;

processing the first audio signal to extract a second audio signal comprising the target component in a second environment, the second environment having less noise than the first environment; and

the first audio signal is mixed with the second audio signal to produce a third audio signal that includes the extracted target component.

The first audio signal may comprise speech from a target speaker, the target component being speech from the target speaker. In embodiments in which the first audio signal is speech, the first audio signal, the second audio signal, and the third audio signal may be referred to as a first speech signal, a second speech signal, and a third speech signal, respectively.

The above-described method for enhancing speech signals helps to improve the accuracy challenges of single-channel Automatic Speech Recognition (ASR) in noisy conditions. Powerful speech enhancement front ends are available, however, they typically require re-training the ASR model to cope with processing artifacts (artifacts). Embodiments described herein relate to speaker enhancement strategies for improving recognition performance without retraining an Acoustic Model (AM). This is achieved by remixing the enhancement signal with the unprocessed input to mitigate processing artifacts. The evaluation was performed using a DNN speaker extraction based speech denoiser trained with a perceptual excitation loss function. The results show (without AM retraining) that a relative accuracy gain of about 23% and 28% was obtained compared to the untreated monaural simulation and the true CHiME-4 evaluation set, respectively.

In the embodiments described herein, a speaker extraction (SPX) system is used for denoising. The purpose of SPX is not to recover all sources from the mix (i.e. speech separation), but to recover only the target speaker from the mix, which avoids the need to know the total number of sources in advance.

Thus, in an embodiment, the third speech signal may be provided to the ASR and text is recognized in the third signal by the ASR. However, the enhancement signal may be output to other systems, such as a voice controlled command interface. The third signal may also be output to a further audio processor.

For example, a speech extractor may be provided in a hearing aid, noise canceling earpiece or other audio processing device to enhance speech or audio signals from one or more speakers or contributors.

Although focusing on speech above, any type of audio signal may be enhanced. For example, a music signal may be enhanced by the above system. Such enhancement signals may then be input into, for example, a music recognition device. The registration (registration) signal may be a singer's voice or a short recording of the musical instrument found in the mix. Other components may also be extracted using this method.

As described above, in an embodiment, the first speech signal is a single-channel speech signal. For example, from a single microphone or other single audio pickup device, such that there are no spatial cues that can be used to reduce noise in the signal. Noise may be attributed to other speakers than the target speaker. The noise may also be normal background noise, e.g. traffic noise in railway stations, etc.

In an embodiment, the first speech signal and the second speech signal are mixed to form a third speech signal, as follows:

z(n)＝s ^′ (n)+αy(n)

where z (n) represents the third signal, s' (n) represents the second signal, y (n) represents the first signal, and α is a multiplication factor of the first signal.

The multiplication factor may be controlled by a scalar factor σ:

in an embodiment, σ is at least 0 and at most 30. In an embodiment, the value of σ is automatically adapted based on the amount of distortion found in the second signal. In another embodiment, a no-reference speech quality model may be used to evaluate distortion.

In another embodiment, the second signal is extracted from the first signal using a speech extractor, and wherein the speech extractor has been trained using a loss function that is a combination of a term representing time domain distortion or frequency domain distortion in the signal and a term representing the intelligibility of the signal. The term representing the intelligibility of the signal is a short-time objective intelligibility "STOI" metric. Instead of the STOI, other functions may be used which also allow for speech intelligibility metrics such as normalized covariance based speech transmission index, peeping ratio (glimpse proportion, GP), speech Intelligibility Index (SII), PESQ or deep learning methods such as STOI-Net.

In another embodiment, the term representing time domain or frequency domain distortion in the signal is a scale invariant signal to distortion ratio "SISDR". In further embodiments, the term representing distortion may be a frequency domain loss over the magnitude spectrum, such as mean square error or mask cross entropy.

In another embodiment, processing the first speech input to extract the second speech signal includes using a speech extractor configured to receive the first speech signal and a registered speech signal containing speech samples from the target speaker, the speech extractor configured to generate a mask to remove portions of the first speech signal to extract the second speech signal.

A speech extractor may be used to process the first signal as a time domain signal. The first signal may be processed by a spectral encoder, which is a trainable convolutional network configured to transform the first signal into a higher-dimensional signal. The speech extractor may include a plurality of deep convolutional layers.

In another embodiment, a method of training a speech processing apparatus for processing speech signals to enhance the speech of a target speaker is provided, the apparatus comprising:

an input for receiving a first speech signal comprising speech from a target speaker in a first environment;

a speech extractor configured to process the first speech input to extract a second speech signal, the second speech signal comprising speech from the target speaker in a second environment, the second environment being less noisy than the first environment, the speech extractor configured to receive the first speech signal and a registered speech signal comprising speech samples from the target speaker, the speech extractor configured to generate a mask to remove portions of the first speech signal to extract the second speech signal; and

a mixer for combining a first speech signal with a second speech signal to produce a third speech signal, the third speech signal comprising extracted speech from a target speaker, the method comprising:

receiving a training dataset, each member of the training dataset comprising a first signal comprising speech of a target speaker, a registration signal having sample speech from the target speaker, and clean speech from the target speaker corresponding to the first signal as a second signal; and

the speech extractor is trained using the first signal and the registration signal as inputs and the second signal as a desired output, the training using a loss function that is a combination of a term representing time-domain or frequency-domain distortion in the signal and a term representing the intelligibility of the signal.

The speech processing apparatus may be configured for automatic speech recognition and further comprise an automatic speech recognition unit configured to receive a third speech signal and to derive text from the third speech signal, wherein the automatic speech recognition unit is trained independently of the speech extractor.

The above-described loss function may also be used for a speech extractor that does not mix the output signal with the input to produce a third enhancement signal. Thus, in another embodiment, a method of training a speech processing apparatus for processing speech signals to enhance the speech of a target speaker is provided, the apparatus comprising:

an input for receiving a first speech signal comprising speech from a target speaker in a first environment; and

a speech extractor configured to process the first speech input to extract a second speech signal, the second speech signal comprising speech from the target speaker in a second environment, the second environment being less noisy than the first environment, the speech extractor configured to receive the first speech signal and a registered speech signal comprising speech samples from the target speaker, the speech extractor configured to generate a mask to remove portions of the first speech signal to extract the second speech signal,

the method further comprises the steps of:

In another embodiment, there is provided an audio processing apparatus for processing an audio signal to enhance a target component of the audio signal, the apparatus comprising:

an input for receiving a first audio signal comprising a target component in a first environment;

an audio signal extractor configured to process the first audio signal to extract a second audio signal, the second audio signal comprising the target component in a second environment, the second environment having less noise than the first environment; and

a mixer configured to mix the first audio signal with the second audio signal to produce a third audio signal, the third audio signal comprising the extracted target component.

The methods described above may be embodied in software or hardware, for example, as a computer-readable medium, which may be transitory or non-transitory, including instructions that when executed by a computer cause the computer to perform any of the methods described above.

Fig. 1A schematically shows a speech processing system according to embodiment 1. The speech processing system 1 comprises a processor 3 that executes a program 5. The speech processing system 1 further comprises a memory 7. The memory 7 stores data used by the program 5 to process speech to extract denoised speech of the target speaker. This procedure may be referred to as a speaker extraction procedure. The speech processing system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an audio input 15. The audio input 15 receives an input audio signal, for example from a microphone. The audio input may also be configured to receive a file of audio data from an external storage medium or network. The audio input 15 may be, for example, a microphone. Alternatively, the audio input 15 may be a means for receiving audio data from an external storage medium or network.

Connected to the output module 13 is an audio output 17. The audio output 17 is for outputting a processed speech signal. The audio output 17 may be, for example, a direct audio output such as a speaker, or an output of an audio data file that may be transmitted to a storage medium or via a network or the like.

In use, the speech processing system 1 receives a first audio signal via the audio input 15. The program 5 executing on the processor 3 denoises the first audio signal using data stored in the memory 7 to extract the speech of the target speaker. The data stored in the memory 7 will contain audio data related to the targeted speaker. The denoised speech is output via the output module 13 to an audio output 17. The de-noised speech is particularly suitable for input into an Automatic Speech Recognizer (ASR).

Fig. 1B schematically shows a speech processing system according to embodiment 1. In this variant, the program 5 includes a speaker extraction (SPX) program 6A and an Automatic Speech Recognizer (ASR) program 6B. Speaker extraction program 6A extracts the denoised speech output of the target speaker. The denoised speech output of the target speaker is then input into the ASR program 6B where it is converted to text. In this embodiment, the output module 13 is a text output module that provides text output to the output 19.

The embodiments of fig. 1A and 1B may be provided in a laptop or desktop computer. However, the embodiments of fig. 1A and 1B may also be provided in a mobile phone, a tablet computer, or any type of processor with an audio input. The system may be used to maintain manual recording of logs where users speak into a microphone to record their notes and have their actual automatic transcription into text. However, in other embodiments, the text output may be used as part of the voice command interface.

Fig. 2 shows an overview of the functions performed by the processor. It should be noted that here, the output of the SPX unit 51 is provided to the ASR unit 53. However, other options are possible, for example, the output may be provided into a command interface for voice control.

In fig. 2, an input signal a containing the voice of the target speaker is supplied to the SPX unit 51. In this example, the output of the SPX unit 51 is provided to an ASR unit 53. However, the output may also be provided to a voice control command interface. The input signal a also contains background noise, e.g. the input signal a is collected via a mobile phone in a noisy environment.

The SPX unit 51 also receives a "registration voice signal", which is clean voice from the target speaker. The enrollment speech signal may be any speech of the target speaker and need not be specific text. The enrollment speech may be captured prior to the inference or may be pre-collected and reused during the inference. If previously collected, it may be stored in memory and retrieved for inference. In an embodiment, the length of the registration signal is a few seconds (e.g., 3s to 6 s). For training, acoustic variability is beneficial to allow lexical diversity. For example, registration data may be randomly selected from among the available clean target data.

The signal from which the target voice is to be separated is then input into the SPX unit 51. Fig. 3 is a schematic diagram showing a subunit of the SPX unit 51.

First, an input signal a is supplied to a spectrum encoder. The spectral encoder includes at least one convolution layer and is configured to transform the speech signal into a higher dimensional space. Higher dimensional space may better exploit speech sparsity for source separation. The dimension of the space is given by the number of convolution kernels in the CNN. In an embodiment, the number of convolution kernels (N) is at least equal to the number of input samples (L).

A first output of the spectral encoder 61 is provided to an extractor 63. The extractor is configured to calculate a mask for the target speaker. The mask is then applied when the element combination is used by the mixer 67. The extractor 63 exploits the sparsity of the speech in the spectral domain and the long-term correlation of its temporal characteristics.

In an embodiment, to achieve this, the extractor includes a plurality of deep convolution layers that convolve the temporally separated frames to capture long-term correlations of the temporal features. In an embodiment, the plurality of depth convolution layers are provided with incremental convolution kernel expansion factors that perform a multi-resolution analysis of the input mixed signal.

The extractor 63 also receives a registered voice signal. The voice registration signal is output by the speaker encoder 65. The speaker encoder includes at least one or more depth convolution layers.

The masked signal output from the extractor is then applied to a second output from the spectral encoder 61 to extract the target speaker from the second output of the spectral encoder 61. The mixer 67 combines element by element between the second output of the spectral encoder and the output of the extractor 63.

The decoder 69 reconstructs the estimated target frame s' (N) using the masked spectral representation and one full connected layer (N input dimensions and L output dimensions). An overlap-add method is applied to reconstruct the entire waveform.

A more detailed description of the speech extractor will now be described with reference to fig. 4. The speech extractor of fig. 4 has the basic structure of the extractor of fig. 3. The extractor having the structure of fig. 3 is not limited to the exact structure of fig. 4. Fig. 4 is an example of an extractor having a structure based on a time convolutional network block (TCN). However, other architectures are possible, such as different model architectures, e.g., LSTM, RNN, etc. The above may be used with extractors operating in the time domain or the frequency domain. For example, networks using different loss functions in the time domain or frequency domain. Extraction networks using different embeddings than those described above. Extraction networks that are jointly trained or in which one of the layers/units is trained separately. The above method uses the registration signal to bias the output. However, a system that is pre-trained for the targeted speaker may also be used, for example, for a closed set of speakers.

To avoid any unnecessaryLike reference numerals will be used to refer to like features throughout. The purpose of the spectral encoder 61 is to transform the input speech waveform into a higher dimensional space (E _spec ). In this example, the input speech waveform is a time domain waveform y (n). The input signal y (n) may be a sequence of floating point numbers in the range of-1 to 1 (or it may be converted to a quantization level as an integer). In an embodiment, the input signal is a digitized version of the continuous signal captured by a microphone that measures the change in sound pressure. In an embodiment, the input signal is a digital input signal. The input signal is then divided into windows of fixed duration. To illustrate this example, a time window of 1ms to 2ms is used. The time windows may also overlap each other.

Each time window of the input signal y (n) is then provided to the spectral encoder 61. In an embodiment, the input layer of the CNN is an array/tensor with dimensions [ batch_size,1, time_length ]. The batch_size may be 1 or more. time_length is the length of the waveform (time domain signal) in the sample. For example, this may be fixed at 4 seconds, giving 64000 samples for fs=16 kHz. The first 1D CNN from the encoder will accept the input and will output tensors with dimensions [ batch_size, number_of_kernel, number_of_time_frames ]. number_of_kernel is given (n=256), and number_of_time_frames is calculated given the total length of the signal, frame length (l=20), and hop size (L/2). Thus, in an embodiment, the 1D CNN performs both windowing and "spectral" analysis.

The spectrum encoder 61 comprises a CNN layer 101 followed by a rectifying linear unit (ReLU) 103 for activation. In this example, CNN has N convolution kernels of size L with a step size L/2. It should be noted that the spectral encoder operates on the input time domain signal y (n). The function of the spectral encoder is to produce an output similar to the spectral output, but it does not strictly perform the conversion of the signal into a frequency state (region). Instead, it allows training parameters of CNN 101 to allow high-dimensional sparse outputs to be achieved.

In the same manner as described above with respect to y (n), the speech signal s is registered _e (n) is also a time domain signal, in an embodiment, theThe signal is converted into a digital signal in the same manner as y (n), i.e., a time domain window. In the same way as spectral encoder 61, speaker encoder 65 also includes CNN layer 105 followed by rectifying linear unit (ReLU) 107 for activation.

The output of the activation layer 107 is then provided to a Time Convolutional Network (TCN) 109.

Fig. 4B shows a schematic layered structure of TCNs such as TCN 109. Here, the TCN block is formed by three CNN layers, parameter ReLU (prilu) activation, and mean and variance normalization (G-NORM) across both the time dimension and the channel dimension, scaled by trainable bias and gain parameters. The depth convolution (D-CONV, H convolution kernels of size P) operates independently on the input channel as a dilation factor of 1. Endpoint 1D CNN (B convolution kernels) is used to adjust the channel dimension.

In detail, the first layer of TCN is 1-dimensional CNN 111. Next, the G-NORM layer 113 provides mean and variance normalization across both the time dimension and the channel dimension scaled by trainable bias and gain parameters. The output of this layer 113 is then provided to the PReLU layer 115. In an embodiment, the PReLU layer is used because it can help prevent the gradient vanishing problem.

Next, the signal is processed by the D-CONV layer 117, which provides a depth convolution. Next is another GNORM layer 119 outputting to the second prime layer 121, the output of the second prime layer 121 being provided to the second 1D CNN layer 123.

The output of TCN 109 is then provided to time averaging operator 125 to produce a spectral representation E of the registration signal _spk 。

In the extractor 63, the output E of the spectral encoder 61 _spec The channel-wise normalization is performed before being fed to the first TCN in a series of cascaded TCNs 133, 139, and then processed by the bottleneck 1x1 CNN layer (B convolution kernels) 133. TCNs 135 and 139 each have the structure described with reference to fig. 4B. However, the expansion of the D-CONV is varied such that the expansion factor across consecutive TCN blocks is 2 ^mod(i,X) Where i is the block index from 0 to XR-1. The TCN blocks are arranged in series into groups of X TCNs in each group and R groups such that there are a total of XR TCNs. X is controlling the presence of each groupHow many TCNs are hyper-parameters, mod is a modulo operation.

In this example, the target speaker embedding E is combined using a multiplicative point-by-point adaptation layer 137 _spk And an output of the second TCN. However, other point-wise combinations may also be used.

After the last TCN, the mask dimension is adjusted to the dimension of the output of the spectrum encoder 61 using another 1x1 CNN layer 141 with N output channels, thereby facilitating their point-wise multiplication. The output of the other CNN layer 141 is provided through an activation layer 143, which generates a mask. The mask is then applied to a second output E from the spectral encoder 61 using a mixer 67 _spec To from E _spec The speech of the target speaker is extracted.

Which is then provided to decoder 147. The decoder comprises a fully connected layer 147 which then outputs the extracted time domain signal s' (n).

Before returning to the overall system of fig. 2, training of the system of fig. 3 will be explained. This training is appropriate for the particular arrangement of fig. 4, but is not limited to this particular architecture.

For training, a training set is built, each member of the training set comprising three parts:

1) Input speech (y (n)) including the speech of the target speaker in the background noise possibly provided by other speakers or non-voice sounds

2) Registered speech samples(s) from target speaker _e (n))

3) Clean speech (s (n)) of target speaker corresponding to input speech

The training set will contain the three above-mentioned parts for many different target speakers. After training, the system should be able to adapt to the new speaker simply by receiving the speaker's registered speech. No retraining is required for a new speaker nor is it required to include that speaker in the original training set.

In an embodiment, the system of fig. 3 is co-trained, i.e. the spectral encoder 61, the speaker encoder 65, the extractor 63 and the decoder 69 are trained together.

In an embodiment, the training objective for the extractor of fig. 1 is to maximize the scale-invariant signal-to-distortion ratio (sisr), which is defined as:

where s' and s represent the estimated target speaker signal and the oracle (oracle) target speaker signal, respectively, where the oracle target speaker signal is clean speech.

In another embodiment, a different loss function is used, which is:

L _new ＝L _S(SDR (s,s ^′ )+L _STOI (s,s ^′ ) (2)

the loss function combines the SISDR loss of equation (1) above with a perceptual excitation term based on a short-term objective intelligibility (STOI) metric.

STOI is an indicator that objectively evaluates the intelligibility of noisy speech processed by time-frequency weighting, which produces a high degree of correlation with human perception. It requires a clean reference signal s (n) that is compared against the processed signal s' (n). The comparison requires about 400 milliseconds of speech and is performed in a compressed DFT-based space as described below.

Fig. 5 is a flowchart showing the steps of calculating the STOI. First, in step S201, short-term fourier transforms (STFT) of the reference (i.e., clean) signal and the processed signal are calculated, denoted as S (k, m) and S' (k, m), respectively. The indices k and m represent the current frequency bin and time frame, respectively. Then, at S203, one third of the double-band Root Mean Square (RMS) energy is calculated, with a minimum center frequency of 150Hz:

where j is one third of the octave index, k ₁ (j) And k ₂ (j) Is its band edge. In step S205, the one-third of the octave band RMS energy of the processed signal is normalized to match using the context of N (e.g., 30) preceding consecutive framesEnergy of clean signal

Next, at step S207, the scaled one-third octave band RMS energy of the processed signal is calculated using the following rulesThe clipping is performed so that the clipping is performed,

where > (e.g., -15) is the lower bound of signal distortion. In step S209, the intermediate intelligibility score is calculated as the cross-correlation between the reference signal and the one-third of the octave-band RMS energy of the normalized and limited processed signal,

calculate the final STOI loss at step S211

Where J and M are the total number of one-third octave bands and signal frames, respectively. Instead of the STOI, other functions may be used which also allow for speech intelligibility metrics such as normalized covariance based speech transmission index, peeping ratio (GP), speech Intelligibility Index (SII), PESQ or deep learning methods such as STOI-Net.

Thus, in this embodiment, the extractor is trained taking into account both the SISDR and the STOI.

Returning to fig. 2, the extracted signal output from the extractor 51 is provided to an Automatic Speech Recognizer (ASR) 53. However, the extracted signal is mixed with a portion of the original input signal (A) to produce an enhanced signal (C) prior to being input to the ASR 53. This allows the suppression of processing artifacts in the extracted signal to be masked. These processing artifacts can affect the operation of the ASR.

In an embodiment, remixing is controlled by scalar σ (fig. 2):

and the output is calculated as z (n) =s ^′ (n) +αy (n). How σ is selected is described below. By using the above, good results can be obtained from the ASR 53 without requiring retraining the ASR 53 to match the processing distortion.

To test the above, experiments were performed on CHiME-4 data, which contained simulated and authentic noisy speech recordings. The CHiME-4 corpus aims to capture multi-channel speech in noisy everyday environments (such as cafeterias, buses, streets, or walking areas) using mobile tablet computing devices. SPX denoisers (speech extractors) are used and are trained based on clean Walker daily (WSJ) speech that is artificially mixed with CHiME-4 noise.

The results (matching conditions relative to the training set of SPX) for the single channel true and simulated evaluation set (et 05) of CHiME-4 are reported below. In addition, the method is also aimed at VoiceBank-DEMAND (VBD) and WHAM-! The mismatch test conditions of the set report the results. Using WHAM-! The largest version of the test (tt) set was tested and all experiments were performed with 16kHz resolution data.

Performance was mainly evaluated in terms of Word Error Rate (WER), however, for some preliminary experiments, signal-to-distortion ratio (SDR) and STOI values were also reported. SDR scores were calculated using the BSSeval toolkit and STOI training loss was calculated using a free available PyTorch implementation. Further details regarding the configuration of the denoising network and ASR system will be described below.

For these tests, a speech extractor having the architecture described with reference to fig. 4A and 4B was used. The spectral encoder consists of a 1-D CNN with n=256 convolution kernels of size l=20 and a frame rate of 10 samples. The TCN block is superimposed with x=8 repeated r=4 times for the extraction network. Each TCN block consists of 1x1 CNN and 1x3 depth convolutions with b=256 and h=512 convolution kernels, respectively. The fully connected layer in the decoder has an input dimension of 256 and an output dimension of 20.

The SPX system was trained using a clean WSJ training list from WSJ0-2mix with CHiME-4 noise artificially added. Data was generated for approximately 39 hours, the signal-to-noise ratio (SNR) of the mixed tone was sampled uniformly in the range of 0dB to 5dB, and the audio length was varied randomly between 1s and 6 s. Training is performed using the target registration sentence to ensure that the registration signal and the mixed signal are recorded differently. Simulated CHiME-4, VBD and WHAM-! The enrollment sample of the test set is selected from the available clean waveforms, while the near-speaking microphone recording is used for enrollment of the real CHiME-4 evaluation set.

Training was performed using an Adam optimizer [ d.p. kingma and j.l.ba, "Adam: A method for stochastic optimization", in int.conf. Learning Repres.,2015], initial learning rate of 0.001, block length of 4 seconds, and batch size (miniband size) of 8. If the cross-validation set is not improved over three consecutive periods (epochs), the learning rate is halved. To avoid overfitting the training data, all competition models are decoded at epoch (epoch) 20.

Two acoustic models of ASR are included in the evaluation. The first model was trained on clean WSJ-SI284 data (WSJ-CLN) and had a 12-layer TDNNF topology [ D.Povey, G.Cheng, Y.Wang, K.Li, H.Xu, M.Yarmoham-madi, and S.Khudanp ur, "Semi-orthological low-rank matrix factorization for deep neural networks," in Proc. Intersech, 2018, pp.3743-3747], while the second model was trained on a standard noisy set from CHiME-4 (C4-ORG) and had a 14-layer TDNNF structure. The latter system employs all 6 channels from the real and simulated training set of CHiME-4. Both models use 40-dimensional MFCC and 100-dimensional i vectors as acoustic features, and they are [ D.Povey, V.Peddinti, D.Galvez, P.Ghahrmani, and V.Manohar, "Purely sequence-trained neural networks for ASR based on lattice-free MMI," in Proc. Intersech, 2016, pp.2751-2755] trained in KALDI using the lattice-free MMI criterion. Both standard trigrams and more powerful RNN Language Models (LM) are used for decoding. After a 3-fold speed disturbance, the WSJ-CLN and C4-ORG had training data of about 246 hours and 327 hours, respectively.

Next, the results of research on the effectiveness of the proposed targeted speaker enhancement method to improve ASR robustness under matched and unmatched noise conditions will be presented.

Table 1. The proposed method performs under mismatched noisy conditions (denoising-SPX is trained based on simulated noisy CHiME-4 data). WER (%) utilized WSJ-CLN AM and 3-G LM.

First, VBD and WHAM-! The simulated noisy test data evaluate the SPX denoising apparatus under mismatch conditions (table 1). The WER results in Table 1 utilized WSJ-CLN AM and ternary grammar (3G) LM. Although denoising-SPX is based on simulated noisy CHiME-4 mixed-tone training and therefore does not match either test set, it is found to be better at VBD and WHAM-! The relative WER reduction of about 14% and 67% was produced on the test set. The WER results in table 1 show that the composite SDR and STOI training losses work better than the standard SDR losses, although the SDR and STOI values for the normal and proposed systems are almost the same. Thus, the additional STOI terms can help recover some of the time modulation of the speech distorted during enhancement. As used herein, a "normal" system is one that uses standard sisr training criteria (no STOI) to denoise SPX.

The next set of experiments was performed using a noisy CHiME-4 acoustic model (C4-ORG), which was evaluating the importance of the remixing ratio σ on ASR robustness.

Table 2. WER accuracy of the proposed speaker enhancement method for CHiME-4 for various remixing ratios σ. All results utilized noisy CHiME-4AM (C4-ORG, 3-LM).

The results in table 2 show that for both the simulated and the real CHiME-4 test set, an impressive WER reduction can be achieved by reducing the value of σ from ≡ (no input mix is added above the enhancement signal) to 0 dB. More specifically, the proposed noise reduction-SPX achieves relative WER reductions of about 28% and 33% for the simulated evaluation set and the real evaluation set, respectively, simply by reducing the remix ratio. These results are significant because neither the acoustic model nor the SPX model is retrained to produce these gains. The poor performance of denoising-SPX at σ= infinity compared to the untreated case can be attributed to the fact that the system is trained from the anechoic simulated CHiME-4 noisy data, while the test set also contains a small amount of reverberation. Another source of reduced accuracy may be the inherent distortion introduced by SPX, particularly in the case of real data. denoising-SPX with speaker enhancement produced relative WER reductions of about 23% and 28% for the analog and real sets, respectively.

Table 3 WER accuracy of the proposed method for CHiME-4 using C4-ORG AM and RNN LM.

Table 3 shows the recognition accuracy of denoising-SPX for the single channel CHiME-4 task. ASR was performed using a standard noisy C4-ORG acoustic model and 3-G transcripts were rescored using an RNN-based language model.

The above embodiments illustrate a targeted speaker enhancement algorithm for improving ASR accuracy in noisy conditions without acoustic model retraining. Using a denoising based on DNN speaker extraction, it was shown that remixing the noisy input with the enhancement signal achieves a WER reduction of about 23% and 28%, respectively, compared to the untreated case of single channel CHiME-4 simulation and true evaluation set. Furthermore, experiments have shown that adding perceptual excitation loss over time domain reconstruction loss during training of the speaker extraction system helps achieve a modest but consistent ASR accuracy gain.

The above embodiment provides at least one of the following:

(i) ASR performance of time domain speech extraction for speech denoising of real and analog data under matched and unmatched conditions,

(ii) A new objective intelligibility-index-based loss function for training a temporal denoising apparatus,

(iii) The speaker enhances the strategy to increase the robustness of the ASR model in noisy environments.

The embodiments described herein demonstrate that ASR accuracy under noisy conditions can be improved by using speaker enhancement strategies that do not require acoustic model retraining of distorted data. Instead of focusing on using existing enhancement algorithms to fully suppress the background, the embodiments described herein remix the enhancement signal with the raw input to mitigate processing artifacts, resulting in significant recognition accuracy gains without requiring retraining of the ASR's acoustic model.

While certain embodiments have been described, these embodiments are presented by way of example only and are not intended to limit the scope of the invention. Indeed, the novel apparatus and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions, and changes in the form of the devices, methods, and products described herein may be made without departing from the spirit of the invention. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the invention.

Claims

1. A method for processing an audio signal to enhance a target component of the audio signal, the method comprising:

2. The method of claim 1, wherein the first audio signal comprises speech from a target speaker, the target component being speech from the target speaker.

3. The method of claim 2, wherein the third audio signal is provided to an automatic speech recognizer ASR, and text is recognized in the third audio signal by the ASR.

4. A method according to claim 2 or 3, wherein the noise in the second environment is noise due to a speaker other than the target speaker.

5. The method of any of claims 2 to 4, wherein the second signal is extracted from the first signal using a speech extractor, and wherein the speech extractor has been trained using a loss function that is a combination of a term representing time-domain or frequency-domain distortion in the signal and a term representing intelligibility of the signal.

6. The method of claim 5 wherein the term representing the intelligibility of the signal is a short-time objective intelligibility, STOI, metric.

7. A method as claimed in claim 5 or 6, wherein the term representing the distortion in the signal is a scale invariant signal distortion ratio, sisr.

8. The method of any of claims 2 to 7, wherein processing the first audio signal to extract the second audio signal comprises using a speech extractor configured to receive the first audio signal and a registered speech signal containing speech samples from a target speaker, the speech extractor configured to generate a mask to remove portions of the first audio signal to extract the second audio signal.

9. The method of claim 8, wherein the first audio signal is a time domain signal.

10. The method of claim 9, wherein the first audio signal is processed by a spectral encoder, the spectral encoder being a trainable convolutional network configured to transform the first audio signal into a higher dimensional signal.

11. The method of claim 10, wherein the speech extractor comprises a plurality of deep convolutional layers that allow for time convolution of the spectral input signal.

12. A method as claimed in any preceding claim, wherein the first audio signal is a single channel audio signal.

13. A method as claimed in any preceding claim, wherein the first audio signal and the second audio signal are mixed to form a third audio signal, as follows:

z(n)＝s′(n)+αy(n)

where z (n) represents the third audio, s' (n) represents the second audio, y (n) represents the first audio, and α is a multiplication factor of the first audio signal.

14. The method of claim 13, wherein the multiplication factor is controlled by a scalar factor σ:

15. the method of claim 14, wherein σ is at least 0 and at most 30.

16. A method of training a speech processing apparatus for processing speech signals to enhance the speech of a target speaker, the apparatus comprising:

17. The method of training a speech processing apparatus of claim 16 wherein the speech processing apparatus is configured for automatic speech recognition and further comprising an automatic speech recognition unit configured to receive a third speech signal and derive text from the third speech signal, wherein the automatic speech recognition unit is trained independently of the speech extractor.

18. A method of training a speech processing apparatus for processing speech signals to enhance the speech of a target speaker, the apparatus comprising:

the method further comprises the steps of:

19. An audio processing apparatus for processing an audio signal to enhance a target component of the audio signal, the apparatus comprising:

20. A computer readable medium comprising instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1 to 18.