[go: up one dir, main page]

0% found this document useful (0 votes)
27 views33 pages

Keynote Slides

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 33

Speech Enhancement with Convolutional-

Recurrent Networks
Han Zhao1, Shuayb Zarar2, Ivan Tashev2 and Chin-Hui Lee3
Apr. 19th

1Machine Learning Department, Carnegie Mellon University


2Microsoft Research
3School of Electrical Engineering, Georgia Institute of Technology

1
Speech Enhancement — Motivation

ASR system - Training phase

Clean Speech

Black-box ASR Text stream

2
Speech Enhancement — Motivation

ASR system - Inference phase

Noisy Speech
Fixed

Black-box ASR Text stream

3
Speech Enhancement — Motivation

Distribution mismatch

Noisy Speech Clean Speech

• Similar issues with rendering and perception


• Clean speech is preferred for playback

4
Speech Enhancement — Motivation

Speech enhancement: from noisy to clean

Noisy Speech Clean Speech

Speech
Enhancement

5
Outline

• Background

• Data-driven Approach

• Convolutional-Recurrent Network for Speech


Enhancement

• Conclusion

6
Background

Problem setup:
Clean signal

Noisy signal (Unknown) noise

Typical assumptions on noise:


• Stationarity: is independent of

• Noise type:

Classic methods: spectral subtraction (Boll 1979), Minimum mean-squared error


estimator (Ephraim et al. 1984), Subspace approach (Ephraim et al. 1995)

7
Background

Classic methods are based on statistical assumptions of


noise:

Pros:
• Simple, and computationally efficient
• Optimality under proper assumption
• Interpretable

Cons:
• Limited to stationary noise
• Restricted to noise with specific characteristics

8
Data-driven Approach

What if we can collect large datasets of paired signals?

9
Data-driven Approach

What if we can collect large datasets of paired signals?


Given:
• Paired signals

Goal:
• Build function approximator such that

In short: regression based approach, usually

10
Data-driven Approach

Parametric regression using Neural Networks:


• Flexible for representation learning
• Scale linearly in and
• Natural paradigm for multi-task learning by sharing
common representations

Figure from Lu el al., Interspeech 2013 11


Data-driven Approach

Related work for speech enhancement

• Recurrent network for noise reduction, Maas et al., ISCA 2012


• Deep denoising auto-encoder, Lu et al., Interspeech 2013
• Weighted denoising auto-encoder, Xia et al., Interspeech 2013
• DNN with symmetric context window, Xu et al., IEEE SPL 2014
• Hybrid of DNN suppression rule, Mirsamadi et al., Interspeech 2016

12
Data-driven Approach

Speech Enhancement Pipeline:


• Short-term Fourier Transform (STFT) to obtain time-frequency signal

STFT

• Build neural networks to approximate filter function such that


Focus of this talk
• Apply Inverse-STFT (ISTFT) to reconstruct sound wave

ISTFT( )
13
Convolutional-Recurrent Networks for SE

Problem setup:

Given time-frequency signal — spectrogram pair

where

For each utterance, usually frames and frequency bins.


14
Convolutional-Recurrent Networks for SE
Observations:

Existing DNN-based approaches do not fully exploit the structure of


speech signals.
• Frame-based DNN regression approach does not use the temporal
locality of spectrogram
• Fully connected DNN regression approach does not exploit the
continuity of consecutive frequency bins in spectrogram

15
Convolutional-Recurrent Networks for SE
Observations:

Existing DNN-based approaches do not fully exploit the structure of


speech signals.
• Frame-based DNN regression approach does not use the temporal
locality of spectrogram

• Use recurrent neural networks


• Fully connected DNN regression approach does not exploit the
continuity of consecutive frequency bins in spectrogram

• Use convolutional neural networks

16
Convolutional-Recurrent Networks for SE
Proposed: Convolution + bi-LSTM + Linear
Regression

Objective:

17
Convolutional-Recurrent Networks for SE
Proposed: Convolution + bi-LSTM + Linear
Regression

At a high level, why will this model work?


• Continuity of signal in time and frequency domains
• Convolution kernels as linear filters to match local patterns
• bi-LSTM -> symmetric context window with adaptive window size
• End-to-end learning without additional assumptions on noise type 18
Convolutional-Recurrent Networks for SE
Convolution

* Convolution kernel
with size (b, w)
Zero-padded
spectrogram (t, f) =

feature map of size


(t, f’) 19
Convolutional-Recurrent Networks for SE
Concatenation of feature maps

k feature maps, each with size (t, f’)

One feature map, with size (t, kf’)

20
Convolutional-Recurrent Networks for SE
bi-directional LSTM State transition function of LSTM cell:

21
Convolutional-Recurrent Networks for SE
Linear Regression with Projection
At each time step t:

where is the output state of bi-LSTM at time step t.


Objective function and Optimization

MSE:

Optimization algorithm: AdaDelta 22


Experiments
Dataset
Single channel, Microsoft-internal data
• Cortana utterances: male, female and children
• Sampling rate: 16kHz
• Storage format: 24bits precision
• Each utterance: 5~9 seconds
• Noise: subset of MS noise collection, 377 files with 25 types
• 48 room impulse responses from MS RIR collection

Training Validation Test (seen noise) Test (unseen noise)

# utterances 7,500 1,500 1,500 1,500

23
Experiments
Evaluation Metric
• Signal-to-Noise Ratio (SNR) dB

• Log-spectral Distance (LSD)

• Mean-squared Error in time domain (MSE)

• Word error rate (WER)

• Perceptual evaluation of speech quality P.862 (PESQ)

24
Experiments
Comparison with State-of-the-Art Methods
• Classic noise suppressor
• DNN-Symmetric (Xu et al. 2015)
• Multilayer perceptron, 3 hidden layers (2048x3), 11 context window
• DNN-Causal (Tashev et al. 2016)
• Multilayer perceptron, 3 hidden layers (2048x3), 7 causal window
• Deep-RNN (Maas et al. 2012)
• Recurrent autoencoders, 3 hidden layers (500x3), 3 context window

All models are trained using AdaDelta

25
Experiments
Comparison with State-of-the-Art Methods (seen noise)
SNR LSD MSE WER PESQ
Noisy data 15.18 23.07 0.04399 15.40 2.26
Classic NS 18.82 22.24 0.03985 14.77 2.40
DNN-s 44.51 19.89 0.03436 55.38 2.20
DNN-c 40.70 20.09 0.03485 54.92 2.17
RNN 41.08 17.49 0.03533 44.93 2.19
Ours 49.79 15.17 0.03399 14.64 2.86
Clean data 57.31 1.01 0.0000 2.19 4.48

26
Experiments
Comparison with State-of-the-Art Methods (unseen noise)
SNR LSD MSE WER PESQ
Noisy data 14.78 23.76 0.04786 18.40 2.09
Classic NS 19.73 22.82 0.04201 15.54 2.26
DNN-s 40.47 21.07 0.03741 54.77 2.16
DNN-c 38.70 21.38 0.03718 54.13 2.13
RNN 44.60 18.81 0.03665 52.05 2.06
Ours 39.70 17.06 0.04721 16.71 2.73
Clean data 58.35 1.15 0.0000 1.83 4.48

27
Experiments
Case Study

Noisy Clean

MS-Cortana 28
Experiments
Case Study

Noisy Clean

DNN 29
Experiments
Case Study

Noisy Clean

RNN 30
Experiments
Case Study

Noisy Clean

Ours 31
Conclusion
• Convolutions help capture local pattern

• Recurrence helps model sequential structure

• Our model improves SNR by 35 dB and PESQ by 0.6

• With fixed ASR system, improves WER by 1%

• Good generalizations on unseen noise

32
Conclusion

Thanks

33

You might also like