Keynote Slides

Speech Enhancement with Convolutional-
Recurrent Networks
Han Zhao1, Shuayb Zarar2, Ivan Tashev2 and Chin-Hui Lee3
Apr. 19th
1Machine Learning Department, Carnegie Mellon University

2Microsoft Research
3School of Electrical Engineering, Georgia Institute of Technology
1
Speech Enhancement — Motivation
ASR system - Training phase
Clean Speech
Black-box ASR Text stream
2
ASR system - Inference phase
Noisy Speech
Fixed
Black-box ASR Text stream
3
Distribution mismatch
Noisy Speech Clean Speech
• Similar issues with rendering and perception

• Clean speech is preferred for playback
4
Speech enhancement: from noisy to clean
Noisy Speech Clean Speech
Speech
Enhancement
5
Outline
• Background
• Data-driven Approach
• Convolutional-Recurrent Network for Speech

Enhancement
• Conclusion
6
Background
Problem setup:
Clean signal
Noisy signal (Unknown) noise
Typical assumptions on noise:

• Stationarity: is independent of
• Noise type:
Classic methods: spectral subtraction (Boll 1979), Minimum mean-squared error

estimator (Ephraim et al. 1984), Subspace approach (Ephraim et al. 1995)
7
Background
Classic methods are based on statistical assumptions of

noise:
Pros:
• Simple, and computationally efficient
• Optimality under proper assumption
• Interpretable
Cons:
• Limited to stationary noise
• Restricted to noise with specific characteristics
8
Data-driven Approach
What if we can collect large datasets of paired signals?
9
What if we can collect large datasets of paired signals?

Given:
• Paired signals
Goal:
• Build function approximator such that
In short: regression based approach, usually
10
Parametric regression using Neural Networks:

• Flexible for representation learning
• Scale linearly in and
• Natural paradigm for multi-task learning by sharing
common representations
Figure from Lu el al., Interspeech 2013 11

Related work for speech enhancement
• Recurrent network for noise reduction, Maas et al., ISCA 2012

• Deep denoising auto-encoder, Lu et al., Interspeech 2013
• Weighted denoising auto-encoder, Xia et al., Interspeech 2013
• DNN with symmetric context window, Xu et al., IEEE SPL 2014
• Hybrid of DNN suppression rule, Mirsamadi et al., Interspeech 2016
12
Speech Enhancement Pipeline:

• Short-term Fourier Transform (STFT) to obtain time-frequency signal
STFT
• Build neural networks to approximate filter function such that

Focus of this talk
• Apply Inverse-STFT (ISTFT) to reconstruct sound wave
ISTFT( )
13
Convolutional-Recurrent Networks for SE
Problem setup:
Given time-frequency signal — spectrogram pair
where
For each utterance, usually frames and frequency bins.

14
Observations:
Existing DNN-based approaches do not fully exploit the structure of

speech signals.
• Frame-based DNN regression approach does not use the temporal
locality of spectrogram
• Fully connected DNN regression approach does not exploit the
continuity of consecutive frequency bins in spectrogram
15
Observations:
Existing DNN-based approaches do not fully exploit the structure of

speech signals.
• Frame-based DNN regression approach does not use the temporal
locality of spectrogram
• Use recurrent neural networks

• Fully connected DNN regression approach does not exploit the
continuity of consecutive frequency bins in spectrogram
• Use convolutional neural networks
16
Proposed: Convolution + bi-LSTM + Linear
Regression
Objective:
17
Proposed: Convolution + bi-LSTM + Linear
Regression
At a high level, why will this model work?

• Continuity of signal in time and frequency domains
• Convolution kernels as linear filters to match local patterns
• bi-LSTM -> symmetric context window with adaptive window size
• End-to-end learning without additional assumptions on noise type 18
Convolution
* Convolution kernel
with size (b, w)
Zero-padded
spectrogram (t, f) =
feature map of size

(t, f’) 19
Concatenation of feature maps
k feature maps, each with size (t, f’)
One feature map, with size (t, kf’)
20
bi-directional LSTM State transition function of LSTM cell:
21
Linear Regression with Projection
At each time step t:
where is the output state of bi-LSTM at time step t.

Objective function and Optimization
MSE:
Optimization algorithm: AdaDelta 22

Experiments
Dataset
Single channel, Microsoft-internal data
• Cortana utterances: male, female and children
• Sampling rate: 16kHz
• Storage format: 24bits precision
• Each utterance: 5~9 seconds
• Noise: subset of MS noise collection, 377 files with 25 types
• 48 room impulse responses from MS RIR collection
Training Validation Test (seen noise) Test (unseen noise)
# utterances 7,500 1,500 1,500 1,500
23
Experiments
Evaluation Metric
• Signal-to-Noise Ratio (SNR) dB
• Log-spectral Distance (LSD)
• Mean-squared Error in time domain (MSE)
• Word error rate (WER)
• Perceptual evaluation of speech quality P.862 (PESQ)
24
Experiments
Comparison with State-of-the-Art Methods
• Classic noise suppressor
• DNN-Symmetric (Xu et al. 2015)
• Multilayer perceptron, 3 hidden layers (2048x3), 11 context window
• DNN-Causal (Tashev et al. 2016)
• Multilayer perceptron, 3 hidden layers (2048x3), 7 causal window
• Deep-RNN (Maas et al. 2012)
• Recurrent autoencoders, 3 hidden layers (500x3), 3 context window
All models are trained using AdaDelta
25
Experiments
Comparison with State-of-the-Art Methods (seen noise)
SNR LSD MSE WER PESQ
Noisy data 15.18 23.07 0.04399 15.40 2.26
Classic NS 18.82 22.24 0.03985 14.77 2.40
DNN-s 44.51 19.89 0.03436 55.38 2.20
DNN-c 40.70 20.09 0.03485 54.92 2.17
RNN 41.08 17.49 0.03533 44.93 2.19
Ours 49.79 15.17 0.03399 14.64 2.86
Clean data 57.31 1.01 0.0000 2.19 4.48
26
Experiments
Comparison with State-of-the-Art Methods (unseen noise)
SNR LSD MSE WER PESQ
Noisy data 14.78 23.76 0.04786 18.40 2.09
Classic NS 19.73 22.82 0.04201 15.54 2.26
DNN-s 40.47 21.07 0.03741 54.77 2.16
DNN-c 38.70 21.38 0.03718 54.13 2.13
RNN 44.60 18.81 0.03665 52.05 2.06
Ours 39.70 17.06 0.04721 16.71 2.73
Clean data 58.35 1.15 0.0000 1.83 4.48
27
Experiments
Case Study
Noisy Clean
MS-Cortana 28
Experiments
Case Study
Noisy Clean
DNN 29
Experiments
Case Study
Noisy Clean
RNN 30
Experiments
Case Study
Noisy Clean
Ours 31
Conclusion
• Convolutions help capture local pattern
• Recurrence helps model sequential structure
• Our model improves SNR by 35 dB and PESQ by 0.6
• With fixed ASR system, improves WER by 1%
• Good generalizations on unseen noise
32
Conclusion
Thanks
33

Keynote Slides

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Keynote Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keynote Slides

Uploaded by

Copyright:

Available Formats

Speech Enhancement with Convolutional-

1Machine Learning Department, Carnegie Mellon University

ASR system - Training phase

Black-box ASR Text stream

ASR system - Inference phase

Black-box ASR Text stream

Noisy Speech Clean Speech

• Similar issues with rendering and perception

Speech enhancement: from noisy to clean

Noisy Speech Clean Speech

• Convolutional-Recurrent Network for Speech

Noisy signal (Unknown) noise

Typical assumptions on noise:

Classic methods: spectral subtraction (Boll 1979), Minimum mean-squared error

Classic methods are based on statistical assumptions of

What if we can collect large datasets of paired signals?

What if we can collect large datasets of paired signals?

In short: regression based approach, usually

Parametric regression using Neural Networks:

Figure from Lu el al., Interspeech 2013 11

Related work for speech enhancement

• Recurrent network for noise reduction, Maas et al., ISCA 2012

Speech Enhancement Pipeline:

• Build neural networks to approximate filter function such that

Given time-frequency signal — spectrogram pair

For each utterance, usually frames and frequency bins.

Existing DNN-based approaches do not fully exploit the structure of

Existing DNN-based approaches do not fully exploit the structure of

• Use recurrent neural networks

• Use convolutional neural networks

At a high level, why will this model work?

feature map of size

k feature maps, each with size (t, f’)

One feature map, with size (t, kf’)

where is the output state of bi-LSTM at time step t.

Optimization algorithm: AdaDelta 22

Training Validation Test (seen noise) Test (unseen noise)

# utterances 7,500 1,500 1,500 1,500

• Log-spectral Distance (LSD)

• Mean-squared Error in time domain (MSE)

• Word error rate (WER)

• Perceptual evaluation of speech quality P.862 (PESQ)

All models are trained using AdaDelta

• Recurrence helps model sequential structure

• Our model improves SNR by 35 dB and PESQ by 0.6

• With fixed ASR system, improves WER by 1%

• Good generalizations on unseen noise

You might also like