Lecture 9 - Speech Recognition

Speech Recognition
Andrew Senior
(DeepMind London)
Many thanks for slides to Vincent Vanhoucke, Heiga
Zen, Jun Song & Andrew Zisserman
February 21st, 2017. Oxford University
Outline
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Neural network speech recognition

Hybrid neural networks
Training losses
Sequence discriminative training
New architectures
Other topics
Speech recognition problem
Automatic speech recognition (ASR)

→ “OK Google, directions home”
Text-to-speech synthesis (TTS)

“Take the first left” →
Andrew Senior Speech Recognition 1 of 63

Speech problems
• Automatic speech recognition

− Spontaneous vs read speech
− Large vocabulary
− In noise
− Low resource
− Far-field
− Accent-independent
− Speaker-adaptive
• Text to speech
− Low resource
− Realistic prosody
• Speaker identification
• Speech enhancement
• Speech separation
Outline
Speech recognition
History

Training losses
New architectures
Other topics
What is speech — physical realisation
• Waves of changing air pressure.

• Realised through excitation from the vocal cords
• Modulated by the vocal tract.
• Modulated by the articulators (tongue, teeth, lips).
• Vowels produced with an open vocal tract (stationary)
− Can be parameterized by position of tongue.
• Consonants are constrictions of vocal tract.
• Converted to Voltage with a microphone.
• Sampled with an Analogue to Digital Converter
Amplitude
Time
Sampling & Quantization

Speech representation
• Human hearing is ~50Hz-20kHz

• Human speech is ~85Hz–8kHz
• Telephone speech has 8kHz sampling: 300Hz–4kHz bandwidth
• 1 bit per sample can be intelligible
• CD is 44.1kHz 16 bits per sample
• Contemporary speech processing mostly around 16kHz 16bits/sample

Speech representation
We want a low-dimensionality representation, invariant to speaker,

background noise, rate of speaking etc.
• Fourier analysis shows energy in different frequency bands.
• windowed short-term fast Fourier transform
• e.g. FFT on overlapping 25ms windows (400 samples) taken every
10ms
− Energy vs frequency [discrete] vs time [discrete]

Mel frequency representation
• FFT is still too high-dimensional.

• Downsample by local weighted averages on mel scale non-linear
f
spacing, and take a log. m = 1127 ln(1 + 700 )
• Result in log-mel features (default for neural network speech
modelling.)
• 40+ dimensional features per frame

MFCCs
• Mel Frequency Cepstral Coefficients - MFCCs are the discrete cosine

transformation of the mel filterbank energies. Whitened and
low-dimensional.
• Similar to Principal Components of log spectra.
• GMM speech recognition systems may use 13 MFCCs
• Perceptual Linear Prediction – a common alternative representation.
• Frame stacking- it’s common to concatenate several consecutive
frames.
• e.g. 26 for fully-connected DNN. 8 for LSTM.
• GMMs used local differences (deltas) and second-order differences
(delta-deltas) to capture dynamics. (13 + 13 + 13 dimensional)
• Ultimately use ~39 dimensional linear discriminant analysis
(~class-aware PCA) projection of 9 stacked MFCC vectors.

Outline
Speech recognition
History

Training losses
New architectures
Other topics
Speech as communication
• Speech evolved as communication to convey information.

• Consists of sentences (in ASR we usually talk about “utterances”)
• Sentences composed of words
• Minimal unit is a “phoneme”
− Minimal unit that distinguishes one word from another.
− Set of 40–60 distinct sounds.
− Vary per language,
− Universal representations.
◦ IPA: international phonetic alphabet,
◦ X-SAMPA (ASCII)
• Homophones
− distinct words with the same pronunciation: “there” vs “their”
• Prosody
− How something is said can convey meaning.
Datasets
• TIMIT
− Hand-marked phone boundaries given
− 630 speakers × 10 utterances
• Wall Street Journal (WSJ) 1986 Read speech. WSJ0 1991, 30k vocab
• Broadcast News (BN) 1996 104 hours
• Switchboard (SWB) 1992. 2000 hours spontaneous telephone speech
500 speakers
• Google voice search
− anonymized live traffic 3M utterances 2000 hours
hand-transcribed 4M vocabulary. Constantly refreshed, synthetic
reverberation + additive noise
• DeepSpeech 5000h read (Lombard) speech + SWB with additive
noise.
• YouTube 125,000 hours aligned captions (Soltau et al., 2016)
Outline
Speech recognition
History

Training losses
New architectures
Other topics
Rough History
• 1960s Dynamic Time Warping

• 1970s Hidden Markov Models
• Multi-layer perceptron 1986
• Speech recognition with neural networks 1987–1995
• Superseded by GMMs 1995–2009
• Neural network features 2002–
• Deep networks 2006– (Hinton, 2002)
• Deep networks for speech recognition
− Good results on TIMIT (Mohamed et al., 2009)
− Results on large vocabulary systems 2010 (Dahl et al., 2011)
− Google launches DNN ASR product 2011
− Dominant paradigm for ASR 2012 (Hinton et al., 2012)
• Recurrent networks for speech recognition 1990, 2012–
− New models (attention, LAS, neural transducer)
Outline
Speech recognition
History

Training losses
New architectures
Other topics
• Speech signal represented as an observation sequence o = {ot }.

• We want to find the most likely word sequence ŵ
• We model this with a Hidden Markov Model.
− The system has a set of discrete states,
− transitions from state to state according to transition probabilities
(Markovian: memoryless)
− Acoustic observation when making a transition is conditioned on
state alone. P (ot |ct )
− We seek to recover the state sequence and consequently the word
sequence.
<S> /TH/ /E/ /K/ /AE/ /T/

Fundamental equation of speech recognition
We choose the decoder output as the most likely sequence ŵ from all
possible sequences, Σ∗, for an observation sequence o:
ŵ = arg max P (w|o) (1)

w∈Σ∗
= arg max P (o|w)P (w) (2)
w∈Σ∗
A product of Acoustic model and Language model scores.

X
P (o|w) = P (o|c)P (c|p)P (p|w) (3)
d,c,p
Where p is the phone sequence and c is the state sequence.

• We can model word sequences with a language model.
Y
P (w1 , w2 , . . . , wN ) = P (w0 ) P (wi |w0 , . . . , wi−1 )

Speech recognition as transduction
From signal to language.

Speech recognition as transduction – lexicon
Construct graph using Weighted Finite State Transducers (WFST)

Speech recognition as transduction
Compose Lexicon FST with Grammar FST L ◦ G

Phonetic units
• Phonemes: “cat” → /K/, /AE/, /T/

• Context independent HMM states k1 , k2 , ae1 . . .
− Model onset / middle / end separately.
• Context dependent states k1.17 , . . .
• Context dependent phones
• Diphones (pairs of half-phones)
• Syllables
• Word-parts cf Machine translation (Wu et al., 2016)
• Characters (graphemes)
• Whole words Sak et al. (2014a, 2015); Soltau et al. (2016)
− Hard to generalize to rare words.
Choice depends on language, size of dataset, task, resources available.

Context dependent phonetic clustering
• A phone’s realization depends on the preceding and following context

• Could improve discrimination if we model different contextual
realizations separately:
e.g AE preceded by K, followed by T: AE+T-K
• But, if we have 42 phones, and 3 states per phone, there are 3 × 423
context-dependent phones.
• Most of these won’t be observed
• So cluster – group together similar distributions and train a joint
model.
• Have a “back-off” rule to determine which model to use for
unobserved contexts.
• Usually a decision tree.

Gaussian Mixture Models
• Dominant paradigm for ASR from 1990 to 2010
• Model the probability distribution of the acoustic features for each
state.
P
P (ot |ci ) = j wij N (ot ; µij , σij )
• Often use diagonal covariance Gaussians to keep number of
parameters under control.
• Train by the E-M algorithm (Dempster et al., 1977) alternating:
− M: forced alignment computing the maximum-likelihood state
sequence for each utterance
− E: parameter (µ, σ) estimation
• Complex training procedures to incrementally fit increasing numbers
of components per mixture.
− More components, better fit. 79 parameters / component.
• Given an alignment mapping audio frames to states, this is
parallelizable by state.
• Hard to share parameters / data across states.
Forced alignment
• Forced alignment uses a model to compute the maximum likelihood

alignment between speech features and phonetic states.
• For each training utterance, construct the set of phonetic states for
the ground truth transcription.
• Use Viterbi algorithm to find ML monotonic state sequence
• Under constraints such as at least one frame per state.
• Results in a phonetic label for each frame.
• Can give hard or soft segmentation.
1 sil
0.9 h
0.8 aU
k
0.7 oU
0.6 l
d
0.5 I
0.4 z
t
0.3 s
0.2 aI
0.1
0
sil h aU k oU l d I z I t aU t s aI d sil

Forced alignment
With a transducer with states ci :
<S> /TH/ /E/ /K/ /AE/ /T/
Compute state likelihoods at time t

X
P (o1,...,t |ci ) = P (ot |cj )P (o1,...,t |cj )P (ci |cj )
j
With transition probabilities: P (ci |cj )

To find best path;
P (o1,...,t |ci ) = max P (ot |cj )P (o1,...,t |cj )P (ci |cj )

j

Forced alignment t = 0
<S> /TH/ /E/ /K/ /AE/ /T/
Observation likelihoods P (ot |ci ) Start distribution Pt=0 (ci )

...
... ...
... ...
... /t/
/t/ 0.1 0.1 0.1 0.1 0.2 0.1 /ae/ 0
/ae/ 0.1 0.1 0.3 0.3 0.1 0.4 /k/ 0
/k/ 0.1 0.1 0.1 0.2 0.5 0.1 /e/ 0
/e/ 0.1 0.2 0.3 0.2 0.1 0.3 /th/ 0
/th/ 0.6 0.5 0.1 0.1 0.2 0.1 <s>1.0
t-> 0
t->

<S> /TH/ /E/ /K/ /AE/ /T/
Observation likelihoods P (ot |ci ) State likelihoods P (o1,...,t |ci )

...
... ...
... ...
... /t/
/t/ 0.1 0.1 0.1 0.1 0.2 0.1 /ae/ 0
/ae/ 0.1 0.1 0.3 0.3 0.1 0.4 /k/ 0
/k/ 0.1 0.1 0.1 0.2 0.5 0.1 /e/ 0
/e/ 0.1 0.2 0.3 0.2 0.1 0.3 /th/ 0 0.6
/th/ 0.6 0.5 0.1 0.1 0.2 0.1 <s>1.0
t-> 0 1
t->

<S> /TH/ /E/ /K/ /AE/ /T/

...
... ...
... ...
... /t/
/t/ 0.1 0.1 0.1 0.1 0.2 0.1 /ae/ 0
/ae/ 0.1 0.1 0.3 0.3 0.1 0.4 /k/ 0
/k/ 0.1 0.1 0.1 0.2 0.5 0.1 /e/ 0 .03
/e/ 0.1 0.2 0.3 0.2 0.1 0.3 /th/ 0 0.6 .15
/th/ 0.6 0.5 0.1 0.1 0.2 0.1 <s>1.0
t-> 0 1 2
t->

Forced alignment t = T
<S> /TH/ /E/ /K/ /AE/ /T/

...
... ...
... ...
... /t/
/t/ 0.1 0.1 0.1 0.1 0.2 0.1 /ae/ 0
/ae/ 0.1 0.1 0.3 0.3 0.1 0.4 /k/ 0
/k/ 0.1 0.1 0.1 0.2 0.5 0.1 /e/ 0 .03
/e/ 0.1 0.2 0.3 0.2 0.1 0.3 /th/ 0 0.6 .15
/th/ 0.6 0.5 0.1 0.1 0.2 0.1 <s>1.0
t-> 0 1 2
t->

Decoding
Speech recognition unfolds in much the same way.

cat
Now we have a graph instead of a dog
straight-through path. the
Optional silences between words cat
a dog
Alternative pronunciation paths. once
Typically use max probability, and work in the log
domain. hello
Hypothesis space is huge, so we only keep a
“beam” of the best paths, and can lose what
would end up being the true best path.

Two main paradigms for neural networks for speech
• Use neural networks to compute nonlinear feature representations.

− “Bottleneck” or “tandem” features (Hermansky et al., 2000)
− Low-dimensional representation is modelled conventionally with
GMMs.
− Allows all the GMM machinery and tricks to be exploited.
• Use neural networks to estimate phonetic unit probabilities.

Neural network features
Train a neural network to discriminate classes.

Use output or a low-dimensional bottleneck layer representation as
features.
Hidden
layers Output
Input layer
Bottleneck
layer
layer y1
x1 y
2
x2 y3
x3 y4
x4 y5

Neural network features
• TRAP: Concatenate PLP-HLDA features and NN features.

• Bottleneck outperforms posterior features (Grezl et al., 2007)
• Generally DNN features + GMMs reach about the same performance
as hybrid DNN-HMM systems, but are much more complex.

Outline
Speech recognition
History

Training losses
New architectures
Other topics
Hybrid networks
• Train the network as a classifier with a softmax across the phonetic

units.
• Train with cross-entropy.
• Softmax
exp (a (i, θ))

y (i) = PN
j=1 exp (a (j, θ))
will converge to posterior across phonetic states:

P (ci |ot )

Hybrid Neural network decoding
Now we model P (o|c) with a Neural network instead of a Gaussian
Mixture model. Everything else stays the same.
Y
P (o|c) = P (ot |ct ) (4)
t
P (ct |ot )P (ot )
P (ot |ct ) = (5)
P (ct )
P (ct |ot )
∝ (6)
P (ct )
For observations ot at time t and a CD state sequence ct .
We can ignore P (ot ) since it is the same for all decoding paths.
The last term is called the “scaled posterior”:
log P (ot |ct ) = log P (ct |ot ) − α log P (ct ) (7)
Empirically (by cross validation) we actually find better results with a
“prior smoothing” term α ≈ 0.8.
Input features
Neural networks can handle high-dimensional features with correlated
features.
Use (26) stacked filterbank inputs. (40-dimensional mel-spaced
filterbanks)
Example filters learned in the first layer of a fully-connected network:
(33 x 8 filters. Each subimage 40 frequency vs 26 time.)

Neural network architectures for speech recognition
• Fully connected
• Convolutional networks (CNNs)
• Recurrent neural networks (RNNs)
− LSTMs
− GRUs

Convolutional neural networks
• Time delay neural networks

− Waibel et al. (1989)
− Dilated convolutions (Peddinti et al., 2015)
• CNNs in time or frequency domain. Abdel-Hamid et al. (2014);
Sainath et al. (2013)
• Wavenet (van den Oord et al., 2016)

Recurrent neural networks
• RNNs
− RNN (Robinson and Fallside, 1991)
− LSTM Graves et al. (2013)
− Deep LSTM-P Sak et al. (2014b)
− CLDNN (right) (Sainath et al., 2015a)
− GRU. DeepSpeech 1/2 (Amodei et al., 2015)
• Bidirectional (Schuster and Paliwal, 1997)
helps, but introduces latency.
• Dependencies not long at speech frame rates
(100Hz).
• Frame stacking and down-sampling help.

Human parity in speech recognition (Xiong et al.,
2016)
• Ensemble of BLSTMs
• i-vectors for speaker normalization
− i-vector is an embedding of audio trained to discriminate between
speakers. (Speaker ID)
• Interpolated n-gram + LSTM language model.
• 5.8% WER on SWB (vs 5.9% for human).

Outline
Speech recognition
History

Training losses
New architectures
Other topics
Cross Entropy Training
• GMMs were trained with Maximum Likelihood

• Conventional training uses Cross-Entropy loss.
N
X yt (i)
LXEN T (ot , θ) = yt (i) log
ŷt (i)
i=1
• With large data we can use Viterbi (binary) targets: yt ∈ {0, 1}

− i.e. a hard alignment.
• Can also use a soft (Baum-Welch) alignment (Senior and Robinson,
1994)

Connectionist Temporal Classification (Graves et al.,
2006)
• CTC is a bundle of alternatives to conventional system:

− CTC introduces an optional blank symbol between the ”real”
labels.
− Simple to implement in the FST framework -an optional
- /K/ - /AE/ - /T/ -
− Continuous realignment — no need for a bootstrap model

− Always use soft targets.
− Don’t scale by posterior.
• Similar results to conventional training.

CTC alignments
<b> j.35 i.113 z.87 S.41 A.22

sil.1 u.46 @.320 I.17 @.133 g.18
m.25 z.69 m.227 n.350 k.75 oU.68
1
0.8
0.6
0.4
0.2
0
sil m j u z i @ m z I n S @k A g oU sil

Outline
Speech recognition
History

Training losses
New architectures
Other topics
• Conventional training uses Cross-Entropy loss

− Tries to maximize probability of the true state sequence given the
data.
• We care about Word Error Rate of the complete system.
• Design a loss that’s differentiable and closer to what we care about.
• Applied to neural networks (Kingsbury, 2009)
• Posterior scaling gets learnt by the network.
• Improves conventional training and CTC by ~15% relative.
• bMMI, sMBR(Povey et al., 2008)
p (Xr , Sr ) p (Xr |Sr ) P (Sr )

P (Sr |Xr ) = P =P
S p (X r , S) S p (Xr |S) P (S)
R
X
Lmmi (θ) = − log P (Sr |Xr )
r=1
R
X XR X
L
Andrew Senior (θ) = L (X
Speech, Recognition
θ) = P (S |X ) e (S , S ) 47 of 63


Outline
Speech recognition
History

Training losses
New architectures
Other topics
Sequence2Sequence
• Basic sequence2sequence not that good for speech

− Utterances are too long to memorize
− Monotonicity of audio (vs Machine Translation)
• Attention + seq2seq for speech (Chorowski et al., 2015)
• Listen, Attend and Spell (Chan et al., 2015)
• Output characters until EOS
• Incorporates language model of training set.
• Harder to incorporate a separately-trained language model. (e.g.
trained on trillions of tokens)

Watch Listen, Attend and Spell (Chung et al., 2016)
Apply LAS to audio and video streams simultaneously.
Train with scheduled sampling (Bengio et al., 2015)

Watch Listen, Attend and Spell (Chung et al., 2016)

Neural transducer (Jaitly et al., 2015)
• Seq2seq models require the whole sequence to be available.

• Introduce latency compared to unidirectional.
• Solution: Transcribe monotonic chunks at a time with attention.

Neural transducer

Raw waveform speech recognition
• We typically train on a much-reduced dimensional signal.

• Would like to train end-to-end.
• Learn filterbanks, instead of hand-crafting.
• A conventional RNN at audio sample rate can’t learn long-enough
dependencies.
− Add a convolutional filter to a conventional system e.g.
CLDNN (Sainath et al., 2015b)
− WaveNet-style architecture. [See TTS talk on Thursday]
− Clockwork RNN (Koutnı́k et al., 2014) Run a hierarchical RNN at
multiple rates.

Raw waveform speech recognition
Frequency distribution of learned filters differs from hand-initialization:

Speech recognition in noise
• Multi-style training (“MTS”)

− Collect noisy data.
− Or, add realistic but randomized noise to utterances during
training.
− e.g. Through a “room simulator” to add reverberation.
− Optionally add a clean-reconstruction loss in training.
• Train a denoiser.
• NB Lombard effect – voice changes in noise.

Multi-microphone speech recognition
• Multiple microphones give a richer representation

• “Closest to the speaker” has better SNR
• Beamforming
− Given geometry of microphone array and speed of sound
− Compute Time Delay of Arrival at each microphone
− Delay-and-sum: Constructive interference of signal in chosen
direction.
− Destructive interference depends on direction / frequency of noise.
• More features for a neural network to exploit.
− Important to preserve phase information to enable beam-forming

Factored multichannel raw waveform CLDNN (Sainath
et al., 2016)
462
Impulse responses 180 Beampattern
DOA DOA DOA DOA DOA DOA DOA DOA DOA DOA
Channel 0 Channel 1
p=10 p=9 p=8 p=7 p=6 p=5 p=4 p=3 p=2 p=1
120
output targets
02
4
60
0 30
0.4 180
0.2 120
MTL 0.0 60
clean features
DNN
0.2
0.4 0
0.4 180
0.2
DNN LSTM 0.0
0.2
120
60 24
0.4 0
2 180
CLDNN 1 120
DNN LSTM 0 60
1
2 0
LSTM
0.2
0.0
180
120 18
0.2 60
0
1.0 180
fConv 0.5 120
0.0 60
z[t] 2 <1⇥F ⇥P 0.5
1.0
0.2
0.1
0
180 12
pool +
w[t] 2 <M L+1⇥F ⇥P
0.0 120
nonlin
0.1 60
0.2 0
5 180
g 2 <L⇥F ⇥1 tConv2
0 120
y[t] 2 <M ⇥1⇥P
tConv1 0.4
5
60
0
180
6
hP1 2< N
hP
2 2<
N
0.2 120
0.0 60
. . 0.2
. .
0.4 0
1.0 180
0.5 120
h21 2 <N h22 2 <N 0.0
0.5
1.0
60
0
0
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8
h11 2 <N h12 2 <N
Time (milliseconds) Frequency (kHz)
x1 [t] 2 <M x2 [t] 2 <M

References I
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014). Convolutional neural networks for
speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(10):1533–1545.
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A.,
Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A. Y., Jun, B., LeGresley, P., Lin, L.,
Narang, S., Ng, A. Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y.,
Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., and Zhu, Z. (2015). Deep speech 2: End-to-end speech
recognition in english and mandarin. CoRR, abs/1512.02595.
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent
neural networks. CoRR, abs/1506.03099.
Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell. CoRR, abs/1508.01211.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition.
CoRR, abs/1506.07503.
Chung, J. S., Senior, A. W., Vinyals, O., and Zisserman, A. (2016). Lip reading sentences in the wild. CoRR,
abs/1611.05358.
Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent
dbn-hmms. In ICASSP.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of
the Royal Statistical Society, 39(B):1 – 38.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling
unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on
Machine Learning, pages 369–376. ACM.
Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU.
Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition.
Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems.
In ICASSP.

References II
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and
Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing
Magazine, 29(6):82–97.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation.
Jaitly, N., Le, Q. V., Vinyals, O., Sutskever, I., and Bengio, S. (2015). An online sequence-to-sequence model using partial
conditioning. CoRR, abs/1511.04868.
Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3761–3764, Taipei,
Taiwan.
Koutnı́k, J., Greff, K., Gomez, F. J., and Schmidhuber, J. (2014). A clockwork RNN. CoRR, abs/1402.3511.
Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. In NIPS.
Peddinti, V., Povey, D., and Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long
temporal contexts. In Interspeech.
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K. (2008). Boosted MMI for model
and feature-space discriminative training. In Proc. ICASSP.
Robinson, A. and Fallside, F. (1991). A recurrent error propagation network speech recognition system. Computer Speech
and Language, 5(3):259–274.
Sainath, T. N., Mohamed, A.-r., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolutional neural networks for lvcsr.
In IEEE International Conference on Acoustics, Speech and Signal Processing.
Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. (2015a). Convolutional, long short-term memory, fully connected deep
neural networks. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Sainath, T. N., Weiss, R., Senior, A., Wilson, K., and Vinyals, O. (2015b). Raw waveform CLDNNs. In Submitted to
Interspeech.
Sainath, T. N., Weiss, R. J., Wilson, K. W., Narayanan, A., and Bacchiani, M. (2016). Factored Spatial and Spectral
Multichannel Raw Waveform CLDNNs. In to appear in Proc. ICASSP.

References III
Sak, H., Senior, A., and Beaufays, F. (2014a). Long Short-Term Memory Based Recurrent Neural Network Architectures for
Large Vocabulary Speech Recognition. ArXiv e-prints.
Sak, H., Senior, A., and Beaufays, F. (2014b). Long Short-Term Memory Recurrent Neural Network Architectures for Large
Scale Acoustic Modeling. In INTERSPEECH 2014.
Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., and Schalkwyk, J. (2015). Learning acoustic frame labeling
for speech recognition with recurrent neural networks. In IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP).
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on,
45(11):2673–2681.
Senior, A. and Robinson, A. (1994). Forward-backward retraining of recurrent neural networks. In NIPS.
Soltau, H., Liao, H., and Sak, H. (2016). Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary
speech recognition. CoRR, abs/1610.09975.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and
Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989). Phoneme recognition using time-delay neural
networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3).
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K.,
Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K.,
Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M.,
and Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine
translation. CoRR, abs/1609.08144.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human
parity in conversational speech recognition. CoRR, abs/1610.05256.

Lecture 9 - Speech Recognition

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lecture 9 - Speech Recognition

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 9 - Speech Recognition

Uploaded by

Copyright:

Available Formats

Speech Recognition

Neural network speech recognition

Automatic speech recognition (ASR)

Text-to-speech synthesis (TTS)

Andrew Senior Speech Recognition 1 of 63

• Automatic speech recognition

Neural network speech recognition

• Waves of changing air pressure.

Sampling & Quantization

Andrew Senior Speech Recognition 4 of 63

• Human hearing is ~50Hz-20kHz

Andrew Senior Speech Recognition 5 of 63

We want a low-dimensionality representation, invariant to speaker,

Andrew Senior Speech Recognition 6 of 63

• FFT is still too high-dimensional.

Andrew Senior Speech Recognition 7 of 63

• Mel Frequency Cepstral Coefficients - MFCCs are the discrete cosine

Andrew Senior Speech Recognition 8 of 63

Neural network speech recognition

• Speech evolved as communication to convey information.

Neural network speech recognition

• 1960s Dynamic Time Warping

Neural network speech recognition

• Speech signal represented as an observation sequence o = {ot }.

<S> /TH/ /E/ /K/ /AE/ /T/

Andrew Senior Speech Recognition 15 of 63

ŵ = arg max P (w|o) (1)

A product of Acoustic model and Language model scores.

Where p is the phone sequence and c is the state sequence.

Andrew Senior Speech Recognition 16 of 63

Andrew Senior Speech Recognition 17 of 63

Andrew Senior Speech Recognition 18 of 63

Construct graph using Weighted Finite State Transducers (WFST)

Andrew Senior Speech Recognition 19 of 63

Compose Lexicon FST with Grammar FST L ◦ G

Andrew Senior Speech Recognition 20 of 63

• Phonemes: “cat” → /K/, /AE/, /T/

Andrew Senior Speech Recognition 21 of 63

• A phone’s realization depends on the preceding and following context

Andrew Senior Speech Recognition 22 of 63

• Forced alignment uses a model to compute the maximum likelihood

Andrew Senior Speech Recognition 24 of 63

With a transducer with states ci :

<S> /TH/ /E/ /K/ /AE/ /T/

Compute state likelihoods at time t

With transition probabilities: P (ci |cj )

P (o1,...,t |ci ) = max P (ot |cj )P (o1,...,t |cj )P (ci |cj )

Andrew Senior Speech Recognition 25 of 63

<S> /TH/ /E/ /K/ /AE/ /T/

Observation likelihoods P (ot |ci ) Start distribution Pt=0 (ci )

Andrew Senior Speech Recognition 26 of 63

<S> /TH/ /E/ /K/ /AE/ /T/

Observation likelihoods P (ot |ci ) State likelihoods P (o1,...,t |ci )

Andrew Senior Speech Recognition 27 of 63

<S> /TH/ /E/ /K/ /AE/ /T/

Observation likelihoods P (ot |ci ) State likelihoods P (o1,...,t |ci )

Andrew Senior Speech Recognition 28 of 63

<S> /TH/ /E/ /K/ /AE/ /T/