Lecture 9 - Speech Recognition
Lecture 9 - Speech Recognition
Lecture 9 - Speech Recognition
Andrew Senior
(DeepMind London)
Many thanks for slides to Vincent Vanhoucke, Heiga
Zen, Jun Song & Andrew Zisserman
February 21st, 2017. Oxford University
Outline
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Speech recognition problem
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
What is speech — physical realisation
Amplitude
Time
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Speech as communication
• TIMIT
− Hand-marked phone boundaries given
− 630 speakers × 10 utterances
• Wall Street Journal (WSJ) 1986 Read speech. WSJ0 1991, 30k vocab
• Broadcast News (BN) 1996 104 hours
• Switchboard (SWB) 1992. 2000 hours spontaneous telephone speech
500 speakers
• Google voice search
− anonymized live traffic 3M utterances 2000 hours
hand-transcribed 4M vocabulary. Constantly refreshed, synthetic
reverberation + additive noise
• DeepSpeech 5000h read (Lombard) speech + SWB with additive
noise.
• YouTube 125,000 hours aligned captions (Soltau et al., 2016)
Andrew Senior Speech Recognition 11 of 63
Outline
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Rough History
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Probabilistic speech recognition
We choose the decoder output as the most likely sequence ŵ from all
possible sequences, Σ∗, for an observation sequence o:
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Hybrid networks
• Fully connected
• Convolutional networks (CNNs)
• Recurrent neural networks (RNNs)
− LSTMs
− GRUs
• RNNs
− RNN (Robinson and Fallside, 1991)
− LSTM Graves et al. (2013)
− Deep LSTM-P Sak et al. (2014b)
− CLDNN (right) (Sainath et al., 2015a)
− GRU. DeepSpeech 1/2 (Amodei et al., 2015)
• Bidirectional (Schuster and Paliwal, 1997)
helps, but introduces latency.
• Dependencies not long at speech frame rates
(100Hz).
• Frame stacking and down-sampling help.
• Ensemble of BLSTMs
• i-vectors for speaker normalization
− i-vector is an embedding of audio trained to discriminate between
speakers. (Speaker ID)
• Interpolated n-gram + LSTM language model.
• 5.8% WER on SWB (vs 5.9% for human).
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Cross Entropy Training
1
0.8
0.6
0.4
0.2
0
sil m j u z i @ m z I n S @k A g oU sil
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Sequence discriminative training
R
X XR X
L
Andrew Senior (θ) = L (X
Speech, Recognition
θ) = P (S |X ) e (S , S ) 47 of 63
Sequence discriminative training
Speech recognition
Acoustic representation
Phonetic representation
History
Probabilistic speech recognition
Other topics
Sequence2Sequence
462
Impulse responses 180 Beampattern
DOA DOA DOA DOA DOA DOA DOA DOA DOA DOA
Channel 0 Channel 1
p=10 p=9 p=8 p=7 p=6 p=5 p=4 p=3 p=2 p=1
120
output targets
02
4
60
0 30
0.4 180
0.2 120
MTL 0.0 60
clean features
DNN
0.2
0.4 0
0.4 180
0.2
DNN LSTM 0.0
0.2
120
60 24
0.4 0
2 180
CLDNN 1 120
DNN LSTM 0 60
1
2 0
LSTM
0.2
0.0
180
120 18
0.2 60
0
1.0 180
fConv 0.5 120
0.0 60
z[t] 2 <1⇥F ⇥P 0.5
1.0
0.2
0.1
0
180 12
pool +
w[t] 2 <M L+1⇥F ⇥P
0.0 120
nonlin
0.1 60
0.2 0
5 180
g 2 <L⇥F ⇥1 tConv2
0 120
y[t] 2 <M ⇥1⇥P
tConv1 0.4
5
60
0
180
6
hP1 2< N
hP
2 2<
N
0.2 120
0.0 60
. . 0.2
. .
0.4 0
1.0 180
0.5 120
h21 2 <N h22 2 <N 0.0
0.5
1.0
60
0
0
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8
h11 2 <N h12 2 <N
Time (milliseconds) Frequency (kHz)
x1 [t] 2 <M x2 [t] 2 <M
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014). Convolutional neural networks for
speech recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 22(10):1533–1545.
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A.,
Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A. Y., Jun, B., LeGresley, P., Lin, L.,
Narang, S., Ng, A. Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y.,
Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., and Zhu, Z. (2015). Deep speech 2: End-to-end speech
recognition in english and mandarin. CoRR, abs/1512.02595.
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent
neural networks. CoRR, abs/1506.03099.
Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. (2015). Listen, attend and spell. CoRR, abs/1508.01211.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition.
CoRR, abs/1506.07503.
Chung, J. S., Senior, A. W., Vinyals, O., and Zisserman, A. (2016). Lip reading sentences in the wild. CoRR,
abs/1611.05358.
Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent
dbn-hmms. In ICASSP.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of
the Royal Statistical Society, 39(B):1 – 38.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling
unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on
Machine Learning, pages 369–376. ACM.
Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU.
Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition.
Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems.
In ICASSP.
Sak, H., Senior, A., and Beaufays, F. (2014a). Long Short-Term Memory Based Recurrent Neural Network Architectures for
Large Vocabulary Speech Recognition. ArXiv e-prints.
Sak, H., Senior, A., and Beaufays, F. (2014b). Long Short-Term Memory Recurrent Neural Network Architectures for Large
Scale Acoustic Modeling. In INTERSPEECH 2014.
Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., and Schalkwyk, J. (2015). Learning acoustic frame labeling
for speech recognition with recurrent neural networks. In IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP).
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on,
45(11):2673–2681.
Senior, A. and Robinson, A. (1994). Forward-backward retraining of recurrent neural networks. In NIPS.
Soltau, H., Liao, H., and Sak, H. (2016). Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary
speech recognition. CoRR, abs/1610.09975.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and
Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989). Phoneme recognition using time-delay neural
networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3).
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K.,
Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K.,
Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M.,
and Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine
translation. CoRR, abs/1609.08144.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human
parity in conversational speech recognition. CoRR, abs/1610.05256.