Abstract
In this paper, we have investigated recurrent deep neural networks (DNNs) in combination with regularization techniques as dropout, zoneout, and regularization post-layer. As a benchmark, we chose the TIMIT phone recognition task due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition task. In recent years, recurrent DNNs pushed the error rates in automatic speech recognition down. But, there was no clear winner in proposed architectures. The dropout was used as the regularization technique in most cases, but combination with other regularization techniques together with model ensembles was omitted. However, just an ensemble of recurrent DNNs performed best and achieved an average phone error rate from 10 experiments 14.84% (minimum 14.69%) on core test set that is slightly lower then the best-published PER to date, according to our knowledge. Finally, in contrast of the most papers, we published the open-source scripts to easily replicate the results and to help continue the development.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barrow, D.K., Crone, S.F.: Crogging (Cross-Validation Aggregation) for forecasting - a novel algorithm of neural network ensembles on time series subsamples. In: Proceedings of the International Joint Conference on Neural Networks (2013). https://doi.org/10.1109/IJCNN.2013.6706740
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
A Flexible Framework of Neural Networks for Deep Learning. https://chainer.org
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y.: An exploration of dropout with LSTMs. In: Interspeech 2017, pp. 1586–1590 (2017). https://doi.org/10.21437/Interspeech.2017-129, http://www.danielpovey.com/files/2017_interspeech_dropout.pdf
Garofolo, J.S.E.A.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium LDC93S1 (1993)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kaldi Speech Recognition Toolkit. https://github.com/kaldi-asr/kaldi
Krueger, D., et al.: Zoneout: regularizing RNNs by randomly preserving hidden activations. In: International Conference on Learning Representations 2017 (2017)
Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio, Speech Lang. Process. 20(1), 14–22 (2012). https://doi.org/10.1109/TASL.2011.2109382
Moon, T., Choi, H., Lee, H., Song, I.: RNNDROP : a novel dropout for RNNs in ASR. In: Proceedings of the ASRU (2015)
Olah, C.: Understanding LSTM Networks, August 2015. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., Kessler, F.B.: Improving speech recognition by revising gated recurrent units. In: Interspeech 2017, pp. 1308–1312 (2017). https://doi.org/10.21437/Interspeech.2017-775
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Tóth, L.: Convolutional deep rectifier neural nets for phone recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1722–1726, August 2013
Tóth, L.: Convolutional deep maxout networks for phone recognition. In: Proceedings of the INTERSPEECH, pp. 1078–1082 (2014). https://doi.org/10.1186/s13636-015-0068-3
Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Statistical Language and Speech Processing, pp. 204–214 (2017)
Acknowledgments
This work was supported by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506 and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Vaněk, J., Michálek, J., Psutka, J. (2018). Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_74
Download citation
DOI: https://doi.org/10.1007/978-3-319-99579-3_74
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)