[go: up one dir, main page]

skip to main content
research-article

Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

Published: 18 August 2011 Publication History

Abstract

In this article, we focus on keyword detection in children's speech as it is needed in voice command systems. We use the FAU Aibo Emotion Corpus which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and investigate various recent keyword spotting techniques. As the principle of bidirectional Long Short-Term Memory (BLSTM) is known to be well-suited for context-sensitive phoneme prediction, we incorporate a BLSTM network into a Tandem model for flexible coarticulation modeling in children's speech. Our experiments reveal that the Tandem model prevails over a triphone-based Hidden Markov Model approach.

References

[1]
Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., and Aharonson, V. 2006. Combining efforts for improving automatic classification of emotional user states. In Proceedings of the 5th Slovenian and 1st International Language Technologies Conference. 240--245.
[2]
Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., and Amir, N. 2010. Whodunnit—searching for the most important feature types signalling emotional user states in speech. Comput. Speech Lang. 25, 1, 4--28.
[3]
Benayed, Y., Fohr, D., Haton, J. P., and Chollet, G. 2003. Confidence measure for keyword spotting using support vector machines. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 588--591.
[4]
Bengio, S. 2003. An asynchronous hidden markov model for audio-visual speech recognition. Adv. NIPS 15, 1--8.
[5]
Bengio, Y. 1999. Markovian models for sequential data. Neural Comput. Surv. 2, 129--162.
[6]
Bengio, Y., Simard, P., and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neur. Netw. 5, 2, 157--166.
[7]
Bilmes, J. A. 1997. A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Tech. rep. ICSI-TR-97-02, University of Berkeley.
[8]
Bilmes, J. A. 1998. Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 469--472.
[9]
Bilmes, J. A. 2003. Graphical models and automatic speech recognition. In Mathematical Foundations of Speech and Language Processing, R. Rosenfeld, M. Ostendorf, S. Khudanpur, and M. Johnson Eds., Springer Verlag, 191--246.
[10]
Bilmes, J. A. 2008. Gaussian models in automatic speech recognition. In Signal Processing in Acoustics. Springer, 521--555.
[11]
Bilmes, J. A. and Bartels, C. 2005. Graphical model architectures for speech recognition. IEEE Sign. Process. Mag. 22, 5, 89--100.
[12]
Bilmes, J. A. and Zweig, G. 2002. The graphical models toolkit: an open source software system for speech and time-series processing. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3916--3919.
[13]
Boulard, H. and Morgan, N. 1994. Connectionist Speech Recognition: A Hybrid Approach. Kluwer Adacemic Publishers.
[14]
Cortes, C. and Vapnik, V. 1995. Support-vector networks. Mach. Learn. 20, 3, 273--297.
[15]
Das, S., Nix, D., and Picheny, M. 1998. Improvements in children's speech recognition performance. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 433--436.
[16]
de la Torre, A., Peinado, A. M., Segura, J. C., Perez-Cordoba, J. L., Benitez, M. C., and Rubio, A. J. 2005. Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Speech Audio Process. 13, 3, 355--366.
[17]
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39, 185--197.
[18]
Ellis, D. P. W., Singh, R., and Sivadas, S. 2001. Tandem acoustic modeling in large-vocabulary recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 517--520.
[19]
Fernandez, S., Graves, A., and Schmidhuber, J. 2007a. An application of recurrent neural networks to discriminative keyword spotting. In Proceedings of the International Conference on Artificial Neural Networks. 220--229.
[20]
Fernandez, S., Graves, A., and Schmidhuber, J. 2007b. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence.
[21]
Foote, J. 1999. An overview of audio information retrieval. Multimed. Syst. 7, 1, 2--10.
[22]
Gerosa, M., Lee, S., Giuliani, D., and Narayanan, S. 2006. Analyzing children's speech: an acoustic study of consonants and consonant-vowel transition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 393--396.
[23]
Giuliani, D. and Gerosa, M. 2003. Investigating recognition of children's speech. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 137--140.
[24]
Graves, A. 2008. Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universität München.
[25]
Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. 2006. Connectionist temporal classification: Labelling unsegmented data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning. 369--376.
[26]
Graves, A., Fernandez, S., Liwicki, M., Bunke, H., and Schmidhuber, J. 2008a. Unconstrained online handwriting recognition with recurrent neural networks. Adv. Neur. Inform. Process. Syst. 20, 1--8.
[27]
Graves, A., Fernandez, S., and Schmidhuber, J. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks. 602--610.
[28]
Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. 2008b. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Patt. Anal. Mach. Intell.
[29]
Graves, A. and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5-6, 602--610.
[30]
Grezl, F. and Fousek, P. 2008. Optimizing bottle-neck features for LVCSR. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4729--4732.
[31]
Gustafson, J. and Sjölander, K. 2002. Voice transformations for improving children's speech recognition in a publicly available dialogue system. In Proceedings of the International Conference on Spoken Language Processing. 297--300.
[32]
Hagen, A., Pellom, B., and Cole, R. 2007. Highly accurate children's speech recognition for interactive reading tutors using sub-word units. Speech Comm. 49, 12, 861--873.
[33]
Hermansky, H., Ellis, D. P. W., and Sharma, S. 2000. Tandem connectionist feature extraction for conventional HMM systems. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 1635--1638.
[34]
Hermansky, H. and Fousek, P. 2008. Multi-resolution RASTA filtering for TANDEM-based ASR. In Proceedings of the European Conference on Speech Communication and Technology. 361--364.
[35]
Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen Eds., IEEE Press, 1--15.
[36]
Hochreiter, S. and Schmidhuber, J. 1997. Long short-term memory. Neur. Computat. 9, 8, 1735--1780.
[37]
Jaeger, H. 2001. The echo state approach to analyzing and training recurrent neural networks. Tech. rep. 148, Bremen: German National Research Center for Information Technology.
[38]
Jensen, F. V. 1996. An Introduction to Bayesian Networks. Springer.
[39]
Keshet, J., Grangier, D., and Bengio, S. 2007. Discriminative keyword spotting. In Proceedings of the ISCA Tutorial and Research Workshop on Non-Linear Speech Processing. 47--50.
[40]
Ketabdar, H. and Bourlard, H. 2008. Enhanced phone posteriors for improving speech recognition systems. In IDIAP-RR. 39-2008.
[41]
Ketabdar, H., Vepa, J., Bengio, S., and Boulard, H. 2006. Posterior based keyword spotting with a priori thresholds. In Proceedings of the International Conference on Spoken Language Processing. 1--8.
[42]
Lang, K. J., Waibel, A. H., and Hinton, G. E. 1990. A time-delay neural network architecture for isolated word recognition. Neur. Netw. 3, 1, 23--43.
[43]
Lin, T., Horne, B. G., Tino, P., and Giles, C. L. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans. Neur. Netw. 7, 6, 1329--1338.
[44]
Liwicki, M., Graves, A., Fernandez, S., Bunke, H., and Schmidhuber, J. 2007. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In Proceedings of the International Conference on Document Analysis and Recognition. 367--371.
[45]
Mamou, J., Ramabhadran, B., and Siohan, O. 2007. Vocabulary independent spoken term detection. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 615--622.
[46]
Mayo, C., Scobbie, J. M., Hewlett, N., and Waters, D. 2003. The influence of phonemic awareness development on acoustic cue weighting strategies in children's speech perception. J. Speech Lang. Hear. Resear. 46, 1184--1196.
[47]
Murphy, K. 2002. Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, Department EECS, Computer Science Division, University California, Berkeley.
[48]
Narayanan, S. and Potamianos, A. 2002. Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10, 2, 65--78.
[49]
Potamianos, A., Narayanan, S., and Lee, S. 1997. Automatic speech recognition for children. In Proceedings of Eurospeech. 2371--2374.
[50]
Quattoni, A., Wang, S., Morency, L. P., Collins, M., and Darrell, T. 2007. Hidden conditional random fields. IEEE Trans. Patt. Analy. Mach. Intell. 29, 1848--1853.
[51]
Repp, B. H. 1986. Some observations on the development of anticipatory coarticulation. J. Acoust. Soc. Am. 79, 5, 1616--1619.
[52]
Rose, R. C. 1995. Keyword detection in conversational speech utterances using hidden markov model based continuous speech recognition. Comput. Speech Lang. 9, 4, 309--333.
[53]
Rose, R. C. and Paul, D. B. 1990. A hidden Markov model based keyword recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 129--132.
[54]
Schaefer, A. M., Udluft, S., and Zimmermann, H. G. 2008. Learning long-term dependencies with recurrent neural networks. Neurocomput. 71, 13--15, 2481--2488.
[55]
Schmidhuber, J. 1992. Learning complex extended sequences using the principle of history compression. Neural Comput. 4, 2, 234--242.
[56]
Schröder, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., and Schuller, B. 2008. Towards responsive sensitive artificial listeners. In Proceedings of 4th International Workshop on Human-Computer Conversation. 1--6.
[57]
Schuller, B., Batliner, A., Steidl, S., and Seppi, D. 2008. Does affect affect automatic recognition of children's speech? In Proceedings of the 1st Workshop on Child, Computer and Interaction.
[58]
Schuller, B., Batliner, A., Steidl, S., and Seppi, D. 2009a. Emotion recognition from speech: Putting ASR in the loop. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4585--4588.
[59]
Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., and Konosu, H. 2009b. Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image and Vis. Comput. J. (Special Issue on Visual and Multimodal Analysis of Human Spontaneous Behavior 27, 12, 1760--1774.
[60]
Schuller, B. and Rigoll, G. 2009. Recognising interest in conversational speech—comparing bag of frames and supra-segmental features. In Proceedings of Interspeech. 1999--2002.
[61]
Schuster, M. and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 2673--2681.
[62]
Steidl, S. 2009. Automatic Classification of Emotion-Related User States in Spontaneous Speech. Logos, Berlin, Germany.
[63]
Steidl, S., Batliner, A., Seppi, D., and Schuller, B. 2010. On the impact of children's emotional speech on acoustic and language models. EURASIP J. Audio Speech Music Process. (Special Issue on Atypical Speech). Article ID 783954.
[64]
Williams, R. J. and Zipser, D. 1995. Gradient-based learning algorithms for recurrent neural networks and their computational complexity. In Back-Propagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart Eds., Lawrence Erlbaum Publishers, Hillsdale, N.J., 433--486.
[65]
Wöllmer, M., Al-Hames, M., Eyben, F., Schuller, B., and Rigoll, G. 2009a. A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomput. 73, 1-3, 366--380.
[66]
Wöllmer, M., Eyben, F., Graves, A., Schuller, B., and Rigoll, G. 2009b. A Tandem BLSTM-DBN architecture for keyword spotting with enhanced context modeling. In Proceedings of the ISCA Tutorial and Research Workshop on Non-Linear Speech Processing.
[67]
Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B., and Rigoll, G. 2009c. Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3949--3952.
[68]
Wöllmer, M., Eyben, F., Schuller, B., and Rigoll, G. 2009d. Robust vocabulary independent keyword spotting with graphical models. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 349--353.
[69]
Wöllmer, M., Eyben, F., Schuller, B., and Rigoll, G. 2010a. Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In Proceedings of Interspeech. 1946--1949.
[70]
Wöllmer, M., Eyben, F., Schuller, B., Sun, Y., Moosmayr, T., and Nguyen-Thien, N. 2009e. Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In Proceedings of the Interspeech. 2507--2510.
[71]
Wöllmer, M., Klebert, N., and Schuller, B. 2010b. Switching linear dynamic models for recognition of emotionally colored and noisy speech. In Proceedings of the ITG conference on Speech Communication.
[72]
Wöllmer, M., Schuller, B., Eyben, F., and Rigoll, G. 2010c. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Topics Sign. Process. 4, 5, 867--881.
[73]
Yang, H. H., Sharma, S., van Vuuren, S., and Hermansky, H. 2000. Relevance of time-frequency features for phonetic and speaker/channel classification. Speech Comm. 31, 35--50.
[74]
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The HTK Book (v3.4). Cambridge University Press.
[75]
Zweig, G. and Padmanabhan, M. 2000. Exact alpha-beta computation in logarithmic space with application to map word graph construction. In Proceedings of the International Conference on Spoken Language Processing. 855--858.

Cited By

View all
  • (2024)"When He Feels Cold, He Goes to the Seahorse"—Blending Generative AI into Multimaterial Storymaking for Family Expressive Arts TherapyProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642852(1-21)Online publication date: 11-May-2024
  • (2022)Informing Age-Appropriate AI: Examining Principles and Practices of AI for ChildrenProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502057(1-29)Online publication date: 29-Apr-2022
  • (2022)Deep Spoken Keyword Spotting: An OverviewIEEE Access10.1109/ACCESS.2021.313950810(4169-4199)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing
ACM Transactions on Speech and Language Processing   Volume 7, Issue 4
August 2011
143 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/1998384
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2011
Revised: 01 January 2011
Accepted: 01 October 2010
Received: 01 May 2010
Published in TSLP Volume 7, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Children's speech
  2. dynamic Bayesian networks
  3. keyword spotting
  4. long short-term memory

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)"When He Feels Cold, He Goes to the Seahorse"—Blending Generative AI into Multimaterial Storymaking for Family Expressive Arts TherapyProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642852(1-21)Online publication date: 11-May-2024
  • (2022)Informing Age-Appropriate AI: Examining Principles and Practices of AI for ChildrenProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502057(1-29)Online publication date: 29-Apr-2022
  • (2022)Deep Spoken Keyword Spotting: An OverviewIEEE Access10.1109/ACCESS.2021.313950810(4169-4199)Online publication date: 2022
  • (2021)A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody ModificationCircuits, Systems, and Signal Processing10.1007/s00034-020-01565-w40:4(1892-1904)Online publication date: 1-Apr-2021
  • (2016)Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including childrenNatural Language Engineering10.1017/S135132491600005X23:3(325-350)Online publication date: 12-Apr-2016
  • (2013)LSTM-Modeling of continuous emotions in an audiovisual affect recognition frameworkImage and Vision Computing10.1016/j.imavis.2012.03.00131:2(153-163)Online publication date: 1-Feb-2013
  • (2013)Audio RecognitionIntelligent Audio Analysis10.1007/978-3-642-36806-6_7(99-138)Online publication date: 25-Apr-2013
  • (2013)Applications in Intelligent Speech AnalysisIntelligent Audio Analysis10.1007/978-3-642-36806-6_10(169-223)Online publication date: 25-Apr-2013

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media