research-article

Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario

Authors:

Martin Wöllmer,

Björn Schuller,

Anton Batliner,

Dino SeppiAuthors Info & Claims

ACM Transactions on Speech and Language Processing (TSLP), Volume 7, Issue 4

Article No.: 12, Pages 1 - 22

https://doi.org/10.1145/1998384.1998386

Published: 18 August 2011 Publication History

Abstract

In this article, we focus on keyword detection in children's speech as it is needed in voice command systems. We use the FAU Aibo Emotion Corpus which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and investigate various recent keyword spotting techniques. As the principle of bidirectional Long Short-Term Memory (BLSTM) is known to be well-suited for context-sensitive phoneme prediction, we incorporate a BLSTM network into a Tandem model for flexible coarticulation modeling in children's speech. Our experiments reveal that the Tandem model prevails over a triphone-based Hidden Markov Model approach.

References

[1]

Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., and Aharonson, V. 2006. Combining efforts for improving automatic classification of emotional user states. In Proceedings of the 5th Slovenian and 1st International Language Technologies Conference. 240--245.

[2]

Batliner, A., Steidl, S., Schuller, B., Seppi, D., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Aharonson, V., and Amir, N. 2010. Whodunnit—searching for the most important feature types signalling emotional user states in speech. Comput. Speech Lang. 25, 1, 4--28.

Digital Library

[3]

Benayed, Y., Fohr, D., Haton, J. P., and Chollet, G. 2003. Confidence measure for keyword spotting using support vector machines. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 588--591.

[4]

Bengio, S. 2003. An asynchronous hidden markov model for audio-visual speech recognition. Adv. NIPS 15, 1--8.

[5]

Bengio, Y. 1999. Markovian models for sequential data. Neural Comput. Surv. 2, 129--162.

[6]

Bengio, Y., Simard, P., and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neur. Netw. 5, 2, 157--166.

Digital Library

[7]

Bilmes, J. A. 1997. A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Tech. rep. ICSI-TR-97-02, University of Berkeley.

[8]

Bilmes, J. A. 1998. Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 469--472.

[9]

Bilmes, J. A. 2003. Graphical models and automatic speech recognition. In Mathematical Foundations of Speech and Language Processing, R. Rosenfeld, M. Ostendorf, S. Khudanpur, and M. Johnson Eds., Springer Verlag, 191--246.

[10]

Bilmes, J. A. 2008. Gaussian models in automatic speech recognition. In Signal Processing in Acoustics. Springer, 521--555.

[11]

Bilmes, J. A. and Bartels, C. 2005. Graphical model architectures for speech recognition. IEEE Sign. Process. Mag. 22, 5, 89--100.

[12]

Bilmes, J. A. and Zweig, G. 2002. The graphical models toolkit: an open source software system for speech and time-series processing. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3916--3919.

[13]

Boulard, H. and Morgan, N. 1994. Connectionist Speech Recognition: A Hybrid Approach. Kluwer Adacemic Publishers.

Digital Library

[14]

Cortes, C. and Vapnik, V. 1995. Support-vector networks. Mach. Learn. 20, 3, 273--297.

Digital Library

[15]

Das, S., Nix, D., and Picheny, M. 1998. Improvements in children's speech recognition performance. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 433--436.

[16]

de la Torre, A., Peinado, A. M., Segura, J. C., Perez-Cordoba, J. L., Benitez, M. C., and Rubio, A. J. 2005. Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Speech Audio Process. 13, 3, 355--366.

[17]

Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39, 185--197.

[18]

Ellis, D. P. W., Singh, R., and Sivadas, S. 2001. Tandem acoustic modeling in large-vocabulary recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 517--520.

[19]

Fernandez, S., Graves, A., and Schmidhuber, J. 2007a. An application of recurrent neural networks to discriminative keyword spotting. In Proceedings of the International Conference on Artificial Neural Networks. 220--229.

Digital Library

[20]

Fernandez, S., Graves, A., and Schmidhuber, J. 2007b. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence.

Digital Library

[21]

Foote, J. 1999. An overview of audio information retrieval. Multimed. Syst. 7, 1, 2--10.

Digital Library

[22]

Gerosa, M., Lee, S., Giuliani, D., and Narayanan, S. 2006. Analyzing children's speech: an acoustic study of consonants and consonant-vowel transition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 393--396.

[23]

Giuliani, D. and Gerosa, M. 2003. Investigating recognition of children's speech. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 137--140.

[24]

Graves, A. 2008. Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technische Universität München.

[25]

Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. 2006. Connectionist temporal classification: Labelling unsegmented data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning. 369--376.

Digital Library

[26]

Graves, A., Fernandez, S., Liwicki, M., Bunke, H., and Schmidhuber, J. 2008a. Unconstrained online handwriting recognition with recurrent neural networks. Adv. Neur. Inform. Process. Syst. 20, 1--8.

[27]

Graves, A., Fernandez, S., and Schmidhuber, J. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks. 602--610.

Digital Library

[28]

Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. 2008b. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Patt. Anal. Mach. Intell.

Digital Library

[29]

Graves, A. and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5-6, 602--610.

Digital Library

[30]

Grezl, F. and Fousek, P. 2008. Optimizing bottle-neck features for LVCSR. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4729--4732.

[31]

Gustafson, J. and Sjölander, K. 2002. Voice transformations for improving children's speech recognition in a publicly available dialogue system. In Proceedings of the International Conference on Spoken Language Processing. 297--300.

[32]

Hagen, A., Pellom, B., and Cole, R. 2007. Highly accurate children's speech recognition for interactive reading tutors using sub-word units. Speech Comm. 49, 12, 861--873.

Digital Library

[33]

Hermansky, H., Ellis, D. P. W., and Sharma, S. 2000. Tandem connectionist feature extraction for conventional HMM systems. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 1635--1638.

[34]

Hermansky, H. and Fousek, P. 2008. Multi-resolution RASTA filtering for TANDEM-based ASR. In Proceedings of the European Conference on Speech Communication and Technology. 361--364.

[35]

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen Eds., IEEE Press, 1--15.

[36]

Hochreiter, S. and Schmidhuber, J. 1997. Long short-term memory. Neur. Computat. 9, 8, 1735--1780.

Digital Library

[37]

Jaeger, H. 2001. The echo state approach to analyzing and training recurrent neural networks. Tech. rep. 148, Bremen: German National Research Center for Information Technology.

[38]

Jensen, F. V. 1996. An Introduction to Bayesian Networks. Springer.

Digital Library

[39]

Keshet, J., Grangier, D., and Bengio, S. 2007. Discriminative keyword spotting. In Proceedings of the ISCA Tutorial and Research Workshop on Non-Linear Speech Processing. 47--50.

[40]

Ketabdar, H. and Bourlard, H. 2008. Enhanced phone posteriors for improving speech recognition systems. In IDIAP-RR. 39-2008.

[41]

Ketabdar, H., Vepa, J., Bengio, S., and Boulard, H. 2006. Posterior based keyword spotting with a priori thresholds. In Proceedings of the International Conference on Spoken Language Processing. 1--8.

[42]

Lang, K. J., Waibel, A. H., and Hinton, G. E. 1990. A time-delay neural network architecture for isolated word recognition. Neur. Netw. 3, 1, 23--43.

Digital Library

[43]

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans. Neur. Netw. 7, 6, 1329--1338.

Digital Library

[44]

Liwicki, M., Graves, A., Fernandez, S., Bunke, H., and Schmidhuber, J. 2007. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In Proceedings of the International Conference on Document Analysis and Recognition. 367--371.

[45]

Mamou, J., Ramabhadran, B., and Siohan, O. 2007. Vocabulary independent spoken term detection. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 615--622.

Digital Library

[46]

Mayo, C., Scobbie, J. M., Hewlett, N., and Waters, D. 2003. The influence of phonemic awareness development on acoustic cue weighting strategies in children's speech perception. J. Speech Lang. Hear. Resear. 46, 1184--1196.

[47]

Murphy, K. 2002. Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, Department EECS, Computer Science Division, University California, Berkeley.

Digital Library

[48]

Narayanan, S. and Potamianos, A. 2002. Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10, 2, 65--78.

[49]

Potamianos, A., Narayanan, S., and Lee, S. 1997. Automatic speech recognition for children. In Proceedings of Eurospeech. 2371--2374.

[50]

Quattoni, A., Wang, S., Morency, L. P., Collins, M., and Darrell, T. 2007. Hidden conditional random fields. IEEE Trans. Patt. Analy. Mach. Intell. 29, 1848--1853.

Digital Library

[51]

Repp, B. H. 1986. Some observations on the development of anticipatory coarticulation. J. Acoust. Soc. Am. 79, 5, 1616--1619.

[52]

Rose, R. C. 1995. Keyword detection in conversational speech utterances using hidden markov model based continuous speech recognition. Comput. Speech Lang. 9, 4, 309--333.

[53]

Rose, R. C. and Paul, D. B. 1990. A hidden Markov model based keyword recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 129--132.

[54]

Schaefer, A. M., Udluft, S., and Zimmermann, H. G. 2008. Learning long-term dependencies with recurrent neural networks. Neurocomput. 71, 13--15, 2481--2488.

Digital Library

[55]

Schmidhuber, J. 1992. Learning complex extended sequences using the principle of history compression. Neural Comput. 4, 2, 234--242.

Digital Library

[56]

Schröder, M., Cowie, R., Heylen, D., Pantic, M., Pelachaud, C., and Schuller, B. 2008. Towards responsive sensitive artificial listeners. In Proceedings of 4th International Workshop on Human-Computer Conversation. 1--6.

[57]

Schuller, B., Batliner, A., Steidl, S., and Seppi, D. 2008. Does affect affect automatic recognition of children's speech&quest; In Proceedings of the 1st Workshop on Child, Computer and Interaction.

[58]

Schuller, B., Batliner, A., Steidl, S., and Seppi, D. 2009a. Emotion recognition from speech: Putting ASR in the loop. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4585--4588.

Digital Library

[59]

Schuller, B., Müller, R., Eyben, F., Gast, J., Hörnler, B., Wöllmer, M., Rigoll, G., Höthker, A., and Konosu, H. 2009b. Being bored&quest; recognising natural interest by extensive audiovisual integration for real-life application. Image and Vis. Comput. J. (Special Issue on Visual and Multimodal Analysis of Human Spontaneous Behavior 27, 12, 1760--1774.

Digital Library

[60]

Schuller, B. and Rigoll, G. 2009. Recognising interest in conversational speech—comparing bag of frames and supra-segmental features. In Proceedings of Interspeech. 1999--2002.

[61]

Schuster, M. and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 2673--2681.

Digital Library

[62]

Steidl, S. 2009. Automatic Classification of Emotion-Related User States in Spontaneous Speech. Logos, Berlin, Germany.

[63]

Steidl, S., Batliner, A., Seppi, D., and Schuller, B. 2010. On the impact of children's emotional speech on acoustic and language models. EURASIP J. Audio Speech Music Process. (Special Issue on Atypical Speech). Article ID 783954.

Digital Library

[64]

Williams, R. J. and Zipser, D. 1995. Gradient-based learning algorithms for recurrent neural networks and their computational complexity. In Back-Propagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart Eds., Lawrence Erlbaum Publishers, Hillsdale, N.J., 433--486.

Digital Library

[65]

Wöllmer, M., Al-Hames, M., Eyben, F., Schuller, B., and Rigoll, G. 2009a. A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomput. 73, 1-3, 366--380.

Digital Library

[66]

Wöllmer, M., Eyben, F., Graves, A., Schuller, B., and Rigoll, G. 2009b. A Tandem BLSTM-DBN architecture for keyword spotting with enhanced context modeling. In Proceedings of the ISCA Tutorial and Research Workshop on Non-Linear Speech Processing.

[67]

Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B., and Rigoll, G. 2009c. Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3949--3952.

Digital Library

[68]

Wöllmer, M., Eyben, F., Schuller, B., and Rigoll, G. 2009d. Robust vocabulary independent keyword spotting with graphical models. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 349--353.

[69]

Wöllmer, M., Eyben, F., Schuller, B., and Rigoll, G. 2010a. Recognition of spontaneous conversational speech using long short-term memory phoneme predictions. In Proceedings of Interspeech. 1946--1949.

[70]

Wöllmer, M., Eyben, F., Schuller, B., Sun, Y., Moosmayr, T., and Nguyen-Thien, N. 2009e. Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In Proceedings of the Interspeech. 2507--2510.

[71]

Wöllmer, M., Klebert, N., and Schuller, B. 2010b. Switching linear dynamic models for recognition of emotionally colored and noisy speech. In Proceedings of the ITG conference on Speech Communication.

[72]

Wöllmer, M., Schuller, B., Eyben, F., and Rigoll, G. 2010c. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Topics Sign. Process. 4, 5, 867--881.

[73]

Yang, H. H., Sharma, S., van Vuuren, S., and Hermansky, H. 2000. Relevance of time-frequency features for phonetic and speaker/channel classification. Speech Comm. 31, 35--50.

Digital Library

[74]

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The HTK Book (v3.4). Cambridge University Press.

[75]

Zweig, G. and Padmanabhan, M. 2000. Exact alpha-beta computation in logarithmic space with application to map word graph construction. In Proceedings of the International Conference on Spoken Language Processing. 855--858.

Cited By

Liu DZhou HAn P(2024)"When He Feels Cold, He Goes to the Seahorse"—Blending Generative AI into Multimaterial Storymaking for Family Expressive Arts TherapyProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642852(1-21)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642852
Wang GZhao JVan Kleek MShadbolt N(2022)Informing Age-Appropriate AI: Examining Principles and Practices of AI for ChildrenProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502057(1-29)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3502057
Lopez-Espejo ITan ZHansen JJensen J(2022)Deep Spoken Keyword Spotting: An OverviewIEEE Access10.1109/ACCESS.2021.313950810(4169-4199)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2021.3139508
Show More Cited By

Index Terms

Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody Modification
Abstract
Spotting of keywords in continuous speech signal with the aid of the computer is called a keyword spotting (KWS) system. A variety of strategies have been suggested in the literature to detect keywords from the adult’s speech effectively. However, ...
Formant measurement in children's speech based on spectral filtering

Children's speech presents a challenging problem for formant frequency measurement.A spectral filtering technique is proposed for analysis of children's speech.The method was tested with child-like simulated speech samples.The new approach produced less ...
Robust multi-stream keyword and non-linguistic vocalization detection for computationally intelligent virtual agents
ISNN'11: Proceedings of the 8th international conference on Advances in neural networks - Volume Part II

Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing

ACM Transactions on Speech and Language Processing Volume 7, Issue 4

August 2011

143 pages

ISSN:1550-4875

EISSN:1550-4883

DOI:10.1145/1998384

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2011

Revised: 01 January 2011

Accepted: 01 October 2010

Received: 01 May 2010

Published in TSLP Volume 7, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Seventh Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
296
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu DZhou HAn P(2024)"When He Feels Cold, He Goes to the Seahorse"—Blending Generative AI into Multimaterial Storymaking for Family Expressive Arts TherapyProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642852(1-21)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642852
Wang GZhao JVan Kleek MShadbolt N(2022)Informing Age-Appropriate AI: Examining Principles and Practices of AI for ChildrenProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502057(1-29)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3502057
Lopez-Espejo ITan ZHansen JJensen J(2022)Deep Spoken Keyword Spotting: An OverviewIEEE Access10.1109/ACCESS.2021.313950810(4169-4199)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2021.3139508
Maity KPradhan GSingh J(2021)A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody ModificationCircuits, Systems, and Signal Processing10.1007/s00034-020-01565-w40:4(1892-1904)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.1007/s00034-020-01565-w
SERIZEL RGIULIANI D(2016)Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including childrenNatural Language Engineering10.1017/S135132491600005X23:3(325-350)Online publication date: 12-Apr-2016
https://doi.org/10.1017/S135132491600005X
WöLlmer MKaiser MEyben FSchuller BRigoll G(2013)LSTM-Modeling of continuous emotions in an audiovisual affect recognition frameworkImage and Vision Computing10.1016/j.imavis.2012.03.00131:2(153-163)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1016/j.imavis.2012.03.001
Schuller BSchuller B(2013)Audio RecognitionIntelligent Audio Analysis10.1007/978-3-642-36806-6_7(99-138)Online publication date: 25-Apr-2013
https://doi.org/10.1007/978-3-642-36806-6_7
Schuller BSchuller B(2013)Applications in Intelligent Speech AnalysisIntelligent Audio Analysis10.1007/978-3-642-36806-6_10(169-223)Online publication date: 25-Apr-2013
https://doi.org/10.1007/978-3-642-36806-6_10

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents