Abstract
Expressiveness and non-verbal information in speech are active research topics in speech processing. In this work, we are interested in detecting emphasis at word-level as a mean to identify what are the focus words in a given utterance. We compare several machine learning techniques (Linear Discriminant Analysis, Support Vector Machines, Neural Networks) for this task carried out on SIWIS, a French speech synthesis database. Our approach consists first in aligning the spoken words to the speech signal and second to feed classifier with filter bank coefficients in order to take a binary decision at word-level: neutral/emphasized. Evaluation results show that a three-layer neural network performed best with a \(93\%\) accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)
Campbell, N.: Loudness, spectral tilt, and perceived prominence in dialogues. In: Proceedings ICPhS, vol. 95, pp. 676–679 (1995)
Campbell, N.: On the use of nonverbal speech sounds in human communication. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) Verbal and Nonverbal Communication Behaviours. LNCS, vol. 4775, pp. 117–128. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76442-7_11
Campbell, W.N.: Prosodic encoding of English speech. In: Second International Conference on Spoken Language Processing (1992)
Cohn, A.C., Fougeron, C., Huffman, M.K.: The Oxford Handbook of Laboratory Phonology. Oxford University Press, Oxford (2012). Sect. 6.2, pp. 103–114
Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonol. 1(2), 425–452 (2010)
Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.: The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In: INTERSPEECH, pp. 1149–1152 (2005)
Heldner, M.: On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in swedish. J. Phon. 31(1), 39–62 (2003)
Honnet, P.E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The SIWIS French speech synthesis database? Design and recording of a high quality French database for speech synthesis. Technical report, Idiap (2017)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, K., Meng, H.: Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks. Speech Commun. (2016)
Li, K., Zhang, S., Li, M., Lo, W.K., Meng, H.M.: Prominence model for prosodic features in automatic lexical stress and pitch accent detection. In: INTERSPEECH, pp. 2009–2012 (2011)
Narupiyakul, L., Keselj, V., Cercone, N., Sirinaovakul, B.: Focus to emphasize tone analysis for prosodic generation. Comput. Math. Appl. 55(8), 1735–1753 (2008)
Noth, E., Batliner, A., Kießling, A., Kompe, R., Niemann, H.: Verbmobil: the use of prosody in the linguistic components of a speech understanding system. IEEE Trans. Speech Audio Process. 8(5), 519–532 (2000)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1), 127–154 (2000)
Sluijter, A.M., Shattuck-Hufnagel, S., Stevens, K.N., Van Heuven, V., et al.: Supralaryngeal resonance and glottal pulse shape as correlates of prosodic stress and accent in American English (1995)
Sluijter, A.M., Van Heuven, V.J.: Spectral balance as an acoustic correlate of linguistic stress. J. Acoust. Soc. Am. 100(4), 2471–2485 (1996)
Streefkerk, B.M., Pols, L.C., Ten Bosch, L., et al.: Automatic detection of prominence (as defined by listeners’ judgements) in read aloud Dutch sentences. In: ICSLP (1998)
Tepperman, J., Narayanan, S.: Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In: IEEE International Conference on Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP 2005), vol. 1, pp. I–937. IEEE (2005)
Van Kuijk, D., Boves, L.: Acoustic characteristics of lexical stress in continuous telephone speech. Speech Commun. 27(2), 95–111 (1999)
Wheatley, B., Doddington, G., Hemphill, C., Godfrey, J., Holliman, E., McDaniel, J., Fisher, D.: Robust automatic time alignment of orthographic transcriptions with unconstrained speech. In: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1992, vol. 1, pp. 533–536. IEEE (1992)
Wightman, C.W., Ostendorf, M.: Automatic labeling of prosodic patterns. IEEE Trans. Speech Audio Process. 2(4), 469–481 (1994)
Yu, K., Mairesse, F., Young, S.: Word-level emphasis modelling in HMM-based speech synthesis. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4238–4241. IEEE (2010)
Zeiler, M.D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q.V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., et al.: On rectified linear units for speech processing. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3517–3521. IEEE (2013)
Zhao, J., Yuan, H., Liu, J., Xia, S.: Automatic lexical stress detection using acoustic features for computer assisted language learning. In: Proceedings of the APSIPA ASC, pp. 247–251 (2011)
Zhu, Y., Liu, J., Liu, R.: Automatic lexical stress detection for English learning. In: Proceedings of the 2003 International Conference on Natural Language Processing and Knowledge Engineering, pp. 728–733. IEEE (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Heba, A., Pellegrini, T., Jorquera, T., André-Obrecht, R., Lorré, JP. (2017). Lexical Emphasis Detection in Spoken French Using F-BANKs and Neural Networks. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2017. Lecture Notes in Computer Science(), vol 10583. Springer, Cham. https://doi.org/10.1007/978-3-319-68456-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-68456-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68455-0
Online ISBN: 978-3-319-68456-7
eBook Packages: Computer ScienceComputer Science (R0)