Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
<p>The proposed methodological framework with the main steps involved: audio-stream preprocessing, feature extraction and utterance level aggregation, system training with emotion metric learning and SVM training.</p> "> Figure 2
<p>Audio signal pre-processing: voice activity detection, silence removal and spectrogram image computation.</p> "> Figure 3
<p>SE-ResNet CNN extension with a GhostVLAD layer for feature aggregation.</p> "> Figure 4
<p>Emotion representation in the valence-arousal space using Mikel’s wheel of emotions.</p> "> Figure 5
<p>Difference between the triplet loss function and the emotion constraint. (<b>a</b>) The triplet loss function, and (<b>b</b>). The proposed emotion metric.</p> "> Figure 6
<p>Visualization of the feature embedding using t-SNE on the CREMA-D dataset: (<b>a</b>) softmax loss, (<b>b</b>) triplet loss, and (<b>c</b>) emotion metric learning.</p> "> Figure 7
<p>Visualization of the feature embedding using t-SNE on the RAVDESS dataset: (<b>a</b>) softmax loss, (<b>b</b>) triplet loss, and (<b>c</b>) emotion metric learning.</p> "> Figure 8
<p>The statistical experimental results of the considered databases: (<b>a</b>) RAVDESS, (<b>b</b>) CREMA-D.</p> "> Figure 9
<p>The confusion matrixes on the evaluation dataset (<b>a</b>) RAVDESS and (<b>b</b>) CREMA-D. (S1). The baseline method; (S2). SE-ResNet with multi-stage training; (S3). SE-ResNet with GhostVLAD layer; (S4). The SE-ResNet with the GhostVLAD layer and the classical triplet loss function; (S5). The proposed framework, which involves SE-ResNet with the GhostVLAD aggregation layer and emotion constraint loss.</p> "> Figure 10
<p>The system performance evaluation on RAVDESS and CREMA-D datasets with the different parameters involved: (<b>a</b>) the number of NetVLAD clusters (<span class="html-italic">K</span>) and (<b>b</b>) the number of GhostVLAD clusters (<span class="html-italic">G</span>).</p> "> Figure 11
<p>The system performance evaluation for different values of the control margins <math display="inline"><semantics> <mi>α</mi> </semantics></math> and <math display="inline"><semantics> <mi>β</mi> </semantics></math> on (<b>a</b>) RAVDESS dataset; (<b>b</b>) CREMA-D dataset.</p> ">
Abstract
:1. Introduction
- A deep learning-based, end-to-end speech emotion recognition technique. The method exploits the recently introduced Squeeze and Excitation ResNet architecture [7]. The challenge is to derive a fixed-length, discriminative representation at the utterance level for audio segments of arbitrary length.
- The introduction within the aggregation process of a trainable GhostVLAD [6] layer. In contrast to [8], where a NetVLAD [9] approach is used for feature aggregation, in our case we privilege the use of ghost clusters. We show that they are able to capture noisy segments and thus obtain a more robust feature representation.
- A learnable emotional metric which enriches the traditional triplet loss function with an additional emotional constraint. By taking into account the relations between the considered emotional categories, as defined in the Mikel’s wheel representation [10], the method makes it possible to obtain more discriminative embeddings, with more compact emotional clusters and increased inter-class separability.
2. Related Work
3. Proposed Approach
3.1. Audio Signal Pre-Processing
3.2. GhostVLAD Feature Aggregation
3.3. Emotion Metric Learning
4. Experimental setup
4.1. Dataset Selection
4.2. Hardware Configuration
4.3. CNN Training Details
5. Experimental Results
- (S1) An emotion recognition approach that trains from scratch (i.e., with random weight initialization) the SE-ResNet architecture [7] on the considered emotion datasets. This approach will be considered as a baseline.
- (S2) An emotion recognition method that performs multi-stage CNN training (denoted by SE-ResNet with multi-stage training). In this case, the network is pre-trained on the VoxCeleb2 [44] dataset and the resulting model is fine-tuned on the two considered emotion recognition datasets. For both the (S1) and (S2) models, ghost clusters are not considered within the aggregation phase (which is equivalent to applying a NetVLAD aggregation procedure).
- (S3) An emotion-recognition method that extends the SE-ResNet architecture with the GhostVLAD feature aggregation layer. The same multi-stage training used in Strategy 2 is employed here.
- (S4) A speech emotion identification framework that involves the SE-ResNet with the GhostVLAD layer and the classical triplet loss function with the SVM-RBF classifier on top.
- (S5) A proposed framework that involves multi-stage SE-ResNet training, GhostVLAD feature aggregation layer, and emotional metric learning (cf. Section 3.3) with the SVM-RBF classifier on top.
- The lowest accuracy scores (67% and 53%, respectively, for RAVDESS and CREMA-D datasets) are obtained by the baseline emotion recognition method, which trains the considered SE-ResNet architecture from scratch. The results can be explained by the relatively small size of the datasets, which is not sufficient for taking into account the complexity of the related audio emotional features.
- Multi-stage SE-ResNet training partially solves the above-mentioned problem. The use of additional speech data (VoxCeleb2) allows for obtaining models with a higher capacity. Thus, the multi-stage training increases the overall system performance with more than 6%.
- The extension of the SE-ResNet with a GhostVLAD layer produces more effective feature representations. From the experimental results, presented in Table 1, we observe that the introduction of the aggregation layer on top of the CNN architecture increases emotion recognition accuracy for both considered databases (4.97% and 1.78% for RAVDES and CREMA-D, respectively).
- Compared to softmax loss, the introduction of the triplet loss function offers an average improvement of 1.2%. As we employ semi-hard sampling where all the positive/negative spectrogram images are selected within a mini batch, the system converges faster and yields better results in terms of accuracy on both datasets.
- The best results (with accuracy scores of 83% and 64% on the RAVDESS and CREMA-D datasets, respectively) are obtained by the complete method that integrates the whole chain, with SE-ResNet, GhostVLAD aggregation layer and the emotional metric loss. The use of the emotion constraint significantly increases the system’s performance by reinforcing emotional clustering. Remarkably, the proposed framework outperforms the recognition rate provided by human observers with more than 24%.
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Venkataramanan, K.; Rajamohan, H.R. Emotion Recognition from Speech. arXiv 2019, arXiv:1912.10458. Available online: https://arxiv.org/abs/1912.10458 (accessed on 20 June 2021).
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 1971, 17, 124–129. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ekman, P. Strong evidence for universals in facial expressions: A reply to Russell’s mistaken critique. Psychol. Bull. 1994, 115, 268–287. [Google Scholar] [CrossRef] [PubMed]
- Vogt, T.; Andre, E.; Wagner, J. Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realization. In Affect and Emotion in Human-Computer Interaction, 1st ed.; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2008; pp. 75–91. [Google Scholar] [CrossRef] [Green Version]
- Zhong, Y.; Arandjelovic, R.; Zisserman, A. GhostVLAD for Set-Based Face Recognition. In Lecture Notes in Computer Science Proceedings of the Asian Conference on Computer Vision, ACCV; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2018. [Google Scholar] [CrossRef] [Green Version]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2018; pp. 7132–7141. [Google Scholar]
- Huang, J.; Tao, J.; Liu, B.; Lian, Z. Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 4079–4083. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mikels, J.A.; Fredrickson, B.L.; Larkin, G.R.; Lindberg, C.M.; Maglio, S.J.; Reuter-Lorenz, P.A. Emotional category data on images from the international affective picture system. Behav. Res. Methods 2005, 37, 626–630. [Google Scholar] [CrossRef] [PubMed]
- Jiang, W.; Wang, Z.; Jin, J.S.; Han, X.; Li, C. Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network. Sensors 2019, 19, 2730. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Erden, M.; Arslan, L. Automatic Detection of Anger in Human-Human Call Center Dialogs. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 81–84. [Google Scholar]
- Lugger, M.; Yang, B. The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP 07, Honolulu, HI, USA, 4 June 2007; pp. IV-17–IV-20. [Google Scholar] [CrossRef]
- Lee, C.M.; Narayanan, S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar] [CrossRef]
- Jeon, J.H.; Xia, R.; Liu, Y. Sentence level emotion recognition based on decisions from subsentence segments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 4940–4943. [Google Scholar] [CrossRef]
- Wu, S.; Falk, T.; Chan, W.-Y. Automatic speech emotion recognition using modulation spectral features. Speech Commun. 2011, 53, 768–785. [Google Scholar] [CrossRef]
- Schuller, B.; Rigoll, G. Timing levels in segment-based speech emotion recognition. In Proceedings of the Interspeech 2006, Pittsburgh, PA, USA, 17–21 September 2006; pp. 17–21. [Google Scholar]
- Espinosa, H.P.; García, C.A.R.; Pineda, L.V. Features selection for primitives estimation on emotional speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 5138–5141. [Google Scholar] [CrossRef]
- Sundberg, J.; Patel, S.; Bjorkner, E.; Scherer, K.R. Interdependencies among Voice Source Parameters in Emotional Speech. IEEE Trans. Affect. Comput. 2011, 2, 162–174. [Google Scholar] [CrossRef]
- Sun., R.; Moore, E.; Torres, J.F. Investigating glottal parameters for differentiating emotional categories with similar prosodics. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 4509–4512. [Google Scholar] [CrossRef]
- Gangamohan, P.; Kadiri, S.R.; Yegnanarayana, B. Analysis of Emotional Speech—A Review. In Toward Robotic Socially Believable Behaving Systems—Volume I; Esposito, A., Jain, L., Eds.; Springer: Cham, Switzerland, 2016; Volume 105. [Google Scholar]
- Albanie, S.; Nagrani, A.; Vedaldi, A.; Zisserman, A. Emotion Recognition in Speech using Cross-Modal Transfer in the Wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 292–301. [Google Scholar]
- Ghaleb, E.; Popa, M.; Asteriadis, S. Metric Learning Based Multimodal Audio-visual Emotion Recognition. IEEE MultiMedia 2019, 27, 19508194. [Google Scholar] [CrossRef] [Green Version]
- Yeh, L.Y.; Tai-Shih, C. Spectro-temporal modulations for robust speech emotion recognition. In Proceedings of the Interspeech 2010, Makuhari, Chiba, Japan, 26–30 September 2010; pp. 789–792. [Google Scholar]
- Bhavan, A.; Chauhan, P.; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl. Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
- Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control. 2020, 59, 101894. [Google Scholar] [CrossRef]
- Amer, M.R.; Siddiquie, B.; Richey, C.; Divakaran, A.; Amer, M.R. Emotion detection in speech using deep networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 14 July 2014; pp. 3724–3728. [Google Scholar]
- Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
- Ma, X.; Wu, Z.; Jia, J.; Xu, M.; Meng, H.; Cai, L. Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space. Interspeech 2017, 2017, 1238–1242. [Google Scholar] [CrossRef]
- Lian, Z.; Li, Y.; Tao, J.; Huang, J. Speech Emotion Recognition via Contrastive Loss under Siamese Networks. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, Seoul, Korea, 26 October 2018; pp. 21–26. [Google Scholar]
- Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria, Y.B. Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network. Sensors 2020, 20, 6008. [Google Scholar] [CrossRef] [PubMed]
- Kumar, P.; Jain, S.; Raman, B.; Roy, P.P.; Iwamura, M. End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 5 May 2021; pp. 8766–8773. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 19 June 2017; pp. 2227–2231. [Google Scholar]
- Chao, L.; Tao, J.; Yang, M.; Li, Y.; Wen, Z. Long short term memory recurrent neural network based encoding method for emotion recognition in video. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2752–2756. [Google Scholar]
- Tzinis, E.; Potamianos, A. Segment-based speech emotion recognition using recurrent neural networks. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; pp. 190–195. [Google Scholar]
- Huang, J.; Li, Y.; Tao, J.; Lian, Z. Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3673–3677. [Google Scholar]
- Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yin, R.; Bredin, H.; Barras, C. Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 3827–3831. [Google Scholar]
- Nawab, S.H.; Quatieri, T.F. Short-Time Fourier Transform. In Advanced Topics in Signal Processing; Lim, J.S., Oppenheim, A.V., Eds.; Prentice-Hall: Upper Saddle River, NJ, USA, 1987; pp. 289–337. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yadav, S.; Shukla, S. Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India, 27–28 February 2016; pp. 78–83. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar]
- Mocanu, B.; Tapu, R.; Zaharia, T. DEEP-SEE FACE: A Mobile Face Recognition System Dedicated to Visually Impaired People. IEEE Access 2018, 6, 51975–51985. [Google Scholar] [CrossRef]
Figure 1. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
CREMA-D | Train | 1—18 | 3—18 | 5—22 | 7—24 | 9—24; 1—2 | 11—24; 1—4 | 13—24; 1—6 | 15—24; 1—8 | 17—24; 1—10 | 19—24; 1—12 |
Val. | 19—21 | 21—23 | 23—24; 1 | 1—3 | 3—5 | 5—7 | 7—9 | 9—11 | 11—13 | 13—15 | |
Test | 22—24 | 24; 1—2 | 2—4 | 4—6 | 6—8 | 8—10 | 10—12 | 12—14 | 14—16 | 16—18 | |
RAVDESS | Train | 1—73 | 10—82 | 19—91 | 28—91; 1—9 | 37—91; 1—18 | 46—91; 1—27 | 55—91; 1—36 | 64—91; 1—45 | 73—91; 1—54 | 82—91; 1—63 |
Val. | 74—82 | 83—91 | 1—9 | 10—18 | 19—27 | 28—36 | 37—45 | 46—54 | 55—63 | 64—72 | |
Test | 83—91 | 1—9 | 10—18 | 19—27 | 28—36 | 37—45 | 46—54 | 55—63 | 64—72 | 73—81 |
Method | Accuracy | |||||||
---|---|---|---|---|---|---|---|---|
RAVDESS [9] | CREMA-D [10] | |||||||
AVG % | MIN % | MAX % | STD % | AVG % | MIN % | MAX % | STD % | |
Human observers | Not reported | 40.90 | - | - | - | |||
(S1). Baseline method | 67.94 | 63.95 | 70.48 | 2.33 | 53.38 | 49.25 | 56.02 | 2.12 |
(S2). SE-ResNet with multi-stage training | 74.81 | 71.93 | 77.16 | 1.61 | 59.47 | 56.24 | 62.56 | 1.89 |
(S3). SE-ResNet with GhostVLAD layer | 79.78 | 76.96 | 81.98 | 1.68 | 61.25 | 59.38 | 63.79 | 1.31 |
(S4). SE-ResNet with GhostVLAD layer + triplet loss | 80.93 | 78.42 | 82.94 | 1.41 | 62.44 | 59.86 | 64.14 | 1.25 |
(S5). Proposed framework: SE-ResNet + GhostVLAD layer + emotion constraint | 83.55 | 81.22 | 85.65 | 1.33 | 64.92 | 62.85 | 66.84 | 1.16 |
Method | Accuracy | |
---|---|---|
RAVDESS [41] | CREMA-D [42] | |
Bhavan et al. [25] 2019 | 75.69% | - |
Issa et al. [26] 2020 | 71.61% | - |
Kumar et al. [32] 2021 | 79.67% | 58.72% |
Ghaleb et al. [23] 2020 | - | 59.01% |
Huang et al. [8] 2020 | - | 61.53% |
Proposed framework | 83.35% | 64.92% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mocanu, B.; Tapu, R.; Zaharia, T. Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition. Sensors 2021, 21, 4233. https://doi.org/10.3390/s21124233
Mocanu B, Tapu R, Zaharia T. Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition. Sensors. 2021; 21(12):4233. https://doi.org/10.3390/s21124233
Chicago/Turabian StyleMocanu, Bogdan, Ruxandra Tapu, and Titus Zaharia. 2021. "Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition" Sensors 21, no. 12: 4233. https://doi.org/10.3390/s21124233
APA StyleMocanu, B., Tapu, R., & Zaharia, T. (2021). Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition. Sensors, 21(12), 4233. https://doi.org/10.3390/s21124233