A Hybrid Multimodal Emotion Recognition Framework for UX Evaluation Using Generalized Mixture Functions
<p>Hybrid Multimodal Emotion Recognition (H-MMER) framework.</p> "> Figure 2
<p>Transfer learning in the proposed FER-Dual-Net Model.</p> "> Figure 3
<p>Emotion recognition system based on voice and video.</p> "> Figure 4
<p>Architecture diagram for an Unobtrusive Skeletal-based Emotion Recognition (UnSkEm) [<a href="#B36-sensors-23-04373" class="html-bibr">36</a>].</p> "> Figure 5
<p>Workflow diagram for an Unobtrusive Skeletal-based Emotion Recognition (UnSkEm).</p> "> Figure 6
<p>Workflow diagram for an Unobtrusive GM-based Multimodal Emotion Fusion (GM-mmEF).</p> "> Figure 7
<p>Unimodal, multimodal feature fusion and decision level fusion.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Proposed Multimodal Emotion Recognition Method
3.1. Multimodal Input Acquisition
3.2. Uni-Modal Emotion Recognition
3.2.1. Video-Based Emotion Recognition
3.2.2. Audio-Based Emotion Recognition
- Audio Signal preprocessing
- Speech Text Extraction
- Sentiment Classification using CNN
- Audio Signal Feature Extraction
3.2.3. Skeletal-Based Emotion Recognition
- Skeletal Joint Acquisition
- Skeletal Frame Segmentation
- Feature Computation
3.3. Multimodal Emotion Recognition
3.3.1. Multimodal Feature Fusion
Algorithm 1 Multimodal Feature Transformation and Deep Neural Network (DNN) Training |
Input: Unimodal Feature vectors (Video, Audio, Skeletal) Output: Emotion-Score Vector.
|
3.3.2. Multimodal Decision Fusion
Decision Aggregation
Dynamic Statistical Weighting
Classifier Ensemble Using Generalized Mixture Functions
Cross-Modality Ranking Pool
Algorithm 2 Multimodal Decision Level Fusioning (GM-based Combination method ) |
Input: Dataset of size with instances for modalities classifying emotions with posterior probabilities . Output: Highest ensemble score.
|
4. Experimental Evaluations
4.1. Dataset and Implementation
4.2. Multimodal Emotion Recognition Results
4.2.1. Performance Analysis of Video-Based Emotion Recognition
4.2.2. Performance Analysis of Audio-Based Emotion Recognition
4.2.3. Performance Analysis of Skeletal-Based Emotion Recognition
4.2.4. Performance Analysis of Multimodal Emotion Feature Fusion
4.2.5. Performance Analysis of Multimodal Decision Level Fusion with GM Functions
4.3. Comparison of Unimodal, Multimodal Feature Fusion and Decision Level Fusion Results
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
- The following abbreviations are used in this manuscript:
UX User Experience HCI Human–Computer Interaction FC Feature Concatenation FER Facial Expression Recognition LPC Linear Predictive Coding OWA Ordered Weighted Averaging GMF Generalized Mixture Functions GM-mmEF GM-based Multimodal Emotion Fusion
References
- Zhao, Z.; Wang, Y.; Wang, Y. Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition. arXiv 2022, arXiv:2207.04697. [Google Scholar]
- Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar]
- Medjden, S.; Ahmed, N.; Lataifeh, M. Adaptive user interface design and analysis using emotion recognition through facial expressions and body posture from an RGB-D sensor. PLoS ONE 2020, 15, e0235908. [Google Scholar]
- Cimtay, Y.; Ekmekcioglu, E.; Caglar-Ozhan, S. Cross-subject multimodal emotion recognition based on hybrid fusion. IEEE Access 2020, 8, 168865–168878. [Google Scholar] [CrossRef]
- Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
- Radu, V.; Tong, C.; Bhattacharya, S.; Lane, N.D.; Mascolo, C.; Marina, M.K.; Kawsar, F. Multimodal deep learning for activity and context recognition. Proc. Acm Interact. Mob. Wearable Ubiquitous Technol. 2018, 1, 157. [Google Scholar]
- Liu, H.; Zhang, L. Advancing ensemble learning performance through data transformation and classifiers fusion in granular computing context. Expert Syst. Appl. 2019, 131, 20–29. [Google Scholar]
- Costa, V.S.; Farias, A.D.S.; Bedregal, B.; Santiago, R.H.; Canuto, A.M.d.P. Combining multiple algorithms in classifier ensembles using generalized mixture functions. Neurocomputing 2018, 313, 402–414. [Google Scholar]
- Hussain, J.; Khan, W.A.; Hur, T.; Bilal, H.S.M.; Bang, J.; Hassan, A.U.; Afzal, M.; Lee, S. A multimodal deep log-based user experience (UX) platform for UX evaluation. Sensors 2018, 18, 1622. [Google Scholar] [PubMed]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 2017, 20, 1576–1590. [Google Scholar] [CrossRef]
- Li, S.; Zhang, T.; Chen, B.; Chen, C.P. MIA-Net: Multi-Modal Interactive Attention Network for Multi-Modal Affective Analysis. IEEE Trans. Affect. Comput. 2023, 1–15. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Gravina, R.; Alinia, P.; Ghasemzadeh, H.; Fortino, G. Multi-sensor fusion in body sensor networks: State-of-the-art and research challenges. Inf. Fusion 2017, 35, 68–80. [Google Scholar] [CrossRef]
- Ehatisham-Ul-Haq, M.; Javed, A.; Azam, M.A.; Malik, H.M.; Irtaza, A.; Lee, I.H.; Mahmood, M.T. Robust human activity recognition using multimodal feature-level fusion. IEEE Access 2019, 7, 60736–60751. [Google Scholar] [CrossRef]
- Huang, J.; Li, Y.; Tao, J.; Lian, Z.; Wen, Z.; Yang, M.; Yi, J. Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23–27 October 2017; pp. 11–18. [Google Scholar]
- Thuseethan, S.; Rajasegarar, S.; Yearwood, J. EmoSeC: Emotion recognition from scene context. Neurocomputing 2022, 492, 174–187. [Google Scholar] [CrossRef]
- Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
- Przybyła-Kasperek, M. Practically motivated adaptive fusion method with tie analysis for multilabel dispersed data. Expert Syst. Appl. 2023, 219, 119601. [Google Scholar] [CrossRef]
- Krawczyk, B.; Woźniak, M. Untrained weighted classifier combination with embedded ensemble pruning. Neurocomputing 2016, 196, 14–22. [Google Scholar] [CrossRef]
- Liu, Z.; Pan, Q.; Dezert, J.; Martin, A. Combination of Classifiers With Optimal Weight Based on Evidential Reasoning. IEEE Trans. Fuzzy Syst. 2018, 26, 1217–1230. [Google Scholar] [CrossRef]
- Onan, A.; Korukoğlu, S.; Bulut, H. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst. Appl. 2016, 62, 1–16. [Google Scholar] [CrossRef]
- Lean UX: Mixed Method Approach for ux Evaluation. Available online: https://github.com/ubiquitous-computing-lab/Lean-UX-Platform/ (accessed on 2 April 2023).
- Liu, W.; Qiu, J.L.; Zheng, W.L.; Lu, B.L. Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 715–729. [Google Scholar] [CrossRef]
- Ghoniem, R.M.; Algarni, A.D.; Shaalan, K. Multi-Modal Emotion Aware System Based on Fusion of Speech and Brain Information. Information 2019, 10, 239. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Zhang, J.; Xiu, Y. Image stitching based on human visual system and SIFT algorithm. Vis. Comput. 2023, 1–13. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Shoumy, N.J.; Ang, L.M.; Seng, K.P.; Rahaman, D.M.; Zia, T. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl. 2020, 149, 102447. [Google Scholar] [CrossRef]
- Park, E.L.; Cho, S. KoNLPy: Korean natural language processing in Python. In Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea, 11–14 October 2014; Volume 6, pp. 133–136. [Google Scholar]
- Chang, S.W.; Dong, W.H.; Rhee, D.Y.; Jun, H.J. Deep learning-based natural language sentiment classification model for recognizing users’ sentiments toward residential space. Archit. Sci. Rev. 2020, 64, 410–421. [Google Scholar] [CrossRef]
- Bang, J.; Hur, T.; Kim, D.; Huynh-The, T.; Lee, J.; Han, Y.; Banos, O.; Kim, J.I.; Lee, S. Adaptive Data Boosting Technique for Robust Personalized Speech Emotion in Emotionally-Imbalanced Small-Sample Environments. Sensors 2018, 18, 3744. [Google Scholar] [CrossRef]
- Wang, K.C. Time-frequency feature representation using multi-resolution texture analysis and acoustic activity detector for real-life speech emotion recognition. Sensors 2015, 15, 1458–1478. [Google Scholar] [CrossRef] [PubMed]
- Razzaq, M.A.; Bang, J.; Kang, S.S.; Lee, S. UnSkEm: Unobtrusive Skeletal-based Emotion Recognition for User Experience. In Proceedings of the 2020 International Conference on Information Networking (ICOIN), Barcelona, Spain, 7–10 January 2020; pp. 92–96. [Google Scholar]
- Du, G.; Zeng, Y.; Su, K.; Li, C.; Wang, X.; Teng, S.; Li, D.; Liu, P.X. A Novel Emotion-Aware Method Based on the Fusion of Textual Description of Speech, Body Movements, and Facial Expressions. IEEE Trans. Instrum. Meas. 2022, 71, 1–16. [Google Scholar] [CrossRef]
- Khaire, P.; Kumar, P. A semi-supervised deep learning based video anomaly detection framework using RGB-D for surveillance of real-world critical environments. Forensic Sci. Int. Digit. Investig. 2022, 40, 301346. [Google Scholar] [CrossRef]
- Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2022, 91, 1566–2535. [Google Scholar] [CrossRef]
- Shahin, I.; Nassif, A.B.; Hamsa, S. Emotion Recognition Using Hybrid Gaussian Mixture Model and Deep Neural Network. IEEE Access 2019, 7, 26777–26787. [Google Scholar] [CrossRef]
- Deep Learning Library for the Java. Available online: https://deeplearning4j.org/ (accessed on 2 April 2023).
- Amsaprabhaa, M.; Nancy Jane, Y.; Khanna Nehemiah, H. Multimodal spatiotemporal skeletal kinematic gait feature fusion for vision-based fall detection. Expert Syst. Appl. 2023, 212, 118681. [Google Scholar]
- Samadiani, N.; Huang, G.; Cai, B.; Luo, W.; Chi, C.H.; Xiang, Y.; He, J. A review on automatic facial expression recognition systems assisted by multimodal sensor data. Sensors 2019, 19, 1863. [Google Scholar] [CrossRef]
- Pereira, R.M.; Pasi, G. On non-monotonic aggregation: Mixture operators. In Proceedings of the 4th Meeting of the EURO Working Group on Fuzzy Sets (EUROFUSE’99) and 2nd International Conference on Soft and Intelligent Computing (SIC’99), Budapest, Hungary, 25–28 May 1999; pp. 513–517. [Google Scholar]
- Landowska, A. Uncertainty in emotion recognition. J. Inf. Commun. Ethics Soc. 2019, 17, 273–291. [Google Scholar] [CrossRef]
- Beliakov, G.; Sola, H.B.; Sánchez, T.C. A Practical Guide to Averaging Functions; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
- Gan, C.; Xiao, J.; Wang, Z.; Zhang, Z.; Zhu, Q. Facial expression recognition using densely connected convolutional neural network and hierarchical spatial attention. Image Vis. Comput. 2022, 117, 104342. [Google Scholar] [CrossRef]
- Hua, C.H.; Huynh-The, T.; Seo, H.; Lee, S. Convolutional network with densely backward attention for facial expression recognition. In Proceedings of the 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), Taichung, Taiwan, 3–5 January 2020; pp. 1–6. [Google Scholar]
- Singh, P.; Srivastava, R.; Rana, K.; Kumar, V. A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl.-Based Syst. 2021, 229, 107316. [Google Scholar] [CrossRef]
- Deb, S.; Dandapat, S. Multiscale amplitude feature and significance of enhanced vocal tract information for emotion classification. IEEE Trans. Cybern. 2018, 49, 802–815. [Google Scholar] [CrossRef] [PubMed]
- Fourati, N.; Pelachaud, C. Perception of emotions and body movement in the emilya database. IEEE Trans. Affect. Comput. 2016, 9, 90–101. [Google Scholar] [CrossRef]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
- Shi, H.; Peng, W.; Chen, H.; Liu, X.; Zhao, G. Multiscale 3D-shift graph convolution network for emotion recognition from human actions. IEEE Intell. Syst. 2022, 37, 103–110. [Google Scholar] [CrossRef]
Mean Classification Acccuracy: (91.46%) & Classification Error: (8.54%) | |||||
---|---|---|---|---|---|
Type of Emotions | Emotion Recognition Rate (%) | ||||
Happiness | Neutral | Sadness | Anger | ||
Ground Truth | Happiness | 95.0 | 2.01 | 1.40 | 1.60 |
Neutral | 0.01 | 89.13 | 10.90 | 0.02 | |
Sadness | 4.61 | 4.13 | 86.75 | 4.60 | |
Anger | 0.40 | 4.67 | 0.12 | 94.95 |
State-of-the-Art methods | Datasets | Number of Emotions | Mean Recognition Accuracy (%) |
---|---|---|---|
Gan et al. [48] * | AffectNet [47] | 4 | 88.05 |
Hua et al. [49] * | 4 | 87.27 | |
Proposed Video-based ER | LeanUX [25] | 4 | 91.46 |
Mean Classification Accuracy: (66.07%) & Classification Error: (33.93%) | |||||
---|---|---|---|---|---|
Type of Emotions | Emotion Recognition Rate (%) | ||||
Happiness | Neutral | Sadness | Anger | ||
Ground Truth | Happiness | 68.2 | 4.7 | 8.4 | 18.8 |
Neutral | 13.5 | 62.5 | 17.7 | 6.2 | |
Sadness | 17.3 | 21.01 | 58.7 | 3.1 | |
Anger | 1.99 | 11.8 | 14.8 | 71.5 |
State-of-the-Art Methods | Datasets | Number of Emotions | Mean Recognition Accuracy (%) |
---|---|---|---|
Deb et al. [51] | IEMOCAP [52] | 6 | 66.80 |
Singh et al. [50] | RAVDESS [53] | 8 | 81.20 |
Proposed Audio-based ER | LeanUX [25] | 4 | 66.07 |
Mean Classification Acccuracy: (97.01%) & Classification Error: (2.99%) | |||||
---|---|---|---|---|---|
Type of Emotions | Emotion Recognition Rate (%) | ||||
Happiness | Neutral | Sadness | Anger | ||
Ground Truth | Happiness | 97.71 | 0.18 | 0.41 | 1.71 |
Neutral | 0.12 | 98.67 | 0.34 | 0.85 | |
Sadness | 0.29 | 0.44 | 96.22 | 1.50 | |
Anger | 1.12 | 0.54 | 2.80 | 95.42 |
State-of-the-Art Methods | Datasets | Number of Emotions | Mean Recognition Accuracy (%) |
---|---|---|---|
Razzaq et al. [36] | UnSkEm [36] | 6 | 96.73 |
Shi et al. [54] | Emilya [52] | 8 | 95.50 |
Proposed Skeletal-based ER | LeanUX [25] | 4 | 97.01 |
Mean Classification Acccuracy: (97.71%) & Classification Error: (2.29%) | |||||
---|---|---|---|---|---|
Type of Emotions | Emotion Recognition Rate (%) | ||||
Happiness | Neutral | Sadness | Anger | ||
Ground Truth | Happiness | 98.21 | 0.10 | 0.27 | 1.42 |
Neutral | 0.21 | 98.85 | 0.10 | 0.84 | |
Sadness | 0.10 | 0.04 | 98.08 | 1.91 | |
Anger | 1.46 | 1.01 | 1.10 | 95.68 |
Modality | Accuracy (%) |
---|---|
Video-based ER | 91.46% |
Skeletal-based ER | 97.01% |
Audio-based ER | 66.07% |
Multimodal Feature Fusion | 97.71% |
Modality | EnsSize | Vote | Max | Arith | Prod | HMax | HArith | HProd | Best |
---|---|---|---|---|---|---|---|---|---|
Aud_Fac | 2 | 0.808 | 0.808 | 0.800 | 0.783 | 0.841 | 0.808 | 0.799 | HMax |
Aud_BL | 2 | 0.810 | 0.810 | 0.808 | 0.810 | 0.824 | 0.816 | 0.824 | HMax & HProd |
Fac_BL | 2 | 0.941 | 0.941 | 0.941 | 0.994 | 0.943 | 0.944 | 0.944 | Prod |
Aud_Fac_BL | 3 | 0.943 | 0.943 | 0.943 | 0.949 | 0.941 | 0.949 | 0.942 | Prod & HArith |
Aud_Fac_BL_FF | 4 | 0.978 | 0.978 | 0.979 | 0.972 | 0.982 | 0.972 | 0.974 | HMax |
Mean Classification Acccuracy: (98.19%) & Classification Error: (1.81%) | |||||
---|---|---|---|---|---|
Type of Emotions | Emotion Recognition Rate (%) | ||||
Happiness | Neutral | Sadness | Anger | ||
Ground Truth | Happiness | 98.47 | 0.11 | 0.14 | 1.49 |
Neutral | 0.06 | 98.79 | 0.12 | 1.07 | |
Sadness | 0.98 | 0.44 | 98.49 | 0.21 | |
Anger | 0.11 | 0.61 | 0.93 | 97.01 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Razzaq, M.A.; Hussain, J.; Bang, J.; Hua, C.-H.; Satti, F.A.; Rehman, U.U.; Bilal, H.S.M.; Kim, S.T.; Lee, S. A Hybrid Multimodal Emotion Recognition Framework for UX Evaluation Using Generalized Mixture Functions. Sensors 2023, 23, 4373. https://doi.org/10.3390/s23094373
Razzaq MA, Hussain J, Bang J, Hua C-H, Satti FA, Rehman UU, Bilal HSM, Kim ST, Lee S. A Hybrid Multimodal Emotion Recognition Framework for UX Evaluation Using Generalized Mixture Functions. Sensors. 2023; 23(9):4373. https://doi.org/10.3390/s23094373
Chicago/Turabian StyleRazzaq, Muhammad Asif, Jamil Hussain, Jaehun Bang, Cam-Hao Hua, Fahad Ahmed Satti, Ubaid Ur Rehman, Hafiz Syed Muhammad Bilal, Seong Tae Kim, and Sungyoung Lee. 2023. "A Hybrid Multimodal Emotion Recognition Framework for UX Evaluation Using Generalized Mixture Functions" Sensors 23, no. 9: 4373. https://doi.org/10.3390/s23094373