Abstract
Depression, as one of the prominent challenges in the field of worldwide psychological health, affects the quality of life and psychological well-being of hundreds of millions of people. Due to its high prevalence, recurrence and strong association with other health problems, early diagnosis and treatment are crucial. With advances in technology, audio and visual data are increasingly recognized as biomarkers for the identification of depression. However, it should be noted that many existing studies focus primarily on a single modality, often overlooking the potential complementarity between different modalities. In this context, this study proposes an advanced approach that integrates convolutional neural networks (CNN) and bidirectional long short-term memory networks (BiLSTM) with attention mechanisms, with the objective of extracting more profound features from speech data. For facial expressions, a hybrid model comprising temporal convolutional networks (TCN) and long short-term memory networks (LSTM) is utilized. Furthermore, to achieve a seamless integration of different modalities, we design a cross-attention fusion strategy that allows speech and facial information to be integrated into a unified framework. Our methodology’s efficacy is confirmed by the experimental findings on the E-DAIC dataset, in which the multimodal fusion strategy demonstrates higher precision and reliability in detecting depression compared to a single modality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hammar, Å., Ronold, E.H., Rekkedal, G.Å.: Cognitive impairment and neurocognitive profiles in major depression—a clinical perspective. Front. Psychiatry 13, 764374 (2022)
WHO: Depression key facts. World Health Organization (2023). https://www.who.int/news-room/fact-sheets/detail/depression
Schumann, I., Schneider, A., Kantert, C., Löwe, B., Linde, K.: Physicians’ attitudes, diagnostic process and barriers regarding depression diagnosis in primary care: a systematic review of qualitative studies. Fam. Pract. 29, 255–263 (2012)
World Health Organization. Depression and Other Common Mental Disorders: Global Health Estimates. World Health Organization (2017)
Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiat. 72(7), 580–587 (2012)
Rejaibi, E., Komaty, A., Meriaudeau, F., Agrebi, S., Othmani, A.: MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed. Signal Process. Control 71, 103107 (2022)
He, L., Cao, C.: Automated depression analysis using convolutional neural networks from speech. J. Biomed. Inform. 83, 103–111 (2018)
Ma, X., Yang, H., Chen, Q., Huang, D., Wang, Y.: Depaudionet: an efficient deep model for audio based depression classification. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 35–42 (2016)
Girard, J.M., Cohn, J.F., Mahoor, M.H., Mavadati, S., Rosenwald, D.P.: Social risk and depression: evidence from manual and automatic facial expression analysis. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Pampouchidou, A., et al.: Automatic assessment of depression based on visual cues: a systematic review. IEEE Trans. Affect. Comput. 10(4), 445–470 (2017)
Gavrilescu, M., Vizireanu, N.: Predicting depression, anxiety, and stress levels from videos using the facial action coding system. Sensors 19(17), 3693 (2019)
Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)
Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., Wang, Y.: Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 21–30 (2013)
Alghowinem, S., Goecke, R., Wagner, M., Parkerx, G., Breakspear, M.: Head pose and movement analysis as an indicator of depression. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 283–288. IEEE (2013)
Williamson, J.R., Young, D., Nierenberg, A.A., Niemi, J., Helfer, B.S., Quatieri, T.F.: Tracking depression severity from audio and video based on speech articulatory coordination. Comput. Speech Lang. 55, 40–56 (2019)
Bone, D., Lee, C.C., Narayanan, S.: Robust unsupervised arousal rating: a rule-based framework with knowledge-inspired vocal features. IEEE Trans. Affect. Comput. 5(2), 201–213 (2014)
Eyben, F., Weninger, F., Schuller, B.: Affect recognition in real-life acoustic conditions-a new perspective on feature selection. In: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France (2013)
Zhai, S.P., Yang, Y.Y.: Bilingual text sentiment analysis based on attention mechanism Bi-LSTM. Comput. Appl. Softw. 36(12), 251–255 (2019)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection, pp. 156–165 (2017)
Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Khorram, S., Aldeneh, Z., Dimitriadis, D., McInnis, M., Provost, E.M.: Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. arXiv preprint arXiv:1708.07050 (2017)
Acknowledgement
The authors acknowledge the Key Research and Development Plan of Anhui Province (202104d07020006), the Natural Science Foundation of Anhui Province (2108085MF223), University Natural Sciences Research Project of Anhui Province (KJ2021A0991), the Key Research and Development Plan of Hefei (2021GJ030).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xu, X., Zhang, G., Mao, X., Lu, Q. (2024). Multimodal Depression Recognition Using Audio and Visual. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2014. Springer, Singapore. https://doi.org/10.1007/978-981-97-0903-8_22
Download citation
DOI: https://doi.org/10.1007/978-981-97-0903-8_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0902-1
Online ISBN: 978-981-97-0903-8
eBook Packages: Computer ScienceComputer Science (R0)