[go: up one dir, main page]

0% found this document useful (0 votes)
5 views15 pages

Sicheng Zhao Multimodal

The document discusses the importance of multimodal emotion recognition (MER) in affective computing, highlighting how machines can recognize and interpret human emotions through various modalities such as facial expressions, speech, and digital media. It outlines the advantages of MER over unimodal approaches, including data complementarity, model robustness, and performance superiority, while also addressing the challenges and methodologies involved in emotion recognition. Additionally, the document presents real-world applications and future directions for the field of affective computing.

Uploaded by

asrrriyasriyas8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

Sicheng Zhao Multimodal

The document discusses the importance of multimodal emotion recognition (MER) in affective computing, highlighting how machines can recognize and interpret human emotions through various modalities such as facial expressions, speech, and digital media. It outlines the advantages of MER over unimodal approaches, including data complementarity, model robustness, and performance superiority, while also addressing the challenges and methodologies involved in emotion recognition. Additionally, the document presents real-world applications and future directions for the field of affective computing.

Uploaded by

asrrriyasriyas8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

INTELLIGENT SIGNAL PROCESSING

FOR AFFECTIVE COMPUTING

Sicheng Zhao, Guoli Jia, Jufeng Yang, Guiguang Ding,


and Kurt Keutzer

Emotion Recognition From Multiple Modalities


Fundamentals and methodologies

H
umans are emotional creatures. Multiple modalities are
often involved when we express emotions, whether we do so
explicitly (such as through facial expression and speech) or
implicitly (e.g., via text or images). Enabling machines to have
emotional intelligence, i.e., recognizing, interpreting, process-
ing, and simulating emotions, is becoming increasingly impor-
tant. In this tutorial, we discuss several key aspects of multi-
modal emotion recognition (MER).
We begin with a brief introduction on widely used emotion
representation models and affective modalities. We then sum-
marize existing emotion annotation strategies and correspond-
ing computational tasks, followed by a description of the main
challenges in MER. Furthermore, we present some represen-
tative approaches on representation learning of each affective
modality, feature fusion of different affective modalities, and
classifier optimization as well as domain adaptation for MER.
Finally, we outline several real-world applications and discuss
some future directions.

Introduction
Emotion is present everywhere in human daily life and can in-
fluence or even determine our judgment and decision making
[1]. For example, in marketing, a widely advertised brand can
generate a mental representation of a product in consumers’
©SHUTTERSTOCK.COM/HQUALITY
minds and influence their preferences and actions; inducing
sadness and disgust during a shopping trip would, respective-
ly, increase and decrease consumers’ willingness to pay [31].
Drivers experiencing strong emotions, such as sadness, anger,
agitation, and even happiness, are much more likely to be in-
volved in an accident [32]. In education—especially current
online classes during the COVID-19 pandemic period—stu-
dents’ emotional experiences and interactions with teachers
have a big impact on their learning ability, interest, engage-
ment, and even career choices [33].
The importance of emotions in artificial intelligence was
recognized decades ago. Minsky, a Turing Award winner in
1970, once claimed, “The question is not whether intelligent
Digital Object Identifier 10.1109/MSP.2021.3106895
Date of current version: 27 October 2021 machines can have any emotions, but whether machines can be

1053-5888/21©2021IEEE
Authorized IEEE SIGNAL
licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 59
intelligent without emotions” [2]. Enabling machines to have compared to unimodal emotion recognition, MER has sev-
emotional intelligence, i.e., recognizing, interpreting, process- eral advantages. The first is data complementarity. Cues from
ing, and simulating emotions, has recently become increas- different modalities can augment or complement each other.
ingly important, with wide potential applications involving For example, if we see a social media post from a good friend
human–computer interaction [3]. saying, “What great weather!” it is highly probable that our
On the one hand, emotionally intelligent machines can pro- friend is expressing a positive emotion, but, if there is also
vide more harmonious and personal services for human beings, an auxiliary image of a storm, we can infer that the text is
especially the elderly, those with disabilities, and children. For actually sarcastic and that a negative emotion is intended to
example, companion robots that can work with emotions can be expressed.
better meet the psychological and emotional needs of the elderly The second is model robustness. Due to the influence of
and help them stay comfortable. many normally occurring factors in data collection, such as
On the other hand, by recognizing humans’ emotions sensor device failure, some data modalities might be unavail-
automatically and in real time, intelligent machines can bet- able, which is especially prevalent in the wild. For example,
ter identify humans’ abnormal behaviors, send reminders to in the CALLAS data set containing speech, facial expres-
their relatives and friends, and prevent extreme behaviors to sion, and gesture modalities, the gesture stream is missing
themselves and even to the rest of society. For example, an for some momentarily motionless users [5]. In such cases,
emotion-monitoring system for driving can automatically play the learned MER model can still work with the help of other
some soothing music to relax angry individuals who might be available modalities.
dissatisfied with a traffic jam and can remind them to focus on The final advantage is performance superiority. Joint
driving safely. consideration of the complementary information of different
The first step for intelligent machines to express human-like modalities can result in better recognition performance. A
emotions is to recognize and understand humans’ emotions, typi- meta-analysis indicates that, as compared to the best unimodal
cally through two groups of affective modalities: explicit affec- counterparts, MER achieves 9.83% performance improvement
tive cues and implicit affective stimuli. Explicit affective cues on average [6].
correspond to specific physical and psychological changes in In this article, we give a comprehensive tutorial on dif-
humans that can be directly observed and recorded, such as facial ferent aspects of MER, including psychological models,
expressions, eye movement, speech, actions, and physiologi- affective modalities, data collections and emotion annota-
cal signals. These can be either easily suppressed and masked or tions, computational tasks, challenges, computational meth-
difficult and impractical to capture. odologies, applications, and future directions. There have
Meanwhile, the popularity of mobile devices and social been several reviews/surveys on MER-related topics [4],
networks enables humans to habitually share their experiences [6]–[9]. In particular, [7] and [9] cover different aspects of
and express their opinions online using text, images, audio, general multimodal machine learning with few efforts on
and video. Implicit affective stimuli correspond to these com- emotion recognition, [6] focuses on the quantitative review
monly used digital media, the analysis of which provides an and meta-analysis of existing MER systems, and [4] and [8]
implicit way to infer humans’ emotions [4]. are survey-style MER articles with a technical emphasis on
Regardless of whether emotions are expressed explicitly or multimodal fusion. However, this tutorial-style article aims
implicitly, there are generally multiple modalities that can con- to give a quick and comprehensive MER introduction that is
tribute to the emotion recognition task, as shown in Figure 1. As also suitable for nonspecialists.

Psychological models
In psychology, categorical emotion states
(CES) and dimensional emotion space
(DES) are two representative types of
models to measure emotion [10]. CES
models define emotions as being in a
(a) (b) (c) (d) few basic categories, such as binary sen-
timents (positive and negative, some-
times including neutral), Ekman’s six
What an exciting day! basic emotions [happiness and surprise
I will never forget it. (positive) as well as anger, disgust, fear,
and sadness (negative)], Mikels’s eight
(e) (f) (g)
emotions [amusement, awe, content-
ment, and excitement (positive) as
FIGURE 1. The multiple modalities for emotion recognition. Explicit affective cues include (a) facial ex- well as anger, disgust, fear, and sadness
pression, (b) action and gait, (c) speech, and (d) physiological signals. Implicit affective stimuli include (negative)], Plutchik’s emotion wheel
(e) text, (f) image, and (g) video. (eight basic emotion categories, each

60 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
with three intensities), and Parrott’s tree hierarchical group- We can deduce how an individual is feeling by observing
ing (primary, secondary, and tertiary categories). The de- his or her eye movement [34]. The eyes are often viewed as
velopment of psychological theories motivates CES to be important cues of emotions. For example, if a person is ner-
increasingly diverse and fine-grained. DES models em- vous or lying, the blinking rate of his or her eyes may become
ploy continuous 2D, 3D, or higher-dimensional Cartesian slower than normal [34]. Eye movement signals can be easily
spaces to represent emotions; the most widely used DES collected via an eye-tracker system and have been widely used
model is valence–arousal–dominance (VAD), which rep- in human–computer interaction research.
resent the pleasantness, intensity, and control degree of Speech is a significant vocal modality to carry emotions
emotions, respectively. [13], [14]. Speakers may express their intentions, like asking or
CES models agree better with humans’ intuition, but no declaring, by using various intonations, degrees of loudness,
consensus has been reached by psychologists on how many and tempo. Specifically, emotions can be revealed when peo-
discrete emotion categories should be included. Furthermore, ple talk with each other or just mutter to themselves.
emotion is complex and subtle, which cannot be well reflected As an important part of human body language, action also
by limited discrete categories. DES models can theoretically conveys massive information about emotion. For instance, an
measure all emotions as different coordinate points in the con- air punch is an act of thrusting one’s clenched fist up into the
tinuous Cartesian space, but the absolute continuous values are air, typically as a gesture of triumph or elation.
beyond users’ understanding. These two types of definitions Similar to action, emotions can be perceived from a per-
of emotions are related, with a possible transformation from a son’s gait, i.e., his or her walking style. The psychology litera-
CES to DES. For example, anger relates to negative valence, ture has proven that participants can identify the emotions
high arousal, and high dominance. of a subject by observing his or her posture, including long
Besides emotion, there are several other widely used con- strides, collapsed upper body, and so on [35]. Body move-
cepts in affective computing, such as mood, affect, and senti- ment (e.g., walking speed) also plays an important role in
ment. Emotions can be expected, induced, or perceived. We the perception of different emotions. High-arousal emo-
do not aim to distinguish them in this article. Please refer to tions, such as anger and excitement, are more associated
[11] for more details on the differences or correlations among with rapid movements than low-arousal emotions, such as
these concepts. sadness and contentment.
Last but not least, EEG, as one representative psychological
Affective modalities signal, is another important method for recording the electrical
In the area of MER, multiple modalities are employed to rec- and emotional activity of the brain [15]. Compared to the other
ognize and predict human emotions. The affective modali- aforementioned explicit cues, the collection of EEG signals is
ties in MER can be roughly divided into two groups based on typically more difficult and unnatural, regardless of whether
whether emotions are recognized from humans’ physical body electrodes are placed noninvasively along the scalp or inva-
changes or external digital media: explicit affective cues and sively using electrocorticography.
implicit affective stimuli.
The former group includes facial expression, eye move- Implicit affective stimuli
ment, speech, action, gait, and electroencephalography (EEG), Text is a form used to record the natural language of human
all of which can be directly observed, recorded, or collected beings, which can implicitly carry informative emotions
from an individual. Meanwhile, the latter group comprises [16], [17]. It has different levels of linguistic components,
commonly used digital media types, such as text, audio, imag- including words, sentences, paragraphs, and articles, which
es, and video. We use these data types to store information are well studied; many off-the-shelf algorithms have been
and knowledge as well as transfer them among digital devices. developed to segment text into small pieces. Then, the affec-
In this way, emotions may be implicitly involved and evoked. tive attribute of each linguistic piece is recognized with the
Although the efficacy of one specific modality as a reliable help of a publicly available dictionary like SentiWordNet,
channel to express emotions cannot be guaranteed, jointly con- and the emotion evoked by the text can be deduced.
sidering multiple modalities would significantly improve the A digital audio signal is a representation of sound, typi-
reliability and robustness [12]. cally stored and transferred using a series of binary num-
bers [12]. Audio signals may be synthesized directly or
Explicit affective cues originate at a transducer, such as a microphone or musi-
A facial expression is an isolated motion of one or more hu- cal instrument. Unlike speech, which mainly focuses on
man face regions/units or a combination of such motions. It is human vocal information and the content of which may
commonly agreed that facial expressions can carry informa- be translated into natural language, audio is more general,
tive affective cues, and they are recognized as one of the most including any sound, like music or birdsong.
natural and powerful signals to convey the emotional states An image is a distribution of colored dots over space
and intentions of humans [12]. Facial expression is also a form [36]. The phrase “a picture is worth a thousand words” is
of nonverbal communication conveying social information well known. It has been demonstrated in psychology that
among humans. emotions can be evoked in humans by images [18]. The

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 61
explosive growth of images shared online and powerful and retrieval. In this section, we briefly introduce what these
descriptive ability of scenes have enabled images to become tasks do.
crucial affective stimuli, which has attracted extensive
research efforts [10]. Emotion classification
Video naturally contains multiple modalities at the same In the emotion classification task, we assume that one in-
time, such as visual, audio, and textual information [19]. stance can belong to only one or a fixed number of emotion
That means temporal, spatial, and multichannel representa- categories, and the goal is to discover class boundaries or
tions can be learned and utilized to recognize the emotions distributions in the data space [16]. Current works mainly
in videos. focus on the manual design of multimodal features and
classifiers or employing deep neural networks in an end-
Data collections and emotion annotations to-end manner.
Two steps are usually involved in constructing an MER data As defined as a single-label learning problem, MER assigns
set: data collection and emotion annotation. The collected data a single dominant emotion label to each sample. However, the
can be roughly divided into two categories: selecting from ex- emotion may be a mixture of all components from various
isting data and new recording in specific environments. regions or sequences rather than a single representative emo-
On the one hand, some data are selected from movies, tion. Meanwhile, different people may have varying emotional
reviews, videos, and TV shows in online social networks, such reactions to the same stimulus, which is caused by a variety of
as YouTube and WeiBo. For example, the review videos in elements, like personality.
ICT-MMMO and MOUD are collected from YouTube; audio- Thus, multilabel learning (MLL) has been utilized to study
visual clips are extracted from TV series in MELD; online the problem where one instance is associated with multiple
reviews from the food and restaurant categories are crawled emotion labels. Recently, to address the problem that MLL
in Yelp; and video blogs, typically with one speaker looking does not fit some real applications well where the overall dis-
at the camera from YouTube, are collected in CMU-MOSI to tribution of different labels’ importance matters, label-distri-
capture the speakers’ information. Some collected data pro- bution learning is proposed to cover a certain number of labels,
vide a transcription of speech either manually (e.g., CMU- representing the degree to which each emotion label describes
MOSI and CH-SMIS) or automatically (such as ICT-MMMO the instance [20].
and MELD).
On the other hand, some data are newly recorded with Emotion regression
different sensors in specifically designed environments. Emotion regression aims to learn a mapping function that can
For example, participants’ physiological signals and fron- effectively associate one instance with continuous emotion val-
tal facial changes induced by music videos are recorded ues in a Cartesian space. The most common regression algo-
in DEAP. rithms for MER aim to assign the average dimension values to
There are different kinds of emotion annotation strate- the instance. To deal with the inherent subjectivity characteris-
gies. Some data sets have target emotions and do not need to tic of emotions, researchers propose predicting the continuous
be annotated. For example, in EMODB, each sentence per- probability distribution of emotions, which are represented in
formed by actors corresponds to a target emotion. For some dimensional VA space. Specifically, VA emotion labels can be
data sets, the emotion annotations are obtained automatically. represented by a Gaussian mixture model (GMM), and then
For example, in Multi-ZOL, the integer sentiment score for the emotion distribution prediction can be formalized as a pa-
each review, ranging from 1 to 10, is regarded as the senti- rameter learning problem [21].
ment label.
Several workers are employed to annotate the emotions, Emotion detection
such as VideoEmotion-8. The data sets with recorded data As the raw data do not ensure carrying emotions, or only part
are usually annotated by participants’ self-reporting, such as of the data can evoke emotional reactions, emotion detec-
MAHNOB-HCI. In addition, the emotion labels are typically tion aims to find out which kind of emotion lies where in the
obtained by major voting. source data. For example, a restaurant review on Yelp might
For DES models, “FeelTrace” and “SAM” are often used read, “This location is conveniently located across the street
for annotation. The former is based on the activation-evalua- from where I work—being walkable is a huge plus for me!
tion space, which allows observers to track the emotion con- Foodwise, it’s the same as almost every location I’ve visited,
tent of a stimulus as they perceive it over time. The latter is a so there’s nothing much to say there. I do have to say that the
tool that accomplishes emotion rating based on different Lik- customer service is hit or miss.” Meanwhile, the overall rating
ert scales. Some commonly used data sets are summarized score is three stars out of five. This review contains different
in Table 1. emotions and attitudes: positive in the first sentence, neutral
in the second sentence, and negative in the last sentence. As
Computational tasks such, it is crucial for the system to detect which sentence cor-
Given multimodal affective signals, we can conduct differ- responds to each emotion. Another example is affective region
ent MER tasks, including classification, regression, detection, detection in images [22].

62 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
Emotion retrieval ing whether the distance between two patches or sequences is
How to search affective content based on human perception is less than a given fixed threshold. The similarity score between
another meaningful task. The existing framework first detects the query and each candidate is calculated as the quantity of
local interest patches or sequences in the query and candidate matched components, followed by ranking the candidates of this
data sources. Then, it discovers all matched pairs by determin- query accordingly. While an affective retrieval system is useful

Table 1. A brief summary of released data sets for MER.

Data
Data Set Modalities Samples Sources Emotion Labels Website
IEMOCAP Face, speech, 10,039 turns Recording ang, sad, hap, dis, fea, https://sail.usc.edu/iemocap
t-text, and video sur, fru, exc, and neu
VAD on 5-point ratings
YouTube Face, eye, speech, 47 videos YouTube pos, neg, and neu http://multicomp.cs.cmu.edu/rsources/youtube
t-text, and video -dataset-2
MOUD Face, speech, 412 utterances YouTube pos, and neg http://web.eecs.umich.edu/~mihalcea/
t-text, and video downloads.html#MOUD
ICT-MMMO Face, eye, speech, 370 segments Youtube pos and neg http://multicomp.cs.cmu.edu/resources/ict-mmmo
t-text, and video and ExpoTV -dataset
News Rover Face, speech, 929 videos News pos, neg, and neu https://www.ee.columbia.edu/ln/dvmm/
t-text, and video newsrover/sentimentdataset
CMU-MOSI Face, eye, speech, 2,199 clips YouTube –3 to 3 sentiment score http://multicomp.cs.cmu.edu/resources/cmu-mosi
t-text, and video -dataset
CMU-MOSEI Face, eye, speech, 23,453 YouTube hap, sad, ang, fea, dis, http://multicomp.cs.cmu.edu/resources/cmu-mosei
t-text, and video ­sentences and sur -dataset/
–3 to 3 sentiment score
MELD Face, speech, 13,708 TV series hap, sad, ang, fea, dis, https://affective-meld.github.io
t-text, and video ­utterances Friends sur, neu, and non-neu
pos, neg, and neu
CH-SIMS Face, eye, speech, 2,281 segments Movies and –1 to 1 sentiment score https://github.com/thuiar/MMSA
t-text, and video TV series
Variety
shows
eNTERFACE’05 Face, speech, and 1,166 sequences Recording ang, fea, hap, sad, and http://www.enterface.net/enterface05
video sur
SEMAINE Face, speech, 959 Recording val, act, pow, exp, int; https://semaine-db.eu/
t-text, and video ­conversations bas-em, eps, ipa, and
vad
EMDB Video, SCL, and 52 clips Films ero, hor, neg, pos, sce, EMDB@psi.uminho.pt
HR and obm
VAD on 9-point ratings
DEAP Face, EEG, GSR, 1,280 samples Recording VAD-L on 9 point ratings http://www.eecs.qmul.ac.uk/mmv/datasets/
RA, and ST F on 5-point ratings deap/
ECG, BVP, EMG,
and EOG
MAHNOB-HCI Face, eye, audio, 532 samples Recording sad, joy, dis, neu, hap, https://mahnob-db.eu/hci-tagging
and EEG amu, ang, fea, sur, and anx
ECG, GSR, ST, VAD-P on 9-point ­ratings
and RA
Multi-ZOL Image and text 28,469 aspect- ZOL 0–10 sentiment score https://github.com/xunan0812/MIMN
review pairs
Yelp Image and text 244,569 images Yelp Sentiment score on https://github.com/PreferredAI/vista-net
and 44,305 5-point ratings
reviews
Tourism Image and text 1,796 weibos WeiBo pos, neg, and neu https://github.com/wlj961012/Multi-Modal-Event
-awareNetwork-for-SentimentAnalysis-in-Tourism
LIRIS-ACCEDE Video (audio and 9,800 clips Movies Rank along valence https://liris-accede.ec-lyon.fr
image)
VideoEmotion-8 Video (audio and 1,101 videos YouTube ang, ant, dis, fea, joy, http://www.yugangjiang.info/research/
image) and Flickr sad, sur, and tru VideoEmotions/index.html
Ekman-6 Video (audio and 1,637 videos YouTube and ang, dis, fea, joy, sad, https://github.com/kittenish/Frame-Transformer
image) Flickr and sur -Network

Modalities: BVP: blood volume pressure; ECG: electrocardiogram; EMG: electromyogram; EOG: electro-oculogram; GSR: galvanic skin response; HR: heart rate; PPS: peripheral
physiological signal; RA: respiration amplitude; SCL: skin conductance level; ST: skin temperature; t-text: transcript text.
Emotion labels: amu: amusement; ang: angry; ant: anticipation; anx: anxiety; dis: disgust; ero: erotic; exc: excited; F: familiarity; fea: fear; fru: frustration; hap: happiness; hor:
horror; L: liking; neg: negative; neu: neutral; obm: object manipulation; P: predictability; pos: positive; sad: sadness; sce: scenery; sur: surprise; tru: trust; act: activation; bas-em:
basic-emotions; eps: epistemic-states; exp: expectation; ipa: interaction-process-analysis; int: intensity; pow: power; val: valence; vad: validity.

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 63
for obtaining online content with the desired emotions from ic modalities might be corrupted, which results in missing or
a massive repository [10], again, the abstract and subjective incomplete data. Data incompleteness is a common phenom-
characteristics make the task challenging and difficult enon in real-world MER tasks.
to evaluate. For example, for explicit affective cues, an EEG headset
might record contaminated signals or even fail to capture
Challenges any signal; at night, cameras cannot capture clear facial
As stated in the “Introduction” section, MER has several ad- expressions. For implicit affective stimuli, one user might
vantages as compared to unimodel emotion recognition, but it post a tweet containing only an image (without text); for
also faces more challenges. some videos, the audio channel does not change much. In
such cases, the simplest feature fusion method, i.e., early
Affective gap fusion, does not work because we cannot extract any fea-
The affective gap, which measures the inconsistency between tures given no captured signal. Designing effective fusion
extracted features and perceived high-level emotions, is one methods that can deal with data incompleteness is a widely
main challenge for MER. The affective gap is even more chal- employed strategy.
lenging than the semantic gap in objective multimedia analy-
sis. Even if the semantic gap is bridged, there might still exist Cross-modality inconsistency
an affective gap. Different modalities of the same sample may conflict with
For example, a blooming and a faded rose both contain each other and, thus, express varying emotions. For example,
a rose, but they can evoke different emotions. For the same facial expressions and speech can be easily suppressed or
sentence, different voice intonations may correspond to masked to avoid being detected, but EEG signals that are
totally different emotions. Extracting discriminative high- controlled by the central nervous system can reflect humans’
level features, especially those related to emotions, can unconscious body changes. When people post tweets on so-
help to bridge the affective gap. The main difficulty lies in cial media, it is very common that the images are not se-
how to evaluate whether the extracted features are related mantically correlated to the text. In such cases, an effective
to emotions. MER method is expected to automatically evaluate which
modalities are more reliable, such as by assigning a weight
Perception subjectivity to each one.
Due to many personal, contextual, and psychological factors,
such as the cultural background, personality, and social con- Cross-modality imbalance
text, different people might have varying emotional responses In some MER applications, different modalities may contrib-
to the same stimuli [10]. Even if the emotion is the same, their ute unequally to the evoked emotion. For example, online news
physical and psychological changes can also be quite divergent. plays an important role in our daily lives, and, in addition to
For example, all of the 36 videos in the ASCERTAIN data understanding the preferences of readers, predicting their emo-
set for MER are labeled with at least four out of seven differ- tional reactions is of great value in various applications, such
ent valence and arousal scales by 58 subjects [15]. This clearly as personalized advertising. However, a piece of online news
indicates that some subjects have the opposite emotional reac- usually includes imbalanced texts and images; i.e., an article
tions to the same stimuli. Take a short video with a storm and may be very long, with lots of detailed information, while only
thunder, for instance: some people may feel awe because they one or two illustrations are inserted into the news. Potentially
have never seen such extreme weather, others may experience more problematic, the editor of the news may select a neutral
fear because of the loud thunder noise, some may be excited to image for an article with an obvious sentiment.
capture such rare scenes, still others may feel sad because they
have to cancel their travel plans, and so on. Label noise and absence
Even for the same emotion (e.g., excitement), there are dif- Existing MER methods, especially the ones based on deep
ferent reactions, such as facial expression, gait, action, and learning, require large-scale labeled data for training. Howev-
speech. For the subjectivity challenge, one direct solution is er, in real-world applications, labeling emotions in the ground-
to learn personalized MER models for each subject. From the truth generation is not only prohibitively expensive and time-
perspective of stimuli, we can also predict the emotion distri- consuming but also highly inconsistent, which results in a large
bution when a certain number of subjects are involved. Besides amount of data but with few or even no emotion labels. With
the content of the stimuli and direct physical and psychological the increasingly diverse and fine-grained emotion require-
changes, jointly modeling the personal, contextual, and psy- ment, we might have enough training data for some emotion
chological factors mentioned earlier would also contribute to categories but not for others. One alternate solution to manual
the MER task. annotation is to leverage the tags or keywords of social tweets
as emotion labels, but such labels are incomplete and noisy. As
Data incompleteness such, designing effective algorithms for unsupervised/weakly
Because of the presence of many inevitable factors in data col- supervised learning and few-/zero-shot learning can provide
lection, such as sensor device failure, the information in specif- potential solutions.

64 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
Meanwhile, we might have sufficient labeled affective

FIGURE 2. A widely used MER framework, which consists of three components: representation learning to extract feature representations, feature fusion to combine features from different modalities, and classi-
data in one domain, such as synthetic facial expression and
speech. The problem turns to how to effectively transfer the
trained MER model on the labeled source domain to another

l
sa
unlabeled target domain. The presence of a domain shift causes

ou
Ar
significant performance decay when a direct transfer is used

Regression
Valence

Retrieval
[23]. Multimodal domain adaptation and domain generaliza-
tion can help to mitigate such domain gaps. Practical settings,
such as multiple source domains, should also be considered.

Classifier Optimization
Computational methodologies Dominance
Generally, there are three components in an MER framework
with sufficient labeled training data in the target domain:
representation learning, feature fusion, and classifier optimi-
zation, as shown in Figure 2. In this section, we introduce
these components. Further, we describe domain adaptation

Classification
when there is no labeled training data in the target domain

Detection
Sadness
Surprise
Disgust
Anger

Fear
and sufficient labeled data are available in another related

Joy
source domain.

fier optimization to learn specific task models (e.g., classification, regression, detection, and retrieval); n is the number of different modalities.
Representation learning of each affective modality
To represent text in a form that can be understood by com-
puters, the following aspects are required: first, representing
the symbolic words as real numbers for the next computation;
second, modeling the semantic relationships; and, finally, ob-
taining a unified representation for the whole text [16]. In the

Tensor-Based Fusion
beginning, words are represented by one-hot vectors with the

Feature Fusion
Early Fusion
length of the vocabulary size, where, for the tth word in the
vocabulary wt, only the position t is one, and the other positions
are zero. As the scale of the data increases, the dimension of
this one-hot vector increases dramatically.
Later, researchers used language models to train word
vectors by predicting context, obtaining word vectors with
vectors of a fixed dimension. Popular word vector representa-
tion models include word2vec, GLOVE, BERT, and XLNet,
among others.
The text feature extraction methods have developed from
simple to complex ones as well. Text features can be obtained
by simply averaging word vectors. A recurrent neural network
Representation Learning

(RNN) is used to model the sequential relations of words in the


Feature Extractor 1

Feature Extractor n

text. A convolutional neural network (CNN), which has been


widely employed in the computer vision community, is also
used to extract the contextual relations between words.
To date, plenty of methods have been developed to design
representative features for emotion stimuli in audios [13], [14].
It has been found that audio features, such as pitch, log energy,
zero-crossing rate, spectral features, voice quality, and jitter,
are useful in emotion recognition. The ComParE acoustic fea-
ture set has been commonly used as the baseline set for the
Affective Signals

ongoing Computation Paralinguistics Challenge series since


Multimodal
Modality 1

Modality n

2013. However, because of possible high similarities in certain


emotions, a single type of audio feature is not discriminative
enough to classify emotions.
To solve this problem, some approaches propose combin-
ing different types of features. Recently, with the development
of deep learning, CNNs are shown to achieve state-of-the-art

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 65
performance on large-scale tasks in many domains dealing information), their covariance descriptors, and so on can be
with natural data, and audio emotion recognition is, of course, easily extracted.
also included. Audio is typically transferred into a graphical For another thing, high-level emotional representations can
representation, such as a spectrogram, to be fed into a CNN. be modeled from gait by long short-term memory (LSTM),
Since the CNN uses shared weight filters and pooling to give deep CNNs, or graph convolutional networks. Some methods
the model better spectral and temporal invariant properties, it extract optical flow from gait videos and then extract sequence
typically yields better generalized and more robust models for representations using these networks. Others learn skeletal
emotion recognition. structures of the gait and then feed them into multiple networks
Researchers have designed informative representations for to extract discriminate representations.
emotional stimuli in images. In general, images can be divided Since various types of information about emotions, such
into two types, nonrestrictive and facial expression. For the as the frequency band, electrodeposition, and temporal data,
former, e.g., natural images, various handcrafted features, can be explored from the brain’s response to emotional stim-
including color, texture, shape, composition, and so on, are uli, EEG signals are widely used in emotion analysis [15]. To
developed to represent image emotion in the early years [10]. extract discriminative features for EEG emotion recognition,
These low-level features are developed with inspiration from differential entropy features from the frequency band or elec-
psychology and art theory. trodeposition relationship are very popular in previous works.
Later, midlevel features based on the visual concepts are In addition to handcrafted features, we can also directly
presented to bridge the gap between the pixels in images and apply end-to-end deep learning neural networks, such as
emotion labels. The most representative engine is SentiBank, CNNs and RNNs, on raw EEG signals to obtain power-
which is composed of 1,200 adjective–noun pairs and shows ful deep features [25]. Inspired by the learning pattern of
remarkable and robust recognition performance among all of humans, spatialwise attention mechanisms are successfully
the hand-engineering features. In the era of deep learning, a applied to extract more discriminative spatial information.
CNN is regarded as a strong feature extractor in an end-to- Furthermore, considering that EEG signals contain multiple
end manner. Specifically, to integrate various representations channels, a channelwise attention mechanism can also be
of different levels, features are extracted from multiple layers integrated into a CNN to exploit the interchannel relationship
of the CNN. among feature maps.
Meanwhile, an attention mechanism is employed to learn
better emotional representations of specific local affec- Feature fusion of different affective modalities
tive regions [22]. For the facial expression images, firstly, the Feature fusion, as one key research topic in MER, aims to in-
human face is detected and aligned, and then the face land- tegrate the representations from multiple modalities to predict
marks are encoded for the recognition task. Note that, for those ­either a specific category or continuous value of emotions.
nonrestrictive images that contain human faces by chance, Generally, there are two strategies: model-free and model-
facial expression can be treated as an important midlevel cue. based fusion [7], [9].
Earlier, we mentioned how to identify emotions in isolated Model-free fusion that is not directly dependent on specific
modalities. Here, we first focus on perceiving emotions from learning algorithms has been widely used for decades. We can
successive frames. Then, we introduce how to build joint rep- divide it into early fusion, late fusion, and hybrid fusion [5].
resentation for videos. Compared to a single image, a video All of these fusion methods can be extended from existing uni-
contains a series of images with temporal information [19]. modal emotion recognition classifiers.
To build representations of videos, a wide range of meth- Early fusion, also named feature-level fusion, directly con-
ods has been proposed. Early methods mainly utilize hand- catenates the feature representations from different modalities
crafted local representations in this field, which include color, as a single representation. It is the most intuitive method for
motion, and the shot cut rate. With the advent of deep learn- fusing multiple representations by exploiting the interactions
ing, recent methods extract discriminative representations by between various modalities at an early stage and only requires
adopting a 3D CNN that captures the temporal information training a single model. However, since the representations
encoded in multiple adjacent frames. After extracting modal- from the modalities might significantly differ, we have to
ity-specific features in videos, integrating different types of consider the time synchronization problem to transform these
features could obtain more promising results and improve representations into the same format before fusion. When one
the performance. or more modalities are missing, such early fusion would fail.
To perceive emotions, there are mainly two aspects of ways Late fusion, also named decision-level fusion, instead inte-
to learn the representations of gait [24]. For one thing, we can grates the prediction results from each single modality. Some
explicitly model the posture and movement information that popular mechanisms include averaging, voting, and signal
is related to the emotions. To do this, we first extract the skel- variance. The advantages of late fusion include 1) flexibility
etal structure of a person and then represent each joint of the and superiority (the optimal classifiers can be selected for dif-
human body using the 3D coordinate system. After getting ferent modalities) and 2) robustness (when some modalities are
these coordinates, the angles, distance, or area among different missing, late fusion can still work). However, the correlations
joints (posture information), velocity/acceleration (movement between different modalities before the decision are ignored.

66 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
Hybrid fusion combines early and late fusion to exploit their accept multimodal nonverbal data [17]. Based on the attention
advantages in a unified framework but with higher computa- conditioned on the nonverbal behaviors, MAG can essentially
tional cost. map the informative multiple modalities to a vector with a tra-
Model-based fusion that explicitly performs fusion during jectory and magnitude.
the construction of learning models has received more atten- Tensor-based fusion tries to exploit the correlations of dif-
tion [7], [9], as shown in Figure 3, since it is based on some ferent representations by some specific tensor operations, such
simple techniques that are not specifically designed for mul- as outer product and polynomial tensor pooling. These fusion
timodal data. For shallow models, kernel- and graph-based methods for deep models are capable of learning from a large
fusion are two representative methods; for recent popular deep amount of data in an end-to-end manner with good perfor-
models, neural network-, attention-, and tensor-based fusion mance but suffer from low interpretability.
are often used. One important property of these feature fusion methods is
Kernel-based fusion is extended based on classifiers that whether they support temporal modeling for MER in videos.
contain kernels, such as support vector machine (SVM). For It is obvious that early fusion can, while late and hybrid fusion
different modalities, different kernels are used. The flexibility cannot since the predicted results based on each modality are
in kernel selection and convexity of the loss functions make already known before late fusion. For model-based fusion,
multiple-kernel learning fusion popular in many applications, excluding kernel-based fusion, all others can be used for tem-
including MER. However, during testing, these fusion methods poral modeling. Example methods for graph-based fusion
rely on the support vectors in the training data, which results in methods include hidden Markov models (HMMs) and condi-
large memory cost and inefficient reference. tional random fields (CRFs), and RNN and LSTM networks
Graph-based fusion constructs separate graphs or hyper- can be employed for neural network-based fusion.
graphs for each modality, combines these graphs into a fused
one, and learns the weights of different edges and modali- Classifier optimization for MER
ties by graph-based learning. It can well deal with the data For the text represented as a sequence of word embeddings,
incompleteness problem simply by constructing graphs based the most popular approaches to leverage the semantics among
on available data. Besides the extracted
feature representations, we can also
incorporate prior human knowledge
into the models by corresponding Modality 1 Feature Kernel 1
edges. However, the computational cost Extractor 1
would increase exponentially when more Kernelized
training samples are available. Classifier
Neural network-based fusion employs Feature
a direct and intuitive strategy to fuse Modality n Kernel n
Extractor n
the feature representations or predicted
(a)
results of different modalities by a neu-
ral network. Attention-based fusion uses
some attention mechanisms to obtain Feature
Modality 1 Graph 1
the weighted sum of a set of feature rep- Extractor 1
resentations with scalar weights that are Graph
dynamically learned by an attention mod- Learning
ule. Different attention mechanisms cor-
respond to fusing different components. Feature
Modality n Graph n
Extractor n
For example, spatial image attention
measures the importance of different (b)
image regions. Image and text coat-
tention employs symmetric attention Feature
mechanisms to generate attended visu- Modality 1
Neural Network

Extractor 1
al and textual representations. Parallel
Fully Connected
and alternating coattention methods Layers
can be used to, respectively, generate
attention for different modalities simul- Feature
Modality n Extractor n
taneously and one by one.
Recently, a multimodal adaptation (c)
gate (MAG) is designed to enable trans-
former-based contextual word represen- FIGURE 3. The different model-based fusion strategies, where n is the number of different modalities:
tations, such as BERT and XLNet, to (a) kernel-, (b) graph-, and (c) neural network-based fusion.

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 67
words are RNNs and CNNs. LSTM, as a typical RNN, con- features and training classifiers. For the latter, many machine
tains a series of cells with the same structure. Every cell takes learning methods have been investigated to model the map-
a word embedding and the hidden state from the last cell as ping between video features and discrete emotion categories,
input, computes the output, and updates the hidden state for including SVM, GMM, HMM, dynamic Bayesian networks,
the following cell. The hidden state records the semantics of and CRF. Although the approaches contributed to the devel-
previous words. A CNN computes local contextual features opment of emotion recognition in videos, recent methods
among consecutive words through convolution operations, and have been proposed to recognize video emotions in an end-
average- or max-pooling layers are used to further integrate to-end manner based on deep neural networks due to their
the obtained features for the following sentiment classification. superior capability [27].
Recently, researchers have begun to use transformer- CNN-based methods first employ 3D CNNs to extract high-
based methods, e.g., BERT and GPT-3. The transformer level spatiotemporal features, which contain affective informa-
is implemented as a series of modules containing a multi- tion, and then use fully connected layers to classify emotions.
head self-attention layer followed by a normalization layer, a Finally, the models are followed by the loss function to opti-
feed-forward network, and another normalization layer. The mize the whole network. Inspired by the human process of
order of words in the text is also represented by another posi- perceiving emotions, CNN-based methods employ the atten-
tion embedding layer. Compared with an RNN, the trans- tion mechanism to emphasize emotionally relevant regions of
former does not require the sequential processing of words, frames or segments in each video. Furthermore, considering
which improves the parallelizability, and, compared with a the polar-emotion hierarchy constraint, recent methods pro-
CNN, the transformer can model relationships between more pose polarity-consistent cross-entropy loss to guide the atten-
distant words. tion generation.
The classification approaches used in audio emotion recog- The gait of a person can be represented as a sequence of
nition generally include the following two options: traditional 2D or 3D joint coordinates for each frame in walking videos.
and deep learning-based methods. For traditional methods, To leverage the inherent affective cues in the coordinates of
HMM is a representative method because of its capability of joints, many classifiers or architectures have been used to
capturing the dynamic characteristics of sequential data. SVM extract affective features in the gait. LSTM networks contain
is also widely utilized in audio emotion recognition. many special units, i.e., memory cells, and can store the joint
Deep learning-based methods have become more popu- coordinate information from particular time steps in a long
lar since they are not restricted by the classical independence data sequence. Thus, they were used in some early work of gait
assumptions of HMM models. Among these techniques, sequence- emotion recognition.
to-sequence models with attention have shown success in an The hidden features of LSTM can be further concatenated
end-to-end manner. Recently, some approaches have signifi- with the handcrafted affective features and are then fed into
cantly extended the state of the art in this area by developing a classifier [e.g., SVM or random forest (RF)] to predict emo-
deep hybrid convolutional and recurrent models [14]. tions. Recently, another popular network used in gait emo-
In the early years, similar to this task in other modalities, tion prediction is the spatial–temporal graph convolutional
multiple handcrafted image features were integrated and input network (ST-GCN), which was initially proposed for action
into an SVM to train classifiers. Then, based on deep learning, recognition from human skeletal graphs. “Spatial” represents
the classifier and feature extractor were connected and opti- the spatial edges in the skeletal structure, which are the limbs
mized in an end-to-end manner by corresponding loss func- that connect the body joints. “Temporal” refers to temporal
tions, like cross-entropy loss [26]. In addition, popular metric edges, and they connect the positions of each joint across dif-
losses, such as triplet and N-pair loss, also took part in the net- ferent frames. ST-GCN can be easily implemented as a spatial
work optimization to obtain more discriminative features. followed by a temporal convolution, which is similar to deep
With the described learning paradigm, each image was convolutional networks.
predicted as a single dominant emotion category. However, EEG-based emotion recognition usually employs various
based on the theories of psychology, an image may evoke classifiers, such as SVM, decision trees, and k-nearest neigh-
multiple emotions in viewers, which leads to an ambiguity bor to classify handcrafted features in the early stage. Later,
problem. To address this issue, label-distribution learning is since CNNs and RNNs are good at extracting the spatial and
employed to predict a concrete relative degree for each emo- temporal information of EEG signals, respectively, end-to-end
tion category, where Kullback–Leibler divergence is the most structures, such as cascade convolutional recurrent networks
popular loss function. (which combine a CNN and RNN), LSTM-RNNs, and parallel
Some informative and attractive regions of an image always convolutional RNNs, are successfully designed and applied to
determine the emotion of it. Therefore, a series of architectures emotion recognition tasks.
with extra attention or detection branches is constructed. With
optimization for multiple tasks, including attention and the orig- Quantitative comparison of representative MER methods
inal task, a more robust and discriminative model is obtained. To give readers an impression of the performances of state-of-
Most existing methods employ a two-stage pipeline to the-art MER methods, we conduct experiments to fairly com-
recognize video emotion, i.e., extracting visual and/or audio pare some representative methods based on the released codes

68 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
Table 2. A quantitative comparison of some representative methods for MER on five widely used data sets using GLOVE as word embeddings.

Data Set CMU-MOSI YouTube ICT-MMMO MOUD IEMOCAP

Train:Val:Test 1,284 : 229 : 686 30 : 5 :11 11: 2 : 4 49 : 10 : 20 3 :1 :1

Method/Metric A2 - F1- A7 - M. C- A3 - F1- A2 - F1- A2 - F1- A9 - F1- MV . CV - MA. CA -


SVM 71.6 72.3 26.5 1.1 0.559 42.4 37.9 68.8 68.7 60.4 45.5 24.1 18 0.251 0.06 0.546 0.54
RF 56.4 56.3 21.3 — — 49.3 49.2 70 69.8 64.2 63.3 27.3 25.3 — — — —
THMM 50.7 45.4 17.8 — — 42.4 27.9 53.8 53 58.5 52.7 23.5 10.8 — — — —
MV-LSTM 73.9 74 33.2 1.019 0.601 45.8 43.3 72.5 72.3 57.6 48.2 31.3 26.7 0.257 0.02 0.513 0.62
BC-LSTM 73.9 73.9 28.7 1.079 0.581 47.5 47.3 70 71.1 72.6 72.9 35.9 34.1 0.248 0.07 0.593 0.4
TFN 74.6 74.5 28.7 1.04 0.587 47.5 41 72.5 72.6 63.2 61.7 36 34.5 0.251 0.04 0.521 0.55
MARN 77.1 77 34.7 0.968 0.625 54.2 52.9 86.3 85.9 81.1 81.2 37 35.9 0.242 0.1 0.497 0.65
MFN 77.4 77.3 34.1 0.965 0.632 61 60.7 87.5 87.1 81.1 80.4 36.5 34.9 0.236 0.111 0.482 0.645

AN and F1 are percentages. - and . respectively indicate that higher and lower values represent better performance for corresponding metrics (the same below).
Evaluation metrics: AN: emotion classification accuracy, where N denotes the number of emotion classes; M: mean absolute error; C: Pearson correlation; V: valence results;
A: arousal results.

of the CMU multimodal software development kit [37] and better word embeddings than GLOVE, and XLNet is gener-
MAG [38]. Specifically, the compared nondeep methods in- ally better than BERT. Finally, although XLNet-based MAG
clude SVM, RF, and trimodal HMM (THMM); the compared achieves a near-human-level performance on CMU-MOSI,
deep methods include multiview LSTM (MV-LSTM), bidirec- there is still some gap, and more efforts are expected to achieve
tional contextual LSTM (BC-LSTM), tensor fusion network even better performance than humans.
(TFN), multiattention recurrent network (MARN), memory
fusion network (MFN), fine-tuning (FT), and MAG. Domain adaptation for MER
We conduct experiments on five data sets: CMU-MOSI, Domain adaptation aims to learn a transferable MER model
YouTube, ICT-MMMO, MOUD, and IEMOCAP. All of the from labeled source domains that can perform well on unla-
data sets contain three modalities: face, speech, and transcript beled target domains [23]. Recent efforts have been dedicated
text. For visual features, Facet is used to extract per-frame basic to deep unsupervised domain adaptation [23], which employs
and advanced emotions and facial action units as indicators a two-stream architecture. One stream is used to train an MER
of facial muscle movement. For acoustic features, COVAREP model on the labeled source domains, while the other is used
is employed to extract 12 mel-frequency cepstral coefficients, to align the source and target domains. Based on the alignment
pitch tracking and voiced/unvoiced segmenting features, glot- strategy, existing unimodal domain adaptation approaches can
tal source parameters, peak slope parameters, and maxima be classified into different categories [23], such as discrepancy-
dispersion quotients. For linguistic features, three different based, adversarial discriminative, adversarial generative, and
pretrained word embeddings, i.e., GLOVE, BERT, and XLNet, self-supervision-based methods.
are employed to obtain the word vector. For comparison, the Discrepancy-based methods employ some distance metrics
human performance is also reported on CMU-MOSI with to explicitly measure the discrepancy between the source and
results derived from [39]. target domains on the corresponding activation layers of the two
The input to the nondeep methods is the early fusion of mul- network streams. Commonly used discrepancy loss measures
timodal features. For emotion classification, we use accuracy include maximum mean discrepancy, correlation alignment,
and F1 as metrics; for emotion regression, we use mean abso- geodesic distance, central moment discrepancy, Wasserstein
lute error and the Pearson correlation. Higher values indicate discrepancy, contrastive Domain discrepancy, and higher-order
better performance for all of the metrics, except mean absolute
error, where lower values denote better performance.
From the results in Tables 2 and 3, we have the following Table 3. A quantitative comparison of some representative methods
for MER on the CMU-MOSI data set using BERT or XLNet as word
observations. First, the performances of deep models are gen-
embeddings.
erally better than nondeep ones. Second, for different data
sets, the methods with the best performances are different. For Method/
example, RF achieves the best performance among nondeep Metric A2 - F1- M. C-
models except CMU-MOSI, which demonstrates its good gen- TFN 74.8/78.2 74.1/78.2 0.955/0.914 0.649/0.713
MARN 77.7/78.3 77.9/78.8 0.938/0.921 0.691/0.707
eralization ability, while the performance of SVM is much bet-
MFN 78.2/78.3 78.1/78.4 0.911/0.898 0.699/0.713
ter than that of RF or THMM on CMU-MOSI. FT 83.5/84.7 83.4/84.6 0.739/0.676 0.782/0.812
Third, multiclass classification is more difficult than binary MAG 84.2/85.7 84.1/85.6 0.712/0.675 0.796/0.821
classification, such as 77.1 versus 34.7 of MARN on CMU-MOSI. Human 85.7 87.5 0.71 0.82
Fourth, comparing the same method in Tables 2 and 3 on CMU- The numbers on the left and right sides of “/” are the MER results based on BERT
MOSI, we can conclude that BERT and XLNet can provide and XLNet, respectively.

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 69
moment matching. Besides the used discrepancy loss, there are
some other differences among existing methods, such as wheth-

FIGURE 4. A generalized framework for multimodal domain adaptation with one labeled source domain and one unlabeled target domain. The grayscale rectangles with text in bold represent different alignment
er the loss is at the domain or class level, which layer the loss is
operated on, whether the backbone networks share weights or
Task Classifier

Feature-Level
Alignment
not, and if the aligned distribution is marginal or joint.
Adversarial discriminative models usually align the source

strategies. Most existing multimodal domain adaptation methods can be obtained by employing different component details, enforcing some constraints, or slightly changing the architecture.
and target domains with a domain discriminator by adversari-
ally making different domains indistinguishable. The input
to the discriminator ranges from the original data to extract-
ed features, and the adversarial alignment can be global or
classwise. We can also consider using shared or unshared
feature extractors.
Fusion Fusion Adversarial generative models usually employ a genera-
tor to create fake source or target data to make the domain
discriminator indistinguishable from the generated and real
domains. The generator is typically based on a generative
adversarial network (GAN) and its variants, such as CoGAN,
Self-Supervision

Self-Supervision

SimGAN, and CycleGAN. The input to the generator and dis-


Alignment

Alignment

criminator can be different in different methods.


Self-supervision-based methods combine some auxiliary
self-supervised learning tasks, such as reconstruction, image
rotation prediction, jigsaw prediction, and masking, with the
original task network to bring the source and target domains
closer. We can compare these four types of domain adaptation
methods from the perspectives of theory guarantee, efficien-
cy, task scalability, data scalability, data dependency, opti-
Extractor 1

Extractor 1
Extractor n

Extractor n
Feature

Feature
Feature

Feature
……

……

mizability, and performance. We can combine some of these


techniques to jointly exploit their advantages.
The main difficulty in domain adaptation for MER lies in
the alignment of multiple modalities between the source and
target domains simultaneously. There are some simple but
effective ways to extend unimodal domain adaptation to multi-
modal settings, as shown in Figure 4. For example, we can use
discrepancy loss or a discriminator to align the fused feature
Modality 1
Modality 1

Modality n

Modality n
Adapted

Adapted

Target

Target

representations. The correspondence between different modal-


ities can be used as a self-supervised alignment.
Extending adversarial generative models from unimodal to
multimodal would be more difficult. Unlike images, other
generated modalities, such as text and speech, might have
confused semantics, although they can make the ­discriminator
indistinguishable. Generating intermediate feature represen-
tations instead of raw data can provide a feasible solution.
Generator 1

Generator n

Raw Data-Level
……

Alignment

Applications
Recognizing emotions from multiple explicit cues and im-
plicit stimuli is of great significance in a broad range of real-
world applications. Generally speaking, emotion is the most
important aspect of the quality and meaning of our existence,
and it makes life worth living. The emotional impact of digi-
tal data lies in that it can improve the user experience of ex-
Modality 1

Modality n

isting techniques and then strengthen the knowledge transfer


Source

Source

between people and computers [18].


Many people tend to post texts, images, and videos on social
networks to express their daily feelings about life. Inspired by
this, we can mine people’s opinions and sentiments toward top-
ics and events happening in the real world [28]. For instance,

70 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
user-generated content on Facebook or Instagram can be the same emotion. This helps generate emotion-aware music
used to derive the attitudes of people from different countries playlists from one’s personal album photos in mobile devices.
and regions when they face a pandemic like COVID-19 [29].
Researchers also try to detect sentiment in social networks and Future directions
apply the results to predict political elections. Note that, when Existing methods have achieved promising performances in
the personalized emotion of an individual is detected, we can various MER settings, such as visual–audio, facial–textual–
further group these emotions, which may contribute to predict- speech, and textual–visual tasks. However, all of the summa-
ing the tendencies of society. rized challenges have not been fully addressed. For example,
Another important application of MER is business intel- the issues of how to extract discriminative features that are
ligence, especially marketing and consumer behavior analy- more related to emotion, balance common and personalized
sis [30]. Today, most apparel e-retailers use human models to emotion reactions, and emphasize the more important modali-
present products. The model’s face presentation is proven to ties are still open. To help improve the performances of MER
have a significant effect on consumer approach behavior. To be methods and make them fit special requirements in the real
specific, for participants whose emotional receptivity is high, a world, we provide some potential future directions.
smiling facial expression tends to lead to the highest approach
behavior. In addition, researchers examine how online store New methodologies for MER
specialization influences consumer pleasure and arousal, ■■ Contextual and prior knowledge modeling: The experienced
based on the stimulus–organism–response framework. emotion of a user can be significantly influenced by contex-
Emotion recognition can also be used in call centers, the tual information, such as the conversational and social envi-
goal of which is to detect the emotional states of both the caller ronments. The prior knowledge of users, such as personality
and operator. The system recognizes the involved emotions and age, can also contribute to emotion perception. For
through the intonation and tempo as well as the texts translated example, an optimistic user and a pessimistic viewer are like-
from the corresponding speech. Based on this, we can receive ly to see different aspects of the same stimuli. Jointly consid-
feedback on the quality of the service. ering this important contextual information and prior
Meanwhile, emotion recognition plays an important role in knowledge is expected to improve the MER performance.
the field of medical treatment and psychological health. With Graph-related methods, such as graph convolutional net-
the popularity of social media, some people prefer expressing works, are possible solutions to model the relationships
their emotions over the Internet rather than to others. If a user among factors and emotions.
is observed to be sharing negative information (e.g., sadness) ■■ Learning from unlabeled, unreliable, and unmatched
frequently and continuously, it is necessary to track her or his affective signals: In the big data era, the affective data
mental status to prevent the occurrence of psychological illness might be sparsely labeled or even unlabeled, raw data or
and even suicide. labels can be unreliable, and test and training data could be
Emotional states can also be used to monitor and predict unmatched. Exploring advanced machine learning tech-
the fatigue level of a variety of people, like drivers, pilots, niques, such as unsupervised representation learning,
workers on assembly lines, and students in classrooms. This dynamic data selection and balancing, and domain adapta-
technique both prevents dangerous situations and benefits the tion, as well as the embedding of special properties of
evaluation of work/study efficiency. Further, emotional states emotions, can help to address these challenges.
can be incorporated into various security applications, such as ■■ Explainable, robust, and secure deep learning for MER:
systems for monitoring public spaces (e.g., bus/train/subway Due to the black-box nature, it is difficult to understand why
stations or football stadiums) for potential aggression. existing deep neural networks perform well for MER, and the
Recently, an effective auxiliary system was introduced in trained deep networks are vulnerable to adversarial attacks
the diagnosis and treatment process of autism spectrum dis- and inevitable noises that might cause erraticism. Essentially
order (ASD) in children to assist in collecting information on explaining the decision-making process of deep learning can
the condition. To help professional clinicians better and faster help with the design of robust and secure MER systems.
make a diagnosis and give treatment to ASD patients, this sys- ■■ A combination of explicit and implicit signals: Both
tem characterizes facial expressions and eye gaze attention, explicit and implicit signals are demonstrated to be useful
which are considered to be remarkable indicators for the early for MER, but they also suffer from some limitations. For
screening of autism. example, explicit signals can be easily suppressed or are
MER is used to improve the personal entertainment expe- difficult to capture, while implicit signals might not reflect
rience. For example, a recent work in the brainwave–music the emotions in real time. Jointly combining them to
interface maps EEG characteristics to musical structures (note, explore complementary information during a viewer–mul-
intensity, and pitch). Similarly, efforts have been made to timedia interaction would boost the MER performance.
understand the emotion-centric correlation between different ■■ The incorporation of emotion theory into MER: Different
modalities that are essential for various applications. Affective theories have been proposed in psychology, physiology,
image–music matching provides a good chance appending a neurology, and the cognitive sciences. These theories can
sequence of music to a given image such that they both evoke help us understand how humans produce emotion, but they

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 71
have not been employed in the computational MER task. tion in real applications might have a negative and even
We believe such an incorporation would make more sense dangerous impact on a person, such as emotional pressure.
to recognize emotions. Methods to eliminate such an effect should also be consid-
ered from the perspectives of ethics and fairness.
More practical MER settings
■■ MER in the wild: Current MER methods mainly focus on Conclusions
neat lab settings. However, MER problems in the real In this article, we provided a comprehensive tutorial on MER.
world are much more complex. For example, the collected We briefly introduced emotion representation models, both ex-
data might contain much noise that is unrelated to emotion; plicit and implicit affective modalities, emotion annotations,
the users in the test set could be from different cultures and and corresponding computational tasks. We summarized the
languages from those in the training set, resulting in vary- main challenges of MER in detail, and then we emphatically
ing ways of expressing emotion; different emotion label introduced different computational methodologies, includ-
spaces might be employed across various settings; or train- ing the representation learning of each affective modality,
ing data may be incrementally available. Designing an feature fusion of different affective modalities, and ­classifier
effective MER model that is generalizable to these practi- ­optimization as well as domain adaptation for MER. We ended
cal settings is worth investigating. this tutorial with discussions of real-world applications and
■■ MER on the edge: When deploying MER models in edge future directions. We hope this tutorial can motivate novel tech-
devices, such as mobile phones and security cameras, we niques to facilitate the development of MER, and we believe
have to consider the computing limitations and data privacy. that this area will continue to attract significant research efforts.
Techniques like autopruning, neural architecture search,
invertible neural network, and software–hardware co-design Acknowledgments
are believed to be beneficial for efficient on-device training. This work was supported by the National Key Research and
■■ Personalized and group MER: Because of the subjectivity of Development Program of China (grant 2018AAA0100403), Na-
emotions, simply recognizing the dominant emotion of differ- tional Natural Science Foundation of China (grants 61701273,
ent individuals is insufficient. It is ideal but impractical to col- 61876094, U1933114, 61925107, and U1936202), Natural Science
lect enough data for each individual to train personalized Foundation of Tianjin, China (grants 20JCJQJC00020, 18JCY-
MER models. Adapting the well-trained MER models for BJC15400, and 18ZXZNGX00110), and Berkeley DeepDrive.
dominant emotions to each individual with a small amount of Jufeng Yang is the corresponding author of this article.
labeled data is a possible alternate solution. On the other hand,
it would make more sense to predict emotions for groups of Authors
individuals who share similar tastes or interests and have a Sicheng Zhao (schzhao@gmail.com) received his Ph.D. degree
similar background. Group emotion recognition is essential from the Harbin Institute of Technology, Harbin, China, in 2016.
in many applications, such as recommendation systems, but He is a postdoctoral research scientist at Columbia University,
how to classify users into different groups is still challenging. New York, New York, 10032, USA. He was a visiting scholar at
the National University of Singapore from July 2013 to June
Real applications based on MER 2014, a research fellow at Tsinghua University from September
■■ The implementation of MER in real-world applications: 2016 to September 2017, and a research fellow at the University
Although emotion recognition has been emphasized to be of California, Berkeley from September 2017 to September
important for decades, it has rarely been applied to real 2020. His research interests include affective computing, multi-
scenarios due to relatively low performance. With the media, and computer vision. He is a Senior Member of IEEE.
recent rapid progress of MER, we can begin incorporating Guoli Jia (exped1230@gmail.com) is working toward his
emotion into different applications in the marketing, educa- master’s degree at the College of Computer Science, Nankai
tion, health care, and service sectors. The feedback from University, Tianjin, 300350, China. His research interests
the applications can, in turn, promote the development of include computer vision and pattern recognition.
MER. Together with emotion generation, we believe an Jufeng Yang (yangjufeng@nankai.edu.cn) received his
age of artificial emotional intelligence is coming. Ph.D. degree from Nankai University, Tianjin, China, in 2009.
■■ Wearable, simple, and accurate affective data collection: He is a full professor in the College of Computer Science,
To conduct MER tasks, the first step is to collect accurate Nankai University, Tianjin, 300350, China, and was a visiting
affective data. Developing wearable, simple, and even con- scholar with the Vision and Learning Lab, University of
tactless sensors to capture such data would make it more California, Merced, USA, from 2015 to 2016. His recent
acceptable to users. interests include computer vision, machine learning, and mul-
■■ Security, privacy, ethics, and fairness of MER: During timedia. He is a Member of IEEE.
data collection, it is possible to extract users’ confidential Guiguang Ding (dinggg@tsinghua.edu.cn) received his
information, such as identity, age, and so on. Protecting the Ph.D. degree from Xidian University, China, in 2004. He is a
security and privacy of users and avoiding any chance of full professor with School of Software, Tsinghua University,
misuse must be taken into consideration. Emotion recogni- Beijing, 100084, China. Before joining the School of Software

72 licensed use limited to: VIT University- Chennai Campus.


Authorized IEEE SIGNAL PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43
in 2006, he was a postdoctoral research fellow in the [16] A. Giachanou and F. Crestani, “Like it or not: A survey of twitter sentiment analy-
sis methods,” ACM Comput. Surv., vol. 49, no. 2, pp. 1–41, 2016. doi: 10.1145/2938640.
Department of Automation, Tsinghua University. He was a
[17] W. Rahman, M. K. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, and
leading guest editor of Neural Processing Letters and E. Hoque, “Integrating multimodal information in large pretrained transformers,” in
Multimedia Tools and Applications. He served as special Proc. Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 2359–2369.
[18] D. Joshi, R. Datta, E. Fedorovskaya, Q.-T. Luong, J. Z. Wang, J. Li, and J. Luo,
session chair of the 2021 IEEE ICASSP; 2019 and 2020 IEEE “Aesthetics and emotions in images,” IEEE Signal Process. Mag., vol. 28, no. 5,
ICME; and 2017 Pacific Rim Conference on Multimedia and pp. 94–115, 2011. doi: 10.1109/MSP.2011.941851.
reviewer for more than 20 prestigious international journals [19] S. Wang and Q. Ji, “Video affective content analysis: A survey of state-of-the-
art methods,” IEEE Trans. Affective Comput., vol. 6, no. 4, pp. 410–430, 2015. doi:
and conferences. His research interests include the area of 10.1109/TAFFC.2015.2432791.
multimedia information retrieval, computer vision, and [20] J. Yang, M. Sun, and S. Xiaoxiao, “Learning visual sentiment distributions via
machine learning. He is a Member of IEEE. augmented conditional probability neural network,” in Proc. AAAI Conf. Artif.
Intell., 2017, pp. 224–230.
Kurt Keutzer (keutzer@berkeley.edu) received his Ph.D.
[21] S. Zhao, H. Yao, Y. Gao, R. Ji, and G. Ding, “Continuous probability distribution
degree in computer science from Indiana University in 1984. prediction of image emotions via multitask shared sparse regression,” IEEE Trans.
He then joined the research division of AT&T Bell Multimedia, vol. 19, no. 3, pp. 632–645, 2017. doi: 10.1109/TMM.2016.2617741.
[22] D. She, J. Yang, M.-M. Cheng, Y.-K. Lai, P. L. Rosin, and L. Wang,
Laboratories. In 1991, he joined Synopsys, Inc., where he ulti- “WSCNet: Weakly supervised coupled networks for visual sentiment classification
mately became CTO and senior vice-president of research. In and detection,” IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1358–1371, 2020. doi:
10.1109/TMM.2019.2939744.
1998, he became a professor of electrical engineering and
[23] S. Zhao, X. Yue, S. Zhang, B. Li, H. Zhao, B. Wu, R. Krishna, J. E. Gonzalez
computer science at the University of California, Berkeley, et al., “A review of single-source deep unsupervised visual domain adaptation,”
Berkeley, California, 94720, USA. He has published six books IEEE Trans. Neural Netw. Learn. Syst., 2020. doi: 10.1109/TNNLS.2020.3028503.
and more than 250 refereed articles, and he is among the most [24] U. Bhattacharya, T. Mittal, R. Chandra, T. Randhavane, A. Bera, and D.
Manocha, “Step: Spatial temporal graph convolutional networks for emotion percep-
highly cited authors in hardware and design automation. His tion from gaits,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 1342–1350. doi:
research interests include using parallelism to accelerate the 10.1609/aaai.v34i02.5490.
training and deployment of deep neural networks for applica- [25] Y.-J. Liu, M. Yu, G. Zhao, J. Song, Y. Ge, and Y. Shi, “Real-time movie-
induced discrete emotion recognition from EEG signals,” IEEE Trans. Affective
tions in computer vision, speech recognition, multimedia anal- Comput., vol. 9, no. 4, pp. 550–562, 2018. doi: 10.1109/TAFFC.2017.2660485.
ysis, and computational finance. He is a Life Fellow of IEEE. [26] A. Hu and S. Flaxman, “Multimodal sentiment analysis to explore the structure
of emotions,” in Proc. ACM Int. Conf. Knowledge Discovery Data Mining, 2018,
pp. 350–358. doi: 10.1145/3219819.3219853.
References [27] S. E. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent
[1] D. Kahneman, Thinking, Fast and Slow. New York: Macmillan, 2011. neural networks for emotion recognition in video,” in Proc. ACM Int. Conf.
[2] M. Minsky, The Society of Mind. New York: Simon and Schuster, 1986. Multimodal Interaction, 2015, pp. 467–474. doi: 10.1145/2818346.2830596.
[3] D. Schuller and B. W. Schuller, “The age of artificial emotional intelligence,” [28] R. Ji, F. Chen, L. Cao, and Y. Gao, “Cross-modality microblog sentiment pre-
Computer, vol. 51, no. 9, pp. 38–46, 2018. doi: 10.1109/MC.2018.3620963. diction via bi-layer multimodal hypergraph learning,” IEEE Trans. Multimedia, vol.
[4] M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.-F. Chang, and M. Pantic, “A 21, no. 4, pp. 1062–1075, 2019. doi: 10.1109/TMM.2018.2867718.
survey of multimodal sentiment analysis,” Image Vis. Comput., vol. 65, pp. 3–14, [29] H. Lyu, L. Chen, Y. Wang, and J. Luo, “Sense and sensibility: Characterizing
Sept. 2017. doi: 10.1016/j.imavis.2017.08.003. social media users regarding the use of controversial terms for COVID-19,” IEEE
[5] J. Wagner, E. Andre, F. Lingenfelser, and J. Kim, “Exploring fusion methods for Trans. Big Data, to be published. doi: 10.1109/TBDATA.2020.2996401.
multimodal emotion recognition with missing data,” IEEE Trans. Affective [30] R. Wu and C. L. Wang, “The asymmetric impact of other-blame regret versus self-
Comput., vol. 2, no. 4, pp. 206–218, 2011. doi: 10.1109/T-AFFC.2011.12. blame regret on negative word of mouth: Empirical evidence from China,” European J.
[6] S. K. D’mello and J. Kory, “A review and meta-analysis of multimodal affect detec- Marketing, vol. 51, no. 11/12, pp. 1799–1816, 2017. doi: 10.1108/EJM-06-2015-0322.
tion systems,” ACM Comput. Surv., vol. 47, no. 3, p. 43, 2015. doi: 10.1145/2682899. [31] K. Diehl, A. C. Morales, G. J. Fitzsimmons, and D. Simester, “Shopping inter-
[7] D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A survey on dependencies: How emotions affect consumer search and shopping behavior.”
recent advances and trends,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 96–108, [Online]. Available: https://msbfile03.usc.edu/digitalmeasures/kdiehl/intellcont/
2017. doi: 10.1109/MSP.2017.2738401. Shopping%20Interdependencies%20WP-1.pdf
[8] S. K. D’Mello, N. Bosch, and H. Chen, “Multimodal-multisensor affect detec- [32] T. A. Dingus, F. Guo, S. Lee, J. F. Antin, M. Perez, M. Buchanan-King, and
tion,” in Handbook of Multimodal-Multisensor Interfaces: Signal Processing, J. Hankey, “Driver crash risk factors and prevalence evaluation using naturalistic
Architectures, Detection Emotion Cognition, vol. 2. New York: Association for driving data,” Proc. Natl. Acad. Sci. USA, vol. 113, no. 10, pp. 2636–2641, 2016.
Computing Machinery and Morgan & Claypool 2018, pp. 167–202. [Online]. Available: https://www.pnas.org/content/113/10/2636
[9] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A [33] K. Trezise, A. Bourgeois, and C. Luck, “Emotions in classrooms: The need to
survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. understand how emotions affect learning and education,” npj Science of Learning
423–443, 2019. doi: 10.1109/TPAMI.2018.2798607. Community, July 13, 2017. [Online]. Available: https://npjscilearncommunity.
nature.com/posts/18507
[10] S. Zhao, X. Yao, J. Yang, G. Jia, G. Ding, T.-S. Chua, B. W. Schuller, and K.
Keutzer, “Affective image content analysis: Two decades review and new perspectives,” [34] F. M. Marchak, “Detecting false intent using eye blink measures,” Front.
IEEE Trans. Pattern Anal. Mach. Intell., 2021. doi: 10.1109/TPAMI.2021.3094362. Psychol., vol. 4, p. 736, Oct. 2013. [Online]. Available: https://www.frontiersin.org/
articles/10.3389/fpsyg.2013.00736/full
[11] M. D. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, “Are they different?
affect, feeling, emotion, sentiment, and opinion detection in text,” IEEE Trans. Affective [35] H. K. M. Meeren, C. C. R. J. van Heijnsbergen, and B. de Gelder, “Rapid per-
Comput., vol. 5, no. 2, pp. 101–111, 2014. doi: 10.1109/TAFFC.2014.2317187. ceptual integration of facial expression and emotional body language,” Proc. Natl.
Acad. Sci. USA, vol. 102, no. 45, pp. 16,518–16,523, 2016. [Online]. Available:
[12] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recog-
https://www.pnas.org/content/102/45/16518.short
nition methods: Audio, visual, and spontaneous expressions,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, 2009. doi: 10.1109/TPAMI.2008.52. [36] P. Chakravorty, “What Is a Signal? [Lecture Notes],” IEEE Signal Process.
Mag., vol. 35, no. 5, pp. 175–177, Sept. 2018. doi: 10.1109/MSP.2018.2832195
[13] Z. Zhang, N. Cummins, and B. Schuller, “Advanced data exploitation in speech
analysis: An overview,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 107–129, [37] “A2Zadeh/CMU-MultimodalSDK,” GitHub. https://github.com/A2Zadeh/
2017. doi: 10.1109/MSP.2017.2699358. CMU-MultimodalSDK (accessed Sept. 3, 2021).
[14] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, [38] https://github.com/WasifurRahman/BERT_multimodal_transformer
databases, features, preprocessing methods, supporting modalities, and classifiers,” [39] A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency, “Multi-
Speech Commun., vol. 116, pp. 56–76, Jan. 2020. doi: 10.1016/j.specom.2019.12.001. attention recurrent network for human communication comprehension,” in Proc.
[15] R. Subramanian, J. Wache, M. K. Abadi, R. L. Vieriu, S. Winkler, and N. Sebe, 32nd AAAI Conf. Artf. Intell., 2018, pp. 5642–5649.
“Ascertain: Emotion and personality recognition using commercial sensors,” IEEE Trans.
Affective Comput., vol. 9, no. 2, pp. 147–160, 2018. doi: 10.1109/TAFFC.2016.2625250.  SP

IEEE SIGNAL
Authorized licensed use limited to: VIT University- Chennai Campus. PROCESSINGonMAGAZINE
Downloaded | November
July 24,2025 2021 |UTC from IEEE Xplore. Restrictions apply.
at 03:50:43 73

You might also like