CN117708754A

CN117708754A - A multimodal emotion recognition method and device

Info

Publication number: CN117708754A
Application number: CN202311702397.4A
Authority: CN
Inventors: 邬霞; 范超琼; 汪启昕
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-15
Anticipated expiration: 2043-12-12
Also published as: CN117708754B

Abstract

The present invention provides a multi-modal emotion recognition method and device, which includes: preprocessing original multi-modal data samples, obtaining primary features of each single modality and inputting them into the single-modal feature extraction network respectively, and extracting each single modal data sample. The high-level features of the modality, after one-dimensionalization, are input into each single-modal recognition network respectively, and the single-modal model composed of each single-modal feature extraction network and the single-modal recognition network is trained; each single-modal recognition network is obtained. The weight of the modal feature extraction network is extracted, and the activation degree of all neurons in each single-modal recognition network is extracted; its type is judged based on the activation degree of each neuron, and the unique connection constraints between different neurons are obtained; based on The obtained weight of each single-modal feature extraction network and the extracted connection constraints are used to establish a complete multi-modal emotion recognition model and complete the training, which is used to identify multi-modal emotions. The invention realizes cross-modal interaction and can effectively identify multi-modal emotions.

Description

A multimodal emotion recognition method and device

技术领域Technical field

本发明涉及多模态识别技术领域，尤其涉及一种多模态情绪识别方法和装置。The present invention relates to the technical field of multi-modal recognition, and in particular to a multi-modal emotion recognition method and device.

背景技术Background technique

多模态识别能够有效利用来自多个不同模态的信息，这些信息之间彼此互补，能够有效缓解单模态识别面临的噪声数据、非通用性等问题，因此在多种多模态识别任务(比如类脑多模态情绪识别)中取得了广泛进展与良好的性能。Multimodal recognition can effectively utilize information from multiple different modalities. These information complement each other and can effectively alleviate the problems of noisy data, non-universality and other problems faced by single-modal recognition. Therefore, it can be used in various multimodal recognition tasks. (such as brain-inspired multi-modal emotion recognition) has achieved extensive progress and good performance.

对于多模态识别，如何有效利用来自多种模态的信息是一个关键问题。为了解决这一问题，多模态融合是一个关键手段。随着深度学习的发展，越来越多的方法被用于融合多模态信息，如数据级融合、特征级融合、决策级融合。数据级融合即在数据水平上组合来自不同模态的输入数据，以合并成信息更丰富的数据，然后对合并后的数据提取特征并建立分类模型对提取的特征进行识别分类；特征级融合指对从不同数据模态提取的多个输入特征创建一个组合特征，然后用一个分类器对组合特征进行识别；在决策级融合中，一个单独的分类系统将来自各个模态的数据分别分类到相关的类中，然后将各个分类系统的得分进行组合，得到最终的识别结果。然而，这些深度学习方法大多没有考虑到不同模态之间的相互作用，这种相互作用称为跨模态交互。跨模态交互能够使得各个模态的信息彼此影响彼此补充，对于有效的多模态融合至关重要，因此传统深度学习方法由于忽略跨模态交互导致在多模态识别任务中的识别准确率仍有待提高。For multimodal recognition, how to effectively utilize information from multiple modalities is a key issue. In order to solve this problem, multi-modal fusion is a key means. With the development of deep learning, more and more methods are used to fuse multi-modal information, such as data-level fusion, feature-level fusion, and decision-level fusion. Data-level fusion means combining input data from different modalities at the data level to merge it into more information-rich data, and then extract features from the combined data and build a classification model to identify and classify the extracted features; feature-level fusion refers to Create a combined feature from multiple input features extracted from different data modalities, and then use a classifier to identify the combined features; in decision-level fusion, a separate classification system classifies data from each modality into relevant categories, and then combine the scores of each classification system to obtain the final recognition result. However, most of these deep learning methods do not take into account the interaction between different modalities, which is called cross-modal interaction. Cross-modal interaction can make the information of each modality influence each other and complement each other, which is crucial for effective multi-modal fusion. Therefore, traditional deep learning methods have low recognition accuracy in multi-modal recognition tasks due to ignoring cross-modal interaction. Still needs to be improved.

发明内容Contents of the invention

本发明提供了一种多模态情绪识别方法和装置，用以解决现有技术存在的问题，本发明提供的技术方案如下：The present invention provides a multi-modal emotion recognition method and device to solve the problems existing in the existing technology. The technical solutions provided by the present invention are as follows:

一方面，提供了一种多模态情绪识别方法，所述方法包括：On the one hand, a multi-modal emotion recognition method is provided, and the method includes:

S1、对原始多模态数据样本进行预处理，获取所述原始多模态数据样本包括的每个单模态的初级特征；S1. Preprocess the original multi-modal data sample and obtain the primary features of each single modality included in the original multi-modal data sample;

S2、将所述每个单模态的初级特征分别输入单模态特征提取网络，提取每个单模态的高级特征；S2. Input the primary features of each single modality into the single-modal feature extraction network, and extract the advanced features of each single modality;

S3、将所述每个单模态的高级特征一维化后，分别输入每个单模态识别网络，对由每个单模态特征提取网络和单模态识别网络构成的单模态模型进行训练；S3. After making the high-level features of each single modality one-dimensional, input each single-modal recognition network separately, and perform a single-modal model composed of each single-modal feature extraction network and single-modal recognition network. conduct training;

S4、从训练完成的单模态模型中获取每个单模态特征提取网络的权重，并提取每个单模态识别网络中所有神经元的激活程度；S4. Obtain the weight of each single-modal feature extraction network from the trained single-modal model, and extract the activation degree of all neurons in each single-modal recognition network;

S5、根据每个神经元的激活程度判断其类型，根据每个神经元的类型，获取不同神经元之间独特的连接约束；S5. Determine the type of each neuron based on its activation degree, and obtain unique connection constraints between different neurons based on the type of each neuron;

S6、基于获取的每个单模态特征提取网络的权重与提取到的连接约束，建立完整的多模态情绪识别模型并进行训练，得到训练完成的多模态情绪识别模型，用于识别多模态情绪。S6. Based on the obtained weight of each single-modal feature extraction network and the extracted connection constraints, establish a complete multi-modal emotion recognition model and conduct training, and obtain the trained multi-modal emotion recognition model, which can be used to identify multiple Modal emotions.

可选地，所述S1，具体包括：Optionally, the S1 specifically includes:

所述待原始多模态数据样本为带有声音的视频片段，以等间隔的方式提取若干帧图片，对每一帧图片提取人的面部轮廓，作为视觉模态的初级特征，用表示；The original multi-modal data sample is a video clip with sound. Several frames of pictures are extracted at equal intervals. The facial contour of the person is extracted from each frame of picture as the primary feature of the visual modality. Use express;

对所述视频片段的音频提取其梅尔频谱系数特征，作为听觉模态的初级特征，用表示。Extract the Mel spectral coefficient features of the audio of the video clip as the primary features of the auditory modality, using express.

可选地，所述S2中的每个单模态特征提取网络分别为采用经典的LIF模型作为神经元模型的脉冲神经网络，每个网络由卷积层与池化层组成，用于分别从所述视觉模态与听觉模态的初级特征进一步提取视觉模态或听觉模态的高级特征，这两个模态的高级特征分别用与/>表示。Optionally, each single-modal feature extraction network in S2 is a spiking neural network using the classic LIF model as the neuron model. Each network is composed of a convolution layer and a pooling layer, and is used to extract from The primary features of the visual modality and the auditory modality further extract the high-level features of the visual modality or the auditory modality. The high-level features of these two modalities are respectively used. with/> express.

可选地，所述S3中的每个单模态识别网络分别为采用经典的LIF模型作为神经元模型的脉冲神经网络，每个网络由两层全连接层与一层输出层组成，用于对每个单模态数据里表达的情绪进行识别；Optionally, each single-modal recognition network in S3 is a spiking neural network using the classic LIF model as the neuron model. Each network is composed of two fully connected layers and one output layer, for Identify the emotions expressed in each single-modal data;

所述每个单模态识别网络的神经元模型增加了一个自适应的激活阈值，所述阈值由常微分方程设定；The neuron model of each single-modal recognition network adds an adaptive activation threshold, and the threshold is set by an ordinary differential equation;

其中在视觉模态识别网络中的第一个全连接层中，第i个神经元的膜电位更新如下：Among them, in the first fully connected layer in the visual modality recognition network, the membrane potential of the i-th neuron is updated as follows:

在听觉模态识别网络中的第一个全连接层中，第i个神经元的膜电位更新如下：In the first fully connected layer in the auditory modality recognition network, the membrane potential of the i-th neuron is updated as follows:

其中，C表示电容，g表示电导，V_i(t)表示第i个神经元在时刻t的膜电位，S_i(t)表示这个神经元的激活标志，V₁表示静息电位，V₂表示复位电位，V_th表示激活阈值，表示输入的视觉模态或听觉模态的高级特征中的第j个神经元至第一层全连接网络中第i个神经元的权重，动态阈值a_i(t)是在复位至膜电位放电期间积累的，随着放电频率的增加，阈值也随之增加，随着放电频率的减少，阈值也随之减少，α，β，γ均为超参数。Among them, C represents the capacitance, g represents the conductance, V _i (t) represents the membrane potential of the i-th neuron at time t, _Si (t) represents the activation flag of this neuron, V ₁ represents the resting potential, V ₂ represents the reset potential, V _th represents the activation threshold, Represents the weight of the j-th neuron in the high-level feature of the input visual modality or auditory modality to the i-th neuron in the first layer of fully connected network. The dynamic threshold a _i (t) is the discharge after resetting to the membrane potential. Accumulated during the period, as the discharge frequency increases, the threshold value also increases, and as the discharge frequency decreases, the threshold value also decreases. α, β, and γ are all hyperparameters.

可选地，所述S4中的提取每个单模态识别网络中所有神经元的激活程度，具体包括：Optionally, extracting the activation degree of all neurons in each single-modal recognition network in S4 specifically includes:

在每个单模态识别网络中，对每个样本，记录保存在第l层中所有脉冲神经元在所有时刻的膜电位，用表示，用膜电位矩阵表示神经元的激活程度，其中，i表示第i个样本，l表示第l层。In each single-modal recognition network, for each sample, record the membrane potential of all spiking neurons in the l-th layer at all times, using represents, the membrane potential matrix is used to represent the activation degree of the neuron, where i represents the i-th sample and l represents the l-th layer.

可选地，所述S5中的根据每个神经元的激活程度判断其类型，具体包括：Optionally, in S5, the type of each neuron is determined based on its activation degree, specifically including:

通过对神经元的激活程度在时间尺度和样本上进行平均，得到脉冲模式，记为其公式为:By averaging the activation of neurons over time scales and samples, the spike pattern is obtained, denoted as The formula is:

其中，Mean_t表示在时间尺度上计算平均,表示在样本上求平均；Among them, Mean _t means calculating the average on the time scale, Means averaging over samples;

根据获取多感官神经元，用集合M_l表示：according to Obtain multi-sensory neurons, represented by the set M _l :

其中，Top表示根据神经元的脉冲模式找到具有最强脉冲水平的神经元，ρ是超参数，表明选择激活程度排名在前ρ的神经元作为具有最强放电水平的神经元，第l层的多感官神经元，用M_l表示，是激活程度最强的视觉神经元和听觉神经元的交集；Among them, Top means to find the neuron with the strongest spike level according to the spike pattern of the neuron, ρ is a hyperparameter, indicating that the neuron with the top activation degree ρ is selected as the neuron with the strongest discharge level. The lth layer Multisensory neurons, denoted by M _l , are the intersection of the most strongly activated visual and auditory neurons;

单感官神经元是第l层所有神经元与多感官神经元的差集：Unisensory neurons are the difference set of all neurons in layer l and multisensory neurons:

其中，分别表示第l层的所有神经元，/>分别表示第l层的单感官神经元。in, Represents all neurons in layer l, /> Represent the unisensory neurons of layer l respectively.

可选地，所述S5中的根据每个神经元的类型，得到不同神经元之间独特的连接约束，具体包括：Optionally, in S5, unique connection constraints between different neurons are obtained according to the type of each neuron, specifically including:

每个单模态的高级特征与每个单模态识别网络的第一层全连接层神经元之间的连接权重W₁、第一层与第二层全连接神经元之间的连接权重W₂，按照以下规则分别针对这两个权重建立连接约束mask¹，mask²，mask¹，mask²是用1或0填充的矩阵，其中1和0分别表示神经元之间的连接是否存在，按照以下规则判断i→j的连接权重：The connection weight W ₁ between the high-level features of each single modality and the fully connected neurons in the first layer of each single mode recognition network, and the connection weight W between the fully connected neurons in the first and second layers _2. Establish connection constraints for these two weights according to the following rules: mask ¹ , mask ² , mask ¹ , and mask ² are matrices filled with 1 or 0, where 1 and 0 respectively indicate whether the connection between neurons exists, according to The following rules determine the connection weight of i→j:

a)若神经元i表示单感官神经元，且神经元j表示同一模态的单感官神经元，则i→j的连接权重存在， a) If neuron i represents a single-sensory neuron, and neuron j represents a single-sensory neuron of the same modality, then the connection weight of i→j exists,

b)若神经元i表示单感官神经元，且神经元j表示多感官神经元，则i→j的连接权重存在， b) If neuron i represents a single-sensory neuron, and neuron j represents a multi-sensory neuron, then the connection weight of i→j exists,

c)若神经元i表示多感官神经元，且神经元j表示单感官神经元，则i→j的连接权重存在， c) If neuron i represents a multi-sensory neuron and neuron j represents a single-sensory neuron, then the connection weight of i→j exists,

其他情况下，i→j的连接权重不存在， In other cases, the connection weight of i→j does not exist,

可选地，所述S6中的完整的多模态情绪识别模型包括每个单模态特征提取网络和多模态识别网络；Optionally, the complete multi-modal emotion recognition model in S6 includes each single-modal feature extraction network and multi-modal recognition network;

所述多模态识别网络具有与每个单模态识别网络相似的架构，包括两层全连接层与一层输出层，不同之处在于：The multi-modal recognition network has a similar architecture to each single-modal recognition network, including two fully connected layers and one output layer. The difference is:

a)单模态识别网络中每一全连接层神经元都有固定数量的神经元，但是多模态识别网络中全连接层中的神经元数量发生变化，是提取到的两种单感官神经元与多感官神经元的数量之和，计算公式为：a) Each fully connected layer neuron in the single-modal recognition network has a fixed number of neurons, but the number of neurons in the fully connected layer in the multi-modal recognition network is The change is the sum of the numbers of the two types of single-sensory neurons and multi-sensory neurons extracted. The calculation formula is:

b)多模态识别网络中层与层之间是稀疏连接的，采用了所述提取的连接约束mask¹，mask²，每个单模态的高级特征与第一层全连接层神经元之间的连接权重W₁、第一层与第二层全连接神经元之间的连接权重W₂，将mask¹，mask²分别与W₁、W₂求哈达玛积，对层与层之间的连接作有规律的限制，计算公式为：b) The layers in the multi-modal recognition network are sparsely connected, and the extracted connection constraints mask ¹ and mask ² are used, between the high-level features of each single modality and the neurons of the first fully connected layer The connection weight W ₁ and the connection weight W ₂ between the fully connected neurons in the first layer and the second layer, calculate the Hadamard product of mask ¹ and mask ² with W ₁ and W ₂ respectively, and calculate the Hadamard product between the layers. The connection is subject to regular restrictions, and the calculation formula is:

其中，*表示哈达玛积，表示连接约束mask¹这一矩阵中第i行第j列的元素，W_1，i，j表示连接权重W₁这一矩阵中第i行第j列的元素，/>表示连接约束mask²这一矩阵中第i行第j列的元素，W_2，i，j表示连接权重W₂这一矩阵中第i行第j列的元素。Among them, * represents the Hadamard product, Represents the element of the i-th row and j-th column in the matrix of connection constraint mask ¹ , W _{1, i, j} represents the element of the i-th row and j-th column of the matrix of connection weight W ₁ , /> Represents the element in the i-th row and j-th column in the matrix of connection constraint mask ² , and W _{2, i, j} represents the element in the i-th row and j-th column in the matrix of connection weight W ₂ .

可选地，S6中对完整的多模态情绪识别模型进行训练，具体包括：Optionally, a complete multi-modal emotion recognition model is trained in S6, including:

将原始多模态数据样本进行单模态数据预处理与初级特征提取，分别将单模态初级特征输入至单模态特征提取网络，得到单模态高级特征，将单模态高级特征一维化后进行拼接，将拼接后的特征输入至多模态识别网络，对整个模型采用DFA算法进行训练，训练过程中每个单模态特征提取网络的权重固定，即为S3中训练后的权重，只训练多模态识别网络中的权重。The original multi-modal data samples are subjected to single-modal data preprocessing and primary feature extraction. The single-modal primary features are input into the single-modal feature extraction network respectively to obtain single-modal high-level features. The single-modal high-level features are one-dimensionally After transformation, the spliced features are input to the multi-modal recognition network, and the entire model is trained using the DFA algorithm. During the training process, the weight of each single-modal feature extraction network is fixed, which is the weight after training in S3. Only the weights in the multimodal recognition network are trained.

另一方面，提供了一种多模态情绪识别装置，所述装置包括：On the other hand, a multi-modal emotion recognition device is provided, and the device includes:

第一获取模块，用于对原始多模态数据样本进行预处理，获取所述原始多模态数据样本包括的每个单模态的初级特征；The first acquisition module is used to preprocess the original multi-modal data sample and obtain the primary features of each single modality included in the original multi-modal data sample;

第一提取模块，用于将所述每个单模态的初级特征分别输入单模态特征提取网络，提取每个单模态的高级特征；The first extraction module is used to input the primary features of each single modality into the single-modality feature extraction network respectively, and extract the high-level features of each single modality;

第一训练模块，用于将所述每个单模态的高级特征一维化后，分别输入每个单模态识别网络，对由每个单模态特征提取网络和单模态识别网络构成的单模态模型进行训练；The first training module is used to convert the high-level features of each single modality into one dimension and then input them into each single-modality recognition network respectively. The pair consists of each single-modality feature extraction network and a single-modality recognition network. Single-modal model for training;

第二获取模块，用于从训练完成的单模态模型中获取每个单模态特征提取网络的权重，并提取每个单模态识别网络中所有神经元的激活程度；The second acquisition module is used to obtain the weight of each single-modal feature extraction network from the trained single-modal model, and extract the activation degree of all neurons in each single-modal recognition network;

第三获取模块，用于根据每个神经元的激活程度判断其类型，根据每个神经元的类型，获取不同神经元之间独特的连接约束；The third acquisition module is used to determine the type of each neuron based on its activation degree, and obtain unique connection constraints between different neurons based on the type of each neuron;

第二训练模块，用于基于获取的每个单模态特征提取网络的权重与提取到的连接约束，建立完整的多模态情绪识别模型并进行训练，得到训练完成的多模态情绪识别模型，用于识别多模态情绪。The second training module is used to establish a complete multi-modal emotion recognition model and train it based on the obtained weight of each single-modal feature extraction network and the extracted connection constraints, and obtain the trained multi-modal emotion recognition model. , for identifying multimodal emotions.

另一方面，提供了一种电子设备，所述电子设备包括处理器和存储器，所述存储器中存储有指令，所述指令由所述处理器加载并执行以实现上述多模态情绪识别方法。On the other hand, an electronic device is provided. The electronic device includes a processor and a memory. Instructions are stored in the memory. The instructions are loaded and executed by the processor to implement the above multi-modal emotion recognition method.

另一方面，提供了一种计算机可读存储介质，所述存储介质中存储有指令，所述指令由处理器加载并执行以实现上述多模态情绪识别方法。On the other hand, a computer-readable storage medium is provided, in which instructions are stored, and the instructions are loaded and executed by a processor to implement the above multi-modal emotion recognition method.

上述技术方案，与现有技术相比至少具有如下有益效果：Compared with the existing technology, the above technical solution has at least the following beneficial effects:

本发明受人脑在感知多模态信息过程中的神经元多样性这一理论启发，实现了跨模态交互，具有比已有的类脑方法更高的多模态情绪识别准确率。Inspired by the theory of neuron diversity in the human brain's process of perceiving multi-modal information, the invention realizes cross-modal interaction and has higher multi-modal emotion recognition accuracy than existing brain-like methods.

附图说明Description of the drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1为本发明实施例提供的一种多模态情绪识别方法流程图；Figure 1 is a flow chart of a multi-modal emotion recognition method provided by an embodiment of the present invention;

图2为本发明实施例提供的以听觉模态为例的单模态模型的识别方法流程图；Figure 2 is a flow chart of a single-modal model recognition method using the auditory modality as an example provided by an embodiment of the present invention;

图3为本发明实施例提供的根据单模态识别网络中每个神经元的激活程度判断其类型示意图；Figure 3 is a schematic diagram of determining the type of each neuron in a single-modal recognition network based on its activation degree according to an embodiment of the present invention;

图4为本发明实施例提供的根据每个神经元的类型设计神经元之间独特的连接约束并应用至多模态识别网络示意图；Figure 4 is a schematic diagram of designing unique connection constraints between neurons according to the type of each neuron and applying it to a multi-modal recognition network provided by an embodiment of the present invention;

图5为本发明实施例提供的对待识别多模态情绪进行识别的方法流程图；Figure 5 is a flow chart of a method for identifying multi-modal emotions to be recognized according to an embodiment of the present invention;

图6为本发明实施例提供的一种电子设备的结构示意图。Figure 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例的附图，对本发明实施例的技术方案进行清楚、完整地描述。显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于所描述的本发明的实施例，本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings of the embodiments of the present invention. Obviously, the described embodiments are some, but not all, of the embodiments of the present invention. Based on the described embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

如图1所示，本发明实施例提供了一种多模态情绪识别方法，所述方法包括：As shown in Figure 1, an embodiment of the present invention provides a multi-modal emotion recognition method, which includes:

人脑能够对来自视觉、听觉等多个模态的信息进行交互从而实现准确高效的感知识别。神经科学研究表明，大脑的这种能力源于多种神经元，包括单感官神经元和多感官神经元。这些神经元的特点是它们对不同的感官刺激有不同的反应。具体来说，单感官神经元只对某种单一感官刺激做出响应，而多感官神经元对多种感官的刺激都做出响应。由于这些神经元具有不同的响应特征，人类大脑可以有效地捕捉到跨模态交互，从而具有优越的多模态识别能力。本发明实施例通过借鉴人脑在多模态识别上的优势背后的神经元多样性，提出了一种受神经元多样性启发的多模态情绪识别方法，下面结合图2-图5，详细说明本发明实施例提供的一种多模态情绪识别方法，包括：The human brain can interact with information from multiple modalities such as vision and hearing to achieve accurate and efficient perception and recognition. Neuroscience research shows that this ability of the brain originates from a variety of neurons, including unisensory neurons and multisensory neurons. A characteristic feature of these neurons is that they respond differently to different sensory stimuli. Specifically, unisensory neurons respond only to a single sensory stimulus, while multisensory neurons respond to stimuli from multiple senses. Since these neurons have different response characteristics, the human brain can effectively capture cross-modal interactions, resulting in superior multi-modal recognition capabilities. By drawing on the neuron diversity behind the advantages of the human brain in multi-modal recognition, the embodiment of the present invention proposes a multi-modal emotion recognition method inspired by neuron diversity. The following is detailed in conjunction with Figures 2-5. Describe a multi-modal emotion recognition method provided by embodiments of the present invention, including:

可选地，所述S1，具体包括：Optionally, the S1 specifically includes:

所述待原始多模态数据样本为带有声音的视频片段，以等间隔的方式提取若干帧图片(比如15帧)，对每一帧图片提取人的面部轮廓(可以将其图片压缩至28*28的大小)，作为视觉模态的初级特征，用表示；The original multi-modal data sample is a video clip with sound. Several frames of pictures (such as 15 frames) are extracted at equal intervals, and a person's facial contour is extracted from each frame of picture (the picture can be compressed to 28 *28 size), as the primary feature of the visual modality, use express;

本发明实施例的每一帧的维度是784，故每个样本的视觉模态初始特征的最终维度为 The dimension of each frame in the embodiment of the present invention is 784, so the final dimension of the initial feature of the visual modality of each sample is

对所述视频片段的音频提取其梅尔频谱系数特征(Mel-scale FrequencyCepstral Coefficients,MFCC)，作为听觉模态的初级特征，用表示。Extract the Mel-scale FrequencyCepstral Coefficients (MFCC) features of the audio of the video clip as the primary features of the auditory modality, using express.

本发明实施例所有语音的平均帧数为280帧，每帧的特征维数为12，因此，每个样本的听觉模态初始特征的最终维度为/> The average number of frames of all speech sounds in the embodiment of the present invention is 280 frames, and the feature dimension of each frame is 12. Therefore, the initial features of the auditory modality of each sample The final dimension of is/>

具体的视觉模态的初级特征的提取方法和听觉模态的初级特征的提取方法都是现有技术，在此不再赘述。The specific extraction methods of primary features of the visual modality and the extraction methods of primary features of the auditory modality are existing technologies and will not be described in detail here.

本发明实施例以视觉模态和听觉模态为例说明本发明实施例的方法，但本发明实施例的多模态数据也可以包括文本等其他模态的数据，本发明实施例对多模态数据并不加以限制，都在本发明实施例的保护范围内。The embodiment of the present invention takes visual modality and auditory modality as examples to illustrate the method of the embodiment of the present invention. However, the multi-modal data in the embodiment of the present invention may also include data in other modalities such as text. The embodiment of the present invention is suitable for multi-modal data. There is no limitation on the status data, and they are all within the protection scope of the embodiments of the present invention.

可选地，所述S2中的每个单模态特征提取网络(视觉模态提取网络和听觉模态提取网络)分别为采用经典的LIF模型作为神经元模型的脉冲神经网络(Spiking NeuralNetwork,SNN)，每个网络由卷积层与池化层组成，用于分别从所述视觉模态与听觉模态的初级特征进一步提取视觉模态或听觉模态的高级特征，这两个模态的高级特征分别用与/>表示。Optionally, each single-modality feature extraction network (visual modality extraction network and auditory modality extraction network) in S2 is a spiking neural network (SNN) using the classic LIF model as a neuron model. ), each network consists of a convolutional layer and a pooling layer, used to further extract high-level features of the visual modality or auditory modality from the primary features of the visual modality and auditory modality respectively. The two modalities Advanced features are used separately with/> express.

S3、将所述每个单模态的高级特征一维化(具体可以采用Flatten操作进行一维化)后，分别输入每个单模态识别网络(视觉模态识别网络和听觉模态识别网络)，对由每个单模态特征提取网络和单模态识别网络构成的单模态模型进行训练；S3. After converting the high-level features of each single modality into one dimension (specifically, Flatten operation can be used for one-dimensionalization), input each single-modality recognition network (visual modality recognition network and auditory modality recognition network) respectively. ), train the single-modal model composed of each single-modal feature extraction network and single-modal recognition network;

本发明实施例的每个单模态识别网络中每一层全连接层有固定数量(比如200个)的神经元，输出层的神经元数量与训练的数据集中情绪标签的数量一致。Each fully connected layer in each single-modal recognition network in the embodiment of the present invention has a fixed number (such as 200) of neurons, and the number of neurons in the output layer is consistent with the number of emotion labels in the training data set.

本发明实施例在单模态识别网络中考虑了神经元的可塑性，所述每个单模态识别网络的神经元模型增加了一个自适应的激活阈值，所述阈值由常微分方程设定；The embodiment of the present invention considers the plasticity of neurons in the single-modal recognition network. The neuron model of each single-modal recognition network adds an adaptive activation threshold, and the threshold is set by an ordinary differential equation;

其中，C表示电容，g表示电导，V_i(t)表示第i个神经元在时刻t的膜电位，S_i(t)表示这个神经元的激活标志，V₁表示静息电位，V₂表示复位电位，V_th表示激活阈值，表示输入的视觉模态或听觉模态的高级特征中的第j个神经元至第一层全连接网络中第i个神经元的权重，动态阈值a_i(t)是在复位至膜电位放电期间积累的，随着放电频率的增加，阈值也随之增加，随着放电频率的减少，阈值也随之减少，α，β，γ均为超参数(无实际意义，本发明实施例设α＝0.9，β＝0.1，γ＝1)；Among them, C represents the capacitance, g represents the conductance, V _i (t) represents the membrane potential of the i-th neuron at time t, _Si (t) represents the activation flag of this neuron, V ₁ represents the resting potential, V ₂ represents the reset potential, V _th represents the activation threshold, Represents the weight of the j-th neuron in the high-level feature of the input visual modality or auditory modality to the i-th neuron in the first layer of fully connected network. The dynamic threshold a _i (t) is the discharge after resetting to the membrane potential. accumulated during the period. As the discharge frequency increases, the threshold value also increases. As the discharge frequency decreases, the threshold value also decreases. α, β, and γ are all hyperparameters (without practical significance. In the embodiment of the present invention, α =0.9, β=0.1, γ=1);

对于单模态模型，根据训练集中的样本及其情绪标签(人工标注好的该样本所属的情绪类别)，采用直接反馈对齐(Direct Feedback Alignment,DFA)进行监督训练，得到训练完成的单模态模型。For the single-modal model, based on the samples in the training set and their emotional labels (the manually labeled emotional categories to which the samples belong), Direct Feedback Alignment (DFA) is used for supervised training to obtain the single-modal model that has been trained. Model.

本发明实施例的单模态模型(以听觉模态为例)的识别方法如图2所示。The identification method of a single-modal model (taking the auditory modality as an example) according to the embodiment of the present invention is shown in Figure 2.

对于训练完成的单模态模型，保存用于提取视觉模态、听觉模态的高级特征的每个单模态特征提取网络的权重，分别用En^(v)，En^(a)表示。For the trained single-modal model, save the weight of each single-modal feature extraction network used to extract high-level features of the visual modality and auditory modality, represented by En ^(v) and En ^(a) respectively.

本发明实施例受人脑中单感官神经元与多感官神经元对于不同刺激具有不同响应的特点启发，根据单模态识别网络中每个神经元的激活程度判断其类型，具体的：The embodiment of the present invention is inspired by the characteristics of single-sensory neurons and multi-sensory neurons in the human brain that have different responses to different stimuli. The type of each neuron in the single-modal recognition network is determined based on its activation degree. Specifically:

人脑中神经元多样性的响应特点如下：The response characteristics of neuronal diversity in the human brain are as follows:

在人脑中，多感官信息能够被有效地处理和识别，这取决于对感官输入表现出不同反应的不同神经元。在人脑中负责整合多种模态信息的脑区中，包括多感官神经元和单感官神经元在内的多种神经元普遍存在。多感官神经元被定义为对来自不止一种感官的刺激做出反应的神经元，而单感官神经元是那些只对一种感官刺激做出反应的神经元。单感官神经元可进一步分为视觉和听觉的单感官神经元。对于多感官神经元来说，当它接收到的外界刺激有一个共同的来源时，它的响应明显超过对任何单一感官刺激的响应，而单感官神经元对多感官刺激的反应与对单感官刺激的反应无显著差异。In the human brain, multisensory information can be efficiently processed and recognized, depending on different neurons that exhibit different responses to sensory input. In areas of the human brain responsible for integrating information from multiple modalities, a variety of neurons, including multisensory neurons and unisensory neurons, are ubiquitous. Multisensory neurons are defined as neurons that respond to stimulation from more than one sense, while unisensory neurons are those that respond to stimulation from only one sense. Unisensory neurons can be further divided into visual and auditory unisensory neurons. For multi-sensory neurons, when the external stimuli it receives have a common source, its response significantly exceeds the response to any single sensory stimulus, while the response of unisensory neurons to multi-sensory stimuli is different from that of single-sensory neurons. There were no significant differences in responses to stimulation.

可选地，如图3所示，所述S5中的根据每个神经元的激活程度判断其类型，具体包括：Optionally, as shown in Figure 3, the type of each neuron is determined according to its activation degree in S5, specifically including:

其中，Top表示根据神经元的脉冲模式找到具有最强脉冲水平的神经元，ρ是超参数(取值范围为0～1)，表明选择激活程度排名在前ρ的神经元作为具有最强放电水平的神经元(比如ρ取值0.5，那就选择前50％的神经元)，第l层的多感官神经元，用M_l表示，是激活程度最强的视觉神经元和听觉神经元的交集；Among them, Top means to find the neuron with the strongest spike level based on the spike pattern of the neuron, and ρ is a hyperparameter (the value range is 0 to 1), indicating that the neuron with the highest activation degree ρ is selected as the neuron with the strongest discharge. Horizontal neurons (for example, if ρ has a value of 0.5, then select the top 50% of neurons), the multi-sensory neurons of layer l, represented by M _l , are the visual neurons and auditory neurons with the strongest activation intersection;

可选地，所述S5中的根据每个神经元的类型，得到不同神经元之间独特的连接约束(受人脑中单感官神经元与多感官神经元的响应特点启发)，具体包括：Optionally, in S5, according to the type of each neuron, unique connection constraints between different neurons are obtained (inspired by the response characteristics of single-sensory neurons and multi-sensory neurons in the human brain), specifically including:

a)若神经元i表示单感官神经元(视觉模态或听觉模态)，且神经元j表示同一模态的单感官神经元，则i→j的连接权重存在， a) If neuron i represents a single-sensory neuron (visual modality or auditory modality), and neuron j represents a single-sensory neuron of the same modality, then the connection weight of i→j exists,

b)若神经元i表示单感官神经元(视觉模态或听觉模态)，且神经元j表示多感官神经元，则i→j的连接权重存在， b) If neuron i represents a single-sensory neuron (visual modality or auditory modality), and neuron j represents a multi-sensory neuron, then the connection weight of i→j exists,

c)若神经元i表示多感官神经元，且神经元j表示单感官神经元(视觉模态或听觉模态)，则i→j的连接权重存在， c) If neuron i represents a multi-sensory neuron, and neuron j represents a single-sensory neuron (visual modality or auditory modality), then the connection weight of i→j exists,

通过这些约束，实现了跨模态的交互。比如，听觉特征连接到第一层全连接层中的多感觉神经元M₁，再连接到第二层全连接层中的视觉神经元/>实现了从听觉到视觉的单向传递。从视觉到听觉的单向传输也是如此，因此，实现了跨模态互动。Through these constraints, cross-modal interaction is achieved. For example, auditory features Connect to the multi-sensory neuron M ₁ in the first fully connected layer, and then connect to the visual neuron in the second fully connected layer/> One-way transmission from hearing to vision is achieved. The same goes for one-way transmission from vision to hearing, thus enabling cross-modal interaction.

a)单模态识别网络中每一全连接层神经元都有固定数量(比如200个)的神经元，但是多模态识别网络中全连接层中的神经元数量发生变化，是提取到的两种单感官神经元与多感官神经元的数量之和，计算公式为：a) Each fully connected layer neuron in the single-modal recognition network has a fixed number (such as 200) of neurons, but the number of neurons in the fully connected layer in the multi-modal recognition network The change is the sum of the numbers of the two types of single-sensory neurons and multi-sensory neurons extracted. The calculation formula is:

本发明实施例根据每个神经元的类型设计神经元之间独特的连接约束并应用至多模态识别网络，如图4所示。The embodiment of the present invention designs unique connection constraints between neurons according to the type of each neuron and applies it to the multi-modal recognition network, as shown in Figure 4.

将原始多模态数据样本进行单模态数据预处理与初级特征提取，分别将单模态初级特征输入至单模态特征提取网络，得到单模态高级特征，将单模态高级特征一维化后进行拼接，将拼接后的特征输入至多模态识别网络，对整个模型采用DFA算法进行训练，训练过程中每个单模态特征提取网络的权重固定，即为S3中训练后的权重，只训练多模态识别网络中的权重。The original multi-modal data samples are subjected to single-modal data preprocessing and primary feature extraction. The single-modal primary features are input into the single-modal feature extraction network to obtain single-modal high-level features. The single-modal high-level features are one-dimensionally After transformation, the spliced features are input to the multi-modal recognition network, and the entire model is trained using the DFA algorithm. During the training process, the weight of each single-modal feature extraction network is fixed, which is the weight after training in S3. Only the weights in the multimodal recognition network are trained.

如图5所示，当使用本发明实施例的方法对待识别多模态情绪进行识别时，包括：As shown in Figure 5, when using the method of the embodiment of the present invention to identify multi-modal emotions to be recognized, it includes:

对待识别原始多模态数据进行预处理，获取所述待识别原始多模态数据样本包括的每个单模态(视觉模态和听觉模态)的初级特征；Preprocess the original multi-modal data to be identified and obtain the primary features of each single modality (visual modality and auditory modality) included in the original multi-modal data sample to be identified;

将所述每个单模态的初级特征分别输入单模态特征提取网络，提取每个单模态的高级特征；Input the primary features of each single modality into the single-modality feature extraction network respectively, and extract the high-level features of each single modality;

将所述每个单模态的高级特征一维化后拼接，输入多模态识别网络，识别出待识别原始多模态数据表达的多模态情绪。The high-level features of each single modality are one-dimensionally spliced and then input into the multi-modal recognition network to identify the multi-modal emotions expressed by the original multi-modal data to be recognized.

本发明实施例还提供了一种多模态情绪识别装置，所述装置包括：An embodiment of the present invention also provides a multi-modal emotion recognition device, which includes:

本发明实施例提供的一种多模态情绪识别装置，其功能结构与本发明实施例提供的一种多模态情绪识别方法相对应，在此不再赘述。The functional structure of a multi-modal emotion recognition device provided by an embodiment of the present invention corresponds to a multi-modal emotion recognition method provided by an embodiment of the present invention, and will not be described again here.

图6是本发明实施例提供的一种电子设备600的结构示意图，该电子设备600可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上处理器(centralprocessing units，CPU)601和一个或一个以上的存储器602，其中，所述存储器602中存储有指令，所述指令由所述处理器601加载并执行以实现上述多模态情绪识别方法的步骤。FIG. 6 is a schematic structural diagram of an electronic device 600 provided by an embodiment of the present invention. The electronic device 600 may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPUs) 601 and one or more memories 602, wherein instructions are stored in the memory 602, and the instructions are loaded and executed by the processor 601 to implement the steps of the above multi-modal emotion recognition method.

在示例性实施例中，还提供了一种计算机可读存储介质，例如包括指令的存储器，上述指令可由终端中的处理器执行以完成上述多模态情绪识别方法。例如，所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including instructions, and the instructions can be executed by a processor in the terminal to complete the above-mentioned multi-modal emotion recognition method. For example, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps to implement the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage media mentioned can be read-only memory, magnetic disks or optical disks, etc.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. A multi-modal emotion recognition method, characterized in that the method includes:

S1. Preprocess the original multi-modal data sample and obtain the primary features of each single modality included in the original multi-modal data sample;

S2. Input the primary features of each single modality into the single-modal feature extraction network, and extract the advanced features of each single modality;

S3. After making the high-level features of each single modality one-dimensional, input each single-modal recognition network separately, and perform a single-modal model composed of each single-modal feature extraction network and single-modal recognition network. conduct training;

S4. Obtain the weight of each single-modal feature extraction network from the trained single-modal model, and extract the activation degree of all neurons in each single-modal recognition network;

S5. Determine the type of each neuron based on its activation degree, and obtain unique connection constraints between different neurons based on the type of each neuron;

S6. Based on the obtained weight of each single-modal feature extraction network and the extracted connection constraints, establish a complete multi-modal emotion recognition model and conduct training, and obtain the trained multi-modal emotion recognition model, which can be used to identify multiple Modal emotions.

2. The method according to claim 1, characterized in that, S1 specifically includes:

The original multi-modal data sample is a video clip with sound. Several frames of pictures are extracted at equal intervals. The facial contour of the person is extracted from each frame of picture as the primary feature of the visual modality. Use express;

Extract the Mel spectral coefficient features of the audio of the video clip as the primary features of the auditory modality, using express.

3. The method according to claim 2, characterized in that each single-modal feature extraction network in the S2 is a spiking neural network using the classic LIF model as a neuron model, and each network is composed of a convolution layer and a pooling layer, used to further extract the high-level features of the visual modality or the auditory modality from the primary features of the visual modality and the auditory modality respectively. The high-level features of the two modalities are respectively used with/> express.

4. The method according to claim 3, characterized in that each single-modal recognition network in the S3 is a spiking neural network using the classic LIF model as a neuron model, and each network is composed of two layers of full-scale neural networks. The connection layer is composed of an output layer and is used to identify the emotions expressed in each single-modal data;

The neuron model of each single-modal recognition network adds an adaptive activation threshold, and the threshold is set by an ordinary differential equation;

Among them, in the first fully connected layer in the visual modality recognition network, the membrane potential of the i-th neuron is updated as follows:

In the first fully connected layer in the auditory modality recognition network, the membrane potential of the i-th neuron is updated as follows:

Among them, C represents the capacitance, g represents the conductance, V _i (t) represents the membrane potential of the i-th neuron at time t, _Si (t) represents the activation flag of this neuron, V ₁ represents the resting potential, V ₂ represents the reset potential, V _th represents the activation threshold, Represents the weight of the j-th neuron in the high-level feature of the input visual modality or auditory modality to the i-th neuron in the first layer of fully connected network. The dynamic threshold a _i (t) is the discharge after resetting to the membrane potential. Accumulated during the period, as the discharge frequency increases, the threshold value also increases, and as the discharge frequency decreases, the threshold value also decreases. α, β, and γ are all hyperparameters.

5. The method according to claim 1, characterized in that extracting the activation degree of all neurons in each single-modal recognition network in S4 specifically includes:

In each single-modal recognition network, for each sample, record the membrane potential of all spiking neurons in the l-th layer at all times, using represents, the membrane potential matrix is used to represent the activation degree of the neuron, where i represents the i-th sample and l represents the l-th layer.

6. The method according to claim 5, characterized in that, in S5, judging the type of each neuron according to its activation degree specifically includes:

By averaging the activation of neurons over time scales and samples, the spike pattern is obtained, denoted as The formula is:

Among them, Mean _t means calculating the average on the time scale, Means averaging over samples;

according to Obtain multi-sensory neurons, represented by the set M _l :

Among them, Top means to find the neuron with the strongest spike level according to the spike pattern of the neuron, ρ is a hyperparameter, indicating that the neuron with the top activation degree ρ is selected as the neuron with the strongest discharge level. The lth layer Multisensory neurons, denoted by M _l , are the intersection of the most strongly activated visual and auditory neurons;

Unisensory neurons are the difference set of all neurons in layer l and multisensory neurons:

in, Represents all neurons in layer l, /> Represent the unisensory neurons of layer l respectively.

7. The method according to claim 6, characterized in that, in S5, unique connection constraints between different neurons are obtained according to the type of each neuron, specifically including:

The connection weight W ₁ between the high-level features of each single modality and the fully connected neurons in the first layer of each single mode recognition network, and the connection weight W between the fully connected neurons in the first and second layers _2. Establish connection constraints for these two weights according to the following rules: mask ¹ , mask ² , mask ¹ , and mask ² are matrices filled with 1 or 0, where 1 and 0 respectively indicate whether the connection between neurons exists, according to The following rules determine the connection weight of i→j:

a) If neuron i represents a single-sensory neuron, and neuron j represents a single-sensory neuron of the same modality, then the connection weight of i→j exists,

b) If neuron i represents a single-sensory neuron, and neuron j represents a multi-sensory neuron, then the connection weight of i→j exists,

c) If neuron i represents a multi-sensory neuron and neuron j represents a single-sensory neuron, then the connection weight of i→j exists,

In other cases, the connection weight of i→j does not exist,

8. The method according to claim 7, characterized in that the complete multi-modal emotion recognition model in S6 includes each single-modal feature extraction network and a multi-modal recognition network;

The multi-modal recognition network has a similar architecture to each single-modal recognition network, including two fully connected layers and one output layer. The difference is:

a) Each fully connected layer neuron in the single-modal recognition network has a fixed number of neurons, but the number of neurons in the fully connected layer in the multi-modal recognition network is The change is the sum of the numbers of the two types of single-sensory neurons and multi-sensory neurons extracted. The calculation formula is:

b) The layers in the multi-modal recognition network are sparsely connected, and the extracted connection constraints mask ¹ and mask ² are used, between the high-level features of each single modality and the neurons of the first fully connected layer The connection weight W ₁ and the connection weight W ₂ between the fully connected neurons in the first layer and the second layer, calculate the Hadamard product of mask ¹ and mask ² with W ₁ and W ₂ respectively, and calculate the Hadamard product between the layers. The connection is subject to regular restrictions, and the calculation formula is:

Among them, * represents the Hadamard product, Represents the element of the i-th row and j-th column in the matrix of connection constraint mask ¹ , W _{1, i, j} represents the element of the i-th row and j-th column of the matrix of connection weight W ₁ , /> Represents the element in the i-th row and j-th column in the matrix of connection constraint mask ² , and W _{2, i, j} represents the element in the i-th row and j-th column in the matrix of connection weight W ₂ .

9. The method according to claim 1, characterized in that, in S6, a complete multi-modal emotion recognition model is trained, specifically including:

The original multi-modal data samples are subjected to single-modal data preprocessing and primary feature extraction. The single-modal primary features are input into the single-modal feature extraction network to obtain single-modal high-level features. The single-modal high-level features are one-dimensionally After transformation, the spliced features are input to the multi-modal recognition network, and the entire model is trained using the DFA algorithm. During the training process, the weight of each single-modal feature extraction network is fixed, which is the weight after training in S3. Only the weights in the multimodal recognition network are trained.

10. A multi-modal emotion recognition device, characterized in that the device includes:

The first acquisition module is used to preprocess the original multi-modal data sample and obtain the primary features of each single modality included in the original multi-modal data sample;

The first extraction module is used to input the primary features of each single modality into the single-modality feature extraction network respectively, and extract the high-level features of each single modality;

The first training module is used to convert the high-level features of each single modality into one dimension and then input them into each single-modality recognition network respectively. The pair consists of each single-modality feature extraction network and a single-modality recognition network. Single-modal model for training;

The second acquisition module is used to obtain the weight of each single-modal feature extraction network from the trained single-modal model, and extract the activation degree of all neurons in each single-modal recognition network;

The third acquisition module is used to determine the type of each neuron based on its activation degree, and obtain unique connection constraints between different neurons based on the type of each neuron;

The second training module is used to establish a complete multi-modal emotion recognition model and train it based on the obtained weight of each single-modal feature extraction network and the extracted connection constraints, and obtain the trained multi-modal emotion recognition model. , for identifying multimodal emotions.