CN113729707A

CN113729707A - An emotion recognition method based on FECNN-LSTM multimodal fusion of eye movement and PPG

Info

Publication number: CN113729707A
Application number: CN202111037434.5A
Authority: CN
Inventors: 陶小梅; 陈心怡; 周颖慧; 鲍金笛
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-03

Abstract

The invention discloses an emotion recognition method based on FECNN-LSTM multimodal fusion of eye movement and PPG. Including: by watching the learning video stimulation materials, using eye tracking technology and the volume measurement method of photoplethysmography, to obtain the pupil diameter, blinking, gaze and saccade and other eye movement information of learners, as well as heart rate value, heart rate variability, Peak interval signal. To study the relationship between learners' emotional state and eye movement physiological signals during online learning. Compute and use principal component analysis to select the eye movement features, heart rate features, heart rate variability features and peak interval features that are most relevant to the learner's affective state. Then perform feature layer fusion to generate shallow features. After normalization, the deep features are extracted by the FECNN network, and then the obtained deep features and shallow features are fused by feature layers, and the long and short-term memory network LSTM and random forest RF are used. K-Nearest Neighbors KNN, Multilayer Perceptron MLP and Support Vector Machine SVM perform four kinds of emotion classification: Interested, Confused, Bored, and Happy.

Description

FECNN-LSTM-based emotion recognition method based on multi-mode fusion of eye movement and PPG

The research obtains a national science fund project (number: 61906051), a Guangxi science fund project (number: 2018GXNSFBA050029) and a doctor scientific research initiation fund (GUTQDJJ2005015) of Guilin university

Technical Field

The invention relates to the field of emotion recognition, in particular to an emotion recognition method aiming at video learning and based on eye movement and PPG multi-mode signal analysis fusion and long-time memory network utilization.

Background

With the rapid development of artificial intelligence, emotional intelligence is also gradually valued by researchers. The emotion calculation endows a computer with the ability of recognizing, understanding, expressing and adapting human emotion, so that the computer can sense the emotional state of a user and timely make correct response. Emotion recognition is one of the key problems in emotion calculation research, and has important significance in various different scenes such as human-computer interaction and the like. Current emotion recognition studies are mainly: emotion recognition is carried out by adopting voice, text, facial expression, physiological signals (such as electroencephalogram EEG, skin electricity EDA, electromyogram EMG, photoplethysmography, PPG and the like) and multi-modal signal fusion. The human face expression, the voice and the like are external expressions and are easy to hide or disguise. In contrast, the change of the physiological signal is spontaneously generated by a human physiological system, is not controlled by the subjective will of a person, and can provide accurate and reliable basis for emotion recognition. In addition, along with the development of scientific technology, equipment for acquiring physiological signals is gradually perfected, and the device has the characteristics of convenience in carrying, non-invasive performance and stable signals, so that the emotion recognition research based on the physiological signals has great practical value. But selecting appropriate emotional features and classification methods requires further research.

PPG is a volume measurement method called photoplethysmography, which measures the blood flow rate and the volume changes in the blood by optical techniques. Physiological indexes related to emotional changes, such as Heart Rate (HR), Heart beat interval (beat Rate variance, HRV), and Heart Rate Variability (HRV), can be calculated from the photoplethysmographic signal. HRV is the time interval change between continuous heartbeats, is an important index of individual emotion and psychological state, and can well represent the change of emotional state. In addition, compared with physiological signals such as electroencephalogram and respiration, the pulse signals are more convenient to acquire, and the included emotional characteristics are richer. The time-frequency domain features, the depth level features and the heart rate related features of PPG are commonly used for emotion classification in research.

The vision is a direct channel for people to obtain information, and the eye movement information can objectively reflect the information processing mechanism of human brain. The cognitive processing process of humans is largely dependent on the visual system, with about 80% to 90% of the external information being obtained by the human eye. The eye movement information of the learner in the video learning is rich, and the eye movement data can be obtained through the non-invasive eye movement tracking technology without interfering the learning process. With the popularization of eye tracking technology, eye movement data visualization is rapidly developing in terms of theory and application. The eye movement data visualization method comprises 4 main visualization methods, namely a scanning path method, a hotspot graph method, an interest region method and a three-dimensional space method, wherein the time-space characteristics of the eye movement are physiological and behavioral expressions in the visual information extraction process and have direct or indirect relation with the human psychological activity, and the eye movement characteristics can truly reflect the psychological state, participation and cognitive load degree of learners.

However, the research of emotion recognition of physiological signals of a single modality has certain limitations, and physiological signals of different modalities have complementarity and relevance, so that the emotion classification is performed by fusing the time-frequency domain features of the eye movement and the PPG bimodal physiological signals with deep features.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the existing defects, the invention provides an emotion recognition method based on FECNN-LSTM multi-mode fusion of eye movement and PPG, which comprises the following steps:

the technical scheme of the invention is as follows:

step 1: and self-establishing an eye movement and PPG multi-mode database, and acquiring eye movement and PPG data by taking the learning video as a stimulation material.

Step 2: marking the collected eye movement data and physiological signal data, adopting a discrete emotion marking model for marking, and dividing emotion marking words into four emotion states of interest, happiness, confusion and chatlessness; pre-processing the acquired eye movement data, the pre-processing comprising: high-quality data screening, data cleaning and data denoising; because the PPG original data can be interfered by electromagnetic interference, illumination influence, motion artifact and the like to generate noise in the acquisition process, and the effective band pass of the PPG signal is between 0.8 Hz and 10Hz, the threshold value of the high-pass filter is set to be 1Hz to filter drift generated by the signal at the low frequency, and the threshold value of the low-pass filter is set to be 10 to filter noise interference higher than 10 Hz.

And step 3: dividing the denoised data into a training set and a verification set, wherein the division ratio is 8: 2.

and 4, step 4: and converting the eye movement and PPG data of the preprocessed data set into UTF-8 format text, and constructing data sets with different time window lengths, including 5-second, 10-second and 15-second time window lengths.

And 5: and calculating the eye movement time domain and PPG time-frequency domain characteristics in the 5s time window.

Step 6: a total of 72 eye movements and PPG features most relevant to the emotional state were selected using principal component analysis.

And 7: carrying out feature layer fusion on the 72 features selected in the step 6 to generate shallow features, designing after normalization and extracting deep features by using a convolutional neural network FECNN; and selecting 57-dimensional features most relevant to emotional states from the deep features by using a principal component analysis method. And performing feature layer fusion on the deep features and the shallow features extracted by the FECNN to obtain a 129-dimensional feature vector as the input of the emotion classifier.

And 8: and designing an LSTM network model to carry out emotion classification on the shallow feature and the deep feature. After a plurality of tests, the data in the training set are subjected to a plurality of rounds of training in batches to adjust network parameters until the maximum iteration times is reached or the advanced cutoff condition is met, and the optimal LSTM network structure and parameters are selected. And (3) taking the 129-dimensional feature vector obtained in the step (7) as an input training of the LSTM model, evaluating the performance of the model by using test set data, outputting one of four emotional states of interest, happiness, confusion and boredom, and finally evaluating the performance of the model by using the accuracy and the loss value.

And step 9: and (4) operating the LSTM network model obtained by training in the step (8) on a test set to obtain a final classification precision index.

Step 10: the effectiveness of the machine learning model was measured using accuracy (Precision), Recall (Recall) and F1 score (F1-score). Several basic concepts need to be defined, N_TP: the classifier judges the positive samples as the number of the positive samples, N_FP: the classifier judges the negative samples as the number of positive samples, N_TN: the classifier judges the negative samples as the number of the negative samples, N_FN: the classifier judges the positive samples as the number of the negative samples. The accuracy is defined as the proportion of the number of correctly classified samples in the positive samples to the number of all the classified positive samples, and the formula is as follows:

the recall ratio is defined as the proportion of the number of correctly classified samples in the positive samples to the number of all actually classified positive samples, and the recall ratio measures the capacity of correctly classifying the positive samples by classification, and the formula is as follows:

the F1 score is defined as twice the accuracy and recall ratio and the mean value, the F1 score comprehensively considers the accuracy and recall ratio capability of the classifier, and the formula is as follows:

drawings

FIG. 1 is an experimental flow chart of a multi-modal data acquisition experiment in the present invention.

Fig. 2 shows a network structure of the FECNN in the present invention.

FIG. 3 is a flow chart of an emotion recognition method based on FECNN-LSTM multi-modal fusion of eye movement and PPG in the invention.

FIG. 4 is a diagram of a neuron structure of the long-short term memory network.

Detailed Description

The invention will be further described with reference to examples and figures, but the embodiments of the invention are not limited thereto.

As shown in fig. 3, the present implementation provides an emotion recognition method based on FECNN-LSTM of multimodal fusion of eye movement and PPG, comprising the following steps:

1. and cleaning and dividing data sets of different time windows, and denoising the data to obtain a processed eye movement signal and a processed PPG signal.

2. Calculating eye movement related features, comprising: features of gaze, saccade, blink, and pupil categories; calculating a PPG-relevant feature, comprising: HR, HRV, RPeaks.

3. And performing principal component analysis, selecting eye movement and PPG (photoplethysmography) features with high correlation with emotional states, performing feature layer fusion to form shallow features, and normalizing the shallow features.

4. And (3) carrying out feature learning on the shallow features by using FECNN, extracting deep features, selecting the deep features with high correlation with emotional states by using a principal component analysis method, and normalizing.

5. And then, carrying out feature layer fusion on the shallow features and the deep features to serve as the input of the emotion classifier. Four algorithms of a support vector machine, a random forest, K neighbor and a multilayer perceptron in machine learning are adopted to carry out emotional state classification on the eye movement, the PPG single-mode shallow layer characteristic and the fused eye movement and PPG characteristic to serve as a comparison test. The obtained model was evaluated using different evaluation indexes.

6. And designing a long-time memory network to classify the deep-layer and shallow-layer characteristics according to feelings, and evaluating the model by using the accuracy and the loss value.

More specifically, a multi-modal emotion recognition data set is manufactured, learning videos of five different subjects are used as stimulation materials, the whole multi-modal data acquisition experiment process is shown in fig. 1, and the following steps are specifically described:

and S1, before the experiment is carried out, the physiological signal acquisition equipment is worn on the tested object, and then the eye calibration is carried out on the tested object so as to check whether the tested object is qualified.

S2, before entering the experiment formally, the tested object needs to watch the fixation point, namely the crosshair appearing in the center of the screen, the time duration is 60S, and the baseline values of the eye movement and the PPG data can be obtained after the fixation point is added.

S3, in the experiment process, 4 video segments of 2min are played first, then 1 video segment of 10min is played, the 4 video segments of 2min are played in a random sequence, a tested subject is required to perform a knowledge questionnaire test before each video segment is played, the tested prior knowledge is measured, and the content of the tested prior knowledge is related to the content of the experimental material. And then, the video segment which is played is watched on the computer screen, after the video playing is finished, the emotion which is generated when the video is watched through the key marking and the post-test inspection is finished, and after the post-test inspection is finished, the knowledge questionnaire test, the watching and the post-test inspection of the next video segment are carried out.

S4, the last played video is a video of 10min evoked distraction, the video pops up a reminder in the process of watching, and if the video is tried to be popped up in the time period before the reminder pops up, the distraction can be marked by pressing a key.

And S5, after the whole experiment is finished, the experimenter de-labels the model for the trial speech, ensures that the trial completely understands the model, and then enables the trial to watch and review the video, including the video segment, the video when the trial person watches the video segment and the eye movement track when the trial watches the video segment, divides the event according to the emotional state generated when the trial reviews the video, and selects the emotional state and the emotional intensity of the trial person according to the classified emotional words and different awakening levels in the emotional model to label 5 videos. The emotional state in the data acquisition experiment is acquired by adopting a method of 'implied review' and a subjective report mode of a tested subject, namely after a video segment is watched by the tested subject, the video is played back, the synchronously recorded facial expression video and the eye movement track of the tested subject are stimulated to recall the current emotional state, the synchronous video is divided into event segments, and the emotional state and the emotional intensity of the user are selected from the emotional words and the awakening levels in the emotion classification model. Emotional states include happy, interesting, boring, confusing, vague, and others. And selecting an A dimension in the PAD dimension model to look back and mark the awakening dimension strength of the tested object in a certain emotion state, respectively representing the strength of the certain emotion by 1-5, wherein 1 is the lowest, 5 is the highest, and the values are sequentially increased from 1 to 5.

For the pretreatment of the eye movement signals, the abnormal values of the eye movement data of the testee in the experimental process are mainly removed, and the noise generated in the acquisition process is eliminated. For the PPG signal preprocessing, noise generated by interference influences such as electromagnetic interference, illumination influence, motion artifact and the like in the PPG signal acquisition process is mainly removed.

It is important to select a proper and effective eye movement and PPG index according to the research purpose, otherwise valuable data information is lost in the research process. In emotion recognition research, indexes of a single mode have certain limitations, and signals of multiple modes have correlation and complementarity, so that indexes of two modes, namely eye movement and PPG, are selected for analysis according to the research needs. The eye movement indexes selected in the experiment mainly comprise the following four types: gaze, eye jump, blink, and pupil diameter; the selected PPG indexes mainly comprise three types of HR, HRV and RPeaks.

In step 2, after the eye movement and the PPG data are preprocessed, the relevant statistical characteristics of the eye movement and the time-frequency domain characteristics of the PPG are calculated. The PPG frequency domain feature specific calculation formula is as follows (4) - (9).

Sampling the pulse sequence signals of each time window at equal intervals, selecting N points to form a discrete sequence X (N), and performing discrete Fourier transform to obtain a frequency domain sequence X (k), wherein k is a discrete frequency variable, and W is a variable of frequency_NFor the positive transform kernel, j is the imaginary unit. The calculation formula is as follows.

From the euler equation:

then:

W_N＝exp(-j2πnk)＝cos2πnk-jsin2πnk (6)

where X (k) is a complex number,

X(k)＝R(k)+jI(k) (7)

r (k) is the real part, and I (k) is the imaginary part. The phase value for each point of the frequency domain sequence is then:

the frequency spectrum is:

because the discrete Fourier transform has large calculation amount, the obtained data is processed by fast Fourier transform, the frequency and the phase are expressed as frequency functions, and corresponding frequency components HF, LF, VLF, LF/HF and total power are extracted from the power spectral density to be used as the frequency domain characteristics of HRV.

Let HRV sequence be R ═ R₁,R₂,…R_N]，R_iDenotes the value of HRV at time i, N representing the sequence length. The HRV temporal feature calculation formulas are as follows (10) to (13). The RR interval difference root mean square RMSSD is calculated as the following formula (10), wherein RR_i＝R_i+1-R_i。

The standard deviation SDNN formula is as follows:

wherein

Percent PNN50 with peak interval greater than 50 ms:

and screening out eye movement and PPG characteristics which are obviously related to the emotional state by adopting a Principal Component Analysis (PCA) method according to the characteristics obtained by each mode. The experimental analysis finally picked 32 eye movement features and 40 PPG features.

In step S3, feature layer fusion is performed on the eye movement statistical features and the PPG time-frequency domain features in the synchronization time window to form shallow features, and a 72-dimensional combined feature vector is obtained. Because of individual differences and different physiological signal baseline values of different people, the baseline value of an individual needs to be removed, each emotional characteristic of the eye movement and the PPG is normalized by a corresponding characteristic value in a calm state, the characteristic value is mapped into a [0,1] interval by min-max normalization, and the min-max normalization formula (14) is as follows:

X^*is a normalized value, X is a sample value, X_minIs the minimum value in the sample, X_maxIs the maximum value in the sample.

In step S4, the FECNN network structure is designed and described as follows: the feature extraction part in the FECNN network is composed of a convolution layer and a pooling layer, wherein the convolution layer is used for extracting deep information of input data, and the pooling layer is used for performing down-sampling processing on the obtained feature map so as to reduce the degree of network overfitting. The input to FECNN is a 72 × 1 vector with 6 convolutional layers, which are Conv1, Conv2, Conv3, Conv4, Conv5, and Conv6, respectively. Each convolutional layer contains a convolution kernel of 3 x 1 in one dimension, a max-pooling layer with a 2 x 1 filter, and a regularized Dropout layer. The Dropout layer deactivates a portion of the neurons with a probability of 0.5 to prevent overfitting of the model. The inactive neuron does not have error back-propagation, but the weight of the neuron is preserved, so that the network adopts a different network structure from before each time a sample is input. The step size for each convolutional layer is set to 1, using Relu as the activation function. The convolutional layer Conv6 was followed by a Flatten flattening layer, and then a Dense layer was used to compress the Flatten layer output features into 64X 1-dimensional deep features. And selecting deep features related to emotional states by using the Pearson correlation coefficient, wherein the deep features are 57-dimensional in total.

In step S5, feature-layer fusion is performed on the deep features and the shallow features extracted in step S4, and a 129-dimensional feature vector is output for input of the emotion classifier. Four machine learning algorithms are adopted to carry out emotion classification on eye movement and PPG single-mode data, grid search is used for parameter optimization, and finally obtained parameters are shown as the following table:

four mechanistic algorithms were evaluated using accuracy, recall and F1 scores, with the following results:

in step S6, the designed long-short duration memory network LSTSM structure is described as follows.

The LSTM is used for analyzing time sequence data, and consists of an input gate, a forgetting gate, an output gate and an internal memory unit, and determines when the network forgets a previous hidden state and updates the hidden state by effectively utilizing a computer memory, so that the problems of gradient disappearance and explosion of RNN in the process of processing sequence data with limited length in the back propagation process are solved. The LSTM network structure elements are shown in fig. 4. In the figure i_tRepresenting the output of the input gate unit, f_tRepresenting the output of a forgetting gate unit, o_tRepresenting the output of the output gate unit, c'_tRepresents an internal memory cell, h_tIs the output of the hidden unit. In the figure, sigma represents a sigmoid activation function. Let x be_tFor the input of the LSTM cell at time t, W and U represent weights, h_t-1For hiding the sheet on the upper layerAnd (4) outputting the element. The following specific descriptions are shown in the following formulas (15) to (20).

i_t＝σ(λ_Wi(W_ix_t)+λ_Ui(U_ih_t-1)) (15)

f_t＝σ(λ_Wf(W_fx_t)+λ_Uf(U_fh_t-1)) (16)

o_t＝σ(λ_Wo(W_ox_o)+λ_Uo(U_oh_t-1)) (17)

c'_t＝tanh(λ_Wu(W_cx_t)+λ_Uc(U_ch_t-1)) (18)

c_t＝f_tc_t-1+i_tc'_t (19)

h_t＝o_ttanh(c_t) (20)

From equations (15) to (20), the final output h of the concealment unit at time t_tOutput h from previous time point hiding unit_t-1And current time point input x_tAnd jointly determining, and realizing the memory function. Through the design of 3 gating units, the LSTM memory unit can selectively store and update information in a long distance, which is beneficial to learning sequence characteristic information of PPG signals and eye movements.

The LSTM network designed herein has 3 hidden layers, with 32, 64, 72 hidden units. And taking the shallow features and the deep features extracted by the FECNN as the input of the LSTM, updating the network weight through inverse gradient propagation in the training stage, accelerating the function convergence speed in order to optimize the problem of overlarge swing amplitude of the loss function, and selecting a self-adaptive learning rate dynamic adjustment algorithm as an optimization algorithm. And evaluating the difference situation of the probability distribution obtained by current training and the real distribution by using a multi-classification cross entropy loss function. The following formula (22) is a cross entropy function calculation formula, in which

For the desired output, y is the actual output of the neuron, and the loss value loss of the model is calculated as follows:

y＝σ(Σw_jx_j+b) (21)

when the desired output and the actual output are equal, the loss value is 0. Using dropout after each LSTM layer prevents trained overfitting from reducing feature interactions. The output layer uses the softmax activation function to classify, and outputs a two-dimensional array consisting of 4 probabilities, which represents the probability value that the sample data belongs to a certain emotion. The LSTM ultimately outputs one of four emotional states of interest, happiness, confusion, and boredom. After 600 iterations, the accuracy and loss value of the LSTM model gradually stabilized, with an accuracy of 84.68% on the test set and a loss value of 0.43 on the test set.

Claims

1. The FECNN-LSTM-based emotion recognition method based on multi-modal fusion of eye movement and PPG is characterized by comprising the following steps:

step 1: self-building an eye movement and PPG multi-mode database, and acquiring eye movement and PPG data by taking a learning video as a stimulation material;

step 2: pre-processing the acquired eye movement and PPG data, wherein the pre-processing comprises the following steps: data cleaning and data labeling; the PPG data is denoised by low-pass filtering and high-pass filtering, a discrete emotion marking model is adopted for marking, and emotion marking words are divided into four emotion states of interest, happiness, confusion and chatlessness; pre-processing the acquired eye movement data, the pre-processing comprising: high-quality data screening, data cleaning and data denoising; because the PPG original data can be influenced by electromagnetic interference, illumination and motion artifact interference to generate noise in the acquisition process, and the effective band pass of a PPG signal is between 0.8 and 10Hz, the threshold value of a high-pass filter is set to be 1Hz to filter drift generated by the signal at a low frequency, and the threshold value of a low-pass filter is set to be 10 to filter noise interference higher than 10 Hz;

and step 3: dividing the preprocessed acquired data set into a training set and a verification set, wherein the proportion is 80% and 20%;

and 4, step 4: converting the eye movement and PPG data of the preprocessed data set into UTF-8 format text, and constructing data sets with different time window lengths, wherein the data sets comprise 5-second, 10-second and 15-second time window lengths;

and 5: calculating the characteristics of an eye movement time domain and a PPG time-frequency domain in a 5s time window, and specifically describing the following steps:

s1, calculating relevant characteristics of the eye movement data, including: the statistical characteristics of watching times, watching duration and watching speed; the number of glances, the duration of glances, and the statistical characteristics of glance speed; statistical characteristics of the change rate of the left and right pupil diameters, the left and right pupil diameters and the pupil mean; the number of blinking times, the blinking frequency and the blinking duration are 50;

s2, extracting heart rate value HR, heart rate variability and peak value RPeaks data of the PPG data, and calculating related characteristics: HR mean value, HR maximum value, HR first-order difference and HR second-order difference time domain characteristics; HRV first order difference, second order difference, SDNN, RMSSD, PNN50, PNN20 time domain features; PSD, LF, HF, VLF, LF/HF five HRV frequency domain characteristics; 32 peak values and peak value first-order difference time domain features;

step 6: carrying out feature dimensionality reduction on the features by using PCA (principal component analysis), and selecting 72 features which are obviously related to emotional states;

and 7: carrying out feature layer fusion on the selected 72 features to construct shallow features, designing and extracting deep features of the shallow features by using FECNN after normalization, selecting features most relevant to emotional states from the deep features by using a principal component analysis method to achieve 57 dimensions, carrying out feature layer fusion on the shallow features and the deep features, and carrying out emotion classification on the shallow features and the deep features by using four machine learning algorithms of SVM, RF, KNN and MLP;

and 8: designing an LSTM network model to carry out sentiment classification on the shallow feature and the deep feature, carrying out multiple rounds of training on data in a training set in batches through multiple tests to adjust network parameters until the maximum iteration times is reached or a cut-off condition in advance is met, and selecting an optimal LSTM network structure and parameters;

and step 9: running the LSTM network model obtained by training in the step 8 on a test set to obtain a final classification precision index;

step 10: and comparing and analyzing the classification result of the LSTM with the classification result of SVM, KNN, MLP and RF algorithm.

2. The method for emotion recognition in video learning according to claim 1,

step 1 is described in detail as follows:

s1, before the experiment, wearing physiological signal acquisition equipment on the tested object, and then carrying out eye calibration on the tested object to check whether the tested object is qualified;

s2, before entering the experiment formally, the tested object needs to watch the fixation point, namely the crosshair appears in the center of the screen, the time duration is 60S, and the baseline values of the eye movement and the PPG data can be obtained after the fixation point is added;

s3, in the experimental process, 4 video clips of 2min are played firstly, then 1 video clip of 10min is played, the 4 video clips of 2min are played in a random sequence, a tested subject is required to be subjected to knowledge questionnaire testing before each video clip is played, the prior knowledge of the tested subject is measured, the content of the prior knowledge is related to the content of an experimental material, then the tested subject watches the played video clips on a computer screen, after the video playing is finished, the tested subject needs to mark emotion generated when watching the video through keys and complete post-test inspection, and after the post-test inspection, the next video clip knowledge questionnaire testing, watching and post-test inspection are carried out;

s4, the last played video is a video with 10min inducement vague, the video pops up a reminder in the watching process, and if the video is tried to be vague in the time period before the reminder pops up, the vague can be labeled by pressing a key;

s5, after the whole experiment is finished, an experimenter de-labels the model for the tested speech, ensures that the tested speech fully understands the model, and then the tested speech is enabled to watch and review the video, including the video segment, the video when the tested speech is watched by the tested person and the eye movement track when the tested speech is watched, divides the event according to the emotional state generated when the tested speech is reviewed, selects the emotional state and the emotional intensity of the tested speech according to the emotional words and different awakening levels in the classified emotional model, labels 5 videos, and obtains the emotional state in the data acquisition experiment by adopting a method of 'implicit review' and a mode of subjective report of the tested, namely, after the tested speech segment is watched, the video is played back, the tested facial expression video and the eye movement track of the tested which are synchronously recorded are stimulated to recall the emotional state of the tested speech, the synchronous video is divided into event segments, and selects the emotional state and the emotional intensity of the tested speech from the emotional words and the awakening levels in the classified emotional model, the emotional states comprise happiness, interest, boredom, confusion, nervousness and the like, A dimension in a PAD dimension model is selected to look back and mark the strength of awakening dimension of a tested object in a certain emotional state, 1-5 are respectively used for representing the strength of a certain emotion, 1 is the lowest, 5 is the highest, and the values are sequentially increased from 1 to 5;

step 2 is described in detail as follows:

s1, preprocessing data: the method comprises the steps of removing tested eye movement data lost in sight line tracking in the experiment process, dividing the acquired data into data according to asynchronous lengths, preprocessing eye movement signals, mainly removing abnormal values of the eye movement data of a tested person in the experiment process, eliminating noise generated in the acquisition process, preprocessing PPG signals, mainly removing electromagnetic interference, illumination influence and noise generated by motion artifact interference influence in the PPG signal acquisition process, setting an effective band pass of the PPG signals between 0.8 Hz and 10Hz, setting a high-pass filter threshold value to be 1Hz, filtering drift generated by the signals at a low frequency position, and setting a low-pass filter threshold value to be 10, and filtering noise interference higher than 10 Hz;

s2, data annotation: adding emotion attribute labels 'label' to all tested data, and recording that the interest is 0, the confusion is 1, the boring is 2 and the happy is 3;

step 5 is described in detail as follows:

the selection of a proper and effective eye movement PPG index according to the research purpose is very important, otherwise, valuable data information is lost in the research process, in the emotion recognition research, a single-mode index has certain limitation, and multi-mode signals have correlation and complementarity, so that the indexes of two modes, namely the eye movement and the PPG, are selected according to the research requirement for analysis, and the eye movement indexes selected in the experiment mainly comprise the following four types: gaze, eye jump, blink, and pupil diameter; the selected PPG indexes mainly comprise three types of HR, HRV and RPeaks;

the PPG time-frequency domain feature is calculated as follows, the PPG frequency domain feature is specifically calculated according to the following formulas (4) - (9), N points are selected for sampling the pulse sequence signal interval of each time window to form a discrete sequence X (N), and a frequency domain sequence X (k) is obtained by performing discrete Fourier transform, wherein k is a discrete frequency variable, W is a discrete frequency variable_NFor the positive transform kernel, j is an imaginary unit, and the calculation formula is as follows:

from the euler equation:

e^±jn＝cos n±j sin n (5)

then:

W_N＝exp(-j2πnk)＝cos2πnk-jsin2πnk (6)

where X (k) is a complex number,

X(k)＝R(k)+jI(k) (7)

r (k) is the real part, I (k) is the imaginary part, then the phase value for each point in the frequency domain sequence is:

the frequency spectrum is:

because the calculation amount of discrete Fourier transform is large, the obtained data is processed by fast Fourier transform, then the frequency and the phase are expressed as frequency functions, and corresponding frequency components HF, LF, VLF, LF/HF and total power are extracted from the power spectral density to be used as the frequency domain characteristics of HRV;

let HRV sequence be R ═ R₁,R₂,…R_N]，R_iThe HRV value at the time i is shown, N represents the sequence length, HRV time domain feature calculation formulas are shown in the formulas (10) to (13), and RR interval difference root mean square (RMSSD) calculation formula is shown in the formula (10), wherein RR interval difference value is shown in the formula (10)_i＝R_i+1-R_i；

The standard deviation SDNN formula is as follows:

wherein

Percent PNN50 with peak interval greater than 50 ms:

screening out eye movement and PPG (photoplethysmography) features which are obviously related to emotional states by adopting a Principal Component Analysis (PCA) method according to the features obtained by each mode, and finally selecting 32 eye movement features and 40 PPG features by experimental analysis;

step 7, the following is described in detail:

s1, performing feature layer fusion on the eye movement statistical features and the PPG time-frequency domain features in the synchronous time window to form shallow features, obtaining a 72-dimensional combined feature vector, because of individual differences and different physiological signal baseline values of different people, removing the baseline value of an individual, standardizing each emotion feature of the eye movement and the PPG by using the corresponding feature value in a calm state, mapping the feature value into a [0,1] interval by using min-max normalization, wherein a min-max normalization formula (14) is as follows:

X^*is a normalized value, X is a sample value, X_minIs the minimum value in the sample, X_maxIs the maximum value in the sample or samples,

s2, designing the FECNN network structure and describing as follows: the FECNN network comprises a characteristic extraction part and a pooling layer, wherein the characteristic extraction part is composed of a convolution layer and a pooling layer, the convolution layer is used for extracting deep information of input data, the pooling layer is used for performing down-sampling processing on an obtained characteristic diagram to reduce the degree of overfitting of the network, the input of the FECNN is a 72 x 1 vector, 6 convolution layers are provided in total, namely Conv1, Conv2, Conv3, Conv4, Conv5 and Conv6, each convolution layer comprises a convolution kernel with the one-dimensional size of 3 x 1, a maximum pooling layer with a 2 x 1 filter and a regularized Dropout layer, the Dropout layer inactivates partial neurons with the probability of 0.5 to prevent overfitting of a model, the inactivated neurons do not perform error back propagation, but the weights of the neurons are kept, therefore, each time a sample is input, the network adopts a network structure different from the previous network structure, the step size of each convolution layer is set as 1, Relu is used as an activation function, superposing a Flatten flattening layer behind the convolutional layer Conv6, then compressing the output characteristics of the Flatten layer into 64 x 1-dimensional deep characteristics by using a Dense layer, and selecting the deep characteristics related to the emotional state by using a Pearson correlation coefficient, wherein the total dimension is 57;

in step 8, the designed long-time memory network LSTM structure is described as follows:

the LSTM is used for analyzing time sequence data, the LSTM selected for use in the method consists of an input gate, a forgetting gate, an output gate and an internal memory unit, and the method effectively utilizes a computer memory to determine when a network forgets a previous hidden state and updates the hidden state, so as to solve the problems of gradient loss and explosion of an RNN in the process of processing sequence data with limited length in the reverse propagation process, and is specifically shown in the following formulas (15) to (20);

i_t＝σ(λ_Wi(W_ix_t)+λ_Ui(U_ih_t-1)) (15)

f_t＝σ(λ_Wf(W_fx_t)+λ_Uf(U_fh_t-1)) (16)

o_t＝σ(λ_Wo(W_ox_o)+λ_Uo(U_oh_t-1)) (17)

c'_t＝tanh(λ_Wu(W_cx_t)+λ_Uc(U_ch_t-1)) (18)

c_t＝f_tc_t-1+i_tc'_t (19)

h_t＝o_ttanh(c_t) (20)

from equations (15) to (20), the final output h of the concealment unit at time t_tOutput h from previous time point hiding unit_t-1And current time point input x_tThe memory function is realized through the joint determination, and through the design of 3 gate control units, the LSTM memory unit can selectively store and update long-distance information, which is beneficial to learning sequence characteristic information of PPG signals and eye movements;

the LSTM network is provided with 3 hidden layers, the number of hidden units is 32, 64 and 72 respectively, shallow layer features and deep layer features extracted by FECNN are used as input of the LSTM, network weight is updated through inverse gradient propagation in a training stage, in order to optimize the problem of overlarge swing amplitude of a loss function and accelerate function convergence speed, a self-adaptive learning rate dynamic adjustment algorithm is selected as an optimization algorithm, a multi-classification cross entropy loss function is used for evaluating the difference situation of probability distribution and real distribution obtained by current training, and the following formula (22) is a cross entropy function calculation formula, wherein the formula (22) is a cross entropy function calculation formula

To the desired output, y is the nerveThe metafactual output, the loss value loss of the model is calculated as follows:

y＝σ(Σw_jx_j+b) (21)

when the expected output is equal to the actual output, the loss value is 0, dropout is used for preventing training and fitting after each LSTM layer to reduce the interaction of features, the output layer is classified by using a softmax activation function, a two-dimensional array consisting of 4 probabilities is output to represent the probability value that sample data belongs to a certain emotion, and finally the LSTM outputs one of four emotional states of interest, happiness, confusion and chatlessness.