The research obtains a national science fund project (number: 61906051), a Guangxi science fund project (number: 2018GXNSFBA050029) and a doctor scientific research initiation fund (GUTQDJJ2005015) of Guilin university
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the existing defects, the invention provides an emotion recognition method based on FECNN-LSTM multi-mode fusion of eye movement and PPG, which comprises the following steps:
the technical scheme of the invention is as follows:
step 1: and self-establishing an eye movement and PPG multi-mode database, and acquiring eye movement and PPG data by taking the learning video as a stimulation material.
Step 2: marking the collected eye movement data and physiological signal data, adopting a discrete emotion marking model for marking, and dividing emotion marking words into four emotion states of interest, happiness, confusion and chatlessness; pre-processing the acquired eye movement data, the pre-processing comprising: high-quality data screening, data cleaning and data denoising; because the PPG original data can be interfered by electromagnetic interference, illumination influence, motion artifact and the like to generate noise in the acquisition process, and the effective band pass of the PPG signal is between 0.8 Hz and 10Hz, the threshold value of the high-pass filter is set to be 1Hz to filter drift generated by the signal at the low frequency, and the threshold value of the low-pass filter is set to be 10 to filter noise interference higher than 10 Hz.
And step 3: dividing the denoised data into a training set and a verification set, wherein the division ratio is 8: 2.
and 4, step 4: and converting the eye movement and PPG data of the preprocessed data set into UTF-8 format text, and constructing data sets with different time window lengths, including 5-second, 10-second and 15-second time window lengths.
And 5: and calculating the eye movement time domain and PPG time-frequency domain characteristics in the 5s time window.
Step 6: a total of 72 eye movements and PPG features most relevant to the emotional state were selected using principal component analysis.
And 7: carrying out feature layer fusion on the 72 features selected in the step 6 to generate shallow features, designing after normalization and extracting deep features by using a convolutional neural network FECNN; and selecting 57-dimensional features most relevant to emotional states from the deep features by using a principal component analysis method. And performing feature layer fusion on the deep features and the shallow features extracted by the FECNN to obtain a 129-dimensional feature vector as the input of the emotion classifier.
And 8: and designing an LSTM network model to carry out emotion classification on the shallow feature and the deep feature. After a plurality of tests, the data in the training set are subjected to a plurality of rounds of training in batches to adjust network parameters until the maximum iteration times is reached or the advanced cutoff condition is met, and the optimal LSTM network structure and parameters are selected. And (3) taking the 129-dimensional feature vector obtained in the step (7) as an input training of the LSTM model, evaluating the performance of the model by using test set data, outputting one of four emotional states of interest, happiness, confusion and boredom, and finally evaluating the performance of the model by using the accuracy and the loss value.
And step 9: and (4) operating the LSTM network model obtained by training in the step (8) on a test set to obtain a final classification precision index.
Step 10: the effectiveness of the machine learning model was measured using accuracy (Precision), Recall (Recall) and F1 score (F1-score). Several basic concepts need to be defined, N
TP: the classifier judges the positive samples as the number of the positive samples, N
FP: the classifier judges the negative samples as the number of positive samples, N
TN: the classifier judges the negative samples as the number of the negative samples, N
FN: the classifier judges the positive samples as the number of the negative samples. The accuracy is defined as the proportion of the number of correctly classified samples in the positive samples to the number of all the classified positive samples, and the formula is as follows:
the recall ratio is defined as the proportion of the number of correctly classified samples in the positive samples to the number of all actually classified positive samples, and the recall ratio measures the capacity of correctly classifying the positive samples by classification, and the formula is as follows:
the F1 score is defined as twice the accuracy and recall ratio and the mean value, the F1 score comprehensively considers the accuracy and recall ratio capability of the classifier, and the formula is as follows:
Detailed Description
The invention will be further described with reference to examples and figures, but the embodiments of the invention are not limited thereto.
As shown in fig. 3, the present implementation provides an emotion recognition method based on FECNN-LSTM of multimodal fusion of eye movement and PPG, comprising the following steps:
1. and cleaning and dividing data sets of different time windows, and denoising the data to obtain a processed eye movement signal and a processed PPG signal.
2. Calculating eye movement related features, comprising: features of gaze, saccade, blink, and pupil categories; calculating a PPG-relevant feature, comprising: HR, HRV, RPeaks.
3. And performing principal component analysis, selecting eye movement and PPG (photoplethysmography) features with high correlation with emotional states, performing feature layer fusion to form shallow features, and normalizing the shallow features.
4. And (3) carrying out feature learning on the shallow features by using FECNN, extracting deep features, selecting the deep features with high correlation with emotional states by using a principal component analysis method, and normalizing.
5. And then, carrying out feature layer fusion on the shallow features and the deep features to serve as the input of the emotion classifier. Four algorithms of a support vector machine, a random forest, K neighbor and a multilayer perceptron in machine learning are adopted to carry out emotional state classification on the eye movement, the PPG single-mode shallow layer characteristic and the fused eye movement and PPG characteristic to serve as a comparison test. The obtained model was evaluated using different evaluation indexes.
6. And designing a long-time memory network to classify the deep-layer and shallow-layer characteristics according to feelings, and evaluating the model by using the accuracy and the loss value.
More specifically, a multi-modal emotion recognition data set is manufactured, learning videos of five different subjects are used as stimulation materials, the whole multi-modal data acquisition experiment process is shown in fig. 1, and the following steps are specifically described:
and S1, before the experiment is carried out, the physiological signal acquisition equipment is worn on the tested object, and then the eye calibration is carried out on the tested object so as to check whether the tested object is qualified.
S2, before entering the experiment formally, the tested object needs to watch the fixation point, namely the crosshair appearing in the center of the screen, the time duration is 60S, and the baseline values of the eye movement and the PPG data can be obtained after the fixation point is added.
S3, in the experiment process, 4 video segments of 2min are played first, then 1 video segment of 10min is played, the 4 video segments of 2min are played in a random sequence, a tested subject is required to perform a knowledge questionnaire test before each video segment is played, the tested prior knowledge is measured, and the content of the tested prior knowledge is related to the content of the experimental material. And then, the video segment which is played is watched on the computer screen, after the video playing is finished, the emotion which is generated when the video is watched through the key marking and the post-test inspection is finished, and after the post-test inspection is finished, the knowledge questionnaire test, the watching and the post-test inspection of the next video segment are carried out.
S4, the last played video is a video of 10min evoked distraction, the video pops up a reminder in the process of watching, and if the video is tried to be popped up in the time period before the reminder pops up, the distraction can be marked by pressing a key.
And S5, after the whole experiment is finished, the experimenter de-labels the model for the trial speech, ensures that the trial completely understands the model, and then enables the trial to watch and review the video, including the video segment, the video when the trial person watches the video segment and the eye movement track when the trial watches the video segment, divides the event according to the emotional state generated when the trial reviews the video, and selects the emotional state and the emotional intensity of the trial person according to the classified emotional words and different awakening levels in the emotional model to label 5 videos. The emotional state in the data acquisition experiment is acquired by adopting a method of 'implied review' and a subjective report mode of a tested subject, namely after a video segment is watched by the tested subject, the video is played back, the synchronously recorded facial expression video and the eye movement track of the tested subject are stimulated to recall the current emotional state, the synchronous video is divided into event segments, and the emotional state and the emotional intensity of the user are selected from the emotional words and the awakening levels in the emotion classification model. Emotional states include happy, interesting, boring, confusing, vague, and others. And selecting an A dimension in the PAD dimension model to look back and mark the awakening dimension strength of the tested object in a certain emotion state, respectively representing the strength of the certain emotion by 1-5, wherein 1 is the lowest, 5 is the highest, and the values are sequentially increased from 1 to 5.
For the pretreatment of the eye movement signals, the abnormal values of the eye movement data of the testee in the experimental process are mainly removed, and the noise generated in the acquisition process is eliminated. For the PPG signal preprocessing, noise generated by interference influences such as electromagnetic interference, illumination influence, motion artifact and the like in the PPG signal acquisition process is mainly removed.
It is important to select a proper and effective eye movement and PPG index according to the research purpose, otherwise valuable data information is lost in the research process. In emotion recognition research, indexes of a single mode have certain limitations, and signals of multiple modes have correlation and complementarity, so that indexes of two modes, namely eye movement and PPG, are selected for analysis according to the research needs. The eye movement indexes selected in the experiment mainly comprise the following four types: gaze, eye jump, blink, and pupil diameter; the selected PPG indexes mainly comprise three types of HR, HRV and RPeaks.
In step 2, after the eye movement and the PPG data are preprocessed, the relevant statistical characteristics of the eye movement and the time-frequency domain characteristics of the PPG are calculated. The PPG frequency domain feature specific calculation formula is as follows (4) - (9).
Sampling the pulse sequence signals of each time window at equal intervals, selecting N points to form a discrete sequence X (N), and performing discrete Fourier transform to obtain a frequency domain sequence X (k), wherein k is a discrete frequency variable, and W is a variable of frequencyNFor the positive transform kernel, j is the imaginary unit. The calculation formula is as follows.
then:
WN=exp(-j2πnk)=cos2πnk-jsin2πnk (6)
where X (k) is a complex number,
X(k)=R(k)+jI(k) (7)
r (k) is the real part, and I (k) is the imaginary part. The phase value for each point of the frequency domain sequence is then:
the frequency spectrum is:
because the discrete Fourier transform has large calculation amount, the obtained data is processed by fast Fourier transform, the frequency and the phase are expressed as frequency functions, and corresponding frequency components HF, LF, VLF, LF/HF and total power are extracted from the power spectral density to be used as the frequency domain characteristics of HRV.
Let HRV sequence be R ═ R1,R2,…RN],RiDenotes the value of HRV at time i, N representing the sequence length. The HRV temporal feature calculation formulas are as follows (10) to (13). The RR interval difference root mean square RMSSD is calculated as the following formula (10), wherein RRi=Ri+1-Ri。
The standard deviation SDNN formula is as follows:
wherein
Percent PNN50 with peak interval greater than 50 ms:
and screening out eye movement and PPG characteristics which are obviously related to the emotional state by adopting a Principal Component Analysis (PCA) method according to the characteristics obtained by each mode. The experimental analysis finally picked 32 eye movement features and 40 PPG features.
In step S3, feature layer fusion is performed on the eye movement statistical features and the PPG time-frequency domain features in the synchronization time window to form shallow features, and a 72-dimensional combined feature vector is obtained. Because of individual differences and different physiological signal baseline values of different people, the baseline value of an individual needs to be removed, each emotional characteristic of the eye movement and the PPG is normalized by a corresponding characteristic value in a calm state, the characteristic value is mapped into a [0,1] interval by min-max normalization, and the min-max normalization formula (14) is as follows:
X*is a normalized value, X is a sample value, XminIs the minimum value in the sample, XmaxIs the maximum value in the sample.
In step S4, the FECNN network structure is designed and described as follows: the feature extraction part in the FECNN network is composed of a convolution layer and a pooling layer, wherein the convolution layer is used for extracting deep information of input data, and the pooling layer is used for performing down-sampling processing on the obtained feature map so as to reduce the degree of network overfitting. The input to FECNN is a 72 × 1 vector with 6 convolutional layers, which are Conv1, Conv2, Conv3, Conv4, Conv5, and Conv6, respectively. Each convolutional layer contains a convolution kernel of 3 x 1 in one dimension, a max-pooling layer with a 2 x 1 filter, and a regularized Dropout layer. The Dropout layer deactivates a portion of the neurons with a probability of 0.5 to prevent overfitting of the model. The inactive neuron does not have error back-propagation, but the weight of the neuron is preserved, so that the network adopts a different network structure from before each time a sample is input. The step size for each convolutional layer is set to 1, using Relu as the activation function. The convolutional layer Conv6 was followed by a Flatten flattening layer, and then a Dense layer was used to compress the Flatten layer output features into 64X 1-dimensional deep features. And selecting deep features related to emotional states by using the Pearson correlation coefficient, wherein the deep features are 57-dimensional in total.
In step S5, feature-layer fusion is performed on the deep features and the shallow features extracted in step S4, and a 129-dimensional feature vector is output for input of the emotion classifier. Four machine learning algorithms are adopted to carry out emotion classification on eye movement and PPG single-mode data, grid search is used for parameter optimization, and finally obtained parameters are shown as the following table:
four mechanistic algorithms were evaluated using accuracy, recall and F1 scores, with the following results:
in step S6, the designed long-short duration memory network LSTSM structure is described as follows.
The LSTM is used for analyzing time sequence data, and consists of an input gate, a forgetting gate, an output gate and an internal memory unit, and determines when the network forgets a previous hidden state and updates the hidden state by effectively utilizing a computer memory, so that the problems of gradient disappearance and explosion of RNN in the process of processing sequence data with limited length in the back propagation process are solved. The LSTM network structure elements are shown in fig. 4. In the figure itRepresenting the output of the input gate unit, ftRepresenting the output of a forgetting gate unit, otRepresenting the output of the output gate unit, c'tRepresents an internal memory cell, htIs the output of the hidden unit. In the figure, sigma represents a sigmoid activation function. Let x betFor the input of the LSTM cell at time t, W and U represent weights, ht-1For hiding the sheet on the upper layerAnd (4) outputting the element. The following specific descriptions are shown in the following formulas (15) to (20).
it=σ(λWi(Wixt)+λUi(Uiht-1)) (15)
ft=σ(λWf(Wfxt)+λUf(Ufht-1)) (16)
ot=σ(λWo(Woxo)+λUo(Uoht-1)) (17)
c't=tanh(λWu(Wcxt)+λUc(Ucht-1)) (18)
ct=ftct-1+itc't (19)
ht=ottanh(ct) (20)
From equations (15) to (20), the final output h of the concealment unit at time ttOutput h from previous time point hiding unitt-1And current time point input xtAnd jointly determining, and realizing the memory function. Through the design of 3 gating units, the LSTM memory unit can selectively store and update information in a long distance, which is beneficial to learning sequence characteristic information of PPG signals and eye movements.
The LSTM network designed herein has 3 hidden layers, with 32, 64, 72 hidden units. And taking the shallow features and the deep features extracted by the FECNN as the input of the LSTM, updating the network weight through inverse gradient propagation in the training stage, accelerating the function convergence speed in order to optimize the problem of overlarge swing amplitude of the loss function, and selecting a self-adaptive learning rate dynamic adjustment algorithm as an optimization algorithm. And evaluating the difference situation of the probability distribution obtained by current training and the real distribution by using a multi-classification cross entropy loss function. The following formula (22) is a cross entropy function calculation formula, in which
For the desired output, y is the actual output of the neuron, and the loss value loss of the model is calculated as follows:
y=σ(Σwjxj+b) (21)
when the desired output and the actual output are equal, the loss value is 0. Using dropout after each LSTM layer prevents trained overfitting from reducing feature interactions. The output layer uses the softmax activation function to classify, and outputs a two-dimensional array consisting of 4 probabilities, which represents the probability value that the sample data belongs to a certain emotion. The LSTM ultimately outputs one of four emotional states of interest, happiness, confusion, and boredom. After 600 iterations, the accuracy and loss value of the LSTM model gradually stabilized, with an accuracy of 84.68% on the test set and a loss value of 0.43 on the test set.