Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a student emotion analysis method and system based on a multi-mode dynamic memory big model.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the invention provides a student emotion analysis method based on a multi-modal dynamic memory big model, which comprises the following steps:
s1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels;
S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly, projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding a space position coding result of the token embedded sequence and the facial expression image and an initialized classification token representation to be used as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregation representation, forming a text sequence after word segmentation after processing a description related to facial behaviors by a text segmentation tool, inputting the text sequence after word segmentation and a learnable text representation into the text encoder together to obtain a text aggregation representation, multiplying the visual aggregation representation with the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and a history feature stored in a dynamic feature space together to an output module to obtain a visual classification representation, storing the visual aggregation representation as a new feature, storing the visual aggregation representation in a dynamic feature space, and then carrying out cross-prediction probability calculation with a second prediction probability and a final prediction model, and calculating a final prediction probability based on a high-weighted prediction probability and a visual prediction model;
And S3, inputting the facial expression images to be classified into a trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting emotion state type prediction results of the facial expression images.
On the basis of the scheme, each step can be realized in the following preferred specific mode.
As a preferable aspect of the first aspect, in step S1, the true emotion state category label includes five categories, namely happy, confused, anxious, attentive, and depressed.
As a preferred aspect of the first aspect, in step S2, the specific process of performing the hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:
AS21, obtaining a classification token representation learned by the (q-1) th coding layer of the image encoder, multiplying the classification token representation by a linear projection matrix to obtain a projected classification token representation, carrying out layer normalization on the projected classification token representation, then applying a multi-head self-attention mechanism to obtain a processed classification token representation, and adding the processed classification token representation and the projected classification token representation to obtain a summary token of the q-th coding layer;
AS22, adding the classified token representation and the randomly initialized leachable vector to be used AS a first intermediate token representation, carrying out layer normalization on the first intermediate token representation, then applying a multi-head self-attention mechanism to obtain a processed first intermediate token representation, and adding the processed first intermediate token representation and the first intermediate token representation to be used AS a local prompt token of a q-th coding layer;
AS23, randomly initializing a global prompt token of a q-th coding layer, obtaining image characteristics output by the (q-1) th coding layer of the image coder, splicing the image characteristics with the local prompt token, the global prompt token and the summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism to obtain a processed visual input representation, adding the processed visual input representation and the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics and the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.
In the preferred embodiment of the first aspect, in step S2, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.
In step S2, the visual classification representation is preferably generated by the output module, wherein the visual aggregation representation corresponding to one facial expression image is used as a query representation, the feature corresponding to the real emotion state category label y of the facial expression image is searched in a dynamic feature space and used as a key representation and a value representation, the query representation is processed by a first projection function to obtain a first projection feature, the key representation is processed by a second projection function and transposed to obtain a second projection feature, the value representation is processed by a third projection function to obtain a third projection feature, the first projection feature and the second projection feature are subjected to cosine similarity calculation to obtain a first similar feature, the first similar feature is used as an input variable of a sharpness function, the sharpness feature weight is obtained after the sharpness feature weight is weighted in a multiplied mode to obtain a second similar feature, the second similar feature is processed by a fourth projection function to obtain an element of a y line in the visual classification representation, and finally the visual classification representation is formed.
The specific processing procedure in the projection function is as follows, the feature input into the projection function is passed through a full connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature is added with the feature input into the projection function to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.
As a preference of the above first aspect, the sharpness function Rd (·) has a functional form:
Rd(x)=exp(-wr(1-x))
Where x represents the variable input to the sharpness function and w r represents the super-parameter that adjusts sharpness.
As a preferred aspect of the above first aspect, in step S3, a specific process of determining whether the dynamic feature space needs to be updated based on the composite score is as follows:
S31, calculating an entropy value based on a first prediction probability corresponding to visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probability corresponding to each feature in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, adding the current prediction entropy, the average historical entropy and a preset constant to be used as a confidence coefficient total score, and taking the ratio of the preset confidence coefficient score to the confidence coefficient total score as a comprehensive score of the facial expression image;
S32, when the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is larger than or equal to the lowest comprehensive score in the dynamic feature space, replacing the feature corresponding to the lowest comprehensive score in the dynamic feature space with the visual aggregate representation of the facial expression image to be classified so as to update the dynamic feature space, and if the comprehensive score of the facial expression image to be classified is smaller than the lowest comprehensive score in the dynamic feature space, keeping the dynamic feature space unchanged.
In a second aspect, the present invention provides a student emotion analysis system based on a multimodal dynamic memory big model, comprising:
The data acquisition module is used for acquiring facial expression images and corresponding real emotion state category labels of the facial expression images when different students watch the online education courses;
The model acquisition module is used for inputting the facial expression images into the multi-mode dynamic memory big model for training, in the training process of the multi-mode dynamic memory big model, firstly, each facial expression image is projected by a linear projection layer, each facial expression image correspondingly obtains a token embedded sequence, the token embedded sequence is added with the space position coding result of the facial expression image and the initialized classified token representation to be used as an input token sequence, the input token sequence is input into an image encoder, a visual prompt learning method is adopted for prompting and fine tuning each coding layer of the image encoder, finally, a visual aggregate representation is obtained, after the description related to the facial behaviors is processed by a text segmentation tool, a text sequence after word segmentation is formed, inputting the text sequence after word segmentation and the learner-able text representation into a text encoder to obtain text aggregation representation, multiplying the visual aggregation representation and the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and the history features stored in a dynamic feature space into an output module to obtain visual classification representation, storing the visual aggregation representation as new features in the dynamic feature space, multiplying the visual aggregation representation and the transposed visual classification representation, outputting a second prediction probability through Softmax, taking weighted summation of the first prediction probability and the second prediction probability as a final prediction probability, and calculating cross entropy loss based on the final prediction probability and a true emotion state type label to update the multi-mode dynamic memory large model parameter;
The result acquisition module is used for inputting the facial expression images to be classified into the trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting the emotion state type prediction result of the facial expression images.
In a third aspect, the present invention provides a computer electronic device comprising a memory and a processor;
the memory is used for storing a computer program;
The processor is configured to implement a student emotion analysis method based on a multi-modal dynamic memory big model according to any one of the first aspect when executing the computer program.
Compared with the traditional student emotion analysis method, the invention has the following beneficial effects:
Compared with the traditional visual learning strategy, the visual prompt learning method can capture and understand visual information more effectively, and can be widely applied to various downstream computer visual tasks, such as student emotion analysis. In terms of text prompting, the present invention introduces a learnable text representation, rather than manually designed prompting words. In addition, the invention designs a dynamic feature space that preserves the features of historical test data during testing. Finally, the multi-mode dynamic memory large model can further mine potential information of the facial expression image, and further improve model performance under emotion analysis tasks of student facial expression recognition.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.
In the description of the present invention, it should be understood that the terms "first" and "second" are used solely for the purpose of distinguishing between the descriptions and not necessarily for the purpose of indicating or implying a relative importance or implicitly indicating the number of features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.
The multi-mode large model has remarkable advantages in solving the problems of accuracy of emotion recognition and computing capacity and response speed of processing a large amount of data in real time, and provides a novel visual representation learning method. It is able to capture and understand visual information more effectively than traditional visual learning strategies. Firstly, the multi-modal large model combines the image and text data, information can be extracted from the multi-modal data, and the emotion recognition accuracy is improved. For example, the multimodal big model can analyze the facial expressions, limb languages and text content in classroom interactions of students simultaneously, providing more comprehensive emotion recognition. This multimodal analysis captures student emotional states more accurately than a single modality because it takes into account more contextual information and behavioral characteristics. In terms of computational power and response speed, the multimodal big model relies on advanced neural network architecture and efficient parallel processing techniques. The existing multi-mode large model is generally based on a transducer architecture, has strong parallel processing capability, and can process a large amount of data simultaneously. This enables the multi-modal large model to analyze student affective states in real time and provide feedback in a short time. The multi-mode visual language large model also has self-adaptive learning capability, can continuously learn and optimize the model from new data, and improves the accuracy of emotion recognition and the real-time processing capability. Through continuous interaction with teachers and students, the multi-modal large model can gradually learn and adapt to emotion expression modes of different students, and individuation and accuracy of emotion recognition are improved. Therefore, the multi-mode large model can be subjected to multi-mode data fusion, efficient parallel processing and self-adaptive learning, and the accuracy of emotion analysis on students in a class is remarkably improved. The latest multimodal large models such as the contrasted language-image pre-training model (Contrastive Language-IMAGE PRETRAINING, CLIP) and the contrasted learning-based dual encoder model ALIGN collect 4 hundred million and 18 hundred million image text pairs, respectively, from the Internet. These multi-modal large models are then widely applied to a variety of downstream computer vision tasks and exhibit excellent performance.
In this embodiment, a comparative language-image pre-training model is specifically described as an example. CLIP utilizes a contrast learning approach to understand and correlate visual and linguistic information. When emotion prediction is carried out, the CLIP firstly converts the input image and text data into high-dimensional feature vectors respectively, and key features are extracted through a visual and linguistic encoder. These feature vectors are then projected into the same embedding space where cosine similarity of the image and text is calculated. CLIP makes correctly matched image and text pairs closer in embedded space and incorrectly matched pairs farther by optimizing a contrast loss function. As previously mentioned, the present invention contemplates the use of a comparative language-image pre-training model in multimodal emotion analysis. In order to effectively apply the contrasted language-image pre-training model to the processing of student facial expression data, the present invention takes into account two important aspects. In the first aspect, the generalization capability of the backbone of the contrast language-image pre-training model needs to be preserved, namely, video representation and text representation obtained from an original image encoder and a text encoder in the CLIP are preserved, and in the second aspect, the method must be capable of effectively adapting to the emotion analysis field of students. Aiming at the first aspect, the invention provides a novel text prompt learning method for fine tuning a text encoder, and the visual prompt learning method is adopted for fine tuning an image encoder to realize competitive strong supervision performance so as to better adapt to emotion analysis tasks of student facial expression recognition. By carrying out prompt fine adjustment on the CLIP model, the adaptability and accuracy of the CLIP model in student emotion analysis can be remarkably improved. The trimmed CLIP model can capture the emotion change of the student more accurately, such as happiness, confusion, anxiety, concentration, depression and the like, and provide feedback in time. By the mode, teachers can flexibly adjust teaching methods according to the instant emotion changes of students, and therefore more customized support and assistance are provided. The fine adjustment can also enable the CLIP to be better suitable for student emotion expression under different cultural and personality backgrounds, and further improves emotion recognition accuracy and universality. For the second aspect, the invention additionally introduces a learnable output module to adapt to the CLIP model after fine tuning.
As shown in FIG. 1, in a preferred implementation manner of the present invention, the student emotion analysis method based on the multi-modal dynamic memory big model includes the following steps S1-S3. The following describes the specific implementation procedure.
S1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels.
In this embodiment, a facial expression video of a student watching an online education course is acquired from an online education platform. And extracting the facial expression videos with the length of N frame by frame according to fixed intervals, wherein each frame is a static facial expression image with the size of Y multiplied by U, and Y and U are the height and the width of the facial expression image respectively.
In this embodiment, the emotion states of students watching online education courses are described by various category labels, and the category labels help teachers to better understand emotion and response of students, so that teaching strategies are effectively adjusted and support is provided. In this embodiment, the true emotion state category labels can be summarized into happy, namely that students feel satisfied and pleasant in learning, and are expressed as smiles, active participation and confidence, confused, namely that students face understanding barriers when concepts or tasks, are expressed as frowning, frequently asking questions or seeking additional guidance, anxiety, stress and anxiety are reflected, the students are not restless, free eyes or speech, concentration is indicated, the students are focused, and static gestures and thinking expressions are presented, and frustration is caused by frustration, and the students are expressed as low emotion, lack of power and reduced participation. The true emotion state category labels not only help teachers to capture emotion states of students in time, but also help to formulate personalized teaching support measures so as to improve learning experience and effect of the students and enable online education to be more focused and humanized.
S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding the token embedded sequence with a spatial position coding result of the facial expression images and an initialized classification token representation as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregate representation (shown in figure 2), forming a text sequence after word segmentation after text segmentation processing is carried out on description related to facial behaviors, inputting the text sequence after word segmentation and the learner text representation into the text encoder together, obtaining a text aggregate representation (shown in figure 3), multiplying the visual aggregate representation with the transposed text aggregate representation, outputting a first prediction probability through Softmax, inputting the visual aggregate representation into an output module together with a history feature stored in a dynamic feature space, obtaining a visual aggregate representation (shown in figure 2), storing the visual aggregate representation as a new prediction probability and a second prediction model, and calculating a visual aggregate feature and a final prediction model based on a high-level prediction entropy, and a visual aggregate feature and a final prediction model.
In step S2 of this embodiment, non-overlapping segmentation is performed on each facial expression image, that is, each facial expression image is divided into a plurality of square patches with a size of p×p, and after the square patches are obtained, random scaling clipping, horizontal flipping, random rotation, and color dithering are used to avoid over-fitting. The processed square patches are flattened into a set of vectors, which are then projected using a linear projection layer to form a token embedded sequence.
In step S2 of this embodiment, as shown in fig. 4, a specific process of performing a hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:
AS21, obtaining a classification token representation learned by the (q-1) th coding layer of the image encoder, multiplying the classification token representation by a linear projection matrix to obtain a projected classification token representation, carrying out layer normalization on the projected classification token representation, then applying a multi-head self-attention mechanism to obtain a processed classification token representation, and adding the processed classification token representation and the projected classification token representation to obtain a summary token of the q-th coding layer;
AS22, adding the classified token representation and the randomly initialized leachable vector to be used AS a first intermediate token representation, carrying out layer normalization on the first intermediate token representation, then applying a multi-head self-attention mechanism to obtain a processed first intermediate token representation, and adding the processed first intermediate token representation and the first intermediate token representation to be used AS a local prompt token of a q-th coding layer;
AS23, randomly initializing a global prompt token of a q-th coding layer, acquiring image characteristics output by the (q-1) th coding layer of an image coder, splicing the image characteristics with a local prompt token, a global prompt token and a summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism (THE PRETRAINED SELF-attention) to obtain a processed visual input representation, adding the processed visual input representation with the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics with the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.
In step S2 of this embodiment, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.
It should be noted that in step S2 of this embodiment, a learnable text representation is used as the context of each category descriptor on the text encoder, which does not require expert design of context words and allows the multi-modal dynamic memory big model to learn the relevant context information of each expression during training. Specifically, the input to the text encoder specifically contains two types of data, the first type of input data being the result of a facial behavior related description being processed by a text segmentation tool Tokenizer, which is a tool for segmenting text into smaller units that can segment long text into smaller units and then convert the segmented text into an index sequence. The second type of input data is a class-specific textual representation that is independent of each description. By inputting both data as cues to the text encoder, a text aggregate representation representing the visual concept is obtained.
In general, in step S2 of the present embodiment, there are two main objectives for hint learning the image encoder, 1) to utilize temporal information by introducing inter-frame information exchange, 2) to provide additional parameters to adapt the image encoder to the distribution of facial expression images. In this regard, the present invention introduces three additional tokens into each layer of the image encoder, the discrimination information between all frames is summarized by the summary token, the discrimination information of each facial expression image is conveyed to the rest of the facial expression images by the local cue token, and the learning ability is provided by the global cue token, so that the multi-modal dynamic memory large model is adapted to the distribution of the facial expression images. In the text encoder section, the description relating to facial behavior is processed by the text segmentation tool and then input to the text encoder together with a learnable text representation, which may provide more detailed and accurate information about the specific movements or positions of the muscles involved in each expression, in view of the fact that different facial expressions have both common and unique or specific properties at the local behavioral level, as opposed to simply inputting a facial expression class name.
In step S2 of this embodiment, as shown in fig. 5, for the facial expression recognition task, the visual aggregate representation extracted by the image encoder and the text aggregate representation extracted by the text encoder are L2 normalized, and then the visual aggregate representation is multiplied by the transposed text aggregate representation, and then the first prediction probability is output through Softmax。
In step S2 of the present embodiment, the specific process of generating the visual classification representation by the output module is as follows, namely, the visual aggregation representation corresponding to one facial expression imageAs query characterization, searching real emotion state category labels of the facial expression image in dynamic feature spaceCorresponding featuresThe method comprises the steps of taking a key representation and a value representation as key representations, processing the query representation through a first projection function to obtain a first projection feature, processing the key representation through a second projection function and transposing the key representation to obtain a second projection feature, processing the value representation through a third projection function to obtain a third projection feature, performing cosine similarity calculation on the first projection feature and the second projection feature to obtain a first similar feature, taking the first similar feature as an input variable of a sharpness function, processing the sharpness function to obtain a sharpness feature weight, weighting the third projection feature and the sharpness feature weight in a multiplication mode to obtain a second similar feature, processing the second similar feature through a fourth projection function to obtain a visual classification representationMiddle (f)Elements of a row:
Wherein, 、、AndRepresenting first, second, third and fourth projection functions, respectively; Representing cosine similarity calculation; representing the sharpness function; Representing the matrix transpose.
The specific processing procedure in the projection function is as follows, the feature input into the projection function is subjected to a full-connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature and the feature input into the projection function are added to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.
Further, in this embodiment, the function form of the projection function is:
Wherein, Representing features input to the projection function; representing a full connection layer of all parameter random initialization; Representing an L2 normalization operation along the feature dimension.
Further, in the present embodiment, the function form of the sharpness function is:
Wherein, A variable representing an input to the sharpness function; indicating the super-parameters that adjust sharpness.
After the above process, visual classification representation can be obtainedFinally, the visual aggregate representation is converted into the required classification prediction probability, namely, the visual aggregate representation is multiplied by the transposed visual classification representation, and then the second prediction probability is output through Softmax. Finally, as shown in FIG. 6, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability:
In the formula,All represent preset weight super parameters.
And S3, inputting the facial expression images to be classified into a trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting emotion state type prediction results of the facial expression images.
In step S3 of this embodiment, the introduction of an external memory component is proposed by study inspired by knowledge accumulation and recall in the brain of the person, and the storage and retrieval of historical knowledge are allowed to facilitate decision making. Recently, the idea of a memory network was introduced into the adaptation of CLIP. Therefore, the invention introduces a dynamic feature space to support the reading and writing operation of the features, thereby optimizing the face recognition process of students. In the training process of the multi-mode dynamic memory large model, the size of the dynamic feature space is consistent with the size of the facial expression image used for training, and the dynamic feature space is used for storing the visual aggregate representation generated by each iteration round and applying the visual aggregate representation as a historical feature to the calculation of the next iteration round. After the multi-mode dynamic memory big model is trained, the dynamic feature space at this stage is set to be a preset size and has a limited length, so that when the visual aggregate representation stored therein reaches the limit, the dynamic feature space needs to be updated.
The specific procedure for determining whether the dynamic feature space needs to be updated based on the composite score mentioned in step S3 will be described in detail below.
S31, calculating an entropy value based on a first prediction probability corresponding to a visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probabilities corresponding to various features in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, and taking the current prediction entropyAverage history entropyPreset constantAdding as confidence total score, and presetting confidence scoreRatio to confidence score as a composite score for a facial expression image:
S32, when the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is larger than or equal to the lowest comprehensive score in the dynamic feature space, replacing the feature corresponding to the lowest comprehensive score in the dynamic feature space with the visual aggregate representation of the facial expression image to be classified so as to update the dynamic feature space, and if the comprehensive score of the facial expression image to be classified is smaller than the lowest comprehensive score in the dynamic feature space, keeping the dynamic feature space unchanged.
The application effect of the student emotion analysis method based on the multi-mode dynamic memory big model described by S1-S3 in the embodiment on a specific data set is shown by a specific example, so that the essence of the invention can be understood conveniently.
Examples
The specific implementation process of the student emotion analysis method based on the multi-mode dynamic memory big model adopted in the embodiment is as described above, and will not be described again.
In order to show the technical effect of the method provided by the invention, the method is verified in an actual teaching scene. The data set selected by the invention is an educational data set, and the data set is focused on an online classroom in an educational environment and aims at carrying out facial expression recognition and emotion analysis on students. The dataset contains facial expression videos and facial expression images from different students while watching online lessons that capture various emotional states exhibited by the students during learning, such as anxiety, concentration, confusion, happiness, and depression. Each video and image is annotated to indicate the student's emotional performance at a particular moment, thereby providing training data for subsequent emotion analysis and recognition tasks. In addition, the data set contains background information of students, such as grade, sex and learning results, so as to study the relationship between emotion and learning effect.
The evaluation index adopts UAR and WAR which are most commonly used in the facial expression recognition field, wherein UAR refers to average recall rate calculated on all categories, and the contribution of UAR to each category is equal, so that the UAR is suitable for the condition of unbalanced category distribution. The larger UAR value indicates a higher accuracy of the model in identifying each emotion category, especially when dealing with a small number of category samples. On the other hand, WAR considers the occurrence frequency of each category in the data set, and the recall rate is calculated through weighted average, so that the influence of the more common category on the total score is ensured to be larger. The higher the WAR value, the better the overall performance of the representation model on the entire dataset, especially in terms of accurately identifying high frequency categories. In general, the larger the two index values, the more excellent the performance of the model in emotion recognition tasks is, and different facial expressions can be more effectively captured and classified. The comparison of the proposed method with the previous dynamic expression recognition method is shown in table 1. In table 1, the implementation processes of the three models, namely the dynamic facial expression recognition model form-DFER based on the transducer, the dynamic facial expression recognition model EST based on the convolutional neural network ResNet-18 and the dynamic facial expression recognition model EmoCLIP based on the visual language big model, all belong to the prior art and are not repeated.
TABLE 1 facial expression recognition results on educational data sets for different methods
| Method of |
UAR |
WAR |
| Former-DFER |
53.69 |
65.70 |
| EST |
53.94 |
65.85 |
| EmoCLIP |
58.04 |
62.12 |
| The invention is that |
66.89 |
75.64 |
In addition, the embodiment of the invention also carries out an ablation experiment to better check the contribution of each module or method proposed by the invention. In Table 2, the backup represents the original image encoder and text encoder in the contrast language-image pre-training model, W/Prompting represents that visual prompt learning is added to the image encoder and a learnable text representation is introduced on the text encoder on the basis of the backup, and W/Memory is the complete multi-modal dynamic Memory large model corresponding to the invention.
TABLE 2 ablation experiment results on educational data set of the invention
| Method of |
UAR |
WAR |
| Backbone |
23.34 |
20.07 |
| W/Prompting |
60.61 |
72.65 |
| W/Memory |
66.89 |
75.64 |
The ablation experiment result shows that the method provided by the invention has obvious effect compared with the prior method. The main reason is that the method provided by the invention models the frame-level information and the inter-frame relation of the facial expression video, and utilizes the characteristics of the historical test sample reserved in the test process of the dynamic memory network to enable the multi-mode dynamic memory large model to further mine the data potential information beyond the training data set.
In addition, the student emotion analysis method based on the multi-mode dynamic memory big model in the above embodiment can be essentially executed by a computer program or a module. Therefore, based on the same inventive concept, another preferred embodiment of the present invention further provides a student emotion analysis system based on a multi-modal dynamic memory big model corresponding to the student emotion analysis method based on a multi-modal dynamic memory big model provided in the above embodiment, as shown in fig. 7, which includes:
The data acquisition module is used for acquiring facial expression images and corresponding real emotion state category labels of the facial expression images when different students watch the online education courses;
The model acquisition module is used for inputting the facial expression images into the multi-mode dynamic memory big model for training, in the training process of the multi-mode dynamic memory big model, firstly, each facial expression image is projected by a linear projection layer, each facial expression image correspondingly obtains a token embedded sequence, the token embedded sequence is added with the space position coding result of the facial expression image and the initialized classified token representation to be used as an input token sequence, the input token sequence is input into an image encoder, a visual prompt learning method is adopted for prompting and fine tuning each coding layer of the image encoder, finally, a visual aggregate representation is obtained, after the description related to the facial behaviors is processed by a text segmentation tool, a text sequence after word segmentation is formed, inputting the text sequence after word segmentation and the learner-able text representation into a text encoder to obtain text aggregation representation, multiplying the visual aggregation representation and the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and the history features stored in a dynamic feature space into an output module to obtain visual classification representation, storing the visual aggregation representation as new features in the dynamic feature space, multiplying the visual aggregation representation and the transposed visual classification representation, outputting a second prediction probability through Softmax, taking weighted summation of the first prediction probability and the second prediction probability as a final prediction probability, and calculating cross entropy loss based on the final prediction probability and a true emotion state type label to update the multi-mode dynamic memory large model parameter;
The result acquisition module is used for inputting the facial expression images to be classified into the trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting the emotion state type prediction result of the facial expression images.
Similarly, based on the same inventive concept, another preferred embodiment of the present invention further provides a computer electronic device corresponding to the student emotion analysis method based on the multi-modal dynamic memory big model provided in the above embodiment, which includes a memory and a processor;
the memory is used for storing a computer program;
the processor is used for realizing the student emotion analysis method based on the multi-mode dynamic memory big model in the embodiment when executing the computer program.
Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
It is to be appreciated that the processor described above can be a general purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
It should be further noted that, for convenience and brevity of description, specific working processes of the system described above may refer to corresponding processes in the foregoing method embodiments, which are not described herein again. In the embodiments of the present application, the division of steps or modules in the system and the method is only one logic function division, and other division manners may be implemented in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.
In addition, the data related to the invention is fully authorized to be acquired, and the collection, the use and the processing of the related information are required to comply with the related laws and regulations and standards of the related countries and regions.
The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.