CN119323818A

CN119323818A - Student emotion analysis method and system based on multi-mode dynamic memory big model

Info

Publication number: CN119323818A
Application number: CN202411878512.8A
Authority: CN
Inventors: 黄昌勤; 苏一飞; 蒋云良; 朱雁来; 蒋凡
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd; Zhejiang Normal University CJNU
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd; Zhejiang Normal University CJNU
Priority date: 2024-12-19
Filing date: 2024-12-19
Publication date: 2025-01-17
Anticipated expiration: 2044-12-19
Also published as: CN119323818B

Abstract

The present invention discloses a student emotion analysis method and system based on a multimodal dynamic memory large model, belonging to the field of multimodal emotion analysis. The method of the present invention first obtains facial expression images of different students when watching online education courses and their corresponding real emotion state category labels to form a training data set, and trains the multimodal dynamic memory large model based on the data set. During the training process, the image encoder and the text encoder are prompted to be fine-tuned, and the visual aggregation representation generated during the historical training process is stored in the dynamic feature space, and the model parameters are optimized by the cross entropy loss; finally, the facial expression image to be classified is input into the trained multimodal dynamic memory large model, and the emotion state category prediction result of the facial expression image is output. The method of the present invention can be effectively applied to the field of education, which is conducive to teachers understanding the emotional state of students, thereby improving students' participation and learning effect.

Description

Student emotion analysis method and system based on multi-mode dynamic memory big model

Technical Field

The invention belongs to the field of multi-mode emotion analysis, and particularly relates to a student emotion analysis method and system based on a multi-mode dynamic memory big model.

Background

Student emotion analysis is a method for evaluating and understanding emotion and emotion states of students in a virtual learning environment by technical means. The method utilizes artificial intelligence and machine learning technology to acquire information from various data sources, such as facial expressions, voice intonation, text input, mouse click and browsing habits of students, and the like. The data are processed by complex algorithms, which can reveal emotional reactions of students in the learning process, such as happiness, confusion, anxiety, concentration, depression and the like. The emotion analysis aims at helping teachers and educational institutions to better understand emotion dynamics of students so as to intervene and support timely. For example, when the system detects that students are showing confusion or frustration, the teacher may provide additional explanation or resources to help them overcome the difficulty. Through emotion analysis, online classroom not only can more accurately satisfy individualized demand of student, can also strengthen student's study experience and enthusiasm for distance education is more humanized and effective, and it provides huge potentiality and prospect for improving online education quality.

Student emotion analysis faces many challenges and technical problems. First, the accuracy of emotion recognition is a major challenge. The emotion expression of students is different from each other due to cultural, personality and environmental differences, and misjudgment may be caused by emotion recognition by facial expression or voice features alone. Secondly, processing and analyzing a large amount of data in real time puts high demands on the computing capacity and response speed of the system, and ensures that the system can provide accurate emotion analysis results in a short time. Another important challenge is how to effectively apply emotion analysis methods to the educational field. The teacher needs to know the emotional state of the student and also needs to obtain specific suggestions and tools to adjust teaching strategies so as to improve the participation degree and learning effect of the student.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a student emotion analysis method and system based on a multi-mode dynamic memory big model.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the invention provides a student emotion analysis method based on a multi-modal dynamic memory big model, which comprises the following steps:

s1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels;

S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly, projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding a space position coding result of the token embedded sequence and the facial expression image and an initialized classification token representation to be used as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregation representation, forming a text sequence after word segmentation after processing a description related to facial behaviors by a text segmentation tool, inputting the text sequence after word segmentation and a learnable text representation into the text encoder together to obtain a text aggregation representation, multiplying the visual aggregation representation with the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and a history feature stored in a dynamic feature space together to an output module to obtain a visual classification representation, storing the visual aggregation representation as a new feature, storing the visual aggregation representation in a dynamic feature space, and then carrying out cross-prediction probability calculation with a second prediction probability and a final prediction model, and calculating a final prediction probability based on a high-weighted prediction probability and a visual prediction model;

And S3, inputting the facial expression images to be classified into a trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting emotion state type prediction results of the facial expression images.

On the basis of the scheme, each step can be realized in the following preferred specific mode.

As a preferable aspect of the first aspect, in step S1, the true emotion state category label includes five categories, namely happy, confused, anxious, attentive, and depressed.

As a preferred aspect of the first aspect, in step S2, the specific process of performing the hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:

AS21, obtaining a classification token representation learned by the (q-1) th coding layer of the image encoder, multiplying the classification token representation by a linear projection matrix to obtain a projected classification token representation, carrying out layer normalization on the projected classification token representation, then applying a multi-head self-attention mechanism to obtain a processed classification token representation, and adding the processed classification token representation and the projected classification token representation to obtain a summary token of the q-th coding layer;

AS22, adding the classified token representation and the randomly initialized leachable vector to be used AS a first intermediate token representation, carrying out layer normalization on the first intermediate token representation, then applying a multi-head self-attention mechanism to obtain a processed first intermediate token representation, and adding the processed first intermediate token representation and the first intermediate token representation to be used AS a local prompt token of a q-th coding layer;

AS23, randomly initializing a global prompt token of a q-th coding layer, obtaining image characteristics output by the (q-1) th coding layer of the image coder, splicing the image characteristics with the local prompt token, the global prompt token and the summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism to obtain a processed visual input representation, adding the processed visual input representation and the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics and the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.

In the preferred embodiment of the first aspect, in step S2, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.

In step S2, the visual classification representation is preferably generated by the output module, wherein the visual aggregation representation corresponding to one facial expression image is used as a query representation, the feature corresponding to the real emotion state category label y of the facial expression image is searched in a dynamic feature space and used as a key representation and a value representation, the query representation is processed by a first projection function to obtain a first projection feature, the key representation is processed by a second projection function and transposed to obtain a second projection feature, the value representation is processed by a third projection function to obtain a third projection feature, the first projection feature and the second projection feature are subjected to cosine similarity calculation to obtain a first similar feature, the first similar feature is used as an input variable of a sharpness function, the sharpness feature weight is obtained after the sharpness feature weight is weighted in a multiplied mode to obtain a second similar feature, the second similar feature is processed by a fourth projection function to obtain an element of a y line in the visual classification representation, and finally the visual classification representation is formed.

The specific processing procedure in the projection function is as follows, the feature input into the projection function is passed through a full connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature is added with the feature input into the projection function to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.

As a preference of the above first aspect, the sharpness function Rd (·) has a functional form:

Rd(x)=exp(-w_r(1-x))

Where x represents the variable input to the sharpness function and w _r represents the super-parameter that adjusts sharpness.

As a preferred aspect of the above first aspect, in step S3, a specific process of determining whether the dynamic feature space needs to be updated based on the composite score is as follows:

S31, calculating an entropy value based on a first prediction probability corresponding to visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probability corresponding to each feature in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, adding the current prediction entropy, the average historical entropy and a preset constant to be used as a confidence coefficient total score, and taking the ratio of the preset confidence coefficient score to the confidence coefficient total score as a comprehensive score of the facial expression image;

S32, when the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is larger than or equal to the lowest comprehensive score in the dynamic feature space, replacing the feature corresponding to the lowest comprehensive score in the dynamic feature space with the visual aggregate representation of the facial expression image to be classified so as to update the dynamic feature space, and if the comprehensive score of the facial expression image to be classified is smaller than the lowest comprehensive score in the dynamic feature space, keeping the dynamic feature space unchanged.

In a second aspect, the present invention provides a student emotion analysis system based on a multimodal dynamic memory big model, comprising:

The data acquisition module is used for acquiring facial expression images and corresponding real emotion state category labels of the facial expression images when different students watch the online education courses;

The model acquisition module is used for inputting the facial expression images into the multi-mode dynamic memory big model for training, in the training process of the multi-mode dynamic memory big model, firstly, each facial expression image is projected by a linear projection layer, each facial expression image correspondingly obtains a token embedded sequence, the token embedded sequence is added with the space position coding result of the facial expression image and the initialized classified token representation to be used as an input token sequence, the input token sequence is input into an image encoder, a visual prompt learning method is adopted for prompting and fine tuning each coding layer of the image encoder, finally, a visual aggregate representation is obtained, after the description related to the facial behaviors is processed by a text segmentation tool, a text sequence after word segmentation is formed, inputting the text sequence after word segmentation and the learner-able text representation into a text encoder to obtain text aggregation representation, multiplying the visual aggregation representation and the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and the history features stored in a dynamic feature space into an output module to obtain visual classification representation, storing the visual aggregation representation as new features in the dynamic feature space, multiplying the visual aggregation representation and the transposed visual classification representation, outputting a second prediction probability through Softmax, taking weighted summation of the first prediction probability and the second prediction probability as a final prediction probability, and calculating cross entropy loss based on the final prediction probability and a true emotion state type label to update the multi-mode dynamic memory large model parameter;

The result acquisition module is used for inputting the facial expression images to be classified into the trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting the emotion state type prediction result of the facial expression images.

In a third aspect, the present invention provides a computer electronic device comprising a memory and a processor;

the memory is used for storing a computer program;

The processor is configured to implement a student emotion analysis method based on a multi-modal dynamic memory big model according to any one of the first aspect when executing the computer program.

Compared with the traditional student emotion analysis method, the invention has the following beneficial effects:

Compared with the traditional visual learning strategy, the visual prompt learning method can capture and understand visual information more effectively, and can be widely applied to various downstream computer visual tasks, such as student emotion analysis. In terms of text prompting, the present invention introduces a learnable text representation, rather than manually designed prompting words. In addition, the invention designs a dynamic feature space that preserves the features of historical test data during testing. Finally, the multi-mode dynamic memory large model can further mine potential information of the facial expression image, and further improve model performance under emotion analysis tasks of student facial expression recognition.

Drawings

FIG. 1 is a flow chart of the steps of the method of the present invention;

FIG. 2 is a schematic representation of the process of the present invention for obtaining a visual aggregate representation;

FIG. 3 is a schematic diagram of a process for obtaining a text aggregation representation in accordance with the method of the present invention;

FIG. 4 is a schematic diagram of the present invention for performing hint fine tuning of an image encoder using a visual hint learning method;

FIG. 5 is a schematic diagram of a process for obtaining a first prediction probability and a second prediction probability according to the present invention;

FIG. 6 is a schematic diagram of a process for obtaining a final predicted probability according to the present invention;

fig. 7 is a system block diagram of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

In the description of the present invention, it should be understood that the terms "first" and "second" are used solely for the purpose of distinguishing between the descriptions and not necessarily for the purpose of indicating or implying a relative importance or implicitly indicating the number of features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.

The multi-mode large model has remarkable advantages in solving the problems of accuracy of emotion recognition and computing capacity and response speed of processing a large amount of data in real time, and provides a novel visual representation learning method. It is able to capture and understand visual information more effectively than traditional visual learning strategies. Firstly, the multi-modal large model combines the image and text data, information can be extracted from the multi-modal data, and the emotion recognition accuracy is improved. For example, the multimodal big model can analyze the facial expressions, limb languages and text content in classroom interactions of students simultaneously, providing more comprehensive emotion recognition. This multimodal analysis captures student emotional states more accurately than a single modality because it takes into account more contextual information and behavioral characteristics. In terms of computational power and response speed, the multimodal big model relies on advanced neural network architecture and efficient parallel processing techniques. The existing multi-mode large model is generally based on a transducer architecture, has strong parallel processing capability, and can process a large amount of data simultaneously. This enables the multi-modal large model to analyze student affective states in real time and provide feedback in a short time. The multi-mode visual language large model also has self-adaptive learning capability, can continuously learn and optimize the model from new data, and improves the accuracy of emotion recognition and the real-time processing capability. Through continuous interaction with teachers and students, the multi-modal large model can gradually learn and adapt to emotion expression modes of different students, and individuation and accuracy of emotion recognition are improved. Therefore, the multi-mode large model can be subjected to multi-mode data fusion, efficient parallel processing and self-adaptive learning, and the accuracy of emotion analysis on students in a class is remarkably improved. The latest multimodal large models such as the contrasted language-image pre-training model (Contrastive Language-IMAGE PRETRAINING, CLIP) and the contrasted learning-based dual encoder model ‌ ALIGN collect 4 hundred million and 18 hundred million image text pairs, respectively, from the Internet. These multi-modal large models are then widely applied to a variety of downstream computer vision tasks and exhibit excellent performance.

In this embodiment, a comparative language-image pre-training model is specifically described as an example. CLIP utilizes a contrast learning approach to understand and correlate visual and linguistic information. When emotion prediction is carried out, the CLIP firstly converts the input image and text data into high-dimensional feature vectors respectively, and key features are extracted through a visual and linguistic encoder. These feature vectors are then projected into the same embedding space where cosine similarity of the image and text is calculated. CLIP makes correctly matched image and text pairs closer in embedded space and incorrectly matched pairs farther by optimizing a contrast loss function. As previously mentioned, the present invention contemplates the use of a comparative language-image pre-training model in multimodal emotion analysis. In order to effectively apply the contrasted language-image pre-training model to the processing of student facial expression data, the present invention takes into account two important aspects. In the first aspect, the generalization capability of the backbone of the contrast language-image pre-training model needs to be preserved, namely, video representation and text representation obtained from an original image encoder and a text encoder in the CLIP are preserved, and in the second aspect, the method must be capable of effectively adapting to the emotion analysis field of students. Aiming at the first aspect, the invention provides a novel text prompt learning method for fine tuning a text encoder, and the visual prompt learning method is adopted for fine tuning an image encoder to realize competitive strong supervision performance so as to better adapt to emotion analysis tasks of student facial expression recognition. By carrying out prompt fine adjustment on the CLIP model, the adaptability and accuracy of the CLIP model in student emotion analysis can be remarkably improved. The trimmed CLIP model can capture the emotion change of the student more accurately, such as happiness, confusion, anxiety, concentration, depression and the like, and provide feedback in time. By the mode, teachers can flexibly adjust teaching methods according to the instant emotion changes of students, and therefore more customized support and assistance are provided. The fine adjustment can also enable the CLIP to be better suitable for student emotion expression under different cultural and personality backgrounds, and further improves emotion recognition accuracy and universality. For the second aspect, the invention additionally introduces a learnable output module to adapt to the CLIP model after fine tuning.

As shown in FIG. 1, in a preferred implementation manner of the present invention, the student emotion analysis method based on the multi-modal dynamic memory big model includes the following steps S1-S3. The following describes the specific implementation procedure.

S1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels.

In this embodiment, a facial expression video of a student watching an online education course is acquired from an online education platform. And extracting the facial expression videos with the length of N frame by frame according to fixed intervals, wherein each frame is a static facial expression image with the size of Y multiplied by U, and Y and U are the height and the width of the facial expression image respectively.

In this embodiment, the emotion states of students watching online education courses are described by various category labels, and the category labels help teachers to better understand emotion and response of students, so that teaching strategies are effectively adjusted and support is provided. In this embodiment, the true emotion state category labels can be summarized into happy, namely that students feel satisfied and pleasant in learning, and are expressed as smiles, active participation and confidence, confused, namely that students face understanding barriers when concepts or tasks, are expressed as frowning, frequently asking questions or seeking additional guidance, anxiety, stress and anxiety are reflected, the students are not restless, free eyes or speech, concentration is indicated, the students are focused, and static gestures and thinking expressions are presented, and frustration is caused by frustration, and the students are expressed as low emotion, lack of power and reduced participation. The true emotion state category labels not only help teachers to capture emotion states of students in time, but also help to formulate personalized teaching support measures so as to improve learning experience and effect of the students and enable online education to be more focused and humanized.

S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding the token embedded sequence with a spatial position coding result of the facial expression images and an initialized classification token representation as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregate representation (shown in figure 2), forming a text sequence after word segmentation after text segmentation processing is carried out on description related to facial behaviors, inputting the text sequence after word segmentation and the learner text representation into the text encoder together, obtaining a text aggregate representation (shown in figure 3), multiplying the visual aggregate representation with the transposed text aggregate representation, outputting a first prediction probability through Softmax, inputting the visual aggregate representation into an output module together with a history feature stored in a dynamic feature space, obtaining a visual aggregate representation (shown in figure 2), storing the visual aggregate representation as a new prediction probability and a second prediction model, and calculating a visual aggregate feature and a final prediction model based on a high-level prediction entropy, and a visual aggregate feature and a final prediction model.

In step S2 of this embodiment, non-overlapping segmentation is performed on each facial expression image, that is, each facial expression image is divided into a plurality of square patches with a size of p×p, and after the square patches are obtained, random scaling clipping, horizontal flipping, random rotation, and color dithering are used to avoid over-fitting. The processed square patches are flattened into a set of vectors, which are then projected using a linear projection layer to form a token embedded sequence.

In step S2 of this embodiment, as shown in fig. 4, a specific process of performing a hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:

AS23, randomly initializing a global prompt token of a q-th coding layer, acquiring image characteristics output by the (q-1) th coding layer of an image coder, splicing the image characteristics with a local prompt token, a global prompt token and a summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism (THE PRETRAINED SELF-attention) to obtain a processed visual input representation, adding the processed visual input representation with the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics with the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.

In step S2 of this embodiment, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.

It should be noted that in step S2 of this embodiment, a learnable text representation is used as the context of each category descriptor on the text encoder, which does not require expert design of context words and allows the multi-modal dynamic memory big model to learn the relevant context information of each expression during training. Specifically, the input to the text encoder specifically contains two types of data, the first type of input data being the result of a facial behavior related description being processed by a text segmentation tool Tokenizer, which is a tool for segmenting text into smaller units that can segment long text into smaller units and then convert the segmented text into an index sequence. The second type of input data is a class-specific textual representation that is independent of each description. By inputting both data as cues to the text encoder, a text aggregate representation representing the visual concept is obtained.

In general, in step S2 of the present embodiment, there are two main objectives for hint learning the image encoder, 1) to utilize temporal information by introducing inter-frame information exchange, 2) to provide additional parameters to adapt the image encoder to the distribution of facial expression images. In this regard, the present invention introduces three additional tokens into each layer of the image encoder, the discrimination information between all frames is summarized by the summary token, the discrimination information of each facial expression image is conveyed to the rest of the facial expression images by the local cue token, and the learning ability is provided by the global cue token, so that the multi-modal dynamic memory large model is adapted to the distribution of the facial expression images. In the text encoder section, the description relating to facial behavior is processed by the text segmentation tool and then input to the text encoder together with a learnable text representation, which may provide more detailed and accurate information about the specific movements or positions of the muscles involved in each expression, in view of the fact that different facial expressions have both common and unique or specific properties at the local behavioral level, as opposed to simply inputting a facial expression class name.

In step S2 of this embodiment, as shown in fig. 5, for the facial expression recognition task, the visual aggregate representation extracted by the image encoder and the text aggregate representation extracted by the text encoder are L2 normalized, and then the visual aggregate representation is multiplied by the transposed text aggregate representation, and then the first prediction probability is output through Softmax。

In step S2 of the present embodiment, the specific process of generating the visual classification representation by the output module is as follows, namely, the visual aggregation representation corresponding to one facial expression imageAs query characterization, searching real emotion state category labels of the facial expression image in dynamic feature spaceCorresponding featuresThe method comprises the steps of taking a key representation and a value representation as key representations, processing the query representation through a first projection function to obtain a first projection feature, processing the key representation through a second projection function and transposing the key representation to obtain a second projection feature, processing the value representation through a third projection function to obtain a third projection feature, performing cosine similarity calculation on the first projection feature and the second projection feature to obtain a first similar feature, taking the first similar feature as an input variable of a sharpness function, processing the sharpness function to obtain a sharpness feature weight, weighting the third projection feature and the sharpness feature weight in a multiplication mode to obtain a second similar feature, processing the second similar feature through a fourth projection function to obtain a visual classification representationMiddle (f)Elements of a row:

Wherein, 、、AndRepresenting first, second, third and fourth projection functions, respectively; Representing cosine similarity calculation; representing the sharpness function; Representing the matrix transpose.

The specific processing procedure in the projection function is as follows, the feature input into the projection function is subjected to a full-connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature and the feature input into the projection function are added to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.

Further, in this embodiment, the function form of the projection function is:

Wherein, Representing features input to the projection function; representing a full connection layer of all parameter random initialization; Representing an L2 normalization operation along the feature dimension.

Further, in the present embodiment, the function form of the sharpness function is:

Wherein, A variable representing an input to the sharpness function; indicating the super-parameters that adjust sharpness.

After the above process, visual classification representation can be obtainedFinally, the visual aggregate representation is converted into the required classification prediction probability, namely, the visual aggregate representation is multiplied by the transposed visual classification representation, and then the second prediction probability is output through Softmax. Finally, as shown in FIG. 6, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability:

In the formula,All represent preset weight super parameters.

In step S3 of this embodiment, the introduction of an external memory component is proposed by study inspired by knowledge accumulation and recall in the brain of the person, and the storage and retrieval of historical knowledge are allowed to facilitate decision making. Recently, the idea of a memory network was introduced into the adaptation of CLIP. Therefore, the invention introduces a dynamic feature space to support the reading and writing operation of the features, thereby optimizing the face recognition process of students. In the training process of the multi-mode dynamic memory large model, the size of the dynamic feature space is consistent with the size of the facial expression image used for training, and the dynamic feature space is used for storing the visual aggregate representation generated by each iteration round and applying the visual aggregate representation as a historical feature to the calculation of the next iteration round. After the multi-mode dynamic memory big model is trained, the dynamic feature space at this stage is set to be a preset size and has a limited length, so that when the visual aggregate representation stored therein reaches the limit, the dynamic feature space needs to be updated.

The specific procedure for determining whether the dynamic feature space needs to be updated based on the composite score mentioned in step S3 will be described in detail below.

S31, calculating an entropy value based on a first prediction probability corresponding to a visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probabilities corresponding to various features in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, and taking the current prediction entropyAverage history entropyPreset constantAdding as confidence total score, and presetting confidence scoreRatio to confidence score as a composite score for a facial expression image:

The application effect of the student emotion analysis method based on the multi-mode dynamic memory big model described by S1-S3 in the embodiment on a specific data set is shown by a specific example, so that the essence of the invention can be understood conveniently.

Examples

The specific implementation process of the student emotion analysis method based on the multi-mode dynamic memory big model adopted in the embodiment is as described above, and will not be described again.

In order to show the technical effect of the method provided by the invention, the method is verified in an actual teaching scene. The data set selected by the invention is an educational data set, and the data set is focused on an online classroom in an educational environment and aims at carrying out facial expression recognition and emotion analysis on students. The dataset contains facial expression videos and facial expression images from different students while watching online lessons that capture various emotional states exhibited by the students during learning, such as anxiety, concentration, confusion, happiness, and depression. Each video and image is annotated to indicate the student's emotional performance at a particular moment, thereby providing training data for subsequent emotion analysis and recognition tasks. In addition, the data set contains background information of students, such as grade, sex and learning results, so as to study the relationship between emotion and learning effect.

The evaluation index adopts UAR and WAR which are most commonly used in the facial expression recognition field, wherein UAR refers to average recall rate calculated on all categories, and the contribution of UAR to each category is equal, so that the UAR is suitable for the condition of unbalanced category distribution. The larger UAR value indicates a higher accuracy of the model in identifying each emotion category, especially when dealing with a small number of category samples. On the other hand, WAR considers the occurrence frequency of each category in the data set, and the recall rate is calculated through weighted average, so that the influence of the more common category on the total score is ensured to be larger. The higher the WAR value, the better the overall performance of the representation model on the entire dataset, especially in terms of accurately identifying high frequency categories. In general, the larger the two index values, the more excellent the performance of the model in emotion recognition tasks is, and different facial expressions can be more effectively captured and classified. The comparison of the proposed method with the previous dynamic expression recognition method is shown in table 1. In table 1, the implementation processes of the three models, namely the dynamic facial expression recognition model form-DFER based on the transducer, the dynamic facial expression recognition model EST based on the convolutional neural network ResNet-18 and the dynamic facial expression recognition model EmoCLIP based on the visual language big model, all belong to the prior art and are not repeated.

TABLE 1 facial expression recognition results on educational data sets for different methods

Method of	UAR	WAR
			Former-DFER	53.69	65.70
EST	53.94	65.85
			EmoCLIP	58.04	62.12
The invention is that	66.89	75.64

In addition, the embodiment of the invention also carries out an ablation experiment to better check the contribution of each module or method proposed by the invention. In Table 2, the backup represents the original image encoder and text encoder in the contrast language-image pre-training model, W/Prompting represents that visual prompt learning is added to the image encoder and a learnable text representation is introduced on the text encoder on the basis of the backup, and W/Memory is the complete multi-modal dynamic Memory large model corresponding to the invention.

TABLE 2 ablation experiment results on educational data set of the invention

Method of	UAR	WAR
			Backbone	23.34	20.07
W/Prompting	60.61	72.65
			W/Memory	66.89	75.64

The ablation experiment result shows that the method provided by the invention has obvious effect compared with the prior method. The main reason is that the method provided by the invention models the frame-level information and the inter-frame relation of the facial expression video, and utilizes the characteristics of the historical test sample reserved in the test process of the dynamic memory network to enable the multi-mode dynamic memory large model to further mine the data potential information beyond the training data set.

In addition, the student emotion analysis method based on the multi-mode dynamic memory big model in the above embodiment can be essentially executed by a computer program or a module. Therefore, based on the same inventive concept, another preferred embodiment of the present invention further provides a student emotion analysis system based on a multi-modal dynamic memory big model corresponding to the student emotion analysis method based on a multi-modal dynamic memory big model provided in the above embodiment, as shown in fig. 7, which includes:

Similarly, based on the same inventive concept, another preferred embodiment of the present invention further provides a computer electronic device corresponding to the student emotion analysis method based on the multi-modal dynamic memory big model provided in the above embodiment, which includes a memory and a processor;

the memory is used for storing a computer program;

the processor is used for realizing the student emotion analysis method based on the multi-mode dynamic memory big model in the embodiment when executing the computer program.

Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

It is to be appreciated that the processor described above can be a general purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.

It should be further noted that, for convenience and brevity of description, specific working processes of the system described above may refer to corresponding processes in the foregoing method embodiments, which are not described herein again. In the embodiments of the present application, the division of steps or modules in the system and the method is only one logic function division, and other division manners may be implemented in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.

In addition, the data related to the invention is fully authorized to be acquired, and the collection, the use and the processing of the related information are required to comply with the related laws and regulations and standards of the related countries and regions.

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. A student sentiment analysis method based on a multimodal dynamic memory large model, characterized by comprising the following steps:

S1: Obtain facial expression images of different students when watching online education courses and their corresponding real emotional state category labels;

S2: The facial expression images are input into the multimodal dynamic memory large model for training. During the training process of the multimodal dynamic memory large model, each facial expression image is first projected by a linear projection layer. Each facial expression image corresponds to a token embedding sequence. The token embedding sequence is added to the spatial position encoding result of the facial expression image and the initialized classification token representation as the input token sequence. The input token sequence is input into the image encoder, and the visual cue learning method is used to cue fine-tune each encoding layer of the image encoder, and finally a visual aggregate representation is obtained. The description related to the facial behavior is processed by the text segmentation tool to form a segmented text sequence, and the segmented text sequence is compared with the identifiable The learned text representation is input into the text encoder together to obtain the text aggregation representation, the visual aggregation representation is multiplied by the transposed text aggregation representation and then outputted through Softmax to obtain the first prediction probability, the visual aggregation representation is input into the output module together with the historical features stored in the dynamic feature space to obtain the visual classification representation, the visual aggregation representation is stored in the dynamic feature space as a new feature, the visual aggregation representation is multiplied by the transposed visual classification representation and then outputted through Softmax to obtain the second prediction probability, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability, and the cross entropy loss is calculated based on the final prediction probability and the true emotional state category label to update the parameters of the multimodal dynamic memory large model;

S3: Input the facial expression image to be classified into the trained multimodal dynamic memory large model. When the dynamic feature space reaches the preset capacity threshold, calculate the comprehensive score corresponding to the facial expression image to be classified, and judge whether the dynamic feature space needs to be updated based on the comprehensive score, and finally output the emotional state category prediction result of the facial expression image.

2. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 1 is characterized in that, in step S1, the real emotional state category labels contain a total of five categories, namely happiness, confusion, anxiety, concentration and depression.

3. The student sentiment analysis method based on the multimodal dynamic memory large model as claimed in claim 1 is characterized in that, in step S2, the specific process of using the visual cue learning method to perform cue fine-tuning on the qth encoding layer of the image encoder is as follows:

AS21: Get the classification token representation learned by the (q-1)th encoding layer of the image encoder, multiply the classification token representation by the linear projection matrix to obtain the projected classification token representation, perform layer normalization on the projected classification token representation, and then apply the multi-head self-attention mechanism to obtain the processed classification token representation, and add the processed classification token representation and the projected classification token representation as the summary token of the qth encoding layer;

AS22: Add the classification token representation to the randomly initialized learnable vector as the first intermediate token representation, perform layer normalization on the first intermediate token representation, and then apply the multi-head self-attention mechanism to obtain the processed first intermediate token representation, and add the processed first intermediate token representation to the first intermediate token representation as the local prompt token of the qth encoding layer;

AS23: Randomly initialize the global prompt token of the qth encoding layer, obtain the image features output by the (q-1)th encoding layer of the image encoder, concatenate the image features with the local prompt tokens, the global prompt tokens and the summary tokens to form the original visual input representation, normalize the original visual input representation by layer and then apply the pre-trained self-attention mechanism to obtain the processed visual input representation, add the processed visual input representation to the original visual input representation to obtain the original visual output representation, delete the local prompt tokens, the global prompt tokens and the summary tokens from the original visual output representation to obtain the processed visual output representation, normalize the processed visual output representation by layer and process it by a feedforward neural network to obtain the initial image features, and add the initial image features to the processed visual output representation as the image features output by the qth encoding layer of the image encoder.

4. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 3 is characterized in that in step S2, the specific process of obtaining the visual aggregate representation is as follows: multiplying the image features output by the last encoding layer of the image encoder with a linear projection matrix to obtain the projected image features, splicing the projected image features corresponding to each facial expression image and then performing an average pooling operation to output the visual aggregate representation.

5. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 1 is characterized in that, in step S2, the specific process of generating the visual classification representation by the output module is as follows: taking the visual aggregation representation corresponding to a facial expression image as the query representation, searching for the feature corresponding to the real emotional state category label y of the facial expression image in the dynamic feature space and using it as the key representation and the value representation, processing the query representation through a first projection function to obtain a first projection feature, processing the key representation through a second projection function and transposing it to obtain a second projection feature, processing the value representation through a third projection function to obtain a third projection feature, performing cosine similarity calculation on the first projection feature and the second projection feature to obtain a first similarity feature, using the first similarity feature as the input variable of the sharpness function, after being processed by the sharpness function, obtaining a sharpness feature weight, weighting the third projection feature and the sharpness feature weight in the form of multiplication to obtain a second similarity feature, processing the second similarity feature through a fourth projection function to obtain the element of the yth row in the visual classification representation, processing all facial expression images, and finally forming a visual classification representation.

6. The student sentiment analysis method based on the multimodal dynamic memory large model as described in claim 5 is characterized in that the specific processing process in the projection function is as follows: the feature input into the projection function passes through a fully connected layer with all parameters randomly initialized to obtain a first intermediate feature, the first intermediate feature is added to the feature input into the projection function to form a residual connection to obtain a second intermediate feature, the second intermediate feature is L2 normalized along the feature dimension to obtain the feature output by the projection function.

7. The student sentiment analysis method based on the multimodal dynamic memory large model according to claim 5, characterized in that the function form of the sharpness function Rd(·) is:

Rd(x)=exp(-w _r (1-x))

Where x represents the variable input to the sharpness function; w _r represents the hyperparameter for adjusting sharpness.

8. The student sentiment analysis method based on the multimodal dynamic memory large model according to claim 1 is characterized in that, in step S3, the specific process of judging whether the dynamic feature space needs to be updated based on the comprehensive score is as follows:

S31: calculating an entropy value based on the first predicted probability corresponding to the visual aggregate representation of the facial expression image to be classified as the current predicted entropy, calculating an entropy value based on the first predicted probability corresponding to each feature in the dynamic feature space as the historical predicted entropy, averaging all historical predicted entropies as the average historical entropy, adding the current predicted entropy, the average historical entropy and a preset constant as the total confidence score, and taking the ratio of the preset confidence score to the total confidence score as the comprehensive score of the facial expression image;

S32: When the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is greater than or equal to the lowest comprehensive score in the dynamic feature space, the visual aggregate representation of the facial expression image to be classified replaces the features corresponding to the lowest comprehensive score in the dynamic feature space to update the dynamic feature space; if the comprehensive score of the facial expression image to be classified is less than the lowest comprehensive score in the dynamic feature space, the dynamic feature space remains unchanged.

9. A student sentiment analysis system based on a multimodal dynamic memory large model, characterized by comprising:

A data acquisition module is used to obtain facial expression images of different students when watching online education courses and their corresponding real emotional state category labels;

The model acquisition module is used to input facial expression images into the multimodal dynamic memory large model for training. During the training process of the multimodal dynamic memory large model, each facial expression image is first projected by a linear projection layer, and each facial expression image corresponds to a token embedding sequence. The token embedding sequence is added to the spatial position encoding result of the facial expression image and the initialized classification token representation as the input token sequence, and the input token sequence is input into the image encoder. The visual cue learning method is used to perform cue fine-tuning on each encoding layer of the image encoder, and finally a visual aggregate representation is obtained. The description related to the facial behavior is processed by a text segmentation tool to form a segmented text sequence, and the segmented text sequence is added to the image encoder. The columns are input into the text encoder together with the learnable text representation to obtain the text aggregation representation, the visual aggregation representation is multiplied by the transposed text aggregation representation and then output the first prediction probability through Softmax, the visual aggregation representation is input into the output module together with the historical features stored in the dynamic feature space to obtain the visual classification representation, the visual aggregation representation is stored in the dynamic feature space as a new feature, the visual aggregation representation is multiplied by the transposed visual classification representation and then output the second prediction probability through Softmax, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability, and the cross entropy loss is calculated based on the final prediction probability and the true emotional state category label to update the multimodal dynamic memory large model parameters;

The result acquisition module is used to input the facial expression image to be classified into the trained multimodal dynamic memory large model. When the dynamic feature space reaches the preset capacity threshold, the comprehensive score corresponding to the facial expression image to be classified is calculated, and based on the comprehensive score, it is determined whether the dynamic feature space needs to be updated, and finally the emotional state category prediction result of the facial expression image is output.

10. A computer electronic device, comprising a memory and a processor;

The memory is used to store computer programs;

The processor is used to implement the student sentiment analysis method based on the multimodal dynamic memory large model as described in any one of claims 1 to 8 when executing the computer program.