[go: up one dir, main page]

CN119323818A - Student emotion analysis method and system based on multi-mode dynamic memory big model - Google Patents

Student emotion analysis method and system based on multi-mode dynamic memory big model Download PDF

Info

Publication number
CN119323818A
CN119323818A CN202411878512.8A CN202411878512A CN119323818A CN 119323818 A CN119323818 A CN 119323818A CN 202411878512 A CN202411878512 A CN 202411878512A CN 119323818 A CN119323818 A CN 119323818A
Authority
CN
China
Prior art keywords
representation
visual
facial expression
token
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411878512.8A
Other languages
Chinese (zh)
Other versions
CN119323818B (en
Inventor
黄昌勤
苏一飞
蒋云良
朱雁来
蒋凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Zhejiang Normal University CJNU
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Zhejiang Normal University CJNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd, Zhejiang Normal University CJNU filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202411878512.8A priority Critical patent/CN119323818B/en
Publication of CN119323818A publication Critical patent/CN119323818A/en
Application granted granted Critical
Publication of CN119323818B publication Critical patent/CN119323818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/175Static expression
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于多模态动态记忆大模型的学生情感分析方法及系统,属于多模态情感分析领域。本发明的方法首先获取不同学生观看在线教育课程时的面部表情图像及其对应的真实情感状态类别标签,构成训练数据集,并基于该数据集训练多模态动态记忆大模型,在训练过程中,对图像编码器以及文本编码器进行提示微调,通过动态特征空间存储历史训练过程中产生的视觉聚合表示,由交叉熵损失优化模型参数;最终将待分类的面部表情图像输入到训练好的多模态动态记忆大模型中,输出面部表情图像的情感状态类别预测结果。本发明的方法可以有效地应用于教育领域,有利于教师了解学生的情绪状态,进而提高学生的参与度和学习效果。

The present invention discloses a student emotion analysis method and system based on a multimodal dynamic memory large model, belonging to the field of multimodal emotion analysis. The method of the present invention first obtains facial expression images of different students when watching online education courses and their corresponding real emotion state category labels to form a training data set, and trains the multimodal dynamic memory large model based on the data set. During the training process, the image encoder and the text encoder are prompted to be fine-tuned, and the visual aggregation representation generated during the historical training process is stored in the dynamic feature space, and the model parameters are optimized by the cross entropy loss; finally, the facial expression image to be classified is input into the trained multimodal dynamic memory large model, and the emotion state category prediction result of the facial expression image is output. The method of the present invention can be effectively applied to the field of education, which is conducive to teachers understanding the emotional state of students, thereby improving students' participation and learning effect.

Description

Student emotion analysis method and system based on multi-mode dynamic memory big model
Technical Field
The invention belongs to the field of multi-mode emotion analysis, and particularly relates to a student emotion analysis method and system based on a multi-mode dynamic memory big model.
Background
Student emotion analysis is a method for evaluating and understanding emotion and emotion states of students in a virtual learning environment by technical means. The method utilizes artificial intelligence and machine learning technology to acquire information from various data sources, such as facial expressions, voice intonation, text input, mouse click and browsing habits of students, and the like. The data are processed by complex algorithms, which can reveal emotional reactions of students in the learning process, such as happiness, confusion, anxiety, concentration, depression and the like. The emotion analysis aims at helping teachers and educational institutions to better understand emotion dynamics of students so as to intervene and support timely. For example, when the system detects that students are showing confusion or frustration, the teacher may provide additional explanation or resources to help them overcome the difficulty. Through emotion analysis, online classroom not only can more accurately satisfy individualized demand of student, can also strengthen student's study experience and enthusiasm for distance education is more humanized and effective, and it provides huge potentiality and prospect for improving online education quality.
Student emotion analysis faces many challenges and technical problems. First, the accuracy of emotion recognition is a major challenge. The emotion expression of students is different from each other due to cultural, personality and environmental differences, and misjudgment may be caused by emotion recognition by facial expression or voice features alone. Secondly, processing and analyzing a large amount of data in real time puts high demands on the computing capacity and response speed of the system, and ensures that the system can provide accurate emotion analysis results in a short time. Another important challenge is how to effectively apply emotion analysis methods to the educational field. The teacher needs to know the emotional state of the student and also needs to obtain specific suggestions and tools to adjust teaching strategies so as to improve the participation degree and learning effect of the student.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a student emotion analysis method and system based on a multi-mode dynamic memory big model.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the invention provides a student emotion analysis method based on a multi-modal dynamic memory big model, which comprises the following steps:
s1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels;
S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly, projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding a space position coding result of the token embedded sequence and the facial expression image and an initialized classification token representation to be used as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregation representation, forming a text sequence after word segmentation after processing a description related to facial behaviors by a text segmentation tool, inputting the text sequence after word segmentation and a learnable text representation into the text encoder together to obtain a text aggregation representation, multiplying the visual aggregation representation with the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and a history feature stored in a dynamic feature space together to an output module to obtain a visual classification representation, storing the visual aggregation representation as a new feature, storing the visual aggregation representation in a dynamic feature space, and then carrying out cross-prediction probability calculation with a second prediction probability and a final prediction model, and calculating a final prediction probability based on a high-weighted prediction probability and a visual prediction model;
And S3, inputting the facial expression images to be classified into a trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting emotion state type prediction results of the facial expression images.
On the basis of the scheme, each step can be realized in the following preferred specific mode.
As a preferable aspect of the first aspect, in step S1, the true emotion state category label includes five categories, namely happy, confused, anxious, attentive, and depressed.
As a preferred aspect of the first aspect, in step S2, the specific process of performing the hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:
AS21, obtaining a classification token representation learned by the (q-1) th coding layer of the image encoder, multiplying the classification token representation by a linear projection matrix to obtain a projected classification token representation, carrying out layer normalization on the projected classification token representation, then applying a multi-head self-attention mechanism to obtain a processed classification token representation, and adding the processed classification token representation and the projected classification token representation to obtain a summary token of the q-th coding layer;
AS22, adding the classified token representation and the randomly initialized leachable vector to be used AS a first intermediate token representation, carrying out layer normalization on the first intermediate token representation, then applying a multi-head self-attention mechanism to obtain a processed first intermediate token representation, and adding the processed first intermediate token representation and the first intermediate token representation to be used AS a local prompt token of a q-th coding layer;
AS23, randomly initializing a global prompt token of a q-th coding layer, obtaining image characteristics output by the (q-1) th coding layer of the image coder, splicing the image characteristics with the local prompt token, the global prompt token and the summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism to obtain a processed visual input representation, adding the processed visual input representation and the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics and the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.
In the preferred embodiment of the first aspect, in step S2, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.
In step S2, the visual classification representation is preferably generated by the output module, wherein the visual aggregation representation corresponding to one facial expression image is used as a query representation, the feature corresponding to the real emotion state category label y of the facial expression image is searched in a dynamic feature space and used as a key representation and a value representation, the query representation is processed by a first projection function to obtain a first projection feature, the key representation is processed by a second projection function and transposed to obtain a second projection feature, the value representation is processed by a third projection function to obtain a third projection feature, the first projection feature and the second projection feature are subjected to cosine similarity calculation to obtain a first similar feature, the first similar feature is used as an input variable of a sharpness function, the sharpness feature weight is obtained after the sharpness feature weight is weighted in a multiplied mode to obtain a second similar feature, the second similar feature is processed by a fourth projection function to obtain an element of a y line in the visual classification representation, and finally the visual classification representation is formed.
The specific processing procedure in the projection function is as follows, the feature input into the projection function is passed through a full connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature is added with the feature input into the projection function to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.
As a preference of the above first aspect, the sharpness function Rd (·) has a functional form:
Rd(x)=exp(-wr(1-x))
Where x represents the variable input to the sharpness function and w r represents the super-parameter that adjusts sharpness.
As a preferred aspect of the above first aspect, in step S3, a specific process of determining whether the dynamic feature space needs to be updated based on the composite score is as follows:
S31, calculating an entropy value based on a first prediction probability corresponding to visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probability corresponding to each feature in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, adding the current prediction entropy, the average historical entropy and a preset constant to be used as a confidence coefficient total score, and taking the ratio of the preset confidence coefficient score to the confidence coefficient total score as a comprehensive score of the facial expression image;
S32, when the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is larger than or equal to the lowest comprehensive score in the dynamic feature space, replacing the feature corresponding to the lowest comprehensive score in the dynamic feature space with the visual aggregate representation of the facial expression image to be classified so as to update the dynamic feature space, and if the comprehensive score of the facial expression image to be classified is smaller than the lowest comprehensive score in the dynamic feature space, keeping the dynamic feature space unchanged.
In a second aspect, the present invention provides a student emotion analysis system based on a multimodal dynamic memory big model, comprising:
The data acquisition module is used for acquiring facial expression images and corresponding real emotion state category labels of the facial expression images when different students watch the online education courses;
The model acquisition module is used for inputting the facial expression images into the multi-mode dynamic memory big model for training, in the training process of the multi-mode dynamic memory big model, firstly, each facial expression image is projected by a linear projection layer, each facial expression image correspondingly obtains a token embedded sequence, the token embedded sequence is added with the space position coding result of the facial expression image and the initialized classified token representation to be used as an input token sequence, the input token sequence is input into an image encoder, a visual prompt learning method is adopted for prompting and fine tuning each coding layer of the image encoder, finally, a visual aggregate representation is obtained, after the description related to the facial behaviors is processed by a text segmentation tool, a text sequence after word segmentation is formed, inputting the text sequence after word segmentation and the learner-able text representation into a text encoder to obtain text aggregation representation, multiplying the visual aggregation representation and the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and the history features stored in a dynamic feature space into an output module to obtain visual classification representation, storing the visual aggregation representation as new features in the dynamic feature space, multiplying the visual aggregation representation and the transposed visual classification representation, outputting a second prediction probability through Softmax, taking weighted summation of the first prediction probability and the second prediction probability as a final prediction probability, and calculating cross entropy loss based on the final prediction probability and a true emotion state type label to update the multi-mode dynamic memory large model parameter;
The result acquisition module is used for inputting the facial expression images to be classified into the trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting the emotion state type prediction result of the facial expression images.
In a third aspect, the present invention provides a computer electronic device comprising a memory and a processor;
the memory is used for storing a computer program;
The processor is configured to implement a student emotion analysis method based on a multi-modal dynamic memory big model according to any one of the first aspect when executing the computer program.
Compared with the traditional student emotion analysis method, the invention has the following beneficial effects:
Compared with the traditional visual learning strategy, the visual prompt learning method can capture and understand visual information more effectively, and can be widely applied to various downstream computer visual tasks, such as student emotion analysis. In terms of text prompting, the present invention introduces a learnable text representation, rather than manually designed prompting words. In addition, the invention designs a dynamic feature space that preserves the features of historical test data during testing. Finally, the multi-mode dynamic memory large model can further mine potential information of the facial expression image, and further improve model performance under emotion analysis tasks of student facial expression recognition.
Drawings
FIG. 1 is a flow chart of the steps of the method of the present invention;
FIG. 2 is a schematic representation of the process of the present invention for obtaining a visual aggregate representation;
FIG. 3 is a schematic diagram of a process for obtaining a text aggregation representation in accordance with the method of the present invention;
FIG. 4 is a schematic diagram of the present invention for performing hint fine tuning of an image encoder using a visual hint learning method;
FIG. 5 is a schematic diagram of a process for obtaining a first prediction probability and a second prediction probability according to the present invention;
FIG. 6 is a schematic diagram of a process for obtaining a final predicted probability according to the present invention;
fig. 7 is a system block diagram of the present invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.
In the description of the present invention, it should be understood that the terms "first" and "second" are used solely for the purpose of distinguishing between the descriptions and not necessarily for the purpose of indicating or implying a relative importance or implicitly indicating the number of features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.
The multi-mode large model has remarkable advantages in solving the problems of accuracy of emotion recognition and computing capacity and response speed of processing a large amount of data in real time, and provides a novel visual representation learning method. It is able to capture and understand visual information more effectively than traditional visual learning strategies. Firstly, the multi-modal large model combines the image and text data, information can be extracted from the multi-modal data, and the emotion recognition accuracy is improved. For example, the multimodal big model can analyze the facial expressions, limb languages and text content in classroom interactions of students simultaneously, providing more comprehensive emotion recognition. This multimodal analysis captures student emotional states more accurately than a single modality because it takes into account more contextual information and behavioral characteristics. In terms of computational power and response speed, the multimodal big model relies on advanced neural network architecture and efficient parallel processing techniques. The existing multi-mode large model is generally based on a transducer architecture, has strong parallel processing capability, and can process a large amount of data simultaneously. This enables the multi-modal large model to analyze student affective states in real time and provide feedback in a short time. The multi-mode visual language large model also has self-adaptive learning capability, can continuously learn and optimize the model from new data, and improves the accuracy of emotion recognition and the real-time processing capability. Through continuous interaction with teachers and students, the multi-modal large model can gradually learn and adapt to emotion expression modes of different students, and individuation and accuracy of emotion recognition are improved. Therefore, the multi-mode large model can be subjected to multi-mode data fusion, efficient parallel processing and self-adaptive learning, and the accuracy of emotion analysis on students in a class is remarkably improved. The latest multimodal large models such as the contrasted language-image pre-training model (Contrastive Language-IMAGE PRETRAINING, CLIP) and the contrasted learning-based dual encoder model ‌ ALIGN collect 4 hundred million and 18 hundred million image text pairs, respectively, from the Internet. These multi-modal large models are then widely applied to a variety of downstream computer vision tasks and exhibit excellent performance.
In this embodiment, a comparative language-image pre-training model is specifically described as an example. CLIP utilizes a contrast learning approach to understand and correlate visual and linguistic information. When emotion prediction is carried out, the CLIP firstly converts the input image and text data into high-dimensional feature vectors respectively, and key features are extracted through a visual and linguistic encoder. These feature vectors are then projected into the same embedding space where cosine similarity of the image and text is calculated. CLIP makes correctly matched image and text pairs closer in embedded space and incorrectly matched pairs farther by optimizing a contrast loss function. As previously mentioned, the present invention contemplates the use of a comparative language-image pre-training model in multimodal emotion analysis. In order to effectively apply the contrasted language-image pre-training model to the processing of student facial expression data, the present invention takes into account two important aspects. In the first aspect, the generalization capability of the backbone of the contrast language-image pre-training model needs to be preserved, namely, video representation and text representation obtained from an original image encoder and a text encoder in the CLIP are preserved, and in the second aspect, the method must be capable of effectively adapting to the emotion analysis field of students. Aiming at the first aspect, the invention provides a novel text prompt learning method for fine tuning a text encoder, and the visual prompt learning method is adopted for fine tuning an image encoder to realize competitive strong supervision performance so as to better adapt to emotion analysis tasks of student facial expression recognition. By carrying out prompt fine adjustment on the CLIP model, the adaptability and accuracy of the CLIP model in student emotion analysis can be remarkably improved. The trimmed CLIP model can capture the emotion change of the student more accurately, such as happiness, confusion, anxiety, concentration, depression and the like, and provide feedback in time. By the mode, teachers can flexibly adjust teaching methods according to the instant emotion changes of students, and therefore more customized support and assistance are provided. The fine adjustment can also enable the CLIP to be better suitable for student emotion expression under different cultural and personality backgrounds, and further improves emotion recognition accuracy and universality. For the second aspect, the invention additionally introduces a learnable output module to adapt to the CLIP model after fine tuning.
As shown in FIG. 1, in a preferred implementation manner of the present invention, the student emotion analysis method based on the multi-modal dynamic memory big model includes the following steps S1-S3. The following describes the specific implementation procedure.
S1, obtaining facial expression images of different students when watching online education courses and corresponding real emotion state category labels.
In this embodiment, a facial expression video of a student watching an online education course is acquired from an online education platform. And extracting the facial expression videos with the length of N frame by frame according to fixed intervals, wherein each frame is a static facial expression image with the size of Y multiplied by U, and Y and U are the height and the width of the facial expression image respectively.
In this embodiment, the emotion states of students watching online education courses are described by various category labels, and the category labels help teachers to better understand emotion and response of students, so that teaching strategies are effectively adjusted and support is provided. In this embodiment, the true emotion state category labels can be summarized into happy, namely that students feel satisfied and pleasant in learning, and are expressed as smiles, active participation and confidence, confused, namely that students face understanding barriers when concepts or tasks, are expressed as frowning, frequently asking questions or seeking additional guidance, anxiety, stress and anxiety are reflected, the students are not restless, free eyes or speech, concentration is indicated, the students are focused, and static gestures and thinking expressions are presented, and frustration is caused by frustration, and the students are expressed as low emotion, lack of power and reduced participation. The true emotion state category labels not only help teachers to capture emotion states of students in time, but also help to formulate personalized teaching support measures so as to improve learning experience and effect of the students and enable online education to be more focused and humanized.
S2, inputting facial expression images into a multi-mode dynamic memory big model for training, firstly projecting each facial expression image by a linear projection layer in the training process of the multi-mode dynamic memory big model, obtaining a token embedded sequence by each facial expression image, adding the token embedded sequence with a spatial position coding result of the facial expression images and an initialized classification token representation as an input token sequence, inputting the input token sequence into an image encoder, adopting a visual prompt learning method to carry out prompt fine adjustment on each coding layer of the image encoder, finally obtaining a visual aggregate representation (shown in figure 2), forming a text sequence after word segmentation after text segmentation processing is carried out on description related to facial behaviors, inputting the text sequence after word segmentation and the learner text representation into the text encoder together, obtaining a text aggregate representation (shown in figure 3), multiplying the visual aggregate representation with the transposed text aggregate representation, outputting a first prediction probability through Softmax, inputting the visual aggregate representation into an output module together with a history feature stored in a dynamic feature space, obtaining a visual aggregate representation (shown in figure 2), storing the visual aggregate representation as a new prediction probability and a second prediction model, and calculating a visual aggregate feature and a final prediction model based on a high-level prediction entropy, and a visual aggregate feature and a final prediction model.
In step S2 of this embodiment, non-overlapping segmentation is performed on each facial expression image, that is, each facial expression image is divided into a plurality of square patches with a size of p×p, and after the square patches are obtained, random scaling clipping, horizontal flipping, random rotation, and color dithering are used to avoid over-fitting. The processed square patches are flattened into a set of vectors, which are then projected using a linear projection layer to form a token embedded sequence.
In step S2 of this embodiment, as shown in fig. 4, a specific process of performing a hint fine adjustment on the q-th coding layer of the image encoder by using the visual hint learning method is as follows:
AS21, obtaining a classification token representation learned by the (q-1) th coding layer of the image encoder, multiplying the classification token representation by a linear projection matrix to obtain a projected classification token representation, carrying out layer normalization on the projected classification token representation, then applying a multi-head self-attention mechanism to obtain a processed classification token representation, and adding the processed classification token representation and the projected classification token representation to obtain a summary token of the q-th coding layer;
AS22, adding the classified token representation and the randomly initialized leachable vector to be used AS a first intermediate token representation, carrying out layer normalization on the first intermediate token representation, then applying a multi-head self-attention mechanism to obtain a processed first intermediate token representation, and adding the processed first intermediate token representation and the first intermediate token representation to be used AS a local prompt token of a q-th coding layer;
AS23, randomly initializing a global prompt token of a q-th coding layer, acquiring image characteristics output by the (q-1) th coding layer of an image coder, splicing the image characteristics with a local prompt token, a global prompt token and a summary token to form an original visual input representation, carrying out layer normalization on the original visual input representation, then applying a pre-training self-attention mechanism (THE PRETRAINED SELF-attention) to obtain a processed visual input representation, adding the processed visual input representation with the original visual input representation to obtain an original visual output representation, deleting the local prompt token, the global prompt token and the summary token in the original visual output representation to obtain a processed visual output representation, carrying out layer normalization and feedforward neural network processing on the processed visual output representation to obtain initial image characteristics, and adding the initial image characteristics with the processed visual output representation to be used AS the image characteristics output by the q-th coding layer of the image coder.
In step S2 of this embodiment, the specific process of obtaining the visual aggregate representation is that the image feature output by the last coding layer of the image encoder is multiplied by a linear projection matrix to obtain the projected image feature, and the projected image features corresponding to each facial expression image are spliced and then subjected to an averaging pooling operation to output the visual aggregate representation.
It should be noted that in step S2 of this embodiment, a learnable text representation is used as the context of each category descriptor on the text encoder, which does not require expert design of context words and allows the multi-modal dynamic memory big model to learn the relevant context information of each expression during training. Specifically, the input to the text encoder specifically contains two types of data, the first type of input data being the result of a facial behavior related description being processed by a text segmentation tool Tokenizer, which is a tool for segmenting text into smaller units that can segment long text into smaller units and then convert the segmented text into an index sequence. The second type of input data is a class-specific textual representation that is independent of each description. By inputting both data as cues to the text encoder, a text aggregate representation representing the visual concept is obtained.
In general, in step S2 of the present embodiment, there are two main objectives for hint learning the image encoder, 1) to utilize temporal information by introducing inter-frame information exchange, 2) to provide additional parameters to adapt the image encoder to the distribution of facial expression images. In this regard, the present invention introduces three additional tokens into each layer of the image encoder, the discrimination information between all frames is summarized by the summary token, the discrimination information of each facial expression image is conveyed to the rest of the facial expression images by the local cue token, and the learning ability is provided by the global cue token, so that the multi-modal dynamic memory large model is adapted to the distribution of the facial expression images. In the text encoder section, the description relating to facial behavior is processed by the text segmentation tool and then input to the text encoder together with a learnable text representation, which may provide more detailed and accurate information about the specific movements or positions of the muscles involved in each expression, in view of the fact that different facial expressions have both common and unique or specific properties at the local behavioral level, as opposed to simply inputting a facial expression class name.
In step S2 of this embodiment, as shown in fig. 5, for the facial expression recognition task, the visual aggregate representation extracted by the image encoder and the text aggregate representation extracted by the text encoder are L2 normalized, and then the visual aggregate representation is multiplied by the transposed text aggregate representation, and then the first prediction probability is output through Softmax
In step S2 of the present embodiment, the specific process of generating the visual classification representation by the output module is as follows, namely, the visual aggregation representation corresponding to one facial expression imageAs query characterization, searching real emotion state category labels of the facial expression image in dynamic feature spaceCorresponding featuresThe method comprises the steps of taking a key representation and a value representation as key representations, processing the query representation through a first projection function to obtain a first projection feature, processing the key representation through a second projection function and transposing the key representation to obtain a second projection feature, processing the value representation through a third projection function to obtain a third projection feature, performing cosine similarity calculation on the first projection feature and the second projection feature to obtain a first similar feature, taking the first similar feature as an input variable of a sharpness function, processing the sharpness function to obtain a sharpness feature weight, weighting the third projection feature and the sharpness feature weight in a multiplication mode to obtain a second similar feature, processing the second similar feature through a fourth projection function to obtain a visual classification representationMiddle (f)Elements of a row:
Wherein, AndRepresenting first, second, third and fourth projection functions, respectively; Representing cosine similarity calculation; representing the sharpness function; Representing the matrix transpose.
The specific processing procedure in the projection function is as follows, the feature input into the projection function is subjected to a full-connection layer with all parameters initialized randomly to obtain a first intermediate feature, the first intermediate feature and the feature input into the projection function are added to form residual connection to obtain a second intermediate feature, and the second intermediate feature is normalized by L2 along the feature dimension to obtain the feature output by the projection function.
Further, in this embodiment, the function form of the projection function is:
Wherein, Representing features input to the projection function; representing a full connection layer of all parameter random initialization; Representing an L2 normalization operation along the feature dimension.
Further, in the present embodiment, the function form of the sharpness function is:
Wherein, A variable representing an input to the sharpness function; indicating the super-parameters that adjust sharpness.
After the above process, visual classification representation can be obtainedFinally, the visual aggregate representation is converted into the required classification prediction probability, namely, the visual aggregate representation is multiplied by the transposed visual classification representation, and then the second prediction probability is output through Softmax. Finally, as shown in FIG. 6, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability:
In the formula,All represent preset weight super parameters.
And S3, inputting the facial expression images to be classified into a trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting emotion state type prediction results of the facial expression images.
In step S3 of this embodiment, the introduction of an external memory component is proposed by study inspired by knowledge accumulation and recall in the brain of the person, and the storage and retrieval of historical knowledge are allowed to facilitate decision making. Recently, the idea of a memory network was introduced into the adaptation of CLIP. Therefore, the invention introduces a dynamic feature space to support the reading and writing operation of the features, thereby optimizing the face recognition process of students. In the training process of the multi-mode dynamic memory large model, the size of the dynamic feature space is consistent with the size of the facial expression image used for training, and the dynamic feature space is used for storing the visual aggregate representation generated by each iteration round and applying the visual aggregate representation as a historical feature to the calculation of the next iteration round. After the multi-mode dynamic memory big model is trained, the dynamic feature space at this stage is set to be a preset size and has a limited length, so that when the visual aggregate representation stored therein reaches the limit, the dynamic feature space needs to be updated.
The specific procedure for determining whether the dynamic feature space needs to be updated based on the composite score mentioned in step S3 will be described in detail below.
S31, calculating an entropy value based on a first prediction probability corresponding to a visual aggregate representation of the facial expression image to be classified as a current prediction entropy, calculating the entropy value based on the first prediction probabilities corresponding to various features in a dynamic feature space as a historical prediction entropy, averaging all the historical prediction entropy as an average historical entropy, and taking the current prediction entropyAverage history entropyPreset constantAdding as confidence total score, and presetting confidence scoreRatio to confidence score as a composite score for a facial expression image:
S32, when the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is larger than or equal to the lowest comprehensive score in the dynamic feature space, replacing the feature corresponding to the lowest comprehensive score in the dynamic feature space with the visual aggregate representation of the facial expression image to be classified so as to update the dynamic feature space, and if the comprehensive score of the facial expression image to be classified is smaller than the lowest comprehensive score in the dynamic feature space, keeping the dynamic feature space unchanged.
The application effect of the student emotion analysis method based on the multi-mode dynamic memory big model described by S1-S3 in the embodiment on a specific data set is shown by a specific example, so that the essence of the invention can be understood conveniently.
Examples
The specific implementation process of the student emotion analysis method based on the multi-mode dynamic memory big model adopted in the embodiment is as described above, and will not be described again.
In order to show the technical effect of the method provided by the invention, the method is verified in an actual teaching scene. The data set selected by the invention is an educational data set, and the data set is focused on an online classroom in an educational environment and aims at carrying out facial expression recognition and emotion analysis on students. The dataset contains facial expression videos and facial expression images from different students while watching online lessons that capture various emotional states exhibited by the students during learning, such as anxiety, concentration, confusion, happiness, and depression. Each video and image is annotated to indicate the student's emotional performance at a particular moment, thereby providing training data for subsequent emotion analysis and recognition tasks. In addition, the data set contains background information of students, such as grade, sex and learning results, so as to study the relationship between emotion and learning effect.
The evaluation index adopts UAR and WAR which are most commonly used in the facial expression recognition field, wherein UAR refers to average recall rate calculated on all categories, and the contribution of UAR to each category is equal, so that the UAR is suitable for the condition of unbalanced category distribution. The larger UAR value indicates a higher accuracy of the model in identifying each emotion category, especially when dealing with a small number of category samples. On the other hand, WAR considers the occurrence frequency of each category in the data set, and the recall rate is calculated through weighted average, so that the influence of the more common category on the total score is ensured to be larger. The higher the WAR value, the better the overall performance of the representation model on the entire dataset, especially in terms of accurately identifying high frequency categories. In general, the larger the two index values, the more excellent the performance of the model in emotion recognition tasks is, and different facial expressions can be more effectively captured and classified. The comparison of the proposed method with the previous dynamic expression recognition method is shown in table 1. In table 1, the implementation processes of the three models, namely the dynamic facial expression recognition model form-DFER based on the transducer, the dynamic facial expression recognition model EST based on the convolutional neural network ResNet-18 and the dynamic facial expression recognition model EmoCLIP based on the visual language big model, all belong to the prior art and are not repeated.
TABLE 1 facial expression recognition results on educational data sets for different methods
Method of UAR WAR
Former-DFER 53.69 65.70
EST 53.94 65.85
EmoCLIP 58.04 62.12
The invention is that 66.89 75.64
In addition, the embodiment of the invention also carries out an ablation experiment to better check the contribution of each module or method proposed by the invention. In Table 2, the backup represents the original image encoder and text encoder in the contrast language-image pre-training model, W/Prompting represents that visual prompt learning is added to the image encoder and a learnable text representation is introduced on the text encoder on the basis of the backup, and W/Memory is the complete multi-modal dynamic Memory large model corresponding to the invention.
TABLE 2 ablation experiment results on educational data set of the invention
Method of UAR WAR
Backbone 23.34 20.07
W/Prompting 60.61 72.65
W/Memory 66.89 75.64
The ablation experiment result shows that the method provided by the invention has obvious effect compared with the prior method. The main reason is that the method provided by the invention models the frame-level information and the inter-frame relation of the facial expression video, and utilizes the characteristics of the historical test sample reserved in the test process of the dynamic memory network to enable the multi-mode dynamic memory large model to further mine the data potential information beyond the training data set.
In addition, the student emotion analysis method based on the multi-mode dynamic memory big model in the above embodiment can be essentially executed by a computer program or a module. Therefore, based on the same inventive concept, another preferred embodiment of the present invention further provides a student emotion analysis system based on a multi-modal dynamic memory big model corresponding to the student emotion analysis method based on a multi-modal dynamic memory big model provided in the above embodiment, as shown in fig. 7, which includes:
The data acquisition module is used for acquiring facial expression images and corresponding real emotion state category labels of the facial expression images when different students watch the online education courses;
The model acquisition module is used for inputting the facial expression images into the multi-mode dynamic memory big model for training, in the training process of the multi-mode dynamic memory big model, firstly, each facial expression image is projected by a linear projection layer, each facial expression image correspondingly obtains a token embedded sequence, the token embedded sequence is added with the space position coding result of the facial expression image and the initialized classified token representation to be used as an input token sequence, the input token sequence is input into an image encoder, a visual prompt learning method is adopted for prompting and fine tuning each coding layer of the image encoder, finally, a visual aggregate representation is obtained, after the description related to the facial behaviors is processed by a text segmentation tool, a text sequence after word segmentation is formed, inputting the text sequence after word segmentation and the learner-able text representation into a text encoder to obtain text aggregation representation, multiplying the visual aggregation representation and the transposed text aggregation representation, outputting a first prediction probability through Softmax, inputting the visual aggregation representation and the history features stored in a dynamic feature space into an output module to obtain visual classification representation, storing the visual aggregation representation as new features in the dynamic feature space, multiplying the visual aggregation representation and the transposed visual classification representation, outputting a second prediction probability through Softmax, taking weighted summation of the first prediction probability and the second prediction probability as a final prediction probability, and calculating cross entropy loss based on the final prediction probability and a true emotion state type label to update the multi-mode dynamic memory large model parameter;
The result acquisition module is used for inputting the facial expression images to be classified into the trained multi-mode dynamic memory large model, calculating comprehensive scores corresponding to the facial expression images to be classified after the dynamic feature space reaches a preset capacity threshold, judging whether the dynamic feature space needs to be updated or not based on the comprehensive scores, and finally outputting the emotion state type prediction result of the facial expression images.
Similarly, based on the same inventive concept, another preferred embodiment of the present invention further provides a computer electronic device corresponding to the student emotion analysis method based on the multi-modal dynamic memory big model provided in the above embodiment, which includes a memory and a processor;
the memory is used for storing a computer program;
the processor is used for realizing the student emotion analysis method based on the multi-mode dynamic memory big model in the embodiment when executing the computer program.
Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
It is to be appreciated that the processor described above can be a general purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
It should be further noted that, for convenience and brevity of description, specific working processes of the system described above may refer to corresponding processes in the foregoing method embodiments, which are not described herein again. In the embodiments of the present application, the division of steps or modules in the system and the method is only one logic function division, and other division manners may be implemented in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.
In addition, the data related to the invention is fully authorized to be acquired, and the collection, the use and the processing of the related information are required to comply with the related laws and regulations and standards of the related countries and regions.
The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims (10)

1.一种基于多模态动态记忆大模型的学生情感分析方法,其特征在于,包括以下步骤:1. A student sentiment analysis method based on a multimodal dynamic memory large model, characterized by comprising the following steps: S1:获取不同学生观看在线教育课程时的面部表情图像及其对应的真实情感状态类别标签;S1: Obtain facial expression images of different students when watching online education courses and their corresponding real emotional state category labels; S2:将面部表情图像输入到多模态动态记忆大模型中进行训练,在多模态动态记忆大模型的训练过程中,首先由一个线性投影层将每个面部表情图像进行投影,每个面部表情图像对应得到一个令牌嵌入序列,将令牌嵌入序列与面部表情图像的空间位置编码结果以及初始化的分类令牌表示相加作为输入令牌序列,将输入令牌序列输入到图像编码器中,并采用视觉提示学习方法对图像编码器的每个编码层进行提示微调,最终得到视觉聚合表示,将与面部行为相关的描述经过文本分割工具处理后,形成分词后的文本序列,将分词后的文本序列与可学习的文本表示一起输入到文本编码器中,得到文本聚合表示,将视觉聚合表示与转置后的文本聚合表示相乘再经过Softmax输出第一预测概率,将视觉聚合表示与动态特征空间中存储的历史特征一起输入到输出模块中,得到视觉分类表示,将视觉聚合表示作为新的特征存储在动态特征空间中,将视觉聚合表示与转置后的视觉分类表示相乘再经过Softmax输出第二预测概率,将第一预测概率与第二预测概率加权求和作为最终预测概率,基于最终预测概率和真实情感状态类别标签计算交叉熵损失以更新多模态动态记忆大模型参数;S2: The facial expression images are input into the multimodal dynamic memory large model for training. During the training process of the multimodal dynamic memory large model, each facial expression image is first projected by a linear projection layer. Each facial expression image corresponds to a token embedding sequence. The token embedding sequence is added to the spatial position encoding result of the facial expression image and the initialized classification token representation as the input token sequence. The input token sequence is input into the image encoder, and the visual cue learning method is used to cue fine-tune each encoding layer of the image encoder, and finally a visual aggregate representation is obtained. The description related to the facial behavior is processed by the text segmentation tool to form a segmented text sequence, and the segmented text sequence is compared with the identifiable The learned text representation is input into the text encoder together to obtain the text aggregation representation, the visual aggregation representation is multiplied by the transposed text aggregation representation and then outputted through Softmax to obtain the first prediction probability, the visual aggregation representation is input into the output module together with the historical features stored in the dynamic feature space to obtain the visual classification representation, the visual aggregation representation is stored in the dynamic feature space as a new feature, the visual aggregation representation is multiplied by the transposed visual classification representation and then outputted through Softmax to obtain the second prediction probability, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability, and the cross entropy loss is calculated based on the final prediction probability and the true emotional state category label to update the parameters of the multimodal dynamic memory large model; S3:将待分类的面部表情图像输入到训练好的多模态动态记忆大模型中,当动态特征空间达到预设的容量阈值后,计算待分类的面部表情图像对应的综合评分,并基于综合评分判断是否需要更新动态特征空间,最终输出面部表情图像的情感状态类别预测结果。S3: Input the facial expression image to be classified into the trained multimodal dynamic memory large model. When the dynamic feature space reaches the preset capacity threshold, calculate the comprehensive score corresponding to the facial expression image to be classified, and judge whether the dynamic feature space needs to be updated based on the comprehensive score, and finally output the emotional state category prediction result of the facial expression image. 2.如权利要求1所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,步骤S1中,真实情感状态类别标签共包含五个类别,分别为快乐、困惑、焦虑、专注以及沮丧。2. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 1 is characterized in that, in step S1, the real emotional state category labels contain a total of five categories, namely happiness, confusion, anxiety, concentration and depression. 3.如权利要求1所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,步骤S2中,采用视觉提示学习方法对图像编码器的第q个编码层进行提示微调的具体过程如下:3. The student sentiment analysis method based on the multimodal dynamic memory large model as claimed in claim 1 is characterized in that, in step S2, the specific process of using the visual cue learning method to perform cue fine-tuning on the qth encoding layer of the image encoder is as follows: AS21:获取图像编码器的第(q-1)个编码层学习到的分类令牌表示,将分类令牌表示与线性投影矩阵相乘,得到投影后的分类令牌表示,对投影后的分类令牌表示进行层归一化后再应用多头自注意力机制,得到处理后的分类令牌表示,将处理后的分类令牌表示与投影后的分类令牌表示相加作为第q个编码层的总结令牌;AS21: Get the classification token representation learned by the (q-1)th encoding layer of the image encoder, multiply the classification token representation by the linear projection matrix to obtain the projected classification token representation, perform layer normalization on the projected classification token representation, and then apply the multi-head self-attention mechanism to obtain the processed classification token representation, and add the processed classification token representation and the projected classification token representation as the summary token of the qth encoding layer; AS22:将分类令牌表示与随机初始化的可学习向量相加,作为第一中间令牌表示,对第一中间令牌表示进行层归一化后再应用多头自注意力机制,得到处理后的第一中间令牌表示,将处理后的第一中间令牌表示与第一中间令牌表示相加作为第q个编码层的局部提示令牌;AS22: Add the classification token representation to the randomly initialized learnable vector as the first intermediate token representation, perform layer normalization on the first intermediate token representation, and then apply the multi-head self-attention mechanism to obtain the processed first intermediate token representation, and add the processed first intermediate token representation to the first intermediate token representation as the local prompt token of the qth encoding layer; AS23:随机初始化第q个编码层的全局提示令牌,获取图像编码器第(q-1)个编码层输出的图像特征,将图像特征与局部提示令牌、全局提示令牌以及总结令牌拼接后形成原始视觉输入表示,将原始视觉输入表示经过层归一化后再应用预训练自注意力机制,得到处理后的视觉输入表示,将处理后的视觉输入表示与原始视觉输入表示相加,得到原始视觉输出表示,在原始视觉输出表示中将局部提示令牌、全局提示令牌以及总结令牌删除后,得到处理后的视觉输出表示,将处理后的视觉输出表示经过层归一化以及前馈神经网络处理,得到初始图像特征,将初始图像特征与处理后的视觉输出表示相加作为图像编码器第q个编码层输出的图像特征。AS23: Randomly initialize the global prompt token of the qth encoding layer, obtain the image features output by the (q-1)th encoding layer of the image encoder, concatenate the image features with the local prompt tokens, the global prompt tokens and the summary tokens to form the original visual input representation, normalize the original visual input representation by layer and then apply the pre-trained self-attention mechanism to obtain the processed visual input representation, add the processed visual input representation to the original visual input representation to obtain the original visual output representation, delete the local prompt tokens, the global prompt tokens and the summary tokens from the original visual output representation to obtain the processed visual output representation, normalize the processed visual output representation by layer and process it by a feedforward neural network to obtain the initial image features, and add the initial image features to the processed visual output representation as the image features output by the qth encoding layer of the image encoder. 4.如权利要求3所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,步骤S2中,得到视觉聚合表示的具体过程如下:将图像编码器最后一个编码层输出的图像特征与一个线性投影矩阵相乘,得到投影后的图像特征,将每个面部表情图像对应的投影后的图像特征拼接后再由平均池化操作,输出视觉聚合表示。4. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 3 is characterized in that in step S2, the specific process of obtaining the visual aggregate representation is as follows: multiplying the image features output by the last encoding layer of the image encoder with a linear projection matrix to obtain the projected image features, splicing the projected image features corresponding to each facial expression image and then performing an average pooling operation to output the visual aggregate representation. 5.如权利要求1所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,步骤S2中,由输出模块生成视觉分类表示的具体过程如下:将一个面部表情图像对应的视觉聚合表示作为查询表征,在动态特征空间中查找与该面部表情图像的真实情感状态类别标签y相对应的特征并将其作为键表征和值表征,将查询表征经过第一投影函数处理,得到第一投影特征,将键表征经过第二投影函数处理并转置后,得到第二投影特征,将值表征经过第三投影函数处理,得到第三投影特征,将第一投影特征和第二投影特征进行余弦相似度计算,得到第一相似特征,将第一相似特征作为锐度函数的输入变量,经过锐度函数处理后,得到锐度特征权重,将第三投影特征和锐度特征权重以相乘的形式进行加权,得到第二相似特征,将第二相似特征经过第四投影函数处理,得到视觉分类表示中第y行的元素,对所有面部表情图像进行处理,最终构成视觉分类表示。5. The student emotion analysis method based on the multimodal dynamic memory large model as described in claim 1 is characterized in that, in step S2, the specific process of generating the visual classification representation by the output module is as follows: taking the visual aggregation representation corresponding to a facial expression image as the query representation, searching for the feature corresponding to the real emotional state category label y of the facial expression image in the dynamic feature space and using it as the key representation and the value representation, processing the query representation through a first projection function to obtain a first projection feature, processing the key representation through a second projection function and transposing it to obtain a second projection feature, processing the value representation through a third projection function to obtain a third projection feature, performing cosine similarity calculation on the first projection feature and the second projection feature to obtain a first similarity feature, using the first similarity feature as the input variable of the sharpness function, after being processed by the sharpness function, obtaining a sharpness feature weight, weighting the third projection feature and the sharpness feature weight in the form of multiplication to obtain a second similarity feature, processing the second similarity feature through a fourth projection function to obtain the element of the yth row in the visual classification representation, processing all facial expression images, and finally forming a visual classification representation. 6.如权利要求5所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,投影函数中的具体处理过程如下:将输入到投影函数中的特征经过一个所有参数随机初始化的全连接层,得到第一中间特征,将第一中间特征与输入到投影函数中的特征相加形成残差连接,得到第二中间特征,将第二中间特征沿特征维度进行L2归一化,得到投影函数输出的特征。6. The student sentiment analysis method based on the multimodal dynamic memory large model as described in claim 5 is characterized in that the specific processing process in the projection function is as follows: the feature input into the projection function passes through a fully connected layer with all parameters randomly initialized to obtain a first intermediate feature, the first intermediate feature is added to the feature input into the projection function to form a residual connection to obtain a second intermediate feature, the second intermediate feature is L2 normalized along the feature dimension to obtain the feature output by the projection function. 7.如权利要求5所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,所述锐度函数Rd(·)的函数形式为:7. The student sentiment analysis method based on the multimodal dynamic memory large model according to claim 5, characterized in that the function form of the sharpness function Rd(·) is: Rd(x)=exp(-wr(1-x))Rd(x)=exp(-w r (1-x)) 其中,x表示输入到锐度函数的变量;wr表示调节锐度的超参数。Where x represents the variable input to the sharpness function; w r represents the hyperparameter for adjusting sharpness. 8.如权利要求1所述的基于多模态动态记忆大模型的学生情感分析方法,其特征在于,步骤S3中,基于综合评分判断是否需要更新动态特征空间的具体过程如下:8. The student sentiment analysis method based on the multimodal dynamic memory large model according to claim 1 is characterized in that, in step S3, the specific process of judging whether the dynamic feature space needs to be updated based on the comprehensive score is as follows: S31:基于待分类面部表情图像的视觉聚合表示对应的第一预测概率计算熵值作为当前预测熵,基于动态特征空间中各个特征对应的第一预测概率计算熵值作为历史预测熵,将对所有历史预测熵求平均作为平均历史熵,将当前预测熵、平均历史熵以及预设的常数相加作为置信度总分,将预设的置信度分数与置信度总分的比值作为面部表情图像的综合评分;S31: calculating an entropy value based on the first predicted probability corresponding to the visual aggregate representation of the facial expression image to be classified as the current predicted entropy, calculating an entropy value based on the first predicted probability corresponding to each feature in the dynamic feature space as the historical predicted entropy, averaging all historical predicted entropies as the average historical entropy, adding the current predicted entropy, the average historical entropy and a preset constant as the total confidence score, and taking the ratio of the preset confidence score to the total confidence score as the comprehensive score of the facial expression image; S32:当动态特征空间达到预设的容量阈值后,若待分类的面部表情图像的综合评分大于等于动态特征空间中最低的综合评分,则将待分类的面部表情图像的视觉聚合表示替换动态特征空间中最低综合评分对应的特征,以更新动态特征空间;若待分类的面部表情图像的综合评分小于动态特征空间中最低的综合评分,则动态特征空间保持不变。S32: When the dynamic feature space reaches a preset capacity threshold, if the comprehensive score of the facial expression image to be classified is greater than or equal to the lowest comprehensive score in the dynamic feature space, the visual aggregate representation of the facial expression image to be classified replaces the features corresponding to the lowest comprehensive score in the dynamic feature space to update the dynamic feature space; if the comprehensive score of the facial expression image to be classified is less than the lowest comprehensive score in the dynamic feature space, the dynamic feature space remains unchanged. 9.一种基于多模态动态记忆大模型的学生情感分析系统,其特征在于,包括:9. A student sentiment analysis system based on a multimodal dynamic memory large model, characterized by comprising: 数据获取模块,用于获取不同学生观看在线教育课程时的面部表情图像及其对应的真实情感状态类别标签;A data acquisition module is used to obtain facial expression images of different students when watching online education courses and their corresponding real emotional state category labels; 模型获取模块,用于将面部表情图像输入到多模态动态记忆大模型中进行训练,在多模态动态记忆大模型的训练过程中,首先由一个线性投影层将每个面部表情图像进行投影,每个面部表情图像对应得到一个令牌嵌入序列,将令牌嵌入序列与面部表情图像的空间位置编码结果以及初始化的分类令牌表示相加作为输入令牌序列,将输入令牌序列输入到图像编码器中,并采用视觉提示学习方法对图像编码器的每个编码层进行提示微调,最终得到视觉聚合表示,将与面部行为相关的描述经过文本分割工具处理后,形成分词后的文本序列,将分词后的文本序列与可学习的文本表示一起输入到文本编码器中,得到文本聚合表示,将视觉聚合表示与转置后的文本聚合表示相乘再经过Softmax输出第一预测概率,将视觉聚合表示与动态特征空间中存储的历史特征一起输入到输出模块中,得到视觉分类表示,将视觉聚合表示作为新的特征存储在动态特征空间中,将视觉聚合表示与转置后的视觉分类表示相乘再经过Softmax输出第二预测概率,将第一预测概率与第二预测概率加权求和作为最终预测概率,基于最终预测概率和真实情感状态类别标签计算交叉熵损失以更新多模态动态记忆大模型参数;The model acquisition module is used to input facial expression images into the multimodal dynamic memory large model for training. During the training process of the multimodal dynamic memory large model, each facial expression image is first projected by a linear projection layer, and each facial expression image corresponds to a token embedding sequence. The token embedding sequence is added to the spatial position encoding result of the facial expression image and the initialized classification token representation as the input token sequence, and the input token sequence is input into the image encoder. The visual cue learning method is used to perform cue fine-tuning on each encoding layer of the image encoder, and finally a visual aggregate representation is obtained. The description related to the facial behavior is processed by a text segmentation tool to form a segmented text sequence, and the segmented text sequence is added to the image encoder. The columns are input into the text encoder together with the learnable text representation to obtain the text aggregation representation, the visual aggregation representation is multiplied by the transposed text aggregation representation and then output the first prediction probability through Softmax, the visual aggregation representation is input into the output module together with the historical features stored in the dynamic feature space to obtain the visual classification representation, the visual aggregation representation is stored in the dynamic feature space as a new feature, the visual aggregation representation is multiplied by the transposed visual classification representation and then output the second prediction probability through Softmax, the weighted sum of the first prediction probability and the second prediction probability is taken as the final prediction probability, and the cross entropy loss is calculated based on the final prediction probability and the true emotional state category label to update the multimodal dynamic memory large model parameters; 结果获取模块,用于将待分类的面部表情图像输入到训练好的多模态动态记忆大模型中,当动态特征空间达到预设的容量阈值后,计算待分类的面部表情图像对应的综合评分,并基于综合评分判断是否需要更新动态特征空间,最终输出面部表情图像的情感状态类别预测结果。The result acquisition module is used to input the facial expression image to be classified into the trained multimodal dynamic memory large model. When the dynamic feature space reaches the preset capacity threshold, the comprehensive score corresponding to the facial expression image to be classified is calculated, and based on the comprehensive score, it is determined whether the dynamic feature space needs to be updated, and finally the emotional state category prediction result of the facial expression image is output. 10.一种计算机电子设备,其特征在于,包括存储器和处理器;10. A computer electronic device, comprising a memory and a processor; 所述存储器,用于存储计算机程序;The memory is used to store computer programs; 所述处理器,用于当执行所述计算机程序时,实现如权利要求1~8任一项所述的基于多模态动态记忆大模型的学生情感分析方法。The processor is used to implement the student sentiment analysis method based on the multimodal dynamic memory large model as described in any one of claims 1 to 8 when executing the computer program.
CN202411878512.8A 2024-12-19 2024-12-19 Student emotion analysis method and system based on multi-mode dynamic memory big model Active CN119323818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411878512.8A CN119323818B (en) 2024-12-19 2024-12-19 Student emotion analysis method and system based on multi-mode dynamic memory big model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411878512.8A CN119323818B (en) 2024-12-19 2024-12-19 Student emotion analysis method and system based on multi-mode dynamic memory big model

Publications (2)

Publication Number Publication Date
CN119323818A true CN119323818A (en) 2025-01-17
CN119323818B CN119323818B (en) 2025-04-08

Family

ID=94228827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411878512.8A Active CN119323818B (en) 2024-12-19 2024-12-19 Student emotion analysis method and system based on multi-mode dynamic memory big model

Country Status (1)

Country Link
CN (1) CN119323818B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119760362A (en) * 2025-03-04 2025-04-04 浙江师范大学 An empathetic response method for intelligent agents based on brain-computer coupling under the guidance of learner emotions
CN120611274A (en) * 2025-08-04 2025-09-09 中国科学技术大学 A method and system for generating subjective visual emotion interpretation
CN121281118A (en) * 2025-12-08 2026-01-06 深圳市联合光学技术有限公司 Expression recognition and interaction method based on edge vision and multi-mode model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244474A (en) * 2023-03-27 2023-06-09 武汉工商学院 A learner learning state acquisition method based on multi-modal emotional feature fusion
CN117112778A (en) * 2023-08-22 2023-11-24 北京交通大学 A knowledge-based method for generating multi-modal conference abstracts
CN117115505A (en) * 2023-06-15 2023-11-24 北京工业大学 An emotion-enhanced continuous training method that combines knowledge distillation and contrastive learning
CN117421591A (en) * 2023-10-16 2024-01-19 长春理工大学 A multimodal representation learning method based on text-guided image patch screening
CN117746503A (en) * 2023-12-20 2024-03-22 大湾区大学(筹) Face action unit detection method, electronic equipment and storage medium
CN118093914A (en) * 2024-02-04 2024-05-28 南开大学 A dialogue image retrieval method based on cross-modal emotional interaction
US20240220722A1 (en) * 2022-12-28 2024-07-04 Mohamed bin Zayed University of Artificial Intelligence Multi-modal prompt learning for representation transfer on image recognition tasks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240220722A1 (en) * 2022-12-28 2024-07-04 Mohamed bin Zayed University of Artificial Intelligence Multi-modal prompt learning for representation transfer on image recognition tasks
CN116244474A (en) * 2023-03-27 2023-06-09 武汉工商学院 A learner learning state acquisition method based on multi-modal emotional feature fusion
CN117115505A (en) * 2023-06-15 2023-11-24 北京工业大学 An emotion-enhanced continuous training method that combines knowledge distillation and contrastive learning
CN117112778A (en) * 2023-08-22 2023-11-24 北京交通大学 A knowledge-based method for generating multi-modal conference abstracts
CN117421591A (en) * 2023-10-16 2024-01-19 长春理工大学 A multimodal representation learning method based on text-guided image patch screening
CN117746503A (en) * 2023-12-20 2024-03-22 大湾区大学(筹) Face action unit detection method, electronic equipment and storage medium
CN118093914A (en) * 2024-02-04 2024-05-28 南开大学 A dialogue image retrieval method based on cross-modal emotional interaction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GUANGHAO YIN: "Token-desentangling mutual Transformer for multimodal emotion recognition", ENGINEERING APPLICATION OF ARTIFICIAL INTELLIGENCE, 6 April 2024 (2024-04-06) *
刘菁菁;吴晓峰;: "基于长短时记忆网络的多模态情感识别和空间标注", 复旦学报(自然科学版), no. 05, 15 October 2020 (2020-10-15) *
彭小江;: "基于多模态信息的情感计算综述", 衡阳师范学院学报, no. 03, 15 June 2018 (2018-06-15) *
黄昌勤: "学习云空间中基于情感分析的学习推荐研究", 中国电化教育, 31 October 2018 (2018-10-31) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119760362A (en) * 2025-03-04 2025-04-04 浙江师范大学 An empathetic response method for intelligent agents based on brain-computer coupling under the guidance of learner emotions
CN120611274A (en) * 2025-08-04 2025-09-09 中国科学技术大学 A method and system for generating subjective visual emotion interpretation
CN121281118A (en) * 2025-12-08 2026-01-06 深圳市联合光学技术有限公司 Expression recognition and interaction method based on edge vision and multi-mode model

Also Published As

Publication number Publication date
CN119323818B (en) 2025-04-08

Similar Documents

Publication Publication Date Title
CN111554268B (en) Language identification method based on language model, text classification method and device
CN113361396B (en) Multimodal knowledge distillation methods and systems
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN119323818B (en) Student emotion analysis method and system based on multi-mode dynamic memory big model
CN111833853B (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN114020871B (en) Multimodal social media sentiment analysis method based on feature fusion
CN112508334B (en) Personalized paper grouping method and system integrating cognition characteristics and test question text information
Yu et al. Compositional attention networks with two-stream fusion for video question answering
CN115329779A (en) Multi-person conversation emotion recognition method
CN112487139A (en) Text-based automatic question setting method and device and computer equipment
CN117540007B (en) Multi-mode emotion analysis method, system and equipment based on similar mode completion
CN118467709B (en) Evaluation method, device, medium and computer program product for visual question answering tasks
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN115424108A (en) A cognitive impairment evaluation method based on audio-visual fusion perception
CN112990301A (en) Emotion data annotation method and device, computer equipment and storage medium
CN113515935A (en) Title generation method, device, terminal and medium
CN118233706A (en) Live broadcast room scene interactive application method, device, equipment and storage medium
AU2019101138A4 (en) Voice interaction system for race games
CN118918510A (en) Space-time transducer-based participation evaluation method for gating hybrid expert network
CN115129935A (en) Video cover determining method, device, equipment, storage medium and product
CN114357964A (en) Subjective question scoring method, model training method, computer device, and storage medium
CN118245602B (en) Training method, device, equipment and storage medium for emotion recognition model
CN120236331A (en) Human behavior recognition method and device based on multimodal knowledge graph reasoning enhancement
CN117313943A (en) Prediction methods, systems, equipment and storage media for test question accuracy
CN117725547A (en) Emotional and cognitive evolution pattern recognition method based on cross-modal feature fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant