CN114663910A

CN114663910A - Multi-mode learning state analysis system

Info

Publication number: CN114663910A
Application number: CN202210027041.4A
Authority: CN
Inventors: 朱世宇; 孙令翠; 杨红艳; 何桢; 田菊艳; 余玉清; 卢政旭; 冉程好
Original assignee: Chongqing Institute of Engineering
Current assignee: Chongqing Institute of Engineering
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-06-24

Abstract

The invention relates to the field of digital data processing, in particular to a multi-mode-based learning state analysis system, which comprises an image acquisition module, a pre-processing module and a learning module, wherein the image acquisition module is used for collecting image data in a classroom and carrying out pre-processing on the image data to obtain a processed image; the teacher position information acquisition module is used for acquiring the position information of the teacher based on the processed image and the Faster R-CNN target detection model; the teacher posture information acquisition module is used for acquiring the posture information of the teacher based on the processed image and the Faster R-CNN target detection model; the student face information acquisition module is used for acquiring student face information based on the processed image and an ERT person face characteristic point detection method; and the analysis module is used for analyzing the learning state based on the position information, the posture information and the facial information of the student. Therefore, the student can be better supervised the class condition, and the learning efficiency of the student is improved.

Description

State Analysis System Based on Multimodal Learning

技术领域technical field

本发明涉及数字数据处理领域，尤其涉及一种基于多模态学习状态分析系统。The invention relates to the field of digital data processing, in particular to a state analysis system based on multimodal learning.

背景技术Background technique

在传统课堂教育中，老师通过对学生的面部表情和头部姿态，判断学生对学习状态，但由于老师的精力有限，无法及时观察到每位学生的课堂学习状态情况，进而无法根据每位学生的学习情况调整教学策略。In traditional classroom education, the teacher judges the students' learning status through the facial expressions and head posture of the students. However, due to the limited energy of the teacher, it is impossible to observe the classroom learning status of each student in time, and thus cannot analyze the learning status of each student in time. Adjust teaching strategies according to the learning situation.

智慧教学是当前我国教育信息化研究的热词，有学者将其称之为教育信息化发展的新形态、新境界、新阶段，将智慧教学研究提到了相当高的高度。现如今关于学生课堂的学情分析的检测方面，可将检测分为基于人脸识别的方法、基于微表情的识别方法和基于脑电波的检测方法。Smarter teaching is a hot word in current educational informatization research in my country. Some scholars call it a new form, new realm, and new stage of educational informatization development, raising the research on smarter teaching to a very high level. Nowadays, regarding the detection of students' classroom learning situation analysis, the detection can be divided into methods based on face recognition, recognition methods based on micro-expressions and detection methods based on brain waves.

采用上述方法由于只有一个变量，使得检测精度较低。Since there is only one variable in the above method, the detection accuracy is low.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于多模态学习状态分析系统，旨在结合学生的面部表情、老师的位置以及老师的上课语音的数据进行分析，给出可视化的分析结果。可以给出学生专注度的分析，有利于教师制定合适的教学计划，促进老师和学生的课堂交互。The purpose of the present invention is to provide a multi-modal learning state analysis system, which aims to analyze the data of the students' facial expressions, the teacher's position and the teacher's speech in class, and provide a visual analysis result. The analysis of students' concentration can be given, which is helpful for teachers to formulate appropriate teaching plans and promote classroom interaction between teachers and students.

为实现上述目的，本发明提供了一种基于多模态学习状态分析系统，包括图像获取模块、老师位置信息获取模块、老师姿态信息获取模块、学生面部信息获取模块和分析模块，In order to achieve the above purpose, the present invention provides a multi-modal learning state analysis system, including an image acquisition module, a teacher position information acquisition module, a teacher posture information acquisition module, a student facial information acquisition module and an analysis module,

所述图像获取模块，用于收集教室内部的图像数据并进行预处理得到处理图像；The image acquisition module is used for collecting the image data inside the classroom and preprocessing to obtain the processed image;

所述老师位置信息获取模块，用于基于处理图像和Faster R-CNN目标检测模型获得老师的位置信息；The teacher position information acquisition module is used to obtain the teacher's position information based on the processed image and the Faster R-CNN target detection model;

所述老师姿态信息获取模块，用于基于处理图像和Faster R-CNN目标检测模型获得老师的姿态信息；The teacher's attitude information acquisition module is used to obtain the teacher's attitude information based on the processed image and the Faster R-CNN target detection model;

所述学生面部信息获取模块，用于基于处理图像和ERT人脸特征点检测方法获取学生面部信息；Described student facial information acquisition module, is used for acquiring student facial information based on processing image and ERT facial feature point detection method;

所述分析模块，用于基于位置信息、姿态信息和学生面部信息分析学习状态。The analysis module is used to analyze the learning state based on the position information, the posture information and the student's face information.

其中，所述图像获取模块包括获取单元和处理单元，所述获取单元用于收集摄像头的图像数据；Wherein, the image acquisition module includes an acquisition unit and a processing unit, and the acquisition unit is used to collect the image data of the camera;

所述处理单元，用于将图像数据的尺寸标准化，并进行归一化处理，得到处理图像。The processing unit is used for normalizing the size of the image data and performing normalization processing to obtain a processed image.

其中，所述收集摄像头的图像数据的具体方式是，每隔5秒为一帧获取图像。Wherein, the specific method of collecting the image data of the camera is to acquire an image for one frame every 5 seconds.

其中，所述老师位置信息获取模块包括特征图提取单元、候选单元、区域特征图生成单元和老师位置获取单元，Wherein, the teacher position information acquisition module includes a feature map extraction unit, a candidate unit, a region feature map generation unit and a teacher position acquisition unit,

所述特征图提取单元，用于基于处理图像提取原始特征图；The feature map extraction unit is used to extract the original feature map based on the processed image;

所述候选单元，用于将原始特征图输入到候选框提取网络，生成区域候选框；The candidate unit is used to input the original feature map into the candidate frame extraction network to generate a regional candidate frame;

所述区域特征图生成单元，用于将区域候选框映射到所述原始特征图中，并池化为区域特征图；The regional feature map generation unit is used to map the regional candidate frame to the original feature map, and pool it into a regional feature map;

所述老师位置获取单元，用于将区域特征图输入Faster R-CNN目标检测模型中获取老师位置信息。The teacher position obtaining unit is used for inputting the regional feature map into the Faster R-CNN target detection model to obtain teacher position information.

其中，所述老师姿态信息获取模块包括关键点获取单元和归一化单元，所述关键点获取单元，用于基于Faster R-CNN目标检测模型，获取人体姿态关键点信息；Wherein, the teacher attitude information acquisition module includes a key point acquisition unit and a normalization unit, and the key point acquisition unit is used for acquiring human body attitude key point information based on the Faster R-CNN target detection model;

所述归一化单元，用于基于人体姿态关键点信息进行姿态归一化模块处理。The normalization unit is used to perform attitude normalization module processing based on the human body attitude key point information.

其中，所述基于位置信息、姿态信息和学生面部信息分析学习状态的具体步骤是：Wherein, the specific steps of analyzing the learning state based on position information, posture information and student facial information are:

构建并训练基于全连接层的多模态特征融合网络结构；Build and train a multimodal feature fusion network structure based on a fully connected layer;

将位置信息、姿态信息和学生面部信息映射到特征融合空间进行学习状态分析。The location information, pose information and student face information are mapped to the feature fusion space for learning state analysis.

采用加权融合方法融合位置信息、姿态信息和学生面部信息得到加权融合特征；A weighted fusion method is used to fuse position information, attitude information and student face information to obtain weighted fusion features;

将加权融合特征输入全连接层；Input the weighted fusion features into the fully connected layer;

获得加权融合特征的分类概率分布。Obtain the classification probability distribution of weighted fused features.

本发明的一种基于多模态学习状态分析系统，可以在教室的前后分别放置摄像头，然后通过所述图像获取模块获取图像数据，之后通过Faster R-CNN目标检测模型分别对图像进行处理以得到老师的位置信息和面部信息，然后通过ERT人脸特征点检测方法提取图像数据中的学生面部信息，之后可以综合对学生的学习状态进行评估，从而可以更好地对学生的上课情况进行监管，以提高学生的学习效率。According to the multi-modal learning state analysis system of the present invention, cameras can be placed in the front and rear of the classroom respectively, and then the image data is acquired through the image acquisition module, and then the images are processed respectively through the Faster R-CNN target detection model to obtain The location information and facial information of the teacher, and then the facial information of the students in the image data is extracted by the ERT facial feature point detection method, and then the students' learning status can be evaluated comprehensively, so as to better supervise the students' class situation, to improve students' learning efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1是本发明的池化流程图。FIG. 1 is a flow chart of the pooling of the present invention.

图2是本发明的老师姿态信息计算流程图。FIG. 2 is a flow chart of the teacher attitude information calculation according to the present invention.

图3是本发明的多模态融合分析学生课堂学习状态图。Fig. 3 is a state diagram of a student's classroom learning by multimodal fusion analysis of the present invention.

图4是本发明的通过三模态卷积神经网络的联合学习可以得到三种基于不同模态的三维模型特征到的流程图。FIG. 4 is a flow chart of the three-dimensional model features based on different modalities that can be obtained through joint learning of a three-modal convolutional neural network according to the present invention.

图5是本发明的一种基于多模态学习状态分析系统的结构图。FIG. 5 is a structural diagram of a state analysis system based on multimodal learning of the present invention.

图6是本发明的图像获取模块的结构图。FIG. 6 is a structural diagram of an image acquisition module of the present invention.

图7是本发明的老师位置信息获取模块的结构图。FIG. 7 is a structural diagram of a teacher position information acquisition module of the present invention.

图8是本发明的老师姿态信息获取模块的结构图。FIG. 8 is a structural diagram of a teacher attitude information acquisition module of the present invention.

1-图像获取模块、2-老师位置信息获取模块、3-老师姿态信息获取模块、4-学生面部信息获取模块、5-分析模块、11-获取单元、12-处理单元、21-特征图提取单元、22-候选单元、23-区域特征图生成单元、24-老师位置获取单元、31-关键点获取单元、32-归一化单元。1-Image acquisition module, 2-Teacher position information acquisition module, 3-Teacher attitude information acquisition module, 4-Student face information acquisition module, 5-Analysis module, 11-Acquisition unit, 12-Processing unit, 21-Feature map extraction Unit, 22-candidate unit, 23-region feature map generation unit, 24-teacher position acquisition unit, 31-key point acquisition unit, 32-normalization unit.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to explain the present invention and should not be construed as limiting the present invention.

实施例1Example 1

请参阅图1～图8，本发明提供一种基于多模态学习状态分析系统：Please refer to FIG. 1 to FIG. 8 , the present invention provides a state analysis system based on multimodal learning:

包括图像获取模块1、老师位置信息获取模块2、老师姿态信息获取模块3、学生面部信息获取模块4和分析模块5，Including image acquisition module 1, teacher position information acquisition module 2, teacher posture information acquisition module 3, student facial information acquisition module 4 and analysis module 5,

所述图像获取模块1，用于收集教室内部的图像数据并进行预处理得到处理图像；The image acquisition module 1 is used to collect image data inside the classroom and perform preprocessing to obtain a processed image;

所述老师位置信息获取模块2，用于基于处理图像和Faster R-CNN目标检测模型获得老师的位置信息；The teacher position information acquisition module 2 is used to obtain the teacher's position information based on the processed image and the Faster R-CNN target detection model;

所述老师姿态信息获取模块3，用于基于处理图像和Faster R-CNN目标检测模型获得老师的姿态信息；The teacher attitude information acquisition module 3 is used to obtain the teacher's attitude information based on the processed image and the Faster R-CNN target detection model;

所述学生面部信息获取模块4，用于基于处理图像和ERT人脸特征点检测方法获取学生面部信息；Described student facial information acquisition module 4, is used for obtaining student facial information based on processing image and ERT facial feature point detection method;

所述分析模块5，用于基于位置信息、姿态信息和学生面部信息分析学习状态。The analysis module 5 is used to analyze the learning state based on the position information, the posture information and the student's face information.

在本实施方式中，可以在教室的前后分别放置摄像头，然后通过所述图像获取模块1获取图像数据，之后通过Faster R-CNN目标检测模型分别对图像进行处理以得到老师的位置信息和面部信息，然后通过ERT人脸特征点检测方法提取图像数据中的学生面部信息，之后可以综合对学生的学习状态进行评估，从而可以更好地对学生的上课情况进行监管，以提高学生的学习效率。其中ERT人脸特征点检测方法的公式如下：In this embodiment, cameras can be placed in the front and back of the classroom, and then the image data is acquired by the image acquisition module 1, and then the images are processed by the Faster R-CNN target detection model to obtain the teacher's position information and facial information. , and then extract the student's face information in the image data through the ERT face feature point detection method, and then comprehensively evaluate the student's learning status, so as to better supervise the student's class situation and improve the student's learning efficiency. The formula of the ERT face feature point detection method is as follows:

ξ^(t+1)＝ξ^(t)+r_t(I,ξ^(t))ξ ^(t+1) = ξ ^(t) +r _t (I,ξ ^(t) )

其中：t表示级联序号，r_t(,)表示当前级的回归器。回归器输入人脸图像I和上一级回归器更新后的面部特征点位置坐标，采用的特征可以是灰度值或者其他特征。当获得图像是，算法生成出事位置，改初始位置是预估面部特征点的位置。采用梯度提升算法来最小化误差，得到每一级联回归银子。Among them: t represents the serial number of the cascade, and r _t (,) represents the regressor of the current stage. The regressor inputs the face image I and the position coordinates of the facial feature points updated by the previous regressor, and the adopted feature can be gray value or other features. When the image is obtained, the algorithm generates the location of the accident, and the initial location is the location of the estimated facial feature points. The gradient boosting algorithm is used to minimize the error, and each cascaded regression silver is obtained.

进一步的，所述图像获取模块1包括获取单元11和处理单元12，所述获取单元11用于收集摄像头的图像数据；Further, the image acquisition module 1 includes an acquisition unit 11 and a processing unit 12, and the acquisition unit 11 is used to collect the image data of the camera;

所述处理单元12，用于将图像数据的尺寸标准化，并进行归一化处理，得到处理图像。The processing unit 12 is configured to standardize the size of the image data and perform normalization processing to obtain a processed image.

所述收集摄像头的图像数据的具体方式是，每隔5秒为一帧获取图像。使得采样频率更加合适，在满足实时性的同时也提高了图像的处理效率。The specific manner of collecting the image data of the camera is to acquire an image for one frame every 5 seconds. The sampling frequency is more suitable, and the image processing efficiency is improved while satisfying the real-time performance.

进一步的，所述老师位置信息获取模块2包括特征图提取单元21、候选单元22、区域特征图生成单元23和老师位置获取单元24，Further, the teacher position information acquisition module 2 includes a feature map extraction unit 21, a candidate unit 22, a region feature map generation unit 23 and a teacher position acquisition unit 24,

所述特征图提取单元21，用于基于处理图像提取原始特征图；The feature map extraction unit 21 is used to extract the original feature map based on the processed image;

所述候选单元22，用于将原始特征图输入到候选框提取网络，生成区域候选框；The candidate unit 22 is used to input the original feature map into the candidate frame extraction network to generate a regional candidate frame;

所述区域特征图生成单元23，用于将区域候选框映射到所述原始特征图中，并池化为区域特征图；The regional feature map generation unit 23 is used to map the regional candidate frame to the original feature map, and pool it into a regional feature map;

所述老师位置获取单元24，用于将区域特征图输入Faster R-CNN目标检测模型中获取老师位置信息。The teacher position obtaining unit 24 is configured to input the regional feature map into the Faster R-CNN target detection model to obtain teacher position information.

在本实施方式中，提取原始特征图的具体方式是首先将任意大小的上课场景中的老师行为图像缩放置固定大小256*256，然后将256*256大小的图片输入CNN(卷积层)等基础网络，从而特取原始图片的特征图，后续的RPN层和全连接层可以共享该特征图。In this embodiment, the specific method of extracting the original feature map is to first downscale the teacher behavior image in the class scene of any size to a fixed size of 256*256, and then input the 256*256 image into the CNN (convolutional layer), etc. The basic network, so as to take the feature map of the original image, the subsequent RPN layer and the fully connected layer can share the feature map.

第二步：将特征图输入到RPN，生成区域候选框。在特征图上应用华东窗口进行目标区域判定和分类，从而生成区域候选框。RPN首先对共享卷积层输入的特征图做一个卷积操作已得到特征向量，然后将特征向量输入连个全连接层，即边界框回归层和边界框分类层。Step 2: Input the feature map to the RPN to generate a region candidate frame. The East China window is applied to the feature map to determine and classify the target region, thereby generating a region candidate frame. RPN first performs a convolution operation on the feature map input by the shared convolution layer to obtain the feature vector, and then inputs the feature vector into a fully connected layer, namely the bounding box regression layer and the bounding box classification layer.

第三步：对每个区域特征图进行池化(POI Pooling)。输入特征图和候选区域，将候选区域映射到特征图中，并池化为统一大小的区域特征图，在将区域特征图输入全连接，得到分类向量和位置坐标Step 3: Perform POI Pooling on each region feature map. Input the feature map and candidate region, map the candidate region to the feature map, and pool it into a uniform size region feature map, and then input the region feature map into the full connection to obtain the classification vector and position coordinates

第四步：最后将区域特征图输入Faster R-CNN中，根据RPN所提出的候选框，对候选框做进一步地位置精修好额类别回归。Step 4: Finally, input the regional feature map into Faster R-CNN. According to the candidate frame proposed by RPN, further refine the position of the candidate frame and perform category regression.

对于老师上课行为图片中的每个选定区域i，f_i定义为来自改区域的平均池化卷积特征，因此图像特征向量的为维数2048.全连接层将f_i转换成h维向量。For each selected area _i in the teacher's class behavior picture, fi is defined as the average pooled convolution feature from the modified area, so the dimension of the image feature vector is _2048. The fully connected layer converts fi into an h-dimensional vector .

v_i＝W_vf_i+b_v v _i =W _v f _i +b _v

因此，老师上课时行为图像的完整表示是一组嵌入向量。Therefore, the complete representation of the teacher's behavioral images during class is a set of embedding vectors.

V＝{v₁,...,v_k},vi∈RV={v ₁ ,...,v _k },vi∈R

其中，v_i编码一个显著区域i，k是区域个数。Among them, vi encodes a salient region _i , and k is the number of regions.

进一步的，所述老师姿态信息获取模块3包括关键点获取单元31和归一化单元32，所述关键点获取单元31，用于基于Faster R-CNN目标检测模型，获取人体姿态关键点信息；Further, the teacher attitude information acquisition module 3 includes a key point acquisition unit 31 and a normalization unit 32, and the key point acquisition unit 31 is used to acquire human body attitude key point information based on the Faster R-CNN target detection model;

所述归一化单元32，用于基于人体姿态关键点信息进行姿态归一化模块处理。The normalization unit 32 is configured to perform posture normalization module processing based on the human body posture key point information.

在本实施方式中，Faster R-CNN目标检测模型，获取人体姿态关键点信息，在进行姿态归一化模块处理。In this embodiment, the Faster R-CNN target detection model obtains the key point information of the human body posture, and performs the posture normalization module processing.

在姿态归一化时，对关键点坐标的估算，从而计算变换矩阵。During pose normalization, the estimation of the coordinates of the key points, thereby calculating the transformation matrix.

流程如图2所示。图中，首先基于先前关键点检测网络得到的关键点提取出关键点的坐标。然后选取其中最大值点作为关键点的估计，有了关键点坐标。The process is shown in Figure 2. In the figure, the coordinates of the key points are first extracted based on the key points obtained by the previous key point detection network. Then select the maximum point as the estimation of the key point, and have the coordinates of the key point.

进一步的，所述基于位置信息、姿态信息和学生面部信息分析学习状态的具体步骤是：Further, the concrete steps of analyzing the learning state based on position information, posture information and student face information are:

为了解决单一模态数据对学生课堂学习状态的评估，本文采用了多模态融合的方法，从3个模态分析学生课堂学习状态。总的流程图如图3所示。In order to solve the evaluation of students' classroom learning status by single modal data, this paper adopts the method of multi-modal fusion to analyze students' classroom learning status from three modalities. The overall flow chart is shown in Figure 3.

通过构建并训练基于全连接层的多模态特征融合网络结构，将多维度的多尺度特征映射到特征融合空间，这种特征融合方式的优势主要在于模型能够在训练阶段学习2个并行网络各自的特征参数，并自主完成协调反馈，实现了模型端到端的训练。By building and training a multi-modal feature fusion network structure based on a fully connected layer, the multi-dimensional and multi-scale features are mapped to the feature fusion space. The advantage of this feature fusion method is that the model can learn two parallel networks in the training phase. The characteristic parameters of the model are completed independently, and the coordination feedback is completed, which realizes the end-to-end training of the model.

本文选自Softmax函数将特征向量映射成概率序列，已保留更多特征的原始信息。Softmax计算输出类别y⁽ⁱ⁾的过程如公式所示。This paper selects the Softmax function to map the feature vector into a probability sequence, and retains the original information of more features. The process of Softmax calculating the output category y ⁽ⁱ⁾ is shown in the formula.

式中：ηⁱ为融合后的特征值；k为类别数；P表示y⁽ⁱ⁾表示类别k的概率值。In the formula: η ⁱ is the feature value after fusion; k is the number of categories; P represents y ⁽ⁱ⁾ represents the probability value of category k.

第一阶段：关注每张老师图片中老师位置信息特征与学生眼睛偏移量，得到学生是否看老师The first stage: Pay attention to the teacher's position information feature and the student's eye offset in each teacher's picture, and get whether the student is looking at the teacher

第二阶段：将老师行为图像特征区域进行分析，此时应该看黑板还是老师；然后与学生眼睛偏移量进行比较，得出学生此时应该看那里。The second stage: analyze the characteristic area of the teacher's behavior image, whether to look at the blackboard or the teacher at this time; then compare with the student's eye offset to get where the student should look at this time.

第三阶段：将学生的面部特征图与老师行为和位置进行分析，得出相似实施例2The third stage: analyze the student's facial feature map and the teacher's behavior and position, and obtain a similar embodiment 2

实施例2和实施例1的区别仅在于所述基于位置信息、姿态信息和学生面部信息分析学习状态的具体步骤是：The difference between Embodiment 2 and Embodiment 1 is only that the specific steps of analyzing the learning state based on position information, posture information and student facial information are:

通过三模态卷积神经网络的联合学习可以得到三种基于不同模态的三维模型特征。区别于传统的采用池化操作的特征融合方法，本文中基于统计方法，采用加权融合方法融合三个特征向量，方法的框架如图4所示。Through the joint learning of the three-modal convolutional neural network, three 3D model features based on different modalities can be obtained. Different from the traditional feature fusion method using pooling operation, this paper uses the weighted fusion method to fuse three feature vectors based on the statistical method. The framework of the method is shown in Figure 4.

具体公式：Specific formula:

其中：f表示的是在不同模态表特征提取的特征向量；α_i表示的是不同模态的权重，对加权融合特征的特征向量输入到全连接层(FC层)，全连接层的维度依次为512、256、C，C表示数据集类别的数量。最后通过一个softmax层获得三维模型的分类概率分布。将学习状态分为从容，一般、走神、困难Among them: f represents the feature vector extracted from different modal table features; α _i represents the weight of different modalities, the feature vector of the weighted fusion feature is input to the fully connected layer (FC layer), the dimension of the fully connected layer The sequence is 512, 256, C, and C represents the number of dataset categories. Finally, the classification probability distribution of the 3D model is obtained through a softmax layer. Divide the learning state into calm, normal, distracted, and difficult

本文通过相关性损失函数来确保训练过程中多个模态之间可以相互指导，提高网络训练的学习速度，并提高最终特征向量的鲁棒性。In this paper, the correlation loss function is used to ensure that multiple modalities can guide each other during the training process, improve the learning speed of network training, and improve the robustness of the final feature vector.

具体公式如下：The specific formula is as follows:

L_C(M_i,M_j)＝|ξ(f_Mi)-ξ(f_Mj)|₂ L _C (M _i ,M _j )=|ξ(f _Mi )-ξ(f _Mj )| ₂

其中：2表示为两个不同特征向量的相关性。f表示的是在不同模态表特征提取的特征向量；M的下标表示第1、2、3个模态的数据；ξ＝sigmoid(log(abs()))表示一个归一化激发函数。在训练过程中，相关性损失的值逐渐减小，表明在训练过程中不同模态特征相互指导，这会加快训练的收敛速度，获得更具有鲁棒性的特征向量。以模态M,为例，基于这种相关性损失函数的设计，不同模态网络的最终损失函数如下:where: 2 represents the correlation of two different eigenvectors. f represents the feature vector extracted from different modal tables; the subscript of M represents the data of the 1st, 2nd, and 3rd modalities; ξ=sigmoid(log(abs())) represents a normalized excitation function . During the training process, the value of the correlation loss gradually decreases, indicating that different modal features guide each other during the training process, which will speed up the training convergence speed and obtain more robust feature vectors. Taking modal M as an example, based on the design of this correlation loss function, the final loss functions of different modal networks are as follows:

其中:

表示基于单模态的交叉嫡损失；L_C(M₁,M2)和L_C(M₁,M₃)分别表示模态M,与模态M₂和M₃的相关性损失。最后通过随机梯度下降的反向传播优化这三个单模态网络。in:

Represents the single-modal-based cross-inheritance loss; L _C (M ₁ , M 2 ) and L _C (M ₁ , M ₃ ) represent the mode M, the correlation loss with modes M ₂ and M ₃ , respectively. Finally, the three unimodal networks are optimized by backpropagation with stochastic gradient descent.

以上所揭露的仅为本发明一种较佳实施例而已，当然不能以此来限定本发明之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本发明权利要求所作的等同变化，仍属于发明所涵盖的范围。The above disclosure is only a preferred embodiment of the present invention, and of course, it cannot limit the scope of rights of the present invention. Those of ordinary skill in the art can understand that all or part of the process for realizing the above-mentioned embodiment can be realized according to the rights of the present invention. The equivalent changes required to be made still belong to the scope covered by the invention.

Claims

1. A multi-modal learning-based state analysis system is characterized in that,

comprises an image acquisition module, a teacher position information acquisition module, a teacher posture information acquisition module, a student face information acquisition module and an analysis module,

the image acquisition module is used for collecting image data in a classroom and carrying out preprocessing to obtain a processed image;

the teacher position information acquisition module is used for acquiring the position information of the teacher based on the processed image and the Faster R-CNN target detection model;

the teacher posture information acquisition module is used for acquiring the posture information of the teacher based on the processed image and the Faster R-CNN target detection model;

the student face information acquisition module is used for acquiring student face information based on a processed image and an ERT person face characteristic point detection method;

the analysis module is used for analyzing the learning state based on the position information, the posture information and the facial information of the student.

2. The multi-modal learning state analysis-based system of claim 1,

the image acquisition module comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for collecting image data of the camera;

and the processing unit is used for standardizing the size of the image data and carrying out normalization processing to obtain a processed image.

3. The multi-modal-based learning state analysis system of claim 2, wherein the image data of the camera is collected by capturing an image for a frame every 5 seconds.

4. The multi-modal learning state analysis-based system of claim 1,

the teacher position information acquisition module comprises a feature map extraction unit, a candidate unit, a region feature map generation unit and a teacher position acquisition unit,

the feature map extraction unit is used for extracting an original feature map based on the processed image;

the candidate unit is used for inputting the original feature map into a candidate frame extraction network to generate a region candidate frame;

the regional characteristic diagram generating unit is used for mapping the regional candidate frame into the original characteristic diagram and pooling the regional candidate frame into a regional characteristic diagram;

and the teacher position acquisition unit is used for inputting the regional characteristic diagram into a Faster R-CNN target detection model to acquire teacher position information.

5. The multi-modal learning state analysis-based system of claim 4,

the teacher posture information acquisition module comprises a key point acquisition unit and a normalization unit, wherein the key point acquisition unit is used for acquiring key point information of the posture of the human body based on a Faster R-CNN target detection model;

and the normalization unit is used for carrying out posture normalization module processing based on the human posture key point information.

6. A multi-modal based learning state analysis system as defined in claim 1,

the specific steps of analyzing the learning state based on the position information, the posture information and the student face information are as follows:

constructing and training a multi-modal feature fusion network structure based on a full connection layer;

and mapping the position information, the posture information and the student face information to a feature fusion space for learning state analysis.

7. The multi-modal learning state analysis-based system of claim 1,

fusing position information, posture information and student face information by adopting a weighted fusion method to obtain weighted fusion characteristics;

inputting the weighted fusion features into the full connection layer;

and obtaining the classification probability distribution of the weighted fusion characteristics.