CN113255616B

CN113255616B - Video behavior identification method based on deep learning

Info

Publication number: CN113255616B
Application number: CN202110764936.1A
Authority: CN
Inventors: 胡谋法; 王珏; 卢焕章; 张瑶; 张路平; 沈杏林; 肖山竹; 陶华敏; 赵菲; 邓秋群
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-09-21
Anticipated expiration: 2041-07-07
Also published as: CN113255616A

Abstract

The application relates to a video behavior recognition method based on deep learning, wherein a common 2D network is used as a backbone network in a video behavior recognition network, the characteristics of interframe information are extracted by using bilinear operation, and then the intraframe information and the interframe information are fused to obtain high-identification spatiotemporal characteristics for behavior classification. The 2D model has the capability of processing three-dimensional video information by only adding a small number of parameters, and the accuracy rate of behavior identification can be further improved while the calculation load is reduced compared with that of a traditional 3D convolutional network. The method is particularly suitable for occasions with real-time video analysis requirements but limited resources, and has wide application prospects in the fields of intelligent security, automatic driving and the like.

Description

A video action recognition method based on deep learning

技术领域technical field

本申请涉及视频信息处理技术领域，特别是涉及一种基于深度学习的视频行为识别方法。The present application relates to the technical field of video information processing, and in particular, to a video behavior recognition method based on deep learning.

背景技术Background technique

近年来，随着多媒体技术、高速互联网技术以及大容量存储设备的发展和普及，互联网中视频图像信息资源出现了爆炸性的增长，与静态图片相比，视频中包含的信息量更大、更加丰富多样，己经成为现代社会中重要的信息载体。目前，绝大部分视频内容分析任务依赖人力完成，然而对于海量数据，人工处理费时费力，成本高昂，也难免有疏漏发生，因而迫切需要视频智能分析技术。自2012年Alexnet展露头角以来，深度卷积神经网络已经统治了计算机视觉领域，在包括图像分类、目标检测等多个视觉任务中都取得了突破，并且成功商用，改变了人们的生活方式。但是相对于图像分析取得的巨大成就，深度神经网络在视频分析领域虽然展现出良好的潜力但是还无法达到令人满意的效果，本质原因在于视频信号的高度时空复杂性以及随之而来的庞大计算成本，如何设计出合理高效的网络结构目前还在研究探索之中。In recent years, with the development and popularization of multimedia technology, high-speed Internet technology and large-capacity storage devices, there has been an explosive growth in video image information resources on the Internet. Compared with static pictures, videos contain more information and richer information. Diversity has become an important information carrier in modern society. At present, most of the video content analysis tasks rely on human resources. However, for massive data, manual processing is time-consuming, labor-intensive, expensive, and inevitably there are omissions. Therefore, video intelligent analysis technology is urgently needed. Since the emergence of Alexnet in 2012, deep convolutional neural networks have dominated the field of computer vision, making breakthroughs in many visual tasks including image classification, object detection, etc., and successfully commercialized, changing people's lifestyles. However, compared with the great achievements in image analysis, deep neural networks have shown good potential in the field of video analysis, but they still cannot achieve satisfactory results. The essential reason is the high spatiotemporal complexity of video signals and the accompanying huge The calculation cost and how to design a reasonable and efficient network structure are still under research and exploration.

视频比图像信号多出了一个时间维度，通常认为帧间的运动信息在视频行为识别任务中起到了决定性的作用，但是如何提取有效的帧间运动信息一直没有很好的解决。目前一种流行且有效的识别方法就是在深层神经网络中使用3D卷积核，这是将图像识别领域中的2D卷积自然拓展得到的结果，这样获得的模型也是端到端可以训练的。目前较为先进的视频行为识别模型，如I3D，就是采用这种方法构建的深度卷积网络来进行行为识别的，通过在大型数据集上的训练，然后在小数据集上微调的方法，在多个基准测试集上都取得了领先的结果。Video has one more time dimension than image signal. It is generally believed that the motion information between frames plays a decisive role in the task of video action recognition, but how to extract effective inter-frame motion information has not been well resolved. A popular and effective recognition method at present is to use 3D convolution kernels in deep neural networks, which is the result of natural expansion of 2D convolution in the field of image recognition, and the obtained model is also end-to-end trainable. At present, the more advanced video behavior recognition models, such as I3D, use the deep convolutional network constructed by this method for behavior recognition. Through training on large data sets, and then fine-tuning on small data sets, in many Leading results have been achieved on all benchmark sets.

3D卷积核直接使用前后帧局部邻近数据进行拟合，来提取时空特征，虽然效果不错，但是存在参数量大，计算复杂的问题，而且容易出现过拟合的现象。虽然目前有一些简化的技术，如P3D、R3D等采用2D+1D卷积的形式来替代3D卷积，也都取得了不错的效果。但是总体来说，在帧间特征提取方面仍然还存在着不足，识别性能还有待提高。The 3D convolution kernel directly uses the local adjacent data of the front and back frames for fitting to extract spatiotemporal features. Although the effect is good, there are problems of large amount of parameters, complicated calculation, and prone to overfitting. Although there are some simplified technologies, such as P3D, R3D, etc., which use 2D+1D convolution to replace 3D convolution, they have also achieved good results. But in general, there are still deficiencies in inter-frame feature extraction, and the recognition performance needs to be improved.

发明内容SUMMARY OF THE INVENTION

基于此，有必要针对上述技术问题，提供一种基于深度学习的视频行为识别方法。Based on this, it is necessary to provide a video behavior recognition method based on deep learning for the above technical problems.

一种基于深度学习的视频行为识别方法，所述方法包括：A deep learning-based video behavior recognition method, the method comprising:

获取视频数据，并对所述视频数据进行预处理得到训练样本。Acquire video data, and preprocess the video data to obtain training samples.

构建视频行为识别网络；所述视频行为识别网络为以二维卷积神经网络Resnet作为骨干网络，在所述骨干网络中插入帧间时域信息提取模块的卷积神经网络；所述二维卷积神经网络Resnet用于提取视频中目标的静态特征，所述帧间时域信息提取模块用于对所述骨干网络进行优化，使用双线性操作来提取帧间信息特征。A video behavior recognition network is constructed; the video behavior recognition network is a convolutional neural network with a two-dimensional convolutional neural network Resnet as a backbone network, and an inter-frame time domain information extraction module is inserted into the backbone network; the two-dimensional volume The product neural network Resnet is used to extract the static features of the target in the video, the inter-frame temporal information extraction module is used to optimize the backbone network, and the bilinear operation is used to extract the inter-frame information features.

采用所述训练样本对所述视频行为识别网络进行训练，并进行参数优化，得到训练好的视频行为识别网络模型。The video behavior recognition network is trained by using the training samples, and parameters are optimized to obtain a trained video behavior recognition network model.

获取待识别视频，并进行预处理，将预处理后的待识别视频输入到所述视频行为识别网络模型中，得到视频行为分类结果。The video to be recognized is acquired and preprocessed, and the preprocessed video to be recognized is input into the video behavior recognition network model to obtain a video behavior classification result.

在其中一个实施例中，获取视频数据，并对所述视频数据进行预处理得到训练样本，包括：In one embodiment, acquiring video data, and performing preprocessing on the video data to obtain training samples, including:

获取视频数据。Get video data.

采用密集采样法在所述视频数据中随机抽取连续若干帧图像组成视频块。A number of consecutive frames of images are randomly selected from the video data to form a video block by using a dense sampling method.

将所述视频块中的图像缩放为120像素×160像素大小，并从中随机裁剪112像素×112像素大小的图像。The image in the video block is scaled to a size of 120 pixels by 160 pixels, and an image of a size of 112 pixels by 112 pixels is randomly cropped therefrom.

将剪裁后图像的灰度除以255，映射到[0,1]的数值区间范围。Divide the grayscale of the cropped image by 255 and map it to a numerical range of [0,1].

对裁剪后图像的RGB三个通道分别进行去均值归一化操作。De-average normalization is performed on the three RGB channels of the cropped image respectively.

对所述视频块在水平方向以50%概率随机翻转，得到训练样本。The video blocks are randomly flipped in the horizontal direction with a probability of 50% to obtain training samples.

在其中一个实施例中，采用所述训练样本对所述视频行为识别网络进行训练，并进行参数优化，得到训练好的视频行为识别网络模型，包括：In one embodiment, the training sample is used to train the video behavior recognition network, and parameters are optimized to obtain a trained video behavior recognition network model, including:

将所述训练样本进行分类，得到训练集和测试集。Classify the training samples to obtain a training set and a test set.

将所述训练集输入到所述视频行为识别网络中进行网络训练，得到视频行为预测分类结果。The training set is input into the video behavior recognition network for network training, and a video behavior prediction classification result is obtained.

根据所述视频行为预测分类结果和所述测试集，采用基于交叉熵损失的带动量随机梯度下降法对所述视频行为识别网络进行参数优化，得到训练好的视频行为识别网络模型。According to the video behavior prediction classification result and the test set, the video behavior recognition network is optimized by adopting the momentum stochastic gradient descent method based on cross entropy loss, and a trained video behavior recognition network model is obtained.

在其中一个实施例中，所述视频行为识别网络由1个第一特征提取子模块、3个第二特征提取子模块、1个第三特征提取子模块以及1个全连接层组成；所述第一特征提取子模块由1个卷积层和1个最大池化层组成；所述第二特征提取子模块由1个时空特征提取模块和最大池化层组成；所述第三特征提取子模块由1个所述时空特征提取模块以及全局池化层组成。In one embodiment, the video behavior recognition network is composed of a first feature extraction sub-module, three second feature extraction sub-modules, a third feature extraction sub-module and a fully connected layer; the The first feature extraction sub-module consists of a convolutional layer and a maximum pooling layer; the second feature extraction sub-module consists of a spatiotemporal feature extraction module and a maximum pooling layer; the third feature extraction sub-module The module consists of one of the spatiotemporal feature extraction modules and a global pooling layer.

将所述训练集输入到所述视频行为识别网络中进行网络训练，得到视频行为预测分类结果，包括：The training set is input into the video behavior recognition network for network training, and the video behavior prediction classification result is obtained, including:

将所述训练集输入到所述第一特征提取子模块的卷积层中，得到第一卷积特征，将第一卷积特征输入到第一特征提取子模块的最大池化层进行空域最大值池化，得到第一最大值池化特征。The training set is input into the convolutional layer of the first feature extraction sub-module to obtain the first convolutional feature, and the first convolutional feature is input into the maximum pooling layer of the first feature extraction sub-module to maximize the spatial domain. Value pooling to get the first max pooling feature.

将所述第一最大值池化特征输入到第一个所述第二特征提取子模块的时空特征提取模块中，得到第一时空融合特征。The first maximum pooled feature is input into the first spatiotemporal feature extraction module of the second feature extraction sub-module to obtain the first spatiotemporal fusion feature.

将所述第一时空融合特征输入到第一个所述第二特征提取子模块的最大池化层中，得到第二最大值池化特征。Inputting the first spatiotemporal fusion feature into the first maximum pooling layer of the second feature extraction sub-module to obtain a second maximum pooling feature.

将所述第二最大值池化特征输入到第二个所述第二特征提取子模块中，得到第三最大值池化特征。The second maximum pooling feature is input into the second second feature extraction sub-module to obtain a third maximum pooling feature.

将所述第三最大值池化特征输入到第三个所述第二特征提取子模块中，得到第四最大值池化特征。The third maximum pooling feature is input into the third second feature extraction sub-module to obtain a fourth maximum pooling feature.

将所述第四最大值池化特征输入到所述第三特征提取子模块的时空特征提取模块中，得到时空融合特征；并将所述时空融合特征输入到所述第三特征提取子模块的全局池化层，得到全局池化特征。Input the fourth maximum pooling feature into the spatiotemporal feature extraction module of the third feature extraction submodule to obtain spatiotemporal fusion features; and input the spatiotemporal fusion features into the third feature extraction submodule. Global pooling layer to get global pooling features.

将所述全局池化特征输入到全连接层，采用softmax作为激活函数，得到视频行为预测分类结果。The global pooling feature is input into the fully connected layer, and softmax is used as the activation function to obtain the video behavior prediction classification result.

在其中一个实施例中，所述时空特征提取模块是由若干个残差模块和帧间时域信息提取模块交替串联组成；所述残差模块为Resnet网络的基本组成单元；所述帧间时域信息提取模块包括：帧间时域特征提取单元和特征融合单元；所述帧间时域特征提取单元包括用于提取时域特征的双线性操作卷积层；所述特征融合单元包括用于特征融合的卷积层。In one embodiment, the spatiotemporal feature extraction module is composed of several residual modules and inter-frame time-domain information extraction modules alternately connected in series; the residual module is a basic component unit of a Resnet network; the inter-frame time domain The domain information extraction module includes: an inter-frame time-domain feature extraction unit and a feature fusion unit; the inter-frame time-domain feature extraction unit includes a bilinear operation convolution layer for extracting time-domain features; the feature fusion unit includes a Convolutional layer for feature fusion.

将所述第一最大值池化特征输入到第一个所述第二特征提取子模块的时空特征提取模块中，得到第一时空融合特征，包括：Inputting the first maximum pooling feature into the first spatiotemporal feature extraction module of the second feature extraction submodule to obtain the first spatiotemporal fusion feature, including:

将第一最大值池化特征输入到第一个所述第二特征提取子模块的所述时空特征提取模块中的第一个残差模块得到深层空域特征。Inputting the first maximum pooled feature into the first residual module in the spatiotemporal feature extraction module of the first and second feature extraction sub-modules to obtain deep spatial domain features.

将所述深层空域特征输入到第一个所述第二特征提取子模块的所述时空特征提取模块中的第一个帧间时域信息提取模块，得到融合特征。Inputting the deep spatial domain features into the first inter-frame time domain information extraction module in the spatiotemporal feature extraction modules of the first and second feature extraction submodules to obtain fusion features.

将所述融合特征输入到第一个所述第二特征提取子模块的第二个残差模块和帧间时域信息提取模块，如此重复，直到特征信息通过第一个所述第二特征提取子模块中的所有的残差模块和帧间时域信息提取模块为止，得到第一融合特征。Input the fusion feature into the second residual module and the inter-frame time domain information extraction module of the first second feature extraction sub-module, and repeat this until the feature information is extracted by the first second feature extraction All the residual modules in the sub-modules and the inter-frame time domain information extraction module are up to the first fusion feature.

在其中一个实施例中，将所述训练集输入到所述视频行为识别网络中进行网络训练，得到视频行为预测分类结果，步骤前还包括：In one embodiment, the training set is input into the video behavior recognition network for network training to obtain a video behavior prediction classification result, before the step further comprising:

采用TSN模型在kinetics400数据集上预训练的参数对所述视频行为识别网络的主干网络参数进行初始化。The parameters of the backbone network of the video behavior recognition network are initialized with the parameters pre-trained by the TSN model on the kinetics400 dataset.

将所述帧间时域信息提取模块中帧间时域特征提取单元的参数初始化为随机数，并将所述帧间时域信息提取模块中特征融合单元的参数初始化为0。The parameters of the inter-frame time domain feature extraction unit in the inter-frame time domain information extraction module are initialized to random numbers, and the parameters of the feature fusion unit in the inter-frame time domain information extraction module are initialized to 0.

将所述全连接层的参数初始化为随机数。The parameters of the fully connected layer are initialized to random numbers.

在其中一个实施例中，并进行预处理，将预处理后的待识别视频输入到所述视频行为识别网络模型中，得到视频行为分类结果，包括：In one embodiment, preprocessing is performed, and the preprocessed to-be-recognized video is input into the video behavior recognition network model to obtain a video behavior classification result, including:

获取待识别视频，对所述待识别视频进行均匀的采样，得到若干段等长的视频序列。A video to be identified is acquired, and the to-be-identified video is uniformly sampled to obtain several video sequences of equal length.

将视频序列中的图像缩放到120像素×160像素，裁剪中间112×112像素区域，并将剪裁后图像的灰度除以255，映射到[0,1]的数值区间范围，对裁剪后图像的RGB三个通道分别进行去均值归一化操作。Scale the image in the video sequence to 120 pixels × 160 pixels, crop the middle 112 × 112 pixel area, divide the grayscale of the cropped image by 255, and map it to the range of [0,1]. The three channels of RGB are de-averaged and normalized respectively.

将处理后的视频序列输入到所述视频行为识别网络模型中，得到分类预测得分。The processed video sequence is input into the video behavior recognition network model to obtain a classification prediction score.

将所述预测得分进行平均，在得到的平均分中进行查找，将查找得到的最高平均分对应的类别作为视频行为分类结果。The predicted scores are averaged, the obtained average scores are searched, and the category corresponding to the highest average score obtained by the search is used as the video behavior classification result.

上述基于深度学习的视频行为识别方法，视频行为识别网络以普通2D网络作为骨干网络，使用双线性操作来提取帧间信息特征，然后将帧内信息和帧间信息进行融合得到高辨识度的时空特征用于行为分类。仅仅增加少量的参数就使得2D模型具备处理三维视频信息的能力，相对于传统3D卷积网络能够在降低计算负载的同时进一步提高行为识别的准确率。本发明特别适合用在有实时视频分析需求但是资源有限的场合下，在智能安防、自动驾驶等领域有着广阔的应用前景。In the above-mentioned deep learning-based video behavior recognition method, the video behavior recognition network uses a common 2D network as the backbone network, uses bilinear operations to extract inter-frame information features, and then fuses intra-frame information and inter-frame information to obtain high-resolution images. Spatiotemporal features are used for behavior classification. Only adding a small number of parameters enables the 2D model to have the ability to process 3D video information. Compared with the traditional 3D convolutional network, it can further improve the accuracy of behavior recognition while reducing the computational load. The present invention is particularly suitable for use in situations where real-time video analysis is required but resources are limited, and has broad application prospects in the fields of intelligent security, automatic driving and the like.

附图说明Description of drawings

图1为一个实施例中基于深度学习的视频行为识别方法的流程示意图；1 is a schematic flowchart of a video behavior recognition method based on deep learning in one embodiment;

图2为一个实施例中帧间时域信息提取模块的结构示意图；2 is a schematic structural diagram of an inter-frame time domain information extraction module in one embodiment;

图3为一个实施例中以Resnet34为骨干网的视频行为识别网络结构图。FIG. 3 is a structural diagram of a video behavior recognition network with Resnet34 as the backbone network in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

在一个实施例中，如图1所示，提供了一种基于深度学习的视频行为识别方法，该方法包括以下步骤：In one embodiment, as shown in FIG. 1 , a deep learning-based video behavior recognition method is provided, and the method includes the following steps:

步骤100：获取视频数据，并对视频数据进行预处理得到训练样本。Step 100: Acquire video data, and preprocess the video data to obtain training samples.

训练样本是对视频数据进行采样，然后进行图像处理后的图片格式的样本。The training sample is a sample in the image format after sampling the video data and then performing image processing.

步骤102：构建视频行为识别网络。Step 102: Build a video behavior recognition network.

视频行为识别网络为以二维卷积神经网络Resnet作为骨干网络，在骨干网络中插入帧间时域信息提取模块的卷积神经网络。The video behavior recognition network is a convolutional neural network with two-dimensional convolutional neural network Resnet as the backbone network, and the inter-frame time domain information extraction module is inserted into the backbone network.

二维卷积神经网络Resnet用于提取视频中目标的静态特征。A two-dimensional convolutional neural network Resnet is used to extract static features of objects in videos.

帧间时域信息提取模块用于对骨干网络进行优化，使用双线性操作来提取帧间信息特征。The inter-frame time domain information extraction module is used to optimize the backbone network, and uses bilinear operations to extract inter-frame information features.

帧间时域特征提取模块包括时域特征提取的双线性操作卷积层以及特征融合的卷积层。The inter-frame temporal feature extraction module includes a bilinear operation convolution layer for temporal feature extraction and a convolution layer for feature fusion.

步骤104：采用训练样本对视频行为识别网络进行训练，并进行参数优化，得到训练好的视频行为识别网络模型。Step 104: Use the training samples to train the video behavior recognition network, and optimize the parameters to obtain a trained video behavior recognition network model.

步骤106：获取待识别视频，并进行预处理，将预处理后的待识别视频输入到视频行为识别网络模型中，得到视频行为分类结果。Step 106: Acquire the video to be recognized, and perform preprocessing, and input the preprocessed video to be recognized into a video behavior recognition network model to obtain a video behavior classification result.

上述基于深度学习的视频行为识别方法中，视频行为识别网络以普通2D网络作为骨干网络，使用双线性操作来提取帧间信息特征，然后将帧内信息和帧间信息进行融合得到高辨识度的时空特征用于行为分类。仅仅增加少量的参数就使得2D模型具备处理三维视频信息的能力，相对于传统3D卷积网络能够在降低计算负载的同时进一步提高行为识别的准确率。本发明特别适合用在有实时视频分析需求但是资源有限的场合下，在智能安防、自动驾驶等领域有着广阔的应用前景。In the above video behavior recognition method based on deep learning, the video behavior recognition network uses a common 2D network as the backbone network, uses bilinear operations to extract inter-frame information features, and then fuses intra-frame information and inter-frame information to obtain high recognition. The spatiotemporal features are used for behavior classification. Only adding a small number of parameters enables the 2D model to have the ability to process 3D video information. Compared with the traditional 3D convolutional network, it can further improve the accuracy of behavior recognition while reducing the computational load. The present invention is particularly suitable for use in situations where real-time video analysis is required but resources are limited, and has broad application prospects in the fields of intelligent security, automatic driving and the like.

在其中一个实施例中，步骤100还包括：获取视频数据；采用密集采样法在视频数据中随机抽取连续若干帧图像组成视频块；将视频块中的图像缩放为120像素×160像素大小，并从中随机裁剪112像素×112像素大小的图像；将剪裁后图像的灰度除以255，映射到[0,1]的数值区间范围；对裁剪后图像的RGB三个通道分别进行去均值归一化操作；对视频块在水平方向以50%概率随机翻转，得到训练样本。In one embodiment, step 100 further includes: acquiring video data; randomly extracting several consecutive frames of images from the video data by using a dense sampling method to form a video block; scaling the images in the video block to a size of 120 pixels×160 pixels, and Randomly crop an image with a size of 112 pixels × 112 pixels; divide the grayscale of the cropped image by 255, and map it to the numerical range of [0, 1]; de-mean and normalize the three RGB channels of the cropped image respectively operation; randomly flip the video blocks in the horizontal direction with a probability of 50% to obtain training samples.

在其中一个实施例中，步骤104还包括：将训练样本进行分类，得到训练集和测试集；将训练集输入到视频行为识别网络中进行网络训练，得到视频行为预测分类结果；根据视频行为预测分类结果和测试集，采用基于交叉熵损失的带动量的随机梯度下降法对视频行为识别网络进行参数优化，得到训练好的视频行为识别网络模型。In one embodiment, step 104 further includes: classifying the training samples to obtain a training set and a test set; inputting the training set into a video behavior recognition network for network training to obtain a video behavior prediction classification result; The classification results and the test set are used to optimize the parameters of the video action recognition network using the stochastic gradient descent method with momentum based on the cross entropy loss, and the trained video action recognition network model is obtained.

在其中一个实施例中，视频行为识别网络由1个第一特征提取子模块、3个第二特征提取子模块、1个第三特征提取子模块以及1个全连接层组成；第一特征提取子模块由1个卷积层和1个最大池化层组成；第二特征提取子模块由1个时空特征提取模块和最大池化层组成；第三特征提取子模块由1个时空特征提取模块以及全局池化层组成。步骤104还包括：将训练集输入到第一特征提取子模块的卷积层中，得到第一卷积特征，将第一卷积特征输入到第一特征提取子模块的最大池化层进行空域最大值池化，得到第一最大值池化特征；将第一最大值池化特征输入到第一个第二特征提取子模块的时空特征提取模块中，得到第一时空融合特征；将第一时空融合特征输入到第一个第二特征提取子模块的最大池化层中，得到第二最大值池化特征；将第二最大值池化特征输入到第二个第二特征提取子模块中，得到第三最大值池化特征；将第三最大值池化特征输入到第三个第二特征提取子模块中，得到第四最大值池化特征；将第四最大值池化特征输入到第三特征提取子模块的时空特征提取模块中，得到时空融合特征；并将时空融合特征输入到第三特征提取子模块的全局池化层，得到全局池化特征；将全局池化特征输入到全连接层，采用softmax作为激活函数，得到视频行为预测分类结果。In one of the embodiments, the video behavior recognition network consists of a first feature extraction sub-module, three second feature extraction sub-modules, a third feature extraction sub-module and a fully connected layer; the first feature extraction sub-module The submodule consists of a convolutional layer and a max pooling layer; the second feature extraction submodule consists of a spatiotemporal feature extraction module and a max pooling layer; the third feature extraction submodule consists of a spatiotemporal feature extraction module And the global pooling layer composition. Step 104 further includes: inputting the training set into the convolutional layer of the first feature extraction submodule to obtain the first convolutional feature, and inputting the first convolutional feature into the maximum pooling layer of the first feature extraction submodule for spatial domain analysis. Maximum pooling to obtain the first maximum pooling feature; input the first maximum pooling feature into the spatiotemporal feature extraction module of the first second feature extraction sub-module to obtain the first spatiotemporal fusion feature; The spatiotemporal fusion feature is input into the maximum pooling layer of the first second feature extraction sub-module to obtain the second maximum pooling feature; the second maximum pooling feature is input into the second second feature extraction sub-module , obtain the third maximum pooling feature; input the third maximum pooling feature into the third second feature extraction sub-module to obtain the fourth maximum pooling feature; input the fourth maximum pooling feature into In the spatio-temporal feature extraction module of the third feature extraction sub-module, the spatio-temporal fusion features are obtained; the spatio-temporal fusion features are input into the global pooling layer of the third feature extraction sub-module to obtain the global pooling features; the global pooling features are input into The fully connected layer uses softmax as the activation function to obtain the video behavior prediction classification results.

残差模块是Resnet系列卷积神经网络中的基本组成单元。The residual module is the basic unit in the Resnet series of convolutional neural networks.

在其中一个实施例中，时空特征提取模块是由若干个残差模块和帧间时域信息提取模块交替串联组成；残差模块为Resnet网络的基本组成单元；帧间时域信息提取模块包括：帧间时域特征提取单元和特征融合单元；帧间时域特征提取单元包括用于提取时域特征的双线性操作卷积层；特征融合单元包括用于特征融合的卷积层。步骤104还包括：将第一最大值池化特征输入到第一个第二特征提取子模块的时空特征提取模块中的第一个残差模块得到深层空域特征；将深层空域特征输入到第一个第二特征提取子模块的时空特征提取模块中的第一个帧间时域信息提取模块，得到融合特征；将融合特征输入到第一个第二特征提取子模块的第二个残差模块和帧间时域信息提取模块，如此重复，直到特征信息通过第一个第二特征提取子模块中的所有的残差模块和帧间时域信息提取模块为止，得到第一融合特征。In one embodiment, the spatiotemporal feature extraction module is composed of several residual modules and inter-frame time-domain information extraction modules alternately connected in series; the residual module is a basic component unit of the Resnet network; the inter-frame time-domain information extraction module includes: The inter-frame time-domain feature extraction unit and the feature fusion unit; the inter-frame time-domain feature extraction unit includes a bilinear operation convolution layer for extracting time-domain features; the feature fusion unit includes a convolution layer for feature fusion. Step 104 further includes: inputting the first maximum pooling feature into the first residual module in the spatiotemporal feature extraction module of the first second feature extraction sub-module to obtain deep spatial features; The first inter-frame time domain information extraction module in the spatiotemporal feature extraction module of the second feature extraction sub-module to obtain the fusion feature; input the fusion feature to the second residual module of the first second feature extraction sub-module and the inter-frame time domain information extraction module, and so on, until the feature information passes through all the residual modules and the inter-frame time domain information extraction module in the first and second feature extraction sub-modules to obtain the first fusion feature.

帧间时域特征提取单元采用双线性操作提取帧间信息特征。The inter-frame temporal feature extraction unit adopts bilinear operation to extract inter-frame information features.

在另一个实施例中，帧间时域信息提取模块的设计思路如下：In another embodiment, the design idea of the inter-frame time domain information extraction module is as follows:

帧间时域信息提取模块包括两个部分：采用双线性操作提取帧间特征的帧间时域特征提取单元，用于将帧间特征与帧内特征进行融合的特征融合单元。The inter-frame time-domain information extraction module includes two parts: an inter-frame time-domain feature extraction unit that uses bilinear operations to extract inter-frame features, and a feature fusion unit that fuses inter-frame features and intra-frame features.

传统3D分解方法通过在时域上的1D卷积来提取帧间信息特征，虽然计算简单但是从本质上来说属于线性拟合，建模能力有限，特征提取性能较弱。本发明采用双线性操作来提取前后帧对应位置处的时域信息特征，双线性操作本质上属于二阶拟合，在细粒度图像识别中得到了广泛的应用，可以更好的捕捉前后帧图像之间的变化。双线性操作的计算公式如下：The traditional 3D decomposition method extracts inter-frame information features through 1D convolution in the time domain. Although the calculation is simple, it is essentially linear fitting, with limited modeling ability and weak feature extraction performance. The present invention uses bilinear operation to extract the time domain information features at the corresponding positions of the front and rear frames. The bilinear operation is essentially a second-order fitting, and has been widely used in fine-grained image recognition, which can better capture the front and rear. Changes between frame images. The calculation formula of bilinear operation is as follows:

（1）

(1)

其中

表示输出特征向量Y的第k维分量，

表示前后帧对应位置点的特征向量，

表示二维卷积提取的空域特征的维度，即特征向量

的维度，

为其第i，j维分量。假设输出特征向量Y的维度也为

，则

就是双线性拟合参数，显然其参数数量远多于普通一维卷积。为了简化计算，可以对参数

进行分解：

，p决定了分解的复杂程度，p是模型的超参数，则公式（1）可以展开如下：in

represents the k -th dimension component of the output feature vector Y,

The feature vectors representing the corresponding position points of the previous and subsequent frames,

Represents the dimension of the spatial feature extracted by the two-dimensional convolution, that is, the feature vector

dimension,

is its i-th and j-dimensional components. Suppose the dimension of the output feature vector Y is also

,but

It is the bilinear fitting parameter, obviously the number of parameters is much more than that of ordinary one-dimensional convolution. In order to simplify the calculation, the parameters can be

Break it down:

, p determines the complexity of the decomposition, p is the hyperparameter of the model, then formula (1) can be expanded as follows:

（2）

(2)

公式（2）括号内就是常规的1D时域卷积，通过平方操作引入了二次项，而括号外也是一个线性计算，可以用1×1×1的卷积实现，这样就可以用带平方项的两层卷积计算来近似模拟双线性操作，超参数p就是第一层卷积的输出通道数。考虑到相邻帧之间相同通道的特征具有更高的相关性，使用分组卷积来替换常规卷积，同时可以进一步减少参数量。设置分组数为4，第一层卷积的时域感受野大小为3，第一层卷积输出通道数为

，则双线性操作的参数量减少为

。Formula (2) is a conventional 1D time-domain convolution inside the parentheses. The quadratic term is introduced through the square operation, and the outside of the parentheses is also a linear calculation, which can be implemented by a 1×1×1 convolution, so that it can be used with a square A two-layer convolution of terms is computed to approximate the bilinear operation, and the hyperparameter p is the number of output channels of the first layer of convolution. Considering the higher correlation of features of the same channel between adjacent frames, grouped convolutions are used to replace regular convolutions, while the amount of parameters can be further reduced. Set the number of groups to 4, the time domain receptive field size of the first layer of convolution is 3, and the number of output channels of the first layer of convolution is

, the parameters of the bilinear operation are reduced to

.

提取的帧间特征需要与原始空域特征进行融合以获得当前层的时空特征，为了减少对原始网络输出的影响，参考使用NonLocal网络的加权融合方式，实现公式如下：The extracted inter-frame features need to be fused with the original spatial features to obtain the spatiotemporal features of the current layer. In order to reduce the impact on the original network output, refer to the weighted fusion method using the NonLocal network. The implementation formula is as follows:

（3）

(3)

其中Z为融合特征，X为空域特征，Y为帧间时域特征，W为加权系数。当W初始化为0时，输出的融合特征与输入的空域特征相等，变为恒等输出，这样相当于对原始网络结构不产生任何影响，能够更好的利用骨干网的预训练模型参数。Among them, Z is the fusion feature, X is the spatial domain feature, Y is the inter-frame time domain feature, and W is the weighting coefficient. When W is initialized to 0, the output fusion feature is equal to the input airspace feature, and becomes the identity output, which is equivalent to having no impact on the original network structure and can better utilize the pre-trained model parameters of the backbone network.

帧间时域信息提取模块的结构示意图如图2所示。将空域特征输入到卷积核为3×1×1卷积层（第一层卷积），得到卷积特征，将卷积特征输入到平方层引入二次项，将平方层的结果输入到卷积核为1×1×1的卷积层（第二层卷积），输出即为帧间时域特征，将帧间时域特征输入到卷积核为1×1×1卷积层，并将得到的卷积输出与输入的空域特征相加融合，输出融合特征。A schematic diagram of the structure of the inter-frame time domain information extraction module is shown in Figure 2. Input the spatial feature into the convolution kernel as 3×1×1 convolution layer (the first layer of convolution), get the convolution feature, input the convolution feature into the square layer to introduce the quadratic term, and input the result of the square layer into The convolution kernel is a 1×1×1 convolution layer (the second layer of convolution), the output is the inter-frame time domain feature, and the inter-frame time domain feature is input to the convolution kernel as a 1×1×1 convolution layer. , and the obtained convolution output is added and fused with the input spatial domain features, and the fused features are output.

在其中一个实施例中，步骤104前还包括：采用TSN模型在kinetics400数据集上预训练的参数对视频行为识别网络的主干网络参数进行初始化；将帧间时域信息提取模块中帧间时域特征提取单元的参数初始化为随机数，并将帧间时域信息提取模块中特征融合单元的参数初始化为0；将全连接层的参数初始化为随机数。In one embodiment, before step 104, it further includes: using the parameters pre-trained by the TSN model on the kinetics400 data set to initialize the backbone network parameters of the video behavior recognition network; The parameters of the feature extraction unit are initialized to random numbers, and the parameters of the feature fusion unit in the inter-frame time domain information extraction module are initialized to 0; the parameters of the fully connected layer are initialized to random numbers.

双线性操作的卷积层参数指的是公式（2）中

参数，就是图2中的前两层卷积层的参数。双线性操作是区别传统线性卷积的，本质上是向量二次项的线性组合，传统线性卷积是向量一次项的线性组合。The parameters of the convolutional layer of the bilinear operation refer to the formula (2) in

The parameters are the parameters of the first two convolutional layers in Figure 2. Bilinear operation is different from traditional linear convolution, which is essentially a linear combination of vector quadratic terms, while traditional linear convolution is a linear combination of vector linear terms.

在其中一个实施例中，步骤106还包括：获取待识别视频，对待识别视频进行均匀的采样，得到若干段等长的视频序列；将视频序列中的图像缩放到120像素×160像素，裁剪中间112×112像素区域，并将剪裁后图像的灰度除以255，映射到[0,1]的数值区间范围，对裁剪后图像的RGB三个通道分别进行去均值归一化操作；将处理后的视频序列输入到视频行为识别网络模型中，得到分类预测得分；将预测得分进行平均，在得到的平均分中进行查找，将查找得到的最高平均分对应的类别作为视频行为分类结果。In one embodiment, step 106 further includes: acquiring the video to be recognized, performing uniform sampling on the video to be recognized, and obtaining several video sequences of equal length; scaling the images in the video sequence to 120 pixels×160 pixels, and cropping the middle 112×112 pixel area, divide the grayscale of the cropped image by 255, map it to the numerical range of [0, 1], and perform de-average normalization on the three RGB channels of the cropped image respectively; The latter video sequence is input into the video behavior recognition network model, and the classification prediction score is obtained; the prediction score is averaged, and the obtained average score is searched, and the category corresponding to the highest average score obtained by the search is used as the video behavior classification result.

在一个具体的实施例中，以ucf101数据集为训练样本，采用Resnet34作为2D骨干网，来说明视频行为识别模型对数据集中的行为类别进行分类的步骤，包括如下步骤：In a specific embodiment, the ucf101 data set is used as a training sample, and Resnet34 is used as the 2D backbone network to illustrate the step of classifying the behavior categories in the data set by the video behavior recognition model, including the following steps:

第1步：获得数据。Step 1: Get the data.

下载并准备好ucf101数据集，将视频数据逐帧解压为图片格式并存储，用于网络的训练和测试。Download and prepare the ucf101 dataset, decompress the video data frame by frame into image format and store it for network training and testing.

ucf101共包含101种行为类别，共包含13k视频，采用官方提供的第一种方式来划分训练集和测试集，其中训练集有9537段视频，测试集有3743段视频。ucf101 contains a total of 101 behavior categories and a total of 13k videos. The first officially provided method is used to divide the training set and the test set. There are 9537 videos in the training set and 3743 videos in the test set.

从视频中随机抽取连续16帧图像组成一个视频块，对获得的视频块进行预处理：①将原始图像缩放为120×160大小，然后从中随机裁剪112×112大小的图像；②将图像灰度除以255，映射到[0,1]的数值区间范围；③对裁剪后图像的RGB三个通道分别进行去均值归一化操作，使用imagenet数据集上的归一化系数，RGB三个通道的均值系数和方差系数分别为[0.485, 0.456, 0.406]，[0.229, 0.224, 0.225]；④对视频块在水平方向以50%概率随机翻转，来扩充原始数据。经过以上步骤，就得到了网络的最终输入，其维度大小为16（时间维度）×112（空间维度）×112（空间维度）×3（通道维度）。Randomly extract 16 consecutive frames of images from the video to form a video block, and preprocess the obtained video block: ① Scale the original image to 120×160 size, and then randomly crop the 112×112 size image from it; ② Convert the image to grayscale Divide by 255, and map to the numerical range of [0,1]; ③ Perform de-average normalization on the three RGB channels of the cropped image respectively, using the normalization coefficients on the imagenet dataset, three RGB channels The mean coefficient and variance coefficient are [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] respectively; ④ The video blocks are randomly flipped in the horizontal direction with a probability of 50% to expand the original data. After the above steps, the final input of the network is obtained, and its dimension size is 16 (time dimension) × 112 (space dimension) × 112 (space dimension) × 3 (channel dimension).

第2步：建立视频行为识别网络。Step 2: Build a video action recognition network.

采用Resnet34作为骨干网络，Resnet34共包含4个残差模块组，每个残差模块组内部包含若干个残差模块，在每个残差模块后面加入一个帧间信息提取模块。除最后一个残差模块组外，每个残差模块组后面使用空域最大池化来减少特征图空域尺寸，时域维度不做池化。在最后一个模块后面使用全局池化获得最终512维特征向量输入全连接层，将全连接层的输出维度变为101维，使用softmax作为激活函数。网络前向运算的输出就是输入样本被模型识别为不同类别的概率。以Resnet34作为骨干网络的视频行为识别网络的结构图如图3所示。Using Resnet34 as the backbone network, Resnet34 contains 4 residual module groups in total, each residual module group contains several residual modules, and an inter-frame information extraction module is added after each residual module. Except for the last residual module group, each residual module group uses spatial maximum pooling to reduce the spatial size of the feature map, and the temporal dimension is not pooled. After the last module, global pooling is used to obtain the final 512-dimensional feature vector input to the fully-connected layer, the output dimension of the fully-connected layer is changed to 101-dimensional, and softmax is used as the activation function. The output of the network forward operation is the probability that the input sample is recognized by the model as a different class. The structure diagram of the video action recognition network with Resnet34 as the backbone network is shown in Figure 3.

Resnet34骨干网初始化时采用TSN模型在kinetics400数据集上预训练的参数；帧间信息提取模块中帧间时域特征提取单元的使用随机初始化，融合的卷积层使用全0初始化；最后的全连接层采用随机初始化。The Resnet34 backbone network is initialized with the parameters pre-trained by the TSN model on the kinetics400 dataset; the inter-frame temporal feature extraction unit in the inter-frame information extraction module is randomly initialized, and the fused convolutional layer is initialized with all 0s; the final full connection Layers are initialized randomly.

第3步：获取网络参数。Step 3: Get network parameters.

网络训练时采用带动量的随机梯度下降法进行网络参数的训练，采用标准的交叉熵损失函数对网络参数进行优化。训练批大小为128，初始学习率为0.001，动量为0.9，在第10轮时学习率缩小10倍，共训练20轮（epoch），得到训练好的视频行为识别网络。During network training, the stochastic gradient descent method with momentum is used to train the network parameters, and the standard cross-entropy loss function is used to optimize the network parameters. The training batch size is 128, the initial learning rate is 0.001, and the momentum is 0.9. At the 10th round, the learning rate is reduced by 10 times, and a total of 20 epochs are trained to obtain a trained video action recognition network.

第4步：用训练好的视频行为识别网络对视频行为进行分类识别。Step 4: Use the trained video action recognition network to classify and identify video actions.

通过第2~第3步的学习训练，得到最优的网络模型参数，用该网络对测试集中视频包含的行为类别进行预测。预测时以16帧为间隔将测试视频均匀的分为若干段，对视频片段中的帧执行缩放、中心裁剪、灰度重映射以及去均值归一化的操作，将每一个处理后的视频片段送入网络计算分类得分，然后将所有片段的得分进行累加，选取得分最高的类别作为最终的预测类别。Through the learning and training of steps 2 to 3, the optimal network model parameters are obtained, and the network is used to predict the behavior categories contained in the video in the test set. During prediction, the test video is evenly divided into several segments at intervals of 16 frames, and the frames in the video segment are scaled, center cropped, grayscale remapped, and de-averaged normalized, and each processed video segment is processed. It is sent to the network to calculate the classification score, and then the scores of all segments are accumulated, and the category with the highest score is selected as the final predicted category.

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 1 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence also need not be sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. a video behavior recognition method based on deep learning, is characterized in that, described method comprises:

Obtaining video data, and preprocessing the video data to obtain training samples;

A video behavior recognition network is constructed; the video behavior recognition network is a convolutional neural network with a two-dimensional convolutional neural network Resnet as a backbone network, and an inter-frame time domain information extraction module is inserted into the backbone network; the two-dimensional volume The product neural network Resnet is used to extract the static features of the target in the video, the inter-frame time domain information extraction module is used to optimize the backbone network, and the bilinear operation is used to extract the inter-frame information features;

The video behavior recognition network is trained by using the training samples, and parameters are optimized to obtain a trained video behavior recognition network model;

Obtaining the video to be identified, and performing preprocessing, and inputting the preprocessed to-be-identified video into the video behavior recognition network model to obtain a video behavior classification result;

Wherein: step: use the training samples to train the video behavior recognition network, and optimize parameters to obtain a trained video behavior recognition network model, including:

classifying the training samples to obtain a training set and a test set;

Inputting the training set into the video behavior recognition network for network training to obtain a video behavior prediction classification result;

According to the video behavior prediction classification result and the test set, adopting the momentum stochastic gradient descent method based on cross entropy loss to optimize the parameters of the video behavior recognition network to obtain a trained video behavior recognition network model;

Wherein: the video behavior recognition network consists of a first feature extraction sub-module, three second feature extraction sub-modules, a third feature extraction sub-module and a fully connected layer; the first feature extraction sub-module The module consists of a convolution layer and a maximum pooling layer; the second feature extraction sub-module consists of a spatiotemporal feature extraction module and a maximum pooling layer; the third feature extraction sub-module consists of a It consists of the above-mentioned spatiotemporal feature extraction module and global pooling layer;

Step: Input the training set into the video behavior recognition network for network training, and obtain the video behavior prediction classification result, including:

The training set is input into the convolutional layer of the first feature extraction sub-module to obtain the first convolutional feature, and the first convolutional feature is input into the maximum pooling layer of the first feature extraction sub-module to maximize the spatial domain. Value pooling to obtain the first maximum pooling feature;

Inputting the first maximum pooling feature into the first spatiotemporal feature extraction module of the second feature extraction submodule to obtain a first spatiotemporal fusion feature;

Inputting the first spatiotemporal fusion feature into the first maximum pooling layer of the second feature extraction sub-module to obtain the second maximum pooling feature;

Inputting the second maximum pooling feature into the second second feature extraction sub-module to obtain a third maximum pooling feature;

Inputting the third maximum pooling feature into the third second feature extraction sub-module to obtain a fourth maximum pooling feature;

Input the fourth maximum pooling feature into the spatiotemporal feature extraction module of the third feature extraction submodule to obtain spatiotemporal fusion features; and input the spatiotemporal fusion features into the third feature extraction submodule. Global pooling layer to obtain global pooling features;

The global pooling feature is input into the fully connected layer, and softmax is used as the activation function to obtain the video behavior prediction classification result.

2. The method according to claim 1, wherein obtaining video data, and preprocessing the video data to obtain training samples, comprising:

Get video data;

Using the dense sampling method to randomly extract several consecutive frames of images from the video data to form a video block;

scaling the image in the video block to a size of 120 pixels by 160 pixels, and randomly cropping an image of a size of 112 pixels by 112 pixels therefrom;

Divide the grayscale of the cropped image by 255 and map it to the numerical interval range of [0,1];

Perform de-average normalization operations on the RGB three channels of the cropped image respectively;

The video blocks are randomly flipped in the horizontal direction with a probability of 50% to obtain training samples.

3. The method according to claim 1, wherein the spatiotemporal feature extraction module is composed of several residual modules and inter-frame time domain information extraction modules alternately connected in series; Forming unit; the inter-frame time-domain information extraction module includes: an inter-frame time-domain feature extraction unit and a feature fusion unit; the inter-frame time-domain feature extraction unit includes a bilinear operation convolution layer for extracting time-domain features ; The feature fusion unit includes a convolution layer for feature fusion;

Inputting the first maximum pooling feature into the first spatiotemporal feature extraction module of the second feature extraction submodule to obtain the first spatiotemporal fusion feature, including:

Inputting the first maximum pooling feature into the first residual module in the spatiotemporal feature extraction module of the first and second feature extraction submodules to obtain deep spatial features;

Inputting the deep spatial domain feature into the first inter-frame time domain information extraction module in the spatiotemporal feature extraction module of the first and second feature extraction submodules to obtain a fusion feature;

Input the fusion feature into the second residual module and the inter-frame time domain information extraction module of the first second feature extraction sub-module, and repeat this until the feature information is extracted by the first second feature extraction All the residual modules in the sub-modules and the inter-frame time domain information extraction module are up to the first fusion feature.

4. method according to claim 3, is characterized in that, described training set is input into described video behavior recognition network to carry out network training, obtains video behavior prediction classification result, also comprises before step:

The parameters of the backbone network of the video behavior recognition network are initialized with the parameters pre-trained on the kinetics400 data set by the TSN model;

Initializing the parameters of the inter-frame time-domain feature extraction unit in the inter-frame time-domain information extraction module to random numbers, and initializing the parameters of the feature fusion unit in the inter-frame time-domain information extraction module to 0;

The parameters of the fully connected layer are initialized to random numbers.

5. The method according to claim 1, wherein the video to be recognized is acquired, and preprocessed, the preprocessed video to be recognized is input into the video behavior recognition network model, and a video behavior classification result is obtained, include:

Obtaining the video to be recognized, uniformly sampling the video to be recognized, and obtaining several video sequences of equal length;

Scale the image in the video sequence to a size of 120 pixels × 160 pixels, crop the middle area of 112 pixels × 112 pixels, divide the grayscale of the cropped image by 255, and map it to the range of [0,1]. The three RGB channels of the cropped image are de-averaged and normalized respectively;

Input the processed video sequence into the video behavior recognition network model to obtain a classification prediction score;

The predicted scores are averaged, the obtained average scores are searched, and the category corresponding to the highest average score obtained by the search is used as the video behavior classification result.