CN110532959A

CN110532959A - Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network

Info

Publication number: CN110532959A
Application number: CN201910817372.6A
Authority: CN
Inventors: 沈小艳; 阴文佳; 毕胜
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-12-03
Anticipated expiration: 2039-08-30
Also published as: CN110532959B

Abstract

The invention provides a real-time violent behavior detection system based on a two-channel three-dimensional convolutional neural network, which comprises the following components: the video acquisition module captures video frames in real time and respectively sends the video frames to the video processing module and the playing module; the video processing module is used for extracting the characteristics of the received video frames by using the convolutional neural network, combining the extracted characteristics and classifying the image data according to the combined characteristics; the playing module is used for marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module and playing the video frame to a user; the video acquisition module, the video processing module and the playing module work in parallel. The invention improves the identification accuracy by introducing a double-channel idea, and realizes the accurate positioning of the occurrence time of violent behaviors by introducing the deconvolution layer.

Description

Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network

技术领域technical field

本发明涉及视频监控技术领域，具体而言，尤其涉及一种基于快、慢双通道三维卷积神经网络的实时暴力行为检测系统。The invention relates to the technical field of video surveillance, in particular, to a real-time violent behavior detection system based on fast and slow dual-channel three-dimensional convolutional neural networks.

背景技术Background technique

视频人体行为识别与检测是计算机视觉中最具挑战性的任务之一，其可以在视频监控、运动检索、人机交互、智能家居以及医疗保健等众多领域广泛应用。目前行为识别领域共有两大分支：以IDT(improved Dense Trajectories)算法为代表的传统方式，以及以二维卷积、三维卷积、RNN-LSTM为代表的深度学习方式。从发展趋势上看深度学习方式在性能上已经超过传统方式。Video human behavior recognition and detection is one of the most challenging tasks in computer vision, which can be widely used in many fields such as video surveillance, motion retrieval, human-computer interaction, smart home, and healthcare. At present, there are two branches in the field of behavior recognition: the traditional method represented by the IDT (improved Dense Trajectories) algorithm, and the deep learning method represented by 2D convolution, 3D convolution, and RNN-LSTM. From the perspective of development trends, deep learning methods have surpassed traditional methods in performance.

密集轨迹算法(improved Dense Trajectories，IDT)：传统方法与深度学习方法的主要区别在于用于分类的特征的来源。传统方法是根据经验手动提取某一种或某几种分类效果较好的特征混合起来进行分类。深度学习的方法是人将已经分好类的样本交给计算机让他自己学习一套模型，利用学习到的模型可以提取到某几种特征组合的特征进行分类，至于模型具体提取了哪些特征，人类无从知晓。传统方法提取到的特征种数是有限的，特征能选择范围也是有限的，所以手工提取到的特征不如模型提取到的特征精准。这也是深度学习的优势所在。Improved Dense Trajectories (IDT): The main difference between traditional methods and deep learning methods is the source of the features used for classification. The traditional method is to manually extract one or several kinds of features with better classification effect based on experience and mix them for classification. The method of deep learning is that people pass the classified samples to the computer and let them learn a set of models by themselves. Using the learned model, the features of certain feature combinations can be extracted for classification. As for which features are extracted by the model, Humans have no way of knowing. The number of features extracted by traditional methods is limited, and the range of features that can be selected is also limited, so the features extracted by hand are not as accurate as those extracted by the model. This is also the advantage of deep learning.

双通道卷积神经网络(Two-Stream-CNN)：是二维卷积解决行为识别问题的代表性算法。主要内容：两个通道同时处理RGB帧序列和光流帧序列,两个通道特征提取过程中无信息交流，特征提取结束后以某种方式将特征进行融合用于分类得到最终结果。因为网络每次只能处理一张图像，且序列中每一帧图像都需要进行处理，视频相邻帧图像之间又存在大量的重复信息，所以该算法中存在重复计算的现象，识别检测速度受到了很大的制约，无法满足实时性的要求。Two-channel convolutional neural network (Two-Stream-CNN): It is a representative algorithm for two-dimensional convolution to solve the problem of behavior recognition. Main content: Two channels process RGB frame sequence and optical flow frame sequence at the same time. There is no information exchange in the feature extraction process of the two channels. After the feature extraction is completed, the features are fused in some way for classification to obtain the final result. Because the network can only process one image at a time, and each frame of image in the sequence needs to be processed, and there is a lot of repeated information between adjacent frame images of the video, there is a phenomenon of repeated calculation in this algorithm, and the speed of recognition and detection It is greatly restricted and cannot meet the requirements of real-time performance.

长短期记忆网络(Long-Short Term Memory,LSTM)：由于独特的设计结构，LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。所以在行为识别与检测方向上，LSTM有着不错的效果，也是目前主流方向之一。Long-Short Term Memory (LSTM): Due to its unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in time series. Therefore, in the direction of behavior recognition and detection, LSTM has a good effect and is also one of the current mainstream directions.

二维卷积在图像识别与检测问题中已经发展的很成熟了，但是视频比较图像而言增加了一个时间维度的信息。传统的二维卷积核已经无法满足提取三维特征的需要。三维卷积运算速度是优势且能良好的捕捉帧间信息，目前是主流研究方向。但现有方法均存在识别准确率较低、识别速度较慢的问题，极大的限制了人体行为识别检测技术的发展和应用。Two-dimensional convolution has matured in image recognition and detection problems, but video comparison images add a temporal dimension of information. The traditional two-dimensional convolution kernel has been unable to meet the needs of extracting three-dimensional features. The speed of 3D convolution operation is an advantage and can capture inter-frame information well, and it is currently the mainstream research direction. However, the existing methods all have the problems of low recognition accuracy and slow recognition speed, which greatly limit the development and application of human behavior recognition detection technology.

发明内容SUMMARY OF THE INVENTION

根据上述提出识别准确率较低、识别速度较慢的技术问题，而提供一种基于快、慢双通道三维卷积神经网络的实时暴力行为检测系统，通过引入双通道思想提高了识别的准确率，同时通过引入反卷积层实现了暴力行为发生的时间的精确定位。According to the above technical problems of low recognition accuracy and slow recognition speed, a real-time violent behavior detection system based on fast and slow dual-channel three-dimensional convolutional neural network is provided, and the recognition accuracy is improved by introducing dual-channel ideas. , and at the same time, by introducing a deconvolution layer, the precise location of the time when the violent behavior occurs.

本发明采用的技术手段如下：The technical means adopted in the present invention are as follows:

一种基于双通道三维卷积神经网络的实时暴力行为检测系统，包括：A real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network, including:

视频采集模块，实时抓取视频帧，并分别送至所述视频处理模块及所述播放模块；a video capture module, which captures video frames in real time and sends them to the video processing module and the playback module respectively;

视频处理模块，利用卷积神经网络对接收的视频帧进行特征提取，并将提取的特征进行组合，进而根据组合特征对图像数据进行分类；The video processing module uses the convolutional neural network to perform feature extraction on the received video frames, combines the extracted features, and then classifies the image data according to the combined features;

播放模块，将所述视频处理模块得到的图像分类结果标记到所述视频采集模块发送的视频帧中，向用户进行播放；a playing module, marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module, and playing it to the user;

所述视频采集模块、视频处理模块及播放模块并行工作。The video acquisition module, video processing module and playback module work in parallel.

进一步地，所述视频处理模块对接收的视频帧进行特征提取之前，还对所述视频帧进行预处理，包括：将RGB图像分别送入慢速通道和快速通道处理，将得到的慢速通道预处理结果及快速通道预处理结果作为所述视频处理模块的输入。Further, before the video processing module performs feature extraction on the received video frame, it also preprocesses the video frame, including: sending the RGB images into the slow channel and the fast channel respectively for processing, and processing the obtained slow channel. The preprocessing result and the fast channel preprocessing result are used as the input of the video processing module.

进一步地，所述慢速通道用于将RGB图像等间隔采样为视频段，输入训练后的慢速通道网络模型预测得到慢速通道预处理数据。Further, the slow channel is used to sample the RGB images into video segments at equal intervals, and input the trained slow channel network model to predict the slow channel preprocessing data.

进一步地，所述快速通道用于将RGB图像处理为灰度图像数据并提取光流图像数据，输入训练后的快速通道网络模型预测得到快速通道预处理数据。Further, the fast channel is used to process the RGB image into grayscale image data and extract the optical flow image data, and input the trained fast channel network model to predict the fast channel preprocessing data.

进一步地，所述视频处理模块还对慢速通道特征提取结果及快速通道特征提取结果进行基于卷积特征融合的横向融合处理。Further, the video processing module further performs horizontal fusion processing based on convolution feature fusion on the slow channel feature extraction results and the fast channel feature extraction results.

进一步地，所述系统还包括存储模块，用以存储系统运行过程中的数据信息。Further, the system further includes a storage module for storing data information during the operation of the system.

较现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

本发明利用多层卷积神经网络提取视频帧的时间关联特征，在一定程度上做到了参数共享，能够性能良好的捕捉帧间信息，提高运算速度，实时性强。同时结合快、慢两通道设置，通过卷积的方式在保证数据不丢失的情况下，使待融合的快、慢通道特征具有相似的形状，添加了特征融合结构后，将慢速通道中卷积层慢的输出与同一层次经过卷积变形的快速通道的输出叠加作为下一卷积层的输入，提高了识别的准确率。The invention utilizes the multi-layer convolutional neural network to extract the temporal correlation feature of the video frame, achieves parameter sharing to a certain extent, can capture the inter-frame information with good performance, improves the operation speed, and has strong real-time performance. At the same time, the fast and slow channel settings are combined, and the fast and slow channel features to be fused have similar shapes through convolution without data loss. After adding the feature fusion structure, the slow channel is rolled in the middle. The slow output of the stacked layer is superimposed with the output of the fast channel that has undergone convolution deformation at the same layer as the input of the next convolution layer, which improves the accuracy of recognition.

此外，本发明未在时间域上使用pooling操作，保证了最大程度上保留时间域信息，进而可以更加精确地定位暴力行为发生的时间。In addition, the present invention does not use the pooling operation in the time domain, which ensures that the information in the time domain is preserved to the greatest extent, so that the time when the violent behavior occurs can be more accurately located.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做以简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明检测系统结构功能框图。FIG. 1 is a structural and functional block diagram of the detection system of the present invention.

图2为本发明检测系统工作流程图。Fig. 2 is the working flow chart of the detection system of the present invention.

图3为本发明视频处理模块工作流程图。FIG. 3 is a working flow chart of the video processing module of the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that the embodiments of the present invention and the features of the embodiments may be combined with each other under the condition of no conflict. The present invention will be described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的，决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is only a part of the embodiments of the present invention, but not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图1-3所示，本发明提供了一种基于双通道三维卷积神经网络的实时暴力行为检测系统，包括三个模块，即视频采集模块、视频处理模块、延时播放模块，为满足实时性，需要三线程同时进行。As shown in Figures 1-3, the present invention provides a real-time violent behavior detection system based on a dual-channel three-dimensional convolutional neural network, including three modules, namely a video acquisition module, a video processing module, and a delayed playback module. Real-time, requires three threads to run at the same time.

视频采集模块，实时抓取视频帧，并分别送至所述视频处理模块及所述延时播放模块。具体地，利用opencv库和网络摄像头进行视频帧实时抓取。将抓取的帧图像分成两路，一路存放到队列二中，为线程三延时播放模块提供输入，一路存放到队列一，为视频处理模块的图像预处理步骤提供素材。The video capture module captures video frames in real time and sends them to the video processing module and the delayed playback module respectively. Specifically, the opencv library and webcam are used to capture video frames in real time. The captured frame images are divided into two channels, one is stored in the second queue, which provides input for the delay playback module of thread three, and the other is stored in the queue one, which provides materials for the image preprocessing step of the video processing module.

视频处理模块，利用卷积神经网络对接收的视频帧进行特征提取，并将提取的特征进行组合，进而根据组合特征对图像数据进行分类。具体用于实现图像预处理、视频特征提取以及图像特征分类等功能。本发明为了提高识别的准确率引入了Slow_Fast思想，将数据分别送入快速通道以及慢速通道进行处理。The video processing module uses a convolutional neural network to perform feature extraction on the received video frames, combines the extracted features, and then classifies the image data according to the combined features. Specifically, it is used to realize functions such as image preprocessing, video feature extraction, and image feature classification. In order to improve the accuracy of identification, the invention introduces the idea of Slow_Fast, and the data is respectively sent to the fast channel and the slow channel for processing.

作为本发明较佳的实施方式，进行图像预处理时采用以下技术方案：As a preferred embodiment of the present invention, the following technical solutions are adopted when performing image preprocessing:

慢速通道：RGB帧等间隔取样：64帧为一个视频单元，对视频单元等间隔采样，每隔16帧取一帧，采样结果为4*h*w*3形状的视频段。Slow channel: RGB frames are sampled at equal intervals: 64 frames are a video unit, the video unit is sampled at equal intervals, and a frame is taken every 16 frames, and the sampling result is a video segment in the shape of 4*h*w*3.

调整：预测阶段，将每一帧RGB图像的宽高缩放到224*224的大小输出结果的形状为4*224*224*3。在训练阶段，首先将数据，缩放到4*256*256*3，然后随机裁剪成4*224*224*3，搭配随机翻转完成数据增广。从而增加网络模型的泛化能力，同时能够防止模型过拟合。Adjustment: In the prediction stage, the width and height of each frame of RGB image are scaled to 224*224 and the shape of the output result is 4*224*224*3. In the training phase, the data is first scaled to 4*256*256*3, then randomly cropped to 4*224*224*3, and the data augmentation is completed with random flipping. This increases the generalization ability of the network model and prevents overfitting of the model.

快速通道：主要分为两部分，第一部分将RGB图像转化为灰度图像，RGB包含三个颜色通道，灰度图只有一个颜色通道。将RGB三个通道的灰度值乘以对应的权重，求和结果作为灰度图对应点的灰度值，公式如下：Fast channel: It is mainly divided into two parts. The first part converts the RGB image into a grayscale image. RGB contains three color channels, and the grayscale image has only one color channel. Multiply the grayscale values of the three RGB channels by the corresponding weights, and the summation result is used as the grayscale value of the corresponding point in the grayscale image. The formula is as follows:

Gray＝R*0.299+G*0.587+B*0.114Gray=R*0.299+G*0.587+B*0.114

本环节输出数据形状为64*w*h*1The output data shape of this link is 64*w*h*1

第二部分将灰度图像转化为光流数据，其中光流反映的是相邻帧之间物体的运动信息。The second part converts the grayscale image into optical flow data, where the optical flow reflects the motion information of objects between adjacent frames.

本实施例优选采用Farneback光流算法提取稠密光流。每两帧计算一次光流，一个视频单元有64帧，所以本环节输出数据形状为32*w*h*2(光流图像有2个通道分别是x方向光流和y方向光流)In this embodiment, the Farneback optical flow algorithm is preferably used to extract the dense optical flow. The optical flow is calculated every two frames, and a video unit has 64 frames, so the output data shape of this link is 32*w*h*2 (the optical flow image has 2 channels, which are the optical flow in the x-direction and the optical flow in the y-direction)

调整:预测阶段，将每一帧图像的宽高缩放到224*224的大小，输出结果的形状为32*224*224*2。在训练阶段，首先将数据，缩放到32*256*256*2，然后随机裁剪成32*224*224*2，搭配随机翻转完成数据增广。从而增加网络模型的泛化能力，同时能够防止模型过拟合。此处随机裁剪和随机翻转要和slow通道保持一致。Adjustment: In the prediction stage, the width and height of each frame of image are scaled to 224*224, and the shape of the output result is 32*224*224*2. In the training phase, the data is first scaled to 32*256*256*2, then randomly cropped to 32*224*224*2, and the data augmentation is completed with random flipping. This increases the generalization ability of the network model and prevents overfitting of the model. Here, random cropping and random flipping should be consistent with the slow channel.

将两个通道的数据预处理结果，同时作为提取特征步骤的输入。The data preprocessing results of the two channels are used as the input of the feature extraction step at the same time.

作为本发明较佳的实施方式，提取视频特征时采用以下技术方案：As a preferred embodiment of the present invention, the following technical solutions are adopted when extracting video features:

将已经处理好的数据分别输入到对应的网络通道，输入已经训练好的网络模型，一层一层提取特征。图中以T*W*H*C的形式标注出了特征经过每一层之后的输出形状，举例32*112*112*8指的是经过上一个卷积模块后输出特征是32帧宽112高112卷积核通道数是8的形状。Input the processed data into the corresponding network channels, input the trained network model, and extract features layer by layer. In the figure, the output shape of the feature after each layer is marked in the form of T*W*H*C. For example, 32*112*112*8 means that the output feature after the previous convolution module is 32 frames wide 112 High 112 convolution kernel channel number is 8 shape.

快速通道：输入32帧光流图像，每帧图像宽224高224，分X方向光流和Y方向光流2个通道。快速通道内，共包含五个卷积模块，所有卷积模块结构相同，均包含一个三维卷积层、一个BN层、一个relu激励层、一个三维池化层，卷积模块的名字、卷积核大小、池化核大小均在图上有标示。举例Conv 1_3*3*3_1*2*2指的是该层的名字是Conv 1，卷积核大小是3*3*3，池化核大小是1*2*2。经过5层卷积层后，接一个Average Pooling 3D层，池化核大小为1*7*7，将特征形状由32*7*7*128缩减到32*1*1*128，减少运算开支。Fast channel: input 32 frames of optical flow images, each frame is 224 wide and 224 high, divided into 2 channels of X-direction optical flow and Y-direction optical flow. There are five convolution modules in the fast channel. All convolution modules have the same structure, including a 3D convolution layer, a BN layer, a relu excitation layer, and a 3D pooling layer. The name of the convolution module, the convolution layer The kernel size and pooling kernel size are marked on the figure. For example, Conv 1_3*3*3_1*2*2 means that the name of the layer is Conv 1, the size of the convolution kernel is 3*3*3, and the size of the pooling kernel is 1*2*2. After 5 layers of convolutional layers, an Average Pooling 3D layer is added, the size of the pooling kernel is 1*7*7, and the feature shape is reduced from 32*7*7*128 to 32*1*1*128, reducing the computational cost .

慢速通道：输入4帧RGB彩色图像，每帧图像宽224高224，分RGB三个颜色通道。在慢速通道内同样包含5个卷积模块，模块名字分别为Convx_S，x为1-5。第一个和最后一个模块层次结构和快速通道卷积模块层次机构相同，包含一个三维卷积层、一个BN层、一个relu激励层、一个三维池化层，卷积核大小为3*3*3，池化核大小为1*2*2。在慢速通道中间三个层次中，使用卷积与反卷积联合操作实现在时间域上上采样和空间域上下采样同时完成。卷积核和池化核大小如图所示。同样，经过5层卷积层后，接一个Average Pooling 3D层，池化核大小为1*7*7，将特征形状由32*7*7*128缩减到32*1*1*128，减少运算开支。Slow channel: Input 4 frames of RGB color images, each frame is 224 wide and 224 high, divided into three RGB color channels. There are also 5 convolution modules in the slow channel, the module names are Convx_S, and x is 1-5. The first and last module hierarchy is the same as the fast channel convolution module hierarchy, including a 3D convolution layer, a BN layer, a relu excitation layer, a 3D pooling layer, and the convolution kernel size is 3*3* 3. The size of the pooling kernel is 1*2*2. In the middle three levels of the slow channel, the joint operation of convolution and deconvolution is used to achieve simultaneous upsampling in the time domain and upsampling in the spatial domain. The convolution and pooling kernel sizes are shown in the figure. Similarly, after 5 layers of convolutional layers, followed by an Average Pooling 3D layer, the size of the pooling kernel is 1*7*7, and the feature shape is reduced from 32*7*7*128 to 32*1*1*128, reducing computing expenses.

横向融合：为了能够使两个通道充分利用对方通道所学习到的内容，很多方案都会用到横向特征融合，融合的方式也有很多种，由于本实施例中两个通道的第一个维度即时序信息维度并不相同，无法通过直接叠加的方式进行特征融合，所以本文选择卷积特征融合的方式。即通过卷积的方式在保证数据不丢失的情况下，使待融合的两个特征具有相似的形状，每一个卷层的输出形状在附图中有标识。添加了特征融合结构后，在慢速通道中，将本卷积层慢速通道的输出与同一层次经过卷积变形的快速通道的输出叠加作为下一卷积层的输入。Horizontal fusion: In order to enable the two channels to make full use of the content learned by the other channel, many schemes use horizontal feature fusion, and there are many fusion methods. Since the first dimension of the two channels in this embodiment is the time sequence The information dimensions are not the same, and feature fusion cannot be performed by direct superposition, so this paper chooses the method of convolution feature fusion. That is, the two features to be fused have similar shapes while ensuring that the data is not lost by means of convolution, and the output shape of each convolution layer is marked in the accompanying drawing. After adding the feature fusion structure, in the slow channel, the output of the slow channel of this convolution layer and the output of the fast channel that has undergone convolution deformation at the same layer are superimposed as the input of the next convolution layer.

作为本发明较佳的实施方式，全连接层和分类器的设置方案如下：As a preferred embodiment of the present invention, the setting scheme of the fully connected layer and the classifier is as follows:

两个通道的特征信息经过融合压缩后，提取到的是很多局部特征，需要经过一个全连接层把局部特征重新组装成完整的特征，全局特征将作为输入用于分类器分类。本实施例优选全连接层节点个数为1024。After the feature information of the two channels is fused and compressed, many local features are extracted. It needs to go through a fully connected layer to reassemble the local features into complete features, and the global features will be used as input for classifier classification. In this embodiment, the number of fully connected layer nodes is preferably 1024.

因为只需要将动作分为是与否是暴力行为，所以选择Sigmoid函数作为分类器。输出节点个数为2。Since it is only necessary to classify actions as yes or no violence, the sigmoid function is chosen as the classifier. The number of output nodes is 2.

分类器输出结果的形状为32*2，即一个视频单元64帧，每两帧图像能获得是与否暴力行为的两个概率值。取概率较大值作为预测结果，所以最终结果为一个长度为32的序列，对应于32帧图像。从而实现帧级预测。The shape of the output result of the classifier is 32*2, that is, a video unit has 64 frames, and two probability values of yes or no violent behavior can be obtained for every two frames of images. The larger probability value is taken as the prediction result, so the final result is a sequence of length 32, corresponding to 32 frames of images. This enables frame-level prediction.

Sigmoid函数公式如下：The sigmoid function formula is as follows:

g(x)＝1/(1+e^(-x))g(x)=1/(1+e^(-x))

需要说明的是，在整个网络中，均未在时间域上使用pooling操作，以保证最大程度上保留时间域信息。此外由于RGB主要关注细节信息，随时间变化速度慢，重复信息较多，为了避免重复计算节省计算开支，慢速通道以低帧率进行，每16帧采样1帧，在一个时间单元内，共采样4帧。由于光流图主要关注运动信息，随时间变化速度快，以高帧率进行，每2帧采样1帧，在一个时间单元内，共采样32帧。最后虽然输入帧数较少，但是慢速通道需要关注更多的细节信息，我们知道卷积核数目越多能够关注的细节信息也就越多，所以整个过程中快速通道卷积核数目要明显多于慢速通道，本文中快速通道卷积核数目设定为慢速通道的8倍。It should be noted that, in the entire network, the pooling operation is not used in the time domain to ensure that the time domain information is preserved to the greatest extent. In addition, since RGB mainly pays attention to detail information, it changes slowly with time, and there is a lot of repeated information. In order to avoid repeated calculation and save computing expenses, the slow channel is performed at a low frame rate, and one frame is sampled every 16 frames. In one time unit, a total of Sample 4 frames. Since the optical flow graph mainly focuses on motion information, it changes rapidly with time, and is performed at a high frame rate. One frame is sampled every two frames, and a total of 32 frames are sampled in one time unit. Finally, although the number of input frames is small, the slow channel needs to pay attention to more detailed information. We know that the more the number of convolution kernels, the more detailed information can be paid attention to, so the number of fast channel convolution kernels in the whole process should be obvious. More than the slow channel, the number of fast channel convolution kernels in this paper is set to 8 times that of the slow channel.

延时播放模块，将所述视频处理模块得到的图像分类结果标记到所述视频采集模块发送的视频帧中，向用户进行延时播放。The delayed playback module marks the image classification result obtained by the video processing module into the video frame sent by the video acquisition module, and performs delayed playback to the user.

此外，本实施例中系统还包括存储模块，用以存储系统运行过程中的数据信息。In addition, in this embodiment, the system further includes a storage module for storing data information during the operation of the system.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

1. a real-time violent behavior detection system based on dual-channel three-dimensional convolutional neural network, is characterized in that, comprises:

a video capture module, which captures video frames in real time and sends them to the video processing module and the playback module respectively;

The video processing module uses the convolutional neural network to perform feature extraction on the received video frames, combines the extracted features, and then classifies the image data according to the combined features;

a playing module, marking the image classification result obtained by the video processing module into the video frame sent by the video acquisition module, and playing it to the user;

The video acquisition module, video processing module and playback module work in parallel.

2. The real-time violent behavior detection system according to claim 1, is characterized in that, before described video processing module carries out feature extraction to the video frame received, also carries out preprocessing to described video frame, comprising: RGB images are respectively It is sent to the slow channel and fast channel for processing, and the obtained slow channel preprocessing result and fast channel preprocessing result are used as the input of the video processing module.

3. The real-time violent behavior detection system according to claim 2, wherein the slow channel is used to sample the RGB images into video segments at equal intervals, and the slow channel network model after input training predicts that the slow channel is obtained. Preprocess data.

4. The real-time violent behavior detection system according to claim 2, wherein the fast channel is used to process the RGB image into grayscale image data and extract the optical flow image data, and the fast channel network model after input training predicts Get fast-lane preprocessed data.

5 . The real-time violent behavior detection system according to claim 2 , wherein the video processing module further performs horizontal fusion processing based on convolution feature fusion on the slow channel feature extraction results and the fast channel feature extraction results. 6 .

6 . The real-time violent behavior detection system according to claim 1 , wherein the system further comprises a storage module for storing data information during the operation of the system. 7 .