CN108805036B

CN108805036B - Unsupervised video semantic extraction method

Info

Publication number: CN108805036B
Application number: CN201810496579.3A
Authority: CN
Inventors: 林劼; 王芷若; 马骏; 崔建鹏; 杜亚伟; 钟德建
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-05-22
Filing date: 2018-05-22
Publication date: 2022-11-22
Anticipated expiration: 2038-05-22
Also published as: CN108805036A

Abstract

The invention discloses an unsupervised video semantic extraction method, which comprises the steps of constructing a three-dimensional convolutional neural network model, and training the three-dimensional convolutional neural network model by using labeled video data in a video database; processing video data without labels in a video database into data which is in accordance with the input of a three-dimensional convolution neural network by using a sliding window; using the generated data as input data of a three-dimensional convolution neural network model, and taking output data of a full connection layer of the three-dimensional convolution neural network model as semantic features of a video segment; and using the generated video segment semantic feature sequence as the input of a video semantic self-encoder, and integrating by a self-encoder to obtain the overall semantic features of the video. The embodiment of the invention solves the problem of unsupervised video semantic analysis and extraction by combining the scheme of the three-dimensional convolutional neural network and the cyclic automatic encoder, and improves the accuracy of video semantic extraction.

Description

An Unsupervised Video Semantic Extraction Method

技术领域technical field

本发明涉及人工智能和模式识别技术领域，特别是涉及一种基于深度学习模型的非监督视频语义提取方法。The invention relates to the technical fields of artificial intelligence and pattern recognition, in particular to a method for extracting semantics from unsupervised video based on a deep learning model.

背景技术Background technique

“语义”这一概念起源于19世纪末，是虚拟数据所对应的现实世界中的事物所代表的含义的表现，以及这些含义之间的关系，是虚拟数据在某个领域上的解释和逻辑表示。而且“视频语义”是针对人类思维而言的，当我们想用计算机来理解视频之中的“语义”时，计算机只能够识别诸如颜色、形状等底层特征。因此，我们需要采用一些方法将这些底层的特征联系起来，形成一些更高层的含义，从而将视频中所要展示的信息更好的表达出来。The concept of "semantics" originated at the end of the 19th century. It is the expression of the meanings represented by things in the real world corresponding to virtual data, and the relationship between these meanings. It is the interpretation and logical representation of virtual data in a certain field. . Moreover, "video semantics" is aimed at human thinking. When we want to use computers to understand the "semantics" in videos, computers can only recognize low-level features such as color and shape. Therefore, we need to use some methods to connect these low-level features to form some higher-level meanings, so as to better express the information to be displayed in the video.

视频数据通常是非结构化的，因此对视频的语义提取，需要从多方面进行考虑。从内容上，需要考虑视频含有的空间和时间属性。从语义上，需要考虑视频信息包括的图像特征、字幕文本特征、语音特征和视频描述信息文本特征等。视频在物理结构上分成了四个结构层次：帧、镜头、场景和视频。视频帧的内容记录了视频中对象的特征，如色彩、纹路和形态等；镜头是由若干连续帧组成的，其内容记录了连续帧中对象的运动特征，表现了对象的时间特性。在现实中，镜头是生成视频的基本单位，即是摄像机一次拍摄所得到的最小单位；场景由一系列语义内容相关并且时间上连续的镜头组成，其内容记录了较为复杂的语义信息。若干个场景组成一个视频文件，其内容记录了整个视频的语义信息。Video data is usually unstructured, so the semantic extraction of video needs to be considered from many aspects. In terms of content, it is necessary to consider the spatial and temporal attributes contained in the video. Semantically, it is necessary to consider image features, subtitle text features, voice features, and video description text features included in video information. Video is divided into four structural levels in physical structure: frame, shot, scene and video. The content of the video frame records the characteristics of the object in the video, such as color, texture and shape, etc.; the shot is composed of several consecutive frames, and its content records the motion characteristics of the object in the continuous frames, showing the temporal characteristics of the object. In reality, a shot is the basic unit of video generation, that is, the smallest unit obtained by a camera shooting once; a scene is composed of a series of shots with related semantic content and continuous time, and its content records relatively complex semantic information. Several scenes form a video file, and its content records the semantic information of the entire video.

(1)基于关键帧的视频语义提取，通常的关键帧语义提取技术流程为：对视频的帧截图；对帧截图进行关键帧识别，对取得的关键帧进行语义分析；将视频中包含的语音数据通过语音识别转换成文本；对语音文本进行语义识别；将上述关键帧语义和语音语义结合在一起，就得到了这个视频的语义；也就是将视频的图像特征和声音mfcc特征转换为语义特征，然后结合字幕的识别，通过Neuro-Linguistic Programming处理字幕得到词向量和文档相似度。这个方法的优势在于对视频上的文字内容较多的视频有较好的提取效果，比如一些教育类的视频。劣势就是对其他类型的文字较少的视频，因为其关键帧中的字幕信息较少，很难从中获得有用的文本信息。(1) Video semantic extraction based on key frames, the usual technical process of key frame semantic extraction is: screenshot of video frames; key frame recognition of frame screenshots, semantic analysis of the obtained key frames; voice contained in the video The data is converted into text through speech recognition; the speech text is semantically recognized; the above key frame semantics and speech semantics are combined to obtain the semantics of the video; that is, the image features and sound mfcc features of the video are converted into semantic features , and then combined with the identification of subtitles, the subtitles are processed through Neuro-Linguistic Programming to obtain word vectors and document similarity. The advantage of this method is that it has a better extraction effect on videos with more text content on the video, such as some educational videos. The disadvantage is that for other types of videos with less text, because the subtitle information in the key frame is less, it is difficult to obtain useful text information from it.

(2)基于视频文本信息关键词提取，这种方法是对纯文本的提取，且此方法对词本身的重要程度、词所在的位置要求比较高，前面的词比后面的词重要，词频，词的整体出现顺序，也需要综合起来。也就是说标题的内容需要非常切合视频语义，否则这种方法的准确率会非常低。这种方法的优势是计算复杂度较低，业内有成熟的文本处理算法，并且各种算法开源包都很方便。劣势：有一些网络用语其表达的意思与字面意思相差很大，对视频语义的提取会产生极大的干扰。(2) Keyword extraction based on video text information, this method is the extraction of plain text, and this method has relatively high requirements for the importance of the word itself and the position of the word, the former word is more important than the latter word, word frequency, The overall order of appearance of words also needs to be integrated. That is to say, the content of the title needs to be very consistent with the semantics of the video, otherwise the accuracy of this method will be very low. The advantage of this method is that the calculation complexity is low, there are mature text processing algorithms in the industry, and various open source packages of algorithms are very convenient. Disadvantages: There are some Internet terms whose meaning is quite different from the literal meaning, which will greatly interfere with the extraction of video semantics.

对于体育视频的语义分析，目前的方法很少考虑对无标签数据的语义提取，因此当测试数据不属于训练数据种类之一时会发生领域漂移问题，从而影响视频语义提取准确度。For the semantic analysis of sports videos, the current methods rarely consider the semantic extraction of unlabeled data, so when the test data does not belong to one of the training data types, domain drift will occur, which will affect the accuracy of video semantic extraction.

发明内容Contents of the invention

本发明的目的在于克服现有的技术不足，提供一种使用三维卷积神经网络模型和循环自编码器相结合的视频语义提取的方法，能够解决非监督的视频语义分析与提取问题，提高视频语义提取准确度。The purpose of the present invention is to overcome the existing technical deficiencies, and provide a method for video semantic extraction using a combination of a three-dimensional convolutional neural network model and a cyclic autoencoder, which can solve the problem of unsupervised video semantic analysis and extraction, and improve video quality. Semantic extraction accuracy.

具体的，一种非监督视频语义提取方法，其特征在于，包括以下步骤：Specifically, a method for unsupervised video semantic extraction, is characterized in that, comprises the following steps:

S1：构建三维卷积神经网络模型，使用视频数据库中带标签的UCF-101视频集训练三维卷积神经网络模型；S1: Construct a three-dimensional convolutional neural network model, and use the labeled UCF-101 video set in the video database to train the three-dimensional convolutional neural network model;

S2：使用滑动窗口将视频数据库中不带标签视频数据处理成符合三维卷积神经网络输入的数据；S2: Use the sliding window to process the unlabeled video data in the video database into data that conforms to the input of the 3D convolutional neural network;

S3：使用S2步骤生成数据作为三维卷积神经网络模型的输入数据，取三维卷积神经网络模型全连接层的输出数据作为视频段的语义特征；S3: Use the data generated in step S2 as the input data of the three-dimensional convolutional neural network model, and take the output data of the fully connected layer of the three-dimensional convolutional neural network model as the semantic feature of the video segment;

S4：使用S3步骤生成的视频段语义特征序列作为视频语义自编码器的输入，通过自编码器整合得到视频整体语义特征。S4: Use the video segment semantic feature sequence generated in step S3 as the input of the video semantic autoencoder, and obtain the overall semantic feature of the video through the integration of the autoencoder.

优选地，步骤S1包括下列子步骤：Preferably, step S1 includes the following sub-steps:

S11：构建包含五层卷积层、池化层，两层全连接层和一层SOFTMAX层的三维卷积神经网络模型；S11: Construct a three-dimensional convolutional neural network model including five convolutional layers, pooling layers, two fully connected layers and one SOFTMAX layer;

S12：在使用视频数据库中带标签的UCF-101视频集训练三维卷积神经网络之前，需要对视频数据集视频预处理：将UCF-101视频集中的原始视频需要按照一定的FPS转化为视频帧图片集，对图片进行大小调整、噪声过滤的图像预处理，将图片转化为112*112的统一规格；S12: Before using the labeled UCF-101 video set in the video database to train the 3D convolutional neural network, the video data set video preprocessing is required: the original video in the UCF-101 video set needs to be converted into a video frame according to a certain FPS Image collection, image preprocessing of image size adjustment and noise filtering, converting the image into a unified specification of 112*112;

S13：经过预处理的UCF-101视频集训练视频对应数据形式为(X_n,L_n)：n为训练视频个数，其中X_n＝[x_n(1),x_n(2),x_n(3),...,x_n(m)]是视频X_n经过预处理后的视频图片集合，m为视频转化为图片帧的个数，本方法使用ffmpeg将视频按照每秒20帧转化为图片序列，L_n为视频X_n对应标签类型；S13: The data format corresponding to the preprocessed UCF-101 video set training video is (X _n , L _n ): n is the number of training videos, where X _n =[x _n(1) ,x _n(2) ,x _n(3) ,...,x _n(m) ] is the preprocessed video picture collection of video X _n , m is the number of video frames converted into pictures, this method uses ffmpeg to convert the video at 20 frames per second Converted to a picture sequence, L _n is the label type corresponding to the video X _n ;

S14：基于三维卷积神经网络模型和学习算法，使用经过预处理的UCF-101视频数据集，训练一个具有高识别率的视频种类识别模型。S14: Based on the three-dimensional convolutional neural network model and learning algorithm, use the preprocessed UCF-101 video dataset to train a video category recognition model with a high recognition rate.

优选地，步骤S2包括下列子步骤：Preferably, step S2 includes the following sub-steps:

S21：将测试数据中视频帧图片数量m不满足m＝kw的视频帧图片集进行补充处理，其中，k为任意整数，w为滑动窗口的大小，将视频最后一帧的图片进行复制操作直到满足m为w的倍数；S21: Supplementary processing is performed on the video frame picture set whose number m of video frame pictures in the test data does not satisfy m=kw, wherein k is any integer, w is the size of the sliding window, and the picture of the last frame of the video is copied until Satisfied that m is a multiple of w;

S22：使用滑动窗口对视频帧序列进行滑动读取帧图片，滑动步长为滑动窗口的一半，每滑动一次，获取的帧图片为三维卷积神经网络的一次输入；取滑动窗口大小w＝16，因此测试数据形式经过处理变为

w代表一次滑动窗口取得的图片集合，其中

代表窗口滑动第k次滑动获得视频图片集。S22: Use the sliding window to slide and read the frame pictures of the video frame sequence. The sliding step size is half of the sliding window. Every time you slide, the acquired frame picture is an input of the three-dimensional convolutional neural network; take the sliding window size w=16 , so the test data form is processed into

w represents a collection of pictures obtained by a sliding window, where

Represents the kth sliding of the window to obtain the video picture set.

优选地，步骤S3包括下列子步骤：Preferably, step S3 includes the following sub-steps:

S31：使用S1中使用UCF-101视频集训练得到的三维卷积神经网络模型识别S2中处理后的测试视频数据

S31: Using the three-dimensional convolutional neural network model trained in S1 using the UCF-101 video set to identify the processed test video data in S2

S32：将三维卷积神经网络的全连接层的输出固定为子动作种类个数；S32: fixing the output of the fully connected layer of the three-dimensional convolutional neural network as the number of sub-action types;

S33：三维卷积神经网络输入为S22中定义的

输出为第一层全连接层的输出F_k＝[f₁,f₂,f₃,...,f₄₀₉₆]，其中F_k的维度4096为三维卷积神经网络第一层全连接层的输出维度；S33: The input of the three-dimensional convolutional neural network is defined in S22

The output is the output of the first fully connected layer F _k = [f ₁ , f ₂ , f ₃ ,..., f ₄₀₉₆ ], where the dimension of F _k is 4096 for the first fully connected layer of the three-dimensional convolutional neural network. output dimension;

S34：测试视频数据

对应三维卷积神经网络输出为[F₁,F₂,F₃,...,F_k]其维度为4096*k维。S34: Test video data

The output corresponding to the three-dimensional convolutional neural network is [F ₁ , F ₂ , F ₃ ,...,F _k ] and its dimension is 4096*k dimensions.

优选地，步骤S4包括下列子步骤：Preferably, step S4 includes the following sub-steps:

S41：使用S3中三维卷积神经网络模型对测试视频数据

语义特征提取结果[F₁,F₂,F₃,...,F_k]作为视频语义自编码器的输入提取视频整体语义特征；S41: Use the three-dimensional convolutional neural network model in S3 to test the video data

The semantic feature extraction results [F ₁ , F ₂ , F ₃ ,...,F _k ] are used as the input of the video semantic autoencoder to extract the overall semantic features of the video;

S42：循环自编码器将输入特征序列[F₁,F₂,F₃,...,F_k]转化为特征对序列[[F₁,F₂],[F₂,F₃],[F₃,F₄],...,[F_k-1,F_k]]，采取贪心算法思想，其过程为依次选取特征对序列中的每一对特征将其整合为一个父特征，表示为：F_1,2＝f(W⁽¹⁾[F₁,F₂]+b⁽¹⁾)，其中W⁽¹⁾代表n*n的矩阵参数，b⁽¹⁾是一个偏置项，W⁽¹⁾与b⁽¹⁾是通过学习特征序列对得到的；F_1,2的重构过程为：[F₁',F₂']＝W⁽²⁾F_1,2+b⁽²⁾其中W⁽²⁾代表n*n的矩阵参数，b⁽²⁾是不同于b⁽¹⁾的偏置项，同样W⁽²⁾与b⁽²⁾是通过学习重构误差得到；自编码器的重构误差为：

循环自编码器的目标函数为：

其中A(x)表示输入序列[F₁,F₂,F₃,...,F_k]对应的语义树的所有可能，T(y)表示所有可能的特征对，循环自编码的一次编码过程是选出所有编码对中重构误差最小的一个特征对，将这对特征从特征序列中移除并将其父特征作为这一个特征对的代表组成一个新的特征序列；S42: The cyclic autoencoder converts the input feature sequence [F ₁ ,F ₂ ,F ₃ ,...,F _k ] into a feature pair sequence [[F ₁ ,F ₂ ],[F ₂ ,F ₃ ],[ F ₃ ,F ₄ ],...,[F _k-1 ,F _k ]] adopts the idea of greedy algorithm, and its process is to select each pair of features in the sequence of feature pairs in turn and integrate them into a parent feature, expressing It is: F _1,2 ＝f(W ⁽¹⁾ [F ₁ ,F ₂ ]+b ⁽¹⁾ ), where W ⁽¹⁾ represents the matrix parameter of n*n, b ⁽¹⁾ is a bias item, W ⁽¹⁾ and b ⁽¹⁾ are obtained by learning feature sequence pairs; the reconstruction process of F _1,2 is: [F ₁ ', F ₂ ']=W ⁽²⁾ F _1,2 +b ^{(2 )} where W ⁽²⁾ represents the matrix parameter of n*n, b ⁽²⁾ is a bias item different from b ⁽¹⁾ , and similarly W ⁽²⁾ and b ⁽²⁾ are obtained by learning reconstruction errors; self-encoding The reconstruction error of the device is:

The objective function of the recurrent autoencoder is:

Where A(x) represents all possibilities of the semantic tree corresponding to the input sequence [F ₁ , F ₂ , F ₃ ,...,F _k ], T(y) represents all possible feature pairs, one-time encoding of cyclic autoencoder The process is to select a feature pair with the smallest reconstruction error among all coding pairs, remove this pair of features from the feature sequence and use its parent feature as a representative of this feature pair to form a new feature sequence;

S43：重复S42的自编码过程，直到特征序列中特征向量个数为1；S43: Repeat the self-encoding process of S42 until the number of feature vectors in the feature sequence is 1;

S44：循环自编码器输出最终的特征向量

作为视频X_n的语义特征向量。S44: Loop self-encoder to output the final feature vector

as the semantic feature vector of video X _n .

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明通过结合三维卷积神经网络和循环自动编码器的方案，解决了非监督的视频语义分析与提取问题，提高了视频语义提取准确度。The invention solves the problem of unsupervised video semantic analysis and extraction by combining the scheme of the three-dimensional convolutional neural network and the cyclic automatic encoder, and improves the accuracy of video semantic extraction.

附图说明Description of drawings

图1是本发明提出的一种非监督视频语义提取方法的流程图。Fig. 1 is a flow chart of a method for extracting semantic meaning from unsupervised video proposed by the present invention.

图2是本发明构建的三维卷积神经网络模型的结构图。Fig. 2 is a structural diagram of a three-dimensional convolutional neural network model constructed by the present invention.

图3是本发明方法中训练三维卷积神经网络模型的流程示意图。Fig. 3 is a schematic flow chart of training a three-dimensional convolutional neural network model in the method of the present invention.

图4是本发明方法中提取视频语义特征的流程示意图。Fig. 4 is a schematic flow chart of extracting video semantic features in the method of the present invention.

图5是本发明基于三维卷积神经网络与循环自编码器模型的架构图。Fig. 5 is an architecture diagram of the present invention based on a three-dimensional convolutional neural network and a cyclic autoencoder model.

具体实施方式Detailed ways

为了对本发明的技术特征、目的和效果有更加清楚的理解，现对照附图说明本发明的具体实施方式。In order to have a clearer understanding of the technical features, purposes and effects of the present invention, the specific implementation manners of the present invention will now be described with reference to the accompanying drawings.

本发明提出的一种非监督视频语义提取方法实施例流程图如图1所示，包括以下步骤：A flow chart of an embodiment of an unsupervised video semantic extraction method proposed by the present invention is shown in Figure 1, comprising the following steps:

作为一种优选实施例，步骤S1包括下列子步骤：As a preferred embodiment, step S1 includes the following sub-steps:

S11：构建包含五层卷积层、池化层，两层全连接层和一层SOFTMAX层的三维卷积神经网络模型，所构建的三维卷积神经网络模型结构如图2所示；S11: Construct a three-dimensional convolutional neural network model including five convolutional layers, a pooling layer, two fully connected layers and one layer of SOFTMAX. The structure of the constructed three-dimensional convolutional neural network model is shown in Figure 2;

S12：在使用视频数据库中带标签的UCF-101视频集训练三维卷积神经网络之前，需要对视频数据集视频预处理：将UCF-101视频集中的原始视频需要按照一定的FPS转化为视频帧图片集，对图片进行大小调整、噪声过滤的图像预处理，将图片转化为112*112的统一规格；对图像进行预处理，是由于受到各种条件的限制和随机干扰，这些图片集往往不能直接使用，因而需要在图像处理的早期阶段对它们进行大小调整、噪声过滤等图像预处理；S12: Before using the labeled UCF-101 video set in the video database to train the 3D convolutional neural network, the video data set video preprocessing is required: the original video in the UCF-101 video set needs to be converted into a video frame according to a certain FPS Image collection, image preprocessing of image size adjustment and noise filtering, converting the image into a unified specification of 112*112; image preprocessing is due to various conditions and random interference, these image collections often cannot are used directly, thus requiring image preprocessing such as resizing, noise filtering, etc., to be performed on them at an early stage of image processing;

其中，训练三维卷积神经网络模型的流程示意如图3所示。随机初始化三维卷积神经网络参数，并将UCF-101视频数据集进行数据预处理后使用BP算法训练模型，得到最优的视频动作种类识别模型。Among them, the flow chart of training the three-dimensional convolutional neural network model is shown in Fig. 3 . The parameters of the three-dimensional convolutional neural network are randomly initialized, and the UCF-101 video dataset is preprocessed, and then the BP algorithm is used to train the model to obtain the optimal video action category recognition model.

作为一种优选实施例，步骤S2包括下列子步骤：As a preferred embodiment, step S2 includes the following sub-steps:

w代表一次滑动窗口取得的图片集合，其中

w represents a collection of pictures obtained by a sliding window, where

Represents the kth sliding of the window to obtain the video picture set.

作为一种优选实施例，步骤S3包括下列子步骤：As a preferred embodiment, step S3 includes the following sub-steps:

S33：三维卷积神经网络输入为S22中定义的

S34：测试视频数据

作为一种优选实施例，步骤S4包括下列子步骤：As a preferred embodiment, step S4 includes the following sub-steps:

S41：使用S3中三维卷积神经网络模型对测试视频数据

循环自编码器的目标函数为：

The objective function of the recurrent autoencoder is:

S44：循环自编码器输出最终的特征向量

as the semantic feature vector of video X _n .

图4是本发明实施例方法中提取视频语义特征的流程示意图，视频数据集通过数据预处理，再经过滑动窗口处理数据，使用训练好的三维卷积神经网络提取特征得到特征序列，最后通过循环自编码器整合特征序列得到语义特征。Fig. 4 is a schematic flow diagram of extracting video semantic features in the method of the embodiment of the present invention. The video data set is preprocessed through data, and then processed through a sliding window, and the trained three-dimensional convolutional neural network is used to extract features to obtain a feature sequence, and finally through a loop The autoencoder integrates feature sequences to obtain semantic features.

图5是本发明实施例基于三维卷积神经网络与循环自编码器模型的架构图，可见，视频经处理得到视频帧序列，处理后的视频帧序列通过三维卷积神经网络提取帧特征，形成视频帧特征序列，再转换为编码特征序列经过循环自编码器得到视频语义特征。Fig. 5 is an architecture diagram based on a three-dimensional convolutional neural network and a cyclic autoencoder model according to an embodiment of the present invention. It can be seen that the video is processed to obtain a video frame sequence, and the processed video frame sequence extracts frame features through a three-dimensional convolutional neural network to form The video frame feature sequence is converted into a coded feature sequence to obtain video semantic features through a loop self-encoder.

本发明实施例通过结合三维卷积神经网络和循环自动编码器的方案，解决了非监督的视频语义分析与提取问题，提高了视频语义提取准确度。The embodiment of the present invention solves the problem of unsupervised video semantic analysis and extraction by combining a three-dimensional convolutional neural network and a cyclic autoencoder, and improves the accuracy of video semantic extraction.

需要说明的是，对于前述的各个方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请并不受所描述的动作顺序的限制，因为依据本申请，某一些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和单元并不一定是本申请所必须的。It should be noted that, for the sake of simple description, all the aforementioned method embodiments are expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because according to the application, certain steps may be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification belong to preferred embodiments, and the actions and units involved are not necessarily required by this application.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详细描述的部分，可以参见其他实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、ROM、RAM等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the programs can be stored in computer-readable storage media. During execution, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the storage medium may be a magnetic disk, an optical disk, a ROM, a RAM or the like.

以上所揭露的仅为本发明较佳实施例而已，当然不能以此来限定本发明之权利范围，因此依本发明权利要求所作的等同变化，仍属本发明所涵盖的范围。The above disclosures are only preferred embodiments of the present invention, and certainly cannot limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention still fall within the scope of the present invention.

Claims

1. An unsupervised video semantic extraction method is characterized by comprising the following steps:

s1: constructing a three-dimensional convolutional neural network model, and training the three-dimensional convolutional neural network model by using a UCF-101 video set with a label in a video database;

s2: processing video data without labels in a video database into data which is in accordance with the input of a three-dimensional convolution neural network by using a sliding window;

s3: using the data generated in the step S2 as input data of the three-dimensional convolutional neural network model, and taking output data of a full connection layer of the three-dimensional convolutional neural network model as semantic features of the video segment;

s4: using the video segment semantic feature sequence generated in the step S3 as the input of a video semantic self-encoder, and integrating through the self-encoder to obtain the whole video semantic features;

step S1 comprises the following substeps:

s11: constructing a three-dimensional convolutional neural network model comprising five convolutional layers, a pooling layer, two full-connection layers and one SOFTMAX layer;

s12: before training the three-dimensional convolutional neural network by using the labeled UCF-101 video set in the video database, video preprocessing needs to be carried out on the video data set: converting an original video in a UCF-101 video set into a video frame picture set according to a certain FPS, carrying out image preprocessing of size adjustment and noise filtration on the picture, and converting the picture into a unified specification of 112 x 112;

s13: the corresponding data form of the training video of the UCF-101 video set after preprocessing is (X) _n ,L _n ): n is the number of training videos, wherein X _n ＝[x _n(1) ,x _n(2) ,x _n(3) ,...,x _n(m) ]Is a video X _n The method comprises the steps that a preprocessed video picture set is formed, m is the number of frames of a video converted into pictures, ffmpeg is used for converting the video into a picture sequence according to 20 frames per second, and L is _n As video X _n Corresponding to the label type;

s14: training a video type identification model with high identification rate by using a preprocessed UCF-101 video data set based on a three-dimensional convolution neural network model and a learning algorithm;

step S2 comprises the following substeps:

s21: performing supplementary processing on a video frame picture set of which the number m of video frame pictures in the test data does not satisfy m = kw, wherein k is any integer and w is the size of a sliding window, and copying the picture of the last frame of the video until m is a multiple of w;

s22: sliding the video frame sequence by using a sliding window to read a frame picture, wherein the sliding step length is half of that of the sliding window, and the obtained frame picture is input once for the three-dimensional convolutional neural network when the frame picture slides once; taking the sliding window size w =16, the test dataform is processed to become

w represents a set of pictures taken in a sliding window, wherein

Sliding the representative window for the kth time to obtain a video picture set;

s31: identifying the processed test video data in S2 by using a three-dimensional convolution neural network model obtained by training the UCF-101 video set in S1

S32: fixing the output of the full connection layer of the three-dimensional convolutional neural network as the number of the sub-action types;

s33: three-dimensional convolutional neural network input as defined in S22

The output is the output F of the first layer full connection layer _k ＝[f ₁ ,f ₂ ,f ₃ ,...,f ₄₀₉₆ ]In which F is _k The dimension 4096 is the output dimension of the first full link layer of the three-dimensional convolutional neural network;

s34: testing video data

Corresponding to a three-dimensional convolutional neural network output of [ F ₁ ,F ₂ ,F ₃ ,...,F _k ]Its dimension is 4096 × k dimensions.

2. The unsupervised video semantic extraction method according to claim 1, wherein the step S4 comprises the following sub-steps:

s41: testing video data by using three-dimensional convolution neural network model in S3

Semantic feature extraction result [ F ₁ ,F ₂ ,F ₃ ,...,F _k ]Extracting the integral semantic features of the video as the input of a video semantic self-encoder;

s42: cyclic self-encoder inputs characteristic sequence F ₁ ,F ₂ ,F ₃ ,...,F _k ]Conversion into a characteristic pair sequence [ [ F ] ₁ ,F ₂ ],[F ₂ ,F ₃ ],[F ₃ ,F ₄ ],...,[F _k-1 ,F _k ]]The method adopts the greedy algorithm idea, and the process is to sequentially select each pair of features in the feature pair sequence and integrate the features into a father feature, which is expressed as: f _1,2 ＝f(W ⁽¹⁾ [F ₁ ,F ₂ ]+b ⁽¹⁾ ) Wherein W is ⁽¹⁾ Parameters of a matrix representing n x n, b ⁽¹⁾ Is an offset term, W ⁽¹⁾ And b ⁽¹⁾ Is obtained by learning the feature sequence pair; f _1,2 The reconstruction process of (2) is as follows: [ F ] ₁ ',F ₂ ']＝W ⁽²⁾ F _1,2 +b ⁽²⁾ Wherein W ⁽²⁾ Parameters of a matrix representing n x n, b ⁽²⁾ Is different from b ⁽¹⁾ Bias term of (1), likewise W ⁽²⁾ And b ⁽²⁾ Is obtained by learning reconstruction errors; the reconstruction error from the encoder is:

the objective function of the cyclic auto-encoder is:

wherein A (x) represents the input sequence [ F ] ₁ ,F ₂ ,F ₃ ,...,F _k ]All the possibilities of the corresponding semantic tree, T (y) represents all the possible feature pairs, one coding process of the cyclic self-coding is to select a feature pair with the minimum reconstruction error in all the coding pairs, remove the feature pair from the feature sequence and take the parent feature of the feature pair as the representation of the feature pair to form a new feature sequence;

s43: repeating the self-coding process of S42 until the number of the feature vectors in the feature sequence is 1;

s44: outputting final feature vector from encoder circularly

As video X _n The semantic feature vector of (1).