CN110738128A

CN110738128A - repeated video detection method based on deep learning

Info

Publication number: CN110738128A
Application number: CN201910888907.9A
Authority: CN
Inventors: 宋晓康; 陈锦言
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-01-31

Abstract

The invention discloses a method for detecting repeated videos based on deep learning. A neural network is used to extract features from existing videos, a video feature library is established, and then features are extracted from the videos to be detected, and the Euclidean distance between the video features and the features in the library is calculated as similarity. Sex metric, when the distance is less than a set threshold, the video is marked as a duplicate. The method of the invention is different from the traditional method of extracting multiple frames from the video, and establishes a feature library by selecting a frame sequence. Instead, a single feature description file is generated for each video by combining the intermediate layer outputs of the neural network. The solution provided by the example of the present invention uses the method of deep learning to detect repeated videos, and the features extracted by the deep neural network can better represent the videos. Compared with using traditional image feature extraction operators, the accuracy of detecting repeated videos is higher. high.

Description

A Deep Learning-Based Duplicate Video Detection Method

技术领域technical field

本发明属于计算机视觉，数字图像处理，深度学习技术领域，特别涉及一种基于深度学习技术的重复视频检测方法。The invention belongs to the technical fields of computer vision, digital image processing, and deep learning, and particularly relates to a method for detecting repeated videos based on deep learning technology.

背景技术Background technique

随着互联网时代的到来，视频的制作以及传播越来越便捷，视频数据大规模增长，例如短视频应用的广泛使用，拍摄视频逐渐成为了很多人分享生活内容的一种方式。但同时也不可避免地产生了大量重复视频。视频内容具有一定的经济价值，现如今存在着很多使用盗版视频牟利等行为，一些受版权保护的视频被修改后上传至未经视频制作者授权的视频网站，产生了版权问题，损害了视频制作方的利益，同时视频网站也面临一定的法律风险。重复视频也会增加视频网站的带宽和存储成本。现今视频网站会根据用户喜好推荐视频，推荐重复的视频会影响向用户的视频观看体验。因此重复视频的存在对视频版权保护以及内容推荐产生了挑战。现今视频数据巨大，无法通过人工的方式筛查重复视频，需要借助计算机技术进行识别。因此，重复视频检索技术有着巨大的实际应用意义。With the advent of the Internet era, the production and dissemination of videos has become more and more convenient, and the large-scale growth of video data, such as the widespread use of short video applications, has gradually become a way for many people to share life content. But at the same time, a lot of repetitive videos are inevitably generated. Video content has certain economic value. Nowadays, there are many behaviors such as using pirated videos for profit. Some copyright-protected videos are modified and uploaded to video websites without the authorization of the video producers, resulting in copyright issues and damage to video production. At the same time, the video website also faces certain legal risks. Duplicate videos also increase bandwidth and storage costs for video sites. Nowadays, video websites recommend videos according to user preferences, and recommending repeated videos will affect the video viewing experience for users. Therefore, the existence of repeated videos poses challenges to video copyright protection and content recommendation. Today's video data is huge, and it is impossible to screen duplicate videos manually, and computer technology is needed to identify them. Therefore, repetitive video retrieval technology has great practical application significance.

常见的重复视频主要包括对视频进行格式转换；在视频中添加字幕，水印；视频压缩，旋转，剪辑等。传统的视频文件哈希检测会对相同内容的视频文件生成相同的哈希值，通过比对哈希值是否一致判断是否属于同一视频。但这种方式只能检测内容完全一致的视频，经过改动后的视频文件，通过算法生成的哈希值变化较大，无法用于重复视频检测。因此，需要数字图像处理技术自动对视频内容进行检测，判断视频的相似性。人工可以轻松分辨视频是否重复，但计算机难以执行该任务。不同于图像，视频具有时域特征，相比图像的特征提取难度更大，需要相应的图像处理算法。目前，重复视频的检测方法主要是用传统的图像特征描述子，例如sift^[1]，SURF^[2]等特征，但这种方式的鲁棒性较低。Common duplicate videos mainly include format conversion of the video; adding subtitles and watermarks to the video; video compression, rotation, clipping, etc. Traditional video file hash detection will generate the same hash value for video files with the same content, and judge whether they belong to the same video by comparing whether the hash values are consistent. However, this method can only detect videos with exactly the same content. The hash value generated by the algorithm changes greatly for the modified video files, which cannot be used for repeated video detection. Therefore, digital image processing technology is required to automatically detect the video content and determine the similarity of the video. Humans can easily tell if a video is repetitive, but computers struggle to perform that task. Unlike images, videos have temporal features, which are more difficult to extract than images and require corresponding image processing algorithms. At present, the detection methods of repeated videos mainly use traditional image feature descriptors, such as sift ^[1] , SURF ^[2] and other features, but this method has low robustness.

近些年卷积神经网络被广泛应用于计算机视觉任务中，促进了一系列视觉任务的准确率提高。通过卷积神经网络提取的视觉特征一般具有更好的鲁棒性，但是对于重复视频检测，使用神经网络解决的方式较少。仍然依靠提取视频中间若干帧的特征表示整个视频，在检索系统中保存多个视频帧的特征。In recent years, convolutional neural networks have been widely used in computer vision tasks, promoting the accuracy of a series of vision tasks. Visual features extracted by convolutional neural networks are generally more robust, but for repetitive video detection, there are fewer ways to use neural networks to solve them. It still relies on extracting the features of several frames in the middle of the video to represent the whole video, and saves the features of multiple video frames in the retrieval system.

[参考文献][references]

[1]Liu H,Lu H,Xue X.A Segmentation and Graph-Based Video SequenceMatching Method for Video Copy Detection[J].IEEE Transactions on Knowledgeand Data Engineering,2013,25(8):1706-1718.[1]Liu H,Lu H,Xue X.A Segmentation and Graph-Based Video SequenceMatching Method for Video Copy Detection[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(8):1706-1718.

[2]Yang G，Chen N，Jiang Q.A robust hashing algorithm based on SURF forvideo copy detection[J].Computers&Security，2012，31(1)：33-39.[2] Yang G, Chen N, Jiang Q. A robust hashing algorithm based on SURF for video copy detection [J]. Computers & Security, 2012, 31(1): 33-39.

[3]Simonyan K，Zisserman A.Very Deep Convolutional Networks for Large-Scale ImageRecognition[J].Computer Science，2014.[3] Simonyan K, Zisserman A.Very Deep Convolutional Networks for Large-Scale ImageRecognition[J].Computer Science, 2014.

发明内容SUMMARY OF THE INVENTION

本发明针对需要重复视频检测的场景，提出了一种基于深度学习的重复视频检测方法。首先使用神经网络对视频帧进行特征提取，使用神经网络的中间层输出作为图像的特征表示，再融合所有视频帧的特征作为视频的特征表示，最后使用视频特征之间的距离度量不同视频的相似度。Aiming at the scene that needs repeated video detection, the invention proposes a repeated video detection method based on deep learning. First, the neural network is used to extract the feature of the video frame, the output of the intermediate layer of the neural network is used as the feature representation of the image, and the features of all video frames are fused as the feature representation of the video. Finally, the distance between the video features is used to measure the similarity of different videos. Spend.

为了解决上述技术问题，本发明提出的一种基于深度学习的重复视频检测方法，使用神经网络对已有视频提取特征，建立视频特征库，然后对待检测视频提取特征，计算该视频特征与库中特征的欧式距离作为相似性度量，当距离小于设定阈值时，标记为重复视频。In order to solve the above technical problems, the present invention proposes a method for detecting repetitive videos based on deep learning, which uses neural networks to extract features from existing videos, establishes a video feature library, then extracts features from the video to be detected, and calculates the difference between the video features and the library. The Euclidean distance of the features is used as a similarity measure, and when the distance is less than a set threshold, the video is marked as a duplicate.

该重复视频检测方法包括下列步骤：The duplicate video detection method includes the following steps:

步骤1：从已有的视频集中获取视频帧，得到所有视频帧的集合；Step 1: Obtain video frames from an existing video set to obtain a set of all video frames;

步骤2：采用卷积神经网络中间层对视频帧提取特征；所述卷积神经网络是vgg16的网络结构；Step 2: Extract features from the video frame by using the middle layer of the convolutional neural network; the convolutional neural network is the network structure of vgg16;

首先，对于一个视频，获取视频帧集合S，集合中的每一帧被缩放为224×224大小的3通道图像，作为神经网络的输入；神经网络中间层输出作为视频特征，对vgg16的网络结构取conv2_1，conv2_2，conv3_1，conv3_2，conv3_3，conv4_1，conv4_2，conv4_3，conv5_1，conv5_2，conv5_3层的特征图，共11层，全部是卷积层输出；这些层的卷积核大小一致为3×3，全0填充，卷积核滑动时每次移动1个像素；First, for a video, obtain a video frame set S, each frame in the set is scaled to a 3-channel image of 224×224 size, as the input of the neural network; the output of the middle layer of the neural network is used as the video feature, and the network structure of vgg16 is Take the feature maps of the conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 layers, a total of 11 layers, all of which are the output of the convolution layer; the size of the convolution kernel of these layers is the same as 3 × 3 , all 0 padding, the convolution kernel moves 1 pixel each time when sliding;

结合中间层的特征输出最终对每个视频得到一个唯一的4096特征向量；Combined with the feature output of the intermediate layer, a unique 4096 feature vector is finally obtained for each video;

步骤3：对视频库V提取特征，得到视频特征库Fv；Step 3: Extract features from the video library V to obtain the video feature library Fv;

步骤4：检索阶段，对于待检索视频v提取特征；Step 4: In the retrieval stage, features are extracted for the video v to be retrieved;

步骤5：使用待检索视频v的特征对比视频特征库特征，满足条件，则为重复视频，对比方式和条件设定如下：检索时，计算不同视频特征之间的距离公式d，视频i，j之间的距离为d，设置阈值t，得到距离d后，当d小于t时，判定为相似视频，否则不是相似视频。Step 5: Use the features of the video v to be retrieved to compare the features of the video feature library. If the conditions are met, it is a repeated video. The comparison method and conditions are set as follows: During retrieval, calculate the distance formula d between different video features, video i, j The distance between them is d, and the threshold t is set. After the distance d is obtained, when d is less than t, it is determined as a similar video, otherwise it is not a similar video.

进一步讲，步骤1中，从已有的视频集中获取视频帧，得到所有视频帧的集合如下：Further, in step 1, video frames are obtained from the existing video set, and the set of all video frames is obtained as follows:

V＝(S₍₁₎，S₍₂₎，…S_(n))，S＝(P₍₁₎，P₍₂₎，…P_(n))；V，S表示单个视频的帧集合，S_(n)表示第n个视频的帧集合，P_(n)表示视频的第n个视频帧。V=(S ₍₁₎ , S ₍₂₎ ,...S _(n) ), S=(P ₍₁₎ , P ₍₂₎ ,...P _(n) ); V, S represent a set of frames of a single video, S _(n) represents the frame set of the nth video, and P _(n) represents the nth video frame of the video.

步骤2中，结合中间层的特征输出最终对每个视频得到一个唯一的4096特征向量的过程是：每层输出的特征图维度为：In step 2, the process of finally obtaining a unique 4096 feature vector for each video by combining the feature output of the intermediate layer is: the dimension of the feature map output by each layer is:

F_(k)＝d(W_(k)×W_(k)×C_(k))，k＝1，2，...，11 (1)F _(k) = d(W _(k) ×W _(k) ×C _(k) ), k=1, 2, ..., 11 (1)

式(1)表示第k层的特征图输出的维度为W_(k)×W_(k)×C_(k)，W_(k)×W_(k)是第k层特征图的维度，C_(k)是第k层特征图的通道数；Equation (1) indicates that the output dimension of the feature map of the kth layer is W _(k) ×W _(k) ×C _(k) , W _(k) ×W _(k) is the dimension of the feature map of the kth layer, C _{( k)} is the number of channels of the feature map of the kth layer;

压缩特征图的维度：The dimension of the compressed feature map:

FM_(k)，＝max(F_(k))，k＝1，2，...，11 (2)FM _(k) , =max(F _(k) ), k=1, 2, ..., 11 (2)

式(2)表示对第k层特征图F_(k)每个通道取最大值，得到C_(k)维的向量表示FM_(k)，维度为一维，向量长度为C_(k)；Equation (2) represents taking the maximum value for each channel of the k-th layer feature map F _(k) , and obtaining a C _(k) -dimensional vector representation FM _(k) , the dimension is one-dimensional, and the vector length is C _(k) ;

连接所有层的特征表示得到整个视频帧的特征表示：FP_n，每个卷积层的输出通道数分别为：128，128，256，256，256，512，512，512，512，512，512；最终的特征维度为相应层的维度之和：Connect the feature representations of all layers to get the feature representation of the entire video frame: FP _n , the number of output channels of each convolutional layer are: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512 ; The final feature dimension is the sum of the dimensions of the corresponding layer:

128+128+256+256+256+512+512+512+512+512+512＝4094 (3)128+128+256+256+256+512+512+512+512+512+512=4094 (3)

即对于一个视频帧P_(n)，提取出的向量大小为4096，再对每个视频的所有视频帧S＝(P₍₁₎，P₍₂₎，…P_(n))共n个4096维向量，对n个向量取均值得到一个4096维向量T，然后归一化得到整个视频V_(n)的特征表示F(V_(n))；That is, for a video frame P _(n) , the size of the extracted vector is 4096, and then for all video frames of each video S=(P ₍₁₎ , P ₍₂₎ ,...P _(n) ) a total of n 4096 dimensional vector, take the mean of n vectors to obtain a 4096-dimensional vector T, and then normalize to obtain the feature representation F(V _(n) ) of the entire video V _(n );

归一化公式为：

The normalization formula is:

式(4)中，μ是向量T的均值，σ是方差，Tv是最终的视频特征向量，最终对每个视频得到一个唯一的4096维向量表Tv_n，表示第n个视频的特征。In formula (4), μ is the mean of the vector T, σ is the variance, Tv is the final video feature vector, and finally a unique 4096-dimensional vector table Tv _n is obtained for each video, representing the feature of the nth video.

步骤3中，采用步骤2的方式对视频库V提取特征，得到视频特征库Fv；步骤4中，采用步骤2的方式对待检索视频v提取特征。In step 3, the method of step 2 is used to extract features from the video library V, and the video feature library Fv is obtained; in step 4, the method of step 2 is used to extract features of the video v to be retrieved.

步骤5中，视频i，j之间的距离的计算公式为： In step 5, the calculation formula of the distance between videos i and j is:

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

本发明重复视频检测方法中，获取待检测视频特征无需人工判别视频是否与已有视频库重复，直接使用深度神经网络提取的视频特征与已有的视频特征进行比较，判断是否是重复视频。上述方法不同于传统的对视频提取多个帧，通过选择帧序列建立特征库。而是通过合并神经网络的中间层输出，对于每个视频生成单个的特征描述文件。本发明实例所提供的方案利用的是深度学习的方法检测重复视频，通过深度神经网络提取的特征能够更好地表示视频，相比使用传统的图像特征提取算子，检测重复视频的准确度更高。In the repeated video detection method of the present invention, it is not necessary to manually determine whether the video is repeated with the existing video library to obtain the characteristics of the video to be detected. The above method is different from the traditional method of extracting multiple frames from a video, and establishing a feature library by selecting a sequence of frames. Instead, a single feature description file is generated for each video by combining the intermediate layer outputs of the neural network. The solution provided by the example of the present invention uses the method of deep learning to detect repeated videos, and the features extracted by the deep neural network can better represent the videos. Compared with using traditional image feature extraction operators, the accuracy of detecting repeated videos is higher. high.

附图说明Description of drawings

图1是本发明基于深度学习的重复视频检测方法流程图；Fig. 1 is the flow chart of the repetitive video detection method based on deep learning of the present invention;

图2是本发明中视频帧特征提取原理图。FIG. 2 is a schematic diagram of a feature extraction principle of a video frame in the present invention.

具体实施方式Detailed ways

下面结合附图及具体实施例对本发明做进一步的说明，但下述实施例绝非对本发明有任何限制。The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but the following embodiments do not limit the present invention by any means.

本发明提出的基于深度学习的重复视频检测方法的主要步骤是：使用神经网络对已有视频提取特征，建立视频特征库，然后对待检测视频提取特征，计算该视频特征与库中特征的欧式距离作为相似性度量，当距离小于设定阈值时，标记为重复视频。本发明通过提取神经网络不同层次的特征输出，得到视频的低级到高级语义特征的特征表示，通过结合不同层次的特征，能够得到更准确的视频特征表示。The main steps of the deep learning-based repetitive video detection method proposed by the present invention are: using neural network to extract features from existing videos, establishing a video feature library, then extracting features from the video to be detected, and calculating the Euclidean distance between the video features and the features in the library As a similarity measure, when the distance is less than a set threshold, the video is marked as a duplicate. The present invention obtains the feature representation of the low-level to high-level semantic features of the video by extracting the feature outputs of different levels of the neural network, and can obtain more accurate video feature representation by combining the features of different levels.

如图1所示，该重复视频检测方法包括以下步骤：As shown in Figure 1, the repeated video detection method includes the following steps:

步骤1：从已有的视频集中获取视频帧，得到所有视频帧的集合：V，S表示单个视频的帧集合。Step 1: Obtain video frames from an existing video set, and obtain a set of all video frames: V, S represent the frame set of a single video.

V＝(S₍₁₎，S₍₂₎，…S_(n))，S_(n)表示第n个视频的帧集合。V=(S ₍₁₎ ,S ₍₂₎ ,...S _(n) ), S _(n) represents the frame set of the nth video.

S＝(P₍₁₎，P₍₂₎，…P_(n))，P_(n)表示视频的第n个视频帧。S=(P ₍₁₎ ,P ₍₂₎ ,...P _(n) ), P _(n) denotes the nth video frame of the video.

步骤2：采用卷积神经网络vgg16^[3]中间层对视频帧提取特征，vgg16的网络结构如下表所示：Step 2: Use the middle layer of convolutional neural network vgg16 ^[3] to extract features from video frames. The network structure of vgg16 is shown in the following table:

Layer表示神经网络不同的层，Output Shape表示每一层的输出向量维度，Param表示每一层的参数数量。Layer represents the different layers of the neural network, Output Shape represents the output vector dimension of each layer, and Param represents the number of parameters of each layer.

首先，对于一个视频，获取视频帧集合S，集合中的每一帧被缩放为224×224大小的3通道图像，作为神经网络的输入。本文使用神经网络中间层输出作为视频特征，对vgg16取conv2_1，conv2_2，conv3_1，conv3_2，conv3_3，conv4_1，conv4_2，conv4_3，conv5_1，conv5_2，conv5_3层的特征图，共11层，全部是卷积层输出。这些层的卷积核大小一致为3×3，全0填充，卷积核滑动时每次移动1个像素。First, for a video, a set S of video frames is obtained, and each frame in the set is scaled to a 3-channel image of size 224×224, which is used as the input of the neural network. In this paper, the output of the intermediate layer of the neural network is used as the video feature, and the feature maps of the conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 layers are taken for vgg16, with a total of 11 layers, all of which are convolutional layer outputs. . The size of the convolution kernels of these layers is uniformly 3×3, filled with all 0s, and the convolution kernels are moved by 1 pixel each time they are slid.

每层输出的特征图维度为：The dimension of the feature map output by each layer is:

上式表示第k层的特征图输出的维度为W_(k)×W_(k)×C_(k)，W_(k)×W_(k)是第k层特征图的维度，C_(k)是第k层特征图的通道数。The above formula indicates that the dimension of the feature map output of the kth layer is W _(k) ×W _(k) ×C _(k) , W _(k) ×W _(k) is the dimension of the feature map of the kth layer, C _(k) is the number of channels of the k-th layer feature map.

步骤3：压缩特征图的维度：Step 3: Compress the dimension of the feature map:

上式表示对第k层特征图F_(k)每个通道取最大值，得到C_(k)维的向量表示FM_(k)，维度为一维，向量长度为C_(k)。The above formula means taking the maximum value for each channel of the feature map F _(k) of the kth layer, and obtaining a C _(k) -dimensional vector representation FM _(k) , the dimension is one-dimensional, and the vector length is C _(k) .

然后连接所有层的特征表示，如图2所示，得到整个视频帧的特征表示：FP_n，通过步骤2的表格可以看出，每个卷积层的输出通道数分别为：128，128，256，256，256，512，512，512，512，512，512。因此，最终的特征维度为相应层的维度之和：Then connect the feature representations of all layers, as shown in Figure 2, to obtain the feature representation of the entire video frame: FP _n , it can be seen from the table in step 2 that the number of output channels of each convolutional layer are: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512. Therefore, the final feature dimension is the sum of the dimensions of the corresponding layers:

即对于一个视频帧P_(n)，提取出的向量大小为4096，再对每个视频的所有视频帧S＝(P₍₁₎，P₍₂₎，…P_(n))共n个4096维向量，对n个向量取均值得到一个4096维向量T，然后归一化得到整个视频V_(n)的特征表示F(V_(n))That is, for a video frame P _(n) , the size of the extracted vector is 4096, and then for all video frames of each video S=(P ₍₁₎ , P ₍₂₎ ,...P _(n) ) a total of n 4096 Dimension vector, take the mean of n vectors to get a 4096-dimensional vector T, and then normalize it to get the feature representation of the entire video V _(n) F(V _(n) )

归一化公式为：The normalization formula is:

μ是向量T的均值，σ是方差，Tv是最终的视频特征向量，维度为4096.对于视频帧集合来说，最终对每个视频得到一个唯一的4096维向量表Tv_n，表示第n个视频的特征。μ is the mean of the vector T, σ is the variance, and Tv is the final video feature vector with a dimension of 4096. For the video frame set, a unique 4096-dimensional vector table Tv _n is finally obtained for each video, representing the nth Features of the video.

步骤3：对视频库V用步骤2的方式提取特征，得到视频特征库Fv。Step 3: Extract features from the video library V in the manner of step 2 to obtain a video feature library Fv.

步骤4：检索阶段，对于待检索视频v，用步骤2的方式提取特征Tv。Step 4: In the retrieval stage, for the video v to be retrieved, the feature Tv is extracted in the manner of step 2.

步骤5：使用Tv对比视频特征库特征，满足条件，则为重复视频，对比方式和条件设定如下：Step 5: Use TV to compare the features of the video feature library. If the conditions are met, it is a repeated video. The comparison method and conditions are set as follows:

检索时，计算不同视频特征之间的距离公式d，视频i，j之间的距离为d(Tv_i，Tv_j))，计算公式为:During retrieval, calculate the distance formula d between different video features, the distance between videos i, j is d(Tv _i , Tv _j )), the calculation formula is:

设置阈值t，得到距离d后，当d小于t时，判定为重复视频，否则不是重复视频。Set the threshold t, and after the distance d is obtained, when d is less than t, it is determined to be a repeated video, otherwise it is not a repeated video.

综上，本发明基于深度学习的重复视频检测方法可以归纳为包括特征库建立阶段和判别阶段。To sum up, the deep learning-based repetitive video detection method of the present invention can be summarized as including a feature library establishment stage and a discrimination stage.

1.特征库建立阶段：通过神经网络对已有视频提取特征，通过合并神经网络的中间层输出作为视频的特征表示，每个视频对应一个4096维的特征文件(图2显示了单个视频帧特征提取的原理)，存入数据库，得到视频特征库。1. Feature library establishment stage: extract features from existing videos through neural network, and combine the output of the middle layer of neural network as the feature representation of the video, each video corresponds to a 4096-dimensional feature file (Figure 2 shows the features of a single video frame. The principle of extraction) is stored in the database to obtain the video feature library.

2.判别阶段：对于判别的视频，使用同样方法提取特征，对库中已有的视频特征计算欧式距离，如果该距离小于提前设定的阈值t(一般设置阈值为0.3)。则判定为重复视频，否则不是重复视频。具体本实例可以判别的重复视频格式有：对视频加字幕，画面缩放，画中画，格式变化，视频长度变化，加水印，镜像翻转，亮度变化，画面剪切。2. Discrimination stage: For the discriminated video, use the same method to extract features, and calculate the Euclidean distance for the existing video features in the library, if the distance is less than the threshold t set in advance (generally set the threshold to 0.3). Then it is determined to be a repeated video, otherwise it is not a repeated video. Specifically, the repeated video formats that can be identified in this example include: adding subtitles to the video, zooming in and out, picture-in-picture, format change, video length change, watermarking, mirror flipping, brightness change, and picture clipping.

上述网络使用tensorflow深度学习框架，通过opencv获取视频帧序列，并将视频帧序列缩放到满足神经网络输入的大小。在本实例中，通过合并vgg16中间层的特征图得到整个视频的特征表示。The above network uses the tensorflow deep learning framework, obtains the video frame sequence through opencv, and scales the video frame sequence to meet the size of the neural network input. In this example, the feature representation of the entire video is obtained by merging the feature maps of the vgg16 intermediate layer.

尽管上面结合附图对本发明进行了描述，但是本发明并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本发明的启示下，在不脱离本发明宗旨的情况下，还可以做出很多变形，这些均属于本发明的保护之内。Although the present invention has been described above in conjunction with the accompanying drawings, the present invention is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of the present invention, many modifications can be made without departing from the spirit of the present invention, which all belong to the protection of the present invention.

Claims

1. a kind of repetitive video detection method based on deep learning, it is characterized in that, use neural network to extract feature to existing video, set up video feature library, then treat to detect video extraction feature, calculate the Euclidean distance of feature in this video feature and library As a similarity measure, when the distance is less than a set threshold, the video is marked as a duplicate.

2. the repetitive video detection method based on deep learning according to claim 1, is characterized in that, comprises the following steps:

Step 1: Obtain video frames from an existing video set to obtain a set of all video frames;

Step 2: Extract features from the video frame by using the middle layer of the convolutional neural network; the convolutional neural network is the network structure of vgg16;

First, for a video, obtain a video frame set S, each frame in the set is scaled to a 3-channel image of 224×224 size, as the input of the neural network; the output of the middle layer of the neural network is used as the video feature, and the network structure of vgg16 is Take the feature maps of the conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 layers, a total of 11 layers, all of which are the output of the convolution layer; the size of the convolution kernel of these layers is the same as 3 × 3 , all 0 padding, the convolution kernel moves 1 pixel each time when sliding;

Combined with the feature output of the intermediate layer, a unique 4096 feature vector is finally obtained for each video;

Step 3: Extract features from the video library V to obtain the video feature library Fv;

Step 4: In the retrieval stage, features are extracted for the video v to be retrieved;

Step 5: Use the features of the video v to be retrieved to compare the features of the video feature library. If the conditions are met, it is a repeated video. The comparison method and conditions are set as follows: During retrieval, calculate the distance formula d between different video features, video i, j The distance between them is d, and the threshold t is set. After the distance d is obtained, when d is less than t, it is determined as a similar video, otherwise it is not a similar video.

3. the repeated video detection method based on deep learning according to claim 2, is characterized in that, obtains video frame from existing video collection, obtains the set of all video frames as follows:

V=(S ₍₁₎ ,S ₍₂₎ ,...S _(n) ), S=(P ₍₁₎ ,P ₍₂₎ ,...P _(n) );

V, S denotes the frame set of a single video, S _(n) denotes the frame set of the nth video, and P _(n) denotes the nth video frame of the video.

4. the repetitive video detection method based on deep learning according to claim 2, is characterized in that, in step 2, the process that finally obtains a unique 4096 feature vector for each video in conjunction with the feature output of the middle layer is:

The dimension of the feature map output by each layer is:

F _(k) = d(W _(k) ×W _(k) ×C _(k) ), k=1,2,...,11 (1)

Equation (1) indicates that the output dimension of the feature map of the kth layer is W _(k) ×W _(k) ×C _(k) , W _(k) ×W _(k) is the dimension of the feature map of the kth layer, C _{( k)} is the number of channels of the feature map of the kth layer;

The dimension of the compressed feature map:

FM _(k) ,=max(F _(k) ), k=1,2,...,11 (2)

Equation (2) represents taking the maximum value for each channel of the k-th layer feature map F _(k) , and obtaining a C _(k) -dimensional vector representation FM _(k) , the dimension is one-dimensional, and the vector length is C _(k) ;

Connect the feature representations of all layers to get the feature representation of the entire video frame: FP _n , the number of output channels of each convolutional layer are: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512 ; The final feature dimension is the sum of the dimensions of the corresponding layer:

128+128+256+256+256+512+512+512+512+512+512=4094 (3)

That is, for a video frame P _(n) , the size of the extracted vector is 4096, and then for all video frames of each video S=(P ₍₁₎ , P ₍₂₎ ,...P _(n) ) a total of n 4096 dimensional vector, take the mean of n vectors to obtain a 4096-dimensional vector T, and then normalize to obtain the feature representation F(V _(n) ) of the entire video V _(n );

The normalization formula is:

In formula (4), μ is the mean of the vector T, σ is the variance, Tv is the final video feature vector, and finally a unique 4096-dimensional vector table Tv _n is obtained for each video, representing the feature of the nth video.

5. the repeated video detection method based on deep learning according to claim 4, is characterized in that, in step 3, adopts the mode of step 2 to extract feature to video library V, obtains video feature library Fv; In step 4, adopts step 2 way to treat the retrieved video v to extract features.

6. the repeated video detection method based on deep learning according to claim 5, is characterized in that, in step 5, the calculation formula of the distance between video i, j is: