CN103824284B

CN103824284B - Key frame extraction method based on visual attention model and system

Info

Publication number: CN103824284B
Application number: CN201410039072.7A
Authority: CN
Inventors: 纪庆革; 赵杰; 刘勇
Original assignee: Sun Yat Sen University; Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd
Current assignee: Sun Yat Sen University; Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd
Priority date: 2014-01-26
Filing date: 2014-01-26
Publication date: 2017-05-10
Anticipated expiration: 2034-01-26
Also published as: CN103824284A

Abstract

The invention discloses a key frame extraction method and system based on a visual attention model. The extraction method includes: In the space domain, the method uses the binomial coefficient to filter the global contrast to detect the saliency, and uses the adaptive threshold to extract the target area. The algorithm can not only keep the boundary of the salient object area well, but also the saliency degree in the area is relatively uniform. Then, in the time domain, the method defines the saliency of the motion, estimates the target motion through the homography matrix, uses the key points instead of the target to detect the saliency, and then fuses the data of the saliency in the space domain to propose a boundary based on the energy function. The extended method obtains bounding boxes as salient object regions in the temporal domain. Finally, the method reduces the richness of the video by salient target regions, and uses a shot-adaptive method combined with online clustering for keyframe extraction.

Description

A key frame extraction method and system based on visual attention model

技术领域technical field

本发明涉及视频分析技术领域，特别是涉及一种基于视觉注意力模型的关键帧提取方法和系统。The invention relates to the technical field of video analysis, in particular to a method and system for extracting key frames based on a visual attention model.

背景技术Background technique

随着互联网技术的快速发展,我们已经迈入了信息大爆炸时代，各种各样的网络应用和多媒体技术的快速发展得到了广泛的应用。视频作为一种常见的网络信息载体，生动而直观，具有很强的观赏性和表现力，从而在各个领域得到了广泛的应用，使得视频数据海量增长，以著名的视频网站YouTube为例，每分钟由用户上传的视频约有60小时（数据取自2012年1月23日），而且依然保持着增长趋势。如何快速有效地存储、管理和访问海量的视频资源成为当前视频应用领域的一个重要问题。视频因为具有时域相关性，传统方式下，用户掌握一段视频信息需要自始至终浏览完整段视频。无关视频占据用户大量时间的同时，也浪费了大量网络带宽。因此，我们需要对视频添加辅助信息，帮助用户更好地筛选。目前成熟的系统中普遍采用传统的文字标注法，通过人工方式手动分类，用标题、描述等文字赋予视频人工语义。面对海量视频，这项任务不但工作量大，而且不同的人对视频理解不同，其他人无法通过作者的文字标注判断视频是否符合自己的兴趣。With the rapid development of Internet technology, we have entered the era of information explosion, and various network applications and the rapid development of multimedia technology have been widely used. As a common network information carrier, video is vivid and intuitive, with strong appreciation and expressiveness, and thus has been widely used in various fields, resulting in a massive increase in video data. Taking the famous video website YouTube as an example, every There are about 60 hours of videos uploaded by users per minute (data taken from January 23, 2012), and it still maintains a growing trend. How to quickly and effectively store, manage and access massive video resources has become an important issue in the current video application field. Due to the time-domain correlation of video, in the traditional way, users need to browse the entire video from beginning to end to grasp a piece of video information. While irrelevant videos take up a lot of time for users, they also waste a lot of network bandwidth. Therefore, we need to add auxiliary information to the video to help users filter better. At present, the traditional text annotation method is generally used in mature systems, which are manually classified by manual methods, and artificial semantics are given to videos with text such as titles and descriptions. Faced with a large number of videos, this task is not only a lot of work, but also different people have different understandings of videos, and others cannot judge whether the videos meet their interests through the author's text annotation.

因此，人们迫切需要一种自动化的方式对视频进行有效地概括。Therefore, there is an urgent need for an automated way to effectively summarize videos.

发明内容Contents of the invention

为了解决现有技术的不足，本发明首先提供一种基于视觉注意力模型的视频关健帧提取方法，采用该方法能够有效的获得对视频镜头具有很好代表性的关键帧。In order to solve the deficiencies of the prior art, the present invention firstly provides a video key frame extraction method based on a visual attention model, which can effectively obtain key frames that are very representative of video shots.

本发明的又一目的是提出一种基于视觉注意力模型的视频关健帧提取系统。Another object of the present invention is to propose a video key frame extraction system based on a visual attention model.

为了实现上述目的，本发明的技术方案为：In order to achieve the above object, the technical solution of the present invention is:

一种基于视觉注意力模型的视频关键帧提取方法，包括：A video key frame extraction method based on a visual attention model, comprising:

在空域上，用二项式系数滤波全局对比度进行显著度检测，并且利用自适应阈值对目标区域进行提取；采用这种方式不但能较好地保持显著目标区域边界，而且区域内显著度较均匀。In the spatial domain, the binomial coefficient is used to filter the global contrast for saliency detection, and the adaptive threshold is used to extract the target area; this method can not only better maintain the boundary of the salient target area, but also the saliency in the area is relatively uniform .

在时域上，定义运动的显著度，通过单应性矩阵对目标运动进行估计，采用关键点代替目标进行显著度检测，融合空域显著度的数据，提出基于能量函数边界扩展的方法获得包围盒作为时域的显著目标区域；In the time domain, define the saliency of the motion, estimate the target motion through the homography matrix, use the key points instead of the target to detect the saliency, fuse the data of the saliency in the space domain, and propose a method based on the boundary expansion of the energy function to obtain the bounding box as a salient target area in the temporal domain;

通过显著目标区域降低视频的丰富性，采用结合在线聚类的镜头自适应方法进行关键帧提取。The richness of the video is reduced by salient target regions, and a shot-adaptive method combined with online clustering is used for keyframe extraction.

一种基于视觉注意力模型的视频关键帧提取系统，该系统包括显著区域提取模块，关键帧提取模块；A video key frame extraction system based on a visual attention model, the system includes a salient area extraction module and a key frame extraction module;

具体的，所述显著区域提取模块包括：Specifically, the salient region extraction module includes:

空域显著区域提取模块，用于提取空域上的显著区域；The airspace salient area extraction module is used to extract the salient area on the airspace;

时域关键点显著度获取模块，用于提取时域上的关键点的显著度值；The time-domain key point saliency acquisition module is used to extract the saliency value of the key point on the time domain;

融合模块，用于将空域上的显著区域和时域上的关键点进行融合，并最终获取显著区域。The fusion module is used to fuse the salient areas in the air domain and the key points in the time domain, and finally obtain the salient areas.

所述关键帧提取模块包括：The key frame extraction module includes:

静态镜头关键帧提取模块，用于静态镜头的关键帧提取；Static lens key frame extraction module, used for key frame extraction of static lens;

动态镜头关键帧提取模块，用于动态镜头的关键帧提取；Dynamic lens key frame extraction module, used for key frame extraction of dynamic lens;

镜头自适应模块，用于静态镜头关键帧提取模块和动态镜头关键帧提取模块之间的控制。The lens adaptive module is used for the control between the static lens key frame extraction module and the dynamic lens key frame extraction module.

与现有技术相比，本发明的有益效果为：采用本发明能够自动的对视频进行有地概括，有效的获得对视频镜头具有很好代表性的关键帧。Compared with the prior art, the beneficial effects of the present invention are: adopting the present invention can automatically and effectively summarize the video, and effectively obtain key frames that are very representative of the video shots.

附图说明Description of drawings

图1为本发明静态镜头的关键帧提取流程图。Fig. 1 is a flow chart of key frame extraction of a static shot in the present invention.

图2为本发明动态镜头的关键帧提取流程图。Fig. 2 is a flow chart of key frame extraction of a dynamic shot in the present invention.

图3为本发明自适应镜头的关键帧提取流程图。Fig. 3 is a flow chart of key frame extraction of an adaptive lens according to the present invention.

具体实施方式detailed description

下面结合附图对本发明作进一步详细的说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

本发明公开的一种基于视觉注意力模型的视频关键帧提取方法，具体实施方式如下：A kind of video key frame extraction method based on visual attention model disclosed by the present invention, the specific implementation is as follows:

首先，在空域上，通过用二项式系数滤波全局对比度进行显著度检测，并且利用自适应阈值对目标区域进行提取，具体方法如下：First, in the spatial domain, the binomial coefficient is used to filter the global contrast for saliency detection, and the adaptive threshold is used to extract the target area. The specific method is as follows:

（11）二项式系数按照杨辉三角构造，N层的归一化因子为2^N。选择第四层，因此滤波器系数B₄=(1/16)[1 4 6 4 1]；(11) The binomial coefficient is constructed according to the Yang Hui triangle, and the normalization factor of the N layer is 2 ^N . Select the fourth layer, so the filter coefficient B ₄ =(1/16)[1 4 6 4 1];

（12）设I为原刺激强度，为周围刺激强度的均值，为I与B₄的卷积；将像素点采用CIELAB颜色空间的向量形式衡量刺激的强弱，刺激的对比度即为两CIELAB向量的欧式距离，因此对于像素点(x,y)的刺激度检测为(12) Let I be the original stimulus intensity, is the mean value of the surrounding stimulus intensity, It is the convolution of I and B4 _; the pixels are measured in the form of vectors in the CIELAB color space to measure the strength of the stimulus, and the contrast of the stimulus is the Euclidean distance between the two CIELAB vectors. Therefore, for the stimulus detection of the pixel (x, y) for

（13）得到显著度的测量集合S_s=(s₁₁,s₁₂,...,s_NM)之后，利用自适应阈值对目标区域进行提取,其中s_ij(0≤i≤N,0≤j≤M)为像素点(i,j)的显著度，M，N分别为图像的宽度和高度。(13) After obtaining the saliency measurement set S _s =(s ₁₁ ,s ₁₂ ,...,s _NM ), use the adaptive threshold to extract the target area, where s _ij (0≤i≤N,0≤ j≤M) is the salient degree of the pixel point (i, j), M and N are the width and height of the image respectively.

具体，通过以下方法实现自适应阈值对目标区域进行提取：Specifically, the adaptive threshold is used to extract the target area through the following methods:

（21）定义像素点(x,y)全局显著度检测计算式(21) Define the global saliency detection calculation formula of pixel point (x, y)

其中A为检测的面积，为原图像经滤波器B₄滤波后像素点(x,y)的刺激强度，I(i,j)为像素点(i,j)的原刺激强度，M，N分别为图像的宽度和高度；where A is the detected area, is the stimulus intensity of the pixel (x, y ₎ after the original image is filtered by filter B4, I(i, j) is the original stimulus intensity of the pixel (i, j), M, N are the width and height of the image respectively ;

（22）通过直方图进行运算加速，将原刺激强度I映射到刺激空间中，最终对于用户感受到的刺激的显著度如下所示(22) Accelerate the operation through the histogram, and map the original stimulus intensity I to the stimulus space In the end, the stimulus felt by the user The significance of

其中D为刺激在m个最近刺激之间的距离m为人为控制参数，在本实施例中取m为8；where D is the stimulus The distance between the m nearest stimuli m is an artificial control parameter, m is taken as 8 in the present embodiment;

（23）通过改变阈值T_s指定前景和背景区域，然后以获得最小的能量函数的阈值作为最优阈值；以T_s为阈值的能量函数的定义如下：(23) Specify the foreground and background regions by changing the threshold T _s , and then obtain the threshold of the smallest energy function as the optimal threshold; the definition of the energy function with T _s as the threshold is as follows:

其中S_n由公式（2）获得，λ为显著目标能量的权重，在本实施例中取λ=1.0，N为图像的总像素数，f(T_s,S_n)=max(0,sign(S_n-T_s))，V(I,T_s,s)为对周围刺激的相似度的衡量，选择当前T_s下显著点和其8邻域的像素点组成点对Pair进行计算，dist(p,q)为两点之间的空间距离,σ为人为控制参数，在本实施例中取σ=10.0。Among them, S _n is obtained by formula (2), λ is the weight of significant target energy, in this embodiment, λ=1.0, N is the total number of pixels of the image, f(T _s ,S _n )=max(0,sign (S _n -T _s )), V(I,T _s ,s) is a measure of the similarity of surrounding stimuli, select the salient point under the current T _s and its 8 neighboring pixels to form a point to calculate the Pair, dist(p,q) is the spatial distance between two points, σ is a human control parameter, and σ=10.0 is taken in this embodiment.

因此给定一幅图像以及显著度图，通过最小化能量函数对T_s进行估计，当像素点属于显著目标时被标记为1，其余标记为0，参数λ和σ需要事先手工设定。Therefore, given an image and a saliency map, T _s is estimated by minimizing the energy function. When a pixel belongs to a salient object, it is marked as 1, and the rest are marked as 0. The parameters λ and σ need to be manually set in advance.

然后，在时域上，定义运动的显著度，通过单应性矩阵对目标运动进行估计，采用关键点代替目标进行显著度检测，之后融合空域显著度的数据，提出基于能量函数边界扩展的方法获得包围盒作为时域的显著目标区域，具体方法如下：Then, in the time domain, the saliency of motion is defined, the target motion is estimated through the homography matrix, and key points are used to replace the target for saliency detection, and then the data of spatial saliency is fused, and a method based on energy function boundary extension is proposed Obtain the bounding box as the salient target area in the time domain, the specific method is as follows:

（31）给定一幅图像，采用实时性好的FAST（Features from Accelerated SegmentTest）特征点检测算法获得图像的关键点；(31) Given an image, use the FAST (Features from Accelerated SegmentTest) feature point detection algorithm with good real-time performance to obtain the key points of the image;

（32）给定相邻的两帧图像，采用FLANN（Fast Library for Approximate NearestNeighbor）进行快速的相关点匹配；(32) Given two adjacent frames of images, use FLANN (Fast Library for Approximate Nearest Neighbor) for fast correlation point matching;

（33）用单应性矩阵（Homography Matrix）H来描述关键点的运动，由于一个H仅仅描述一种运动形式，同一段视频内存在的运动形式是多样的，因此需要多个H对不同的运动进行描述。在本实施例中采用RANSAC算法，通过不断迭代，获得一系列单应性矩阵的估计H={H₁,H₂,...,H_n}；(33) Use the homography matrix (Homography Matrix) H to describe the motion of key points. Since one H only describes one motion form, and the motion forms in the same video are diverse, multiple H pairs of different Movement is described. In this embodiment, the RANSAC algorithm is used to obtain a series of homography matrix estimates H={H ₁ ,H ₂ ,...,H _n } through continuous iterations;

（34）定义关键点的时域显著度为(34) Define the temporal saliency of key points as

其中A_m为运动状态H_m的所有关键点的分布面积，W和H为视频图像的宽度和高度；Among them, A _m is the distribution area of all key points in the motion state H _m , and W and H are the width and height of the video image;

（35）将空域的显著度值与获取的关键点的时域显著度值进行融合；(35) Fusing the saliency value of the spatial domain with the temporal saliency value of the acquired key points;

（36）采用基于能量函数边界扩展的方法获得包围盒作为时域的显著目标区域。(36) adopt an energy function-based boundary extension method to obtain bounding boxes as salient object regions in the temporal domain.

具体，通过以下方法实现空域的显著度值与获取的关键点的时域显著度值进行融合：Specifically, the saliency value of the spatial domain and the temporal saliency value of the obtained key points are fused by the following methods:

（41）定义一个运动显著性的对比度其中关键点时域显著度值S_t由公式（5）获得，为关键点时域显著度值的均值；(41) Defining the contrast of a motion salience Among them, the time-domain saliency value S _t of key points is obtained by formula (5), is the mean value of the time-domain saliency value of key points;

（42）运动的显著性应该针对在空域上依然有较强区分度的目标，因此对时域显著度S_t的统计范围应该有所限制，设p_i为S_t的第i个关键点，则p_i应满足其中为空域显著度值均值；(42) The salience of motion should be aimed at targets that still have a strong degree of discrimination in the airspace, so the statistical range of the time-domain saliency S _t should be limited. Let p _i be the i-th key point of S _t , Then p _i should satisfy in is the mean value of the spatial significance value;

（43）定义时域的权重空域的权重将满足（42）的关键点的时域与空域显著度值按权值相加。(43) Define the weight of the time domain weight of airspace The temporal domain and spatial domain saliency values of the key points satisfying (42) are added according to the weight.

具体，通过以下方法实现时域显著目标区域提取：Specifically, the time-domain salient target region extraction is realized by the following methods:

将空域的显著关键点p作为种子点，种子区域采用矩形的包围盒B，设b_i为包围盒B的四条边，i∈{1,2,3,4}为上下左右的编号，边界扩展的算法如下：The prominent key point p of the airspace is used as the seed point, and the seed area adopts a rectangular bounding box B. Let b _i be the four sides of the bounding box B, i∈{1,2,3,4} is the number of up, down, left, and right, and the boundary is extended The algorithm is as follows:

初始化：包围盒B的上下左右顶点都设为关键点p位置，点p为包围盒B的内部点。Initialization: The upper, lower, left, and right vertices of the bounding box B are set to the position of the key point p, and point p is the internal point of the bounding box B.

步骤1：从i=1以递增的顺序计算b_i外边界上的显著度能量E_outer(i)和内边界的显著度能量E_inner(i)，能量函数的计算如公式（4），然后计算边界可以外扩的权值为其中l_i为当前包围盒B的第i条边的长度。Step 1: Calculate the saliency energy E _outer (i) on the outer boundary of bi and the saliency energy E _inner (i) on the inner boundary of bi in increasing order from _i =1, the calculation of the energy function is as formula (4), and then The weight that can be expanded to calculate the boundary is Where l _i is the length of the i-th side of the current bounding box B.

步骤2：如果w(i)≥ε，则第i条边向外扩展一个像素单元。ε为扩展判定的阈值，需要预先设置。在本文的实验中，设置为0.8T_s′，T_s′为包围盒内的空域显著度均值。Step 2: If w(i)≥ε, the i-th side extends outward by one pixel unit. ε is the threshold for extending the decision, which needs to be set in advance. In the experiments in this paper, it is set to 0.8T _s ′, where T _s ′ is the mean value of the spatial saliency in the bounding box.

步骤3：在步骤2中如果没有新的边被扩展，则停止算法，输出包围盒B；否则，重复步骤1和步骤2。Step 3: If no new edge is extended in step 2, stop the algorithm and output bounding box B; otherwise, repeat step 1 and step 2.

最后，通过显著目标区域降低视频的丰富性，采用结合在线聚类的镜头自适应方法进行关键帧提取，具体方法如下：Finally, the richness of the video is reduced by the significant target area, and the key frame extraction is performed by a shot adaptive method combined with online clustering. The specific method is as follows:

（51）将显著区域的RGB颜色空间转换为HSV颜色空间，取其中H分量（色调）和S分量（饱和度）计算色相饱和度直方图（Hue-saturation Histogram）。记H_p(i)为第p帧显著目标区域的色相饱和度直方图的第i个bin值，本实施例采用Bhattacharyya距离来衡量两帧之间的视觉距离 (51) Convert the RGB color space of the salient area to the HSV color space, and take the H component (hue) and S component (saturation) to calculate the hue-saturation histogram (Hue-saturation Histogram). Note that Hp(i) is the i-th bin value of the hue-saturation histogram of the salient target area in frame _p , and this embodiment uses the Bhattacharyya distance to measure the visual distance between two frames

（52）采用结合在线聚类的镜头自适应方法进行关键帧提取，以静态镜头的聚类方式为主，动态镜头的聚类方式为辅。对于静态镜头，以显著区域的色相饱和度直方图为依据进行在线聚类，选取聚类中任意一帧作为关键帧。对于动态镜头，首先跟踪显著运动目标，然后以显著运动目标的跟踪作为在线聚类的依据，显著目标的位置信息作为从聚类中提取关键帧的依据。(52) A shot adaptive method combined with online clustering is used for key frame extraction, the clustering method of static shots is the main method, and the clustering method of dynamic shots is supplemented. For static shots, online clustering is performed based on the hue-saturation histogram of the salient area, and any frame in the cluster is selected as a key frame. For dynamic shots, firstly, the significant moving objects are tracked, and then the tracking of the significant moving objects is used as the basis for online clustering, and the location information of the prominent objects is used as the basis for extracting key frames from the clustering.

具体，如图1，通过以下步骤实现静态镜头在线聚类：Specifically, as shown in Figure 1, the online clustering of static shots is realized through the following steps:

初始化：计算静态镜头第一帧的色相饱和度直方图初始胞腔数N=1，并且将作为胞腔Cell₁的形心C₁的矢量，C₁=f₁。Initialization: Calculate the hue-saturation histogram of the first frame of the static shot The initial cell number N=1, and the As the vector of the centroid C ₁ of the cell cell Cell ₁ , C ₁ =f ₁ .

S11：如果当前帧p属于静态镜头，计算当前帧的色相饱和度直方图H_p。S11: If the current frame p belongs to a static shot, calculate the hue-saturation histogram H _p of the current frame.

S12：计算p和各胞腔形心的视觉距离，获得其中最小的视觉距离胞腔其中m为胞腔的索引号。S12: Calculate the visual distance between p and the centroid of each cell, and obtain the smallest visual distance cell where m is the index number of the cell.

S13：将D_sal(p,C_m)与阈值ε_c进行比较，当D_sal(p,C_m)≤ε_c时，把p归入胞腔Cell_m中，然后用H_p替代Cell_m的形心。否则，增加胞腔Cell_N+1，将H_p作为胞腔Cell_N+1的形心C_N+1的矢量，最后更新胞腔数N=N+1。S13: Compare D _sal (p, C _m ) with the threshold ε _c , when D _sal (p, C _m ) ≤ ε _c , classify p into Cell _m , and then replace Cell _m with H _p Centroid. Otherwise, add cell _N+1 , use H _p as the vector of centroid C _N+1 of cell _N +1, and finally update cell number N=N+1.

S14：对于所有静态镜头帧重复S11、S12和S13。S14: Repeat S11, S12 and S13 for all still shot frames.

具体，如图2，通过以下步骤实现动态镜头的关键帧提取：Specifically, as shown in Figure 2, the key frame extraction of dynamic shots is realized through the following steps:

初始化：获取动态镜头的第一帧。Initialization: Get the first frame of the dynamic camera.

S21：获得跟踪目标区域，初始化粒子或重采样，提取视频下一帧，判断该帧是否为空，如果为空，则结束。S21: Obtain the tracking target area, initialize particles or resample, extract the next frame of the video, judge whether the frame is empty, and if it is empty, end.

S22：获得FAST特征向量，用FLANN算法进行匹配，更新特征向量权值，如果特征向量不足，则结束。S22: Obtain FAST feature vectors, use the FLANN algorithm for matching, update the weights of feature vectors, and end if the feature vectors are insufficient.

S23：更新各粒子权值，计算关键帧权值和目标区域，跳转执行S21。S23: Update the weights of each particle, calculate the key frame weights and the target area, and jump to S21.

本发明公开的一种基于视觉注意力模型的关键帧提取系统包括显著区域提取模块和关键帧提取模块。A key frame extraction system based on a visual attention model disclosed in the present invention includes a salient area extraction module and a key frame extraction module.

显著区域提取模块包括:Salient region extraction modules include:

关键帧提取模块包括：The keyframe extraction module includes:

以上所述，仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围，应当理解，本发明并不限于这里所描述的实现方案，这些实现方案描述的目的在于帮助本领域中的技术人员实践本发明。任何本领域中的技术人员很容易在不脱离本发明精神和范围的情况下进行进一步的改进和完善，因此任何在本发明的精神原则之内所作出的修改、等同替换和改进等，均应包含在本发明的权利要求保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. It should be understood that the present invention is not limited to the implementation solutions described here. The purpose of these implementation solutions descriptions is to help those skilled in the art Those skilled in the art practice the present invention. Any person skilled in the art can easily make further improvements and improvements without departing from the spirit and scope of the present invention, so any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be Included within the protection scope of the claims of the present invention.

Claims

1. A key frame extraction method based on visual attention model, for extracting the key frame of video, it is characterized in that, comprising:

In the spatial domain, the binomial coefficient is used to filter the global contrast for saliency detection, and the adaptive threshold is used to extract the target area;

In the time domain, define the saliency of the motion, estimate the target motion through the homography matrix, use the key points instead of the target to detect the saliency, fuse the data of the spatial saliency, and propose a method based on the boundary expansion of the energy function to obtain the bounding box as a salient target area in the temporal domain;

Reduce the richness of the video through significant target areas, and use a shot adaptive method combined with online clustering for key frame extraction;

In the spatial domain, saliency detection is performed by filtering the global contrast with the binomial coefficient, and the target area is extracted by using an adaptive threshold. The specific method is as follows:

(11) The binomial coefficient is constructed according to the Yang Hui triangle, and the normalization factor of the N layer is 2 ^N ; the fourth layer is selected, and the filter coefficient B ₄ =(1/16)[1 4 6 4 1];

(12) Let I be the original stimulus intensity, is the mean value of the surrounding stimulus intensity, It is the convolution of I and B4 _; the pixels are measured in the form of vectors in CIELAB color space to measure the strength of the stimulus, and the contrast of the stimulus is the Euclidean distance between the two CIELAB vectors, so for the stimulus detection of the pixel (x, y) for

S S ((x x,, y the y)) = = | | | | {I I}_{{B B}_{44}} ((x x,, y the y)) - - \overset{&OverBar; &OverBar;}{I I} | | | | - - - - - - ((11))

(13) After obtaining the saliency measurement set S _s =(s ₁₁ ,s ₁₂ ,…,s _NM ), use the adaptive threshold to extract the target area, where s _ij is the saliency of the pixel point (i,j) , 0≤i≤N, 0≤j≤M, M, N are the width and height of the image respectively; implement the adaptive threshold to extract the target area by the following method:

(21) Define the global saliency detection calculation formula of pixel point (x, y)

{S S}_{g g} ((x x,, y the y)) = = \frac{11}{A A} {Σ Σ}_{i i = = 00}^{N N} {Σ Σ}_{j j = = 00}^{M m} | | | | {I I}_{{B B}_{44}} ((x x,, y the y)) - - I I ((i i,, j j)) | | | | - - - - - - ((22))

where A is the detected area, is the stimulus intensity of pixel (x, y ₎ after the original image is filtered by filter B4, I(i, j) is the original stimulus intensity of pixel (i, j), M, N are the width and height of the image respectively ;

(22) Accelerate the operation through the histogram, and map the original stimulus intensity I to the stimulus space In the end, the stimulus felt by the user The significance of

S S (({I I}_{{B B}_{44}} ((I I)))) = = \frac{11}{((m m - - 11)) D D. (({I I}_{{B B}_{44}} ((I I))))} {Σ Σ}_{i i = = 11}^{m m} ((D D. (({I I}_{{B B}_{44}} ((I I)))) - - | | | | {I I}_{{B B}_{44}} ((I I)) - - {I I}_{{B B}_{44}} (({I I}_{i i})) | | | |)) {S S}_{g g} (({I I}_{{B B}_{44}} ((I I)))) - - - - - - ((33))

where D is the stimulus The distance between the m nearest stimuli

(23) Designate the foreground and background regions by changing the threshold T _s , and then obtain the threshold of the smallest energy function as the optimal threshold; the definition of the energy function with T _s as the threshold is as follows:

E E. ((I I,, {T T}_{s the s},, λ λ,, σ σ)) = = λ λ {Σ Σ}_{n no = = 11}^{N N} ((f f (({T T}_{s the s},, {S S}_{n no})) {S S}_{n no})) + + V V ((I I,, {T T}_{s the s},, σ σ)) - - - - - - ((44))

where S _n is obtained by formula (2), λ is the weight of salient target energy, N is the total number of pixels in the image, f(T _s ,S _n )=max(0,sign(S _n -T _s )), V (I, T _s , σ) is a measure of the similarity of the surrounding stimuli, select the salient points under the current T _s and the pixel points in its 8 neighborhoods to calculate the Pair, dist(p,q) is the spatial distance between two points, and σ is the control parameter.

2. The method according to claim 1, characterized in that, in the time domain, the saliency of the motion is defined, the target motion is estimated through the homography matrix, the key points are used instead of the target for saliency detection, and then the spatial domain is fused For saliency data, a method based on energy function boundary extension is proposed to obtain the bounding box as the salient target area in the time domain. The specific method is as follows:

(31) Given an image, use the FAST feature point detection algorithm with good real-time performance to obtain the key points of the image;

(32) Given two adjacent frames of images, use FLANN to perform fast correlation point matching;

(33) Use multiple homography matrices H to describe the movement of key points, and use the RANSAC algorithm to obtain a series of homography matrix estimates H={H ₁ ,H ₂ ,...,H _n through continuous iteration };

(34) Define the temporal saliency of key points as

{S S}_{t t} (({p p}_{m m})) = = \frac{{A A}_{m m}}{W W \times \times H h} {Σ Σ}_{i i = = 11}^{n no} {A A}_{i i} D D. (({p p}_{m m},, {H h}_{i i})) - - - - - - ((55))

Among them, A _m is the distribution area of all key points in the motion state H _m , and W and H are the width and height of the video image;

(35) A method based on energy function boundary expansion is used to obtain bounding boxes as salient object regions in the time domain.

3. The method according to claim 2, characterized in that, the saliency value of the space domain is fused with the time domain saliency value of the key point obtained by the following method:

(41) Define a motion salience contrast Among them, the time-domain saliency value S _t of key points is obtained by formula (5), is the mean value of the time-domain saliency value of key points;

(42) Let p _i be the ith key point of S _t , then p _i should satisfy in is the mean value of the spatial significance value;

(43) Define the weight of the time domain weight of airspace The time-domain and space-domain saliency values satisfying the key points in step (42) are added according to weights.

4. The method according to claim 2, characterized in that, the extraction of the salient target region in time domain is realized by the following method:

The prominent key point p of the airspace is used as the seed point, and the seed area adopts a rectangular bounding box B. Let b _i be the four sides of the bounding box B, i∈{1,2,3,4} is the number of up, down, left, and right, and the boundary is extended The algorithm is as follows:

Initialization: The upper, lower, left, and right vertices of the bounding box B are set to the position of the key point p, and point p is the internal point of the bounding box B;

Step 1: Calculate the saliency energy E _outer (i) on the outer boundary of bi and the saliency energy E _inner (i) of the inner boundary in increasing order from _i =1, the calculation of the energy function is as formula (4), and then The weight to calculate the boundary extension is Where l _i is the length of the i-th side of the current bounding box B;

Step 2: If w(i)≥ε, then the i-th edge is extended outward by one pixel unit; ε is the set threshold for expanding the judgment, set to 0.8T _s ′, and T _s ′ is the significant space in the bounding box degree mean;

Step 3: If no new edge is extended in step 2, stop the algorithm and output bounding box B; otherwise, repeat step 1 and step 2.

5. method according to claim 1, is characterized in that, reduces the richness of video by significant target area, adopts the shot adaptive method in conjunction with online clustering to carry out key frame extraction, and concrete method is as follows:

(51) Convert the RGB color space of the salient area to the HSV color space, take the H component and the S component to calculate the hue-saturation histogram, record H _p (i) as the hue-saturation histogram of the p-th frame salient target area For the i-th bin value, the Bhattacharyya distance is used to measure the visual distance between the p and q frames

(52) Using the lens adaptive method combined with online clustering to extract key frames, the clustering method of static shots is the main method, and the clustering method of dynamic shots is supplemented;

For static shots, online clustering is performed based on the hue-saturation histogram of the salient area, and any frame in the cluster is selected as a key frame;

For dynamic shots, firstly, the significant moving objects are tracked, and then the tracking of the significant moving objects is used as the basis for online clustering, and the location information of the prominent objects is used as the basis for extracting key frames from the clustering.

6. The method according to claim 5, characterized in that, the online clustering of static shots is realized by the following steps:

Initialization: Calculate the hue-saturation histogram of the first frame of the static shot The initial cell number N=1, and the As the vector of the centroid C ₁ of the cell cavity Cell ₁ , C ₁ =f ₁ ;

S11: If the current frame p belongs to a static shot, calculate the hue-saturation histogram H _p of the current frame;

S12: Calculate the visual distance between p and the centroid of each cell, and obtain the smallest visual distance cell Where m is the index number of the cell;

S13: Compare D _sal (p, C _m ) with the threshold ε _c , when D _sal (p, C _m ) ≤ ε _c , classify p into Cell _m , and then replace Cell _m with H _p Centroid; otherwise, increase the cell cavity Cell _N+1 , use H _p as the vector of the centroid C _N+1 of the cell cavity Cell _N+1 , and finally update the cell number N=N+1;

S14: Repeat S11, S12 and S13 for all still shot frames.

7. method according to claim 5, is characterized in that, realizes the key frame extraction of dynamic shot by the following steps:

Initialization: Get the first frame of the dynamic shot;

S21: Obtain the tracking target area, initialize particles or resampling, extract the next frame of the video, judge whether the frame is empty, and if it is empty, end;

S22: Obtain FAST eigenvectors, use the FLANN algorithm for matching, update the weights of the eigenvectors, and end if the eigenvectors are insufficient;

S23: Update the weights of each particle, calculate the key frame weights and the target area, and jump to S21.

8. A system of the key frame extraction method based on the visual attention model described in any one of claims 1 to 7, characterized in that, comprising a salient region extraction module, a key frame extraction module;

The salient region extraction module includes:

The airspace salient area extraction module is used to extract the salient area on the airspace;

The time-domain key point saliency acquisition module is used to extract the saliency value of the key point on the time domain;

The fusion module is used to fuse the salient areas in the airspace and the key points in the time domain, and finally obtain the salient areas;

The key frame extraction module includes:

Static lens key frame extraction module, used for key frame extraction of static lens;

Dynamic lens key frame extraction module, used for key frame extraction of dynamic lens;

The lens adaptive module is used for the control between the static lens key frame extraction module and the dynamic lens key frame extraction module.