CN103618900A

CN103618900A - Video region-of-interest extraction method based on encoding information

Info

Publication number: CN103618900A
Application number: CN201310591430.0A
Authority: CN
Inventors: 刘鹏宇; 贾克斌
Original assignee: Beijing University of Technology
Current assignee: Hebei Hongyi Environmental Protection Technology Co ltd
Priority date: 2013-11-21
Filing date: 2013-11-21
Publication date: 2014-03-05
Anticipated expiration: 2033-11-21
Also published as: CN103618900B

Abstract

The invention discloses a video interest region extraction method based on visual perception features and coding information, and relates to the field of video coding. The present invention includes the following steps: first extracting the brightness information of the current coded macroblock from the original video stream; then, using the inter-frame prediction mode type of the current coded macroblock to identify the saliency region of the spatial domain visual feature; and then encoding the macroblock of the previous frame The average motion vectors in the horizontal and vertical directions respectively are dynamic double thresholds. According to the comparison results of the horizontal and vertical motion vectors of the current coded macroblock and the dynamic double thresholds, the time-domain visual feature salience area is identified; finally, combined with the spatial domain and The identification result of the saliency region of temporal visual features defines the priority of video interest, and realizes the automatic extraction of video interest. The method of the invention can provide an important coding basis for ROI (Region of Interest, ROI)-based video coding technology.

Description

Extraction method of video region of interest based on coding information

技术领域technical field

本发明属视频信息处理领域。利用视频编码技术和人眼视觉感知原理实现一种视频感兴趣区域快速提取方法。该方法能够对输入的视频流进行自动分析，利用编码信息标注并输出视频感兴趣区域。The invention belongs to the field of video information processing. Using video coding technology and the principle of human visual perception to realize a fast extraction method of video interest region. The method can automatically analyze the input video stream, and use the encoding information to mark and output the region of interest in the video.

背景技术Background technique

最新的视频编码标准H.264/AVC采用了多种先进的编码技术，在提高编码性能的同时，其编码复杂度也急剧增加，限制了其在多媒体信息处理与实时通信业务中的广泛应用。人们对如何提高H.264/AVC编码速度进行了深入研究，并提出了大量快速编码优化算法，但多数算法并不区分视频图像中各个区域在视觉意义上的重要程度，对所有编码内容采用相同的编码方案，忽略了人类视觉系统HVS（Human Visual System,HVS）对视频场景感知的差异性。The latest video coding standard H.264/AVC uses a variety of advanced coding technologies. While improving the coding performance, its coding complexity also increases sharply, which limits its wide application in multimedia information processing and real-time communication services. People have conducted in-depth research on how to improve the encoding speed of H.264/AVC, and proposed a large number of fast encoding optimization algorithms, but most of the algorithms do not distinguish the visual importance of each area in the video image, and use the same method for all encoding content. The encoding scheme ignores the difference in the human visual system HVS (Human Visual System, HVS) perception of video scenes.

视觉神经科学研究已证明，HVS对视频场景的感知具有选择性，对不同区域具有不同的视觉重要性。因此，利用已有编码信息进行视觉感知特征分析，再依据视觉感知特征将计算资源优先分配给感兴趣区域，对提高视频编码算法实时性、降低计算复杂度，具有重要的理论意义和应用价值。而快速、有效的视觉特征分析，特别是视觉感兴趣区域的有效检测是优化编码资源、编写高效视频编码方案的重要基础。Visual neuroscience research has demonstrated that HVS is selective in the perception of video scenes, giving different visual importance to different regions. Therefore, it is of great theoretical significance and application value to improve the real-time performance of video coding algorithms and reduce computational complexity by using existing coding information to analyze visual perception features, and then preferentially allocate computing resources to regions of interest based on visual perception features. Fast and effective visual feature analysis, especially effective detection of visual regions of interest is an important basis for optimizing coding resources and writing efficient video coding schemes.

发明内容Contents of the invention

本发明不同于现有的光流法、帧差法、运动能量检测法、背景差法等视频运动对象提取方法，是以视频码流中的预测模式、运动矢量等编码信息为基础，根据编码信息与视觉感兴趣区域的关联性，识别视频编码内容中的空域视觉特征显著度区域和时域特征视觉显著度区域，从而实现视频感兴趣区域的自动标识和获取。The present invention is different from existing video moving object extraction methods such as optical flow method, frame difference method, motion energy detection method, background difference method, etc., and is based on coding information such as prediction mode and motion vector in video code stream, and according to coding The correlation between information and visual regions of interest, identifying spatial visual feature salience regions and temporal feature visual salience regions in video coding content, so as to realize automatic identification and acquisition of video regions of interest.

根据HVS特征，人眼对亮度信息较之色度信息更加敏感，本发明方法针对视频序列中的亮度分量的编码信息，进行视频感兴趣区域的自动标识和获取。According to the HVS feature, human eyes are more sensitive to luminance information than chrominance information, and the method of the present invention aims at the encoding information of luminance components in video sequences to automatically identify and acquire video regions of interest.

本发明方法具体包括下述步骤：The inventive method specifically comprises the following steps:

步骤一：输入YUV格式、GOP（Group of Picture,GOP）结构为IPPP的视频序列，读取编码宏块的亮度分量Y，进行编码参数配置和初始化参数；Step 1: Input a video sequence in YUV format and GOP (Group of Picture, GOP) structure as IPPP, read the luminance component Y of the encoded macroblock, and configure the encoding parameters and initialize the parameters;

步骤二：对视频序列的首帧，即I帧进行帧内预测编码；Step 2: performing intra-frame predictive coding on the first frame of the video sequence, i.e. the I frame;

在视频编码标准中，I帧做为随机访问的参考点，含有大量信息，由于其不能利用相邻帧之间的时间相关性进行编码，因而采用帧内预测编码方法，利用当前帧中己编码重建宏块的编码信息对当前宏块进行预测，以消除空间冗余。对视频序列首帧，即I帧进行帧内预测编码是视频编码中惯用的一种常规编码方式。In the video coding standard, the I frame is used as a reference point for random access and contains a large amount of information. Since it cannot be coded using the temporal correlation between adjacent frames, the intra-frame predictive coding method is used to utilize the coded data in the current frame. The coding information of the reconstructed macroblock is used to predict the current macroblock to eliminate spatial redundancy. Performing intra-frame predictive coding on the first frame of a video sequence, that is, an I frame, is a conventional coding method commonly used in video coding.

步骤三：对当前p帧进行帧间预测编码，利用相邻帧视频内容的相关性消除时间冗余。记录当前帧内所有编码宏块的帧间预测模式类型，记为Mode_pn；Step 3: Perform inter-frame predictive coding on the current p frame, and eliminate time redundancy by utilizing the correlation of video content of adjacent frames. Record the inter prediction mode types of all coded macroblocks in the current frame, denoted as Mode _pn ;

其中，p=1,2,3,…,L-1，代表第p个进行帧间编码的视频帧，L为整个视频序列进行编码的总帧数；n表示在当前编码帧中的第n个编码宏块的序号。Among them, p=1,2,3,...,L-1, represents the p-th video frame for inter-frame encoding, L is the total number of frames encoded in the entire video sequence; n indicates the nth in the current encoding frame The sequence number of a coded macroblock.

步骤四：标识当前p帧的空域视觉特征显著度区域，具体为：若当前编码宏块的帧间预测模式Mode_pn属于亚分割模式集合或者帧内预测模式集合，即Mode_pn∈{8×8,8×4,4×8,4×4}or{Intra16×16,Intra4×4}，则将该宏块标记为S_Yp(x,y,Mode_pn)=1，属于空域视觉特征显著度区域，否则标记S_Yp(x,y,Mode_pn)=0；其中，Y表示编码宏块的亮度分量，(x,y)表示该编码宏块的位置坐标，p和Mode_pn的定义同上，遍历当前p帧中的所有编码宏块；Step 4: Identify the spatial visual feature saliency area of the current p frame, specifically: if the inter-frame prediction mode Mode _pn of the current coded macroblock belongs to the sub-segmentation mode set or the intra-frame prediction mode set, that is, Mode _pn ∈{8×8 ,8×4,4×8,4×4}or{Intra16×16,Intra4×4}, then mark the macroblock as S _Yp (x,y,Mode _pn )=1, which belongs to the saliency of spatial visual features area, otherwise mark S _Yp (x, y, Mode _pn )=0; wherein, Y represents the luminance component of the coded macroblock, (x, y) represents the position coordinates of the coded macroblock, and the definitions of p and Mode _pn are the same as above, Traverse all coded macroblocks in the current p frame;

图1给出了H.264标准帧间预测模式选择流程示意图。FIG. 1 shows a schematic diagram of the H.264 standard inter-frame prediction mode selection process.

经过实验，发现在H.264/AVC标准编码中，预测编码结果与人眼感兴趣区域之间具有较强相关性：对于人眼关注度较高的运动区域或者纹理丰富区域，Mode_pn大多选择亚分割模式集合{8×8,8×4,4×8,4×4}；在镜头切换，视频内容发生突变，或者出现运动幅度较大的运动对象时，人眼关注度最高，此时Mode_pn才会选择帧内预测模式集合{Intra16×16,Intra4×4}；对于人眼关注度较低的背景平滑区域，Mode_pn大多选择宏块分割模式集合{Skip,16×16,16×8,8×16}。图2以Claire序列为例，给出了Claire序列第50帧帧间预测模式分布图，从图中可以发现在人眼关注度较高的区域中，编码宏块大都选择了帧间亚分割预测模式集合。After experiments, it is found that in the H.264/AVC standard coding, there is a strong correlation between the predictive coding results and the area of interest to the human eye: for the moving area or texture-rich area with high human attention, Mode _pn mostly chooses A set of sub-segmentation modes {8×8, 8×4, 4×8, 4×4}; when the camera is switched, the video content changes suddenly, or there is a moving object with a large range of motion, the human eye pays the most attention. At this time Mode _pn will select the intra prediction mode set {Intra16×16, Intra4×4}; for the background smooth area with low human attention, Mode _pn mostly selects the macroblock segmentation mode set {Skip,16×16,16× 8,8×16}. Figure 2 takes the Claire sequence as an example, and shows the distribution of inter-frame prediction modes in the 50th frame of the Claire sequence. From the figure, it can be found that in areas with high human attention, most of the coded macroblocks choose inter-frame sub-segmentation prediction Pattern collection.

步骤五：记录第p帧中每一个编码宏块在水平方向上的运动矢量V_xpn和在垂直方向上的运动矢量V_ypn；并计算前一个编码帧中所有编码宏块在水平方向上的平均运动矢量

以及垂直方向上的平均运动矢量

Step 5: Record the motion vector V _xpn in the horizontal direction and the motion vector V _ypn in the vertical direction of each coded macroblock in the pth frame; and calculate the average of all coded macroblocks in the horizontal direction in the previous coded frame motion vector

and the average motion vector in the vertical direction

其中， ${\overset{&OverBar;}{V}}_{x (p - 1) th} = \frac{Σ_{n = 1}^{Num} V_{x (p - 1) n}}{Num}, {\overset{&OverBar;}{V}}_{y (p - 1) th} = \frac{Σ_{n = 1}^{Num} V_{y (p - 1) n}}{Num};$ V_x(p-1)n和V_y(p-1)n表示前一个编码帧中每一个编码宏块在水平和垂直方向上的运动矢量，p和n的定义与步骤三相同；Num表示一个编码帧中包含的宏块个数，也就是累加次数。图3以QCIF格式（176×144）的视频为例，给出了一个编码帧中所有编码宏块（16×16）的位置及其序号n，此时， $Num = \frac{176}{16} \times \frac{144}{16} = 11 \times 9 = 99 .$ in, ${\overset{&OverBar;}{V}}_{x (p - 1) the th} = \frac{Σ_{no = 1}^{Num} V_{x (p - 1) no}}{Num}, {\overset{&OverBar;}{V}}_{the y (p - 1) the th} = \frac{Σ_{no = 1}^{Num} V_{the y (p - 1) no}}{Num};$ V _x(p-1)n and V _y(p-1)n represent the horizontal and vertical motion vectors of each coded macroblock in the previous coded frame, and the definitions of p and n are the same as in step 3; Num represents The number of macroblocks included in a coded frame, that is, the number of times of accumulation. Figure 3 takes the video in QCIF format (176×144) as an example, and shows the positions and serial numbers n of all coded macroblocks (16×16) in a coded frame. At this time, $Num = \frac{176}{16} \times \frac{144}{16} = 11 \times 9 = 99 .$

步骤六：标识当前p帧的时域视觉特征显著度区域，具体为：若当前编码宏块的水平方向运动矢量V_xpn大于前一帧编码宏块在水平方向运动矢量平均值

或者当前编码宏块的垂直方向运动矢量V_ypn大于前一帧编码宏块在垂直方向运动矢量平均值

则该宏块属于时域视觉特征显著度区域，标记T_Yp(x,y,V_xpn,V_ypn)=1，否则标记T_Yp(x,y,V_xpn,V_ypn)=0，遍历当前p帧中的所有编码宏块；Step 6: Identify the temporal visual feature salience area of the current p frame, specifically: if the horizontal motion vector V _xpn of the current coded macroblock is greater than the average value of the horizontal motion vector of the coded macroblock in the previous frame

Or the vertical motion vector V _ypn of the current coded macroblock is greater than the average value of the vertical motion vector of the coded macroblock in the previous frame

Then the macroblock belongs to the time-domain visual feature saliency area, mark T _Yp (x,y,V _xpn ,V _ypn )=1, otherwise mark T _Yp (x,y,V _xpn ,V _ypn )=0, traverse the current all coded macroblocks in the p frame;

其中，Y表示编码宏块的亮度分量，(x,y)表示该编码宏块的位置坐标，p的定义与步骤三相同。Wherein, Y represents the luminance component of the coded macroblock, (x, y) represents the position coordinates of the coded macroblock, and the definition of p is the same as that in Step 3.

运动感知是人眼视觉系统中最重要的视觉处理机制之一。经过实验，发现具有较大运动矢量的编码内容恰好是人眼感兴趣的运动区域(如头部、手臂、人物等)；而运动矢量较小甚至为零的编码内容正是人眼关注度较低的静止背景区域。图4以Akiyo序列为例，给出了Akiyo序列第50帧运动矢量分布图，从图中可以发现在人眼关注度较高的人脸及头肩区域中，编码宏块通常具有较大的运动矢量。Motion perception is one of the most important visual processing mechanisms in the human visual system. After experiments, it is found that the coded content with a large motion vector is just the moving area of interest to the human eye (such as the head, arm, person, etc.); Low static background area. Figure 4, taking the Akiyo sequence as an example, shows the motion vector distribution diagram of the 50th frame of the Akiyo sequence. From the figure, it can be found that in the face and head and shoulder areas where the human eye pays more attention, the coded macroblock usually has a larger Motion vector.

当前编码宏块的运动程度剧烈与否，判定阈值的设定对结果的影响较大。为降低误判率，本发明将水平方向和垂直方向的运动程度判定阈值分别记为

和

表示前一帧中所有编码宏块在水平方向上的平均运动矢量，

表示前一帧中所有编码宏块在垂直方向上的平均运动矢量。本发明中动态阈值的设定，充分考虑了视频序列的时间相关性，使阈值能够随前一帧编码宏块运动矢量平均值的变化而改变，有效减少了误判，能够快速、准确地获得时域视觉特征显著度区域。Whether the degree of motion of the currently coded macroblock is severe or not, the setting of the decision threshold has a great influence on the result. In order to reduce the misjudgment rate, the present invention records the thresholds for judging the degree of movement in the horizontal direction and the vertical direction as

and

Indicates the average motion vector in the horizontal direction of all coded macroblocks in the previous frame,

Indicates the average motion vector in the vertical direction of all coded macroblocks in the previous frame. The setting of the dynamic threshold in the present invention fully considers the time correlation of the video sequence, so that the threshold can be changed with the change of the average value of the motion vector of the coded macroblock in the previous frame, effectively reducing misjudgment, and can quickly and accurately obtain Temporal visual feature saliency regions.

步骤七：标记当前p帧的视频感兴趣区域，具体为：遍历当前p帧中的所有编码宏块，根据每个编码宏块的空域特征显著度和时域视觉特征显著度进行标记，具体标记公式如下：Step 7: mark the video region of interest of the current p frame, specifically: traverse all the coded macroblocks in the current p frame, mark according to the spatial domain feature saliency and time domain visual feature saliency of each coded macro block, specifically mark The formula is as follows:

${ROI ROI}_{Yp Yp} ((x x,, y the y)) = = \{\begin{matrix} 33,, {S S}_{Yp Yp} ((x x,, y the y,, {Mode mode}_{pn pn})) = = 11 | | | | {T T}_{Yp Yp} ((x x,, y the y,, {V V}_{xpn xpn},, {V V}_{ypn ypn})) = = 11 \\ 22,, {S S}_{Yp Yp} ((x x,, y the y,, {Mode mode}_{pn pn})) = = 00 | | | | {T T}_{Yp Yp} ((x x,, y the y,, {V V}_{xpn xpn},, {V V}_{ypn ypn})) = = 11 \\ 11,, {S S}_{Yp Yp} ((x x,, y the y,, {Mode mode}_{pn pn})) = = 11 | | | | {T T}_{Yp Yp} ((x x,, y the y,, {V V}_{xpn xpn},, {V V}_{ypn ypn})) = = 00 \\ 00,, {S S}_{Yp Yp} ((x x,, y the y,, {Mode mode}_{pn pn})) = = 00 | | | | {T T}_{Yp Yp} ((x x,, y the y,, {V V}_{xpn xpn},, {V V}_{ypn ypn})) = = 00 \end{matrix}$

标记视频感兴趣区域，分为以下几类情况：Mark video regions of interest, divided into the following categories:

如果当前编码宏块同时具有空域和时域视觉特征显著度，即S_Yp(x,y,Mode_pn)=1并且T_Yp(x,y,V_xpn,V_ypn)=1，说明当前编码宏块不仅纹理细节丰富，而且产生了较大的运动矢量，则人眼感兴趣程度最高，标记ROI_Yp(x,y)=3；If the current coded macroblock has both spatial and temporal visual feature salience, that is, S _Yp (x,y,Mode _pn )=1 and T _Yp (x,y,V _xpn ,V _ypn )=1, it means that the current coded macroblock The block is not only rich in texture details, but also produces a large motion vector, the human eye is most interested, and the mark ROI _Yp (x,y)=3;

若仅具有时域视觉特征显著度，不具有空域视觉特征显著度，即T_Yp(x,y,V_xpn,V_ypn)=1并且S_Yp(x,y,Mode_pn)=0，说明当前编码宏块产生了较大的运动矢量，根据HVS的感知特征，人眼对物体的运动具有高度敏感性，人眼感兴趣程度次之，标记ROI_Yp(x,y)=2；If it only has the saliency of temporal visual features, but does not have the saliency of spatial visual features, that is, T _Yp (x,y,V _xpn ,V _ypn )=1 and S _Yp (x,y,Mode _pn )=0, it means that the current The coded macroblock produces a large motion vector. According to the perceptual characteristics of HVS, the human eye is highly sensitive to the motion of the object, and the human eye is less interested. Mark ROI _Yp (x,y)=2;

若宏块运动程度较低，不具有时域视觉特征显著度，但具有丰富的纹理信息，仅具有空域视觉特征显著度，即S_Yp(x,y,Mode_pn)=1并且T_Yp(x,y,V_xpn,V_ypn)=0，人眼感兴趣程度再次，标记ROI_Yp(x,y)=1；If the motion degree of the macroblock is low, it does not have temporal visual feature saliency, but it has rich texture information, and only has spatial domain visual feature salience, that is, S _Yp (x,y,Mode _pn )=1 and T _Yp (x ,y,V _xpn ,V _ypn )=0, the degree of human interest is again, mark ROI _Yp (x,y)=1;

若既不具有空域视觉特征显著度也不具有时域视觉特征显著度，即S_Yp(x,y,Mode_pn)=0并且T_Yp(x,y,V_xpn,V_ypn)=0，说明当前编码宏块纹理平坦、运动平缓或者静止，通常是静止的背景区域，则为人眼非感兴趣区域，人眼感兴趣程度最低，标记ROI_Yp(x,y)=0；If there is neither spatial visual feature salience nor temporal visual feature saliency, that is, S _Yp (x,y,Mode _pn )=0 and T _Yp (x,y,V _xpn ,V _ypn )=0, it means The currently coded macroblock has flat texture, smooth motion, or stillness, usually a static background area, which is not an area of interest to the human eye, with the lowest degree of interest to the human eye, and the mark ROI _Yp (x,y)=0;

其中，ROI_Yp(x,y)代表当前编码宏块视觉感兴趣优先级；T_Yp(x,y,V_xpn,V_ypn)代表当前编码宏块的时域视觉特征显著度；S_Yp(x,y,Mode_pn)代表当前编码宏块的空域视觉特征显著度；(x,y)表示当前编码宏块的位置坐标；Y代表宏块的亮度分量；p表示第p个进行帧间编码的视频帧；n表示在当前编码帧中的第n个编码宏块的序号。Among them, ROI _Yp (x, y) represents the visual interest priority of the current coded macroblock; T _Yp (x, y, V _xpn , V _ypn ) represents the temporal visual feature salience of the current coded macroblock; S _Yp (x , y, Mode _pn ) represents the spatial visual feature saliency of the current coded macro block; (x, y) represents the position coordinates of the current coded macro block; Y represents the brightness component of the macro block; p represents the pth inter-frame coding Video frame; n represents the sequence number of the nth coded macroblock in the current coded frame.

步骤八：输出视频编码码流，具体为：根据标记的ROI_Yp(x,y)感兴趣优先级别高低，对当前p帧中所有宏块的亮度分量Y做如下处理，并输出标记后的视频流，Step 8: output the video coded stream, specifically: according to the ROI _Yp (x, y) marked ROI Yp (x, y) interest priority level, do the following processing to the luminance component Y of all macroblocks in the current p frame, and output the marked video flow,

${Y Y}_{p p} ((x x,, y the y)) = = \{\begin{matrix} 255255,, {ROI ROI}_{Yp Yp} ((x x,, y the y)) = = 33 \\ 150150,, {ROI ROI}_{Yp Yp} ((x x,, y the y)) = = 22 \\ 100100,, {ROI ROI}_{Yp Yp} ((x x,, y the y)) = = 11 \\ 00,, {ROI ROI}_{Yp Yp} ((x x,, y the y)) = = 00 \end{matrix}$

由于编码宏块的亮度分量的取值范围为Y∈[0,255]，从0到255代表宏块亮度分量从全黑到全白的256个级别。根据标记的ROI_Yp(x,y)感兴趣优先级别高低，本发明针对宏块的亮度分量Y做如下处理，并输出标记后的视频流。Since the value range of the luminance component of the coded macroblock is Y∈[0,255], from 0 to 255 represents 256 levels of the luminance component of the macroblock from all black to all white. According to the interest priority level of the marked ROI _Yp (x, y), the present invention performs the following processing on the luminance component Y of the macroblock, and outputs the marked video stream.

如果ROI_Yp(x,y)=3，感兴趣程度最高，人眼关注度最高，将该编码宏块的亮度分量设为255，输出宏块亮度分量值最高，即Y_p(x,y)=255；If ROI _Yp (x, y)=3, the degree of interest is the highest, and the degree of human eye attention is the highest. The brightness component of the coded macroblock is set to 255, and the value of the brightness component of the output macroblock is the highest, that is, Y _p (x, y) =255;

如果ROI_Yp(x,y)=2，感兴趣程度次之，人眼关注度较高，将该编码宏块的亮度分量设为150，输出宏块亮度分量值较高，即Y_p(x,y)=150；If ROI _Yp (x, y)=2, the degree of interest is second, and the human eye is more concerned, and the luminance component of the coded macroblock is set to 150, and the value of the luminance component of the output macroblock is higher, that is, Y _p (x ,y)=150;

如果ROI_Yp(x,y)=1，感兴趣程度再次，人眼关注度较低，将该编码宏块的亮度分量设为100，输出宏块亮度分量值较低，即Y_p(x,y)=100；If ROI _Yp (x, y)=1, the degree of interest is again, and the attention of the human eye is low. The luminance component of the coded macroblock is set to 100, and the value of the luminance component of the output macroblock is low, that is, Y _p (x, y)=100;

如果ROI_Yp(x,y)=0，是非感兴趣区域，人眼关注度最低，将该编码宏块的亮度分量设为0，输出宏块亮度分量值最低，即Y_p(x,y)=0。If ROI _Yp (x, y)=0, it is a non-interest region, and the human eye has the lowest degree of attention. The brightness component of the coded macroblock is set to 0, and the output macroblock brightness component value is the lowest, that is, Y _p (x, y) =0.

步骤九：返回步骤三，对下一帧进行处理，直到遍历整个视频序列。Step 9: Return to Step 3 to process the next frame until the entire video sequence is traversed.

图5给出了视频感兴趣区域标识与提取方法流程图。Fig. 5 shows the flow chart of the video ROI identification and extraction method.

图6给出了典型视频序列的标记后的视频感兴趣区域输出结果。有益效果Figure 6 shows the output results of the labeled video ROIs for a typical video sequence. Beneficial effect

本方法根据基本编码信息实现了视频感兴趣区域的快速提取。本方法利用基本编码信息与人眼视觉感兴趣区域之间的关联性，分别识别视频编码内容中的空域视觉特征显著度区域和时域特征视觉显著度区域，再结合空域和时域视觉特征显著度区域的标识结果，定义视频感兴趣优先级，最终实现了视频感兴趣的自动提取。本发明方法可为基于感兴趣区域ROI（Region of Interest,ROI）的视频编码技术提供重要编码依据。This method realizes the rapid extraction of the region of interest in the video according to the basic coding information. This method uses the correlation between the basic coding information and the human visual interest area to identify the spatial visual feature salience area and the temporal feature visual saliency area in the video coding content respectively, and then combines the spatial and temporal visual feature salience The identification results of the degree area define the priority of video interest, and finally realize the automatic extraction of video interest. The method of the invention can provide an important coding basis for the video coding technology based on the ROI (Region of Interest, ROI).

附图说明Description of drawings

图1.H.264标准帧间预测模式选择流程示意图；Figure 1. Schematic diagram of the H.264 standard inter prediction mode selection process;

图2.Claire序列第50帧帧间预测模式分布图；Figure 2. Distribution of inter-frame prediction modes in the 50th frame of the Claire sequence;

图3.一个视频帧中每一个编码宏块的位置及其序号示意图；Figure 3. A schematic diagram of the position and sequence number of each coded macroblock in a video frame;

图4.Akiyo序列第50帧运动矢量分布图；Figure 4. Motion vector distribution diagram of the 50th frame of the Akiyo sequence;

图5.本发明方法流程图；Fig. 5. flow chart of the method of the present invention;

图6.利用本发明方法标记视频感兴趣区域的输出结果示意图。Fig. 6. Schematic diagram of the output result of marking a region of interest in a video using the method of the present invention.

具体实施方式Detailed ways

鉴于人眼对亮度信息较之色度信息更加敏感，本发明方法针对视频帧的亮度分量进行编码。先读入视频序列，提取其亮度分量，调用本发明的视频感兴趣区域提取模块完成感兴趣区域的自动标识与提取。In view of the fact that human eyes are more sensitive to luminance information than chrominance information, the method of the present invention encodes the luminance component of the video frame. First read in the video sequence, extract its luminance component, and call the video region of interest extraction module of the present invention to complete the automatic identification and extraction of the region of interest.

本发明实施中是采用视频摄取装置（如数码摄像机等）实现视频图像的采集，并将图象传输至计算机，在计算机中根据视频码流中的编码信息实现视频感兴趣区域的自动标识。依据当前编码宏块的预测编码模式标识空域视觉特征显著度区域；再依据当前编码宏块在水平或垂直方向上的运动矢量，标识时域视觉特征显著度区域，通过设定动态运动矢量判定阈值减小由于不同的视频运动类型对于感兴趣区域提取准确度的影响；最后依据空域/时域视觉特征显著度得到视频感兴趣分类结果，实现视频感兴趣区域的自动提取。In the implementation of the present invention, a video capture device (such as a digital video camera, etc.) is used to collect video images, and the images are transmitted to a computer, and the automatic identification of the video region of interest is realized in the computer according to the coding information in the video code stream. According to the predictive coding mode of the current coded macroblock, the spatial domain visual feature salience area is identified; then according to the motion vector of the current coded macro block in the horizontal or vertical direction, the time domain visual feature salience area is identified, and the dynamic motion vector judgment threshold is set Reduce the impact of different video motion types on the accuracy of ROI extraction; finally, the classification results of video interest are obtained according to the salience of spatial/temporal visual features, and the automatic extraction of ROI in videos is realized.

具体实施中，在计算机中完成以下程序：In the specific implementation, the following procedures are completed in the computer:

第一步：根据编码配置文件encoder.cfg读入视频序列，按照配置文件中的参数配置编码器。例如：完成视频码流结构GOP=IPPP…;编码帧数FramesToBeEncoded=100；帧率FrameRate=30f/s；视频文件宽度SourceWidth=176，高度SourceHeight=144；输出文件名称OutputFile=ROI.264；量化步长值QPISlice=28，QPPSlice=28；运动估计搜索范围SearchRange=±16；参考帧数NumberReferenceFrames=5；激活率失真代价函数RDOptimization=on；设置熵编码类型SymbolMode=CAVLC等参数配置，初始化参数L=编码帧数，p=1；Step 1: Read in the video sequence according to the encoding configuration file encoder.cfg, and configure the encoder according to the parameters in the configuration file. For example: complete the video code stream structure GOP=IPPP...; number of encoded frames FramesToBeEncoded=100; frame rate FrameRate=30f/s; video file width SourceWidth=176, height SourceHeight=144; output file name OutputFile=ROI.264; quantization step Long value QPISlice=28, QPPSlice=28; motion estimation search range SearchRange=±16; reference frame number NumberReferenceFrames=5; activation rate distortion cost function RDOptimization=on; set entropy coding type SymbolMode=CAVLC and other parameter configurations, initialization parameters L= Number of encoded frames, p=1;

第二步：从输入视频序列中按顺序逐帧读取编码宏块亮度分量值Y；The second step: read the coded macroblock luminance component value Y frame by frame from the input video sequence in sequence;

第三步：对视频序列首帧，即I帧进行帧内预测编码；Step 3: carry out intra-frame predictive encoding to the first frame of the video sequence, i.e. the I frame;

第四步：对当前p帧进行帧间预测编码；记录当前编码宏块的帧间预测模式类型Mode_pn；其中，p=1,2,3,…,L-1，代表第p个进行帧间编码的视频帧，L为整个视频序列进行编码的总帧数；n表示在当前编码帧中的第n个编码宏块的序号。Step 4: Perform inter-frame predictive coding on the current p frame; record the inter-frame prediction mode type Mode _pn of the current coded macroblock; where, p=1,2,3,...,L-1, representing the p-th frame L is the total number of coded frames of the entire video sequence; n represents the sequence number of the nth coded macroblock in the current coded frame.

第五步：标识空域视觉特征显著度区域，若当前编码宏块的帧间预测模式Mode_pn属于亚分割模式集合或者帧内预测模式集合，Mode_pn∈{8×8,8×4,4×8,4×4}or{Intra16×16,Intra4×4}，则将该宏块标记为S_Yp(x,y,Mode_pn)=1，属于空域视觉特征显著度区域，否则标记S_Yp(x,y,Mode_pn)=0；Step 5: Identify the spatial visual feature salience area. If the inter prediction mode Mode _pn of the current coded macroblock belongs to the sub-segmentation mode set or the intra prediction mode set, Mode _pn ∈ {8×8,8×4,4× 8,4×4}or{Intra16×16,Intra4×4}, then mark the macroblock as S _Yp (x,y,Mode _pn )=1, which belongs to the spatial visual feature saliency area, otherwise mark S _Yp ( x,y,Mode _pn )=0;

$S S ((x x,, y the y,, {Mode mode}_{pn pn})) = = \{\begin{matrix} 11,, & {Mode mode}_{pn pn} &Element; &Element; {{88 \times \times 8,8 8,8 \times \times 4,4 4,4 \times \times 8,4 8,4 \times \times 44}} or or {{Intra Intra 1616 \times \times 1616,, Intra Intra 44 \times \times 44}} \\ 00,, & else else \end{matrix}$

第六步：若p≠1，记录第p帧中每一个编码宏块在水平方向上的运动矢量V_xpn和在垂直方向上的运动矢量V_ypn；并计算前一个编码帧中所有编码宏块在水平方向上的平均运动矢量

以及垂直方向上的平均运动矢量

否则，跳转至第十步；Step 6: If p≠1, record the motion vector V _xpn in the horizontal direction and the motion vector V _ypn in the vertical direction of each coded macroblock in the pth frame; and calculate all coded macroblocks in the previous coded frame Average motion vector in the horizontal direction

and the average motion vector in the vertical direction

Otherwise, skip to the tenth step;

第七步：标识时域视觉特征显著度区域，若当前编码宏块的水平方向运动矢量V_xpn大于前一帧编码宏块在水平方向运动矢量平均值

满足其中任何一个判别条件，则该宏块属于时域视觉特征显著度区域，标记T_Yp(x,y,V_xpn,V_ypn)=1，否则标记T_Yp(x,y,V_xpn,V_ypn)=0；Step 7: Identify the time-domain visual feature salience area, if the horizontal motion vector V _xpn of the current coded macroblock is greater than the average value of the horizontal motion vector of the coded macroblock in the previous frame

If any one of the discriminant conditions is met, the macroblock belongs to the time-domain visual feature saliency area, marked T _Yp (x,y,V _xpn ,V _ypn )=1, otherwise marked T _Yp (x,y,V _xpn ,V _ypn )=0;

${T T}_{Yp Yp} ((x x,, y the y,, {V V}_{xpn xpn},, {V V}_{ypn ypn})) = = \{\begin{matrix} 11,, & {V V}_{xpn xpn} > > {\overset{&OverBar; &OverBar;}{V V}}_{x x ((p p - -)) th the th} or or {V V}_{ypn ypn} > > {\overset{&OverBar; &OverBar;}{V V}}_{t t ((p p - - 11)) th the th} \\ 00,, & else else \end{matrix}$

第八步：标记视频感兴趣区域。Step 8: Mark the region of interest in the video.

如果当前编码宏块同时具有空域和时域视觉特征显著度，即S_Yp(x,y,Mode_pn)=1并且T_Yp(x,y,V_xpn,V_ypn)=1，则人眼感兴趣程度最高，标记ROI_Yp(x,y)=3；If the current coded macroblock has both spatial and temporal visual feature salience, that is, S _Yp (x,y,Mode _pn )=1 and T _Yp (x,y,V _xpn ,V _ypn )=1, the human eye perception The highest degree of interest, marked ROI _Yp (x,y)=3;

若仅具有时域视觉特征显著度，即T_Yp(x,y,V_xpn,V_ypn)=1并且S_Yp(x,y,Mode_pn)=0，人眼感兴趣程度次之，标记ROI_Yp(x,y)=2；If it only has the salience of time-domain visual features, that is, T _Yp (x,y,V _xpn ,V _ypn )=1 and S _Yp (x,y,Mode _pn )=0, the degree of interest of human eyes is next, mark ROI _Yp (x,y)=2;

若仅具有空域视觉特征显著度，即S_Yp(x,y,Mode_pn)=1并且T_Yp(x,y,V_xpn,V_ypn)=0，人眼感兴趣程度再次，标记ROI_Yp(x,y)=1；If it only has the salience of spatial visual features, that is, S _Yp (x,y,Mode _pn )=1 and T _Yp (x,y,V _xpn ,V _ypn )=0, the degree of human interest is again, and the ROI _Yp ( x,y)=1;

若既不具有空域视觉特征显著度也不具有时域视觉特征显著度，即S_Yp(x,y,Mode_pn)=0并且T_Yp(x,y,V_xpn,V_ypn)=0，则为人眼非感兴趣区域，标记ROI_Yp(x,y)=0；If there is neither spatial visual feature salience nor temporal visual feature saliency, that is, S _Yp (x,y,Mode _pn )=0 and T _Yp (x,y,V _xpn ,V _ypn )=0, then For the non-interest area of the human eye, mark the ROI _Yp (x,y)=0;

第九步：输出视频编码码流。Step 9: Output video encoding stream.

第十步：若p≠L-1，p=p+1，跳转至第三步；否则，结束编码。Step 10: If p≠L-1, p=p+1, go to step 3; otherwise, end coding.

利用本发明方法标记视频感兴趣区域的输出结果示意图，如图6所示。以典型的视频监控序列（Hall）和室内活动视频序列（Salesman）为例，利用运动矢量分布结果和帧间预测模式选择结果，标记视频感兴趣区域，若某宏块的人眼感兴趣程度越高，则在输出视频中该位置处的亮度值越高，反之亮度值越低。从图6中最右侧一列的标记结果可以发现，采用本发明方法获得的视频感兴趣区域的形状是不规则的，与传统的采用固定形状模板的运动目标检测方法获得的感兴趣区域相比较，本发明方法标记结果更接近人眼所关注的感兴趣目标形状，能够更准确地标记感兴趣区域。A schematic diagram of the output result of marking a region of interest in a video using the method of the present invention is shown in FIG. 6 . Taking a typical video surveillance sequence (Hall) and an indoor activity video sequence (Salesman) as examples, use the motion vector distribution results and the inter-frame prediction mode selection results to mark the video interest area. If it is high, the brightness value at this position in the output video is higher, otherwise the brightness value is lower. From the marking results in the rightmost column in Figure 6, it can be found that the shape of the video region of interest obtained by the method of the present invention is irregular, compared with the region of interest obtained by the traditional moving object detection method using a fixed shape template , the marking result of the method of the present invention is closer to the shape of the object of interest that human eyes focus on, and can mark the region of interest more accurately.

本发明方法还可与其他快速编码技术结合，在保证对人眼感兴趣区域编码质量的前提下，降低人眼不感兴趣的背景区域编码复杂度，进一步减少编码时间，也可用于基于H.264的可伸缩编码中，实现感兴趣区域的选择性增强编码。The method of the present invention can also be combined with other fast coding techniques, under the premise of ensuring the coding quality of the region of interest to the human eye, it can reduce the coding complexity of the background region that is not of interest to the human eye, and further reduce the coding time. It can also be used in H.264-based In the scalable coding of , the selective enhancement coding of the region of interest is realized.

Claims

1. The method for extracting the video region of interest based on coding information is characterized in that comprising the following steps:

Step 1: Input a video sequence in YUV format and GOP (Group of Picture, GOP) structure as IPPP, read the luminance component Y of the encoded macroblock, and configure the encoding parameters;

Step 2: performing intra-frame predictive coding on the first frame of the video sequence, i.e. the I frame;

Step 3: Perform interframe predictive coding on the current p frame, and record the interframe prediction mode types of all coded macroblocks in the current p frame, denoted as Mode _pn ; p=1,2,3,...,L-1, representing The pth video frame for inter-frame encoding, L is the total number of frames encoded in the entire video sequence; n represents the sequence number of the nth encoded macroblock in the current encoded frame;

Step 4: Identify the spatial visual feature saliency area of the current p frame, specifically: if the inter-frame prediction mode Mode _pn of the current coded macroblock belongs to the sub-segmentation mode set or the intra-frame prediction mode set, that is, Mode _pn ∈{8×8 ,8×4,4×8,4×4}or{Intra16×16,Intra4×4}, then mark the macroblock as S _Yp (x,y,Mode _pn )=1, which belongs to the saliency of spatial visual features area, otherwise mark S _Yp (x, y, Mode _pn )=0; Y represents the luminance component of the coded macroblock, (x, y) represents the position coordinates of the coded macroblock, and traverses all coded macroblocks in the current p frame ;

and the average motion vector in the vertical direction

Num represents the number of macroblocks contained in a coded frame, that is, the number of times of accumulation; Step 6: identify the temporal visual feature salience region of the current p frame, specifically: if the horizontal direction motion vector V _xpn of the current coded macroblock is greater than the previous The average value of the motion vector in the horizontal direction of a coded macroblock in one frame

Then the macroblock belongs to the time-domain visual feature saliency area, mark T _Yp (x,y,V _xpn ,V _ypn )=1, otherwise mark T _Yp (x,y,V _xpn ,V _ypn )=0, traverse the current All coded macroblocks in the p frame; Step 7: mark the video region of interest of the current p frame, specifically: traverse all the coded macroblocks in the current p frame, according to the spatial domain feature saliency and time domain of each coded macroblock The visual feature salience is marked, and the specific marking formula is as follows:

{ROI}_{Yp} (x, the y) = \{\begin{matrix} 3, S_{Yp} (x, the y, {mode}_{pn}) = 1 | | T_{Yp} (x, the y, V_{xpn}, V_{ypn}) = 1 \\ 2, S_{Yp} (x, the y, {mode}_{pn}) = 0 | | T_{Yp} (x, the y, V_{xpn}, V_{ypn}) = 1 \\ 1, S_{Yp} (x, the y, {mode}_{pn}) = 1 | | T_{Yp} (x, the y, V_{xpn}, V_{ypn}) = 0 \\ 0, S_{Yp} (x, the y, {mode}_{pn}) = 0 | | T_{Yp} (x, the y, V_{xpn}, V_{ypn}) = 0 \end{matrix}

If the current coded macroblock has both spatial and temporal visual feature salience, that is, S _Yp (x,y,Mode _pn )=1 and T _Yp (x,y,V _xpn ,V _ypn )=1, mark ROI _Yp (x,y)=3; if the current coded macroblock only has temporal visual feature saliency, but does not have spatial domain visual feature saliency, that is, T _Yp (x,y,V _xpn ,V _ypn )=1 and S _Yp ( x,y,Mode _pn )=0, mark ROI _Yp (x,y)=2; if the current coded macroblock does not have temporal visual feature salience, it only has spatial visual feature saliency, that is, S _Yp (x, y,Mode _pn )=1 and T _Yp (x,y,V _xpn ,V _ypn )=0, then mark ROI _Yp (x,y)=1; if the current coded macroblock has neither spatial visual feature salience nor Does not have temporal visual feature salience, that is, S _Yp (x,y,Mode _pn )=0 and T _Yp (x,y,V _xpn ,V _ypn )=0, then mark ROI _Yp (x,y)=0 ; Step 8: output video coded code stream, specifically: according to the ROI _Yp (x, y) of the mark ROI Yp (x, y) the level of interest priority is high or low, the luminance component Y of all macroblocks in the current p frame is processed as follows, and output the mark after video stream,

Y_{p} (x, the y) = \{\begin{matrix} 255, {ROI}_{Yp} (x, the y) = 3 \\ 150, {ROI}_{Yp} (x, the y) = 2 \\ 100, {ROI}_{Yp} (x, the y) = 1 \\ 0, {ROI}_{Yp} (x, the y) = 0 \end{matrix}

Step 9: Return to Step 3 to process the next frame until the entire video sequence is traversed.