CN112770116B

CN112770116B - Method for extracting video key frame by using video compression coding information

Info

Publication number: CN112770116B
Application number: CN202011642920.5A
Authority: CN
Inventors: 艾达; 梁嘉倩
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-12-07
Anticipated expiration: 2040-12-31
Also published as: CN112770116A

Abstract

A method for extracting video key frame by video compression coding information is composed of extracting depth and frame bit number characteristics, shot switching detection and extracting key frame. The invention adopts the coding unit depth information and the frame bit number compression domain characteristics in the video code stream to carry out shot switching detection, obtain shot fragments and carry out key frame extraction. The invention fully utilizes the compressed domain video to process without decompression, reduces the calculation process, shortens the processing time and improves the processing speed. Compared with the existing method, the experimental result shows that the accuracy of the method is improved by 12.1%, the recall rate is improved by 5.3%, the F value is improved by 8.4%, and the extracted key frame can well express the main content of the original video. The method has the advantages of small calculated amount, high efficiency, high accuracy, high processing speed and the like, and can be used for processing the video image.

Description

Method for extracting video key frame by using video compression coding information

Technical collar city

The invention belongs to the technical field of digital video retrieval, and particularly relates to a method for extracting video key frames by using video compression coding information.

Background

With the rapid development of multimedia technology and network technology, video data rapidly grows, unprecedented data appears, and how to effectively manage videos and rapidly acquire important information in the videos becomes a research hotspot. Under the background, key frame extraction becomes an effective way for solving the problem, and by extracting the key frame, the data volume of the video can be greatly reduced, the important information of the original video can be well expressed, the retrieval time is saved, and the video retrieval efficiency is improved.

At present, as for the extraction method of key frames, scholars at home and abroad carry out a great deal of research work, and the methods can be divided into key frame extraction in a pixel domain and key frame extraction in a compression domain according to processed video data objects. The method for extracting the key frame of the pixel domain is carried out after the video is completely decompressed, the calculated amount is large, the efficiency is low, and the real-time requirement is difficult to meet. The compressed domain video processing technology is directly oriented to compressed video data with small data volume, and the video is processed under the condition of no decompression or partial decompression, so that the processing speed of the video can be greatly improved, and therefore, the research on the key frame extraction method on the compressed domain draws wide attention.

Ali Reza et al propose a method for extracting key frames in the h.265/HEVC compressed domain, which uses a normalized histogram of intra-frame prediction modes extracted from the h.265/HEVC coded video to detect similar frames, classifies the similar frames using fuzzy c-means clustering, and extracts key frames. Zhu Zhiming et al proposed a video abstract key frame extraction method of video coding compression domain, which is to count the number of brightness prediction modes of a video coding intra-frame coding PU block at a decoding end, construct a mode feature vector, cluster the mode feature vector by using an adaptive clustering algorithm fused with an iterative self-organizing data analysis algorithm (ISODATA) to obtain candidate key frames, and filter the candidate key frames again through similarity to remove redundant frames to obtain final key frames.

The common point of the methods is that the intra-frame prediction mode value is used as the characteristic, and the experiment only aims at the full intra-frame mode, so that the processing speed of the video frame is low, the processing time is long, and the practicability is not realized.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the disadvantages of the above video frame processing method, and provide a method for extracting video key frames by using video compression coding information, which does not need decoding, has small calculation amount, high processing speed and high extraction efficiency.

The technical scheme adopted for solving the technical problems comprises the following steps:

(1) extracting depth and frame bit number features

Determining a rate-distortion cost J of the coding unit according to equation (1):

wherein D_x,yAnd R_x,yRespectively, indicate the (x,y) number of distortion and coding bits for the pixels, x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is Lagrangian coefficient, W and H are finite positive integers, and W is greater than H.

Determining depth feature vector F of coded frame according to equation (2)_n：

F_n＝{f₁,f₂,…,f_α} (2)

Wherein N represents the nth coded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, round () is an upward rounding function, f_αFor coding depth values of a unit, f_αThe value of (a) is any one of 0, 1,2 and 3.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a line drawing for analysis, marking the positions which are gradually increased and then gradually reduced as shot switching, wherein 1 shot segment is arranged between two adjacent shot switching, the length of the shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, and the value of K is a limited positive integer.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

wherein F_iAnd F_jThe depth feature vectors for the ith and j-th coded frames, i ∈ {1,2, …, N }, j ∈ {1,2, …, N }, respectively, are represented.

Determining eigenvectors Y corresponding to the first K eigenvalues of L according to the formula (5), and constructing an NxK order matrix Y according to the formula (6):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

wherein y is₁,y₂,...,y_KSequentially forming N multiplied by 1 order eigenvectors corresponding to the first K eigenvalues.

K-means clustering is carried out on the matrix Y, and the distance d between the clustering center mu and all other frames in the shot is determined according to the formula (7)_m：

d_m＝||y_m-μ||₂ (7)

Wherein M belongs to {1,2, …, M }, M is the length of each shot, M is a finite positive integer, and M is less than N.

Will be a distance d_mThe smallest frame is denoted as the key frame.

In the step (1) of extracting the depth and frame bit number characteristics, the value of W is 176-7680, the value of H is 144-4320, and the value of N is 1000-7000.

In the step (2) of detecting lens switching, the value of K is 5-20.

The invention adopts CU depth value and frame bit number compression domain characteristics in video code stream to carry out shot switching detection, obtains shot fragments, and carries out key frame extraction. The invention fully utilizes the compressed domain video to process without decompression, reduces the calculation process, shortens the processing time and improves the processing speed. Compared with the existing method, the experimental result shows that the accuracy of the method is improved by 12.1%, the recall rate is improved by 5.3%, the F value is improved by 8.4%, and the extracted key frame can well express the main content of the original video. The method has the advantages of small calculated amount, high efficiency, high accuracy, high processing speed and the like, and can be used for processing the video image.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following drawings and examples, but the present invention is not limited to these examples.

Example 1

Taking the video sequence a New Horizon, segment 02 in the international VSUMM dataset as an example, the method for extracting video key frames by using video compression coding information in the embodiment includes the following steps (see fig. 1):

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 352, and the value of H is 240.

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 1797 in this embodiment, round () is an upward rounding function, f is a positive integer, and N is a positive integer_αFor coding depth values of a unit, f_αIs any one of 0, 1,2 and 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a line graph for analysis, marking the positions which are gradually increased and then gradually decreased as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, the value of K is a limited positive integer, the value of K in the embodiment is 13, and the specific value of M is 376, 232, 128, 108, 80, 76, 72, 80, 116, 120, 68, 72 and 108.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

wherein y is₁,y₂,...,y_KSequentially obtaining N multiplied by 1 order eigenvectors corresponding to the first K eigenvalues, wherein the value of K in the step is the same as that of K in the step (2), and the value of N is the same as that of N in the step (1).

d_m＝||y_m-μ||₂ (7)

Wherein M belongs to {1,2, …, M }, M is the length of each shot, M is a finite positive integer, M is less than N, and the specific value of M is the same as that in step (2).

Will be a distance d_mThe smallest frame is denoted as the key frame.

Example 2

Taking an ocean floor Legacy as an example, the method for extracting video key frames by using video compression coding information in the embodiment includes the following steps:

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 176, and the value of H is 144.

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 1000 in this embodiment, round () is an upward rounding function, f is a positive integer, and N is a positive integer_αFor coding depth values of a unit, f_αIs any one of 0, 1,2 and 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a broken line graph for analysis, marking the positions which are gradually increased and then gradually decreased as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, the value of K is a limited positive integer, the value of K in the embodiment is 5, and the specific values of M are 336, 216, 112, 96 and 296.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

d_m＝||y_m-μ||₂ (7)

Will be a distance d_mThe smallest frame is denoted as the key frame.

Example 3

Taking an exceptional Terrane of a video sequence as an example, the method for extracting a video key frame by using video compression coding information of the embodiment includes the following steps:

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 7680, and the value of H is 4320.

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 7000 in this embodiment, round () is an upward rounding function, f is a positive integer_αFor coding depth values of a unit, f_αIs any one of 0, 1,2 and 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a broken line graph for analysis, marking the positions which are gradually increased and then gradually decreased as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, the value of K is a limited positive integer, the value of K in the embodiment is 20, and the specific value of M is 156, 196, 596, 1068, 316, 452, 196, 96, 468, 240, 496, 176, 152, 376, 192, 112, 412, 336, 240 and 396.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

d_m＝||y_m-μ||₂ (7)

Will be a distance d_mThe smallest frame is denoted as the key frame.

In order to verify the beneficial effects of the present invention, the inventor performed a comparison experiment by using the method of extracting video key frames from video compression coding information in embodiment 1 of the present invention and an HEVC intra frame based compressed domain video summary (hereinafter referred to as "comparison file 1") method, and determined the accuracy, recall rate, and F value of the two methods as comprehensive indicators for evaluating the quality of the video summary, where the experiment and calculation results are shown in table 1.

The accuracy is determined as follows:

wherein N is_mNumber of key frames, N, for the experimental method to match the user summary_ASThe number of key frames extracted for the experimental method.

The recall rate is determined as follows:

wherein N is_USKey frame number extracted for user abstract.

The value of F is determined as follows:

TABLE 1 results of the experiment

As can be seen from Table 1, compared with the method of the comparison document 1, the method of the present invention has the advantages of significantly improved effect, wherein the accuracy rate is improved by 12.1%, the recall rate is improved by 5.3%, and the F value is improved by 8.4%.

Claims

1. a method for extracting video key frame with video compression coding information, is characterized in that being made up of following steps:

(1) Extract depth and frame bit number features

Determine the rate-distortion cost J of the coding unit according to formula (1):

where D _{x, y} and R _{x, y} represent the distortion and the number of encoded bits of the pixel with coordinates (x, y) in the coding unit, respectively, x∈{1,2,…,H}, y∈{1,2, ...,W}, W×H is the video resolution, λ is the Lagrangian coefficient and λ≥0, W and H are finite positive integers, and W>H;

Determine the depth feature vector F _n of the encoded frame according to formula (2):

F _n ={f ₁ ,f ₂ ,...,f _α } (2)

where n represents the nth coded frame of the video, n∈{1,2,…,N}, N is the total number of video frames, N is a finite positive integer, round() is a round-up function, f ₁ , f ₂ ,...f _α is the depth value of the coding unit, and the value of f ₁ , f ₂ ,...f _α is any one of 0, 1, 2, and 3;

Determine the number of frame bits R _n according to formula (3):

(2) Lens switching detection

Count the number of frame bits R _n of the coded frame and draw a line graph for analysis, mark the turning point that gradually increases first and then gradually decreases as shot switching, and between two adjacent shot switching is one shot segment, and the length of the shot segment is M, M is a finite positive integer, and M<N, K shot segments are obtained, and K is a finite positive integer;

(3) Extract key frames

Determine the Laplacian graph matrix L according to formula (4):

where F _i and F _j represent the depth feature vectors of the i and jth encoded frames, respectively, i∈{1,2,…,N}, j∈{1,2,…,N};

Determine the eigenvectors y corresponding to the first K eigenvalues of L according to equation (5), and construct an N×K-order matrix Y according to equation (6):

L×y=β×D×y (5)

Y=[y ₁ ,y ₂ ,...,y _K ] (6)

where y ₁ , y ₂ ,...,y _K are the N×1-order eigenvectors corresponding to the first K eigenvalues in turn;

Perform k-means clustering on the matrix Y, and determine the distance d _m between the cluster center μ and all other frames in the shot according to formula (7):

d _m =||y _m -μ|| ₂ (7)

where m∈{1,2,…,M}, M is the length of each shot segment, M is a finite positive integer, and M<N;

The frame with the smallest distance d _m is recorded as the key frame.

2. the method for extracting video key frame with video compression coding information according to claim 1, is characterized in that: in extracting depth and frame bit number feature step (1), the value of described W is 176～7680 , H ranges from 144 to 4320, and N ranges from 1000 to 7000.

3 . The method for extracting video key frames with video compression and coding information according to claim 1 , wherein in the shot switch detection step (2), the value of K is 5-20. 4 .