Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
The embodiment of the application provides a video frame extraction scheme which is based on the principle that an initial video frame set to be processed can be obtained so as to extract frames of the initial video set to obtain a target video frame set. The initial video frame set comprises a plurality of initial video frames, and the initial video frames are continuous video frames in one video. In a specific implementation, coding information corresponding to the candidate video frame set may be obtained, where the coding information includes frame types and time stamps corresponding to each candidate video frame in the candidate video frame set, so as to perform frame extraction processing on the initial video frame set based on the coding information, and obtain a target video frame set. In one implementation, before the frame extraction process is performed on the initial video frame set by using the coding information, brightness filtering process may be performed on each initial video frame in the initial video frame set to obtain a candidate video frame set, and then frame extraction process is performed on the candidate video frame set by using the coding information, so as to obtain a target video frame set. In the embodiment, the quality of the video frames of the frame extraction can be improved through the brightness screening processing of the video frames, and the quantity of key frames can be improved by combining the decoding information of the video so as to improve the accuracy of the frame extraction and the quality of the video frames.
In a specific implementation, the implementation subject of the video frame extraction scheme mentioned above may be a computer device, which may be a terminal or a server. The terminal mentioned herein may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., and the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligent platform, etc.
When the computer device is a server, as shown in fig. 1, the embodiment of the application provides a video frame extraction system, which may include at least one terminal and at least one server, where the terminal may acquire an initial video frame set to be processed, and upload the acquired initial video frame set to the server (i.e., the computer device), so that the server may acquire the initial video frame set to be processed, and extract a required target video frame set from the initial video frame set to be processed, that is, the server may implement frame extraction on the initial video frame set by adopting a video frame extraction scheme.
Referring to fig. 2, fig. 2 is a flowchart of a video frame extraction method according to an embodiment of the present application, where the video frame extraction method described in the present embodiment may be applied to the above-mentioned computer device, and as shown in fig. 2, the method may include:
s201, acquiring an initial video frame set to be processed.
The initial video frame set may include a plurality of initial video frames, and the plurality of initial video frames may be consecutive video frames in one video. The video may be any type of video, such as any of a movie type, an education type, a sports type, etc., and may be any time-long video, such as a video that may be 20 seconds or 8 minutes, etc.
In one implementation, after the initial video frame set is acquired, the initial video frame set may be scaled to control the resolution of each initial video frame in the initial video frame set, e.g., the resolution of each initial video frame may be reduced to reduce the storage space occupied by the initial video frame set. In a specific implementation, a target resolution of the required initial video frame may be preset, where the target resolution may be smaller than an original resolution of each initial video frame in the initial video frame set, for example, the target resolution may be set to 720×1280, so as to perform scaling processing on the initial video set based on the target resolution, thereby obtaining a scaled initial video set. The scaling process may be understood as an interpolation process that requires the use of a corresponding scaling algorithm (interpolation algorithm), e.g., any one of nearest neighbor interpolation, bilinear interpolation, bicubic interpolation, and the like.
In one implementation, the execution of step S201 may be triggered when a frame extraction requirement for the video is obtained.
Alternatively, a request for frame extraction for the video may be obtained at the computer device, and the request for frame extraction for the video may be determined to be obtained, where the request for frame extraction may be triggered by an object (which may refer to any user) performing a related operation on the user operation interface, to generate the request for frame extraction. When the object needs to utilize the key frames of the video to carry out subsequent reasoning tasks, related operations can be executed on a user operation interface output by the used terminal so as to send a frame extraction request for the video to the computer equipment. For example, referring to fig. 3, a user operation interface may be displayed on a terminal screen of a terminal used by an object, and the user operation interface may include at least a video input area 301 and a confirmation control 302. If the target object wants to obtain a matching commodity corresponding to the video, related information of the video (for example, the video may be directly input in the video input area 301 or a link address corresponding to the video) may be input, then a trigger operation (for example, clicking, pressing, etc.) may be performed on the confirmation control 302, after the terminal detects that the confirmation control 302 is triggered, the video may be acquired based on the information in the video input area 301, and after the terminal acquires the video, a frame extraction request carrying the video may be sent to the computer device, where the frame extraction request may carry the video, so that a frame extraction process may be performed on a video frame included in the video subsequently.
Alternatively, it may be determined that the frame-extraction requirements for the video are obtained when a frame-extraction timing task is triggered. If a frame extraction timing task can be set, when the triggering condition in the frame extraction timing task is triggered, the frame extraction requirement can be determined to be acquired. In one embodiment, a large number of videos may be stored in a specific area, and the trigger condition may be that the current time reaches the preset processing time, or that the remaining storage space of the specific storage area exceeds the preset remaining storage space, or the like. The video to be subjected to frame extraction may be one video in the designated area.
S202, brightness filtering processing is carried out on each initial video frame in the initial video frame set, and a candidate video frame set is obtained.
The brightness filtering process may be a filtering process performed by using brightness values of pixels in the initial video frame, that is, filtering processes may be performed on each initial video frame in the initial video frame set based on the brightness values, so as to obtain a candidate video frame set. Considering that the same luminance filtering process may be performed for each initial video frame, the luminance filtering process will be specifically described below by taking any one of the initial video frames in the initial video frame set as an example.
It can be understood that the brightness of an image refers to the size of a pixel of the image, and if the pixel value of a certain pixel point in the image is larger, the image is brighter at the pixel point, and correspondingly, if the pixel value of the certain pixel point in the image is smaller, the image is darker at the pixel point. The range of the pixel value in the image is [0,255], that is, the brightness of the image can be expressed by a value between [0,255], and when the pixel value is closer to 0, the brightness is lower, and when the pixel value is closer to 255, the brightness is higher.
Based on this, in the embodiment of the present application, the range of the brightness value of one pixel point in any initial video frame is [0,255]. For example, the luminance value of one pixel in any initial video frame is 200, and the luminance value of another pixel is 185.
In one implementation, for any initial video frame in the initial video frame set, the brightness value of each pixel point in the any initial video frame can be obtained, and if the brightness value of the pixel point in the any initial video frame meets the preset condition, any initial video frame can be used as a candidate video frame and added into the candidate video frame set. It should be understood that, the video frame is usually in YUV format, where Y in YUV represents brightness, U and V represent chromaticity, and then the Y value of the initial video frame may be directly obtained, where the Y value is the luminance value.
The preset conditions may include any one of a first preset condition, a second preset condition and a third preset condition, where the first preset condition may be used for filtering out dark initial video frames, the second preset condition may be used for filtering out bright initial video frames, and the third preset condition may be used for filtering out initial video frames with an excessively large black duty ratio.
Based on this, the first preset condition may be that a minimum luminance value of the luminance values of each pixel point is greater than a preset minimum luminance value. That is, if the minimum luminance value of the luminance values of all the pixels of any one initial video frame is greater than the preset minimum luminance value, it can be determined that the any one initial video frame is not excessively dark, the any one initial video frame can be retained, and if the minimum luminance value of the luminance values of all the pixels of the any one initial video frame is not greater than the preset minimum luminance value, it can be determined that the any one initial video frame is excessively dark, the any one initial video frame can be filtered out.
The second preset condition may be that a maximum luminance value among luminance values of each pixel point is smaller than a preset maximum luminance value. That is, if the maximum luminance value of the luminance values of all the pixels of any one of the initial video frames is less than the preset maximum luminance value, it can be determined that the any one of the initial video frames is not too bright, the any one of the initial video frames can be preserved, and if the maximum luminance value of the luminance values of all the pixels of the any one of the initial video frames is not less than the preset maximum luminance value, it can be determined that the any one of the initial video frames is too bright, the any one of the initial video frames can be filtered out.
The third preset condition may be that a pixel proportion corresponding to any one of the initial video frames is greater than a preset proportion, and the pixel proportion may be a proportion of pixels whose luminance value is less than a preset threshold value. That is, if the pixel proportion corresponding to any initial video frame is greater than the preset proportion, it can be determined that the black proportion of any initial video frame is not very large, then any initial video frame can be retained, and if the pixel proportion corresponding to any initial video frame is not greater than the preset proportion, it can be determined that the black proportion of any initial video frame is relatively large, then any initial video frame can be filtered out.
The specific values of the above-mentioned preset minimum luminance value, preset maximum luminance value, preset threshold value, and preset ratio may be preset, and the specific values thereof may not be limited. For example, a minimum luminance value e [0,255], a preset maximum luminance value e [0,255], a preset threshold value e [0,255], a preset ratio e [0,1].
For example, assume a preset minimum luminance value of 16, a preset maximum luminance value of 235, a preset threshold value of 30, and a preset ratio of 32%.
Then any initial video frame may be added to the candidate set of video frames if its minimum luminance value is greater than 16, and may be filtered out if its minimum luminance value is less than or equal to 16.
Any initial video frame may be added to the candidate set of video frames if its maximum luminance value is less than 235, and may be filtered out if its minimum luminance value is greater than or equal to 235.
If the pixel proportion of any initial video frame determined under the condition that the preset threshold value is 30 is greater than 32%, the any initial video frame can be added into the candidate video frame set, and if the minimum brightness value of any initial video frame is less than or equal to 32%, the any initial video frame can be filtered out.
Through the filtering operation, the initial video frames with dark, bright and black occupation ratios can be filtered from the initial video frame set, so that the image quality of the final video frame set is ensured.
S203, obtaining coding information corresponding to the candidate video frame set.
The encoding information corresponding to the candidate video frame set may include a frame type and a time stamp corresponding to each candidate video frame in the candidate video frame set, in other words, the encoding information may include encoding information of each candidate video frame in the candidate video frame set, and the encoding information of any candidate video frame may include a frame type and a time stamp corresponding to the candidate video frame. The frame type of a candidate video frame may include any of an I frame, a P frame, and a B frame, and the time stamp may refer to a play time of the candidate video frame in the video.
In one implementation, after the video may be encoded and decoded, consecutive video frames included in the video and encoding information of the video may be obtained, where the encoding information of the video may include a frame type and a time stamp corresponding to each video frame in the video. The continuous video frames mentioned here, that is, the initial video frame set in step S201, then, after the encoding information of the video is acquired, the encoding information of each candidate video frame may be extracted from the encoding information of the video to construct encoding information corresponding to the candidate video frame set from the encoding information of each candidate video frame.
S204, performing frame extraction processing on the candidate video frame set based on the coding information to obtain a target video frame set.
In one implementation, one or more of the first and second frame-extraction policies may be employed and frame-extraction processing may be performed on the candidate video frame set based on the encoding information to obtain the target video frame set.
The first frame extraction strategy may be to extract the frame of the candidate video frame set after the frame extraction processing again by using the pixel gap or the time stamp gap of two adjacent candidate video frames on the premise of extracting the frame of the candidate video frame set by using one or more of the reserved target frame type and the reserved candidate video frame with the target interval as the target interval, so as to obtain the target video frame set. The target frame type may refer to an I frame, and the target interval may refer to sampling the candidate video frame set by an interval of N frames, where N may be any value, for example, may be 3, 5, etc.
The second frame extraction strategy may be that, on the premise that the pixel difference of two adjacent candidate video frames combined with the frame type extracts the candidate video frame set, the pixel difference of two adjacent candidate video frames combined with the timestamp difference of two adjacent candidate video frames is utilized to perform frame extraction processing again on the candidate video frame set after frame extraction processing, so as to obtain the target video frame set.
In the embodiment of the application, the initial video frame set to be processed is obtained, and each initial video frame in the initial video frame set is subjected to brightness filtering processing to obtain the candidate video frame set, so that the video frame quality of each candidate video frame in the candidate video frame set can be ensured. Furthermore, the encoding information corresponding to the candidate video frame set can be obtained, so that the candidate video frame set is subjected to frame extraction processing based on the encoding information to obtain a target video frame set, the decoding information of the video is combined, the number of key frames is increased, and the frame extraction accuracy and the video frame quality are further improved, so that efficient video frame extraction is realized.
Referring to fig. 4, fig. 4 is a flowchart of another video frame extraction method according to an embodiment of the present application, where the video frame extraction method described in the present embodiment may be applied to the above-mentioned computer device, and as shown in fig. 4, the method may include:
s401, acquiring an initial video frame set to be processed.
S402, brightness filtering processing is carried out on each initial video frame in the initial video frame set, and a candidate video frame set is obtained.
S403, obtaining coding information corresponding to the candidate video frame set.
Wherein the encoding information includes frame types and time stamps corresponding to each candidate video frame in the set of candidate video frames.
The specific embodiments in steps S401 to S403 may refer to the specific embodiments in steps S201 to S203, and will not be described herein.
S404, performing frame extraction processing on the candidate video frame set based on the coding information by adopting a first frame extraction strategy to obtain a first target video frame set.
In one implementation, first, a candidate video frame set may be extracted according to a target extraction manner to obtain a first video frame set. The target extraction mode may include one or more of a first extraction mode and a second extraction mode, where the first extraction mode may be to extract candidate video frames of a target frame type from the candidate video frame set, and the second extraction mode may be to sample the candidate video frame set at a target interval.
Based on the above, if the target extraction mode is the first extraction mode, the specific implementation of determining the first video frame set may be to extract the candidate video frame of the target frame type from the candidate video frame set, and use the extracted candidate video frame as the first video frame, thereby obtaining the first video frame set. The target frame type may refer to an I frame, i.e. the first video frame set may be constructed from candidate video frames with a frame type of I frame. It will be appreciated that the I-frame represents a key frame in a video, and the key frame may be understood that the picture in the candidate video frame may be completely reserved, so that the video frame that is more key in the candidate video frame set may be reserved in the above manner.
If the target extraction mode is the second extraction mode, the specific implementation of determining the first video frame set may be that the candidate video frame set is sampled at a target interval to obtain the first video frame set. The target interval may be set by sampling the candidate video frame set at an interval of N frames, where the target interval may be preset, and the specific value of the target interval is not limited, for example, the target interval (i.e., the N value) may be 3, 5, etc., and if the target interval is 3, it is exemplary that a frame of candidate video frame may be extracted every 3 candidate video frames, and the extracted candidate video frame is taken as the first video frame. By the method, the number of video frames of the final video frame set can be reduced, so that the calculation amount of processing the subsequent video frames is reduced.
If the target extraction mode is a first target extraction mode and a second extraction mode, the specific implementation of determining the first video frame set may be that candidate video frames of a target frame type are extracted from a candidate video frame set, the extracted candidate video frames are used as a first reference video frame set, the candidate video frame set is sampled at a target interval to obtain a second reference video frame set, and after the first reference video frame set and the second reference video frame set are obtained, the first reference video frame set and the second reference video frame set may be combined to obtain the first video frame set.
It will be appreciated that when the candidate video frame sets are processed in the first target extraction manner and the second extraction manner, the same candidate video frame may be extracted, and then the first reference video frame set and the second reference video frame set may be combined later, where the combining process may be a union process, that is, a union between the first reference video frame set and the second reference video frame set may be used as the first video frame set. It can be seen that by the video frame extraction method, more key frames in the video frames can be reserved, and the number of the video frames can be reduced appropriately.
Alternatively, after the first video frame set is obtained, the first video frame set may be used as the first target video frame set.
Optionally, after the first video frame set is obtained, the pixel value or the timestamp of each first video frame in the first video frame set may be used to screen the pair of first video frame sets, so as to obtain a first target video frame set. That is, after the first video frame set is obtained, the first video frame set may be subjected to frame extraction processing again, so as to obtain a video frame set after frame extraction processing, for example, the video frame set may be referred to as a first target video frame set. In a specific implementation, each first video frame in the first video frame set may be traversed, and the currently traversed first video frame is compared with a previous first video frame, and whether to add the currently traversed first video frame to the first target video frame set is determined based on the comparison result. For convenience of description, the first video frame currently traversed may be referred to as the current first video frame. If the comparison result is that the pixel difference between the current first video frame and the previous first video frame traversed currently is greater than the preset pixel difference, or the time stamp difference between the current first video frame and the previous first video frame is greater than the first preset time stamp difference, the current first video frame may be added to the first target video frame set. Then, after each first video frame in the first video frame set is traversed, a final first target video frame set is obtained.
The pixel difference may refer to an average value of pixel differences between two video frames, and based on this, it is known that the pixel difference between the current first video frame and the previous first video frame is determined by first obtaining a pixel value of each pixel point in the current first video frame, and obtaining a pixel value of each pixel point in the previous video frame, and then determining the pixel difference between the current first video frame and the previous first video frame based on the pixel value of each pixel point in the current first video frame and the pixel value of each pixel point in the previous video frame. Optionally, for any pixel in the current first video frame, the pixel difference for the any pixel may be determined based on the pixel value of the any pixel and the pixel value of the pixel corresponding to the any pixel in the previous video frame, e.g., the difference between the pixel value of the any pixel and the pixel value of the pixel corresponding to the any pixel may be used as the pixel difference. In view of the fact that the difference may be positive or negative, to better characterize the pixel difference between two video frames, the absolute value of the difference may be taken as the pixel difference.
By the method, the pixel difference of each pixel point in the current first video frame can be determined, and then the average value of the pixel differences of each pixel point can be used as the pixel difference between the current first video frame and the previous first video frame. It will be appreciated that the pixels of each video frame in the video are the same, i.e., each pixel in the current first video frame has a corresponding pixel in the previous first video frame.
For example, the pixel values of all pixels in the current first video frame are a1, a2,..an, and the pixel values of all pixels in the previous first video frame are b1, b2,..bn, the computer device may calculate the pixel gap between the current first video frame and the previous first video frame by the following formula (1):
[a1-b1|+|a2-b2|+...+|an-bn]/n (1)
The method for determining the time stamp difference between the current first video frame and the previous first video frame comprises the steps of firstly obtaining the time stamp of the current first video frame and the time stamp of the previous first video frame, and then determining the time stamp difference based on the time stamp of the current first video frame and the time stamp of the previous first video frame, wherein the time stamp difference can be used as the time stamp difference.
The preset pixel gap and the first preset time stamp gap can be preset, the preset pixel gap is larger than 0, the first preset time stamp gap is larger than 0, and specific numerical values are not limited. The method can keep the first video frames with larger pixel differences or the first video frames with larger time differences, ensure the frame differences among the video frames, and effectively avoid the extraction of repeated frames, for example, if the frames are extracted by using the frame extraction methods such as interval timestamp frames, uniform frame extraction for setting total frames and the like, more repeated video frames can be extracted for static unchanged video time slices, and the extraction quantity of the repeated frames can be effectively reduced by using the pixel differences or the timestamp differences between the two frames in the embodiment of the application, thereby improving the frame extraction accuracy.
S405, performing frame extraction processing on the candidate video frame set based on the coding information by adopting a second frame extraction strategy to obtain a second target video frame set.
In one implementation, each candidate video frame in the set of candidate video frames may be traversed first and a first pixel gap between the current candidate video frame currently traversed and the previous candidate video frame may be obtained. The first pixel difference refers to a pixel difference between the current candidate video frame and the previous candidate video frame, and the specific implementation of the first pixel difference may refer to the determination manner of the pixel difference between the current first video frame and the previous first video frame in step S405, which is not described herein. The set of candidate video frames may then be filtered based on the first pixel gap and the frame type of the current candidate video frame to obtain a second set of video frames.
The specific implementation of determining the second video frame set may be as follows.
The current candidate video frame may be added to the second video frame set if the frame type of the current candidate video frame is an I frame type and the first pixel gap is greater than the first preset gap, the current candidate video frame may be added to the second video frame set if the frame type of the current candidate video frame is a P frame type and the first pixel gap is greater than the second preset gap, and the current candidate video frame may be added to the second video frame set if the frame type of the current candidate video frame is a B frame type and the first pixel gap is greater than the third preset gap.
The first preset gap is smaller than the second preset gap, and the second preset gap is smaller than the third preset gap. Considering that the criticality of the video frames of the large I frame type, the P frame type and the B frame type in the video gradually decreases, when the preset gaps (namely the first preset gap, the second preset gap and the third preset gap) are utilized to screen candidate video frames, the preset gaps corresponding to the I frame type, the P frame type and the B frame type respectively can be gradually decreased, namely the first preset gap is smaller than the second preset gap, and the second preset gap is smaller than the third preset gap, so that more critical video frames can be reserved as much as possible in the arrangement mode, and the frame extraction accuracy is improved.
Through the frame extraction mode, the extraction of repeated frames can be avoided through the pixel difference between video frames, and the extraction of key frames can be carried out by combining frame types, so that the frame extraction accuracy is improved. For example, if frames are extracted from video only by using frame extraction methods such as "interval time stamp frames", "set average frames", more repeated video frames may be extracted for still unchanged video time slices, and more key frames may be missed for video time slices with more transition content or frequent scene changes. By combining the frame types and the pixel difference between two frames in the embodiment of the application, the extraction quantity of repeated frames can be effectively reduced, and the extraction quantity of key frames can be improved, thereby effectively improving the extraction accuracy.
Alternatively, after the second set of video frames is obtained, the second set of video frames may be taken as the second set of target video frames.
Optionally, after obtaining the second video frame set, the pair of second video frame sets may be further filtered based on a timestamp of each second video frame in the second video frame set to obtain a second target video frame set. In a specific implementation, each second video frame in the second video frame set may be traversed, and a second pixel difference between the current traversed second video frame and the previous second video frame may be obtained. The second pixel difference refers to a pixel difference between the current second video frame and the previous second video frame, and the specific implementation of the second pixel difference may refer to the determination manner of the pixel difference between the current first video frame and the previous first video frame in step S404, which is not described herein. Then, the second set of video frames may be filtered based on the second pixel gap, the timestamp of the current second video frame, and the timestamp of the previous second video frame to obtain a second set of target video frames.
The specific implementation of determining the second target video frame set may be described below.
First, a time stamp gap between the time stamp of the current second video frame and the time stamp of the previous second video frame may be determined, where the specific implementation of the time stamp gap herein may refer to the determination manner of the time stamp gap in step S404, which is not described herein. And then determining whether to add the current second video frame to the second set of target video frames based on the timestamp gap and the second pixel gap. In one embodiment, the current second video frame may be added to the second set of target video frames if the timestamp gap is greater than a second preset timestamp gap and the second pixel gap is greater than a fourth preset gap, and the current second video frame may be added to the second set of target video frames if the timestamp gap is less than or equal to the second preset timestamp gap and the second pixel gap is greater than a fifth preset gap.
Wherein the fifth preset difference is greater than the fourth preset difference. It will be appreciated that when the difference between two video frames is guaranteed, the pixel difference between the two video frames may be set smaller if the time stamp difference between the two video frames is larger, and correspondingly, the pixel difference between the two video frames may be set larger if the time stamp difference between the two video frames is smaller, and then the fifth preset difference may be set larger than the fourth preset difference.
Through the frame extraction mode, the extraction of repeated frames can be effectively avoided through the pixel gap and the time stamp gap between video frames, and the extraction of key frames can be improved, so that the frame extraction accuracy is improved. For example, if frames are extracted from video only by using frame extraction methods such as "interval time stamp frames", "set average frames", more repeated video frames may be extracted for still unchanged video time slices, and more key frames may be missed for video time slices with more transition content or frequent scene changes. By combining the pixel gap and the time stamp gap in the embodiment of the application, the extraction quantity of repeated frames can be effectively reduced, and the extraction quantity of key frames can be improved, thereby effectively improving the extraction accuracy.
S406, combining the first target video frame set and the second target video frame set to obtain a target video frame set.
In one implementation, the merging may be performed to obtain the set of target video frames, considering that the first set of target video frames and the second set of target video frames may have the same video frames. The merging process may be a union process, that is, a union between the first target video frame set and the second target video frame set may be used as the target video frame set.
For example, if the first target set of video frames includes video frame 1, video frame 2, video frame 3, video frame 4, the second target set of video frames includes video frame 2, video frame 4, video frame 5, video frame 6, the target set of video frames includes video frame 1, video frame 2, video frame 3, video frame 4, video frame 5, video frame 6.
In one implementation, after the target video frame set is acquired, quality control may be performed on the target video frame set to control image compression quality of each target video frame in the target video frame set, for example, image compression may be performed on each target video frame to reduce a storage space occupied by the target video frame set. In a specific implementation, a required quality parameter of image compression may be preset, so as to perform image compression processing on the target video set based on the target quality parameter, thereby obtaining the target video set after the image compression processing. The image compression process may be any of jpeg compression, webp compression, and the like, and is not limited thereto.
For easy understanding, the video frame extraction method according to the embodiment of the present application is further described below with reference to fig. 5. In the description, the frame extraction process of each stage is understood as a process of one module, that is, the video frame extraction mode may be divided into processes of each module as shown in fig. 5, it may be understood that some modules output each video frame flowing into the module, and some modules also filter some video frames input by the module, so as to implement frame extraction process of video, and functions of each module are described below.
The a module may be a scaling module that may be used to control the resolution of each initial video frame in the set of initial video frames. The configuration parameters as in the a module may include the resolution of the output video frame (i.e., the target resolution). The processing of the a module may be based on the target resolution and a scaling algorithm, so as to process each initial video frame in the initial video frame set, thereby obtaining a scaled initial video frame set, for example, the scaling algorithm may be bicubic interpolation or the like. The input of the A module is an initial video frame set, and the output is the initial video frame set after scaling processing.
It should be noted that, the configuration parameters mentioned in the embodiment of the present application may be understood as parameters that need to be preset (configured). There may be different parameter configurations for different business scenarios (e.g., audits, content understanding, etc.).
The B module may be configured to perform brightness filtering processing on each initial video frame in the initial video frame set to obtain a candidate video frame set, that is, may be configured to filter out video frames that are too bright, too dark, and have too large a black duty ratio. It will be appreciated that the B-module inputs as each initial video frame in the set of initial video frames and outputs as the set of candidate video frames.
The processing procedure of the B module may be described as calculating, for each initial video frame input, a luminance component Y (i.e., a luminance value) of each pixel point in the initial video frame, and counting the minimum luminance value and the maximum luminance value among all the luminance values, and recording the minimum luminance value and the maximum luminance value as Ymin and Ymax, respectively.
The proportion of pixels (i.e. the pixel proportion) in all luminance values that are smaller than a preset threshold (th_black threshold) can also be calculated and this pixel proportion is denoted as black_percentage.
The condition that the input initial video frame is filtered by the B-module may be an expression that the initial video frame satisfying the condition may be added to the candidate video frame set.
(Ymin>th_Y1)‖(Ymax<th_Y2)‖(black_percentage>th_bp)
The configuration parameters include th_y1 (0-255), th_y2 (0-255), th_black (0-255), and th_bp (0-1). Wherein, th_y1 represents a preset minimum brightness value, th_y2 represents a preset maximum brightness value, th_black represents a preset threshold, th_bp represents a preset ratio, "Ymin > th_y1" represents the above first preset condition, "Ymax < th_y2" represents the above second preset condition, and "black_percentage > th_bp" represents the above third preset condition.
The B module may then split into two parallel branches, and the final output frame set may contain a union of the output results of the two branches. Wherein the first branch may comprise a C module and a D module and the second branch may comprise an E module and an F module. The first branch may be understood as a process of performing frame extraction processing on the candidate video frame set based on the coding information by using a first frame extraction strategy to obtain a first target video frame set, and the second branch may be understood as a process of performing frame extraction processing on the candidate video frame set based on the coding information by using a second frame extraction strategy to obtain a second target video frame set. Based on this, the output results of the two branches are the first target video frame set and the second target video frame set, respectively, that is, the union set between the first target video frame set and the second target video frame set can be used as the final output frame set (i.e., the target video frame set).
The modules of each of these two branches are described in detail below.
For the C module:
Assuming that the video frame currently input by the C module is F n, here F n may be understood as one candidate video frame in the candidate video frame set, for example, may refer to an nth frame in the candidate video frame set, and the video frame F n passing through the C module needs to satisfy the following conditions:
the frame type is I frame ii (n mod th_sample= 0)
Mod is a modulo operation.
The configuration parameters include th_sample (value > =1). th_sample represents the target interval (i.e., N).
Where "frame type is I frame" indicates a first extraction mode, and "n mod th_sample= =0" indicates a second extraction mode.
As previously described, the C module can be understood as a selector that retains I frames and video frames with an interval of N.
For the D module:
assuming that the video frame currently input by the D module is F n, the corresponding time stamp is T n, the video frame passing through the D module recently is F n-1, and the corresponding time stamp is T n-1. Here, F n may be understood as the first video frame currently traversed in the first video frame set (i.e., the current first video frame), and F n-1 may be understood as the first video frame immediately preceding the current first video frame. The video frame F n passing through the D module needs to satisfy the following conditions:
(Diff(Fn-Fn-1)>th_D_pixel)‖(Tn–Tn-1>th_D_time)
Where Diff represents the average of the pixel differences of two video frames (current first video frame and previous first video frame), i.e. the pixel difference.
The configuration parameters include th_D_pixel (value > 0), th_D_time (value > 0). th_D_pixel represents a preset pixel gap, th_D_time represents a first preset timestamp gap, "Diff (F n-Fn-1) > th_D_pixel" represents that the pixel gap between the current first video frame and the previous first video frame is greater than the preset pixel gap, and "T n–Tn-1 > th_D_time" represents that the timestamp gap between the current first video frame and the previous first video frame is greater than the first preset timestamp gap.
As previously described, the D module may be understood as a selector for one frame difference determination or adjacent frame duration determination. The frame difference herein may refer to a pixel difference between a current first video frame and a previous first video frame, and the adjacent frame duration may refer to a timestamp difference between the current first video frame and the previous first video frame.
For the E module:
Assuming that the video frame currently input by the E-module is F n, the video frame passing through the E-module recently is F n-1, where F n may be understood as the candidate video frame currently traversed in the candidate video frame set (i.e., the current candidate video frame), and F n-1 may be understood as the previous candidate video frame of the current candidate video frame. The video frame F n passing through the E module needs to satisfy the following conditions:
diff (F n-Fn-1)>th_E_pixel_I if Fn frame type I II
Diff (F n-Fn-1)>th_E_pixel_P if Fn frame type is P II
Diff (F n-Fn-1)>th_E_pixel_B if Fn frame type B II)
Where Diff represents the average of the pixel differences of two video frames (the current candidate video frame and the previous candidate video frame), i.e. the first pixel difference.
The configuration parameters include th_E_pixel_I (value > 0), th_E_pixel_P (value > 0), and th_E_pixel_B (value > 0). th_E_pixel_I represents a first preset gap, th_E_pixel_P represents a second preset gap, th_E_pixel_B represents a third preset gap, "Diff (F n-Fn-1)>th_E_pixel_I if Fn. The frame type is I) represents that if the frame type of the current candidate video frame is I frame type and the first pixel gap is larger than the first preset gap," Diff (F n-Fn-1)>th_E_pixel_P if Fn. The frame type is P "represents that if the frame type of the current candidate video frame is P frame type and the first pixel gap is larger than the second preset gap), and" Diff (F n-Fn-1)>th_E_pixel_B if Fn. The frame type is B "represents that if the frame type of the current candidate video frame is B frame type and the first pixel gap is larger than the third preset gap).
As previously mentioned, the E-module can be understood as a selector that considers the frame type.
For the F module:
Assuming that the video frame currently input by the F module is F n, the corresponding time stamp is T n, the video frame passing through the F module recently is F n-1, and the corresponding time stamp is T n-1. Here, F n may be understood as the second video frame currently traversed in the second video frame set (i.e., the current second video frame), and F n-1 may be understood as the second video frame previous to the current second video frame. The video frame F n passing through the F module needs to satisfy the following conditions:
Diff(Fn-Fn-1)>th_F_pixel_1if Tn–Tn-1>th_F_time‖
Diff(Fn-Fn-1)>th_F_pixel_2if Tn–Tn-1<=th_F_time
Wherein Diff represents the average value of the pixel differences of two video frames (current second video frame and previous second video frame), i.e. the second pixel difference.
The configuration parameters include th_F_pixel_1 (value > 0), th_F_pixel_2 (value > 0), and th_F_tim (value > 0). th_f_pixel_1 represents a fourth preset gap, th_f_pixel_2 represents a fifth preset gap, th_f_time represents a second preset timestamp gap, "Diff (F n-Fn-1)>th_F_pixel_1if Tn–Tn-1 > th_f_time" represents that if the timestamp gap is greater than the second preset timestamp gap and the second pixel gap is greater than the fourth preset gap, "Diff (F n-Fn-1)>th_F_pixel_2 if Tn–Tn-1 < = th_f_time" represents that if the timestamp gap is less than or equal to the second preset timestamp gap and the second pixel gap is greater than the fifth preset gap).
As previously described, the F module may be understood as a selector for frame difference determination in combination with adjacent frame durations. Here, the adjacent frame duration may refer to a time stamp difference between the current second video frame and the previous second video frame, and the frame difference may refer to a pixel difference between the current second video frame and the previous second video frame.
The G-module may be a quality control module that may be used to control the image compression quality of each target video frame in the set of target video frames. The configuration parameters as at the G-module may include quality parameters of image compression of the output target video frame. The processing of the G module may be processing each target video frame in the target video frame set based on the quality parameter and the compression mode, so as to obtain a target video frame set after image compression processing, for example, the compression mode may be jpeg compression, webp compression, and the like.
In one implementation manner, the target video frame set extracted by the embodiment of the application can be applied to various inference services, wherein the inference services can be cloud AI (advanced technology) inference services, such as auditing, content understanding and the like, and compared with the frame extraction modes of 'interval time frames', 'uniform frame extraction of setting total frames', and the like, the extraction of repeated frames can be effectively avoided, and the extraction of key frames is improved, so that the extraction accuracy is improved. In addition, when the video frame extraction is performed by adopting frame extraction modes such as interval time frames, uniform frame extraction of total frame setting and the like, if the frame extraction quantity is small, a part of key frames required by the reasoning service can be lost, and if the frame extraction quantity is increased to reserve more key frames, the calculation cost of the reasoning service can be obviously increased, and if the frame extraction quantity is increased, the calculation accuracy of the reasoning service can be reduced due to invalid frames or low-quality frames. The embodiment of the application can realize the extraction of key frames by using as few video frames as possible and reduce the extraction of repeated frames. As can be seen from the above detailed description of the video frame extraction method, the embodiment of the application can provide a more general, rapid and flexible-parameter adaptive video frame extraction method for various inference services, the method can combine video coding information and multi-level serial-parallel image processing algorithm, keep light calculation, and effectively improve frame extraction accuracy and frame extraction video quality.
In the embodiment of the application, an initial video frame set to be processed can be obtained, and brightness filtering processing is performed on each initial video frame in the initial video frame set to obtain a candidate video frame set, so that the video frame quality of each candidate video frame in the candidate video frame set can be ensured. Further, coding information corresponding to the candidate video frame set can be obtained, the candidate video frame set is subjected to frame extraction processing based on the coding information by adopting a first frame extraction strategy to obtain a first target video frame set, the candidate video frame set is subjected to frame extraction processing based on the coding information by adopting a second frame extraction strategy to obtain a second target video frame set, and therefore the first target video frame set and the second target video frame set can be combined to obtain the target video frame set. By the method, the coding information of the video and the multi-stage serial-parallel image processing algorithm can be combined, light calculation is kept, and meanwhile, the frame extraction accuracy and the frame quality of the frame extraction video can be effectively improved.
Fig. 6 is a schematic structural diagram of a video frame extracting apparatus according to an embodiment of the present application. The video frame extraction device described in this embodiment includes:
A first obtaining unit 601, configured to obtain an initial video frame set to be processed, where the initial video frame set includes a plurality of initial video frames, and the plurality of initial video frames are consecutive video frames in one video;
A filtering unit 602, configured to perform brightness filtering processing on each initial video frame in the initial video frame set, so as to obtain a candidate video frame set;
A second obtaining unit 603, configured to obtain encoding information corresponding to the candidate video frame set, where the encoding information includes frame types and time stamps corresponding to each candidate video frame in the candidate video frame set;
and the frame extraction unit 604 is configured to perform frame extraction processing on the candidate video frame set based on the coding information, so as to obtain a target video frame set.
In one implementation, the frame extraction unit 604 is specifically configured to:
performing frame extraction processing on the candidate video frame set based on the coding information by adopting a first frame extraction strategy to obtain a first target video frame set;
Performing frame extraction processing on the candidate video frame set based on the coding information by adopting a second frame extraction strategy to obtain a second target video frame set;
and merging the first target video frame set and the second target video frame set to obtain a target video frame set.
In one implementation, the frame extraction unit 604 is specifically configured to:
Extracting the candidate video frame set according to a target extraction mode to obtain a first video frame set, wherein the target extraction mode comprises one or more of a first extraction mode and a second extraction mode, the first extraction mode is to extract candidate video frames of a target frame type from the candidate video frame set, and the second extraction mode is to sample the candidate video frame set at a target interval;
And traversing each first video frame in the first video frame set, and adding the current first video frame into the first target video frame set if the pixel difference between the current traversed first video frame and the previous first video frame is larger than a preset pixel difference or the time stamp difference between the current first video frame and the previous first video frame is larger than a first preset time stamp difference.
In one implementation, the frame extraction unit 604 is specifically configured to:
traversing each candidate video frame in the candidate video frame set, and acquiring a first pixel difference between the current traversed candidate video frame and the previous candidate video frame;
Filtering the candidate video frame set based on the first pixel gap and the frame type of the current candidate video frame to obtain a second video frame set;
Traversing each second video frame in the second video frame set, and acquiring a second pixel difference between the current traversed second video frame and the previous second video frame;
and screening the second video frame set based on the second pixel gap, the timestamp of the current second video frame and the timestamp of the previous second video frame to obtain a second target video frame set.
In one implementation, the frame extraction unit 604 is specifically configured to:
if the frame type of the current candidate video frame is an I frame type and the first pixel difference is greater than a first preset difference, adding the current candidate video frame into a second video frame set;
If the frame type of the current candidate video frame is a P frame type and the first pixel difference is greater than a second preset difference, adding the current candidate video frame into a second video frame set;
if the frame type of the current candidate video frame is B frame type and the first pixel difference is larger than a third preset difference, adding the current candidate video frame into a second video frame set;
the first preset gap is smaller than the second preset gap, and the second preset gap is smaller than the third preset gap.
In one implementation, the frame extraction unit 604 is specifically configured to:
Determining a timestamp gap between a timestamp of the current second video frame and a timestamp of the previous second video frame;
if the time stamp difference is greater than a second preset time stamp difference and the second pixel difference is greater than a fourth preset difference, adding the current second video frame to a second target video frame set;
if the time stamp difference is smaller than or equal to the second preset time stamp difference and the second pixel difference is larger than a fifth preset difference, adding the current second video frame into a second target video frame set;
Wherein the fifth preset gap is greater than the fourth preset gap.
In one implementation, the frame extraction unit 604 is specifically configured to:
Acquiring a brightness value of each pixel point in any initial video frame aiming at any initial video frame in the initial video frame set;
If the brightness value of the pixel point in any initial video frame meets a preset condition, adding the any initial video frame into a candidate video frame set;
the preset conditions include any one of a first preset condition, a second preset condition and a third preset condition, wherein the first preset condition is that a minimum brightness value in brightness values of each pixel point is larger than a preset minimum brightness value, the second preset condition is that a maximum brightness value in brightness values of each pixel point is smaller than a preset maximum brightness value, the third preset condition is that a pixel proportion corresponding to any initial video frame is larger than a preset proportion, and the pixel proportion is a proportion of pixel points, of which the brightness values of each pixel point are smaller than a preset threshold value.
It will be appreciated that the division of the units in the embodiment of the present application is illustrative, and is merely a logic function division, and other division manners may be actually implemented. The functional units in the embodiment of the application can be integrated in one processing unit, or each unit can exist alone physically, or two or more units are integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the application. The computer device includes a processor 701, a memory 702. Optionally, the computer device may also include a network interface 703. Data may be interacted between the processor 701, the memory 702, and the network interface 703.
The Processor 701 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 702 may include read only memory and random access memory and provides program instructions and data to the processor 701. A portion of the memory 702 may also include non-volatile random access memory. Wherein the processor 701, when calling the program instructions, is configured to execute:
acquiring an initial video frame set to be processed, wherein the initial video frame set comprises a plurality of initial video frames, and the plurality of initial video frames are continuous video frames in one video;
performing brightness filtering processing on each initial video frame in the initial video frame set to obtain a candidate video frame set;
acquiring coding information corresponding to the candidate video frame set, wherein the coding information comprises frame types and time stamps corresponding to all candidate video frames in the candidate video frame set;
and performing frame extraction processing on the candidate video frame set based on the coding information to obtain a target video frame set.
In one implementation, the processor 701 is specifically configured to:
performing frame extraction processing on the candidate video frame set based on the coding information by adopting a first frame extraction strategy to obtain a first target video frame set;
Performing frame extraction processing on the candidate video frame set based on the coding information by adopting a second frame extraction strategy to obtain a second target video frame set;
and merging the first target video frame set and the second target video frame set to obtain a target video frame set.
In one implementation, the processor 701 is specifically configured to:
Extracting the candidate video frame set according to a target extraction mode to obtain a first video frame set, wherein the target extraction mode comprises one or more of a first extraction mode and a second extraction mode, the first extraction mode is to extract candidate video frames of a target frame type from the candidate video frame set, and the second extraction mode is to sample the candidate video frame set at a target interval;
And traversing each first video frame in the first video frame set, and adding the current first video frame into the first target video frame set if the pixel difference between the current traversed first video frame and the previous first video frame is larger than a preset pixel difference or the time stamp difference between the current first video frame and the previous first video frame is larger than a first preset time stamp difference.
In one implementation, the processor 701 is specifically configured to:
traversing each candidate video frame in the candidate video frame set, and acquiring a first pixel difference between the current traversed candidate video frame and the previous candidate video frame;
Filtering the candidate video frame set based on the first pixel gap and the frame type of the current candidate video frame to obtain a second video frame set;
Traversing each second video frame in the second video frame set, and acquiring a second pixel difference between the current traversed second video frame and the previous second video frame;
and screening the second video frame set based on the second pixel gap, the timestamp of the current second video frame and the timestamp of the previous second video frame to obtain a second target video frame set.
In one implementation, the processor 701 is specifically configured to:
if the frame type of the current candidate video frame is an I frame type and the first pixel difference is greater than a first preset difference, adding the current candidate video frame into a second video frame set;
If the frame type of the current candidate video frame is a P frame type and the first pixel difference is greater than a second preset difference, adding the current candidate video frame into a second video frame set;
if the frame type of the current candidate video frame is B frame type and the first pixel difference is larger than a third preset difference, adding the current candidate video frame into a second video frame set;
the first preset gap is smaller than the second preset gap, and the second preset gap is smaller than the third preset gap.
In one implementation, the processor 701 is specifically configured to:
Determining a timestamp gap between a timestamp of the current second video frame and a timestamp of the previous second video frame;
if the time stamp difference is greater than a second preset time stamp difference and the second pixel difference is greater than a fourth preset difference, adding the current second video frame to a second target video frame set;
if the time stamp difference is smaller than or equal to the second preset time stamp difference and the second pixel difference is larger than a fifth preset difference, adding the current second video frame into a second target video frame set;
Wherein the fifth preset gap is greater than the fourth preset gap.
In one implementation, the processor 701 is specifically configured to:
Acquiring a brightness value of each pixel point in any initial video frame aiming at any initial video frame in the initial video frame set;
If the brightness value of the pixel point in any initial video frame meets a preset condition, adding the any initial video frame into a candidate video frame set;
the preset conditions include any one of a first preset condition, a second preset condition and a third preset condition, wherein the first preset condition is that a minimum brightness value in brightness values of each pixel point is larger than a preset minimum brightness value, the second preset condition is that a maximum brightness value in brightness values of each pixel point is smaller than a preset maximum brightness value, the third preset condition is that a pixel proportion corresponding to any initial video frame is larger than a preset proportion, and the pixel proportion is a proportion of pixel points, of which the brightness values of each pixel point are smaller than a preset threshold value.
The embodiment of the application also provides a computer storage medium, and the computer storage medium stores program instructions, and the program can include part or all of the steps of the video frame extraction method in the corresponding embodiment of fig. 2 or fig. 4 when being executed.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of action described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may include a flash disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
Embodiments of the present application also provide a computer program product or computer program comprising program instructions which, when executed by a processor, perform some or all of the steps of the above method. For example, the program instructions are stored in a computer readable storage medium. The program instructions are read from the computer-readable storage medium by a processor of the computer device, and executed by the processor, cause the computer device to perform the steps performed in the embodiments of the methods described above.
The foregoing describes in detail a video frame extraction method, apparatus, computer device and medium provided in the embodiments of the present application, and specific examples are provided herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in understanding the method and core concept of the present application, and meanwhile, to those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so that the disclosure should not be construed as limiting the application.