CN120201241A

CN120201241A - Video generation method, device, electronic device and storage medium

Info

Publication number: CN120201241A
Application number: CN202510526725.2A
Authority: CN
Inventors: 王国升; 陈虹萍; 姚恺頔
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu com Times Technology Beijing Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu com Times Technology Beijing Co Ltd
Priority date: 2025-04-24
Filing date: 2025-04-24
Publication date: 2025-06-24

Abstract

The disclosure provides a video generation method, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, intelligent editing and the like. The method comprises the steps of obtaining a plurality of video frames containing a target object, wherein the plurality of video frames indicate spatial position information of the target object along a time sequence, obtaining a plurality of display parameters of the target object from the plurality of video frames, wherein the display parameters represent the size relation between the target object and the video frames, determining a plurality of scaling coefficients respectively used for the plurality of video frames based on a first change trend of the plurality of display parameters along the time sequence, and respectively scaling the target object in the plurality of video frames based on the plurality of scaling coefficients to generate a target video. The disclosure also provides a video generating device, an electronic device and a storage medium.

Description

Video generation method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of computer vision, intelligent clipping, and the like, which may be applied to video generation scenes. More particularly, the present disclosure provides a video generation method, apparatus, electronic device, storage medium, and program product.

Background

It is common to shoot the motion process of an object and then clip and share the video, for example, people can share their own motion moments in social media.

During the movement, the position of the object is continuously changed, and the distance between the object and the video acquisition device is also changed. If the object is far away from the video acquisition device, the object is displayed in a picture slightly, the target object main body is difficult to effectively highlight, otherwise, the display is slightly large, and the overall composition is not coordinated. Since the size of the object in the picture of the video acquisition device is continuously changed during the moving process, if the picture is enlarged or reduced according to a fixed proportion in the clipping stage, the picture can shake during playing.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, electronic device, storage medium, and program product.

According to one aspect of the disclosure, a video generation method is provided, and the method comprises the steps of obtaining a plurality of video frames containing a target object, wherein the plurality of video frames indicate spatial position information of the target object along a time sequence, obtaining a plurality of display parameters of the target object from the plurality of video frames, wherein the display parameters represent the size relation between the target object and the video frames, determining a plurality of scaling coefficients respectively used for the plurality of video frames based on a first change trend of the plurality of display parameters along the time sequence, and respectively performing scaling processing on the target object in the plurality of video frames based on the plurality of scaling coefficients to generate a target video.

According to another aspect of the present disclosure, there is provided a video generating apparatus including a video frame unit configured to acquire a plurality of video frames including a target object, the plurality of video frames indicating spatial position information of the target object along a time sequence, a display parameter unit configured to acquire a plurality of display parameters of the target object from the plurality of video frames, wherein the display parameters characterize a dimensional relationship of the target object and the video frames, a scaling coefficient unit configured to determine a plurality of scaling coefficients respectively for the plurality of video frames based on a first trend of variation of the plurality of display parameters along the time sequence, and a video generating unit configured to perform scaling processing on the target object in the plurality of video frames respectively based on the plurality of scaling coefficients to generate a target video.

According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a video generation method according to one embodiment of the present disclosure;

FIG. 3 is a schematic illustration of a first trend of variation according to one embodiment of the present disclosure;

FIG. 4 is a zoom schematic diagram showing a height based on a target object according to one embodiment of the present disclosure;

FIG. 5 is a schematic illustration of a second trend of variation according to one embodiment of the present disclosure;

6A-6C are schematic diagrams of a clipping process according to one embodiment of the disclosure;

Fig. 7 is a flowchart of a video generation method according to another embodiment of the present disclosure.

FIG. 8 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure, and

Fig. 9 is a block diagram of an electronic device to which a video generation method may be applied according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a video capture device 101, a network 102, and a server 103. The network 102 is a medium used to provide a communication link between the video capture device 101 and the server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and the like.

The video capture device 101 may be a variety of devices having a camera and supporting image or video capture, including but not limited to cameras, smartphones, tablets, and drones, among others.

The server 103 may be a server providing various services, such as a server processing a video stream photographed by the video capture device 101. The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, network service, and middleware service.

Referring to fig. 1, taking skiing as an example, a skier may be photographed using a video capture device 101. For example, during skiing, the skier uses the smart phone to track and shoot the skier, and the image of the skier can be shot by pressing the recording button through the video recording function of the smart phone.

For example, in addition to using a smartphone, one or more cameras may be installed at the skiing venue to capture video of skiers present in the skiing venue. For example, a skier may wear a special device, the server 103 controls the camera to shoot the video of the skier after detecting the special device, and shooting is finished after the special device is not detected (such as exceeding the detection distance), or a skier may wear clothes with specific identifications, the server 103 starts to control the camera to track and shoot the skier until the target is not tracked after detecting the corresponding specific identification from the video stream shot by the camera, or the skier does not need to wear the special device and wear the clothes with the specific identification, one or more cameras shoot all skiers appearing in a skiing place, and the server 103 performs identification and tracking on one or more skiers by utilizing a multi-target tracking algorithm to obtain video fragments comprising the certain skier.

The multi-target tracking is a technology for tracking a specific target in continuous video frames, and can track one or more targets in a plurality of video frames continuously shot by a single camera, or track one or more targets in a plurality of video frames shot at the same time by a plurality of cameras overlapping in view.

It should be noted that, although the above is exemplified by skiing, the present disclosure is not limited thereto, and the video generating method, apparatus, electronic device, storage medium and program product provided by the embodiments of the present disclosure may be applied to a scene in which some screen objects in a plurality of video frames display smaller or a plurality of screens display unbalanced due to a distance change with the video capturing apparatus 101. For example, the method can be applied to sports scenes such as running, marathon athletic or riding and the like shot by an artificial target object, can be applied to recreation ground activity scenes such as roller coaster projects shot by the artificial target object, can be applied to animal activity scenes shot by animals as target objects, and can be applied to scenes shot by non-living objects such as racing scenes for shooting vehicles and the like.

It should be noted that, the video generating method provided by the embodiments of the present disclosure may be generally performed by the server 103. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may be generally provided in the server 103. The video generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the video capture device 101 and/or the server 103. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 103 and is capable of communicating with the video capturing apparatus 101 and/or the server 103.

In the technical scheme of the application, related user information (including but not limited to user personal information, user image information, user equipment information, such as position information and the like) and data (including but not limited to data for analysis, stored data, displayed data and the like) are information and data authorized by a user or fully authorized by all parties, and the related data are collected, stored, used, processed, transmitted, provided, applied and the like, are processed, and all comply with related laws and regulations and standards, necessary security measures are adopted, no prejudice to the public order is overcome, and corresponding operation entries are provided for the user to select authorization or rejection.

Fig. 2 is a flowchart of a video generation method according to one embodiment of the present disclosure. Fig. 3 is a schematic diagram of a first trend of variation according to one embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210-S240.

In operation S210, a plurality of video frames including a target object, the plurality of video frames indicating spatial position information of the target object along a time series, are acquired.

For example, video or multiple video frames that are actively uploaded by a user may be acquired, as well as video from a video capture device. The video may be a video with complete content after recording, or may be obtained by loading a video stream. For example, a plurality of video frames containing a target object may be identified and extracted from the video by a target detection algorithm such as a kernel correlation filtering algorithm, YOLO detection algorithm, FASTERRCNN, SSD, or the like.

During the movement of the target object, its spatial position changes with time. By changing the target object between frames of a plurality of video frames, spatial position information of the target object along the time series can be obtained. The spatial position information may comprise a spatial position change trajectory of the target object over a time sequence.

In operation S220, a plurality of display parameters of the target object are acquired from a plurality of video frames, wherein the display parameters characterize a size relationship of the target object to the video frames.

In operation S230, a plurality of scaling coefficients for a plurality of video frames, respectively, are determined based on a first trend of the plurality of display parameters along the time series.

Illustratively, as shown in fig. 3, at times t ₁、t₂、t₃ and t ₄, each corresponding to a video frame, a detection frame of the target object may be generated in the video frame using a target detection algorithm, and a detection frame identifier and a detection frame coordinate may be included. The detection frame identification is used for continuously tracking the same target object, and the detection frame coordinates are used for representing the position of the target object in the video frame.

For example, display parameters of a target object may be extracted from a video frame. The display parameters may include parameters positively correlated with the display size of the target object, such as parameters of the detection frame height, the face area, the torso area, or the display scale, which change with the display size. The display size is characterized by a size relationship, and can be determined by a ratio of the number of pixels of the target object area to the total number of pixels of the video frame, or by a ratio of the area of the detection frame to the total area of the video frame. It will be appreciated that the dimensional relationships may also be characterized by other means, without specific limitation.

In fig. 3, the vertical axis represents a measurement of the display parameter size, and the horizontal axis represents time series. Referring to fig. 3, along the time series of times t ₁、t₂、t₃ and t ₄, a first trend of the plurality of display parameters may be obtained. The first trend reflects the change in the size relationship of the target object over time in the video frame.

For example, an initial scaling factor may be allocated to each video frame, and the respective initial scaling factors of the two corresponding video frames may be adjusted according to the rate of change between the display parameters at adjacent moments, so that the initial scaling factors have a substantially uniform rate of change, and a plurality of scaling factors may be obtained finally. Or different historical trend changes may be collected in advance and mapped to a corresponding plurality of scaling coefficients, and the first trend change in operation S230 is matched with each of the previously collected historical trend changes, so that the plurality of scaling coefficients of the matched historical trend changes may be used as a result of performing operation S230. Or the first trend may be smoothed and then a plurality of scaling coefficients may be obtained based on the smoothed trend.

In operation S240, a scaling process is performed on the target object in the plurality of video frames based on the plurality of scaling coefficients, respectively, to generate a target video. For example, the scaling process may be performed on the video frame, or the scaling process may be performed by extracting the target object from the video frame.

Wherein different video frames may have the same or different scaling factors. For example, in fig. 3, the video frame at time t ₁ may have a scaling factor, the video frame at time t ₄ may have an enlargement factor, or both video frames may have scaling factors, but not the same.

For example, a subsequent clip is performed on the plurality of video frames after the scaling processing, and then the video frames are combined in time series to generate the target video.

According to the embodiment of the disclosure, compared with a mode of scaling according to fixed proportion, the screen shaking is reduced to a certain extent. The method for determining the plurality of scaling factors by utilizing the first change trend can consider the overall display condition of the target object among a plurality of video frames, so that the scaled target object can be displayed more smoothly among the plurality of video frames, and the generated target video can provide better viewing experience.

Fig. 4 is a zoom schematic diagram showing a height based on a target object according to one embodiment of the present disclosure.

In some embodiments, determining the plurality of scaling factors for the plurality of video frames based on the first trend of the plurality of display parameters along the time series may include correcting the plurality of display parameters based on a first fit curve characterizing a corrected dimensional relationship of the target object with the video frames to obtain a plurality of fit parameters, the first fit curve characterizing the first trend of the smoothed plurality of display parameters, and determining the plurality of scaling factors based on the plurality of corrected dimensional relationships of the plurality of fit parameters.

For example, the first variation trend may be fitted (such as polynomial fitting, exponential fitting or piecewise fitting) to obtain a first fitted curve, so as to implement smoothing of the first variation trend. The correction process comprises the steps of acquiring fitting parameters corresponding to display parameters of video frames on a first fitting curve based on the video frames at the same moment.

The horizontal axis of fig. 4 represents video frames ordered in time series, for example, the horizontal axis of "50" refers to the 50 th video frame of the acquired plurality of video frames. The vertical axis represents the target object display height in each video frame, characterized by the number of pixels of the target object in the height direction.

As shown in fig. 4, the first trend is a curve obtained from raw data of the target object display height (i.e., display parameter) of each video frame, and the fluctuation is frequent, and in the case of skiing, the fluctuation may be related to the snow road surface of the skiing field. The first trend of variation overall presents a trend that the target object shows a height from small to large. At the end of the first trend, the target object shows a high extreme decrease from the transition 1 marked in fig. 4, possibly caused by the target object gradually approaching the video capture device until leaving the capture range.

The first fitting curve in fig. 4 may be obtained by polynomial fitting the first trend therein. The first fitted curve is smoother overall than the first trend, for example, the line segment of the first fitted curve below 150 pixels on the ordinate reduces the fluctuation of the corresponding part of the first trend, and further, for example, the line segment of the first fitted curve is not reduced in the extreme speed at the turning 1 marked in fig. 4.

The ordinate on the first fitting curve is the fitting parameter, for example, the corrected display height of the target object. The target display height is proportional to the display scale, so the corrected target display height characterizes the corrected size relationship of the target and the video frame.

With embodiments of the present disclosure, fluctuations in the time series of the plurality of display parameters are smoothed by the plurality of fitting parameters. And determining a plurality of scaling factors based on a plurality of corrected size relations, so that the displayed target object can be naturally transited between video frames, and the shaking of a picture is avoided.

In some embodiments, determining the plurality of scaling factors based on the plurality of corrected size relationships characterized by the plurality of fitting parameters includes deriving the plurality of scaling factors based on differences between the plurality of corrected size relationships and a preset size relationship, respectively.

Illustratively, the preset size relationship is determined according to the expected display effect of the target object. The display effect can be determined according to expert experience, and scores of different users can be collected for statistics. For example, if the human body height is 20% of the video frame, the preset size relationship is 20% (hereinafter, referred to as 0.2).

For example, the video frame size photographed by the video capturing device is 3840×1920 pixels, and the fitting parameter is represented by h, where h is the height of the fitted target object in fig. 4. The corrected dimensional relationship is characterized as h/1920. The scaling factor r can be calculated by:

as can be seen from the above equation, the difference in this embodiment is the ratio between the preset size relationship and the corrected size relationship, and the ratio is used as the scaling factor.

According to the embodiment of the disclosure, the difference between the corrected size relation and the preset size relation can be reflected, so that the difference on display can be zoomed in through the zoom coefficient.

It should be noted that the above manners of determining the differences are merely exemplary illustrations that can be implemented by the present disclosure, and do not constitute limitations of the present disclosure. For example, the scaling factor may also be obtained according to the difference between the preset size relationship and the corrected size relationship, for example, the scaling factor may be assigned according to the difference interval, and the relationship between the difference interval and the assignment may be predetermined according to expert experience.

Continuing to take the example of the preset size relation as 0.2, and in the case that the corrected size relation is different from 0.2, the scaling processing aims to enable the size relation of the scaled target object and the video frame to be 0.2. However, too large a scale may also affect the visual effect of the target video, so the scaling factor may be limited, for example, between [0.618,1.618 ]. Further, the dimensional relationship after the scaling process may be limited, and referring to fig. 4, the correction result shows a height change curve for the target object after the scaling process. At turn 2, marked in fig. 4, means that the target object shows a height reaching a maximum value, nor does the dimensional relationship continue to scale up.

Fig. 5 is a schematic diagram of a second trend of variation according to one embodiment of the present disclosure.

In some embodiments, scaling the target object in the plurality of video frames based on the plurality of scaling factors, respectively, to generate the target video may include scaling the plurality of video frames based on the plurality of scaling factors, respectively, to scale the target object in the plurality of video frames. The method includes the steps of scaling a video frame according to a scaling factor, obtaining a plurality of candidate centers of the target object from the scaled plurality of video frames, the candidate centers representing center positions of the target object in the video frame, determining a plurality of crop frames for the scaled plurality of video frames respectively based on a second trend of the plurality of candidate centers along a time sequence, and cropping the target object in the scaled plurality of video frames respectively based on the crop frames to generate a target video. For example, clipping the clipped video frames, and then merging to obtain the target video.

For example, after scaling, the target object detects that the frame is scaled adaptively. The center position of the target object in the video frame may be the center position of the scaled detection frame (e.g., rectangular frame) or the center position of the region of interest determined based on the target object. The region of interest may be a region determined based on the contour of the target object, such as a humanoid contour, and the center of gravity position is determined as a candidate center.

Illustratively, as shown in fig. 5, candidate centers may be extracted from video frames at times t ₁、t₂、t₃ and t ₄, respectively, to thereby obtain a second variation trend. The x-axis and y-axis of the coordinate system shown in fig. 5 correspond to the x-axis and y-axis of the coordinate system in the video frame, respectively. The second variation trend reflects a spatial position variation locus of the target object over time.

For example, the trend of the center positions of the plurality of crop frames may be substantially identical to the second trend, and the candidate center in each video frame may be offset from the center position of the crop frame. For example, the center position of the cutting frame can be weighted and adjusted on the basis of the candidate center according to the second variation trend. If the second change trend indicates that the target object continuously moves in a certain direction, the center position of the cutting frame can be properly shifted in the certain direction, and the offset can be determined according to factors such as movement speed, acceleration and the like. Therefore, the cut result can be smoother and more natural visually, and accords with the perception habit of people on movement visually.

For example, for the size of a crop box, it may be determined based on the content and display requirements of the corresponding video frame. It is contemplated that a device adapted to play the target video may use a fixed size crop box, such as a 1080P size crop box, if the target video is played using a mobile phone. The size of the crop box may also be dynamically adjusted, for example, if the target object is cluttered, the crop box may be reduced, and vice versa.

For example, during video acquisition, the spatial position of a target object changes, and the center position of the corresponding target object changes continuously between video frames. While the center position of the target object is further changed due to the zoom operation, it may be difficult to highlight the target object as the target object body in some video frames.

According to the embodiment of the disclosure, the spatial position change track of the target object reflected by the second change trend is utilized to determine a plurality of cutting frames and respectively cut the scaled video frames, so that the main body of the target object can be effectively highlighted, the transition between adjacent video frames after cutting is natural, the conditions of sudden jump of a picture, abrupt change of the position of the main body of the target object and the like are avoided to a certain extent, and the visual effect is optimized.

Fig. 6A-6C are schematic diagrams of a clipping process according to one embodiment of the disclosure.

In some embodiments, determining a plurality of crop frames for the scaled plurality of video frames based on a second trend of the plurality of candidate centers along the time series may include correcting the plurality of candidate centers based on a second fit curve to obtain a plurality of fit centers, the second fit curve characterizing the smoothed second trend of the plurality of candidate centers, the fit parameter characterizing a corrected center position of the target object in the video frame, and taking the plurality of fit centers as center positions of the plurality of crop frames, respectively, to obtain the plurality of crop frames.

For example, the second variation trend may be fitted (such as polynomial fitting, exponential fitting or piecewise fitting) to obtain a second fitted curve, so as to implement smoothing of the second variation trend. The correcting process comprises the step of obtaining corrected center positions corresponding to video frames at the same moment on a second fitting curve, namely fitting centers.

According to the embodiment of the disclosure, fluctuation of a plurality of candidate centers on a time sequence can be smoothed through a plurality of fitting centers, namely, the space position change track of a target object on the time sequence is smoother, and the cutting frame of each video frame is determined on the basis. The transition consistency between video frames can be improved in the clipping process, so that the spatial position change track of the target object has better visual fluency, and the digital mirror effect is simulated.

Fig. 6A shows that the size of the crop box is adjusted according to the boundary of the video frame based on the fitting center as the center position of the crop box, and fig. 6B shows that the size of the crop box is adjusted according to the boundary of the video frame. As shown in fig. 6A, the fitting center is taken as the center position of the cutting frame, and the cutting frame is determined according to the center position, and the fitting center is deviated from the candidate center by a certain degree. For example, to make the target video have a better playing effect on the mobile phone, a 1080P crop box may be generated. After the center position and size of the crop box are determined, a case may occur in which the crop box exceeds the video frame boundary as shown in fig. 6A. In this case, the frame may be moved into the frame of the video frame, or the frame may be narrowed down as shown in fig. 6B. Fig. 6C shows a cropping result obtained by processing a video frame based on a cropping frame.

Wherein, to accommodate 1080P size, the reduced portion of fig. 6B may be generated using AIGC techniques, or special effect complements may be added. AIGC (ARTIFICIAL INTELLIGENCE GENERATED Content, artificial intelligence generation) is a technology for generating relevant Content with proper generalization capability by learning and pattern recognition of existing data by using artificial intelligence technology, particularly a large pre-training model and other methods. The key idea of AIGC technology is to generate content with a certain creative and quality by using an artificial intelligence algorithm, and it can generate articles, images, audio related to the content according to the input conditions or instructions.

As shown in fig. 7, the video generating method 700 of this embodiment includes operations S710 to S730. Taking skiing as an example, for example, a video acquisition device (such as one or more cameras) is erected beside a snow road, and the video acquisition device can acquire videos of one or more skiers in the snow road and push the videos to a cloud server in real time. As further described below.

In operation S710, an automated, intelligent video analysis may be implemented based on artificial intelligence techniques. Operation S710 in this embodiment may include operations S711 to S715.

In operation S711, the cloud server pulls the real-time video stream.

In operation S712, multiple skiers in the video stream are multi-target tracked by an artificial intelligence technology based target detection algorithm (e.g., YOLO detection algorithm), for example, a sequence of video frames and a sequence of detection frames from in-to-out of each skier are recorded.

In operation S713, video "striping" is performed for the selected target object. The video "stripping" refers to extracting an initial frame sequence of the target object from the video stream according to the tracking record of operation S712, and the extracted start time and end time are the in-mirror time and out-mirror time of the target object, respectively. Wherein, the target object can be selected according to the user operation, or each tracked target can be used as the selected target object.

In operation S714, a plurality of initial frame sequences of the target object are filtered using a filter. In some embodiments, an initial frame sequence including a plurality of video frames may be obtained, and the initial frame sequence may be evaluated based on a preset rule to obtain an evaluation result. The preset rule comprises a rule for evaluating at least one of a size relation, a video acquisition time length, space position information and target object moving speed information, and when the evaluation result is passing, a plurality of video frames are acquired based on an initial frame sequence.

Illustratively, the filter may include a software service capable of invoking and executing preset rules. For example, the rule for evaluating the size relationship may include evaluating whether a size average of the detection frames in the initial frame sequence is greater than or equal to a detection frame threshold, the size may include at least one of a frame height, a frame width, and a frame area, the rule for evaluating the video capture duration may include evaluating whether a duration between a mirror in time and a mirror out time of the target object is greater than or equal to a duration threshold, the rule for evaluating the spatial position information may include whether a spatial position has changed, and the rule for evaluating the movement speed information of the target object may include whether a movement speed is greater than or equal to a speed threshold. It will be appreciated that if at least one of the average value of the size of the detection frame is too small, the duration of the video acquisition is too short, the target object is not moving, and the moving speed is too low, it may be difficult to obtain a target video that meets the expectations, and the evaluation is not passed, and the initial frame sequence is filtered.

Illustratively, in the case where the evaluation passes, the initial frame sequence is taken as a plurality of video frames of the target object, or a part of the video frames are extracted from the initial frame sequence as a plurality of video frames of the target object. Wherein the extraction may include interval extraction and/or evaluating video frame quality. Assessing video frame quality may be accomplished by assessing one or more of the detection frame size, the occlusion of the target object, the resolution of the video frame, and the contrast of the video frame for each video frame.

In operation S715, attribute analysis is performed on the target object. The function of attribute analysis is to extract object features of a target object for storage in association with its target video.

For example, object characteristics of the target object are acquired from a plurality of video frames, the object characteristics are obtained according to at least one of target object form information, target object clothes information, target object equipment information, video acquisition time, video acquisition device information and target object space position change tracks, and the object characteristics and the target video are associated and stored.

The target object shape information may include height, body contour, hairstyle, face information and the like, the target object clothes information may include color information, style information and the like, the target object equipment information may include carried equipment types such as skateboards, the video acquisition device information may include camera marks and the like, the target object space position change track may include a space position change track of the target object, for example, the target object can trigger a skiing start function by a smart phone, and the cloud server records the space position change track of the target object through positioning information of the smart phone when acquiring the instruction.

For example, the maximum human frame can be scratched from the frame sequence and the position frame in the tracking record of the target object, then the analysis (such as analysis by a neural network algorithm) is performed, the color of clothes and the type of the sliding plate are identified, and meanwhile, the color, the type of the sliding plate, the mirror entering time, the mirror exiting time and the camera number are saved for subsequent rapid retrieval of the target video of the target object.

The video capture device may capture a plurality of skiers in the snow tract and pertinently extract a plurality of video frames for generating a target video with each skier as a target object. Wherein multiple skiers may generate multiple target videos. The cloud may generate a greater number of target videos, and the identification of the detection box in the multi-target tracking stage is used to distinguish the tracked targets, and may be difficult to match with the information of a specific skier, for example, the identification of the detection box is "0001", which is difficult to characterize the specific information of the target object selected by the detection box. Thus, in order to accurately push a target video to a target object, or to facilitate the target object to quickly retrieve its target video, object features may be stored in association with the target video.

In operation S720, the object features of the target object and the plurality of video frames may be packaged as a video production task and pushed to a video production task queue. The video production task queue may be a distributed queue, by way of example only. Operation S720 may separate video analysis and video production, facilitating lateral expansion on computing resources.

Several worker processes may be spawned at operation S730. The number of worker processes is determined based on the number of video production tasks, the consumption of video production task resources, and the available computing resources. Operations S731 to S735 may be performed by the worker process in operation S730.

In operation S731, the worker process pulls the task. The plurality of worker processes may work in parallel in a distributed manner, and each of the worker processes fetches the video production task from the video production task queue and executes operations S731 to S735.

In operation S732, the scaling process. For example, a plurality of display parameters of a target object are acquired from a plurality of video frames, a plurality of scaling coefficients for the plurality of video frames are determined based on a first trend of the plurality of display parameters along a time sequence, and then the plurality of video frames are scaled based on the plurality of scaling coefficients.

In some embodiments, the cloud server may detect the size of the detection frame of the target object through real-time streaming, and in the case that the size of the detection frame is detected to be smaller than a certain value, may control the camera to change the zoom magnification to increase the size of the detection frame of the target object, which is beneficial to providing a clear video frame for zooming in operation S732.

In operation S733, the digital mirror. For example, a plurality of candidate centers of a target object are obtained from the scaled plurality of video frames, a plurality of cropping frames for the scaled plurality of video frames are determined based on a second variation trend of the plurality of candidate centers along the time sequence, then the plurality of video frames are cropped to generate a target video, and the effect of the digital mirror is simulated.

In some embodiments, the action types of target objects in the plurality of video frames can be respectively identified, a plurality of target video frames conforming to the preset action types are determined, a target video segment is generated based on the plurality of target video frames, and the target video segment is added to a preset position in the target video. For example, a target video clip may be added to the target video to begin playing first at the beginning.

For example, the preset action types may include a small swing (S-shaped track of a small amplitude on a snowfield), a large swing (S-shaped track of a large amplitude on a snowfield), an obstacle surmounting, a jump and turn (rotating in the air and landing after taking off), a turnup (controlling the turnup and landing again after taking off), and the like.

For other scenarios, the preset action type may be a pass action, a dribble action, a basket action, a cut-in action, etc., without limitation.

In operation S734, video is encoded. For example, the video frame format is YUV format, or the image formats of jpeg, png, etc. obtained by YUV conversion can be combined after cutting out a plurality of video frames, and the target video with MP4, MKV, etc. format is obtained by encoding.

In operation S735, the target video is saved to cloud storage for download by the user.

Through the embodiments of the present disclosure, video generation methods based on multi-objective tracking, video "striping" and implementing digital mirrors are provided. By taking skiing as an example, a video acquisition device is installed beside a snow road to acquire videos in real time, and a cloud server performs multi-target tracking, stripping processing and digital operation mirror on video streams, so that high-quality target videos can be generated for target objects to view and share.

Fig. 8 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure.

As shown in fig. 8, the video generating apparatus 800 may include a video frame unit 810, a display parameter unit 820, a scaling factor unit 830, and a video generating unit 840.

The video frame unit 810 is configured to acquire a plurality of video frames containing the target object, the plurality of video frames indicating spatial position information of the target object along the time sequence.

The display parameter unit 820 is configured to obtain a plurality of display parameters of the target object from a plurality of video frames, wherein the display parameters characterize a dimensional relationship of the target object to the video frames.

The scaling factor unit 830 is configured to determine a plurality of scaling factors for the plurality of video frames, respectively, based on a first trend of the plurality of display parameters along the time series.

The video generation unit 840 is configured to perform scaling processing on target objects in the plurality of video frames based on the plurality of scaling coefficients, respectively, to generate target videos.

Illustratively, the scaling factor unit 830 is further configured to correct the plurality of display parameters based on a first fitting curve, to obtain a plurality of fitting parameters, the fitting parameters characterizing a corrected size relationship of the target object and the video frame, the first fitting curve characterizing a first trend of variation of the smoothed plurality of display parameters, and to determine a plurality of scaling factors based on the plurality of corrected size relationships characterized by the plurality of fitting parameters.

Illustratively, the scaling factor unit 830 is further configured to derive a plurality of scaling factors based on differences between the plurality of corrected dimensional relationships and the preset dimensional relationships, respectively.

Illustratively, the video generating unit 840 is further configured to scale the plurality of video frames to scale the target object in the plurality of video frames based on the plurality of scaling coefficients, respectively, obtain a plurality of candidate centers of the target object from the scaled plurality of video frames, the candidate centers representing center positions of the target object in the video frames, determine a plurality of crop frames for the scaled plurality of video frames, respectively, based on a second trend of the plurality of candidate centers along the time sequence, and crop the target object in the scaled plurality of video frames, respectively, based on the plurality of crop frames, to generate the target video.

Illustratively, the video generating unit 840 is further configured to correct the plurality of candidate centers based on a second fitting curve to obtain a plurality of fitting centers, the second fitting curve represents a second variation trend of the smoothed plurality of candidate centers, the fitting parameter represents a corrected center position of the target object in the video frame, and the plurality of fitting centers are respectively used as center positions of the plurality of cropping frames to obtain a plurality of cropping frames.

Illustratively, the video generating apparatus 800 may further include an association storage unit configured to acquire object features of the target object from the plurality of video frames, the object features being obtained from at least one of target object morphology information, target object clothing information, target object equipment information, video acquisition time, video acquisition apparatus information, and target object spatial position change trajectory, and to store the object features in association with the target video.

The video frame unit 810 is further configured to obtain an initial frame sequence including a plurality of video frames, evaluate the initial frame sequence based on a preset rule to obtain an evaluation result, wherein the preset rule includes a rule for evaluating at least one of a size relationship, a video acquisition duration, spatial position information, and target object moving speed information, and obtain the plurality of video frames based on the initial frame sequence if the evaluation result is passed.

Illustratively, the video generating apparatus 800 may further include an identification unit and a clip insertion unit. The recognition unit is configured to recognize motion types of target objects in the plurality of video frames, respectively, and determine a plurality of target video frames conforming to a preset motion type, the video generation unit 840 is further configured to generate a target video clip based on the plurality of target video frames, and the clip insertion unit is configured to add the target video clip to a preset position in the target video.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 is a block diagram of an electronic device to which a video generation method may be applied according to one embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read-Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a random access Memory (Random Access Memory, RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

Various components in the device 900 are connected to the I/O interface 905, including an input unit 906 such as a keyboard, a mouse, etc., an output unit 907 such as various types of displays, speakers, etc., a storage unit 908 such as a magnetic disk, an optical disk, etc., and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graph Processing Unit, GPU), various dedicated artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing chips, various computing units running machine learning model algorithms, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), and any suitable Processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text processing method and/or a deployment method of a deep learning framework. For example, in some embodiments, the text processing method and/or the deployment method of the deep learning framework may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text processing method and/or the deployment method of the deep learning framework described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the video processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuitry, field programmable gate array (Field Programmable GATE ARRAY, FPGA), application-specific integrated Circuit (ASIC), application-specific standard product (Application SPECIFIC STANDARD PARTS, ASSP), system-On-Chip (SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a random access Memory, a read-Only Memory, an erasable programmable read-Only Memory (EPROM) or flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a Cathode Ray Tube (CRT) display or a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD)) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (Local Aera Network, LAN), a wide area network (Wide Aera Network, WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video generation method, comprising:

acquiring a plurality of video frames containing a target object, wherein the plurality of video frames indicate spatial position information of the target object along a time sequence;

obtaining a plurality of display parameters of the target object from the plurality of video frames, wherein the display parameters characterize the size relationship between the target object and the video frames;

Determining a plurality of scaling factors for the plurality of video frames, respectively, based on a first trend of the plurality of display parameters along the time series;

and scaling the target objects in the video frames based on the scaling coefficients to generate target videos.

2. The method of claim 1, wherein the determining a plurality of scaling factors for the plurality of video frames, respectively, based on a first trend of the plurality of display parameters along the time series comprises:

correcting the plurality of display parameters based on a first fitting curve to obtain a plurality of fitting parameters, wherein the fitting parameters represent corrected size relation between the target object and the video frame, and the first fitting curve represents a first change trend of the plurality of display parameters subjected to smoothing treatment;

The plurality of scaling factors are determined based on a plurality of corrected dimensional relationships characterized by the plurality of fitting parameters.

3. The method of claim 2, wherein the determining the plurality of scaling factors based on the plurality of corrected dimensional relationships characterized by the plurality of fitting parameters comprises:

and obtaining the scaling coefficients based on the differences between the corrected size relationships and preset size relationships.

4. The method of claim 1, wherein scaling the target object in the plurality of video frames based on the plurality of scaling coefficients, respectively, to generate a target video comprises:

Scaling the plurality of video frames based on the plurality of scaling coefficients to scale a target object in the plurality of video frames;

obtaining a plurality of candidate centers of the target object from the scaled plurality of video frames, the candidate centers representing center positions of the target object in the video frames;

determining a plurality of crop frames for the scaled plurality of video frames, respectively, based on a second trend of variation of the plurality of candidate centers along the time series;

And respectively clipping the target objects in the scaled video frames based on the clipping frames to generate the target video.

5. The method of claim 4, wherein the determining a plurality of crop boxes for the scaled plurality of video frames, respectively, based on a second trend of the plurality of candidate centers along the time series comprises:

Correcting the plurality of candidate centers based on a second fitting curve to obtain a plurality of fitting centers, wherein the second fitting curve represents a second variation trend of the plurality of candidate centers subjected to smoothing treatment, and the fitting parameter represents the corrected center position of the target object in the video frame;

and respectively taking the fitting centers as the center positions of the cutting frames to obtain the cutting frames.

6. The method of claim 1, further comprising:

acquiring object features of the target object from the plurality of video frames, wherein the object features are obtained according to at least one of target object form information, target object clothes information, target object equipment information, video acquisition time, video acquisition device information and target object space position change tracks;

and storing the object characteristics and the target video in an associated mode.

7. The method of claim 1, wherein the acquiring a plurality of video frames containing a target object comprises:

acquiring an initial frame sequence containing the plurality of video frames;

Evaluating the initial frame sequence based on preset rules to obtain an evaluation result, wherein the preset rules comprise rules for evaluating at least one of the size relation, the video acquisition duration, the spatial position information and the target object moving speed information;

and acquiring the plurality of video frames based on the initial frame sequence if the evaluation result is passed.

8. The method of claim 1, further comprising:

Respectively identifying action types of the target objects in the plurality of video frames, and determining a plurality of target video frames conforming to preset action types;

Generating a target video clip based on the plurality of target video frames;

the target video clip is added to the target video at a preset location.

9. A video generating apparatus comprising:

A video frame unit configured to acquire a plurality of video frames containing a target object, the plurality of video frames indicating spatial position information of the target object along a time sequence;

a display parameter unit configured to obtain a plurality of display parameters of the target object from the plurality of video frames, wherein the display parameters characterize a dimensional relationship of the target object with the video frames;

a scaling factor unit configured to determine a plurality of scaling factors for the plurality of video frames, respectively, based on a first trend of variation of the plurality of display parameters along the time series;

and a video generation unit configured to perform scaling processing on target objects in the plurality of video frames based on the plurality of scaling coefficients, respectively, to generate a target video.

10. An electronic device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video generation method of any one of claims 1-8.

11. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video generation method of any one of claims 1-8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the video generation method according to any one of claims 1 to 8.