CN119012024A

CN119012024A - Video fusion method and fusion system

Info

Publication number: CN119012024A
Application number: CN202411073252.7A
Authority: CN
Inventors: 唐剑锋; 郭超; 廖展; 江柯; 马康宁
Original assignee: Sichuan Guangxin World Media Co ltd
Current assignee: Sichuan Guangxin World Media Co ltd
Priority date: 2024-08-06
Filing date: 2024-08-06
Publication date: 2024-11-22

Abstract

The invention provides a video fusion method and a fusion system, which relate to the technical field of video processing and are applied to a plurality of video streams with different shooting angles and overlapping areas in the same scene. In the non-first frame processing, an inter-frame change detection mechanism is introduced, and a mapping matrix is dynamically adjusted according to a detection result, so that the accuracy of video fusion is ensured. And finally, splicing the transformed images according to the relative positions of the cameras, so as to realize high-quality video fusion. The technical scheme effectively solves the key problem in multi-view video stream fusion, improves the fusion efficiency and effect, and is suitable for the fields of video monitoring, virtual reality, augmented reality and the like.

Description

Video fusion method and fusion system

Technical Field

The invention relates to the technical field of video processing, in particular to a video fusion method and a fusion system.

Background

Etc. The system realizes multi-angle and all-around monitoring of the same scene by arranging a plurality of cameras at different positions. However, the video streams shot by the plurality of cameras have the problems of different shooting angles, overlapping fields of view and the like, and how to effectively integrate the video streams to form a complete, coherent and information-rich monitoring picture becomes an important research topic in the current video processing field.

Most of the traditional video fusion methods are based on simple image splicing technology, and when the methods process video streams with larger visual angle difference and complex background, the problems of obvious splicing gaps, distortion of images, information loss and the like often exist. In addition, when a scene in a video stream is dynamically changed, such as illumination change, object movement, etc., the conventional splicing method has difficulty in maintaining stability and accuracy of a splicing effect.

Therefore, it is necessary to provide a video fusion method and a fusion system to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a video fusion method and a fusion system, which are used for carrying out homography transformation on single-frame images in a plurality of video streams through a comprehensive mapping matrix, and dynamically adjusting the mapping matrix by combining an inter-frame change detection result so as to realize high-quality video fusion.

The invention provides a video fusion method which is applied to a plurality of video streams with different shooting angles and overlapping areas in the same scene, wherein the video streams are acquired by a plurality of cameras, and the fusion method comprises the following steps:

acquiring single frame images from the video streams respectively, wherein all the single frame images are images of a first time frame,

If the first time frame is the first frame of the video streams, performing image homography on all single-frame images through a comprehensive mapping matrix, wherein the comprehensive mapping matrix is formed by multiplying a transmission correction mapping matrix, a distortion transformation mapping matrix and a scaling mapping matrix through a matrix, the transmission correction mapping matrix is calculated according to a region to be spliced between the single-frame images, the distortion transformation mapping matrix is calculated according to internal parameters and external parameters of a video camera, and the scaling mapping matrix is calculated according to a required scaling ratio;

If the first time frame is a non-initial frame of the plurality of video streams, performing corresponding frame change detection on all single frame images of the first time frame and all single frame images in a frame before the first time frame to obtain a single frame change result, wherein the single frame change result comprises a pixel difference value, a feature point change value and a block similarity;

judging whether the single frame change result meets a preset condition or not, wherein the preset condition comprises a pixel difference threshold, a feature point change threshold and a block similarity threshold;

if yes, performing image homography conversion on all single-frame images of the first time frame through a comprehensive mapping matrix of a frame before the first time frame, and splicing all single-frame images after homography conversion according to the relative positions of the cameras;

If the image homography transformation does not meet the requirement, updating the comprehensive mapping matrix, carrying out image homography transformation on all single-frame images of the first time frame by utilizing the updated comprehensive mapping matrix, and splicing all single-frame images after homography transformation according to the relative positions of the cameras.

Preferably, the region to be spliced between the single frame images is obtained by using a feature point matching method.

Preferably, the feature point matching method is applied to determining a region to be spliced between single-frame images, and comprises the following steps:

detecting feature points in all single-frame images belonging to a first time frame in the plurality of video streams, wherein the feature points comprise corner points and edge points;

Determining adjacent single-frame images according to the relative positions of the cameras, and matching the characteristic points in the adjacent single-frame image pairs by using a SIFT method to obtain successfully matched characteristic point pairs;

estimating a transformation model between single-frame images by a least square method according to the successfully matched characteristic point pairs, wherein the transformation model is used for representing a translation relationship, a rotation relationship and a scaling relationship between the images;

And predicting the region to be spliced between the single-frame images according to the transformation model.

Preferably, the step of obtaining the single frame change result includes:

Performing pixel difference calculation on all single-frame images of the first time frame and single-frame images at corresponding positions in a previous frame of the first time frame to obtain pixel difference values;

Calculating feature point position change values of all single frame images of the first time frame and single frame images at corresponding positions in a previous frame of the first time frame by using a feature point detection method to obtain feature point change values;

Dividing a single frame image of an adjacent frame into a plurality of blocks, and calculating the similarity between the corresponding blocks to obtain the block similarity.

Preferably, the determining whether the single frame change result meets a preset condition includes:

comparing the calculated pixel difference value, the characteristic point change value and the block similarity with a preset pixel difference threshold, a preset characteristic point change threshold and a preset block similarity threshold respectively;

If the pixel difference value is smaller than or equal to a pixel difference threshold, the characteristic point change value is smaller than or equal to a characteristic point change threshold, and the block similarity is larger than or equal to a block similarity threshold, judging that the single-frame change result meets a preset condition;

If any one of the conditions is not satisfied, determining that the single frame change result does not satisfy a preset condition.

Preferably, the updating process of the comprehensive mapping matrix includes:

The transmission correction mapping matrix, the distortion transformation mapping matrix, and the scaling mapping matrix are recalculated.

Preferably, the fusion method further comprises:

and carrying out post-processing on the fused video, wherein the post-processing comprises color correction, brightness adjustment and denoising.

The invention also provides a video fusion system for executing the video fusion method, which is applied to a plurality of video streams with different shooting angles and overlapping areas in the same scene, wherein the video streams are acquired by a plurality of cameras, and the fusion system comprises:

An image acquisition module for respectively acquiring single-frame images from the plurality of video streams, wherein all the single-frame images are images of a first time frame,

The image processing module is used for executing the following steps:

Compared with the related art, the video fusion method and the fusion system provided by the invention have the following advantages that

The beneficial effects are that:

the invention adopts the products of the transmission correction mapping matrix, the distortion transformation mapping matrix and the scaling mapping matrix as the comprehensive mapping matrix, and respectively considers the transmission deformation, the camera distortion and the required scaling among video streams, thereby realizing the alignment among images more accurately.

In the non-first frame processing of the video stream, an inter-frame change detection mechanism is introduced, and the change degree between adjacent frames is evaluated by calculating parameters such as pixel difference values, characteristic point change values, block similarity and the like, so that a basis is provided for dynamically adjusting a mapping matrix.

According to the inter-frame change detection result, if the change of the current frame and the previous frame meets the preset condition, continuing to use the comprehensive mapping matrix of the previous frame for transformation; if not, updating the mapping matrix to ensure the accuracy and stability of video fusion.

After homography transformation, the transformed single-frame images are spliced according to the relative positions of the cameras, so that coherent and high-quality video content is formed.

In summary, the invention performs homography transformation on single-frame images in multiple video streams through the comprehensive mapping matrix, and dynamically adjusts the mapping matrix by combining with the inter-frame change detection result, so as to realize high-quality video fusion.

Drawings

FIG. 1 is a flow chart of a video fusion method provided by the invention;

fig. 2 is a block diagram of a video fusion system according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as being processed sequentially, many of the operations can be performed in parallel, concurrently or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example 1

The invention provides a video fusion method, which is applied to a plurality of video streams with different shooting angles and overlapping areas in the same scene, wherein the video streams are acquired by a plurality of cameras, and the fusion method comprises the following steps of:

And respectively acquiring single-frame images from the video streams, wherein all the single-frame images are images of a first time frame.

In this embodiment, first, single-frame images of a current first time frame (i.e., a frame corresponding to a current processing time point) are extracted from video streams captured by a plurality of cameras in synchronization. These images will serve as input data for subsequent processing. To ensure data synchronization, the system needs to record the clock information for each camera and use time stamps to align the video frames from the different cameras.

If the first time frame is the first frame of the video streams, performing image homography on all single-frame images through a comprehensive mapping matrix, wherein the comprehensive mapping matrix is formed by multiplying a transmission correction mapping matrix, a distortion transformation mapping matrix and a scaling mapping matrix through a matrix, the transmission correction mapping matrix is calculated according to a region to be spliced between the single-frame images, the distortion transformation mapping matrix is calculated according to internal parameters and external parameters of a video camera, and the scaling mapping matrix is calculated according to a required scaling ratio.

In this embodiment, when it is detected that the first frame of the video stream is currently being processed, it will first calculate a transmission correction mapping matrix, specifically: and analyzing the overlapping area between the single-frame images, and determining the relative position relationship between the images by utilizing methods such as feature point matching and the like. Next, a distortion transformation mapping matrix is calculated based on the camera's internal parameters (e.g., focal length, optical center, etc.) and external parameters (e.g., camera's position and orientation in world coordinate system). Finally, a scaling mapping matrix is calculated according to the required scaling. The three matrices are multiplied by a matrix to form a comprehensive mapping matrix, and then applied to all single-frame images to perform image homography transformation so as to correct transmission deformation and distortion and scale the images as required.

And if the first time frame is a non-initial frame of the plurality of video streams, performing corresponding frame change detection on all single frame images of the first time frame and all single frame images in a frame before the first time frame to obtain a single frame change result, wherein the single frame change result comprises a pixel difference value, a feature point change value and a block similarity.

In this embodiment, for the processing of the non-first frame, all the single frame images of the current first time frame are compared with the single frame images of the corresponding positions in the previous frame. By calculating the pixel difference values, the pixel level variation of the image content is evaluated. Meanwhile, the feature point detection method is utilized to detect and match feature points, the change value of the feature point position is calculated, meanwhile, an image is divided into a plurality of blocks, and the similarity between the corresponding blocks of adjacent frames is calculated, so that more comprehensive inter-frame change information is obtained.

And judging whether the single frame change result meets a preset condition or not, wherein the preset condition comprises a pixel difference threshold, a feature point change threshold and a block similarity threshold.

In this embodiment, the calculated pixel difference value, feature point variation value, and block similarity are compared with a preset threshold. The thresholds are preset according to the actual application scene and the fusion effect requirement. If all the change results meet the preset conditions (namely, the pixel difference value is smaller than or equal to the pixel difference threshold value, the feature point change value is smaller than or equal to the feature point change threshold value, and the block similarity is larger than or equal to the block similarity threshold value), the similarity between the current frame and the previous frame is considered to be high enough, and the comprehensive mapping matrix of the previous frame can be used for transformation.

If so, performing image homography conversion on all single-frame images of the first time frame through a comprehensive mapping matrix of a frame before the first time frame, and splicing all single-frame images after homography conversion according to the relative positions of the cameras.

In this embodiment, if the single-frame change result meets the preset condition, the system directly adopts the comprehensive mapping matrix of the previous frame to perform image homography transformation on all single-frame images of the current frame. The transformed images are stitched according to the relative positions of the cameras to form a complete fused video frame.

In this embodiment, if the single frame change result does not meet the preset condition, it indicates that the difference between the current frame and the previous frame is large, and the comprehensive mapping matrix needs to be updated. The system will recalculate the transmission correction mapping matrix, the distortion transformation mapping matrix and the scaling mapping matrix and form a new comprehensive mapping matrix by matrix multiplication. And then, carrying out image homography transformation on all single-frame images of the current frame by using the updated comprehensive mapping matrix, and splicing according to the relative positions of the cameras.

Specifically, the region to be spliced between the single frame images is obtained by using a feature point matching method, and the method comprises the following steps:

in the plurality of video streams, feature points are detected in all single-frame images belonging to a first time frame, wherein the feature points comprise corner points and edge points.

In this embodiment, feature point detection is first performed for a single frame image belonging to a first time frame in each video stream to identify corner points and edge points in the image. These feature points are typically points in the image that are easily identifiable and remain relatively stable at different viewing angles.

And determining adjacent single-frame images according to the relative positions of the cameras, and matching the characteristic points in the adjacent single-frame image pairs by using a SIFT method to obtain successfully matched characteristic point pairs.

In this embodiment, after determining the adjacent pair of single-frame images (based on the relative position information of the camera), the feature points in each pair of images are matched using the SIFT algorithm, which finds a matching term by comparing the descriptors of the feature points (i.e., the gradient direction histogram around the feature points).

In the matching process, a matching threshold value is set to screen out matching pairs with low descriptor similarity, and only feature point pairs with high matching quality are reserved.

And estimating a transformation model between single-frame images by a least square method according to the successfully matched characteristic point pairs, wherein the transformation model is used for representing a translation relationship, a rotation relationship and a scaling relationship between the images.

In the present embodiment, after obtaining the feature point pairs that match successfully, a least square method is used to estimate a transformation model between single-frame images. This transformation model is an affine transformation that characterizes translational, rotational, and scaling relationships between images.

In addition, RANSAC (Random Sample Consensus) algorithms are used to reject false matches (i.e., outliers) in the matching pairs to improve the accuracy of the transformation model. The parameters of the transformation model are then solved using least squares based on the remaining matching pairs (i.e., interior points).

In the present embodiment, after the transformation model is obtained, the region to be stitched between single frame images is predicted using this model. Specifically, the region to be stitched may be determined by applying a transformation model to a region in one image (e.g., a boundary region or a feature point dense region) and observing its corresponding position in the other image.

Specifically, the step of obtaining the single frame change result includes:

And carrying out pixel difference calculation on all single-frame images of the first time frame and the single-frame image at the corresponding position in the previous frame of the first time frame to obtain a pixel difference value.

In this embodiment, for all single frame images of the first time frame, the system compares the single frame images with the single frame images of the corresponding positions in the previous frame pixel by pixel, and when comparing, the comparison is completed by calculating the difference (specifically, absolute value difference) between the corresponding pixel values of the two images, and accumulating or averaging all the difference values to obtain an integral pixel difference value, wherein the integral pixel difference value reflects the variation degree of the pixel level between the adjacent frames.

And calculating feature point position change values of all single frame images of the first time frame and single frame images at corresponding positions in a previous frame of the first time frame by using a feature point detection method to obtain feature point change values.

In this embodiment, a SIFT feature point detection method is used to extract feature points from a single frame image of a first time frame and a previous frame, then a feature descriptor matching algorithm is used to find matched feature point pairs between adjacent frames, and for each pair of matched feature points, the position difference (i.e., feature point position change value) of the matched feature points in the image is calculated, where the value is the euclidean distance.

In the present embodiment, a single frame image of an adjacent frame is divided into a plurality of blocks (also referred to as windows or areas) of the same size, and then, for each pair of corresponding blocks (i.e., blocks located the same in the previous frame and the current frame), the similarity between them is calculated, and the similarity calculation is completed using the structural similarity index SSIM.

Specifically, the determining whether the single frame change result meets a preset condition includes:

And comparing the calculated pixel difference value, the characteristic point change value and the block similarity with a preset pixel difference threshold, a preset characteristic point change threshold and a preset block similarity threshold respectively.

And if the pixel difference value is smaller than or equal to a pixel difference threshold value, the characteristic point change value is smaller than or equal to a characteristic point change threshold value, and the block similarity is larger than or equal to a block similarity threshold value, judging that the single frame change result meets a preset condition.

Furthermore, the fusion method further comprises:

The invention provides a video fusion method and a fusion system, which have the following working principles:

firstly, the system synchronously extracts single-frame images of the current first time frame of all video streams from video streams captured by a plurality of cameras and shot in the same scene but different angles. This is the starting point for the video fusion process.

For the first frame of the video stream, the system uses a comprehensive mapping matrix to homography transform the single frame images. The comprehensive mapping matrix is the product of a transmission correction mapping matrix, a distortion transformation mapping matrix and a scaling mapping matrix, and is used for correcting transmission deformation and camera lens distortion caused by the angle and position difference of a camera and adjusting the scaling of an image according to actual requirements. Through this step, the system can initially align the single frame images in preparation for subsequent stitching.

For the case of non-first frame, the system first performs inter-frame change detection on these single-frame images and the corresponding image of the previous frame. This includes calculating pixel difference values, feature point variation values, and block similarities to comprehensively evaluate the degree of inter-frame variation. The purpose of this step is to determine the similarity of the current frame to the previous frame and thus determine whether the comprehensive mapping matrix needs to be updated.

Next, the system judges whether the inter-frame change result satisfies a preset condition according to preset thresholds (pixel difference threshold, feature point change threshold, and block similarity threshold). If the condition is met, the similarity between the current frame and the previous frame is higher, the comprehensive mapping matrix of the previous frame can be directly used for carrying out homography transformation on the current frame, and splicing is carried out according to the relative position of the camera. If the condition is not satisfied, the inter-frame change is larger, and the comprehensive mapping matrix needs to be updated.

In the event that the integrated mapping matrix needs to be updated, the system will recalculate the transmission correction mapping matrix, the distortion transformation mapping matrix, and the scaling mapping matrix to more accurately reflect the differences between the current frame and the previous frame. And then, carrying out homography transformation on the current frame by using the updated comprehensive mapping matrix, and splicing.

Finally, in order to improve the quality of the fused video, the system also performs post-processing on the spliced video, including color correction, brightness adjustment, denoising and other steps, so as to ensure that the fused video is more natural and coherent in visual effect.

In summary, the video fusion method provided by the invention realizes efficient and accurate fusion of a plurality of video streams with different shooting angles and overlapping areas in the same scene through the technical means of the construction and dynamic adjustment of the comprehensive mapping matrix, the detection of inter-frame variation, the post-processing and the like

Example two

An image acquisition module 100, configured to acquire single-frame images from the multiple video streams, where all the single-frame images are images of a first time frame,

The image processing module 200 is configured to perform the following steps:

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program that instructs associated hardware, the program may be stored in a computer readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Pr ogrammable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (CD-ROM), or other optical disc Memory, magnetic disk Memory, tape Memory, or any other medium capable of being used for computer readable carrying or storing data.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims

1. A video fusion method, characterized in that it is applied to a plurality of video streams having overlapping areas and different shooting angles in the same scene, the plurality of video streams being acquired by a plurality of cameras, the fusion method comprising the steps of:

2. The video fusion method according to claim 1, wherein the region to be spliced between the single frame images is obtained by using a feature point matching method.

3. The video fusion method according to claim 2, wherein the feature point matching method is applied to determine a region to be stitched between single frame images, and comprises the steps of:

4. A video fusion method according to claim 3, wherein the step of obtaining the single frame change result comprises:

5. The method of claim 4, wherein the determining whether the single frame change result meets a preset condition comprises:

6. The video fusion method of claim 5, wherein the updating of the comprehensive mapping matrix comprises:

7. The video fusion method of claim 6, wherein the fusion method further comprises:

8. A video fusion system for performing a video fusion method according to any one of claims 1 to 7, applied to a plurality of video streams having overlapping areas and different shooting angles in the same scene, the plurality of video streams being acquired by a plurality of cameras, the fusion system comprising:

The image processing module is used for executing the following steps: