CN101009021A

CN101009021A - Video stabilizing method based on matching and tracking of characteristic

Info

Publication number: CN101009021A
Application number: CN 200710036817
Authority: CN
Inventors: 胡蓉; 施荣杰; 沈一帆; 陈文斌
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2007-01-25
Filing date: 2007-01-25
Publication date: 2007-08-01
Anticipated expiration: 2027-01-25
Also published as: CN100530239C

Abstract

本发明属于计算机数字图像视频处理技术领域，具体为一种基于特征匹配与跟踪的视频稳定方法。本发明将基于SIFT特征匹配方法应用于视频去抖动问题中，其步骤包括：找出每一帧视频的SIFT特征点；采用仿射模型作为参数估计模型，进行全局参数估计；采用高斯滤波和曲线拟合方法对视频序列的运动进行光滑化处理；对于未知区域进行填补。本发明方法鲁棒性好，受环境因素影响小，运动参数估计准确性高，图像对齐误差小，视频修补的时间代价小。The invention belongs to the technical field of computer digital image video processing, in particular to a video stabilization method based on feature matching and tracking. The present invention applies the SIFT-based feature matching method to the problem of video de-jittering. The steps include: finding out the SIFT feature points of each frame of video; using an affine model as a parameter estimation model to perform global parameter estimation; using Gaussian filtering and curve The fitting method smoothes the motion of the video sequence; fills the unknown area. The method of the invention has good robustness, is less affected by environmental factors, has high motion parameter estimation accuracy, small image alignment error, and small time cost of video repair.

Description

Video stabilizing method based on characteristic matching and tracking

Technical field

The invention belongs to the computer digital image technical field of video processing, be specifically related to a kind of video stabilizing method based on Feature Points Matching and tracking.

Background technology

It is a kind of crucial video enhancement techniques that video goes to tremble (also claiming video stabilization).Along with the significantly reduction of digital photographing apparatus price and the raising of computing power, individual digital picture pick-up device and mobile digital picture pick-up device are more and more universal, and the digital image video treatment technology is subjected to more attention.No matter where you note institute's occurrence at one's side at any time easily to utilize these equipment, and the captured video datas of a large amount of individuals can upload on the internet to be watched and download for the people.Because the instability of picture pick-up device, these people's home videos, or safety monitoring equipment or by the captured video of UAV (UnmannedAerial Vehicles) all exist the shake of high frequency usually significantly, cause image blurring unclearly, simultaneously also can make the beholder produce tiredness.On the other hand, stable video can compress better.If entire image all in vibration, is so just used the more bits number and is write down these motion change, thereby waste the delivery flow rate of more storage space and data.Stable image has better ratio of compression and quality is beneficial to long-range and network browsing.In recent years, there are many researchs to pay close attention to this problem, proposed many new methods and technology, to improve the quality and the speed of video stabilization.

Video jitter is meant in the shooting process owing to there is inconsistent motion noise in video camera and causes the shake of video sequence and fuzzy.In order to eliminate these shakes, need to extract the true global motion parameter of video camera, adopt the motion of suitable converter technique compensation video camera then, make video pictures smooth and stablize, this technology is commonly referred to video and goes to shake or video stabilization.Video goes the technology of trembling to be divided into following two kinds at present: hardware approach and image processing method.The hardware approach light stream stabilization technique that is otherwise known as, it has comprised a cover light stream system and has used motion sensor to compensate the motion of video camera.Though this method ten minutes is effective, has increased the cost of video camera greatly and can only handle some smaller motions usually, so many video cameras do not adopt this technology.Image process method is carried out aftertreatment to captured video clips, to get rid of those by video jitter artificial or that mechanical vibration were produced.Here mainly contain two kinds of methods: characteristic method (FeatureMatching) and optical flow method (Optical Flow).Characteristic method is carried out characteristic matching between consecutive frame on the basis of the unique point of extracting every two field picture, calculate the global motion parameter of video camera then according to the result of coupling, compensates with filtered global motion transfer pair original series at last.The effect of this method depends on the precision of characteristic matching to a great extent, and when existing moving target or textural characteristics not obvious in the scene, the application of this method will be restricted.Optical flow method is at first calculated the light stream between the consecutive frame, according to light stream information, obtains the global motion parameter by motion analysis, compensates original series according to filtered kinematic parameter then.The advantage of this method is to obtain each pixel motion vector, yet if there is inconsistent moving region in the scene, will finish the estimation of global motion usually in conjunction with Video Segmentation.And these class methods generally need sizable calculated amount owing to will each pixel be analyzed.In addition the intrinsic aperture problem of optical flow computation also be this method must consider.At last, in the operation that video goes to tremble, because former video sequence has been carried out translation or rotation, can produce the zone of some the unknowns at the edge, so also very important of a kind of method of video repairing fast and effectively.It is at present main that what use is the method for video-splicing (Video Mosaic) and based on video repairing (Video Completion) method of estimation.But they all have various defectives: simple Mosaic method can produce fuzzy and ghost phenomenon, and based on the method for estimation, though the result who repairs is better than the former,, need bigger time cost owing to need to calculate the motion vector of each picture element.

Summary of the invention

The objective of the invention is to propose the video stabilizing method that a kind of time cost is little, the algorithm robustness is good.

The present invention utilizes yardstick invariant features (Scale-Invariant Features Transform SIFT) to carry out the estimation of interframe, SIFT for image under different scale and rotation all have a constant characteristic, also can retaining part constant for the variation of illumination and 3D video camera observation point.Because the SIFT feature is distributed on space and the frequency field simultaneously, to block, influence chaotic and that noise is produced can reduce widely.And these features have very big identifiability, can higher accuracy mate.The feasible matching problem based on the SIFT feature of these outstanding advantages has possessed better robustness and reliability.Present this method only is to use in the problem of images match and panorama sketch generation, and the present invention at first uses video to it and goes in the problem of trembling, and obtains gratifying experimental result.The video sequence of given one section shake, we finish video according to following step and go the operation of trembling.

1, finds out the SIFT unique point of each frame, and comprise the descriptor of space and frequency domain character description for one of each unique point.Every two field picture is carried out smoothing with the Gaussian function of different scale handle, the SIFT unique point is positioned at the extreme value place (maximum value and minimal value) of the difference of adjacent yardstick.Then, image is carried out up-sampling, carry out identical smoothing and handle, and the like, set up pyramid structure, find the unique point on each yardstick.Next, calculate the gradient direction of each unique point according to the local feature of image.Each unique point has just had position, yardstick and directional information like this.Just getting off, is that the regional area of unique point calculates a descriptor then, even make that such as the variation of light or viewpoint, this descriptor remains unchanged as much as possible when environment changes.Specifically as shown in Figure 1, according to the different scale of unique point, on smooth image, calculate the gradient and the direction of each point.The left figure (a) of Fig. 1, gradient and the direction of sampling around the unique point is divided into the sample area of 2*2 to whole sample window, and the direction histogram of 8 directions of calculating in each zone is seen the right figure (b) of Fig. 1.Each local description comprises in each sample area the size on all directions in the direction histogram like this, and therefore, the local description of a unique point is expressed as the eigenvector of 2 * 2 * 8=32 length.

2, the estimation of global motion parameter.Here the model that has adopted affine model to estimate as kinematic parameter.Affine model is expressed as:

A = [\begin{matrix} a 1 & a 2 & a 3 \\ a 4 & a 5 & a 6 \\ 0 & 0 & 1 \end{matrix}],

A1 in the model, a2, a4, a5 have described convergent-divergent and have rotatablely moved, a3, and a6 has described translation motion.At first adopt the above-mentioned unique point of fast nearest field algorithm (nearest neighbor algorithm) coupling, find nearest neighbours as match point.Utilize Hough (Hough) conversion to determine to belong to all unique points of same object then, determine each parameter (totally 6) in the motion model for these unique points with least square method (least-square solution) at last by the ballot principle.

3, it is smooth to carry out the motion of video sequence in conjunction with gaussian filtering and curve fitting technique.Gaussian filtering and curve fitting technique all are used for the operation of the smooth denoising of signal in a large number, and both have merits and demerits separately.We combine both, utilized the advantage of these two kinds of methods, guaranteed the stability of video flowing on the one hand, overcome the smooth problem of the mistake that Gaussian filter brought of the big window of independent use on the other hand, made that the border zone of ignorance that produces is as much as possible little.The stable video here is not fully motionless, and we expect that the motion of the video that generates is smooth, to a kind of smoothness of beholder, joyful visual experience.In the thought research work in the past that these two kinds of methods are used in combination, also do not occur.Experimental results show that the combination of these two kinds of methods, can obtain gratifying stable video sequence.We carry out conic fitting to the curve movement that estimates above earlier, then the curve movement after smooth are carried out gaussian filtering, and the parameter σ of the gaussian kernel here need not obtain too big (generally getting σ is between the 0.6-1.2), avoids smooth.We did relatively for the sequencing of two kinds of methods, and both difference is little, focuses in the selection of parameter.

4, be directed to filling up of zone of ignorance and always be a problem that relatively is difficult to solve.Our method specifically is, on the stable video stream after handling in the above, earlier on every side neighbours' frame of target frame (frame to be filled up) (about each 4-6 frame) is alignd to target frame, calculate the difference between neighbours' frame and the target frame, according to the size ordering of otherness, generally can be big more from target frame difference far away more.Neighbours' frame by the otherness minimum begins to fill up target frame.If also have zone of ignorance, then fill up target frame by diversity factor second little neighbours' frame, and the like.Be different from general method, we find the path of a difference minimum by DP (Dynamic Programming dynamic programming) method on differential image, with this paths is that two width of cloth images are spliced on the boundary, in order to guarantee temporal continuity, the scope of our limit search (10 pixels) in one section zone on zone of ignorance border.In the Mosaic method, also is the innovative point that we work in conjunction with DP algorithm.

The advantage of invention:

(1) the algorithm robustness is good, and the influence of unfavorable factor such as be subjected to illumination, block is less.

(2) the accuracy height of kinematic parameter estimation, the error of image alignment is less

The combination of (3) two kinds of smooth algorithms can overcome shortcoming each other, produces reasonable effect.

(4) video repairing combines DP algorithm, has guaranteed the continuity on time and the space, has saved time cost widely compared to optical flow approach simultaneously.

Description of drawings

Fig. 1 is the descriptor diagram.Wherein, (a), (b) be the direction histogram of 8 directions for gradient and direction around the sampling unique point.

Fig. 2 is for asking the Gaussian difference diagram.

Fig. 3 is for being asked the territory diagram of extreme point by Gaussian difference.

Fig. 4 is the unique point in the piece image and the diagram of gradient.

The comparison that Fig. 5 fills up for image.Wherein, (a) be result of the present invention, (b) be the result of Mosaic method.

Fig. 6 is an experimental result diagram of the present invention.Wherein, the first row picture is an original video stream, and second row is the video flowing after trembling, and the third line is the result through repairing.

Embodiment

1, test figure is one section shake video clips that hand-held camera is captured.

2, it is smooth to become the long Gaussian function of 2 multiplications to carry out each frame with yardstick, and the difference extreme value of obtaining each layer is as unique point.The up-sampling image is used the Gaussian function smoothing of different scale equally then, asks difference to look for extreme value, and the like.In the experiment our up-sampling 3 times.Ask Gaussian difference specifically referring to Fig. 2, Fig. 3 has illustrated the territory of extreme point, has comprised the neighbor node with layer and levels.

3, calculate the gradient direction of each unique point.Concrete formula is as follows:

m (x . y) = \sqrt{{(L (x + 1, y) - L (x - 1, y))}^{2} + {(L (x, y + 1) - L (x, y - 1))}^{2}}

θ(x，y)＝tan ^-1((L(x，y+1)-L(x，y-1))/(L(x+1，y)-L(x-1，y)))

L is the smooth image of unique point place yardstick, and (x y) is the amplitude of gradient to m, and (x y) is the direction of gradient to θ.Fig. 4 is the unique point of being looked in the piece image and their gradient.

4, utilize formula in the step 3 to calculate the gradient of every bit.As shown in fig. 1, calculate the histogram of gradients (8 directions) in each sample area around the unique point.The gradient amplitude value of each direction has constituted an eigenvector in each zone, as the local description of each unique point.

5, utilize nearest neighbor algorithm (Nearest Neighbor), in adjacent two frames, find the nearest match point of each unique point, change to vote out by Hough then and have most possible motion conditions, find the unique point of those ballots conversely, they should have identical motion conditions, at last determine 6 parameters in the affine model based on these unique points with least square.

6, find motion path after, come the curve movement of smooth 6 parameters respectively with conic fitting.Quafric curve form used herein is: y=ax ²+ bx+c.Determine coefficient a, b, c with least square method.

7, use the curve movement of respectively further smooth 6 parameters of Gaussian function.Gaussian kernel is

G (k) = (1 / \sqrt{2 π σ^{2}}) * \exp^{- \frac{k^{2}}{σ^{2}}} .

σ is a standard deviation, gets σ=1 in the experiment.K is the distance of neighbours' frame and target frame.The formula of smooth calculating is specific as follows:

T_{i} = \underset{j &Element; N}{Σ} A_{i}^{j} G (j - i)

{\hat{I}}_{i} = T_{i} I_{i}

N is the field { N|i-k≤j≤i+k} of i frame.A _i ^jRepresent the kinematic parameter of i frame to the j frame.T _iRepresent smooth after, to the motion compensation of i frame.

I frame after the expression smoothing.

8, at last to after the motion compensation, the zone of ignorance that is produced on the border is filled up.According to kinematic parameter, neighbours' frame is alignd each 5 frame before and after having used here to target frame.Ask poor with target frame respectively then, again little of sorting greatly according to diversity factor, at first the frame of utilization variance minimum is filled up.On differential image, along in the zone of inside 10 pixels in zone of ignorance border, find the path of a difference minimum with dynamic programming method (DP), splice two width of cloth images along this paths.If also have the zone unknown, next the utilization variance degree second little frame is repaired, and the like.If also have the zone unknown, then need to utilize more neighbours' frame.Fig. 5 is that the effect of filling up compares.Right figure (a) is the result that we propose method, and left figure (b) is the result of general Mosaic method, and tangible stitching error is arranged.Square frame indicates the subregion of filling up, and is convenient to comparison.

9, Fig. 6 is last experimental result.Shown 4 frames in the experiment video among the figure, first row is original video stream, and second row and last column are respectively to tremble the result of back and process repairing, and coordinate is used for more stable result.

Claims

1, a kind of video stabilization method based on feature matching and tracking, it is characterized in that concrete steps are as follows:

(1) For a jittering video sequence, find out the SIFT feature points of each frame, and give each feature point a descriptor containing time and frequency domain feature descriptions, where SIFT is a scale-invariant feature;

(2) Global motion parameter estimation, using the affine model as the model for motion parameter estimation, the affine model is expressed as:

A A = = [\begin{matrix} a a 11 & a a 22 & a a 33 \\ a a 44 & a a 55 & a a 66 \\ 00 & 00 & 11 \end{matrix}],,

In the model, a1, a2, a4, and a5 describe scaling and rotation motions, and a3, a6 describe translational motions; first, use the fast nearest field algorithm to match the above feature points, and find the nearest neighbor as the matching point; then use the Hough transform to vote The principle determines all the feature points belonging to the same object; finally, for these feature points, use the least square method to determine each parameter in the motion model;

(3) For the curve estimated in step (2), adopt Gaussian filtering and curve fitting methods to smooth the video sequence, and the Gaussian kernel parameter σ is 0.6-1.2;

(4) For the filling of the unknown area, on the stable video stream processed by step (3), first align the neighbor frames of 4-6 frames on the left and right sides of the target frame to the target frame, and calculate the distance between the neighbor frame and the target frame. The difference between them is sorted according to the size of the difference; the target frame is filled by the neighbor frame with the smallest difference; if there is still an unknown area, it is filled by the neighbor frame with the second smallest difference, and so on.

2. The video stabilization method according to claim 1, characterized in that the step of finding out the SIFT feature points of each frame is as follows: each frame of image is smoothed with Gaussian functions of different scales, and the SIFT feature points are located in adjacent The extreme value of the scale difference; then, upsample the image, perform the same smoothing process, and so on, build a pyramid structure, find the feature points on each scale; the step of giving each feature point a descriptor As follows: calculate the gradient direction of each feature point according to the local features of the image, and obtain the position, scale and direction information of each feature point; then, calculate the gradient and direction of each point on the smooth image according to the different scales of the feature points ; Sampling the gradient and direction around the feature point, divide the entire sampling window into 2×2 sampling areas, and calculate the direction histogram of 8 directions in each area, so that the local descriptor of a feature point is expressed as 2×2 *8=a feature vector of length 32.