CN105701823A

CN105701823A - Method of using occlusion relation to recover depth order

Info

Publication number: CN105701823A
Application number: CN201610024311.0A
Authority: CN
Inventors: 马健翔; 周瑜; 明安龙
Original assignee: WUXI BUPT PERCEPTIVE TECHNOLOGY INDUSTRY INSTITUTE Co Ltd
Current assignee: WUXI BUPT PERCEPTIVE TECHNOLOGY INDUSTRY INSTITUTE Co Ltd
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2016-06-22

Abstract

The invention discloses a method of using an occlusion relation to recover a depth order. The method comprises the following steps of carrying out background scene segmentation; carrying out moving object segmentation, based on a background scene segmentation graph, updating frames of superpixel images of a scene one by one, which means that a superpixel area corresponding to a moving object is determined through a background subtraction method in the background scene where the moving object appears, and acquiring a segmentation graph of the moving object in the background scene; constructing an area-level Markov random field; based on a space-time graph, carrying out depth order inference of a pairing area so as to acquire an occlusion matrix; based on the above occlusion matrix, carrying out global depth order inference so as to acquire a depth order relation of the background image. The method can realize an advantage of enhancing a depth level recovery effect.

Description

Method for restoring depth order from occlusion relation

Technical Field

The invention relates to the field of computer vision and pattern recognition, in particular to a method for recovering a depth order by an occlusion relation.

Background

The depth hierarchy relationship for restoring a scene from a sequence of two-dimensional images has wide application. Video surveillance is typically equipped with a fixed camera that records one or more moving objects in a scene. As the object moves through the scene, the researcher can recover the occlusion information and the hierarchical representation of the scene accordingly. However, it is difficult for a still camera to obtain effective matching cues, such as edges and lighting, corresponding to the scene image, resulting in inaccurate object edge detection and depth level inference results.

Learning-based methods for estimating the depth and order of a scene from a single image have been explored previously. Later, researchers transferred subjects from a single image to a sequence of images. Fouhey et al found that approximate scene areas in a room can be obtained by human body gestures in a cluttered room.

According to the fact that when a moving object passes through a scene, the moving object always has a shielding relation with a part of the scene, and therefore the G.Brostow et al obtains a sparse paired hierarchical order relation by the moving object passing through a static scene. These cues, while effective, are sparse, making the goal of tight depth order ordering of each pixel difficult. Schodl also uses paired occlusion cues to obtain the hierarchical relationship of the moving object to the background region through which it passes. However, it is not possible for a moving object to pass through all regions in the background scene, and thus the regions that interact with it are limited, resulting in some regions not being assigned to the depth-level sequence. In addition, there is also a problem of excessive segmentation.

Depth level estimation of images is a very important and classical problem in computer vision. However, due to the limitations of the conventional inference method, for example, it is difficult for a still camera to obtain effective matching clues corresponding to the scene image, or only a single clue is used for depth inference, so that the inference result is not satisfactory.

Disclosure of Invention

The present invention is directed to solve the above problems, and an object of the present invention is to provide a method for restoring a depth order from an occlusion relation, so as to achieve the advantage of enhancing a depth level restoration effect.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method of restoring depth order from occlusion relationships, comprising:

step 1, background scene segmentation, namely obtaining a background image by using a static camera, and performing over-segmentation on the background image by using a Meanshift segmentation method to obtain a background scene segmentation image formed by small regions of a super pixel block;

step 2, moving object segmentation, namely, based on the background scene segmentation image, updating the super pixel image of the scene frame by frame, namely determining a super pixel region corresponding to the moving object in the background scene with the moving object, and obtaining a segmentation image of the moving object in the background scene;

step 3, constructing a region-level Markov random field, taking each region in the segmentation graph as a node based on the obtained segmentation graph of the moving object in the background scene, if the two regions are adjacent, namely edges exist, connecting the two corresponding nodes, and taking the moving object on each frame of image shot by the static camera as a node; meanwhile, a time edge is added to connect nodes of the moving object on the adjacent frames, so that a space-time diagram with a region shielding relation is constructed, and the space-time diagram contains both space information and information of the moving object at different times;

4, carrying out depth order reasoning on paired regions based on the space-time diagram so as to obtain a shielding matrix;

and 5, carrying out overall depth order reasoning according to the shielding matrix so as to obtain the depth order relation of the background image.

Preferably, the step 2 of determining the super-pixel region corresponding to the moving object by the background subtraction method in the background scene in which the moving object appears is specifically as follows:

firstly, modeling scene expression of a background, wherein each pixel follows Gaussian distribution Ap centered on the average color value of the pixel in all frame images, estimating the probability of the pixel belonging to the background for each frame image after the Ap is given, and if the probability is greater than 90%, determining the pixel as the background pixel; if the probability is lower than 10%, the pixel of the moving object is considered.

Preferably, the depth order inference of the paired regions performed in step 4 based on the space-time diagram specifically includes:

establishing an occlusion matrix O for judging the occlusion relation of a pair of areas i and j based on the superpixel expression of a scene, wherein Oi, j belongs to { +1, -1,0}, the occlusion matrix O respectively corresponds to three conditions that the area i occludes the area j, the area i is occluded by the area j and no occlusion clue exists, and the depth order of the area is deduced through two clues, so that the occlusion matrix O is updated frame by frame;

firstly, judging the depth order of a moving object and a background area according to a motion occlusion cue, obtaining boundary pixels of the moving object and boundary pixels of the background area according to the segmentation of the moving object, thereby deducing whether the moving object is in front of or behind the background area, and then updating an occlusion matrix O at a corresponding position;

then, area occlusion judgment is carried out on the area adjacent to the space which is not updated in the scene by using a monocular cue, and an occlusion matrix O of the corresponding position is updated.

Preferably, the step 5 of performing global depth order inference according to the occlusion matrix specifically includes:

assigning a depth label {1, 2.., L } to each superpixel region in the background scene, wherein L is predetermined, and a larger number of L indicates a farther distance from the camera; therefore, the multi-label segmentation problem is converted into an energy minimization problem based on a space-time graph model, wherein the space-time graph comprises n + F nodes, n corresponds to n super-pixel region nodes of a background scene, F corresponds to each moving object node under an F frame image passing through a video, and therefore the aim is to obtain the arrangement X of a depth mark { X ═ X } of the depth mark₁,...,X_n+F}；

An MRF space-time diagram based energy function is defined as follows:

E (X) = \underset{i &Element; 1, ..., n + F}{Σ} E_{i} (X_{i}) + \underset{(i, j) &Element; N_{S}}{Σ} E_{i j}^{S} (X_{i}, X_{j}) + \underset{(i, j) &Element; N_{T}}{Σ} E_{i j}^{T} (X_{i}, X_{j});

wherein E is_iRepresents the cost assigned to a certain regional node depth order label, Xi ∈ X, Xj ∈ X;

is a spatially doublet of terms, N_SRepresenting the area of interaction in the background scene,

wherein,

\begin{matrix} E_{i j}^{S, f} (X_{i}, X_{j}) = \{\begin{matrix} - l o g (c_{i j}^{f} \times \frac{1 + O_{i, j}^{f} + ϵ}{2}) & &ForAll; X_{i} < X_{j} \\ γ & X_{i} = X_{j} \\ - l o g (c_{i j}^{f} \times \frac{1 + O_{i, j}^{f} + ϵ}{2}) & &ForAll; X_{i} > X_{j} \end{matrix} \\ E_{i j}^{S} (X_{i}, X_{j}) = \underset{f &Element; F}{Σ} (E_{i j}^{S, f} (X_{i}, X_{j}) \times \exp (- δ_{i, j}^{f})) \end{matrix}

are pairs of spatial terms of the f-th frame image,is the occlusion relationship of regions i and j under the f-th frame,the confidence of the corresponding occlusion relationship, and gamma are known coefficients,representing the possibility of judging the coplanarity of the areas i and j by using the related area characteristics for each frame f;

as a time doublet, N_TRepresenting the temporal edges of a moving object.

The technical scheme of the invention has the following beneficial effects:

(1) the framework for performing depth level inference on a static scene by combining a motion occlusion cue obtained according to a moving object and a cue inferred by monocular image occlusion is proposed for the first time, and the defect that only a single depth inference cue is used in the prior art is overcome; (2) constructing an MRF based on adjacent regions and middle edges thereof by using knowledge of graph theory, and connecting moving object nodes between frame images through time edges to ensure the moving fluency of a moving object between frames of a video, so that the constructed MRF has spatiotemporal property; (3) the moving objects in the method are arbitrary, can be people or other objects capable of moving autonomously, and the method is popularized to the fact that the hierarchical judgment of a plurality of moving objects in a scene is still applicable; (4) when the global depth order is deduced, the MRF structure energy minimization function established by the moving object passing through all frames of the whole video is solved, and the items with weak self-deduction capability are combined to form a powerful global function, so that the final depth deduction result is more effective.

According to the technical scheme, depth order reasoning is conducted on the occlusion clues again, and the previous limitations are processed and improved. In the technical scheme, the depth level segmentation problem is converted into a discrete labeling problem based on an image sequence under a spatio-temporal Markov Random Field (MRF) for processing, and the method is superior to the method of the same kind of research at present in terms of depth level recovery effect, and the correctness of verification algorithms of two video data SETs SET-A (single moving object) and SET-B (multiple moving objects) issued by L.Guan et al is used. The technical scheme has the advantages of simple method and good effect.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart of a method for restoring depth order from occlusion relationships according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of background scene segmentation;

FIG. 3 is a schematic diagram of segmentation of a moving object;

FIG. 4 is a schematic diagram of a region level MRF model;

FIG. 5 is a schematic diagram of pairwise depth inference;

FIG. 6 is a schematic diagram of a spatial pair of terms;

FIG. 7 is a schematic diagram of time pairs;

FIG. 8 is a diagram illustrating depth inference for multiple moving objects.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

As shown in FIG. 1, a method for restoring depth order from occlusion relationships, comprising:

The following description of the embodiments is made with reference to the accompanying drawings:

1. background scene segmentation:

the scene without moving objects in each frame of image of the video is the background scene. Therefore, a background image is obtained by using a static camera, and then the background image is subjected to over-segmentation by using a Meanshift segmentation method, so that a background scene segmentation image formed by about 300 small regions of the superpixel block is obtained and used as a background template of the series of frame images, namely, the background scene segmentation image is applied to each frame image. By the method, the edge information of the background scene is reserved to a great extent, and the formed region shape is more regular, so that a moving object can be processed to pass through the background scene conveniently. The resulting image is shown in fig. 2.

2. Moving object segmentation

In short, the segmentation of the moving object is to distinguish the moving object from the scene when the moving object passes through the background scene. After a background scene template is given, the super-pixel image of the scene is updated frame by frame, namely, the super-pixel region corresponding to the moving object is determined in the background scene in which the moving object appears by a background subtraction method. The specific method comprises the following steps: a scene representation of a background is first modeled, where each pixel follows a Gaussian distribution A centered on its mean color value in all frame images_p. Given a_pThen, for each frame of image, the likelihood that the pixel belongs to the background is estimated. If the probability is more than 90%, the pixel can be fully considered as a background pixel; if the ratio is less than 10%, the pixel is sufficiently regarded as a moving object. Thus, the handleThe estimated result is used as the result of the preliminary motion segmentation. Then, a scene model is learned for the background and the moving object based on the above to distinguish the two. And updating scene color models of the background and the moving object by using an iterative Graph-cuts segmentation method to segment the moving object, wherein each iteration is similar to a GrabCut algorithm made by C. Thus, a superpixel image of the background scene and a moving object segmentation result under each frame of image are obtained. The resulting image is shown in fig. 3.

3. Constructing a region-level Markov Random Field (MRF):

here, a undirected graph model is mainly constructed. Based on the superpixel graph of the background scene, each area (superpixel) is taken as a node, and if two areas are adjacent, namely edges exist, the two corresponding nodes are connected. And the moving object on each frame image also serves as a node. At the same time, time edges are added to connect nodes of moving objects on adjacent frames. As shown in fig. 4, the region of the background scene is in an orange square, f represents a frame number, a node represents a moving object in each frame, a connecting edge a2 represents that the two regions have a pair of depth order relationships, and the connecting edge a1 is to enhance the fluency of the motion model of the moving object. The graph model constructed in this way is a space-time graph which contains both spatial information and information of moving objects at different times.

4. Depth order reasoning for paired regions:

when a moving object passes through a background scene, the moving object either occludes a scene object or is occluded by the scene object, which is called a motion occlusion event. Establishing an occlusion matrix O for judging the occlusion relation of a pair of areas i and j based on the super-pixel expression of a scene, wherein O_i,j∈ { +1, -1,0} corresponding to three cases, area i occludes area j, area i is occluded by area j, and no occlusion cues, respectively.

Firstly, the depth order of the moving object and the background area is judged according to the motion occlusion clue proposed by G.Brostow et al. According to the segmentation of the moving object, the boundary pixels of the moving object and the boundary pixels of the background area can be obtained, so that whether the moving object is in front of or behind the background area can be deduced according to the clues of the moving object, and then the occlusion matrix O of the corresponding position is updated. In particular, this thread is transitive. For example, the moving object region is m, when a motion occlusion event is triggered, the moving object m is occluded by the background region k, and the region s is occluded by the moving object m, so that the region s is occluded by the region k according to the transmissibility of the moving object m. Further, since k and s are not limited to adjacent regions, a long range edge between non-adjacent regions is also applicable.

However, since it is unlikely that moving objects will overlap each region on the background image, the region occlusion determination is made for other regions in the scene using the monocular cues proposed by d.hoiem et al, where the hierarchical inference is primarily made for those spatially adjacent regions that have not been previously updated. However, monocular cues are less reliable than motion occlusion cues and are therefore not transitive. The resulting image is shown in fig. 5.

5. Global depth order reasoning:

the pair-wise regional depth order reasoning in the last step shows that the method is a local hierarchy reasoning, an energy function is established for globally and consistently constraining the local reasoning result, and the final result is obtained by solving the minimization problem.

A depth label {1, 2.., L } is assigned to each superpixel region in the background scene, where L is predefined, and a larger number indicates a greater distance from the camera. In this way, the multi-label segmentation problem translates into a space-time graph model-based energy minimization problem. The space-time diagram comprises n + F nodes, n corresponds to n super-pixel region nodes of a background scene, F corresponds to each moving object node under an F frame image of a video, and the target is to obtain a depth mark rowColumn X ═ X₁,...,X_n+F}. To this end, an energy function based on the MRF space-time diagram is defined as follows:

E (- X) = \underset{i &Element; 1, ..., n + F}{Σ} E_{i} (X_{i}) + \underset{(i, j) &Element; N_{S}}{Σ} E_{i j}^{S} (X_{i}, X_{j}) + \underset{(i, j) &Element; N_{T}}{Σ} E_{i j}^{T} (X_{i}, X_{j}) - - - (1)

the function contains three parts:

unary item E_iRepresenting the cost assigned to a node depth order label for a certain region. Since a moving object may move between adjacent background regions, if the tags of the two regions are connected at this time, no tag is assigned to the moving object. In order to avoid the phenomenon, only odd labels, namely modulo-2 labels, are allocated to the background area, so that the layer labels are left between adjacent background layers for moving objects. Thus, if a background region is assigned an even-numbered label, it is penalized with infinity, i.e., with an infinite cost.

Space paired termsCorresponding to the region N of interaction_SThe resulting motion occlusion cue and monocular cue derived cost of the paired region depth order (occlusion matrix O). Suppose O_i,jA +1 indicates that region i occludes region j, and thus the label of i should be less than j. At this time, if i is assigned a label greater than j, then a large penalty is imposed on it. To this end, working according to a.kowdle et al, the following equation is defined:

\begin{matrix} E_{i j}^{S, f} (X_{i}, X_{j}) = \{\begin{matrix} - l o g (c_{i j}^{f} \times \frac{1 + O_{i, j}^{f} + ϵ}{2}) & &ForAll; X_{i} < X_{j} \\ γ & X_{i} = X_{j} \\ - l o g (c_{i j}^{f} \times \frac{1 + O_{i, j}^{f} + ϵ}{2}) & &ForAll; X_{i} > X_{j} \end{matrix} \\ E_{i j}^{S} (X_{i}, X_{j}) = \underset{f &Element; F}{Σ} (E_{i j}^{S, f} (X_{i}, X_{j}) \times \exp (- δ_{i, j}^{f})) \end{matrix} - - - (2)

whereinAre pairs of spatial terms of the f-th frame image,is the occlusion relationship of regions i and j under the f-th frame,the confidence degree of the corresponding occlusion relation, and gamma are parameters,a coplanar classifier from the research work of a. kowdle et al,indicating how likely it is that the regions i, j are coplanar using the associated region features for each frame f. The confidence of the edges associated with the moving object is 1, and the confidence of the other edges is based on the intensity of the occlusion edges utilized by d.hoiem et al, so that the linear summation of the pairwise terms in each frame can effectively capture the occlusion information of the two region nodes in the whole image sequence. The space pair-item diagram is shown in FIG. 6, if region i blocks regionj, then i is assigned a label less than j by adding a large penalty to the items in the dashed 1 region and applying a zero penalty to the items in the dashed 2 region.

Time paired termsCorresponding to the time edge N of moving objects under different frames_TAnd updating is carried out, so that the condition that the labels of the nodes of the moving objects in the image sequence are inconsistent is punished, and the moving fluency of the nodes is ensured. Because the moving object moves through the scene, the depth level of the object does not change suddenly and greatly. Thus, for this case, a large penalty needs to be made for it. The schematic diagram of the time pairing item is shown in fig. 7, and the farther the position of the penalty item is away from the diagonal line, the larger the penalty value is, so that the slow change of the depth level of the moving object is ensured.

The inference capabilities of each of the above three terms are weak, and thus combining them constitutes a powerful global depth order inference method. For the minimization solution of the function, a TRW-S method proposed by v.kolmogorov is used, so as to obtain the depth label arrangement of the final scene area, i.e., a depth level sequence.

6. The method is popularized to a plurality of moving objects:

the depth level estimation of the image by using one moving object can be popularized to a plurality of moving objects for estimation. The implementation is similar to a moving object problem. Taking 2 moving objects as an example, the schematic diagram is shown in fig. 8, the background scene segmentation is still performed according to the first step, then the moving objects are segmented in the second step, at this time, there are two moving objects, and then the space-time diagram is constructed in the third step, except that at this time, g nodes are added in addition to a nodes, corresponding to the two moving objects, and if there is a shielding relationship between the two moving objects, the two moving objects are also connected by using edges. And fourthly, performing regional occlusion cue reasoning by using the two cues to update the occlusion matrix. And finally, establishing an energy function, and solving by using a TRW-S method under the condition that the form of the energy function is the same as that of the moving object.

Tables 1 and 2 show the results of the method of the invention and other comparisons using only a portion of the clues. It should be noted here that the larger the four evaluation index values in columns 2 and 3 of the table are, the more accurate the result of depth order inference is, and the two evaluation indexes in the last column are proposed according to studies by two people, m.kendall and g.e.noether, respectively, for evaluating the depth inference effect of paired regions. The larger the former value is, the better the effect is; the smaller the latter value, the better the effect. Thus, it can be seen from the table that the methods of the present invention are superior to the other methods. Wherein (I) only uses the motion occlusion clue, (II) only uses the learned monocular clue, (III) uses the simple linear combination of the motion occlusion clue and the monocular clue as the judgment clue, (IV) uses the comprehensive clue obtained by the recommendation method, but does not include the time edge connection added for enhancing the continuity of the motion model, and the last line is the result evaluation obtained by adding the time edge on the basis of (IV), namely the frame of the technical scheme of the invention.

TABLE 1 accuracy of SET-A (singlemovingobject) -based depth order reasoning.

TABLE 2 accuracy of SET-B (Multiplemovingobjects) -based depth order reasoning.

In conclusion, the technical scheme of the invention re-reviews and summarizes the depth order reasoning problem and processes and improves the previous limitations on the basis.

For the depth order problem of estimating a scene from a single image with a learning-based approach, the improvement is made to use a set of images of a moving object captured by a still camera for estimation, and the moving object is no longer confined to a person, i.e. not known in advance about the moving object and the type of scene, which makes the problem more general. Meanwhile, for the problem of sparseness of movement occlusion clues existing in the research of G.Brostow and A.Schodl and the problem of depth estimation of the whole image area which cannot be traversed, on the basis of judging the interaction areas by using sparse but effective movement occlusion clues, monocular clues are combined to carry out hierarchical judgment on the areas which are not judged, so that the unified framework can popularize the effective occlusion hierarchical judgment into the whole scene, and the method is more reasonable compared with the conventional reasoning method. In general, the depth level segmentation problem is converted into a discrete labeling problem based on an image sequence under a spatiotemporal Markov Random Field (MRF) to be processed, and the depth level restoration effect is superior to that of the method researched in the same kind at present.

The invention aims to research and realize a depth order reasoning method of a multi-view image of a general scene, namely, the depth reasoning of a background scene is researched and realized by an image sequence of a moving object passing through the scene under different frames. The method is illustrated by partial pictures of two SETs of video data SET-a (single moving object) and SET-B (multiple moving objects) published by l.guan et al, and depth order reasoning for other scenes can be fully implemented accordingly.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for restoring depth order from occlusion relationships, comprising:

2. The method for restoring depth order from occlusion relationship according to claim 1, wherein the determining of the super-pixel region corresponding to the moving object in the background scene in which the moving object appears in step 2 by the background subtraction method specifically comprises:

3. The method for recovering depth order from occlusion relation according to claim 1, wherein the step 4 of performing depth order inference of paired regions based on the space-time diagram specifically comprises:

4. The method for recovering depth order from occlusion relation according to claim 3, wherein the step 5 of performing global depth order inference according to the occlusion matrix specifically comprises:

An MRF space-time diagram based energy function is defined as follows:

E (X) = \underset{i &Element; 1, ..., n + F}{Σ} E_{i} (X_{i}) + \underset{(i, j) &Element; N_{S}}{Σ} E_{i j}^{S} (X_{i}, X_{j}) + \underset{(i, j) &Element; N_{T}}{Σ} E_{i j}^{T} (X_{i}, X_{j});

wherein,

E_{i j}^{S, f} (X_{i}, X_{j}) = \{\begin{matrix} - l o g (c_{i j}^{f} \times \frac{1 + O_{i, j}^{f} + ϵ}{2}) & &ForAll; X_{i} < X_{j} \\ γ & X_{i} = X_{j} \\ - l o g (c_{i j}^{f} \times \frac{1 + O_{j, i}^{f} + ϵ}{2}) & &ForAll; X_{i} > X_{j} \end{matrix}

E_{i j}^{S} (X_{i}, X_{j}) = \underset{f &Element; F}{Σ} (E_{i j}^{S, f} (X_{i}, X_{j}) \exp (- δ_{i, j}^{f}))

as a time doublet, N_TRepresenting the temporal edges of a moving object.