CN115546515A

CN115546515A - Depth information acquisition method and device

Info

Publication number: CN115546515A
Application number: CN202211068190.1A
Authority: CN
Inventors: 张友敏; 国显达; 黄冠
Original assignee: Beijing Jianzhi Technology Co ltd
Current assignee: Beijing Jianzhi Technology Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-30

Abstract

The application discloses a depth information acquisition method and a device, wherein the method comprises the following steps: acquiring a first feature map of a left eye image and a second feature map of a right eye image; based on each pixel point of the first feature map, respectively calculating the similarity between each pixel point and all pixel points on the polar line corresponding to the second feature map to obtain a similarity matrix, wherein the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map; calculating matching cost values of a plurality of preset positions in the first feature map according to the resolution of a pre-output image and the similarity matrix to obtain a similarity feature matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity feature matrix comprises the matching cost values of the plurality of preset positions; and obtaining a depth prediction map according to the similarity characteristic matrix.

Description

Depth information acquisition method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a depth information obtaining method, apparatus, electronic device, and computer-readable storage medium.

Background

The binocular stereo matching is to process left and right eye images captured at the same time to obtain a disparity map, and further to calculate a depth from the disparity map, and to estimate a three-dimensional space from a two-dimensional image.

Generally, binocular stereo matching is to perform pixel point matching between left and right eye images in a limited search space to obtain the similarity of pixel point pairs of the left and right eye images, and then perform depth prediction through calculation.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art: because matching is performed in a limited search space, only depth prediction results within a certain range can be obtained, and the resolution is limited.

Disclosure of Invention

An embodiment of the present application provides a depth information obtaining method, a depth information obtaining apparatus, an electronic device, and a computer-readable storage medium, which can solve the problems that only a depth prediction result in a certain range can be obtained in depth prediction, and resolution is limited.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a depth information obtaining method, including: acquiring a first feature map of a left eye image and a second feature map of a right eye image; based on each pixel point of the first feature map, respectively calculating the similarity between each pixel point and all pixel points on the polar line corresponding to the second feature map to obtain a similarity matrix, wherein the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map; calculating matching cost values of a plurality of preset positions in the first feature map according to the resolution of a pre-output image and the similarity matrix to obtain a similarity feature matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity feature matrix comprises the matching cost values of the plurality of preset positions; and obtaining a depth prediction map according to the similarity characteristic matrix.

In a second aspect, an embodiment of the present application provides a depth information acquiring apparatus, including: the first acquisition module is used for acquiring a first feature map of the left eye image and a second feature map of the right eye image; the first execution module is used for respectively calculating the similarity between each pixel point and all the pixel points on the polar line corresponding to the second feature map based on each pixel point of the first feature map to obtain a similarity matrix, and the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map; the second execution module is used for calculating matching cost values of a plurality of preset positions in the first feature map according to the resolution of a pre-output image and the similarity matrix to obtain a similarity feature matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity feature matrix comprises the matching cost values of the plurality of preset positions; and the second acquisition module is used for acquiring the depth prediction map according to the similarity characteristic matrix.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the depth information acquisition method.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to execute the depth information obtaining method.

In this way, in the embodiment of the application, first feature maps of a left eye image and a second feature map of a right eye image are respectively obtained, then, the similarity between each pixel point in the first feature map and all pixel points on a corresponding polar line in the second feature map is calculated by taking the first feature map as a reference, a similarity matrix is constructed by using the similarities, further, the matching cost values of a plurality of preset positions in the first feature map are calculated based on the resolution of the image to be output and the similarity matrix, the similarity feature matrix is constructed, and then, a depth prediction map is obtained through the similarity feature matrix calculation. In the process, pixel point matching is carried out on each pixel point in the feature map of the left eye image, depth prediction can be carried out on the whole image, so that the depth value of any distance range can be obtained, a preset position can be set at any position in the feature map according to the resolution of the image which is expected to be output, depth estimation can be carried out on any position, and the depth prediction image corresponding to the resolution of the image which is expected to be output can be obtained.

Drawings

Fig. 1 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present application;

fig. 3 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present disclosure;

fig. 4 is a block diagram of a depth information acquiring apparatus according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application;

fig. 6 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in sequences other than those illustrated or described herein, and that the terms "first," "second," etc. are generally used in a generic sense and do not limit the number of terms, e.g., a first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The depth information obtaining method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 1 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present application, where the method may include the following steps:

step 101: and acquiring a first feature map of the left eye image and a second feature map of the right eye image.

In the embodiment of the application, the left eye image and the right eye image are captured by the camera at the same moment and are subjected to polar line correction, and the polar line correction is used for performing projection transformation on the left eye image and the right eye image respectively, so that polar lines corresponding to the two images are on the same scanning line, the matching speed of pixel points can be increased, and the matching precision is improved.

Step 102: and respectively calculating the similarity between each pixel point and all the pixel points on the polar line corresponding to the second feature map based on each pixel point of the first feature map to obtain a similarity matrix, wherein the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map.

Exemplarily, the first feature map of the left eye image is taken as a reference, one pixel point is taken as an example, the pixel point is sequentially matched with all pixel points on an epipolar line corresponding to the epipolar line where the pixel point is located in the second feature map, and the similarity is calculated; in the embodiment of the present application, the above operation is performed on each pixel point in the first feature map, and further, all the similarities obtained through the above calculation are constructed as a similarity matrix, where the similarity matrix includes similarities between all the pixel points in the first feature map and all the pixel points on the epipolar line corresponding to the second feature map.

Optionally, the dimension of the similarity matrix is [ W, H, W ], where H and W are the length and the width of the first feature map, respectively, and illustratively, the length may be the number of pixels in the first feature map along the long direction, and similarly, the width may be the number of pixels in the first feature map along the wide direction; illustratively, in the above dimensions, the dimension [ H, W ] is used to represent all the pixel points in the first feature map, and the first dimension W is used to represent the similarity between each pixel point and all the pixel points on the epipolar line corresponding to the second feature map, and there are W similarities in total. Through the similarity matrix, the similarity distribution of the pixel point pairs between the first characteristic diagram and the second characteristic diagram can be clearly understood.

Step 103: calculating matching cost values of a plurality of preset positions in the first characteristic diagram according to the resolution of a pre-output image and the similarity matrix to obtain a similarity characteristic matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second characteristic diagram, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity characteristic matrix comprises the matching cost values of the plurality of preset positions.

Exemplarily, according to the resolution of an image to be output, and by giving a similarity matrix, the matching cost values of a plurality of preset positions in the first feature map can be calculated, and the obtained matching cost values are constructed into the similarity feature matrix, wherein the matching cost values can intuitively reflect the similarity degree between the preset positions and the corresponding positions of the second feature map, and the matching cost values and the similarity degree are in an inverse proportional relation, that is, the smaller the matching cost value is, the higher the similarity is; the similarity feature matrix includes the matching cost values of all the preset positions, so that the distribution of the matching cost values connecting the preset positions and the second feature map can be relatively clear.

Step 104: and obtaining a depth prediction map according to the similarity characteristic matrix.

In this embodiment of the present application, a Multilayer Perceptron (MLP) may be used to perform an operation on the similarity feature matrix to obtain a depth prediction map, where a dimension of the depth prediction map is a resolution of a pre-output image.

Optionally, step 101 may comprise the steps of:

step 1011: and performing feature extraction on the acquired left eye image and the acquired right eye image to obtain a first image feature of the left eye image and a second image feature of the right eye image.

In the embodiment of the application, the same feature extraction network can be adopted to extract the features of the left eye image and the right eye image, and the method adopting the same feature extraction network is weight sharing, so that the parameter quantity in the operation process can be greatly reduced, the operation complexity is reduced, and the operation efficiency is improved; illustratively, the resolution of the obtained first image feature and the second image feature can be 1/4 of that of the left and right images, so that the computation amount can be reduced, and the efficiency of depth information acquisition can be improved.

Step 1012: and performing attention mechanism operation on the first image characteristic and the second image characteristic to obtain the first characteristic diagram and the second characteristic diagram.

For example, in the embodiment of the application, an attention mechanism may be operated in a transform structure, and the first image feature and the second image feature may be subjected to feature fusion, exchange, and the like through the attention mechanism, so as to achieve an effect of optimizing the image features.

Optionally, step 1012 may include the steps of:

step 10121: and performing multiple iterative operations based on an attention mechanism on the first image feature and the second image feature to obtain the first feature map and the second feature map.

Wherein each iteration operation comprises:

and respectively performing self-attention mechanism operation based on the first image characteristic and the second image characteristic to obtain corresponding optimized image characteristics.

And performing cross attention mechanism operation between the first image characteristic and the second image characteristic after the self attention mechanism operation to obtain a corresponding optimized image characteristic.

In the embodiment of the present application, for the processing of the first image feature and the second image feature, a multi-layer optimization may be performed on the first image feature and the second image feature through multiple iterative operations, where each iterative operation may be: and respectively carrying out feature fusion optimization on the first image feature and the second image feature by adopting a self-attention mechanism, and further carrying out feature exchange and fusion between the first image feature and the second image feature by adopting a cross-attention mechanism.

For example, in the embodiment of the present application, the self-attention mechanism may operate using the following formula:

Attention(q，k，v)＝softmax(qk ^T )v

wherein, attention means, q, k, v means three new image features obtained through three different fully connected layers, the representation form is a matrix, and softmax is a function and is used for carrying out standardization processing on q and k.

Optionally, step 10121 may further comprise the operations of:

after the self-attention mechanism operation or the cross-attention mechanism operation is performed, normalization operations are performed on the first image feature and the second image feature respectively.

For example, in each iteration operation, after the self-attention mechanism operation or the cross-attention mechanism on the first image feature and the second image feature is performed, the first image feature and the second image feature model can be normalized to ensure the stability of the feature distribution.

Optionally, step 10121 may further comprise the following operations:

before performing a plurality of iterations based on an attention mechanism on the first image feature and the second image feature, respectively performing normalization operation on the first image feature and the second image feature.

For example, before entering the iterative operation, the first image feature and the second image feature may be first normalized to provide a basis for stable distribution of the features during the iterative operation.

Optionally, in this embodiment of the present application, layerNorm (layer normalization) may be used to perform normalization on the first image feature and the second image feature, which has a faster convergence rate, and can complete the normalization more quickly, thereby improving the operation efficiency.

Optionally, step 102 may include the steps of:

step 1021: and performing cross attention mechanism operation between each pixel point of the first characteristic diagram and all pixel points on the corresponding polar line of the second characteristic diagram to obtain a plurality of similarity groups.

Exemplarily, the similarity matching between each pixel point of the first feature map and all pixel points on the epipolar line corresponding to the second feature map is performed through a cross attention mechanism, so as to obtain a similarity group corresponding to each pixel point of the first feature map, where the similarity group includes the similarity between one pixel point in the first feature map and all pixel points on the epipolar line corresponding to the pixel point in the second feature map.

Step 1022: and constructing a matching similarity matrix based on the plurality of similarity groups.

Illustratively, the similarity groups can be constructed into a matching similarity matrix, so that the similarity distribution conditions among the pixels can be more intuitively connected, and the subsequent operation is facilitated.

Step 1023: and carrying out weight configuration on the matching similarity matrix to obtain a similarity matrix.

In the embodiment of the application, the matching similarity matrix is subjected to weight configuration, so that the higher similarity between the pixel points can be more intuitively known.

Optionally, step 1023 may include the following steps:

step 10231: and aiming at the similarity of the same similarity group in the matching similarity matrix, the weight of the similarity with the highest similarity value is increased, and the weights of the rest similarities are reduced.

For example, in the weight configuration in this embodiment of the application, the weight configuration is performed on the similarity between the same pixel point in the first feature map and a different pixel point on a corresponding epipolar line in the second feature map, the weight of the similarity value with the highest similarity is increased to the highest value, and the weights of other similarity values are suppressed to be extremely low, so that the pixel point pair with the highest similarity is highlighted, and a basis is provided for subsequent feature sampling.

For example, an optimal transmission algorithm may be adopted to perform optimization of weight configuration along with the matching similarity matrix, wherein a unimodal function may be used as an objective function to perform optimization so as to find a pixel point with the highest similarity to each pixel point in the first feature map.

Optionally, step 103 may include the steps of:

step 1031: and marking a plurality of preset positions and aggregation areas corresponding to the preset positions in the first feature map according to the resolution of a preset output image.

For example, if the resolution of the pre-output image is [ OH, OW ], OH OW preset positions may be marked in the first feature map, and optionally, the preset positions may be distributed in the first feature map in an array manner, or marked in the first feature map in a random manner; and then mark the corresponding aggregation region based on the preset position as the center, where the aggregation region may be a circular region, a directional region or an irregular region with the preset position as the center, or a region based on pixel points, for example, 4 pixel points closest to the preset position are used as the aggregation region.

Step 1032: and carrying out weighted operation on the similarity of each pixel point in the aggregation areas to obtain the matching cost value corresponding to the preset position.

And the pixel points in the aggregation region have corresponding similarity with each pixel point on the corresponding polar line in the second characteristic diagram, namely similarity groups, and the similarity groups of the pixel points are subjected to weighted operation according to the distance between the preset position and the pixel points in the aggregation region, so that the matching cost value of the preset position is obtained.

Illustratively, the matching cost value of the preset position is a matrix with a dimension [ W,1], where the dimension W is the number of similarities in the similarity group, that is, the width of the feature map.

Exemplarily, in the embodiment of the present application, a bilinear interpolation method may be adopted to extract the matching cost values of the pixel points, where the matching cost values represent the similarity degrees of the pixel points corresponding to the second feature map and all the pixel points on the line, and then perform weighted operation on the matching cost values, so as to obtain the matching cost values of the preset positions.

Step 1033: and constructing the similarity feature matrix based on the matching cost values of the preset positions.

And constructing the matching cost values of all the preset positions into a similarity feature matrix, wherein the dimension of the similarity feature matrix is [ W, OH OW ], wherein OH OW represents the number of the matching cost values, namely the number of the preset positions, and W represents the matching cost of W preset positions and the corresponding positions of the second feature map in each matching cost value.

Optionally, step 104 may include the steps of:

step 1041: and on the basis of a multi-layer perceptron, operating the similarity characteristic matrix to obtain an initial parallax estimation and a parallax residual, wherein the initial parallax estimation comprises minimum matching cost values respectively corresponding to the plurality of preset positions, and the parallax residual is used for correcting the initial parallax residual.

Performing feature fusion on the similarity feature matrix through a first multilayer perceptron to obtain a probability matrix, wherein the dimension of the probability matrix inherits the dimension of the similarity matrix, and the probability matrix reflects the probability distribution of the matching cost in the matching cost value of each preset position; in the embodiment of the application, a disparity value corresponding to the minimum matching cost value in the probability distribution corresponding to each preset position in the probability matrix can be used as a disparity estimation result of the preset position, so that initial disparity estimation is obtained, meanwhile, the similarity feature matrix is predicted through a second multi-layer perceptron to obtain a disparity residual, and the disparity residual can be used as residual optimization of the initial disparity estimation, so that a more accurate depth prediction image can be obtained.

Step 1042: and calculating to obtain a depth prediction map according to the initial parallax estimation and the parallax residual.

And correcting and optimizing the initial parallax estimation through the parallax residual error to obtain a final depth prediction image, wherein the dimension of the depth prediction image is [1, OH, OW ], namely the resolution of the depth prediction image accords with the resolution [ OH, OW ] of the pre-output image.

For example, in the embodiment of the present application, the following formula may be adopted to calculate the depth prediction map:

D＝argmax(MLPc(Cf))+MLPo(Cf)

wherein D represents a depth prediction map, MLPc is a first multilayer perceptron in the embodiment of the present application, MLPo is a second multilayer perceptron in the embodiment of the present application, cf is a similarity feature matrix in the embodiment of the present application, argmax is a parameter-solving function, a matching cost with the highest probability in a probability matrix can be obtained by argmax, and the matching cost is a maximum value in matching cost values of corresponding preset positions.

Fig. 2 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present disclosure.

Exemplarily, MLPc in the figure is a first multilayer perceptron in the embodiment of the present application, and is used for optimizing feature fusion of a similarity feature matrix, obtaining a probability matrix, and further obtaining an initial disparity estimation; in the embodiment of the present application, MLPo is a second multi-layer perceptron, which is used to predict a similarity feature matrix to obtain a disparity residual, so as to perform correction optimization on an initial disparity estimation; d in the figure is a depth prediction map in the embodiment of the present application.

Fig. 3 is a flowchart illustrating steps of a depth information obtaining method according to an embodiment of the present application.

For example, operations such as attention mechanism budgeting and normalization on image features of the left eye image and the right eye image and pixel point matching in the embodiment of the application can be performed through a Transformer structure. Firstly, after receiving a left eye image and a right eye image which are subjected to epipolar line correction, performing image feature extraction on the left eye image and the right eye image by adopting the same (namely weight sharing) feature extraction network, and obtaining a left image feature map and a right image feature map with 1/4 resolution, further performing feature matching and fusion by adopting a transform structure, specifically, for the left image feature map and the right image feature map, repeating the following processes for N-1 times, taking the left image as an example (the same principle as the right image):

firstly, normalizing the feature map by using LayerNorm, then realizing the fusion of the features of the whole image by a self-attention mechanism (self-attention), and then performing one-time normalization operation on the features by using another LayerNorm; subsequently, a cross-attention mechanism (cross-attention) is adopted to realize feature exchange and fusion of the features of the left and right images.

After the characteristic fusion process is executed for N-1 times, the similarity between any pixel point in the image and all pixel points on the same polar line in the right image is calculated by using the fused left image characteristic as a reference through a cross attention mechanism to obtain a matching similarity matrix, and the matching similarity matrix is optimized by using an Optimal transmission algorithm (Optimal Transport) by using a unimodal function as an objective function so as to find a matching pixel point with the maximum similarity to obtain a similarity matrix, wherein the similarity matrix dimension is [ W, H, W ] dimension, and H, W is respectively the length and the width of a 1/4 resolution characteristic image.

After the similarity matrix is obtained, carrying out feature sampling on the coordinate position of any floating point number: and extracting the matching cost of four neighbor pixel points near the coordinate position of the floating point number by using a bilinear interpolation method and carrying out linear combination, thereby obtaining the matching cost value of the coordinate position of the floating point number. For example, if the final output resolution of the network is expected to be [ OH, OW ], then OH OW floating point coordinate locations are provided for feature sampling and a similarity feature matrix of [ W, OH OW ] is obtained.

Further, the multi-layer perceptron (MLPc) performs feature fusion on the similarity feature matrix, predicts to obtain a probability matrix of [ W, OH OW ], and for the second dimension OH OW, is a matching cost probability distribution with one dimension being W, and takes a disparity value corresponding to the minimum matching cost value in the distribution as an initial depth/disparity estimation result, i.e., initial disparity estimation, and the multi-layer perceptron (MLPo) can predict a disparity residual error for the similarity feature matrix, as residual optimization for the initial depth result, and obtains a final disparity map prediction D with the dimensions being [1, OH, OW ], and further outputs the prediction D.

Therefore, in the embodiment of the application, the first feature map of the left eye image is used as a reference, each pixel point of the first feature map is subjected to similarity matching with all pixel points on a corresponding polar line in the second feature map, the matching of the pixel points is not limited in a limited region, and therefore depth prediction can be performed on any distance range in the image.

Fig. 4 is a block diagram of a depth information acquiring apparatus according to an embodiment of the present application, where the apparatus may include:

the first obtaining module 201 is configured to obtain a first feature map of the left eye image and a second feature map of the right eye image.

A first executing module 202, configured to calculate, based on each pixel point of the first feature map, a similarity between each pixel point and all pixel points on an epipolar line corresponding to the second feature map, respectively, to obtain a similarity matrix, where the similarity matrix is used to reflect the similarity of each pixel point in the first feature map.

The second executing module 203 is configured to calculate matching cost values of a plurality of preset positions in the first feature map according to a resolution of a pre-output image and the similarity matrix, to obtain a similarity feature matrix, where the matching cost values are used to reflect a similarity degree between the preset positions and corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relationship, and the similarity feature matrix includes matching cost values of the plurality of preset positions.

And a second obtaining module 204, configured to obtain a depth prediction map according to the similarity feature matrix.

Optionally, the first obtaining module 201 includes:

and the extraction submodule is used for carrying out feature extraction on the acquired left eye image and the acquired right eye image to obtain a first image feature of the left eye image and a second image feature of the right eye image.

And the first operation sub-module is used for performing attention mechanism operation on the first image characteristic and the second image characteristic to obtain the first characteristic diagram and the second characteristic diagram.

Optionally, the first operation sub-module is specifically configured to perform multiple iterative operations based on an attention mechanism on the first image feature and the second image feature to obtain the first feature map and the second feature map.

Wherein each iteration operation comprises: respectively performing self-attention mechanism operation based on the first image characteristic and the second image characteristic to obtain corresponding optimized image characteristics; and performing cross attention mechanism operation between the first image characteristic and the second image characteristic after the self attention mechanism operation to obtain a corresponding optimized image characteristic.

Optionally, the first operation sub-module is further configured to perform normalization operations on the first image feature and the second image feature respectively after the performing of the self-attention mechanism operation or after the performing of the cross-attention mechanism operation; and before the first image feature and the second image feature are subjected to a plurality of iterative operations based on an attention mechanism, respectively carrying out normalization operation on the first image feature and the second image feature.

Optionally, the first executing module 202 includes:

and the cross attention submodule is used for performing cross attention mechanism operation between each pixel point of the first characteristic diagram and all pixel points on the epipolar line corresponding to the second characteristic diagram to obtain a plurality of similarity groups.

And the first construction submodule is used for constructing a matching similarity matrix based on the plurality of similarity groups.

And the configuration submodule is used for carrying out weight configuration on the matching similarity matrix to obtain the similarity matrix.

Optionally, the configuration sub-module is specifically configured to, for the similarity of the same similarity group in the matching similarity matrix, increase the weight of the similarity with the highest similarity value, and decrease the weights of the remaining similarities.

Optionally, the second executing module 203 includes:

the marking sub-module is used for marking a plurality of preset positions in the first feature map and aggregation areas corresponding to the preset positions respectively according to the resolution of a pre-output image;

the weighted operation submodule is used for carrying out weighted operation on the similarity of each pixel point in the aggregation areas to obtain a matching cost value corresponding to a preset position;

and the second construction submodule is used for constructing the similarity characteristic matrix based on the matching cost values of the preset positions.

Optionally, the second obtaining module 204 includes:

and the second operation sub-module is used for operating the similarity characteristic matrix based on a multi-layer perceptron to obtain an initial parallax estimation and a parallax residual, wherein the initial parallax estimation comprises minimum matching cost values respectively corresponding to the plurality of preset positions, and the parallax residual is used for correcting the initial parallax residual.

And the third operation sub-module is used for calculating to obtain a depth prediction image according to the initial parallax estimation and the parallax residual error.

To sum up, the similarity matching of each pixel point is carried out on the feature map of the left eye image, so that the depth estimation can be carried out on any position in the image, the depth prediction within any distance range is realized, the preset position can be marked at any position according to the preset resolution ratio to carry out the depth estimation, the depth prediction map consistent with the preset resolution ratio can be obtained, the limitation of the resolution ratio is avoided, and the depth prediction map with any resolution ratio can be output.

The depth information acquiring apparatus in the embodiment of the present application may be an electronic device, or may be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (Network Attached Storage, NAS), a personal computer (NAS), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The depth information acquiring apparatus according to the embodiment of the present application may be an apparatus having an action system. The action system may be an Android (Android) action system, an ios action system, or other possible action systems, and the embodiment of the present application is not particularly limited.

The depth information acquiring device provided in the embodiment of the present application can implement each process implemented by the method embodiment, and is not described here again to avoid repetition.

Optionally, as shown in fig. 5, an electronic device 300 is further provided in this embodiment of the present application, and includes a processor 301, a memory 302, and a program or an instruction stored in the memory 302 and capable of running on the processor 301, where the program or the instruction is executed by the processor 301 to implement each process of any one of the above embodiments of the depth information obtaining method, and can achieve the same technical effect, and in order to avoid repetition, it is not described here again.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 4000 includes, but is not limited to: a radio frequency unit 4001, a network module 4002, an audio output unit 4003, an input unit 4004, a sensor 4005, a display unit 4006, a user input unit 4007, an interface unit 4008, a memory 4009, a processor 4010, and the like.

Those skilled in the art will appreciate that the electronic device 4000 may further include a power supply (e.g., a battery) for supplying power to each component, and the power supply may be logically connected to the processor 4010 through a power management system, so that functions of managing charging, discharging, and power consumption are implemented through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

It should be understood that, in the embodiment of the present application, the input Unit 4004 may include a Graphics Processing Unit (GPU) 40041 and a microphone 40042, and the Graphics processor 40041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 4006 may include a display panel 40061, and the display panel 40061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 4007 includes a touch panel 40071 and other input devices 40072. The touch panel 40071 is also referred to as a touch panel. The touch panel 40071 may include two parts of a touch detection device and a touch controller. Other input devices 40072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein. Memory 4009 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems. The processor 4010 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 4010.

The embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of any one of the above embodiments of the depth information obtaining method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of any one of the above embodiments of the depth information obtaining method, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

Embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the foregoing depth information obtaining method embodiments, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A depth information acquisition method, comprising:

acquiring a first feature map of a left eye image and a second feature map of a right eye image;

based on each pixel point of the first feature map, respectively calculating the similarity between each pixel point and all pixel points on the polar line corresponding to the second feature map to obtain a similarity matrix, wherein the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map;

calculating matching cost values of a plurality of preset positions in the first feature map according to the resolution of a pre-output image and the similarity matrix to obtain a similarity feature matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity feature matrix comprises the matching cost values of the plurality of preset positions;

and obtaining a depth prediction map according to the similarity characteristic matrix.

2. The method according to claim 1, wherein the acquiring a first feature map of a left eye image and a second feature map of a right eye image comprises:

performing feature extraction on the acquired left eye image and right eye image to obtain a first image feature of the left eye image and a second image feature of the right eye image;

and performing attention mechanism operation on the first image characteristic and the second image characteristic to obtain the first characteristic diagram and the second characteristic diagram.

3. The method of claim 2, wherein performing the attention mechanism operation on the first image feature and the second image feature to obtain the first feature map and the second feature map comprises:

performing multiple iterative operations based on an attention mechanism on the first image feature and the second image feature to obtain a first feature map and a second feature map;

wherein each iteration operation comprises:

respectively performing self-attention mechanism operation based on the first image characteristic and the second image characteristic to obtain corresponding optimized image characteristics;

and performing cross attention mechanism operation between the first image characteristic and the second image characteristic which are subjected to the self attention mechanism operation to obtain a corresponding optimized image characteristic.

4. The method of claim 3, further comprising, after the performing the self-attention mechanism operation or after the performing the cross-attention mechanism operation:

respectively carrying out normalization operation on the first image characteristic and the second image characteristic;

before the performing a plurality of attention-based iterations of the first image feature and the second image feature, further comprising:

and respectively carrying out normalization operation on the first image characteristic and the second image characteristic.

5. The method according to any one of claims 1 to 4, wherein the calculating, based on each pixel point of the first feature map, a similarity between each pixel point and all pixel points on an epipolar line corresponding to the second feature map to obtain a similarity matrix comprises:

performing cross attention mechanism operation between each pixel point of the first characteristic diagram and all pixel points on the polar line corresponding to the second characteristic diagram to obtain a plurality of similarity groups;

constructing a matching similarity matrix based on the plurality of similarity groups;

and carrying out weight configuration on the matching similarity matrix to obtain a similarity matrix.

6. The method of claim 5, wherein the weight configuration of the matching similarity matrix comprises:

and aiming at the similarity of the same similarity group in the matching similarity matrix, the weight of the similarity with the highest similarity value is increased, and the weights of the rest similarities are reduced.

7. The method according to any one of claims 1 to 4, wherein calculating the matching cost values of a plurality of preset positions in the first feature map according to the resolution of the pre-output image and the similarity matrix to obtain a similarity feature matrix comprises:

marking a plurality of preset positions and aggregation areas corresponding to the preset positions in the first feature map according to the resolution of a preset output image;

carrying out weighted operation on the similarity of each pixel point in the aggregation areas to obtain a matching cost value corresponding to a preset position;

and constructing the similarity feature matrix based on the matching cost values of the preset positions.

8. The method according to any one of claims 1 to 4, wherein the obtaining a depth prediction map according to the similarity feature matrix comprises:

on the basis of a multi-layer perceptron, calculating the similarity feature matrix to obtain an initial parallax estimation and a parallax residual, wherein the initial parallax estimation comprises minimum matching cost values respectively corresponding to the preset positions, and the parallax residual is used for correcting the initial parallax residual;

and calculating to obtain a depth prediction map according to the initial parallax estimation and the parallax residual.

9. A depth information acquisition apparatus characterized by comprising:

the first acquisition module is used for acquiring a first feature map of the left eye image and a second feature map of the right eye image;

the first execution module is used for respectively calculating the similarity between each pixel point and all the pixel points on the polar line corresponding to the second feature map based on each pixel point of the first feature map to obtain a similarity matrix, and the similarity matrix is used for reflecting the similarity of each pixel point in the first feature map;

the second execution module is used for calculating matching cost values of a plurality of preset positions in the first feature map according to the resolution of a pre-output image and the similarity matrix to obtain a similarity feature matrix, wherein the matching cost values are used for reflecting the similarity degree of the preset positions and the corresponding positions of the second feature map, the matching cost values and the similarity degree are in an inverse proportion relation, and the similarity feature matrix comprises the matching cost values of the plurality of preset positions;

and the second acquisition module is used for acquiring the depth prediction map according to the similarity characteristic matrix.

10. The apparatus of claim 9, wherein the first obtaining module comprises:

the extraction submodule is used for carrying out feature extraction on the acquired left eye image and the acquired right eye image to obtain a first image feature of the left eye image and a second image feature of the right eye image;

11. The apparatus according to claim 10, wherein the first arithmetic sub-module is specifically configured to perform a plurality of iterations based on an attention mechanism on the first image feature and the second image feature to obtain the first feature map and the second feature map;

wherein each iteration operation comprises:

12. The apparatus of claim 11, wherein the first operation sub-module is further configured to perform a normalization operation on the first image feature and the second image feature after the performing a self-attention mechanism operation or after the performing a cross-attention mechanism operation, respectively; and before the first image feature and the second image feature are subjected to a plurality of iterative operations based on an attention mechanism, respectively carrying out normalization operation on the first image feature and the second image feature.

13. The apparatus according to any one of claims 9-12, wherein the first execution module comprises:

the cross attention submodule is used for performing cross attention mechanism operation between each pixel point of the first feature map and all pixel points on the epipolar line corresponding to the second feature map to obtain a plurality of similarity groups;

a first constructing submodule, configured to construct a matching similarity matrix based on the plurality of similarity groups;

14. The apparatus according to claim 13, wherein the configuration sub-module is specifically configured to, for the similarity of the same similarity group in the matching similarity matrix, increase the weight of the similarity with the highest similarity value and decrease the weights of the remaining similarities.

15. The apparatus according to any one of claims 9-12, wherein the second execution module comprises:

the marking sub-module is used for marking a plurality of preset positions and aggregation areas corresponding to the preset positions in the first feature map according to the resolution of a pre-output image;

16. The apparatus according to any one of claims 9-12, wherein the second obtaining module comprises:

the second operation sub-module is used for operating the similarity feature matrix based on a multi-layer perceptron to obtain an initial parallax estimation and a parallax residual, the initial parallax estimation comprises minimum matching cost values respectively corresponding to the plurality of preset positions, and the parallax residual is used for correcting the initial parallax residual;

17. An electronic device, comprising: a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-8.

18. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-8.