Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.
The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
In the application fields of automatic driving, robots, digital cities, virtual/mixed reality and the like, the three-dimensional reconstruction technology is widely focused. In recent years, with the continuous development of deep learning technology, a deep learning model for three-dimensional reconstruction using a neural network to acquire key information features of point cloud data from a single image or a sequence of images is generated.
However, the current deep learning model has relatively weak information capturing capability for fine granularity of an image, and may have limitations in processing complex textures, details, edges and the like, resulting in reduced modeling results and accuracy.
Aiming at the problems, the application provides a method for acquiring key information features for three-dimensional reconstruction, which extracts richer multi-scale features by more deeply and effectively fusing image features so as to improve the information capturing capability of fine granularity of images and facilitate the follow-up three-dimensional reconstruction task.
Next, embodiments of the present specification will be described in detail.
As shown in fig. 1, fig. 1 is a flowchart illustrating a method of acquiring key information features for three-dimensional reconstruction according to an exemplary embodiment of the present specification, including the steps of:
step 101, acquiring a plurality of two-dimensional images of a target scene in different directions.
For different target scenes, such as indoor environments, buildings or in-vehicle environments, spatial image information capable of reflecting different angles of the target scenes can be obtained through cameras arranged at different directions of the target scenes. Taking a target scene as an in-vehicle environment as an example, spatial image information in the vehicle can be acquired through cameras at three positions of the middle front part, the left rear part and the right rear part in the vehicle. The above is merely an example, and the present specification is not limited with respect to the position of the cameras and the number of cameras.
In an embodiment, as shown in fig. 2, the encoder is utilized to map the characteristic information of a plurality of initial images I1, I2, …, ir of different directions captured in the target scene to the potential vector space 20 respectively, so as to obtain two-dimensional images P1, P2, …, pr and r corresponding to the initial images I1, I2, …, ir respectively, where r is the total number of initial images captured by the camera. Wherein any two-dimensional image Pi is a low-dimensional representation of the corresponding initial image Ii. Alternatively, the encoder may be an encoder of the VAE model.
By the embodiment, redundant information in an initial image can be remarkably reduced; and, compress the image into the vector representation of the potential space, can reduce the data dimension too, thus help to accelerate the subsequent image reasoning task.
102, Respectively carrying out n rounds of iterative processing on each two-dimensional image; wherein each round of iterative processing performed for any two-dimensional image includes: performing first fusion processing on m feature images with different scales extracted from an input image, and outputting a transition image generated by fusion; the input image in the first round of iterative processing is any two-dimensional image, and the input image in the non-first round of iterative processing is a transition image output in the previous round of iterative processing; n and m are integers not less than 3.
Since the logic of n rounds of iterative processing performed on the two-dimensional images P1, P2, …, pr of different orientations is the same, only the processing logic for any one two-dimensional image Pi will be described below:
In an embodiment, as shown in fig. 3, n rounds of iterative processing are performed on the two-dimensional image Pi, where the processing logic of each round of iterative processing includes two adjacent steps of feature extraction 3011 and first fusion processing 3012. The feature extraction 3011 includes extracting m feature images of different scales from an input image, and the first fusion processing 3012 includes fusing the extracted m feature images of different scales, and finally obtaining a transition image generated by fusion.
First, in the first round of iterative processing 301, the input image is a two-dimensional image Pi, the first round of iterative processing 301 extracts feature maps Ti-1, ti-2, …, ti-m of m different scales of the two-dimensional image Pi, and outputs a transition map D1 generated by the first fusion processing 3012 to the second round of iterative processing 302. Next, in the non-first round of iterative processing, the input image is a transition image output by the previous round of iterative processing, the processing steps are extracting m feature images with different scales of the transition image, and outputting the transition image generated by the first fusion processing 3012 as the input image of the next round of iterative processing. Illustratively, in the second round of iterative processing, the input image is a transition diagram D1 output by the first round of iterative processing 301, and the processing steps are extracting m feature diagrams of different scales of the transition diagram D1, and outputting a transition diagram D2 generated by the first fusion processing 3012. And so on, in the case that the input image of the nth round of iterative processing 303 is the transition diagram Dn-1, the output image is the transition diagram Dn after two steps of feature extraction 3011 and the first fusion processing 3012.
Next, two steps of the feature extraction 3011 and the first fusion process 3012 included in each round of iterative processing of fig. 3 are described in detail:
In one embodiment, the steps of feature extraction 3011 in FIG. 3 are directed. For each iteration process of the input image, including the two-dimensional image Pi (first iteration process of the input image) or the transition map Dj (non-first iteration process of the input image), the feature maps Ti-1, ti-2, …, ti-m of m different scales can be extracted by the method shown in fig. 4. As shown in FIG. 4, the convolution neural network may be selected to extract m different scales of raw feature maps Fi-1, fi-2, …, fi-m from the input image. Alternatively, the original feature map of m different scales of the image may be extracted by incrementally increasing the size of the convolution kernel and the number of channels of the image. Alternatively, the extracted m-scale raw feature maps Fi-1, fi-2, …, fi-m may be directly utilized in a subsequent first fusion process 3012 step. Alternatively, the extracted original feature maps Fi-1, fi-2, …, fi-m of each scale may be processed by using a self-attention mechanism model to obtain feature maps Ti-1, ti-2, …, ti-m with context relations, and the obtained feature maps Ti-1, ti-2, …, ti-m with context relations of m different scales may be used in the subsequent first fusion processing 3012 step.
On the basis of the above embodiment, as shown in fig. 4, an embedded vector Ei of the azimuth information corresponding to the image Ii in the target scene may also be obtained, and the embedded vector Ei and the original feature maps Fi-1, fi-2, …, fi-m may be input together into the self-attention mechanism model to be processed by using the self-attention mechanism model. The azimuth information of the image Ii may be a position of the camera that obtains the image, for example, if the target scene is an in-vehicle scene, the position of the camera may be left rear, right rear, or middle front in the vehicle. Then, the azimuth information of the image Ii is converted into an embedded vector Ei having the same dimension as the two-dimensional image Pi by using the embedded model, and is input to the self-attention mechanism model together for processing.
For the steps of the first fusion process 3012 in fig. 3, an example is illustrated with m=3.
In one embodiment, as shown in FIG. 5, the m different-scale feature maps Ti-1, ti-2, …, ti-3 of the input image extracted based on the method of FIG. 4 are transformed into a feature map Bi-1 of the same dimension as the feature map Ti-2 of the second scale by dimension transformation, and added with elements to obtain a first intermediate fusion feature map Ui-2; converting the first intermediate fusion feature map Ui-2 into the dimension of a feature map Ti-3 with a third scale, and adding elements of the converted first intermediate fusion feature map Bi-2 and the feature map Ti-3 with the third scale to obtain a transition map D.
It should be noted that, for the case where m is greater than 3, the processing logic is also the above processing logic, and this description is not repeated here.
In this embodiment, since the image information captured by the feature maps with different scales has different semantic hierarchies and perception ranges, the contribution of the feature maps with different scales can be balanced by performing dimension transformation and accumulation processing on the feature maps with different scales, so as to provide a richer and comprehensive visual expression.
Step 103, obtaining transition graphs output by final round iteration processing corresponding to each two-dimensional image, and respectively carrying out second fusion processing according to m different-scale feature graphs extracted from each obtained transition graph to obtain a fused feature graph.
In one embodiment, a transition map Dn output by the final iteration process in the n-round iteration process step of fig. 3 is obtained, and m feature maps Ti-1, ti-2, …, ti-m of different scales are extracted from the transition map Dn according to the feature extraction method shown in fig. 4.
In another embodiment, as shown in FIG. 6, an m-1 round of feature fusion iteration process is performed on m feature maps of different scales. Firstly, converting a feature map Ti-1 with a first scale into a feature map Ti-2 with a second scale, and fusing the converted feature map with the feature map Ti-2 with the second scale to obtain a second intermediate fused feature map Ui-2; then, converting the second intermediate fusion feature map Ui-2 into the dimension of a feature map Ti-3 with a third scale, and fusing the converted feature map with the feature map Ti-3 with the third scale to obtain a third intermediate fusion feature map Ui-3; similarly, m-1 different intermediate fusion profiles Ui-2, ui-3, …, ui-m can be obtained. Finally, a second fusion process 50 may be performed according to the obtained m-1 intermediate fusion feature maps and the feature map Ti-1 of the first scale to obtain a fused feature map Ki corresponding to the two-dimensional image Pi. Alternatively, the second fusion process 50 may be to perform element addition after converting the acquired m-1 intermediate fusion feature maps Ui-2, ui-3, …, ui-m and the feature map Ti-1 of the first scale into the same dimension.
In one embodiment, before the second fusion processing 50 is performed, each of the intermediate fusion feature maps Ui-2, ui-3, …, ui-m and the feature map Ti-1 of the first scale may be processed by using a depth separable convolution network, so as to increase the receptive field of the network and extract rich semantic information. Then, the processed feature map is subjected to a second fusion process 50 to obtain a fused feature map Ki.
And 104, acquiring key information features of three-dimensional reconstruction aiming at the target scene according to the fused feature graphs respectively corresponding to the two-dimensional images.
In an embodiment, the obtained fused feature maps K1, K2, …, and Kr corresponding to each two-dimensional image P1, P2, …, and Pr may be respectively input into a multi-layer perceptron to obtain key information features for three-dimensional reconstruction of the target scene. The key information features can be illumination intensity, transparency, RGB value and the like of the original image.
In another embodiment, the obtained key information features are processed by a decoder and can be restored into a high-resolution image for a subsequent three-dimensional reconstruction task. Alternatively, the decoder may be a decoder of a VAE model.
Thus, an introduction to a method of acquiring key information features for three-dimensional reconstruction has been completed.
Corresponding to the embodiments of the aforementioned method, the present specification also provides embodiments of the apparatus and the terminal to which it is applied.
As shown in fig. 7, fig. 7 is a schematic structural view of an electronic device according to an exemplary embodiment shown in the present specification. At the hardware level, the device includes a processor 702, an internal bus 704, a network interface 706, memory 708, and non-volatile storage 710, although other hardware required by the service is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 702 reading a corresponding computer program from the non-volatile storage 710 into the memory 708 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic module, but may also be a hardware or logic device.
As shown in fig. 8, fig. 8 is a block diagram of an apparatus for acquiring key information features for three-dimensional reconstruction according to an exemplary embodiment shown in the present specification. The device can be applied to the electronic equipment shown in fig. 7 to realize the technical scheme of the specification. The device comprises:
an image acquiring unit 802, configured to acquire a plurality of two-dimensional images of a target scene under different orientations;
An aggregation unit 804, configured to perform n rounds of iterative processing on each two-dimensional image; wherein each round of iterative processing performed for any two-dimensional image includes: performing first fusion processing on m feature images with different scales extracted from an input image, and outputting a transition image generated by fusion; the input image in the first round of iterative processing is any two-dimensional image, and the input image in the non-first round of iterative processing is a transition image output in the previous round of iterative processing; n and m are integers not less than 3;
The fusion processing unit 806 is configured to obtain a transition graph output by a final round of iterative processing corresponding to each two-dimensional image, and perform a second fusion process according to m feature graphs with different scales extracted from each obtained transition graph, so as to obtain a fused feature graph;
And the key information feature acquiring unit 808 is configured to acquire key information features for three-dimensional reconstruction of the target scene according to the fused feature graphs corresponding to the two-dimensional images.
The image obtaining unit 802 is specifically configured to map, by using an encoder, feature information of a plurality of initial images of different directions captured in the target scene to a potential space, so as to obtain two-dimensional images corresponding to the initial images respectively; wherein the two-dimensional image is a low-dimensional representation of the corresponding initial image.
The aggregation unit 804 is specifically configured to extract m kinds of original feature graphs with different scales from the arbitrary image by using a convolutional neural network; the original feature map of each scale is processed separately using a self-attention mechanism model to obtain a feature map with contextual relevance.
The aggregation unit 804 is specifically configured to obtain an embedded vector of azimuth information corresponding to the any one image in the target scene; the embedded vector is input to the self-attention mechanism model together with the original feature map for processing with the self-attention mechanism model.
The aggregation unit 804 is specifically configured to perform a first fusion process on the m feature maps with different scales, where the first fusion process includes: performing m-1 round of iterative fusion processing on the m feature maps with different scales; the ith round of iterative fusion processing comprises the following steps: converting the feature map of the ith scale into the dimension of the feature map of the (i+1) th scale, and carrying out element addition on the converted feature map and the feature map of the (i+1) th scale until i=m-1; wherein, the initial value of i is 1, and the value of i is increased by 1 after each round of iterative fusion processing.
The fusion processing unit 806 is specifically configured to implement m-1 round of feature fusion iteration processing on m feature graphs with different scales extracted from each acquired transition graph; the j-th round of feature fusion iterative processing comprises the following steps: converting the feature map of the j-th scale into the dimension of the feature map of the j+1th scale, and fusing the converted feature map with the feature map of the j+1th scale to obtain a j+1th intermediate fusion feature map; wherein, the initial value of j is 1, and the value of j is automatically increased by 1 after each round of feature fusion iteration processing until j=m-1; and performing second fusion processing according to all the acquired intermediate fusion feature images and the feature images of the first scale.
The fusion processing unit 806 is specifically configured to process each intermediate fusion feature map and the feature map of the first scale by using the depth separable convolution network, so as to obtain a corresponding processed feature map; and performing second fusion processing on all the processed feature images.
The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.