CN116630744A

CN116630744A - Image generation model training method, image generation device and medium

Info

Publication number: CN116630744A
Application number: CN202310583770.2A
Authority: CN
Inventors: 王博瀚; 申世伟; 李家宏
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-22

Abstract

The disclosure relates to the technical field of computers, and in particular relates to an image generation model training method, an image generation device, a medium and electronic equipment. The training method comprises the following steps: inputting an input image sample into a model to be trained, obtaining an image feature vector corresponding to the input image sample, obtaining a depth feature vector corresponding to the input image sample, and fusing to obtain a multi-mode feature vector corresponding to the input image sample; determining positions of a plurality of pixels corresponding to a prediction target image, generating virtual rays from a camera focus to the positions of the pixels, and sampling a plurality of points on each virtual ray; predicting color information of a plurality of pixels corresponding to the target image; rendering the plurality of pixels to obtain a prediction target image; and updating the neural network parameters of the model to be trained to obtain the image generation model. By the technical scheme of the embodiment of the disclosure, the problem that the new view angle image rendered in the prior art has defects can be solved.

Description

Image generation model training method, image generation device and medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to an image generation model training method, an image generation model training apparatus, an image generation apparatus, a computer-readable storage medium, and an electronic device.

Background

With the rapid development of software and hardware, computer graphics is also increasingly used in various fields. For example, computer graphics are widely used in the fields of holograms, metauniverse, digital twins, and the like. In computer graphics, a new view angle image is rendered as an accent study direction.

In the related art, two-dimensional features of an input image can be acquired, and training is performed through a two-dimensional feature supervision model of the input image so as to realize new view angle image rendering. However, in the scheme in the related art, the two-dimensional features are adopted to train the model, so that the three-dimensional information is lost, and the defects of blurring, artifacts and the like of the new angle image obtained by rendering exist.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure aims to provide an image generation model training method, an image generation model training device, a computer readable storage medium and an electronic device, which can solve the problem that a new view angle image rendered in the prior art has defects.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to a first aspect of the present disclosure, there is provided an image generation model training method, including: acquiring an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose; inputting an input image sample into a model to be trained, acquiring an image feature vector corresponding to the input image sample, acquiring a depth feature vector corresponding to the input image sample, and fusing the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample; determining the positions of a plurality of pixels corresponding to a predicted target image based on the pose of a target camera, generating virtual rays from the focus of the camera to the positions of the pixels, and sampling a plurality of points on each virtual ray; wherein, the pose of the target camera corresponds to a camera focus; determining color information of a plurality of pixels corresponding to the prediction target image according to a plurality of points obtained by up-sampling the virtual rays and the multi-mode feature vector; rendering the pixels according to the color information corresponding to the pixels to obtain a prediction target image; and updating the neural network parameters of the model to be trained according to the target image sample and the predicted target image to obtain an image generation model.

Optionally, based on the foregoing solution, obtaining a depth feature vector corresponding to an input image sample includes: performing depth estimation on an input image sample to obtain a sparse depth feature vector corresponding to the input image sample; determining a depth feature vector corresponding to the input image sample according to the sparse depth feature vector; the depth feature vector corresponding to the input image sample is a dense depth feature vector.

Optionally, based on the foregoing solution, determining color information of a plurality of pixels corresponding to the prediction target image according to a plurality of points obtained by up-sampling the virtual ray and the multi-mode feature vector includes: determining pixel coordinate values of pixels, and determining pixel feature vectors corresponding to the pixels according to the multi-mode feature vectors; coordinate encoding is carried out on the space coordinate values of a plurality of points obtained by up-sampling the virtual rays to obtain a plurality of space coordinate encoding vectors; determining color information of a plurality of points and density information of the plurality of points according to pixel characteristic vectors corresponding to the pixels and a plurality of space coordinate coding vectors; color information of a plurality of pixels is determined based on color information of a plurality of points and density information of the plurality of points.

Optionally, based on the foregoing solution, the virtual ray has a ray direction, and determining color information of a plurality of pixels according to color information of a plurality of points and density information of the plurality of points includes: performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points to obtain a plurality of candidate points; color information for a plurality of pixels is determined based on a plurality of candidate points on the virtual ray.

Optionally, based on the foregoing solution, the multiple points obtained by up-sampling the virtual ray have spatial coordinate values, and determining the pixel coordinate value of the pixel includes: converting the spatial coordinate values of the plurality of points obtained by up-sampling the virtual rays from a world coordinate system to a camera coordinate system to obtain first candidate coordinate values of the plurality of points; converting the first candidate coordinate values of the points from the camera coordinate system to the image coordinate system to obtain second candidate coordinate values of the pixels; and converting the second candidate coordinate value of the pixel from the image coordinate system to the pixel coordinate system to obtain the pixel coordinate value of the pixel.

According to a second aspect of the present disclosure, there is provided an image generation method, the method comprising: acquiring an input image and predicting the pose of a camera; the method comprises the steps of predicting the pose of a camera as the pose of the camera corresponding to a target image, wherein an input image and the target image are images shot by adopting different camera poses in the same scene; inputting an input image into an image generation model to obtain a target image; wherein the image generation model is obtained by the image generation model training method according to any one of the above.

According to a third aspect of the present disclosure, there is provided an image generation model training apparatus, the apparatus comprising: a sample data acquisition unit configured to perform acquisition of an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose; the feature vector acquisition unit is configured to input an input image sample into a model to be trained, acquire an image feature vector corresponding to the input image sample, acquire a depth feature vector corresponding to the input image sample, and fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample; a virtual ray generation unit configured to perform determination of positions of a plurality of pixels corresponding to a predicted target image based on a pose of a target camera, generate virtual rays from a camera focus to positions of the pixels, and sample a plurality of points on each virtual ray; wherein, the pose of the target camera corresponds to a camera focus; a color information acquisition unit configured to perform determination of color information of a plurality of pixels corresponding to the prediction target image from a plurality of points obtained by up-sampling the virtual ray and the multi-modal feature vector; a predicted image rendering unit configured to perform rendering of a plurality of pixels according to color information corresponding to the plurality of pixels to obtain a predicted target image; and the network parameter updating unit is configured to update the neural network parameters of the model to be trained according to the target image sample and the predicted target image so as to obtain an image generation model.

Optionally, based on the foregoing solution, the apparatus further includes: the sparse depth feature vector acquisition unit is configured to perform depth estimation on the input image samples to obtain sparse depth feature vectors corresponding to the input image samples; a depth feature vector acquisition unit configured to perform determination of a depth feature vector corresponding to an input image sample from the sparse depth feature vector; the depth feature vector corresponding to the input image sample is a dense depth feature vector.

Optionally, based on the foregoing solution, the apparatus further includes: a pixel feature vector acquisition unit configured to perform determination of pixel coordinate values of pixels, and determine pixel feature vectors corresponding to the pixels according to the multi-modal feature vectors; a space coordinate code vector acquisition unit configured to perform coordinate coding on space coordinate values of a plurality of points obtained by up-sampling the virtual ray to obtain a plurality of space coordinate code vectors; a space coordinate code vector determination unit configured to perform determination of color information of a plurality of points and density information of the plurality of points from the pixel feature vector corresponding to the pixel and the plurality of space coordinate code vectors; and a first color information determining unit configured to perform determination of color information of the plurality of pixels based on the color information of the plurality of dots and the density information of the plurality of dots.

Optionally, based on the foregoing solution, the virtual ray has a ray direction, and the color information of the plurality of pixels is determined according to the color information of the plurality of points and the density information of the plurality of points, and the apparatus further includes: a candidate point determination unit configured to perform volume rendering in a ray direction based on color information of the plurality of points and density information of the plurality of points to obtain a plurality of candidate points; and a second color information determining unit configured to perform determination of color information of a plurality of pixels from a plurality of candidate points on the virtual ray.

Optionally, based on the foregoing solution, the plurality of points obtained by up-sampling the virtual ray have spatial coordinate values, and the apparatus further includes: a first candidate coordinate value determining unit configured to perform conversion of spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray from the world coordinate system to the camera coordinate system to obtain first candidate coordinate values of the plurality of points; a second candidate coordinate value determining unit configured to perform conversion of first candidate coordinate values of the plurality of points from the camera coordinate system to the image coordinate system to obtain second candidate coordinate values of the pixels; and a pixel coordinate value determining unit configured to perform conversion of the second candidate coordinate value of the pixel from the image coordinate system to the pixel coordinate system to obtain the pixel coordinate value of the pixel.

According to a fourth aspect of the present disclosure, there is provided an image generating apparatus, the apparatus comprising: an input image acquisition unit configured to perform acquisition of an input image and prediction of a camera pose; the method comprises the steps of predicting the pose of a camera as the pose of the camera corresponding to a target image, wherein an input image and the target image are images shot by adopting different camera poses in the same scene; an image generation unit configured to perform inputting of an input image into an image generation model, resulting in a target image; wherein the image generation model is obtained by training the image generation model according to any one of the above.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image generation model training method of the first aspect and the image generation method of the second aspect as in the above-described embodiments.

According to a sixth aspect of the present disclosure, there is provided an electronic device comprising:

a processor; and

and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the image generation model training method of the first aspect and the image generation method of the second aspect as in the above embodiments.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, a computer program/instruction, characterized in that the computer program/instruction, when executed by a processor, implements the image generation model training method and the image generation method of any of the above.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

in the image generation model training method provided by the embodiment of the disclosure, an input image sample and a target image sample can be acquired, the input image sample is input into a model to be trained, an image feature vector corresponding to the input image sample is acquired, a depth feature vector corresponding to the input image sample is acquired, the image feature vector and the depth feature vector are fused to obtain a multi-mode feature vector corresponding to the input image sample, the positions of a plurality of pixels corresponding to a predicted target image are determined based on the pose of a target camera, virtual rays are generated from the focus of the camera to the positions of the pixels, a plurality of points are sampled on each virtual ray, color information of a plurality of pixels corresponding to the predicted target image is determined according to the plurality of points obtained by the sampling on the virtual rays and the multi-mode feature vector, the pixels are rendered according to the color information corresponding to the plurality of pixels, so as to obtain the predicted target image, and neural network parameters of the model to be trained are updated according to the target image sample and the predicted target image, so as to obtain the image generation model. According to the embodiment of the disclosure, on one hand, in the graph generation model, the multi-mode feature vector obtained by fusing the image feature vector and the depth feature vector can carry more information, so that the reconstructed three-dimensional scene has visual angle consistency; on the other hand, virtual rays reaching each pixel on the target image can be generated, and color information of a plurality of pixels of the target image is determined according to a plurality of points obtained by up-sampling the virtual rays, so that the generation of the target image is realized, and meanwhile, the model is trained by adopting two-dimensional information and three-dimensional information, so that the accuracy of the generated target image can be improved, and the defects of blurring, artifact and the like of the target image are avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 schematically illustrates a schematic diagram of an exemplary system architecture of an image generation model training method in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an image generation model training method in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of determining depth feature vectors corresponding to input image samples from sparse depth feature vectors in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flowchart for determining color information for a plurality of pixels from color information for a plurality of points and density information for a plurality of points in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart for determining color information for a plurality of pixels from a plurality of candidate points on a virtual ray in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of converting a second candidate coordinate value of a pixel from an image coordinate system to a pixel coordinate system to obtain a pixel coordinate value of the pixel in an exemplary embodiment of the disclosure;

fig. 7 schematically illustrates a flowchart of fusing an image feature vector and a depth feature vector to obtain a multi-modal feature vector corresponding to an input image sample in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of sampling multiple points on each virtual ray in an exemplary embodiment of the present disclosure;

fig. 9 schematically illustrates a flowchart for determining color information of a plurality of points and density information of the plurality of points according to pixel feature vectors corresponding to pixels and a plurality of spatial coordinate encoding vectors in an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a flow chart for rendering a plurality of pixels according to color information corresponding to the plurality of pixels in an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates a schematic diagram of an image generation system in an exemplary embodiment of the present disclosure;

FIG. 12 schematically illustrates a flow chart of inputting an input image into an image generation model to obtain a target image in an exemplary embodiment of the present disclosure;

FIG. 13 schematically illustrates a composition diagram of an image generation model training apparatus in an exemplary embodiment of the present disclosure;

fig. 14 schematically illustrates a composition diagram of an image generating apparatus in an exemplary embodiment of the present disclosure;

fig. 15 schematically illustrates a structural schematic diagram of a computer system suitable for use in implementing the electronic device of the exemplary embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described feature vectors, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Furthermore, the described feature vectors, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which an image generation model training method or image generation method of embodiments of the present disclosure may be applied.

As shown in fig. 1, system architecture 1000 may include one or more of terminal devices 1001, 1002, 1003, a network 1004, and a server 1005. The network 1004 serves as a medium for providing a communication link between the terminal apparatuses 1001, 1002, 1003 and the server 1005. The network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 1005 may be a server cluster formed by a plurality of servers.

A user can interact with a server 1005 via a network 1004 using terminal apparatuses 1001, 1002, 1003 to receive or transmit messages or the like. The terminal devices 1001, 1002, 1003 may be various electronic devices having a display screen including, but not limited to, smartphones, tablet computers, portable computers, desktop computers, and the like. In addition, the server 1005 may be a server providing various services.

In one embodiment, an execution subject of the image generation model training method of the present disclosure may be a server 1005, where the server 1005 may acquire an input image sample and a target image sample sent by the terminal devices 1001, 1002, 1003, input the input image sample into a model to be trained, acquire an image feature vector corresponding to the input image sample, acquire a depth feature vector corresponding to the input image sample, fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample, determine positions of a plurality of pixels corresponding to a predicted target image based on a pose of a target camera, generate virtual rays from a focal point of the camera to positions of the pixels, sample a plurality of points on each virtual ray, determine color information of a plurality of pixels corresponding to the predicted target image according to the points obtained by up-sampling the virtual rays and the multi-mode feature vector, render the plurality of pixels according to the color information corresponding to the plurality of pixels to obtain the predicted target image, and update neural network parameters of the model to be trained according to the target image sample and the predicted target image to obtain the image generation model. In addition, the image generation model training method disclosed by the disclosure may be further executed through terminal devices 1001, 1002, 1003 or the like, so as to obtain an input image sample and a target image sample, input the input image sample into a model to be trained, obtain an image feature vector corresponding to the input image sample, obtain a depth feature vector corresponding to the input image sample, fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample, determine positions of a plurality of pixels corresponding to a predicted target image based on a pose of a target camera, generate virtual rays from a camera focus to positions of the pixels, sample a plurality of points on each virtual ray, determine color information of a plurality of pixels corresponding to the predicted target image according to the points obtained by the sampling on the virtual rays and the multi-mode feature vector, render the plurality of pixels according to the color information corresponding to the plurality of pixels, so as to obtain the predicted target image, and update neural network parameters of the model to be trained according to the target image sample and the predicted target image, so as to obtain the image generation model.

In addition, the image generation model training method implementation process of the present disclosure may also be implemented jointly by the terminal apparatuses 1001, 1002, 1003 and the server 1005. For example, the terminal devices 1001, 1002, 1003 may input an image sample and a target image sample, send the acquired input image sample and target image sample to the server 1005, so that the server 1005 may input the input image sample to a model to be trained, acquire an image feature vector corresponding to the input image sample, acquire a depth feature vector corresponding to the input image sample, fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample, determine positions of a plurality of pixels corresponding to a predicted target image based on a pose of a target camera, generate virtual rays from a focal point of the camera to positions of the pixels, sample a plurality of points on each virtual ray, determine color information of a plurality of pixels corresponding to the predicted target image according to the points and the multi-mode feature vector obtained by up-sampling, render the plurality of pixels according to the color information corresponding to the plurality of pixels, so as to obtain the predicted target image, update neural network parameters of the model to be trained according to the target image sample and the predicted target image, and obtain the image generation model.

According to the training method of the object classification model provided in the present exemplary embodiment, an input image sample and a target image sample may be obtained, the input image sample is input into a model to be trained, an image feature vector corresponding to the input image sample is obtained, a depth feature vector corresponding to the input image sample is obtained, the image feature vector and the depth feature vector are fused to obtain a multi-mode feature vector corresponding to the input image sample, a position of a plurality of pixels corresponding to a predicted target image is determined based on a pose of a target camera, a virtual ray is generated from a camera focus to the position of each pixel, a plurality of points are sampled on each virtual ray, color information of a plurality of pixels corresponding to the predicted target image is determined according to the plurality of points obtained by the virtual ray up-sampling and the multi-mode feature vector, the plurality of pixels are rendered according to the color information corresponding to the plurality of pixels, so as to obtain a predicted target image, and neural network parameters of the model to be trained are updated according to the target image sample and the predicted target image, so as to obtain an image generation model. As shown in fig. 2, the image generation model training method may include the following steps S210 to S260:

Step S210, acquiring an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose;

step S220, inputting an input image sample into a model to be trained, obtaining an image feature vector corresponding to the input image sample, obtaining a depth feature vector corresponding to the input image sample, and fusing the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample;

step S230, determining the positions of a plurality of pixels corresponding to a predicted target image based on the pose of the target camera, generating virtual rays from the focus of the camera to the positions of the pixels, and sampling a plurality of points on each virtual ray; wherein, the pose of the target camera corresponds to a camera focus;

step S240, determining color information of a plurality of pixels corresponding to the predicted target image according to a plurality of points obtained by up-sampling the virtual rays and the multi-mode feature vector;

step S250, rendering a plurality of pixels according to color information corresponding to the pixels to obtain a prediction target image;

and step S260, updating the neural network parameters of the model to be trained according to the target image sample and the predicted target image to obtain an image generation model.

Next, steps S210 to S260 of the image generation model training method in the present exemplary embodiment will be described in more detail with reference to the drawings and the embodiments.

Step S210, acquiring an input image sample and a target image sample;

in one example embodiment of the present disclosure, an input image sample may be acquired along with a target image sample. The input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose. Specifically, the input image sample and the target image sample are images obtained by shooting the same shooting object through different shooting angles, and when the images are shot by adopting different shooting angles, the adopted camera pose is different. Specifically, the camera pose may include a camera external parameter, which may include a camera position, a camera angle, and the like, and a camera internal parameter, which may include a camera focal length, an imaging size, and the like.

It should be noted that, the specific parameter type of the camera pose is not particularly limited in the present disclosure.

For example, an input image sample may be obtained by photographing a subject with an input camera pose, and a target image sample may be obtained by photographing a subject with a target camera pose.

The method for generating the input image sample and the target image sample is not particularly limited.

In one example embodiment of the present disclosure, after the target image sample is acquired, a target camera pose corresponding to the target image sample may be acquired. For example, the pose of the target camera corresponding to the target image sample may be obtained by an SFM algorithm (structure from motion) to recover the three-dimensional scene structure from the motion information.

in an example embodiment of the present disclosure, after the input image sample and the target image sample are obtained through the above steps, and the target camera pose corresponding to the target image sample is obtained, the input image sample may be input into the model to be trained, and the image feature vector corresponding to the input image sample is obtained. Specifically, an image feature vector corresponding to an input image sample may be used to indicate a color of the input image sample, such as RGB (Red Green Blue) features.

It should be noted that, the specific type of the image feature vector corresponding to the input image sample and the specific manner of obtaining the image feature vector corresponding to the input image sample are not limited in this disclosure.

In an example embodiment of the present disclosure, after the input image samples are input into the model to be trained through the above steps, depth feature vectors corresponding to the input image samples may be acquired. In particular, a depth feature vector corresponding to an input image sample may be used to indicate a distance between a point in a scene in the input image sample and a camera.

In an example embodiment of the present disclosure, after the depth feature vectors corresponding to the input image samples are obtained through the above steps, the depth feature vectors corresponding to the input image samples (single channel) may be converted into a three-channel HSV (Hue Saturation Value, hue, saturation and brightness) format, and then input into a pre-trained convolutional neural network to extract the depth feature vectors.

It should be noted that, the specific manner of obtaining the depth feature vector corresponding to the input image sample is not particularly limited in this disclosure.

In an example embodiment of the present disclosure, after the image feature vector corresponding to the input image sample and the depth feature vector corresponding to the input image sample are obtained through the above steps, the image feature vector and the depth feature vector may be fused to obtain a multi-modal feature vector corresponding to the input image sample. In particular, the multimodal feature vector corresponding to the input image sample may be used to indicate the color of the input image sample and the distance between a point in the scene in the input image sample and the camera.

Further, when the image feature vector and the depth feature vector are fused to obtain a multi-mode feature vector corresponding to the input image sample, an attention mechanism may be used for fusion, where the image feature vector is used to provide semantic and texture information, and the depth feature vector may be used to provide spatial information. Because the image feature vector and the depth feature vector have different modes, the convergence speed is different in the training process of the model to be trained, and therefore, the attention mechanism is adopted to fuse the image feature vector and the depth feature vector, so that the feature vectors of the two modes can be fused, and the training speed of the model to be trained can be improved.

In an example embodiment of the present disclosure, the model to be trained refers to a model built for completing an image generation task, and the image generation model may be obtained by training the model to be trained to complete the image generation task. It should be noted that, the specific structure of the model to be trained is not particularly limited in this disclosure.

The image generating task is to input an input image and predict the pose of the camera, generate a target image based on the predicted pose of the camera, and the input image and the target image are images obtained by shooting different camera poses in the same scene.

It should be noted that, the specific manner of fusing the image feature vector and the depth feature vector to obtain the multi-mode feature vector corresponding to the input image sample in the present disclosure is not particularly limited.

In one example embodiment of the present disclosure, feature learning of a multi-camera pose may be performed by multi-modal feature vectors corresponding to a plurality of input image samples. For example, the multi-mode feature vectors corresponding to the plurality of input image samples can be input into the multi-view feature fusion module to perform feature learning of the multi-camera pose.

Step S230, determining the positions of a plurality of pixels corresponding to a predicted target image based on the pose of the target camera, generating virtual rays from the focus of the camera to the positions of the pixels, and sampling a plurality of points on each virtual ray;

in one example embodiment of the present disclosure, after determining the target camera pose corresponding to the target image sample through the above steps, the positions of the plurality of pixels corresponding to the predicted target image may be determined based on the target camera pose. Specifically, after determining the pose of the target camera, the shooting position corresponding to the predicted target image to be generated may be determined, and the positions of the plurality of pixels on the predicted target image may be determined accordingly.

The specific manner of determining the positions of the plurality of pixels corresponding to the prediction target image is not particularly limited in the present disclosure.

In one example embodiment of the present disclosure, after determining the locations of the plurality of pixels corresponding to the prediction target image, a virtual ray may be generated from the camera focus to the locations of the respective pixels, and a plurality of points may be sampled on the respective virtual rays. Wherein, the target camera pose corresponds to the camera focus. Specifically, the target camera pose refers to a camera pose of a virtual camera in a virtual environment, the virtual camera corresponds to a camera focus, virtual rays can be emitted from the camera focus of the virtual camera to a plurality of pixels corresponding to a predicted target image, and a plurality of points are sampled on each virtual image.

For example, if the length (number of pixels) and width (number of pixels) of the prediction target image are H and W, respectively, it is necessary to emit h×w virtual rays from the focal point of the virtual camera to a plurality of pixels corresponding to the prediction target image and sample a plurality of points on each virtual image.

Further, when a plurality of points are sampled on each virtual ray, the plurality of points can be uniformly sampled on the virtual ray; alternatively, multiple points may be non-uniformly sampled on the virtual ray.

in an example embodiment of the present disclosure, after generating a virtual ray from a camera focal point to a position of each pixel through the above steps, sampling a plurality of points on each virtual ray, and fusing an image feature vector with a depth feature vector to obtain a multi-modal feature vector corresponding to an input image sample, color information of a plurality of pixels corresponding to a prediction target image may be determined according to the plurality of points and the multi-modal feature vector obtained by the sampling on the virtual ray. Specifically, a plurality of points obtained by up-sampling the virtual ray may be used to indicate three-dimensional information, a multi-modal feature vector may be used to indicate two-dimensional information, and color information of a plurality of pixels corresponding to the prediction target image may be determined by combining the three-dimensional information and the two-dimensional information.

Specifically, the virtual ray may correspond to a pixel on the prediction target image, color information of the pixel on the prediction target image may be determined according to a plurality of points on the virtual ray and the multi-mode feature vector, and color information of a plurality of pixels corresponding to the prediction target image may be obtained in a similar manner.

Note that, the specific manner of determining the color information of the plurality of pixels corresponding to the prediction target image according to the plurality of points obtained by up-sampling the virtual ray and the multi-modal feature vector is not particularly limited in the present disclosure.

in an example embodiment of the present disclosure, after the color information of the plurality of pixels on the prediction target image is obtained through the above steps, the plurality of pixels may be rendered according to the color information corresponding to the plurality of pixels to obtain the prediction target image. Specifically, color information of each pixel object may be used to render the pixel, and after rendering of a plurality of pixels is completed, a predicted target image may be obtained, where the predicted target image is an image obtained based on a pose of a target camera.

In an example embodiment of the present disclosure, after the predicted target image is obtained through the above steps, the neural network parameters of the model to be trained may be updated according to the target image sample and the predicted target image to obtain the image generation model. Specifically, the model to be trained can be used for generating a target image according to the input image and the target camera position, wherein the target image refers to an image obtained by shooting based on the pose of the target camera.

In one example embodiment of the present disclosure, a plurality of hidden layers may be included in the model to be trained, and a convolution layer, a normalization layer, an excitation layer, and the like may be included in the hidden layers. And sequentially inputting the first fusion feature vectors corresponding to the video sample data into a plurality of hidden layers of the model to be trained to obtain a hidden layer calculation result, and obtaining a prediction target image through the hidden layer calculation result.

In an example embodiment of the present disclosure, after obtaining the predicted target image through the above steps, the neural network parameters of the model to be trained may be updated according to the target image sample and the predicted target image to obtain the item classification model. In particular, the predicted target image may be used to indicate a predicted image generated based on the pose of the target camera. The predicted target image is a predicted value, and at this time, a true value corresponding to the predicted target image, that is, a target image sample (tag) may be acquired, which may be used to indicate a true image generated based on the pose of the target camera. At this time, the predicted target image (predicted value) and the target image sample (real value) may be compared to obtain a predicted difference value between the predicted target image (predicted value) and the target image sample (real value), and the neural network parameters of the model to be trained may be updated according to the predicted difference value, so as to obtain the image generation model.

Specifically, the neural network parameters of the model to be trained may include a model layer number, a feature vector channel number, a learning rate, and the like, and when the neural network parameters of the model to be trained are updated according to the prediction difference, the model layer number, the feature vector channel number, and the learning rate of the model to be trained may be updated to train the image generation model.

In one example embodiment of the present disclosure, the neural network parameters of the model to be trained may be updated by a back propagation algorithm, and after training is completed, an image generation model is obtained.

It should be noted that, the specific manner of updating the neural network parameters of the model to be trained according to the target image sample and the predicted target image is not particularly limited in the present disclosure.

In an example embodiment of the present disclosure, the neural network parameters of the model to be trained may be updated according to the target image sample and the predicted target image, and the model to be trained is determined to be the image generation model when the model to be trained satisfies the convergence condition. Specifically, the fact that the model to be trained meets the convergence condition means that the model to be trained is high in prediction accuracy and can be applied. For example, the convergence condition may include a number of exercises, such as ending the exercises after the model to be trained has been trained N times; for another example, the convergence condition may include a training period, such as ending training after the model to be trained has been trained for a period of T.

It should be noted that, the specific content of the convergence condition is not particularly limited, and the training process of the model to be trained can be better controlled by applying the convergence condition to the model, so that the problem of excessive training of the neural network is avoided, and the training efficiency of the model to be trained is improved.

In an example embodiment of the present disclosure, when a model to be trained is trained, training scenes in a training set may be added to promote generalization of a trained image generation model; and a plurality of input image samples can be adopted, each input image sample corresponds to the pose of the input image respectively, the number of the input image samples is increased, the reconstruction of a more real three-dimensional scene is facilitated, more priori knowledge can be provided, and therefore the accuracy of the image generation model after training is improved.

In an example embodiment of the present disclosure, depth estimation may be performed on an input image sample to obtain a sparse depth feature vector corresponding to the input image sample, and the depth feature vector corresponding to the input image sample is determined according to the sparse depth feature vector; the depth feature vector corresponding to the input image sample is a dense depth feature vector. Referring to fig. 3, determining a depth feature vector corresponding to an input image sample according to the sparse depth feature vector may include the following steps S310 to S320:

Step S310, performing depth estimation on an input image sample to obtain a sparse depth feature vector corresponding to the input image sample;

in an example embodiment of the present disclosure, after the input image samples are input into the model to be trained through the above steps, depth estimation may be performed on the input image samples to obtain sparse depth feature vectors corresponding to the input image samples. In particular, depth estimation refers to estimating a distance of each pixel in an input image sample from a shooting source using the input image sample, and may be used to indicate information of a distance from a viewpoint, which may be a camera that shoots an image, to an object in the input image sample.

The sparse depth feature vector corresponding to the input image sample refers to that most elements of the stored vector in the sparse depth feature vector are equal to zero.

Further, sparse depth feature vectors corresponding to input image samples may be used to indicate depth values of portions of pixels in the input image samples.

For example, depth estimation may be performed by a depth prediction method with a full convolution residual network, so as to obtain sparse depth feature vectors corresponding to input image samples.

In an example embodiment of the present disclosure, when performing depth estimation on an input image sample to obtain a sparse depth feature vector corresponding to the input image sample, an SFM algorithm may be used to perform sparse depth estimation to obtain a sparse depth feature vector corresponding to the input image sample.

It should be noted that the specific manner of depth estimation is not particularly limited in this disclosure.

Step S320, determining depth feature vectors corresponding to the input image samples according to the sparse depth feature vectors;

in an example embodiment of the present disclosure, after the sparse depth feature vector corresponding to the input image sample is obtained through the above steps, the depth feature vector corresponding to the input image sample may be determined according to the sparse depth feature vector. The depth feature vector corresponding to the input image sample is a dense depth feature vector. Specifically, the sparse depth feature vector may be depth-complemented to obtain a depth feature vector corresponding to the input image sample. For example, dense depth estimation may be performed on sparse depth feature vectors to obtain depth feature vectors corresponding to input image samples. In the depth feature vector (dense depth feature vector), a small fraction of the elements of the stored vector are equal to zero.

Further, the depth feature vector corresponding to the input image sample may be used to indicate depth values of a plurality of pixels in the input image sample.

Note that, the specific manner of determining the depth feature vector corresponding to the input image sample according to the sparse depth feature vector is not particularly limited in the present disclosure.

In an example embodiment of the present disclosure, sparse depth feature vectors corresponding to input image samples may be input into a pre-trained depth completion network to obtain depth feature vectors corresponding to the input image samples.

Through the steps S310 to S320, the depth estimation may be performed on the input image sample to obtain a sparse depth feature vector corresponding to the input image sample, and the depth feature vector corresponding to the input image sample is determined according to the sparse depth feature vector. According to the embodiment of the disclosure, the sparse depth feature vector of the input image sample can be converted into the dense depth feature vector, so that the referenceability of the depth feature vector in model training can be improved, the generalization of the model is improved, and the accuracy of the generated target image is further improved.

In an example embodiment of the present disclosure, a pixel coordinate value of a pixel may be determined, a pixel feature vector corresponding to the pixel may be determined according to a multi-mode feature vector, a plurality of spatial coordinate encoding vectors may be obtained by performing coordinate encoding on spatial coordinate values of a plurality of points obtained by up-sampling a virtual ray, color information of the plurality of points and density information of the plurality of points may be determined according to the pixel feature vector corresponding to the pixel and the plurality of spatial coordinate encoding vectors, and color information of the plurality of pixels may be determined according to the color information of the plurality of points and the density information of the plurality of points. Referring to fig. 4, determining color information of a plurality of pixels according to color information of a plurality of points and density information of a plurality of points may include the following steps S410 to S440:

Step S410, determining pixel coordinate values of pixels, and determining pixel feature vectors corresponding to the pixels according to the multi-modal feature vectors;

in one example embodiment of the present disclosure, after the multi-modal feature vector corresponding to the input image sample is obtained through the above steps and the plurality of points are up-sampled on each virtual ray, the pixel coordinate values of the pixels may be determined. Specifically, in the above-described step, the position of each pixel on the prediction target image has been determined, that is, the pixel coordinate value of each pixel can be determined, and the plurality of pixels can be distinguished by the pixel coordinate value of each pixel.

In one example embodiment of the present disclosure, after determining the pixel coordinate values of each pixel, a pixel feature vector corresponding to the pixel may be determined according to the multi-modal feature vector. Specifically, the multi-modal feature vector may be used to indicate a color of the input image sample and a distance between a point in the scene in the input image sample and the camera, and after determining the pixel coordinate values of each pixel, the multi-modal feature vector of the input image sample and the pixel coordinate values of the pixels may be combined to determine a pixel feature vector corresponding to each pixel.

In an example embodiment of the present disclosure, after obtaining the pixel coordinate values of the pixels and the multi-modal feature vectors, the pixel feature vectors corresponding to the pixels may be obtained by a bilinear difference method.

It should be noted that, the specific manner of determining the pixel feature vector corresponding to the pixel according to the multi-modal feature vector is not particularly limited in this disclosure.

Step S420, coordinate encoding is carried out on the space coordinate values of a plurality of points obtained by up-sampling the virtual rays to obtain a plurality of space coordinate encoding vectors;

in an example embodiment of the present disclosure, after the plurality of points are upsampled on the virtual ray through the above steps, the spatial coordinate values of the plurality of points upsampled on the virtual ray may be coordinate-encoded to obtain a plurality of spatial coordinate encoded vectors. Specifically, after coordinate encoding is performed on the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray, the spatial positions of the plurality of points can be indicated by a plurality of spatial coordinate encoding vectors, and the spatial positions of the plurality of points can reflect the high-dimensional information of the pixels corresponding to the plurality of points.

It should be noted that, the specific manner of coordinate encoding the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray to obtain the plurality of spatial coordinate encoded vectors is not particularly limited in the present disclosure.

Step S430, determining color information of a plurality of points and density information of the plurality of points according to the pixel feature vectors and the plurality of space coordinate coding vectors corresponding to the pixels;

In an example embodiment of the present disclosure, after the pixel feature vector and the plurality of spatial coordinate encoding vectors corresponding to the pixel are obtained through the above steps, color information of the plurality of points and density information of the plurality of points may be determined according to the pixel feature vector and the plurality of spatial coordinate encoding vectors corresponding to the pixel. Specifically, the pixel feature vector corresponding to the pixel may be used to indicate pixel information (such as a position and the like) of the pixel, and the spatial coordinate encoding vector may be used to indicate spatial positions of a plurality of points corresponding to the pixel, where color information of each point on the virtual ray and density information of each point may be determined by the pixel feature vector corresponding to the pixel and the plurality of spatial coordinate encoding vectors corresponding to the virtual ray. Wherein color information of the points can be used to indicate the color of the points, and density information of the points can be used to indicate the distance between each point and a pixel on the virtual ray.

Note that, the specific manner of determining the color information of the plurality of points and the density information of the plurality of points according to the pixel feature vector and the plurality of space coordinate encoding vectors corresponding to the pixels is not particularly limited in the present disclosure.

In step S440, the color information of the plurality of pixels is determined according to the color information of the plurality of points and the density information of the plurality of points.

In one example embodiment of the present disclosure, after the color information of the plurality of points and the density information of the plurality of points are obtained through the above steps, the color information of the plurality of pixels may be determined according to the color information of the plurality of points and the density information of the plurality of points. Specifically, the color information of the pixels corresponding to the virtual rays corresponding to the points is determined, and the color information of the pixels corresponding to other virtual rays can be obtained in the same way, namely, the color information of a plurality of pixels is determined according to the color information of a plurality of points and the density information of a plurality of points. Specifically, each virtual ray corresponds to a pixel on the prediction target image, and multiple points on the virtual ray may be mapped onto the pixel, and color information of the pixel may be determined.

The specific manner of determining the color information of the plurality of pixels according to the color information of the plurality of points and the density information of the plurality of points is not particularly limited in the present disclosure.

Through the above steps S410 to S440, the pixel coordinate values of the pixels may be determined, the pixel feature vectors corresponding to the pixels may be determined according to the multi-mode feature vectors, the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray may be coordinate-encoded to obtain a plurality of spatial coordinate encoded vectors, the color information of the plurality of points and the density information of the plurality of points may be determined according to the pixel feature vectors corresponding to the pixels and the plurality of spatial coordinate encoded vectors, and the color information of the plurality of pixels may be determined according to the color information of the plurality of points and the density information of the plurality of points. According to the embodiment of the disclosure, the color information of the pixel corresponding to the virtual ray can be determined through the color information and the density information of the plurality of points on the virtual ray, and the accuracy of the generated target image can be improved through introducing the three-dimensional information through the plurality of points on the virtual ray.

In an example embodiment of the present disclosure, a plurality of candidate points may be obtained by performing volume rendering along a ray direction according to color information of a plurality of points and density information of the plurality of points, and color information of a plurality of pixels may be determined according to the plurality of candidate points on a virtual ray. Referring to fig. 5, determining color information of a plurality of pixels according to a plurality of candidate points on a virtual ray may include the following steps S510 to S520:

step S510, performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points to obtain a plurality of candidate points;

in an example embodiment of the present disclosure, after color information of a plurality of points corresponding to each virtual ray and density information of a plurality of points corresponding to each virtual ray are obtained through the above steps, volume rendering may be performed along a ray direction according to the color information of a plurality of points and the density information of a plurality of points to obtain a plurality of candidate points. Specifically, the ray direction may be a direction from the focal point to the position of the pixel, or the ray direction may be a direction from the position of the pixel to the focal point, and multiple points on the virtual ray may be sequentially volume-rendered along the ray direction to obtain multiple candidate points, where the candidate points may be used to indicate a state that rendering of multiple points on the virtual ray corresponding to the pixel is completed.

For example, there are n points on each virtual ray, n calculations are required to determine n candidate points.

In step S520, color information of a plurality of pixels is determined according to a plurality of candidate points on the virtual ray.

In one example embodiment of the present disclosure, after rendering the plurality of points on each virtual ray by the above steps to obtain candidate points, color information of a plurality of pixels may be determined according to the plurality of candidate points on the virtual ray. Specifically, a plurality of candidate points on the virtual ray may be superimposed, and color information of a pixel corresponding to the virtual ray may be obtained according to a pixel feature vector corresponding to the pixel, where the color information of the pixel may be used to indicate a color to be represented by the pixel.

For example, if the size (number of pixels) of the prediction target image is (H, W), the color information of one pixel may be determined according to a plurality of candidate points on the virtual ray at a time, and similarly, the color information of a plurality of pixels may be determined by performing h×w times.

Specifically, the color information of a pixel can be determined by the following expression, where C is the color information of the pixel, r is a virtual ray, d is a ray direction, C is the color information of a point on the virtual ray, σ is the density information of the point on the virtual ray, tn is the nearest point in the ray direction of the virtual ray, tf is the farthest point in the ray direction of the virtual ray, and T (T) is the cumulative transmittance to a certain point in the ray direction of the virtual ray:

Through the steps S510 to S520, a plurality of candidate points may be obtained by performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points, and the color information of the plurality of pixels may be determined according to the plurality of candidate points on the virtual ray. According to the embodiment of the invention, the multiple points on the virtual ray can be subjected to volume rendering, the color information of the pixel is determined according to the multiple candidate points after the volume rendering, and the color information of the pixel to be finally displayed can be determined through the display states of the multiple candidate points, so that the accuracy of the generated target image is improved.

In one example embodiment of the present disclosure, spatial coordinate values of a plurality of points obtained by up-sampling a virtual ray may be converted from a world coordinate system to a camera coordinate system to obtain first candidate coordinate values of the plurality of points, the first candidate coordinate values of the plurality of points may be converted from the camera coordinate system to an image coordinate system to obtain second candidate coordinate values of pixels, and the second candidate coordinate values of the pixels may be converted from the image coordinate system to the pixel coordinate system to obtain pixel coordinate values of the pixels. Referring to fig. 6, converting the second candidate coordinate value of the pixel from the image coordinate system to the pixel coordinate system to obtain the pixel coordinate value of the pixel may include the following steps S610 to S630:

Step S610, converting the spatial coordinate values of the plurality of points obtained by up-sampling the virtual rays from the world coordinate system to the camera coordinate system to obtain first candidate coordinate values of the plurality of points;

step S620, converting the first candidate coordinate values of the points from the camera coordinate system to the image coordinate system to obtain second candidate coordinate values of the pixels;

in step S630, the second candidate coordinate value of the pixel is converted from the image coordinate system to the pixel coordinate system to obtain the pixel coordinate value of the pixel.

In an example embodiment of the present disclosure, after the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray are obtained through the above steps, the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray may be converted from the world coordinate system to the camera coordinate system to obtain first candidate coordinate values of the plurality of points, then the first candidate coordinate values of the plurality of points are converted from the camera coordinate system to the image coordinate system to obtain second candidate coordinate values of the pixels, and finally the second candidate coordinate values of the pixels are converted from the image coordinate system to the pixel coordinate system to obtain pixel coordinate values of the pixels. Specifically, the world coordinate system refers to a three-dimensional coordinate system having a photographing object as an origin, the camera coordinate system refers to a three-dimensional coordinate system having a camera as an origin at the time of image photographing, the image coordinate system refers to a two-dimensional coordinate system having a center point of an image as an origin, and the pixel coordinate system refers to a two-dimensional coordinate system having a certain specific pixel (for example, an upper left corner pixel) in the image as an origin.

In particular, points on the virtual ray may be mapped into one pixel within a plane. Based on this, the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray are converted from the world coordinate system to the pixel coordinate system, and the pixel coordinate values of the pixel are determined.

Specifically, the spatial coordinate values of a plurality of points may be converted from the world coordinate system to the pixel coordinate system by the pose of the target camera, and the pixel coordinate values of the pixel are determined.

Through the steps S610 to S630, the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray may be converted from the world coordinate system to the camera coordinate system to obtain first candidate coordinate values of the plurality of points, the first candidate coordinate values of the plurality of points may be converted from the camera coordinate system to the image coordinate system to obtain second candidate coordinate values of the pixels, and the second candidate coordinate values of the pixels may be converted from the image coordinate system to the pixel coordinate system to obtain pixel coordinate values of the pixels.

In an exemplary embodiment of the present disclosure, as shown in fig. 7, a flowchart of fusing an image feature vector and a depth feature vector to obtain a multi-modal feature vector corresponding to an input image sample is shown. Specifically, S710, an image feature vector corresponding to an input image sample is obtained; s720, performing depth estimation on the input image sample to obtain a sparse depth feature vector corresponding to the input image sample; s730, determining a depth feature vector corresponding to the input image sample according to the sparse depth feature vector; s740, fusing the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample; s750, fusing the multi-mode feature vectors.

In one example embodiment of the present disclosure, as shown in FIG. 8, there is a flow chart of sampling multiple points on each virtual ray. Specifically, S810, an input image sample and a target image sample are obtained; s820, acquiring a target camera pose corresponding to a target image sample through an SFM algorithm; s830, generating virtual rays from the focus of the camera to the positions of the pixels; s840, sampling a plurality of points on each virtual ray.

In an exemplary embodiment of the present disclosure, as shown in fig. 9, a flowchart is provided for determining color information of a plurality of points and density information of the plurality of points according to a pixel feature vector corresponding to a pixel and a plurality of spatial coordinate coding vectors. Specifically, S910 converts the spatial coordinate values of the plurality of points obtained by up-sampling the virtual ray from the world coordinate system to the pixel coordinate system; s920, determining a pixel characteristic vector corresponding to the pixel through bilinear interpolation; s930, performing coordinate coding on the space coordinate values of a plurality of points obtained by up-sampling the virtual rays to obtain a plurality of space coordinate coding vectors; s940, determining color information of a plurality of points and density information of the plurality of points according to the pixel characteristic vector corresponding to the pixel and the plurality of space coordinate coding vectors.

In an exemplary embodiment of the present disclosure, as shown in fig. 10, a flowchart of rendering a plurality of pixels according to color information corresponding to the plurality of pixels to obtain a prediction target image is shown. Specifically, S1010, color information of a plurality of points and density information of a plurality of points are obtained; s1020, performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points to obtain a plurality of candidate points, and determining the color information of a plurality of pixels according to the plurality of candidate points on the virtual ray; and S1030, rendering the pixels according to the color information corresponding to the pixels to obtain a prediction target image.

In one example embodiment of the present disclosure, as shown in fig. 11, a schematic diagram of an image generation system including an input preprocessing module, a depth extraction module, and a multimodal fusion module is provided.

Specifically, the input preprocessing module comprises a camera pose estimation unit, a three-dimensional virtual ray generation unit and a three-dimensional space point sampling unit. Wherein: the camera pose estimation unit is used for acquiring the pose of the target camera corresponding to the target image sample; the three-dimensional reconstruction unit is used for acquiring the pose of the target camera corresponding to the target image sample through an SFM algorithm; a three-dimensional virtual ray generation unit for generating a virtual ray from a camera focus to a position of each pixel; and the three-dimensional space point sampling unit is used for sampling a plurality of points on each virtual ray.

Specifically, the depth extraction module comprises a sparse depth extraction unit, a depth feature extraction unit, a depth complement unit, a picture feature extraction unit, a multi-mode attention unit, a multi-view feature fusion unit and a target view query unit. Wherein: the sparse depth extraction unit is used for carrying out depth estimation on the input image sample to obtain a sparse depth map corresponding to the input image sample; the depth feature extraction unit is used for extracting sparse depth feature vectors corresponding to the input image samples; and the depth completion unit is used for determining the depth characteristic vector corresponding to the input image sample according to the sparse depth characteristic vector.

Specifically, the multi-mode fusion module comprises a picture feature extraction unit, a multi-mode attention unit, a multi-view feature fusion unit and a target view query unit. Wherein: the image feature extraction unit is used for obtaining an image feature vector corresponding to the input image sample; the multi-modal attention unit is used for fusing the image feature vector and the depth feature vector to obtain a multi-modal feature vector corresponding to the input image sample; the multi-view feature fusion unit is used for carrying out feature learning of the multi-camera pose through multi-mode feature vectors corresponding to a plurality of input image samples, and determining pixel feature vectors corresponding to pixels according to the multi-mode feature vectors; and the target visual angle query unit is used for searching the characteristics corresponding to the position of the target camera.

The image generation system further includes: the nerve radiation field unit is used for determining color information of a plurality of points and density information of the plurality of points according to the pixel characteristic vector corresponding to the pixel and the plurality of space coordinate coding vectors; and the three-dimensional volume rendering unit is used for performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points to obtain a plurality of candidate points.

In one example embodiment of the present disclosure, an input image may be acquired and a camera pose may be predicted, the input image may be input into an image generation model, resulting in a target image; the image generation model is obtained through the image generation model training method. Referring to fig. 12, the input image is input into the image generation model to obtain the target image, and the steps S1210 to S1220 may include:

step S1210, obtaining an input image and predicting the pose of a camera;

in step S1220, the input image is input into the image generation model to obtain the target image.

In one example embodiment of the present disclosure, an input image may be acquired as well as a predicted camera pose. The camera pose is predicted to be the camera pose corresponding to the target image, and the input image and the target image are images shot by adopting different camera poses in the same scene. Specifically, the input image and the predicted camera pose may be input into the image generation model to generate the target image, and the camera pose corresponding to the target image is the predicted camera pose.

For example, the object is a refrigerator, the input image is an image captured directly in front of the refrigerator, the predicted phase pose is the left side of the refrigerator, and the input image and the predicted camera pose can be input into the image generation model to obtain a target image, which is an image captured on the left side of the refrigerator.

For example, when determining the predicted camera pose, the predicted camera pose may be generated by means of virtual camera rotation, adjusting the virtual camera depth of field, virtual camera screw control, virtual camera left-right movement, etc.

It should be noted that the present disclosure is not limited in particular to the specific type of predicting the pose of the camera.

Through the above steps S1210 to S1220, video data may be acquired, and the video data may be input into the item classification model to obtain the item category.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

In addition, in an exemplary embodiment of the present disclosure, an image generation model training apparatus is also provided. Referring to fig. 13, an image generation model training apparatus 1300 includes: a sample data acquisition unit 1310, a feature vector acquisition unit 1320, a virtual ray generation unit 1330, a color information acquisition unit 1340, a predicted image rendering unit 1350, and a network parameter update unit 1360.

Wherein the sample data acquisition unit is configured to perform acquisition of an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose; the feature vector acquisition unit is configured to input an input image sample into a model to be trained, acquire an image feature vector corresponding to the input image sample, acquire a depth feature vector corresponding to the input image sample, and fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample; a virtual ray generation unit configured to perform determination of positions of a plurality of pixels corresponding to a predicted target image based on a pose of a target camera, generate virtual rays from a camera focus to positions of the pixels, and sample a plurality of points on each virtual ray; wherein, the pose of the target camera corresponds to a camera focus; a color information acquisition unit configured to perform determination of color information of a plurality of pixels corresponding to the prediction target image from a plurality of points obtained by up-sampling the virtual ray and the multi-modal feature vector; a predicted image rendering unit configured to perform rendering of a plurality of pixels according to color information corresponding to the plurality of pixels to obtain a predicted target image; and the network parameter updating unit is configured to update the neural network parameters of the model to be trained according to the target image sample and the predicted target image so as to obtain an image generation model.

Since each functional module of the image generation model training apparatus of the exemplary embodiment of the present disclosure corresponds to a step of the foregoing image generation model training method exemplary embodiment, for details not disclosed in the embodiment of the apparatus of the present disclosure, please refer to the foregoing image generation model training method embodiment of the present disclosure.

In addition, in an exemplary embodiment of the present disclosure, an image generation model training apparatus is also provided. Referring to fig. 14, an image generating apparatus 1400 includes: the image acquisition unit 1410 and the image generation unit 1420 are input.

Wherein the input image acquisition unit is configured to perform acquisition of an input image and prediction of a camera pose; the method comprises the steps of predicting the pose of a camera as the pose of the camera corresponding to a target image, wherein an input image and the target image are images shot by adopting different camera poses in the same scene; an image generation unit configured to perform inputting of an input image into an image generation model, resulting in a target image; wherein the image generation model is obtained by training the image generation model according to any one of the above.

Since each functional module of the image generating apparatus of the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the image generating method described above, for details not disclosed in the embodiment of the apparatus of the present disclosure, please refer to the embodiment of the image generating method described above of the present disclosure.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the feature vectors and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the feature vectors and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

In addition, in the exemplary embodiment of the disclosure, an electronic device capable of implementing the image generation model training method is also provided.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 1500 according to such an embodiment of the present disclosure is described below with reference to fig. 15. The electronic device 1500 shown in fig. 15 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 15, the electronic device 1500 is embodied in the form of a general purpose computing device. The components of electronic device 1500 may include, but are not limited to: the at least one processing unit 1510, the at least one storage unit 1520, a bus 1530 connecting the different system components (including the storage unit 1520 and the processing unit 1510), and a display unit 1540.

Wherein the storage unit stores program code that is executable by the processing unit 1510 such that the processing unit 1510 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the "exemplary method" of the present specification. For example, the processing unit 1510 may perform step S210 as shown in fig. 2, acquiring an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose; step S220, inputting an input image sample into a model to be trained, obtaining an image feature vector corresponding to the input image sample, obtaining a depth feature vector corresponding to the input image sample, and fusing the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample; step S230, determining the positions of a plurality of pixels corresponding to a predicted target image based on the pose of the target camera, generating virtual rays from the focus of the camera to the positions of the pixels, and sampling a plurality of points on each virtual ray; wherein, the pose of the target camera corresponds to a camera focus; step S240, determining color information of a plurality of pixels corresponding to the predicted target image according to a plurality of points obtained by up-sampling the virtual rays and the multi-mode feature vector; step S250, rendering a plurality of pixels according to color information corresponding to the pixels to obtain a prediction target image; and step S260, updating the neural network parameters of the model to be trained according to the target image sample and the predicted target image to obtain an image generation model.

Alternatively, step S1210 shown in fig. 12 may also be performed to acquire an input image and predict the camera pose; in step S1220, the input image is input into the image generation model to obtain the target image.

As another example, the electronic device may implement the various steps shown in fig. 2 and 12.

The storage unit 1520 may include readable media in the form of volatile memory units such as Random Access Memory (RAM) 1521 and/or cache memory 1522, and may further include Read Only Memory (ROM) 1523.

The storage unit 1520 may also include a program/utility 1524 having a set (at least one) of program modules 1525, such program modules 1525 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1530 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1500 may also communicate with one or more external devices 1570 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1500, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1500 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1550. Also, the electronic device 1500 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, for example, the Internet, through a network adapter 1560. As shown, the network adapter 1560 communicates with other modules of the electronic device 1500 over the bus 1530. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1500, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g., a memory, comprising instructions executable by a processor of an apparatus to perform the above method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction which, when executed by a processor, implements the image generation model training method or the image generation method in the above-described embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of training an image generation model, the method comprising:

acquiring an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose;

inputting the input image sample into a model to be trained, acquiring an image feature vector corresponding to the input image sample, acquiring a depth feature vector corresponding to the input image sample, and fusing the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample;

Determining positions of a plurality of pixels corresponding to a predicted target image based on the pose of the target camera, generating virtual rays from the camera focus to the positions of the pixels, and sampling a plurality of points on each virtual ray; wherein the target camera pose corresponds to a camera focus;

determining color information of a plurality of pixels corresponding to the prediction target image according to a plurality of points obtained by up-sampling the virtual ray and the multi-mode feature vector;

rendering a plurality of pixels according to color information corresponding to the pixels so as to obtain a prediction target image;

and updating the neural network parameters of the model to be trained according to the target image sample and the prediction target image so as to obtain an image generation model.

2. The method of claim 1, wherein the obtaining depth feature vectors corresponding to the input image samples comprises:

performing depth estimation on the input image sample to obtain a sparse depth feature vector corresponding to the input image sample;

determining a depth feature vector corresponding to the input image sample according to the sparse depth feature vector; the depth feature vector corresponding to the input image sample is a dense depth feature vector.

3. The method according to claim 1, wherein determining color information of a plurality of pixels corresponding to the prediction target image according to the plurality of points sampled on the virtual ray and the multi-modal feature vector includes:

determining pixel coordinate values of the pixels, and determining pixel feature vectors corresponding to the pixels according to the multi-mode feature vectors;

coordinate encoding is carried out on the space coordinate values of a plurality of points obtained by up-sampling the virtual rays to obtain a plurality of space coordinate encoding vectors;

determining color information of the plurality of points and density information of the plurality of points according to the pixel characteristic vector corresponding to the pixel and the plurality of space coordinate coding vectors;

and determining color information of a plurality of pixels according to the color information of the plurality of points and the density information of the plurality of points.

4. A method according to claim 3, wherein the virtual ray has a ray direction, and wherein determining the color information of the plurality of pixels from the color information of the plurality of points and the density information of the plurality of points comprises:

performing volume rendering along the ray direction according to the color information of the plurality of points and the density information of the plurality of points to obtain a plurality of candidate points;

And determining color information of a plurality of pixels according to the plurality of candidate points on the virtual ray.

5. The method of claim 3, wherein the plurality of points sampled on the virtual ray have spatial coordinate values, and wherein the determining the pixel coordinate values of the pixel comprises:

converting the spatial coordinate values of the plurality of points obtained by up-sampling the virtual rays from a world coordinate system to a camera coordinate system to obtain first candidate coordinate values of the plurality of points;

converting the first candidate coordinate values of the points from a camera coordinate system to an image coordinate system to obtain second candidate coordinate values of the pixels;

and converting the second candidate coordinate value of the pixel from the image coordinate system to the pixel coordinate system to obtain the pixel coordinate value of the pixel.

6. An image generation method, the method comprising:

acquiring an input image and predicting the pose of a camera; the input image and the target image are images obtained by shooting different camera poses in the same scene;

inputting the input image into an image generation model to obtain a target image; the image generation model is trained by the image generation model training method according to any one of claims 1 to 5.

7. An image generation model training apparatus, comprising:

a sample data acquisition unit configured to perform acquisition of an input image sample and a target image sample; the input image sample and the target image sample are images obtained by shooting different camera poses in the same scene, and the target image sample corresponds to the target camera pose;

the feature vector acquisition unit is configured to input the input image sample into a model to be trained, acquire an image feature vector corresponding to the input image sample, acquire a depth feature vector corresponding to the input image sample, and fuse the image feature vector and the depth feature vector to obtain a multi-mode feature vector corresponding to the input image sample;

a virtual ray generation unit configured to perform determination of positions of a plurality of pixels corresponding to a predicted target image based on the target camera pose, generate virtual rays from the camera focus to the positions of the pixels, and sample a plurality of points on each of the virtual rays; wherein the target camera pose corresponds to a camera focus;

a color information acquisition unit configured to perform determination of color information of a plurality of pixels corresponding to the prediction target image according to a plurality of points obtained by up-sampling the virtual ray and the multi-modal feature vector;

A predicted image rendering unit configured to perform rendering of a plurality of the pixels according to color information corresponding to the plurality of the pixels to obtain a predicted target image;

and the network parameter updating unit is configured to update the neural network parameters of the model to be trained according to the target image sample and the prediction target image so as to obtain an image generation model.

8. An image generating apparatus, comprising:

an input image acquisition unit configured to perform acquisition of an input image and prediction of a camera pose; the input image and the target image are images obtained by shooting different camera poses in the same scene;

an image generation unit configured to perform input of the input image into an image generation model, resulting in a target image; the image generation model is trained by the image generation model training method according to any one of claims 1 to 5.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the executable instructions to implement the image generation model training method of any one of claims 1 to 5 or the image generation method of claim 6.

10. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image generation model training method of any one of claims 1 to 5 or the image generation method of claim 6.