CN114445562A

CN114445562A - Three-dimensional reconstruction method and device, electronic device and storage medium

Info

Publication number: CN114445562A
Application number: CN202210147676.8A
Authority: CN
Inventors: 王坤; 林纯泽
Original assignee: Beijing Datianmian White Sugar Technology Co ltd
Current assignee: Beijing Datianmian White Sugar Technology Co ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-06

Abstract

The present disclosure relates to a three-dimensional reconstruction method and apparatus, an electronic device, and a storage medium, the method including: acquiring a two-dimensional image to be processed, inputting the two-dimensional image to be processed into a three-dimensional reconstruction network for processing, and obtaining a three-dimensional reconstruction result of the two-dimensional image to be processed, wherein the three-dimensional reconstruction result comprises a three-dimensional point cloud which carries pose information and corresponds to the two-dimensional image to be processed. In the embodiment of the disclosure, the three-dimensional reconstruction network obtained by training is obtained through the training data set comprising the sample data of the front and back of the target object, and the three-dimensional point cloud carrying the corresponding pose information can be accurately obtained for any angle image of the input target object.

Description

Three-dimensional reconstruction method and device, electronic device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a three-dimensional reconstruction method and apparatus, an electronic device, and a storage medium.

Background

Three-dimensional Reconstruction (3D Reconstruction) refers to the establishment of a mathematical model suitable for computer representation and processing of a three-dimensional object, and as a key technology for establishing virtual reality expressing an objective world in a computer, the three-dimensional Reconstruction technique has a wide impact on a plurality of fields such as medical cosmetology, automatic driving, virtual reality, behavior analysis, animation, social entertainment, and the like.

In the three-dimensional reconstruction task of the target object, for example, for the three-dimensional reconstruction of the human face, a correct human face model can be obtained through a regression coefficient based on a reference expression base; the face model can also be obtained by regressing dense point cloud; and further exploration exists as to the manner of three-dimensional reconstruction, e.g., three-dimensional implicit expression equations can be generated directly by generating a competing network. Although the way of three-dimensional reconstruction is gradually expanding, the related art only focuses on the accuracy of regression of the front face pose and the front face shape and expression.

Disclosure of Invention

The present disclosure provides a three-dimensional reconstruction technical solution.

According to an aspect of the present disclosure, there is provided a three-dimensional reconstruction method including: acquiring a two-dimensional image to be processed; inputting the two-dimensional image to be processed into a three-dimensional reconstruction network for processing to obtain a three-dimensional reconstruction result of the two-dimensional image to be processed, wherein the three-dimensional reconstruction result comprises a three-dimensional point cloud which is corresponding to the two-dimensional image to be processed and carries pose information; the three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image, the front side represents that the face of the target object faces a shooting device, and the back side represents that the face of the target object faces away from the shooting device.

In a possible implementation manner, the sample data includes at least two of first sample data on the front side and the back side of the target object, second sample data on the back side of the target object, and third sample data on the front side of the target object and having an expression, and at least part of shooting scenes of images in the first sample data, the second sample data, and the third sample data are different.

In one possible implementation, the training process of the three-dimensional reconstruction network includes: and inputting the first sample data, the second sample data and the third sample data into the three-dimensional reconstruction network according to a preset proportion, and training the three-dimensional reconstruction network.

In one possible implementation, the third sample data is largest in the preset ratio.

In one possible implementation, the first sample data includes: a first sample image and a first three-dimensional point cloud acquired by a depth camera, and/or a first sample image and a second three-dimensional point cloud acquired by a monocular camera; the first three-dimensional point cloud is generated according to the first sample image acquired by the depth camera; the first sample image acquired through the monocular camera comprises three color channel information, and the second three-dimensional point cloud is generated based on the labeling information of the key points in the first sample image acquired through the monocular camera.

In one possible implementation, the second sample data includes: the calibration method comprises the steps of acquiring a second sample image of the back of a target object by a camera array with calibration parameters, and acquiring a three-dimensional point cloud corresponding to the second sample image.

In one possible implementation, the loss function used in the training process of the three-dimensional reconstruction network includes at least one of the following: a first loss function indicative of a predicted three-dimensional point cloud reconstruction error of the three-dimensional reconstruction network output, a second loss function indicative of the predicted three-dimensional point cloud smoothness of the three-dimensional reconstruction network output, a third loss function indicative of the predicted three-dimensional point cloud stability of the three-dimensional reconstruction network output, a fourth loss function indicative of the predicted three-dimensional point cloud expression accuracy of the three-dimensional reconstruction network output;

wherein the first loss function is determined based on the predicted three-dimensional point cloud and the three-dimensional point cloud in the training dataset; the second loss function is determined based on a distance between each point in the predicted three-dimensional point cloud to a first-order neighbor point and a distance between each point of a corresponding sequence of points in the three-dimensional point cloud in the training dataset to a first-order neighbor point; the third loss function is determined based on the predicted three-dimensional point cloud of the two-dimensional sample image and the predicted three-dimensional point cloud of the two-dimensional sample image after the perturbation; the fourth loss function is determined based on a projection of the predicted three-dimensional point cloud and the two-dimensional sample image in the training dataset.

In one possible implementation, the three-dimensional reconstruction network includes an encoding network including a depth-level separable convolutional layer and a decoding network including a fully-connected layer.

In one possible implementation, the method further includes evaluating an accuracy of the three-dimensional reconstruction network based on an evaluation function; the evaluation function comprises at least one of a first evaluation function indicating the expression accuracy of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network and a second evaluation function indicating the stability of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network;

wherein the first merit function is determined based on the projection of the predicted three-dimensional point cloud and the two-dimensional sample image; the second evaluation function is determined based on predicted three-dimensional point clouds respectively corresponding to at least two adjacent frames of the static video of the target object.

In a possible implementation manner, the two-dimensional image to be processed includes a human head image, the three-dimensional reconstruction network is configured to reconstruct the two-dimensional image including the human head image to obtain a three-dimensional reconstruction result of the human head, and the three-dimensional point cloud in the three-dimensional reconstruction result includes a human head point cloud carrying pose information and/or expression information.

According to an aspect of the present disclosure, there is provided a three-dimensional reconstruction apparatus including: the acquisition module is used for acquiring a two-dimensional image to be processed; the three-dimensional reconstruction module is used for inputting the two-dimensional image to be processed into a three-dimensional reconstruction network for processing to obtain a three-dimensional reconstruction result of the two-dimensional image to be processed, and the three-dimensional reconstruction result comprises a three-dimensional point cloud which is corresponding to the two-dimensional image to be processed and carries pose information; the three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image, the front side represents that the face of the target object faces a shooting device, and the back side represents that the face of the target object faces away from the shooting device.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the acquired to-be-processed two-dimensional image can be input into a three-dimensional reconstruction network for processing, so as to obtain a three-dimensional reconstruction result of the to-be-processed two-dimensional image, wherein the three-dimensional reconstruction result can include a three-dimensional point cloud carrying pose information corresponding to the to-be-processed two-dimensional image; the three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image, the front side shows that the face of the target object faces a shooting device, and the back side shows that the face of the target object faces away from the shooting device.

By the method, the training data set comprising the sample data of the front side and the back side of the target object is used for training the three-dimensional reconstruction network, the training effect of the three-dimensional reconstruction network is favorably improved, the trained three-dimensional reconstruction network has high precision and high applicability, the trained three-dimensional reconstruction network can detect the image of the target object at any angle, and the three-dimensional point cloud of the target object, which carries the corresponding position and orientation information, can be accurately obtained. Moreover, for the front image of the target object face facing the shooting equipment, the accuracy of pose information, face shape and expression of the target object (three-dimensional point cloud output by a three-dimensional reconstruction network) after three-dimensional reconstruction is improved; for the back image of the target object with the face facing away from the shooting equipment, the accuracy of the pose information of the target object after three-dimensional reconstruction is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow chart of a three-dimensional reconstruction method according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a three-dimensional point cloud in accordance with an embodiment of the present disclosure.

Fig. 3 illustrates a schematic diagram of a first sample data rendering effect according to an embodiment of the present disclosure.

Fig. 4 illustrates a schematic diagram of another first sample data rendering effect according to an embodiment of the present disclosure.

Fig. 5 illustrates a schematic diagram of a second sample data rendering effect according to an embodiment of the present disclosure.

Fig. 6 illustrates a schematic diagram of a third sample data rendering effect according to an embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of a three-dimensional reconstruction network structure according to an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of test results of a three-dimensional reconstruction network according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of a three-dimensional reconstruction apparatus according to an embodiment of the present disclosure.

Fig. 10 shows a block diagram of an electronic device 800 according to an embodiment of the disclosure.

Fig. 11 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a three-dimensional reconstruction method according to an embodiment of the present disclosure, which may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer-readable instruction stored in a memory. Alternatively, the method may be performed by a server.

As shown in fig. 1, a three-dimensional reconstruction method according to an embodiment of the present disclosure may include:

in step S1, a two-dimensional image to be processed is acquired;

in step S2, the two-dimensional image to be processed is input into a three-dimensional reconstruction network for processing, so as to obtain a three-dimensional reconstruction result of the two-dimensional image to be processed, where the three-dimensional reconstruction result includes a three-dimensional point cloud carrying pose information corresponding to the two-dimensional image to be processed.

The three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, and the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image.

Wherein the front side represents that the face of the target object faces the photographing apparatus, and the back side represents that the face of the target object faces away from the photographing apparatus. For example, the front face may indicate that the face of the target object is oriented at an angle of 0 to 90 degrees and 270 to 360 degrees with respect to the photographing direction, and the rear face may indicate that the face of the target object is oriented at an angle of 90 to 270 degrees with respect to the photographing direction. It should be understood that 0-90 degrees, 90-270 degrees, and 270-360 degrees are merely examples, and embodiments of the present disclosure are not limited to specific ranges of angles corresponding to the front and back surfaces.

In one possible implementation, the three-dimensional reconstruction method of the embodiment of the present disclosure may perform three-dimensional reconstruction on a target object included in a two-dimensional image to be processed, where the target object may include a body part of a human body, such as a human head, i.e., an object whose front (face) has a definite semantic part (e.g., an eye corner, a mouth corner, a nose tip, etc.) and whose back (e.g., a back brain scoop) lacks the definite semantic part. Assuming that the target object is a human head, the to-be-processed two-dimensional image may include a human head image, the three-dimensional reconstruction network is configured to reconstruct the two-dimensional image including the human head image to obtain a three-dimensional reconstruction result of the human head, and the three-dimensional point cloud in the three-dimensional reconstruction result includes a human head point cloud carrying pose information and/or expression information.

The three-dimensional reconstruction result obtained by the embodiment of the disclosure can be applied to interactive scenes based on Augmented Reality (AR) such as movies, games, virtual social networks and the like, and the pose information carried by the three-dimensional reconstruction result is beneficial to judging the relative position of a target object and virtual information (for example, including a virtual scene, a virtual object and the like) in Reality, so that the combination or interaction between the target object and the virtual information is realized. For another example, in the AR suit changing application, the display angle and position of the virtual clothes can be adjusted according to the three-dimensional reconstruction result of the target object, thereby improving the suit changing effect. In AR special effect generation applications, for example, the three-dimensional reconstruction of the target object facilitates the combination of virtual special effects and the target object, such as virtual makeup processing, face-to-animal face special effects, face-to-face sticker special effects, and the like. The application scenario of the three-dimensional reconstruction result is not limited by the present disclosure.

It should be understood that the target object of the embodiments of the present disclosure is not limited to a body part of a human body (e.g., a human head), and any object having a definite semantic part on the front side and lacking a definite semantic part on the back side may be used as the target object. For convenience of description, the following disclosure embodiments all use the target object as an example for description, and the case where the target object is another object may be flexibly extended according to the following disclosure embodiments, which is not limited by the present disclosure.

In one possible implementation manner, a two-dimensional image to be processed may be acquired in step S1, and the manner of acquiring the two-dimensional image to be processed is also different for different execution subjects; for example, in the case that the execution subject is a terminal device, a camera installed in the terminal device may be called to obtain a to-be-processed two-dimensional image including a target object in real time, an album of the terminal device may also be called to obtain the to-be-processed two-dimensional image, and the to-be-processed two-dimensional image sent by other devices may also be received; in the case where the execution subject is a server, the server may receive a to-be-processed two-dimensional image transmitted by another device, and may call the to-be-processed two-dimensional image stored in a database connected to the server.

Moreover, when the three-dimensional reconstruction method provided by the embodiment of the present disclosure is applied to different scenes, the two-dimensional image to be processed may be obtained differently.

For example, assuming that the target object is a human head, in the case of applying the human head reconstruction method to a game, an image including the human head of a game player may be acquired by an image acquisition device installed in the game device, or an image including the human head of the game player may be selected from a photo album in the game device, and the acquired image including the human head of the game player may be taken as a two-dimensional image to be processed.

For another example, when the human head reconstruction method is applied to a live broadcast scene, a video frame image including a human head may be determined from a plurality of frame video frame images included in a video stream acquired by a live broadcast device; and taking the video frame image containing the human head as a two-dimensional image to be processed. Here, the two-dimensional image to be processed may have, for example, a plurality of frames; the multiple frames of to-be-processed two-dimensional images may be obtained by sampling multiple frames of video frame images in the video stream.

In one possible implementation, the two-dimensional image to be processed may be a front image of the target object or a back image of the target object, which is not limited in the present disclosure.

For example, in the case where the target object is a human head, the face representing the target object faces the photographing apparatus, for example, the frontal image may be a target object whose face facing angle to the photographing direction (direction between the face and the image capturing apparatus) is between 0 to 90 degrees and 270 to 360 degrees, the photographed rotation angle is small (for example, a frontal image), and the target object in the image has a definite semantic part (for example, an eye angle, a mouth angle, a nose tip, and the like). The back surface represents that the face of the target object faces away from the shooting device, for example, the back surface image may be a target object in which the face of the target object faces at an angle of 90 to 270 degrees with respect to the shooting direction, the shot target object has a large rotation angle (e.g., a back brain image), and the target object region in the image includes only few definite semantic parts (e.g., an eye corner, a mouth corner, a nose tip, etc.), or even some target object images do not include definite semantic parts at all. The front surface image and the back surface image in the embodiment of the present disclosure are not limited to the numerical reference ranges given above.

In a possible implementation manner, in step S2, the to-be-processed two-dimensional image acquired in step S1 may be input into a three-dimensional reconstruction network for processing, so as to obtain a three-dimensional reconstruction result of the to-be-processed two-dimensional image.

The three-dimensional reconstruction network can be trained in advance and used for performing three-dimensional reconstruction on a target object in a two-dimensional image to be processed and determining a three-dimensional reconstruction result of the target object, wherein the three-dimensional reconstruction result can comprise a three-dimensional point cloud which is corresponding to the two-dimensional image to be processed and carries pose information. The pre-trained three-dimensional reconstruction network may include at least one of: convolutional Neural Networks (CNN), Back Propagation (BP), and Backbone Neural Networks (Backbone Networks).

When determining the three-dimensional reconstruction Network structure, for example, a Backbone Network (Backbone Network) of the three-dimensional reconstruction Network may be determined first, and as a main framework of the three-dimensional reconstruction Network, for example, the Backbone Network may include at least one of the following: an encoding network Encode, a decoding network Decode, an initiation network inclusion, a residual network variant network (the next dimension to RESNET), an initiation network variant network Xception, a Squeeze-and-Excitation network (SENet), a lightweight network MobileNet, and a lightweight network shuffle.

Illustratively, when the three-dimensional reconstruction network comprises a convolutional neural network, a lightweight network (mobilene) can be used as a basic model of the convolutional neural network, and on the basis of the mobilene, other network structures are added to form the convolutional neural network, and the formed convolutional neural network is trained. In the process, the mobilene is used as a part of a convolutional neural network, and the mobilene has small volume and high data processing speed, so that the training speed is higher; meanwhile, the three-dimensional reconstruction network obtained through training also has the advantages of small size and high data processing speed, and is more suitable for being deployed in embedded equipment.

Here, the network structure of the three-dimensional reconstruction network described above is merely an example; the specific construction mode and structure of the network structure may be determined according to actual situations, and are not described herein again, and the above examples also do not constitute limitations on the embodiments of the present disclosure.

In a possible implementation manner, the three-dimensional reconstruction network may be obtained by training a training data set, each sample data in the data set may be input into the three-dimensional reconstruction network separately to train the three-dimensional reconstruction network, or a plurality of sample data may be input into the three-dimensional reconstruction network in batches in a batch (batch) data manner to train the three-dimensional reconstruction network, which is not limited in this disclosure.

For example, assuming that the target object is a human head, the training data set may include sample data of the human head at multiple angles, and each sample data may include a two-dimensional sample image of the human head at a certain angle and a three-dimensional point cloud corresponding to the two-dimensional sample image.

The sample data of the human head at multiple angles may include sample data of the front and back of the human face, for example, the front represents that the face of the human face faces the shooting device, and may have all or most of the definite semantic parts (e.g., the canthus, the corner of the mouth, the tip of the nose, the sides of the ears, etc.), such as the face of the human head, the back represents that the face of the human face faces the shooting device, and there are no or only a few definite semantic parts, such as the back of the head, of the human head.

The two-dimensional sample image can be a two-dimensional image shot by image acquisition equipment with a photographing function, such as a camera, a video camera, a scanner, a mobile phone, a tablet computer and the like, in a natural scene (in-the-world). The natural scene is any scene in the real world (e.g., work, study, and life), and includes, for example, two-dimensional images captured in an environment such as a park, an office, a school, and a mall, and the background of the two-dimensional images captured in the natural scene may be various.

The two-dimensional sample image can be a two-dimensional image taken in an experimental scene by a Camera array (Camera Arrays) composed of multiple cameras at different spatial positions; the experimental scene comprises a laboratory arranged, the background of a two-dimensional image shot in the scene is artificially set, for example, a single color (for example, the background is pure white), the camera array can collect multiple frames of images in one shooting process, and the multiple cameras in different spatial positions can be used for collecting images in different visual angles. When the distance between each Camera is relatively large, the whole Camera array can be regarded as a multi-Center-of-Projection Camera (multi-Center-of-Projection Camera), and multi-view information of a target object (e.g. a human head) can be obtained.

The two-dimensional sample image may also be a two-dimensional image with depth information captured by a capturing device with a depth image capturing function, such as a TOF (Time of flight) camera or a structured light camera, i.e. an image including information about the distance of the capturing device from the surface of the target object. For example, the two-dimensional sample image may be an RGB-D depth image, and may include three color channel information of red (R), green (G), and blue (B), and depth (D) channel information related to a distance of a surface of the scene object of the viewpoint.

It should be understood that the two-dimensional sample image may be an image captured in different capturing scenes by using different capturing devices, the capturing devices may include a depth camera, a camera array, a monocular camera, a video camera, a scanner, a mobile phone, a tablet computer, and the like, the capturing scenes may include any natural scene in the real world, an artificially arranged experimental scene, scenes in different time periods and different light, and the like, and the present disclosure does not limit the capturing devices, capturing angles, and capturing scenes of the two-dimensional sample image.

Where the three-dimensional point cloud may represent a three-dimensional model of a target object, fig. 2 shows a schematic diagram of a three-dimensional point cloud according to an embodiment of the disclosure. As shown in fig. 2, in a case that the target object is a human head, the three-dimensional point cloud may include coordinate values of a plurality of vertices of a human head surface in a pre-constructed three-dimensional coordinate system (a coordinate system serving as a reference), the three-dimensional point cloud may be a three-dimensional point cloud having a topological structure, a three-dimensional network (3D mesh) formed by connecting the plurality of vertices and the coordinate values of the plurality of vertices may be used to represent a three-dimensional model of a human face, and fig. 2 shows three-dimensional point clouds with different densities, it is known that the larger the number of points included in the three-dimensional point cloud, the finer the three-dimensional model of the human face represented.

It is assumed that the three-dimensional point cloud corresponding to the two-dimensional image to be processed may include K points, i.e., (x)_i,y_i,z_i)，i∈[1，K]Each point in the three-dimensional point cloud may carry pose information. For example, a preset point cloud with no pose information under preset coordinates (i.e., face orientation and shooting angle are 0) of a target object (e.g., human head) may include coordinate values expressed as (x'_i,y′_i,z′_i)，i∈[1，K]，(x′_i,y′_i,z′_i) Can be subjected to attitude transformation such as scaling transformation, rotation transformation and translation transformation, and can be matched with coordinate values (x) contained in the three-dimensional point cloud of the predicted human head_i,y_i,z_i) Coincidence, i.e. (x)_i,y_i,z_i)＝scale*rotation*(x′_i,y′_i,z′_i) + transition. Therefore, the three-dimensional point cloud output by the three-dimensional reconstruction network carries pose information.

According to the embodiment of the disclosure, the training data set comprising the sample data of the front side and the back side of the target object is used for training the three-dimensional reconstruction network, so that the training effect of the three-dimensional reconstruction network is improved, the trained three-dimensional reconstruction network has high precision and high applicability, a user can input an image of the target object in all directions (including any angle, for example) to the three-dimensional reconstruction network, and the three-dimensional point cloud of the target object, which carries the corresponding pose information, can be accurately obtained. Moreover, for the front image of the shooting equipment with the face of the target object facing (for example, the angle between the face of the target object and the shooting direction is 0-90 degrees and the image shot between 270-360 degrees), the accuracy of the pose information, the shape of the face and the expression of the target object (three-dimensional point cloud output by a three-dimensional reconstruction network) after three-dimensional reconstruction can be improved; for the back image of the shooting equipment with the face of the target object facing away (the image shot by the shooting direction with the face of the target object facing at an angle of 90-270 degrees), the accuracy of the pose information of the target object after three-dimensional reconstruction is improved.

The following takes the human head as an example of a target object, and exemplarily describes the three-dimensional reconstruction method according to the embodiment of the present disclosure from a training phase and a prediction phase of a three-dimensional reconstruction network, respectively.

In the training phase of the three-dimensional reconstruction network, a training data set of the three-dimensional reconstruction network may be prepared first.

The process of training the three-dimensional reconstruction network by the training data set is the learning process of the three-dimensional reconstruction network to the training data set, the relation between the two-dimensional sample image and the three-dimensional point cloud (the three-dimensional model corresponding to the target object in the two-dimensional sample image) in the training data set is found, and the trained three-dimensional reconstruction network (the three-dimensional reconstruction network after learning) can make a decision on the input human head image at any angle and output the three-dimensional point cloud carrying pose information. Therefore, the training data set is used as an influence factor of the reconstruction effect of the three-dimensional reconstruction network, the better the training data set is, the better the performance of the three-dimensional reconstruction network is, and the more accurate the obtained three-dimensional reconstruction result is.

In one possible implementation, the training data set may include a plurality of types of sample data, where the sample data includes at least two of first sample data of a front side and a back side of the target object, second sample data of the back side of the target object, and third sample data of the front side of the target object and having an expression, where at least some shooting scenes of images in the first sample data, the second sample data, and the third sample data are different, and the shooting scenes may include a natural scene (a scene in the real world), an experimental scene (a scene arranged artificially), scenes in different time periods, scenes in different illuminances (brightness), and the like, and the specific shooting scenes are not limited by the present disclosure.

For example, in the case that the target object is a human head, the training data set may include human head sample data of various poses, various shooting scenes, and various expressions, for example, the training data set may include first sample data of a front side and a back side of the human head in a natural scene, second sample data of the back side of the human head in an experimental scene, and third sample data of the front side of the human head with expressions in a natural scene.

For each sample data in the training data set, a two-dimensional sample image and a three-dimensional point cloud corresponding to the two-dimensional sample image may be included, for example, the first sample data may include a first sample image and a three-dimensional point cloud corresponding to the first sample image on the front and back of the target object in a natural scene; the second sample data can comprise a second sample image on the back of the target object in the experimental scene and a corresponding three-dimensional point cloud; the third sample data can comprise a sample image with expression on the front side of the target object in a natural scene and a corresponding three-dimensional point cloud.

Wherein, each two-dimensional sample image can comprise a plurality of labeled key points for characterizing human head/human face features, such as a plurality of key points for characterizing facial features, cheekbones and eyebrows. The keypoints included in the two-dimensional sample image may be labeled manually in an interactive manner, or labeled by a keypoint detection method, for example: active Shape Models (ASM), Active Appearance Models (AAM), Cascaded Position Regression (CPR), and the like. The key points on the two-dimensional sample image are beneficial to improving the efficiency of generating the three-dimensional point cloud by the three-dimensional reconstruction network.

In this way, the training data set can include a plurality of types of sample data, so that the balance of the training data set and the difference of the samples in the training data set can be improved, and the adaptability and the accuracy of the three-dimensional reconstruction network can be improved.

The first sample data can be used for improving the diversity of sample data scenes, and the adaptability of a three-dimensional reconstruction network is favorably improved; in the related technology, the situation that the shape and the accuracy of a reconstructed human head model (three-dimensional point cloud) are not high enough under the condition of large angle and even back is ignored, and the second sample data is beneficial to improving the learning effect of the three-dimensional reconstruction network on human head back reconstruction; the third sample data comprises multiple expressions (neutral, mouth opening, eye closing and the like) of the human face, and the three-dimensional reconstruction network is favorable for improving the learning of the expressions of the open and closed eyes, the mouth opening and the like.

In one possible implementation, the first sample data includes: the first sample image and the first three-dimensional point cloud acquired by the depth camera, and/or the first sample image and the second three-dimensional point cloud acquired by the monocular camera.

The first three-dimensional point cloud is generated according to the first sample image acquired by the depth camera; the first sample image acquired through the monocular camera comprises three color channel information, and the second three-dimensional point cloud is generated based on the labeling information of the key points in the first sample image acquired through the monocular camera.

Fig. 3 is a schematic diagram illustrating a rendering effect of first sample data according to an embodiment of the present disclosure, and as shown in fig. 3, the first sample data may include: a first sample image having three color channel information and one depth channel information, and a first three-dimensional point cloud corresponding thereto.

The first sample image with three color channel information and one depth channel information, namely the RGB-D depth image, includes not only the color value of each point in the picture to be shot, but also the distance value from each point in the picture to be shot to the vertical plane where the depth camera is located. For example, for an RGB-D depth image, the value of each point may be represented as (R, G, B, D), RGB indicates three color channel information, R represents red channel information, G represents green channel information, B represents blue channel information, and D indicates depth channel information.

The multi-frame RGB-D depth image (first sample image) collected by the depth camera can be subjected to three-dimensional reconstruction to generate a first three-dimensional point cloud. For example, a reconstructed point cloud of a target object (e.g., a human head) and a pose transformation relation of each frame of first sample image corresponding to the reconstructed point cloud may be obtained by using a KinectFusion method (a three-dimensional reconstruction technique based on an RGB-D depth image) according to multiple frames of first sample images, and under the supervision of the reconstructed point cloud, a first three-dimensional point cloud carrying pose information corresponding to each frame of first sample image may be obtained by using Principal Component Analysis (PCA), and the first sample image and the corresponding first three-dimensional point cloud may be combined in pairs to obtain the first sample data as shown in fig. 3.

Comparing pose information carried by three-dimensional point cloud determined by registering a current RGB-D depth image and a previous frame depth image in the correlation technique, and calculating the pose information by registering the current RGB-D depth image and an image obtained by cloud projection of a reconstruction point by using a Kinectfusion method, so that the accuracy of first sample data is improved by using the Kinectfusion method, FIG. 3 shows first sample data of human heads at the back under 4 different angles, and the pose of each human head at the back is accurately described by the first three-dimensional point cloud corresponding to each first sample image.

Fig. 4 illustrates a schematic diagram of another first sample data rendering effect according to an embodiment of the present disclosure. As shown in fig. 4, the first sample data may also include: a first sample image having three color channel information and a corresponding second three-dimensional point cloud.

The first sample image may also be a back and/or front image of a human head in a natural scene captured by an image capture device (e.g. camera, cell phone etc.), the human head point cloud data as a reference standard may be preset, and then the above-described key point detection method may be used, generating annotation information containing a plurality of (for example, 10) key points in the head region of the first sample image, projecting preset head point cloud data to the first sample image, fitting the projection point of the key point in the preset human head point cloud data in the first sample image with the key point of the first sample image, and then estimating the pose information of the heads of the first sample images, and combining each first sample image and the corresponding head point cloud (second three-dimensional point cloud) carrying the pose information in pairs based on the pose information of the first sample image to obtain first sample data. Fig. 4 shows first sample data of the human head at 4 different angles, and the second three-dimensional point cloud corresponding to each first sample image accurately describes the pose of each back and/or front human head.

The calibration parameters can be used to determine the relationship between the three-dimensional geometric position of a certain point on the surface of the target object in the three-dimensional space and the corresponding point in the image captured by the capturing device (e.g., camera array), i.e., the mapping from the world coordinates to the pixel coordinates.

The calibration parameters may include internal parameters and external parameters, the external parameters are used to determine the correlation between the geometric position of a certain point on the surface of the target object in the real-world three-dimensional space and its corresponding point in the coordinates of the shooting device (e.g. camera array), that is, the mapping from the world coordinates to the coordinates of the shooting device (e.g. camera array), which may include the camera position, the camera rotation angle, etc.; the intrinsic parameters are used to determine how a point on the surface of the target object under the coordinates of the photographing device (e.g., camera array) continues to pass through the lens of the photographing device and becomes a pixel point in the image through pinhole imaging and electronic transformation, i.e., the mapping of the coordinates of the photographing device (e.g., camera array) to the coordinates of the image may include parameters related to the characteristics of the photographing device itself, such as focal length, pixel size, and the like.

Fig. 5 illustrates a schematic diagram of a second sample data rendering effect according to an embodiment of the present disclosure. As shown in fig. 5, the second sample data may include: and a second sample image and a corresponding three-dimensional point cloud which are acquired by the camera array with calibration parameters in an experimental scene.

The second sample image can be a multi-frame hindbrain scoop image acquired by a camera array with calibration parameters in an experimental scene, a three-dimensional point cloud generated based on a human head front image can be used as a preset three-dimensional point cloud, the relative position and posture of the preset three-dimensional point cloud and the second sample image are obtained according to the calibration parameters, and the second sample image and the three-dimensional point cloud with the position and posture information are combined in pairs to obtain second sample data.

Fig. 6 illustrates a schematic diagram of a third sample data rendering effect according to an embodiment of the present disclosure. As shown in fig. 6, the third sample data may include: and the front surface of the face collected in a natural scene can be provided with a third sample image with various expressions and a corresponding three-dimensional point cloud.

The third sample image may also be a head front image of a person in a natural scene captured by an image capturing device (e.g., a camera, a mobile phone, etc.), which may include multiple angles (front face, side face, etc.) and multiple expressions (neutral, mouth open, eye closed, etc.), and may summarize multiple scenes in real life. The keypoints of the third sample image per frame can be determined using the keypoint detection method described above. And fitting the shape and the expression coefficient of the third sample image under the supervision of key points by using Principal Component Analysis (PCA), acquiring the three-dimensional point cloud corresponding to each third sample image, and combining the third sample image and the corresponding three-dimensional point cloud with the pose information in pairs to obtain third sample data.

After the training data set is determined, the three-dimensional reconstruction network may be trained from the training data set. The training data set may include at least two of first sample data, second sample data, and third sample data, and at least two of the first sample data, the second sample data, and the third sample data may be input into the three-dimensional reconstruction network according to a preset ratio to train the three-dimensional reconstruction network.

Considering that all the first sample data, the second sample data and the third sample data are input into the three-dimensional reconstruction network to train the three-dimensional reconstruction network, the trained three-dimensional reconstruction network has the best effect, and the following takes the example that all the three sample data are input into the three-dimensional reconstruction network to explain the training process of the three-dimensional reconstruction network.

For example, for the first sample data, the second sample data, and the third sample data included in the training data set, different quantities of the first sample data, the second sample data, and the third sample data may be selected according to a preset ratio (e.g., 0.4:0.6:1), and the selected first sample data, second sample data, and third sample data are input into the three-dimensional reconstruction network to train the three-dimensional reconstruction network.

When the proportion is different, the emphasis on the three-dimensional reconstruction network training is also different. When the proportion of the first sample data is large, the anti-interference capability of the trained three-dimensional reconstruction network to other background parts outside the human head in the image to be processed is stronger. When the proportion of the second sample data is large, the trained three-dimensional reconstruction network has strong capability of acquiring the three-dimensional point cloud corresponding to the human head at the back or large angle (for example, the angle between the face orientation and the shooting direction is between 90 and 270 degrees) in the image to be processed, and can improve the accuracy of the three-dimensional point cloud pose corresponding to the human head at the back (for example, the back of the head image); when the proportion of the third sample data is large, the three-dimensional reconstruction network obtained through training has strong capability of acquiring three-dimensional point clouds corresponding to the front face (such as the angle between the face orientation and the shooting direction is 0-90 degrees and 270-360 degrees) in the image to be processed, and can improve the accuracy of the shape, the expression and the pose of the three-dimensional point cloud corresponding to the front face (such as the face image).

In order to improve the training efficiency, in the training process of the three-dimensional reconstruction network, a plurality of sample data can be input into the three-dimensional reconstruction network in batches in a batch (batch) data mode, and the first sample data, the second sample data and the third sample data can be combined according to a preset proportion (for example, 0.4:0.6:1) in the same batch (batch), so that the three-dimensional reconstruction network can simultaneously obtain the information of the front image and the back image of the human head.

Further, in order to enable the three-dimensional reconstruction network society to satisfy the effect of the front face (e.g., expression such as smiling, mouth opening, and eye closing) in all directions, the proportion of the third sample data (the front expression data) to the input data of each batch may be increased, for example, the proportion of the third sample data may be the largest in the preset proportion.

By the method, the accuracy of the shape, expression and pose of the obtained three-dimensional point cloud corresponding to the front face and the accuracy of the pose of the three-dimensional point cloud corresponding to the hindbrain are improved.

Fig. 7 shows a schematic diagram of a three-dimensional reconstruction network structure according to an embodiment of the present disclosure. As shown in fig. 7, the three-dimensional reconstruction network includes an encoding network (encode) and a decoding network (decode).

In order to improve the data processing efficiency of the three-dimensional reconstruction network, the encoding network may adopt the lightweight network mobilene introduced above, and may include a depth-level separable convolution layer for encoding the input two-dimensional image data into a hidden variable M of R dimension (e.g., R ═ 256)_RCan be expressed as M_R＝[m₁,m₂,…,m_R]。

The decoding network can comprise three fully-connected layers, and is used for outputting the hidden variable M of the R dimension of the coding network_RDecoding into xyz coordinates of K dimension (for example, K3060), and outputting as a three-dimensional reconstruction network a three-dimensional point Cloud_kFor example Cloud_kCan be expressed as a matrix shape: [ x ]₁,x₂,…,x_k]、[y₁,y₂,…,y_k]、[z₁,z₂,…,z_k]. That is, the coordinate value (x) contained in the three-dimensional point cloud of the predicted human head corresponding to any one of the K different human head parts_i,y_i,z_i),i∈[1，K]Is divided into coordinate values x on the x-axis, y-axis and z-axis_i、y_iAnd z_iAnd respectively output as the ith element in three different matrixes, coordinate value (x)_i,y_i,z_i) The output three-dimensional point cloud has a preset arrangement sequence.

Wherein, the first layer of full connection layer can output the coordinate value [ x ] of the x axis of the three-dimensional point cloud in the preset coordinate system₁,x₂,…,x_k]The second layer of full connection layer can output the coordinate value [ y ] of the y axis of the three-dimensional point cloud in the preset coordinate system₁,y₂,…,y_k]The third full-connection layer can output the coordinate value [ z ] of the z axis of the three-dimensional point cloud in a preset coordinate system₁,z₂,…,z_k]。

By the method, the three-dimensional reconstruction network can be determined, and the three-dimensional reconstruction network adopting the coding and decoding structure is simpler and more efficient.

After the sample image is input into the three-dimensional reconstruction network, the three-dimensional reconstruction network can perform feature learning on the sample image and output three-dimensional point cloud for predicting the sample image; the predicted three-dimensional point cloud output by the three-dimensional reconstruction network can be compared with the three-dimensional point cloud corresponding to the sample image in the sample data to determine the loss of the three-dimensional reconstruction network, the loss can be used for measuring the performance (such as accuracy) of the three-dimensional reconstruction network in generating the three-dimensional point cloud, and the network parameters of the three-dimensional reconstruction network can be adjusted based on the loss, so that the performance of the three-dimensional reconstruction network is further optimized.

For example, the first sample image, the second sample image, and the third sample image may be input to a three-dimensional reconstruction network, and a predicted three-dimensional point cloud of the first sample image, a predicted three-dimensional point cloud of the second sample image, and a predicted three-dimensional point cloud of the third sample image may be obtained.

And training the three-dimensional reconstruction network by utilizing the three-dimensional point cloud corresponding to the first sample image in the first sample data, the predicted three-dimensional point cloud of the first sample image, the three-dimensional point cloud corresponding to the second sample image in the second sample data, the predicted three-dimensional point cloud of the second sample image, the three-dimensional point cloud corresponding to the third sample image in the third sample data and the predicted three-dimensional point cloud of the third sample image until a training end condition is met, and obtaining the trained three-dimensional reconstruction network. The training end condition may be network parameter convergence of the three-dimensional reconstruction network, or may be that the training frequency of the three-dimensional reconstruction network reaches a preset threshold, which is not limited by the present disclosure.

The loss of the three-dimensional reconstruction network can be determined based on the difference value of the three-dimensional point cloud of the predicted three-dimensional point cloud and the three-dimensional point cloud of the sample image, the three-dimensional reconstruction network is trained by using the loss, and the training direction is the direction in which the loss is reduced, so that the predicted three-dimensional point cloud obtained by the three-dimensional reconstruction network when the sample image is processed is close to the three-dimensional point cloud of a real human head enough.

At least one loss function, such as a first loss function through a fourth loss function, may be set, and the loss of the three-dimensional reconstruction network may be determined based on one loss function, or a weighted sum of at least two loss functions.

A first loss function is determined based on the predicted three-dimensional point cloud and the three-dimensional point cloud in the training dataset, the first loss function being operable to indicate a three-dimensional point cloud error of the three-dimensional point cloud output by the three-dimensional reconstruction network corresponding to the two-dimensional sample image. For example, the first loss function may be an L2 norm loss function, also known as a Mean Square Error (MSE) loss function, which may be expressed as:

wherein x is_iThe ith sample image may be any sample image in the training data set, for example, a first sample image in the first sample data, a second sample image in the second sample data, or a third sample dataOf (2), y_iThe real value representing the ith sample image, i.e. the three-dimensional point cloud corresponding to the sample image in the sample data, may be the three-dimensional point cloud corresponding to the first sample image in the first sample data, the three-dimensional point cloud corresponding to the second sample image in the second sample data, or the three-dimensional point cloud corresponding to the third sample image in the third sample data, f (x) (x is x)_i) And respectively representing predicted three-dimensional point clouds output by the three-dimensional reconstruction network, wherein n is the number of samples used in the three-dimensional reconstruction network training process, and MSE represents the loss of the first loss function.

Through the first loss function, the regression loss of the three-dimensional point cloud can be monitored, and the accuracy of outputting the three-dimensional point cloud by the three-dimensional reconstruction network is improved.

A second loss function is determined based on predicting distances between each point in the three-dimensional point cloud to first-order neighboring points and distances between each point of a corresponding point sequence in the three-dimensional point cloud in the training data set to first-order neighboring points, the second loss function being indicative of smoothness of the three-dimensional point cloud output by the three-dimensional reconstruction network. For example, the second loss function may be a laplacian (Laplace) loss function, that is, a distance difference between a distance from each point in the predicted three-dimensional point cloud output by the three-dimensional reconstruction network to a first-order neighboring point and a distance from each point in the corresponding point sequence in the three-dimensional point cloud in the sample data (i.e., the golden standard) to the first-order neighboring point is calculated according to the neighboring relationship. The smaller the distance difference is, the smoother the three-dimensional point cloud output by the three-dimensional reconstruction network is.

For example, assume that the left three-dimensional model in fig. 2 is a predicted three-dimensional point cloud output by a three-dimensional reconstruction network, wherein the points at squares represent points P with a point sequence i in the predicted three-dimensional point cloud_iThe points at the triangle represent P_iFirst order neighbor point P of_i+1、P_i+2、P_i+3、P_i-1、P_i-2、P_i-3Calculating a point P_iFirst order neighbor point P corresponding thereto_i+1、P_i+2、P_i+3、P_i-1、P_i-2、P_i-3L1, the distance L1 may be P_iEach first order neighbor point P corresponding thereto_i+1、P_i+2、P_i+3、P_i-1、P_i-2、P_i-3The distance mean of (a), i.e.: l1 ═ P | (P |)_i+1-P_i|+|P_i+2-P_i|+|P_i+3-P_i|+|P_i-1-P_i|+|P_i-2-P_i|+|P_i-3-P_iI)/6; correspondingly, the three-dimensional point cloud in the sample data has a point p 'with a corresponding point sequence i'_iP 'to'_iIs P'_i+1、P′_i+2、P′_i+3、P′_i-1、P′_i-2、P′_i-3Calculating Point P'_iThe distance L2 of the first order neighboring point corresponding thereto, namely: l2 ═ P'_i+1-P′_i|+|P′_i+2-P′_i|+|P′_i+3-P′_i|+|P′_i-1-P′_i|+|P′_i-2-P′_i|+|P′_i-3-P′_i|)/6, and then calculates the distance difference between the two distances L1 and L2.

By the second loss function, the smoothness of the three-dimensional point cloud regression can be improved.

The third loss function is determined based on the predicted three-dimensional point cloud of the two-dimensional sample image and the predicted three-dimensional point cloud of the disturbed two-dimensional sample image, the third loss function is used for indicating the three-dimensional point cloud stability output by the three-dimensional reconstruction network, and can be represented as:

in the formula (2), x_iRepresents the ith sample image, which may be the first sample image in the first sample data, the second sample image in the second sample data, or the third sample image in the third sample data, Δ represents the perturbation in the UV direction (horizontal-vertical mode), f (x)_i) A predicted three-dimensional point cloud, f (x), representing a corresponding sample image representing the output of a three-dimensional reconstruction network_i+ delta) represents the predicted three-dimensional point cloud of the corresponding UV perturbed sample image output by the three-dimensional reconstruction network, n represents the sample used in the training process of the three-dimensional reconstruction networkNumber, Loss3 represents the Loss of the third Loss function.

In this way, the deviation and the disturbance generated by the three-dimensional point cloud predicted by the three-dimensional reconstruction network are consistent by performing disturbance in the UV direction on the sample image and recording the magnitude of the disturbance, and the smaller the value of the third loss function loss3 is, the higher the stability of the three-dimensional point cloud output by the three-dimensional reconstruction network is.

Through the third loss function, the stability of the three-dimensional point cloud predicted by the three-dimensional reconstruction network is improved, and meanwhile, the robustness of the three-dimensional reconstruction network to the input sample image is improved.

A fourth loss function is determined based on the projection of the predicted three-dimensional point cloud and the two-dimensional sample image in the training dataset, the fourth loss function is used for indicating the expression accuracy of the three-dimensional point cloud output by the three-dimensional reconstruction network, and can be expressed as:

in formula (3), Project represents a projection function, which can be used to perform dimension reduction on a target object, and can Project the target object in a three-dimensional space onto a two-dimensional plane, where specific projection modes may include orthogonal projection, perspective projection, and the like. x is the number of_iRepresenting the ith sample image, which may be a first sample image in the first sample data, a second sample image in the second sample data, or a third sample image in the third sample data; f (x)_i) A predicted three-dimensional point cloud representing a corresponding sample image representing a three-dimensional reconstructed network output; n represents the number of samples used in the three-dimensional reconstruction network training process, and Loss4 represents the Loss of the fourth Loss function.

The predicted three-dimensional point cloud output by the three-dimensional reconstruction network is projected to the sample image through a projection function, and the deviation of the projection from the sample image, that is, the loss of the fourth loss function loss4, is compared. In order to improve efficiency and reduce the amount of calculation, the predicted key points of the three-dimensional point cloud may be projected onto the sample image by using a projection function based on the key points, and a loss function for supervising the projection points and the key points in the sample image may be set. The key points may be key points for characterizing the human head/face features, for example, a plurality of key points for characterizing the facial features, the cheekbones, and the brow bones, and the number of key points is not limited by the present disclosure.

The fourth loss function is beneficial to enhancing the learning capability of the three-dimensional reconstruction network on the facial image expression, and the learning on expressions such as open and close eyes, open and close mouths and the like is met.

After the three-dimensional reconstruction network is trained, the three-dimensional point cloud prediction can be performed on the two-dimensional image to be processed by using the trained three-dimensional reconstruction network.

In order to evaluate the quality of the trained three-dimensional reconstruction network, the quality of the predicted three-dimensional point cloud output by the trained three-dimensional reconstruction network can be measured through an evaluation function.

In one possible implementation, the accuracy of the three-dimensional reconstruction network is evaluated based on an evaluation function; the evaluation function comprises at least one of a first evaluation function indicating the expression accuracy of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network and a second evaluation function indicating the stability of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network.

For example, with reference to the training data set above, a test data set may be set up for verifying the performance of the final trained three-dimensional reconstruction network. In order to improve the evaluation effect, the test data set may also include first test data of the front and back of the target object in a natural scene, second test data of the back of the target object in an experimental scene, and third test data of the front and back of the target object with an expression in a natural scene. Each test data comprises a two-dimensional test image of the target object and a three-dimensional point cloud corresponding to the two-dimensional test image. The specific manner of obtaining the test data set may refer to the training data set above, and this disclosure is not repeated here.

The method includes the steps of obtaining a test data set, inputting a two-dimensional test image into a three-dimensional reconstruction network, obtaining a three-dimensional point cloud predicted according to the test image, comparing the predicted three-dimensional point cloud with a three-dimensional point cloud corresponding to the test image in test data, and obtaining an evaluation result by obtaining a distance (error) between the predicted three-dimensional point cloud and the three-dimensional point cloud in the test data set serving as an evaluation standard. The closer the predicted three-dimensional point cloud is to the three-dimensional point cloud in the test data set serving as the evaluation standard, the better the reconstruction effect of the three-dimensional reconstruction network is, and the more accurate predicted three-dimensional point cloud can be obtained.

The evaluation function may include a first evaluation function indicating the expression accuracy of the three-dimensional point cloud output by the three-dimensional reconstruction network, for example, the three-dimensional key points in the predicted three-dimensional point cloud may be projected onto a two-dimensional plane according to perspective projection, and the quality of the three-dimensional reconstruction network may be measured by calculating the distance between the projected two-dimensional projected key points and the standard key points in the test image, and the smaller the function value output by the first evaluation function, the higher the prediction accuracy of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network. Wherein the function indicative of the expressive accuracy of the three-dimensional point cloud output by the three-dimensional reconstruction network may be a fourth loss function represented by parametric formula (3), which is not further delineated by the present disclosure.

The accuracy of the trained three-dimensional reconstruction network can be effectively evaluated through the first evaluation function.

The merit function may also include a second merit function indicating the stability of the three-dimensional point cloud output by the three-dimensional reconstruction network, which may be expressed as:

in equation (4), the test data may be a captured still video, such as a back of the brain facing the camera lens, a person remaining immobile, and may include a plurality of consecutive test images, pre_i+1And pre_iN correlations in a three-dimensional point cloud representing the i +1 th and i th frames of a regression in a three-dimensional reconstruction networkThe key points, for example, n ═ 106, can be 106 key points characterizing the five sense organs, the zygomatic bones and the brow bones, and the value of n is not limited in the present disclosure; the norm _ distance is the distance from the mouth to the eyes in the predicted three-dimensional point cloud (regression result of the three-dimensional reconstruction network) corresponding to the first frame image in the video, mean represents the average value of the evaluation results of the plurality of sample images, and the smaller the value error output by the second evaluation function is, the more stable the three-dimensional point cloud predicted by the three-dimensional reconstruction network is.

Through the second evaluation function, the stability of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network to each frame of image in the video is measured by comparing the difference of the three-dimensional point cloud regressed by at least two adjacent frames (such as the front frame and the rear frame) of the video, and the stability of the three-dimensional reconstruction network can be effectively evaluated. For example, for a video with the hindbrain scoop facing the camera lens to ensure stillness, the difference of the regression results of the front and rear frames of the video can be compared through the second evaluation function, and the hindbrain scoop jitter index can be accurately measured.

Fig. 8 shows a schematic diagram of test results of a three-dimensional reconstruction network according to an embodiment of the present disclosure. As shown in fig. 8, the two left graphs are used to show that the network test effect is three-dimensionally reconstructed under the condition that the input test image is a front face image; the two graphs on the left side are used for showing that the network test effect is reconstructed in three dimensions under the condition that the input test image is the hindbrain image; the key points in fig. 8 are projection points of the predicted key points in the three-dimensional point cloud onto the two-dimensional image, and the projection points can describe features of the five sense organs, the zygomatic bone, the brow bone, and the like.

Therefore, compared with a three-dimensional model (three-dimensional point cloud) which focuses on regressing the face of the front face in the related technology, the accuracy of the face pose of the front face, the face shape of the front face and the expression is more concerned, and the accuracy of the face pose of the back face and the face shape under the conditions of large angles and even the back face is neglected. According to the embodiment of the disclosure, the three-dimensional point cloud of the head can be obtained under the conditions of large angle and back surface, the complete three-dimensional point cloud of the head can be obtained, so that the normal shape and expression change can be expressed on the front image, and the posture on the back image is correct.

A user can input a head image at an all-directional angle, and the three-dimensional reconstruction network of the embodiment of the disclosure can obtain corresponding three-dimensional point cloud with accurate pose information, close shape fitting and accurate expression (limited to the front). In practical application, the method is simple to realize, low in cost and high in running speed, and when a user wants to position the head at any angle and add decorative special effects such as hair clip and the like, the special effect loss situation cannot occur when the head faces away from the lens.

According to the embodiment of the disclosure, the three-dimensional reconstruction network obtained by training through the training data set comprising the sample data of the front and back of the target object can accurately obtain the three-dimensional point cloud carrying the corresponding pose information for the input image of the target object in all directions. For the front image of the shooting equipment with the face of the target object facing to, the accuracy of pose information, the shape and the expression of the face of the target object (three-dimensional point cloud output by a three-dimensional reconstruction network) after three-dimensional reconstruction is improved; for the back image of the shooting equipment with the face of the target object facing back, the accuracy of the pose information of the target object after three-dimensional reconstruction is improved.

It is understood that the above-mentioned embodiments of the method of the present disclosure can be combined with each other to form a combined embodiment without departing from the principle logic, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a three-dimensional reconstruction apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the three-dimensional reconstruction methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are omitted for brevity.

Fig. 9 shows a block diagram of a three-dimensional reconstruction apparatus according to an embodiment of the present disclosure, as shown in fig. 9, the apparatus including:

an obtaining module 91, configured to obtain a two-dimensional image to be processed;

the three-dimensional reconstruction module 92 is configured to input the to-be-processed two-dimensional image into a three-dimensional reconstruction network for processing, so as to obtain a three-dimensional reconstruction result of the to-be-processed two-dimensional image, where the three-dimensional reconstruction result includes a three-dimensional point cloud carrying pose information corresponding to the to-be-processed two-dimensional image;

the three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image, the front side represents that the face of the target object faces the shooting equipment, and the back side represents that the face of the target object faces away from the shooting equipment.

In a possible implementation manner, the sample data includes at least two of first sample data on the front and back of the target object, second sample data on the back of the target object, and third sample data on the front of the target object and with an expression, and at least part of shooting scenes of images in the first sample data, the second sample data, and the third sample data are different.

In a possible implementation manner, the apparatus further includes an evaluation module, configured to evaluate an accuracy of the three-dimensional reconstruction network based on an evaluation function; the evaluation function comprises at least one of a first evaluation function indicating the expression accuracy of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network and a second evaluation function indicating the stability of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network;

The method has specific technical relevance with the internal structure of the computer system, and can solve the technical problems of how to improve the hardware operation efficiency or the execution effect (including reducing data storage capacity, reducing data transmission capacity, improving hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system according with the natural law.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 10 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 10, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G), a long term evolution of universal mobile communication technology (LTE), a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, and the like. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

Fig. 11 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 11, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of three-dimensional reconstruction, the method comprising:

acquiring a two-dimensional image to be processed;

inputting the two-dimensional image to be processed into a three-dimensional reconstruction network for processing to obtain a three-dimensional reconstruction result of the two-dimensional image to be processed, wherein the three-dimensional reconstruction result comprises a three-dimensional point cloud which is corresponding to the two-dimensional image to be processed and carries pose information;

the three-dimensional reconstruction network is obtained through training of a training data set, the training data set comprises sample data of the front side and the back side of a target object, the sample data comprises a two-dimensional sample image of the target object and a three-dimensional point cloud corresponding to the two-dimensional sample image, the front side represents that the face of the target object faces a shooting device, and the back side represents that the face of the target object faces away from the shooting device.

2. The method of claim 1, wherein the sample data comprises at least two of first sample data of the front and back of the target object, second sample data of the back of the target object, and third sample data of the front of the target object with expression, and at least part of shooting scenes of images in the first sample data, the second sample data and the third sample data are different.

3. The method of claim 2, wherein the training process of the three-dimensional reconstruction network comprises:

and inputting the first sample data, the second sample data and the third sample data into the three-dimensional reconstruction network according to a preset proportion, and training the three-dimensional reconstruction network.

4. The method of claim 3, wherein said third sample data is a proportionally largest of said preset ratios.

5. The method of claim 2, wherein the first sample data comprises: a first sample image and a first three-dimensional point cloud acquired by a depth camera, and/or a first sample image and a second three-dimensional point cloud acquired by a monocular camera;

the first three-dimensional point cloud is generated according to the first sample image acquired by the depth camera;

the first sample image acquired through the monocular camera comprises three color channel information, and the second three-dimensional point cloud is generated based on the labeling information of the key points in the first sample image acquired through the monocular camera.

6. The method of claim 2, wherein the second sample data comprises: the calibration method comprises the steps of acquiring a second sample image of the back of a target object by a camera array with calibration parameters, and acquiring a three-dimensional point cloud corresponding to the second sample image.

7. The method according to any one of claims 1-6, wherein the loss function used in the three-dimensional reconstruction network training process comprises at least one of:

a first loss function indicative of a predicted three-dimensional point cloud reconstruction error of the three-dimensional reconstruction network output, a second loss function indicative of the predicted three-dimensional point cloud smoothness of the three-dimensional reconstruction network output, a third loss function indicative of the predicted three-dimensional point cloud stability of the three-dimensional reconstruction network output, a fourth loss function indicative of the predicted three-dimensional point cloud expression accuracy of the three-dimensional reconstruction network output;

wherein the first loss function is determined based on the predicted three-dimensional point cloud and the three-dimensional point cloud in the training dataset;

the second loss function is determined based on a distance between each point in the predicted three-dimensional point cloud to a first-order neighbor point and a distance between each point of a corresponding sequence of points in the three-dimensional point cloud in the training dataset to a first-order neighbor point;

the third loss function is determined based on the predicted three-dimensional point cloud of the two-dimensional sample image and the predicted three-dimensional point cloud of the two-dimensional sample image after the perturbation;

the fourth loss function is determined based on a projection of the predicted three-dimensional point cloud and the two-dimensional sample image in the training dataset.

8. The method of any one of claims 1-7, wherein the three-dimensional reconstruction network comprises an encoding network comprising depth level separable convolutional layers and a decoding network comprising fully connected layers.

9. The method according to any one of claims 1-8, further comprising evaluating an accuracy of the three-dimensional reconstruction network based on an evaluation function;

the evaluation function comprises at least one of a first evaluation function indicating the expression accuracy of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network and a second evaluation function indicating the stability of the predicted three-dimensional point cloud output by the three-dimensional reconstruction network;

wherein the first merit function is determined based on the projection of the predicted three-dimensional point cloud and the two-dimensional sample image;

the second evaluation function is determined based on predicted three-dimensional point clouds respectively corresponding to at least two adjacent frames of the static video of the target object.

10. The method according to any one of claims 1 to 9, wherein the two-dimensional image to be processed comprises a human head image, the three-dimensional reconstruction network is configured to reconstruct the two-dimensional image comprising the human head image to obtain a three-dimensional reconstruction result of the human head, and the three-dimensional point cloud in the three-dimensional reconstruction result comprises a human head point cloud carrying pose information and/or expression information.

11. A three-dimensional reconstruction apparatus, comprising:

the acquisition module is used for acquiring a two-dimensional image to be processed;

the three-dimensional reconstruction module is used for inputting the two-dimensional image to be processed into a three-dimensional reconstruction network for processing to obtain a three-dimensional reconstruction result of the two-dimensional image to be processed, and the three-dimensional reconstruction result comprises a three-dimensional point cloud which is corresponding to the two-dimensional image to be processed and carries pose information;

12. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 10.

13. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 10.