CN119814945A

CN119814945A - Image synthesis method and image synthesis system

Info

Publication number: CN119814945A
Application number: CN202311314687.1A
Authority: CN
Inventors: 张博诚; 陈良其
Original assignee: Acer Inc
Current assignee: Acer Inc
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2025-04-11

Abstract

An image synthesis solution is provided for correcting the proportional relationship between a person and a 3D virtual scene by adjusting the position of a virtual camera in the virtual world to match the position of the camera in the real world for capturing the person.

Description

Image synthesis method and system

Technical Field

The present invention relates to image processing technology, and more particularly, to an image synthesizing method and an image synthesizing system.

Background

There are a number of video related software applications on the market such as video conference (video conference) software such as Zoom, microsoft Teams and Google Meet, and Live (LIVE STREAMING) software such as Twitch, youTube Live, open Broadcaster Software (OBS). In addition to the basic image transmission functions, many auxiliary image processing applications have been developed, such as background blurring, auto-focusing, face detection, background replacement. The conventional common background replacement function uses 2D pictures as the replacement targets, and has the advantages that users can select favorite background pictures to achieve the aim of customization, and in addition, the current live-action environment of the users can be shielded to maintain personal privacy. However, it is disadvantageous that the person itself is a 3D entity and thus has less sense of fusion with the 2D background picture, and furthermore the 2D background picture cannot be dynamically changed (e.g., cannot move the viewing angle or change the relative position of the person in the background) and thus looks unchanged.

In view of the above advantages and disadvantages, an application with a 3D virtual scene as a background is expected to have market demands because it enables characters to be incorporated into the background and to exhibit a more interesting visual effect than an application with a 2D picture as a background. Even, the user can customize virtual scenes, such as the scenes in the favorite games, so that the user experience is greatly improved.

The technology that may be involved in the development process of the application with the 3D virtual scene as the background includes real-time image back removal, virtual scene creation, conversion from 3D scene to 2D image, superposition of character image and virtual scene. One technical difficulty encountered is the size adaptation of the character to the scene. If this problem is ignored or not handled properly, the situation shown in fig. 1A and 1B occurs. Fig. 1A and 1B show two aspects of a composite image with improper person-to-scene ratio. It is apparent that the proportion of the person 10A in fig. 1A with respect to the background scene is excessively large, whereas the proportion of the person 10B in fig. 1B with respect to the background scene is excessively small. These two composite images are a very abrupt look and feel to the user.

Therefore, a solution for image synthesis is needed, which can overcome the above technical difficulties.

Disclosure of Invention

Embodiments of the present disclosure provide an image synthesizing method implemented by a computer system. The method includes obtaining a person image of a person photographed by a photographing device, generating a 3D virtual scene based on a plurality of virtual scene resources, setting a reference object having the same real size as the person in the 3D virtual scene, setting a first virtual camera to face the reference object in the 3D virtual scene, determining an ideal position of the first virtual camera based on a first distance between the photographing device and the person when photographing the person image, moving the first virtual camera to the ideal position, projecting the 3D virtual scene onto the virtual scene layer using the first virtual camera located at the ideal position, separating the person layer from the person image using an image segmentation (image segmentation) model, superimposing the person layer onto the virtual scene layer to generate a pair of superimposed layers, projecting the pair of superimposed layers onto the 2D image using a second virtual camera, and rendering the 2D image to obtain a synthesized image.

In one embodiment, the method further includes calculating a first visual size of the person in the person image using the object recognition model, and calculating a true size based on the first visual size, the first distance, and Field of View (FoV) parameters of the camera.

In one embodiment, after determining the ideal position of the first virtual camera and before moving the first virtual camera to the ideal position, the method further includes projecting the 3D virtual scene onto the virtual scene image, calculating a second visual size of the reference object in the virtual scene image using a ray casting (raycasting) algorithm, and determining a correction amount of movement required by the first virtual camera relative to the ideal position based on the first visual size, the second visual size, and the first distance. The step of moving the first virtual camera to the desired position further comprises moving the first virtual camera according to the correction amount.

In one embodiment, the step of determining the amount of correction required to move the first virtual camera relative to the ideal position further comprises determining the amount of correction required to move the first virtual camera relative to the ideal position based on the first visual size, the second visual size, the first distance, and the offset.

In one embodiment, the method further comprises obtaining a first distance from the photographing device. The photographing device is a depth camera.

In one embodiment, the method further includes estimating the first distance using a depth estimation model based on the image of the person.

In one embodiment, the method further comprises identifying a gesture in the character image using the gesture identification model, determining whether the gesture is mapped to a specified operation, and adjusting an ideal position of the first virtual camera based on the specified operation if the gesture is determined to be mapped to the specified operation.

In one embodiment, the virtual scene resources include Mesh (Mesh) resources, texture (Texture) resources, shader (Shader) resources, and Material (Material) resources.

In one embodiment, the computer system includes a camera device.

In an embodiment, the computer system is connected to a mobile device, and the mobile device includes a photographing device.

Embodiments of the present disclosure further provide an image synthesis system, which includes a storage device and a processing device. The processing device loads a program from the storage device to perform the steps of acquiring a person image of a person captured by the capturing device, generating a 3D virtual scene based on a plurality of virtual scene resources, setting a reference object having the same real size as the person in the 3D virtual scene, setting a first virtual camera so that the first virtual camera faces the reference object in the 3D virtual scene, determining an ideal position of the first virtual camera based on a first distance between the capturing device and the person when capturing the person image, moving the first virtual camera to the ideal position, and projecting the 3D virtual scene onto a virtual scene layer using the first virtual camera located at the ideal position, separating the person layer from the person image using an image segmentation model, superimposing the person layer onto the virtual scene layer to generate a pair of superimposed layers, projecting the pair of superimposed layers onto the 2D image using the second virtual camera, and rendering the 2D image to acquire a composite image.

The image synthesis solution provided by the present disclosure adjusts the position of the virtual camera and ensures that the virtual camera is matched with the real-world camera position, thereby realizing proportional correction, overcoming the problem of size adaptation between the person and the 3D virtual scene, and enabling the synthesized image to present a more natural and coordinated visual effect. Therefore, people in the synthesized image can not show too large or too small abrupt conditions, and the coordination and natural feeling of the image are obviously improved.

Drawings

The disclosure will be better understood from the following description of exemplary embodiments taken in conjunction with the accompanying drawings. Further, it should be understood that in the flow diagrams of the present disclosure, the order of execution of the blocks may be changed, and/or some blocks may be changed, eliminated, or combined.

Fig. 1A shows an aspect of a composite image in which the ratio of characters to background scenes is excessive.

Fig. 1B shows an aspect of a composite image in which the ratio of characters to background scenes is too small.

Fig. 2A is a schematic diagram illustrating one scenario in the real world where a person image of a person is captured with a camera according to an embodiment of the present disclosure.

Fig. 2B is a schematic diagram illustrating one scenario of capturing a 3D virtual scene with a virtual camera in a virtual world, according to an embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating an image synthesizing method according to an embodiment of the disclosure.

Fig. 4A is a flow chart illustrating the calculation of the true dimensions according to the preferred embodiment of the present disclosure.

FIG. 4B is a schematic diagram illustrating the multiple magnitudes required for calculation of the true dimensions according to a preferred embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating a procedure for determining the correction amount of the first virtual camera according to an embodiment of the disclosure.

Fig. 6A is a system block diagram illustrating an image synthesizing system according to an embodiment of the present disclosure.

Fig. 6B is a system block diagram illustrating an image synthesizing system according to another embodiment of the present disclosure.

Wherein reference numerals are as follows:

10A person

10B person

200 Person

201 Video camera

202 Person image

D1 first distance

210 Reference object

211 Virtual camera

212:3D virtual scene

D2 second distance

300:Image synthesis method

S301-S309 step

S401-S402 step

410 Person

411 Photographic device

412, Character image

S501-S503 steps

600A image synthesizing system

601 Photographic device

602 Storage device

603 Processing device

604 Display device

611 Central Processing Unit (CPU)

612 Graphic Processing Unit (GPU)

613 Neural Network Processing Unit (NPU)

600B image synthesizing system

605. Mobile device

Detailed Description

The following description lists various embodiments of the invention and is not intended to limit the scope of the invention. The actual scope of the invention is defined by the claims.

In the various embodiments listed below, the same or similar elements or components will be denoted by the same reference numerals.

The numerical designations such as "first," "second," and the like in the description and in the claims are for convenience of description only and do not have a sequential order relative to one another.

It should be noted that the term "dimension" is often used to describe schematically the size of an object in space, but references herein to "dimension" refer in particular to the magnitude (measure) of the object in a particular direction, such as length, width, height, depth, or diameter.

The following description of embodiments of the apparatus or system applies also to embodiments of the method and vice versa.

In 3D imaging technology, a virtual lens system (virtual CAMERA SYSTEM) is typically used to capture and render a 3D scene. The virtual lens system may simulate a camera or eye in conventional photography, converting a 3D scene into a 2D image. Thereby, the viewer can view the 3D scene in 2D form and even interact with it. Technically, the virtual camera in the virtual lens system does not photograph a scene in the real world just like a general camera, but uses a 3D projection technique to map a scene captured from the virtual world onto a 2D image.

The 3D projection techniques generally include orthogonal projection (orthographic projection) and perspective projection (PERSPECTIVE PROJECTION). In orthographic projection, the size of the object in the image is not affected by the distance between the camera and the object. The size of the object in the image remains constant no matter how far it is from the camera. In perspective projection, the distance between the camera and the object affects the size of the object in the image. Specifically, the closer an object is to the camera, the larger the object will appear in the image. In order to reflect the perspective change caused by the distance between objects in the real world, whether a general camera or a virtual camera, perspective projection is generally adopted.

Accordingly, the problem of mismatching of the person to the scene shown in fig. 1A and 1B is caused by unbalanced shooting distances between a general camera in the real world and a virtual camera in the virtual world. In fig. 1A, the person 10A is in an excessive proportion to the background scene due to the fact that the camera capturing the person is relatively close to the person, while the virtual camera captures the virtual scene at a relatively large distance. The ratio of the person 10B to the background scene is too small in fig. 1B, because the camera capturing the person is relatively far from the person, while the virtual camera captures the virtual scene at a relatively close distance.

The image synthesis solution proposed by the present disclosure aims at correcting the proportional relationship between a person and a 3D virtual scene by adjusting the position of a virtual camera in the virtual world to match the position of the camera in the real world. The principle used is that if two target objects of the same physical size have the same visual size in the frames captured by the two cameras (whether the general camera or the virtual camera), the distance between each of the two cameras and the target object it captures must be the same. On the other hand, if the distances between the two cameras and the target object photographed by the cameras are the same, and the actual sizes of the two target objects are also the same, the two target objects will also have the same visual size in the images captured by the two cameras.

The above image synthesis solution will be described below with reference to fig. 2A and 2B.

Fig. 2A is a schematic diagram illustrating one scenario in the real world where a person image 202 of a person 200 is captured with a camera 201, according to an embodiment of the present disclosure. As shown in fig. 2A, the distance between the camera 201 and the person 200 when the person image 202 is captured is hereinafter referred to as "first distance" and is referred to as "D1". The size of the character 200 in the real world, hereinafter referred to as the "real size", is W cm. The size of the person 200 in the person image 202 captured by the camera 201 is hereinafter referred to as "first visual size", and is referred to as X pixels. It should be appreciated that while in fig. 2A, the true dimension W cm of the character 200 appears equal to the first visual dimension X pixels, the two are entirely different concepts and are not necessarily to be considered comparable magnitudes. The real size is a magnitude of the person 200 in a specific direction in the real world, for example, a head width of the person 200. The first visual size is the number of pixels occupied by the character 200 (e.g., its head width) in the character image 202 or on the screen.

Fig. 2B is a schematic diagram illustrating one scenario in a virtual world in which a 3D virtual scene 212 is captured with a virtual camera 211, according to an embodiment of the present disclosure. As shown in fig. 2B, a reference object 210 is configured in the 3D virtual scene 212. The items in the 3D virtual scene 212 may be drawn by 3D drawing tools, such as Autodesk Maya, blender, autodesk ds Max, cinema 4D, and so on. These drawing tools all have the function of simulating real world scales, ensuring that the objects and scenes in the 3D virtual scene 212 have dimensions and proportions consistent with the real world. Thus, in embodiments of the present disclosure, the reference item 210 may be set to have the same real size (e.g., width of the reference item 210 itself), i.e., W centimeters, as the person 200 in the real world. Further, the virtual camera 211 may be disposed to face the reference object 210.

Since the reference object 210 and the person 200 have the same real size, if the distance between the virtual camera 211 and the reference object 210 is hereinafter referred to as "second distance" which is equal to the first distance between the camera 201 and the person 200, the size of the reference object 210 in the frame of the virtual camera 211 from which the 3D virtual scene 212 is projected is hereinafter referred to as "second visual size" which is also equal to the first visual size of the person 200 in the person image 202. At this time, the ratio between the character 200 and the 3D virtual scene 212 is harmonic, and the result of image synthesis does not appear as in fig. 1A or 1B.

However, in the example of fig. 2B, it is assumed that the second distance D2 between the virtual camera 211 and the reference object 210 is greater than the first distance D1 between the camera 201 and the person 200. In other words, the shooting distance of the virtual camera 211 is farther than that of the camera 201. Thus, the second visual dimension Y pixels of the reference object 210 will be smaller than the first visual dimension X pixels of the character 200. At this time, the ratio of the character 200 to the 3D virtual scene 212 is too large, and the result of image synthesis may be similar to that of fig. 1A.

According to an embodiment of the present disclosure, the problem described above with respect to fig. 2B may be solved by shortening the second distance between the virtual camera 211 and the reference object 210 to be equal or substantially equal to the first distance D1. The first distance D1 may be obtained by estimating the depth of each pixel of the human image 202 by using a depth estimation model, or by measuring the depth of the human 200 by using a depth camera as the camera 201.

Fig. 3 is a flowchart illustrating an image synthesizing method 300 according to an embodiment of the disclosure. As shown in fig. 3, the method 300 may include steps S301-S309.

In step S301, a person image of a person captured by a photographing device is acquired. Taking fig. 2A as an example, a person image 202 of a person 200 captured by a camera 201 is obtained.

In step S302, a 3D virtual scene is generated based on the virtual scene resources, and a reference object having the same real size as the character is set in the 3D virtual scene. Taking fig. 2B as an example, a 3D virtual scene 212 is generated based on a plurality of virtual scene resources, and a reference object 210 having the same real size W cm as the person 200 in fig. 2A is configured in the 3D virtual scene 212.

In one embodiment, step S302 may be implemented by calling an application program interface (Application Programming Interface; API) provided by the drawing engine. The drawing engines may be, for example, unity, unreal Engine, openGL, directX, vulkan, metal, etc., although the disclosure is not so limited.

In one embodiment, the virtual scene resources used in generating the 3D virtual scene in step S302 may include a Mesh (Mesh) resource, a Texture (Texture) resource, a Shader (Shader) resource, and a Material (Material) resource. The grid resource is composed of a plurality of vertices, edges, and faces, defining the size, shape, and topology of the 3D object (including the reference object). Texture resources are 2D images or graphics applied to the surface of a 3D object, typically in the form of a file of images (e.g., JPEG, PNG, or TGA), containing information such as color texture, normal vector texture, environment map (environment mapping), etc., for adding color, texture, and visual effects to the object surface. The shader resource is a program for controlling the effect of changes in illumination, shading, reflection, refraction, etc. of light on the surface of an object and its visual effect. A texture resource is a collection of physical properties of a material (e.g., metal, plastic, wood, etc.) used by an object in a virtual scene to define the optical properties of the object, such as color, reflectance, transparency, etc.

In step S303, a first virtual camera is set to face a reference object in the 3D virtual scene. Taking fig. 2B as an example, the virtual camera 211 is configured such that the virtual camera 211 faces the reference object 210 in the 3D virtual scene 212.

In step S304, an ideal position of the first virtual camera is determined based on a first distance between the photographing device and the person when photographing the image of the person. Taking fig. 2A and 2B as an example, the ideal position of the virtual camera 211 is determined based on the first distance D1 between the camera 201 and the person 200 when the person image 202 is captured.

In one embodiment, the ideal position is equal to a position at a first distance from the reference object. In another embodiment, it is desirable that the person and the reference object are not in exactly the same position, but are offset to some reasonable degree, so that the ideal position can be set at a position that is a first distance from the person plus an offset. The offset may be a predetermined value or may be set by a user, although the disclosure is not limited thereto.

In step S305, the first virtual camera is moved to the ideal position, and the 3D virtual scene is projected onto the virtual scene layer using the first virtual camera located at the ideal position. In other words, the 3D virtual scene captured by the first virtual camera from the virtual world is mapped onto the 2D virtual scene layer using 3D projection techniques based on parameters associated with the first virtual camera.

In step S306, the image segmentation model is used to separate the human figure layer from the human image. The image segmentation model may be implemented using various existing machine learning algorithms for segmenting different objects (e.g., characters) in an image, such as U-Net, deep Lab, mask R-CNN, HRNet, enet. The training process of the image segmentation model includes obtaining tag data (for example, manually marking a segmentation result or collecting tag data of an open source), selecting a loss function, configuring an optimization algorithm, and the like, and various existing supervised learning (supervised learning) methods can be adopted, but the disclosure is not limited thereto. In addition, the image segmentation model may be trained on the local side, or may be trained on another computer device (e.g. a server) and then obtained via a network (e.g. a cloud download), a storage medium (e.g. an external hard disk) or other communication interfaces (e.g. USB), but the disclosure is not limited thereto.

In step S307, the character layer is superimposed on the virtual scene layer to generate a pair of superimposed layers. In the process of overlaying the character image layer and the virtual scene image layer, various image synthesis technologies such as transparency blending (alpha blending), mask blending (mask blending), light projection (raycasting) and the like can be involved, so that the synthesized visual effect is vivid and natural, and the integrated consistency is good.

In step S308, the pair of superimposed layers is projected onto the 2D image using a second virtual camera. In other words, the pair of superimposed layers captured by the second virtual camera are mapped onto the 2D image using 3D projection techniques based on parameters associated with the second virtual camera.

In step S309, the 2D image is rendered to obtain a composite image. This step may further involve ray tracing, shadow generation, ambient lighting, texture imparting. The present disclosure is not limited to specific implementation manners of the rendering technology, and different rendering methods and parameter configurations may be selected according to practical requirements.

In one embodiment, the real size (i.e. the size of the person in the real world) required for generating the reference object in the 3D virtual scene in step S302 may be a value input by the user. In another embodiment, the average human head width, for example 16.5 cm, may be applied as the true size. Since the average head width varies depending on the sex, for example, the average head width of an adult male is about 16.5-18.5 cm and the average head width of an adult female is about 15-16.5 cm, the sex identification model can be used to determine the sex of the person in the image of the person, thereby determining the real size. However, in addition to gender, the actual head width may also vary according to complex factors such as race, geographic location, and individual differences, so that the accuracy of the actual size cannot be ensured, and thus the image synthesis result may be affected.

In a preferred embodiment, the true size of the character may be further calculated based on the magnitude available from the character image. The preferred embodiment will be described with reference to fig. 4A and 4B.

Fig. 4A is a flow chart illustrating the calculation of the true dimensions according to the preferred embodiment of the present disclosure. As shown in fig. 4A, the calculation of the real size includes step S401 and step S402. Corresponding to fig. 4A, fig. 4B is a schematic diagram showing a plurality of magnitudes required for calculation of the true size according to the preferred embodiment of the present disclosure. Please refer to fig. 4A and fig. 4B in combination to better understand the preferred embodiment.

In step S401, a first visual size of the person 410 in the person image 412 is calculated based on the person image 412 using the object recognition model. In the example of fig. 4B, the first visual size is X pixels.

The object identification model can be divided into two stages of feature extraction and object classification and positioning. In the feature extraction stage, the object recognition model may extract features of the character image 412. The features may represent attributes or characteristics of the character image 412, for example, in the form of feature vectors (feature vectors), feature tensors (feature tensors), or feature graphs (feature maps). The features of the character image 412 captured during the feature capture stage are used as inputs for the object classification and positioning stage. Then, in the object classification and positioning stage, object classification (classification) and positioning (localization) are performed according to the extracted features to identify the person 410 and its position and range in the person image 412. Information about the position and scope of the person 410 may be described by a bounding box (bounding box). The bounding box is a rectangular box in the character image 412 that just encloses the character 410. Thus, the size (e.g., length or width) of the bounding box is the first visual size of the character 410 in the character image 412.

The object recognition model may be implemented based on one or more machine learning algorithms and combinations thereof. For example, the object recognition model as a whole may be implemented using a regional convolutional neural network (Region Convolutional Neural Networks; RCNN) or variants thereof (e.g., fast R-CNN, mask R-CNN), or a YOLO (You Only Look Once) series of algorithms. In addition to the implementation of the convolutional layer (convolution layers) and the pooling layer (pooling layers) of the convolutional neural network (Convolutional Neural Network; CNN), the feature extraction stage in the inference process can also be implemented by adopting the techniques of non-closed neural networks such as a Vitola-Jones target detection framework (Viola-Jones object detection framework), scale-invariant feature transform (Scale-INVARIANT FEATURE TRANSFORM; SIFT) or a direction gradient histogram (Histogram of oriented gradient; HOG). The object classification and positioning stage may be implemented by a full connection layer (fully connected layers) of the convolutional neural network, or by an existing machine learning algorithm such as a support vector machine (Support Vector Machine; SVM) or Joint Bayesian (Joint Bayesian).

In one embodiment, labeled data (labelled data) in the training dataset of reference facts (ground truth) during the training phase of the object recognition model may be collected by manually labeling the characters in the multiple character images. In another embodiment, the tagged data may be gathered from an open source dataset, such as a Pascal VOC dataset, common Objects in Context (CoCo) dataset. In a further embodiment, the amount of marked data may be extended using a generation antagonism Network (GENERATIVE ADVERSARIAL Network; GAN).

In one embodiment, a loss function (loss function), such as mean square error (Mean square error; MSE), mean absolute error (Mean absolute error; MAE), or cross entropy (cross-entropy), may be used during the training phase of the object recognition model to calculate a loss value (loss) indicative of the difference between the output of the object recognition model and the labeled data. Still further, an optimizer (optimizer) may be used to recursively adjust parameters of the object recognition model (e.g., weights of layers of the neural network) so that the loss values are minimized to optimize the object recognition model. The optimizer may be implemented using algorithms such as gradient descent (GRADIENT DESCENT; GD), random gradient descent (Stochastic GRADIENT DESCENT; SGD), or adaptive moment estimation (adaptive moment estimation; adam). The loss value is gradually reduced by repeatedly performing training processes such as result feedback and parameter updating until the loss value converges to the minimum value.

In addition, the object recognition model may be trained on the local side, or may be trained on other computer devices (e.g. a server) and then obtained via a network (e.g. downloaded from the cloud), a storage medium (e.g. an external hard disk) or other communication interfaces (e.g. USB), but the disclosure is not limited thereto.

In step S402, the real size is calculated based on the first visual size, the first distance, and the Field of View (FoV) of the photographing device 411. In the example of fig. 4B, the first visual size is X pixels, the first distance is D1, and the true size to be calculated is W cm. In addition, fig. 4B also depicts that the maximum viewing angle of the photographing device 411 is θ, the included angle formed by the rays emitted from the photographing device 411 to the person 410 is ρ, and the maximum width of the photographing device 411 can be M pixels, wherein the maximum viewing angle θ and the maximum width M pixels can be included in the view parameters of the photographing device 411 or can be derived from other view parameters of the photographing device 411.

As can be seen from fig. 4B, the mathematical relationship between the true dimension W and the included angle ρ can be written as < formula one >, as follows:

< formula one >

W=tanρ×D1×2

Further, since the angle of view is proportional to the projection amount, the mathematical relationship between the angle of view ρ, the maximum angle of view θ, the first visual dimension X, and the maximum width M can be written as < formula two >, as follows:

< formula II >

Finally, combining < formula one > and < formula two >, a < formula three > can be obtained as follows:

< formula three >

In one embodiment, between step S304 and step S305 in fig. 3, a determination procedure of the correction amount of the first virtual camera is further included, and the detailed steps thereof will be described with reference to fig. 5.

Fig. 5 is a flowchart illustrating a procedure for determining the correction amount of the first virtual camera according to an embodiment of the disclosure. As shown in fig. 5, the procedure for determining the correction amount of the first virtual camera may include steps S501-S503.

In step S501, a 3D virtual scene is projected onto a virtual scene image using a first virtual camera. In other words, the 3D virtual scene captured by the first virtual camera from the virtual world is mapped onto the 2D virtual scene image using 3D projection techniques based on parameters associated with the first virtual camera. It should be understood that, since the first virtual camera has not moved to the ideal position in step S501 before step S305, the parameters associated with the first virtual camera are different from those of the virtual scene projected in step S305, and the projected virtual scene image is also different from the virtual scene layer projected in step S305.

In step S502, a second visual size of the reference object in the virtual scene image is calculated using a ray casting (raycasting) algorithm. Taking fig. 2B as an example, the second visual size is Y pixels.

The ray casting algorithm simulates the line of sight of the first virtual camera, that is, rays emitted from the first virtual camera, and then judges whether the reference object is in the line of sight of the first virtual camera by checking whether the rays intersect the reference object. When the ray intersects the object, the projection position of the reference object in the virtual camera image and the size thereof, namely the second visual size, can be calculated.

In step S503, a correction amount for the first virtual camera to move relative to the ideal position is determined based on the first visual size, the second visual size and the first distance. Taking fig. 2A and 2B as an example, the first visual size is X pixels, the second visual size is Y pixels, the first distance is D1, and the correction amount to be determined is D1-D2. A positive correction value indicates that the shooting distance of the first virtual camera should be lengthened, and a negative correction value indicates that the shooting distance of the first virtual camera should be shortened.

The mathematical relationship between the first visual dimension X, the second visual dimension Y, the first distance D1, and the second distance D2 can be written as < formula four >, as follows:

< formula IV >

Based on < equation four >, the correction amounts D1-D2 may be further written as < equation five >, as follows:

< formula five >

After the correction amount determining process is completed, in step S305, the first virtual camera is moved according to the correction amount to move the first virtual camera to the desired position. Taking fig. 2B as an example, the virtual camera 211 may be moved according to the correction amounts D1-D2 to move the virtual camera 211 to a desired position, that is, a position at a shooting distance D1.

In one embodiment, it is desirable that the person and the reference object are not in exactly the same position, but are offset to some reasonable degree. Therefore, the foregoing correction amount also takes the offset value into consideration. More specifically, the result of calculating D1-D2 from < formula five > is added by the offset amount as the correction amount. The offset may be a predetermined value or may be set by a user, although the disclosure is not limited thereto.

In one embodiment, the first distance between the camera and the person when capturing the image of the person, for example, the first distance D1 in fig. 2A or fig. 4B, may be obtained from the camera itself, and the camera itself is a depth camera, such as a Time-of-Flight (ToF) camera, a structured light (structured light) camera, a stereoscopic vision (stereo) camera, or the like, capable of sensing depth information of an object (including the person).

In an embodiment, the first distance between the photographing device and the person when photographing the image of the person may be estimated based on the image of the person using the depth estimation model. The depth estimation model may be implemented using a convolutional neural network based algorithm, although the disclosure is not limited in this regard. The training process of the depth estimation model includes obtaining tag data (for example, manually marking the depth or collecting the tag data of an open source), selecting a loss function, configuring an optimization algorithm, and the like, and various existing supervised learning methods can be adopted, but the disclosure is not limited thereto. In addition, the depth estimation model may be trained on the local side, or may be trained on another computer device (e.g. a server) and then obtained via a network (e.g. downloaded from the cloud), a storage medium (e.g. an external hard disk) or other communication interface (e.g. USB), but the disclosure is not limited thereto.

In an embodiment, the image synthesis method provided by the present disclosure further includes using a gesture recognition model to recognize a gesture in the image of the person, and then determining whether the gesture is mapped to a specified operation. If the gesture is judged to be mapped to the appointed operation, the ideal position of the first virtual camera is adjusted according to the appointed operation. For example, a thumb-to-left gesture may represent that the user (i.e., a person object captured by the camera) wants to move the first virtual camera to the left (i.e., the ideal position moves to the left), a thumb-to-right gesture may represent that the user wants to move the first virtual camera to the right (i.e., the ideal position moves to the right), and the mapping relationship between the gestures and operations may be pre-specified and recorded in a mapping table. And responding to the recorded gestures in the mapping table marked by the user, and correspondingly operating the first virtual camera.

The gesture recognition model may be implemented using various existing machine learning algorithms, such as convolutional neural networks, recurrent neural networks (recurrent neural networks; RNN), long-term memory networks (LSTM), support vector machines, decision trees, random forests, etc., although the disclosure is not limited in this regard. The training process of the gesture recognition model includes obtaining tag data (for example, manually marking the gesture, or collecting the tag data of the open source), selecting a loss function, configuring an optimization algorithm, and the like, and various existing supervised learning methods can be adopted, but the disclosure is not limited thereto. In addition, the gesture recognition model may be trained on the local side, or may be trained on another computer device (e.g. a server) and then obtained via a network (e.g. a cloud download), a storage medium (e.g. an external hard disk) or other communication interfaces (e.g. USB), but the disclosure is not limited thereto.

Fig. 6A is a system block diagram illustrating an image compositing system 600A according to an embodiment of the disclosure. As shown in fig. 6A, the system 600A may include a photographing device 601, a storage device 602, a processing device 603, and a display device 604. The system 600A may itself be a personal computer (e.g., desktop or notebook) or server computer running an operating system (e.g., windows, mac OS, linux, unix, etc.), or a mobile device such as a tablet or smart phone, although the disclosure is not so limited.

The photographing device 601 may include a photographing lens for photographing an image, and the photographing lens may include a general optical lens or an infrared lens, and the present disclosure is not limited to the type and number of photographing lenses. The photographic lens may be movable (e.g., rotatable to take images at different angles) or non-movable (e.g., non-rotatable to take images at only fixed angles). In one embodiment, the camera 601 may be a depth camera, such as a time-of-flight camera, a structured light camera, a stereo vision camera, or the like, capable of sensing depth information of an object (including a person) and providing a first distance from the person when capturing an image of the person.

The memory device 602 may include volatile memory (e.g., random access memory (Random Access Memory; RAM), dynamic random access memory (Dynamic Random Access Memory; DRAM), static random access memory (Static Random Access Memory; SRAM)) and/or any one or more devices (e.g., hard disk (HDD), solid State Disk (SSD), or optical disk) including non-volatile memory (e.g., read Only Memory (ROM)), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, non-volatile random access memory (non-volatile random access memory; NVRAM)), and combinations thereof, although the disclosure is not limited in this respect. In various embodiments of the present disclosure, the storage device 602 is configured to store a program for implementing the image synthesizing method 300 and various embodiments thereof, and data required or generated during the execution of the program, such as the aforementioned character image, virtual scene resource, 3D virtual scene, virtual scene layer, character layer, overlay layer, synthesized image, and various machine learning models that may be used.

The processing device 603 may include any one or more general-purpose or special-purpose processors for executing instructions, and combinations thereof. The processing device 603 loads the aforementioned program from the storage device 602 to perform the image composition method 300 and its various embodiments. The processing device 603 may include a Central Processing Unit (CPU) 611 and a Graphics Processing Unit (GPU) 612.GPU 612 is an electronic circuit specifically designed to perform computer graphics operations and image processing, and is therefore more efficient than general-purpose CPU 611 in computer graphics operations and image processing. Thus, in various embodiments of the present disclosure, appropriate tasks may be assigned to the CPU 611 depending on the characteristics of the CPU 611 and GPU 612, such as tasks to obtain data or communicate with other devices, and tasks related to computer graphics operations and image processing may be assigned to the GPU 612.

In an embodiment, the processing device 603 may further include a neural Network Processing Unit (NPU) 613 that is optimized specifically for deep learning. The npu 613 may have operational advantages over the GPU 612 in operating the image segmentation model, object recognition model, depth estimation model, and/or gesture recognition model described above. Thus, in this embodiment, tasks related to the use of the machine learning models may be assigned to the NPUs 613.

The processing device 603 may be coupled to the display device 604 to transmit the synthesized image obtained by performing the image synthesizing method 300 to the display device 604 to display the synthesized image on the display device 604. The display device 604 may be any device for displaying visual information, such as an LCD display, an LED display, an OLED display, or a plasma display, although the disclosure is not limited in this regard.

Fig. 6B is a system block diagram illustrating an image compositing system 600B according to another embodiment of the disclosure. In comparison to the system 600A, the system 600B also includes the same memory device 602, the processing device 603 and the display device 604, and therefore the detailed description of these hardware elements is not repeated. The system 600B differs from the system 600A in that the photographing device 601 is not necessarily included, but an external mobile device 605 is connected.

The mobile device 605 may be any type of smart phone or tablet computer that carries a camera device and thus has a camera function. In this embodiment, the character image is captured by the mobile device 605 and provided to the system 600B.

The above paragraphs are described in various aspects. It should be apparent that the teachings herein may be implemented in a variety of ways and that any particular architecture or functionality disclosed in the examples is merely representative. It will be understood by those skilled in the art, based on the teachings herein, that each aspect disclosed herein may be implemented independently or more than two aspects may be implemented in combination.

Although the present disclosure has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, but rather may be modified or altered somewhat by those skilled in the art without departing from the spirit and scope of the present disclosure.

Claims

1. An image synthesis method, implemented by a computer system, comprising:

Obtaining an image of a person photographed by a photographic device;

Based on a plurality of virtual scene resources, a 3D virtual scene is generated, and a reference object is set in the 3D virtual scene, wherein the reference object has the same real size as the character;

Setting a first virtual camera, so that the first virtual camera faces the reference object in the 3D virtual scene;

Determining an ideal position of the first virtual camera based on a first distance between the photographic device and the person when photographing the person;

Moving the first virtual camera to the ideal position, and projecting the 3D virtual scene onto a virtual scene layer using the first virtual camera located at the ideal position;

Separating a person layer from the person image using an image segmentation model;

Overlaying the character layer onto the virtual scene layer to generate a pair of overlay layers;

projecting the pair of overlay layers onto a 2D image using a second virtual camera; and

The 2D image is rendered to obtain a composite image.

2. The image synthesis method as claimed in claim 1, further comprising:

Based on the person image, using an object recognition model to calculate a first visual size of the person in the person image; and

The real size is calculated based on the first visual size, the first distance, and the field of view parameter of the photographic device.

3. The image synthesis method as claimed in claim 2, wherein after determining the ideal position of the first virtual camera and before moving the first virtual camera to the ideal position, the method further comprises:

Projecting the 3D virtual scene onto a virtual scene image;

Calculating a second visual size of the reference object in the virtual scene image using a ray casting algorithm; and

Determine a correction amount that the first virtual camera needs to move relative to the ideal position based on the first visual size, the second visual size and the first distance;

The step of moving the first virtual camera to the ideal position further includes:

The first virtual camera is moved according to the correction amount.

4. The image synthesis method as claimed in claim 3, wherein determining the correction amount required for the first virtual camera to move relative to the ideal position further comprises:

The correction amount required for the first virtual camera to move relative to the ideal position is determined based on the first visual size, the second visual size, the first distance and an offset.

5. The image synthesis method according to any one of claims 1 to 4, further comprising:

The first distance is obtained from the photographic device, wherein the photographic device is a depth camera.

6. The image synthesis method according to any one of claims 1 to 4, further comprising:

Based on the person image, a depth estimation model is used to estimate the first distance.

7. The image synthesis method as claimed in claim 1, further comprising:

Recognize a gesture in the person image using a gesture recognition model;

Determining whether the gesture is mapped to a specified operation; and

If it is determined that the gesture is mapped to the designated operation, the ideal position of the first virtual camera is adjusted according to the designated operation.

8 . The image synthesis method as claimed in claim 1 , wherein the plurality of virtual scene resources include a mesh resource, a texture resource, a shader resource and a material resource.

9. The image synthesis method as claimed in claim 1, wherein the computer system comprises the photographic device.

10 . The image synthesis method as claimed in claim 1 , wherein the computer system is connected to a mobile device, and the mobile device comprises the photographic device.

11. An image synthesis system, comprising a storage device and a processing device, wherein the processing device loads a program from the storage device to perform the following steps:

Obtaining an image of a person photographed by a photographic device;

The 2D image is rendered to obtain a composite image.

12. The image synthesis system as claimed in claim 11, wherein the processing device further performs the following steps:

13. The image synthesis system as claimed in claim 12, wherein the processing device further performs the following steps after determining the ideal position of the first virtual camera and before moving the first virtual camera to the ideal position:

Projecting the 3D virtual scene onto a virtual scene image;

The processing device further moves the first virtual camera according to the correction amount.

14. The image synthesis system of claim 13, wherein the processing device further performs the following steps to determine the correction amount that the first virtual camera needs to move relative to the ideal position:

15 . The image synthesis system as claimed in any one of claims 11 to 14 , wherein the processing device further obtains the first distance from the photographic device, wherein the photographic device is a depth camera.

16 . The image synthesis system as claimed in any one of claims 11 to 14 , wherein the processing device further estimates the first distance using a depth estimation model based on the person image.

17. The image synthesis system as claimed in claim 11, wherein the processing device further performs the following steps:

Recognize a gesture in the person image using a gesture recognition model;

Determining whether the gesture is mapped to a specified operation; and

18 . The image synthesis system as claimed in claim 11 , wherein the plurality of virtual scene resources include a mesh resource, a texture resource, a shader resource, and a material resource.

19. The image synthesis system as claimed in claim 11, wherein the image synthesis system further comprises the photographic device.

20. The image synthesis system as claimed in claim 11, wherein the image synthesis system is connected to a mobile device, and the mobile device includes the photographic device.