CN114648472B

CN114648472B - Image fusion model training method, image generation method and device

Info

Publication number: CN114648472B
Application number: CN202011504084.4A
Authority: CN
Inventors: 周永翔; 干宏华; 杨蕊; 吴增德
Original assignee: Zhejiang Public Information Industry Co ltd
Current assignee: Zhejiang Public Information Industry Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2025-08-05
Anticipated expiration: 2040-12-18
Also published as: CN114648472A

Abstract

The present disclosure relates to a training method for an image fusion model, an image generation method, and an apparatus thereof. A training method for an image fusion model based on a neural network is provided, comprising: receiving M input images of a specific scene, where M is an integer greater than or equal to 3; generating a three-dimensional global grid of the scene based on the M input images; selecting one input image from the M input images as a reference image; using M-1 non-reference images from the M input images to generate n mosaic images for the perspective of the reference image, where n is an integer greater than or equal to 2 and n is less than or equal to M-1; inputting the three-dimensional global grid and the n mosaic images as training images into a fusion model to generate a predicted image with the same perspective as the reference image; calculating an error between the predicted image and the reference image using a cost function; and adjusting a fusion weight of the image fusion model using the error to reduce the error.

Description

Training method of image fusion model, image generation method and device thereof

Technical Field

The present disclosure relates generally to training methods for image fusion models, image generation methods, devices and media therefor.

Background

There is an increasing demand for real-time, realistic, easy to capture 3D content suitable for freehand, interactive navigation. In the case where images of a plurality of views (or viewpoints) of a scene have been obtained, it is desirable to easily obtain images of different views from the plurality of views.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its purpose is to present some concepts related to the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

According to one aspect of the disclosure, a training method of an image fusion model based on a neural network is provided, comprising receiving M input images of a specific scene, wherein M is an integer greater than or equal to 3, generating a three-dimensional global grid of the scene based on the M input images, selecting one of the M input images as a reference image, generating n mosaic images for a view angle of the reference image using M-1 non-reference images of the M input images, wherein n is an integer greater than or equal to 2 and n is less than or equal to M-1, inputting the three-dimensional global grid and the n mosaic images as training images into the fusion model, generating a predicted image of the same view angle as that of the reference image, calculating an error between the predicted image and the reference image using a cost function, and adjusting a fusion weight of the image fusion model using the error to reduce the error.

According to one aspect of the present disclosure, there is provided an image generation method including receiving L input images of a specific scene, where L is an integer greater than or equal to 2, generating a three-dimensional global grid of the scene based on the L input images, selecting a new view angle different from a view angle of the L input images, generating n mosaic images for the new view angle using the L input images, where n is an integer greater than or equal to 2 and n is less than or equal to L, and inputting the three-dimensional global grid and the n mosaic images into an image fusion model obtained according to the above method, generating a predicted image of the new view angle.

According to another aspect of the present disclosure, there is provided a training apparatus of a neural network-based image fusion model, including a memory having instructions stored thereon, and a processor configured to execute the instructions stored on the memory to perform the above-described training method of the neural network-based image fusion model.

According to another aspect of the present disclosure, there is provided an image generating apparatus including a memory having instructions stored thereon, and a processor configured to execute the instructions stored on the memory to perform the above-described image generating method.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, cause the one or more processors to perform the method according to any of the above aspects of the present disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

Fig. 1 shows one example of a scenario to which the present disclosure is to be applied.

Fig. 2 shows an example of an image and its three-dimensional global grid.

Fig. 3 shows an example of a generation flow of a mosaic image.

Fig. 4 is a schematic diagram for explaining mosaic generation of different meshes of a mosaic image.

Fig. 5 is a schematic diagram for explaining mosaic generation of different meshes of a mosaic image.

Fig. 6 illustrates a flowchart example of a training method of a neural network-based image fusion model according to the present disclosure.

Fig. 7 illustrates an example of a convolutional neural network architecture according to the present disclosure.

Fig. 8 shows an example of a prediction process of a new view image.

Fig. 9 illustrates an example of a training and fusion process of a neural network-based image fusion model, according to one embodiment of the present disclosure.

FIG. 10 illustrates an exemplary configuration of a computing device in which embodiments according to the present disclosure may be implemented.

Detailed Description

The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various example embodiments of the disclosure. The following description includes various details to aid in understanding, but these are to be considered merely examples and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are only intended to provide a clear and consistent understanding of the present disclosure. In addition, descriptions of well-known structures, functions and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the spirit and scope of the present disclosure.

In three-dimensional event live broadcast, three-dimensional real-time video monitoring, three-dimensional tourism and other scenes, the fusion of images is often required. Taking live video broadcasting as an example, as shown in fig. 1, after capturing real-time video by cameras around a football field, it is desirable to fuse three-dimensional video images according to the viewing angle of a client and transmit the fused three-dimensional video images to the client.

To this end, the present disclosure proposes an image fusion technique. For example, for the application scenario shown in fig. 1, by the fusion technique of the present disclosure, a football game is watched as if it were in it, and the spectator can watch the game from any angle, can run in the virtual scenario, and the images seen are different as they are in different positions in three-dimensional space.

A training method of an image fusion model based on a neural network according to one embodiment of the present disclosure includes receiving M input images of a specific scene, where M is an integer greater than or equal to 3, generating a three-dimensional global grid of the scene based on the M input images, selecting one of the M input images as a reference image, generating n mosaic images for a view angle of the reference image using M-1 non-reference images of the M input images, where n is an integer greater than or equal to 2 and n is less than or equal to M-1, inputting the three-dimensional global grid and the n mosaic images as training images into the image fusion model, generating a prediction image of the same view angle as the view angle of the reference image, calculating an error between the prediction image and the reference image using a cost function, and adjusting a fusion weight of the image fusion model using the error to reduce the error.

In one embodiment, the method may further include iterating the steps of generating the predicted image, calculating the error, and adjusting the fusion weight until the error is less than a predetermined value or the number of iterations reaches a predetermined number.

In the step of receiving M input images of a particular scene, after a set of images is collected, the present disclosure determines spatial and geometric relationships (SFM) of objects by movement of a camera and generates a three-dimensional global grid. Fig. 2 shows an example of an image and its three-dimensional global grid. Alternatively, in generating the three-dimensional global grid, the image depth may also be calculated by multi-view stereo vision (MVS) and a local depth map is built.

In order to generate an image of a new view from an image photographed from an existing view, the present disclosure needs to generate n mosaic images in addition to the three-dimensional global grid shown in fig. 3. n is an integer greater than or equal to 2 and n is less than or equal to M-1.

In one embodiment, the step of generating n mosaic images for the view angles of the reference images may include, for each of the three-dimensional global grids, calculating the weights of the M-1 non-reference images on that grid, selecting the n non-reference images with higher weights, obtaining warped projections of the n non-reference images on that grid, and generating the n mosaic images using the warped projections of each grid, wherein the pixels at each grid of a first mosaic image of the n mosaic images are obtained by warping projections of the pixels corresponding to that grid of the non-reference image with the highest weight on that grid, the pixels at each grid of a second mosaic image of the n mosaic images are obtained by warping projections of the pixels corresponding to that grid of the second highest weight non-reference image on that grid, and so on.

Fig. 3 shows an example of a generation flow of a mosaic image. Fig. 4 is a schematic diagram for explaining mosaic generation of different meshes of a mosaic image.

In the following description, n is 4 (i.e., four mosaic images are obtained) as an example. However, n may be set to any value greater than 1.

In fig. 4, it is assumed that five images of camera 1, camera 2, camera 3, camera 4, and camera 5 have now been obtained. We need to calculate the four mosaic images with the highest weights for the new view x.

The triangular mesh m (denoted as t _m) is one of many meshes in the three-dimensional mesh established through the above procedure. As an example, a cosine value of an angle between a normal of a lens of a camera that obtains a specific image and a normal of a grid in a three-dimensional global grid may be used as a weight of the specific image on the grid. For example, a mosaic with the highest priority (weight) may be selected according to the magnitude of the weight W _cm, which is the cosine of the angle between the normal line of the lens of camera c and the normal line of t _m. However, other parameters may be used as the weight W _cm.

Taking triangular grid m of fig. 4 as an example, in the case of using cosine of the included angle as a weight, the weight of camera 1 at grid m > the weight of camera 2 at grid m > the weight of camera 3 at grid m > the weight of camera 4 at grid m > the weight of camera 5 at grid m, i.e. W _1m>W_2m>W_3m>W_4m>W_5m. Also taking triangular mesh n as an example, W _3n>W_4n>W_5n>W_2n>W_1n.

In the new view direction x, if the projection of the pixels of the image shot by the camera c in the grid m area is p _cm, and then the calculated weights are combined, four mosaic images with highest weights can be obtained. The projections of grid m and grid n on these four mosaic images are shown in fig. 5. The m-triangular meshes and n-triangular meshes of the mosaic image of the first priority (i.e. the first mosaic image) are respectively projected by pixel distortions of the triangular meshes corresponding to the camera 1 and the camera 3, the m-triangular meshes and n-triangular meshes of the mosaic image of the second priority (i.e. the second mosaic image) are respectively projected by pixel distortions of the triangular meshes corresponding to the camera 2 and the camera 4, and so on.

The foregoing procedure gives a global three-dimensional grid of three-dimensional images, and 4 mosaic images of high priority, totaling 5 images. The neural network based image fusion model is then trained with these 5 images as input. In the present disclosure, the image fusion model may be based on any neural network and any combination of neural networks, including forward neural networks, recurrent neural networks, convolutional neural networks, deep belief networks, generate countermeasure networks, and so forth. Hereinafter, a Convolutional Neural Network (CNN) will be described as an example only.

The training process is shown in fig. 6. First, a global grid is generated from M images of a particular scene at one time, which is used multiple times for the next training. And then selecting M-1 images from the M images to generate four mosaic images, and inputting the four mosaic images and the global grid into a forward neural network together to obtain a predicted image of a new view angle. The neural network includes image fusion weights. Initially, these image fusion weights may have any initial value. After the predicted image of the new view is obtained, a training loss is calculated from the difference between the predicted image and the reference image. Then, the image fusion weight of the neural network model is adjusted through the backward neural network (or back propagation).

The foregoing procedure is merely a training procedure of the deep neural network, and a specific convolutional neural network architecture example will be described below, and the architecture includes a left image contraction path and a right image expansion path as shown in fig. 7. Note that as previously described, the present disclosure may employ any neural network other than convolutional neural networks.

In the example shown in fig. 7, the systolic path may employ a typical convolutional neural network architecture. For example, the following operations may be performed:

(1) The image is subjected to a 3x3 convolution operation and then a ReLU operation.

(2) The first step is repeatedly performed on the result of the operation of step (1).

(3) Downsampling is further achieved by a max-pooling operation with a span of 2 and a window size of 2x 2.

(4) Returning to step (1), a plurality of similar operations are performed.

The right image expansion path in fig. 7 may take a similar step as the contraction path, but with an increased number of feature channels. The upsampling of the extension path consists of the following parts:

(1) A window size of 2x2 is convolved up operation which reduces the number of channels.

(2) A first 3x3 convolution operation is performed, followed by a ReLU operation.

(3) And (3) repeatedly executing the step (2).

(4) Returning to step (1), a plurality of similar operations are performed.

Since some edge pixels are lost in each convolution operation, for example, the input image with the number of pixels 572x572 is processed as described above, and the output resolution is 388x388, and 92 pixels are clipped around the image. Needless to say, the above-described pixel resolution and number of clipped pixels are merely examples, and any other suitable resolution and number of clipped pixels may be employed by the present disclosure.

The new view image prediction process of the forward neural network is shown in fig. 8, and the relevant parameters are exemplified by the image size shown in fig. 7 (the input image size is 572x572, and the output image is 388x 388):

(1) After the input image is processed by the method disclosed by the disclosure, a global grid map and 4 mosaic images with highest priority (weight) are obtained, and 5 images are obtained in total.

(2) These 5 images are then input into a CNN neural network (i.e., an image fusion model based on the neural network) that contains the image fusion weights.

(3) A pixel-level weighted sum is made for the 5 images input. The method comprises the following steps:

a. The ith row and jth column values of the mth frame of the image fusion weights are r _mij, where 0< m <6, (0, 0) < (i, j) < (388)

B. The ith row and jth column values of the mth frame of the cropped image are c _mij, where 0< m <6, (0, 0) < (i, j) < (388)

C. The pixel value of the ith row and jth column of the predicted image is

Where k=n+1.

(4) Finally, obtaining the image prediction result of the new visual angle.

The three-dimensional image training process and depth fusion process are shown in fig. 9. The upper left-hand corner plus dashed box region is a new view image prediction process depicted in fig. 8 that yields a prediction result via a neural network (e.g., a forward neural network). For M input images of a complete training process for a particular scene, the complete training and fusion process is shown in fig. 9:

(1) First, building global grid from M images

(2) Selecting 1 image as original reference image, and taking visual angle of the image as new visual angle

(3) Generating, for example, 4 highest priority mosaic images from the remaining M-1 images

(4) With the global grid of step (1) and the 4 mosaic images of step (3) as inputs, an image prediction result of a new view in step (2) is generated by an image prediction process such as a convolutional neural network shown in fig. 8. There is a predetermined number of cropping pixels around each of the predicted result images compared to 5 input images.

(5) And (2) clipping the preset number of pixels in the step (4) from the periphery of the original reference image in the step (2) to obtain a clipped reference image.

(6) And (3) inputting the image prediction result in the step (4) and the cut reference image in the step (5) into a cost function to obtain an error.

(7) And adjusting the weight of the neural network through the reverse weight adjustment process of the convolutional neural network so as to reduce the prediction error.

Obviously, as previously described, an iterative process may be performed for the above steps to continually adjust the neural network weights, thereby continually reducing errors.

As described above, in the case where the neural network is a convolutional neural network, the peripheries of the prediction image and the reference image may be clipped by a predetermined number of pixels as compared to the input image, and the error may be calculated using the clipped prediction image and reference image.

In one embodiment, the training method may further include training the neural network-based image fusion model multiple times by changing an image serving as a reference image among the M input images, or using different M input images.

Further, the disclosure may also include an image generation method including receiving L input images of a particular scene, where L is an integer greater than or equal to 2, generating a three-dimensional global grid of the scene based on the L input images, selecting a new view angle different from a view angle of the L input images, generating n mosaic images for the new view angle using the L input images, where n is an integer greater than or equal to 2 and n is less than or equal to L, and inputting the three-dimensional global grid and the n mosaic images into an image fusion model obtained by the method as described above, generating a predicted image of the new view angle.

Fig. 10 illustrates an exemplary configuration of a computing device 1200 capable of implementing embodiments in accordance with the present disclosure.

Computing device 1200 is an example of a hardware device that can employ the above aspects of the disclosure. Computing device 1200 may be any machine configured to perform processing and/or calculations. Computing device 1200 may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a Personal Data Assistant (PDA), a smart phone, an in-vehicle computer, or a combination thereof.

As shown in fig. 10, computing device 1200 may include one or more elements that may be connected to or in communication with a bus 1202 via one or more interfaces. The bus 1202 may include, but is not limited to, an industry standard architecture (Industry Standard Architecture, ISA) bus, a micro channel architecture (Micro Channel Architecture, MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus. Computing device 1200 may include, for example, one or more processors 1204, one or more input devices 1206, and one or more output devices 1208. The one or more processors 1204 may be any kind of processor and may include, but are not limited to, one or more general purpose processors or special purpose processors (such as special purpose processing chips). The processor 1204 may be configured to implement, for example, a training method or an image generation method as described above. Input device 1206 may be any type of input device capable of inputting information to a computing device, and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote controller. The output device 1208 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers.

The computing device 1200 may also include or be connected to a non-transitory storage device 1214, which non-transitory storage device 1214 may be any storage device that is non-transitory and that may enable data storage, and may include, but is not limited to, disk drives, optical storage devices, solid state memory, floppy disks, flexible disks, hard disks, magnetic tape, or any other magnetic medium, compact disk or any other optical medium, cache memory and/or any other memory chip or module, and/or any other medium from which a computer may read data, instructions, and/or code. Computing device 1200 may also include Random Access Memory (RAM) 1210 and Read Only Memory (ROM) 1212. The ROM 1212 may store programs, utilities or processes to be executed in a non-volatile manner. The RAM 1210 may provide volatile data storage and stores instructions related to the operation of the computing device 1200. The computing device 1200 may also include a network/bus interface 1216 coupled to the data link 1218. The network/bus interface 1216 can be any kind of device or system capable of enabling communication with external equipment and/or networks, and can include, but is not limited to, modems, network cards, infrared communication devices, wireless communication devices, and/or chipsets (such as bluetooth ^TM devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication facilities, etc.).

The present disclosure may be implemented as any combination of apparatuses, systems, integrated circuits, and computer programs on a non-transitory computer readable medium. One or more processors may be implemented as an Integrated Circuit (IC), application Specific Integrated Circuit (ASIC), or large scale integrated circuit (LSI), system LSI, super LSI, or ultra LSI assembly that performs some or all of the functions described in this disclosure.

The present disclosure includes the use of software, applications, computer programs, or algorithms. The software, application, computer program or algorithm may be stored on a non-transitory computer readable medium to cause a computer, such as one or more processors, to perform the steps described above and depicted in the drawings. For example, one or more memories may store software or algorithms in executable instructions and one or more processors may associate a set of instructions to execute the software or algorithms to provide various functions in accordance with the embodiments described in this disclosure.

The software and computer programs (which may also be referred to as programs, software applications, components, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural, object-oriented, functional, logical, or assembly or machine language. The term "computer-readable medium" refers to any computer program product, apparatus or device, such as magnetic disks, optical disks, solid state memory devices, memory, and Programmable Logic Devices (PLDs), for providing machine instructions or data to a programmable data processor, including computer-readable media that receives machine instructions as a computer-readable signal.

By way of example, computer-readable media can comprise Dynamic Random Access Memory (DRAM), random Access Memory (RAM), read Only Memory (ROM), electrically erasable read only memory (EEPROM), compact disk read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired computer-readable program code in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer or general purpose or special purpose processor. Disk or disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

The present disclosure provides a holographic image generation apparatus based on, for example, a convolutional neural network. By re-projecting a set of input images from different views to a new view, a weighted mosaic list from the different views is created. The best candidate set for each pixel is then selected and input into the CNN model for fusion. Multiple images through different scenes are used as training data sets. For a certain scene, one image is used as a reference view, and the image is generated by fusion of other images through a CNN neural network, so that training is realized.

The subject matter of the present disclosure is provided as examples of apparatuses, systems, methods, and programs for performing the features described in the present disclosure. Other features or variations in addition to those described above are contemplated. It is contemplated that the implementation of the components and functions of the present disclosure may be accomplished with any emerging technology that may replace any of the above-described implementation technologies.

In addition, the foregoing description provides examples without limiting the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, replace, or add various procedures or components as appropriate. For example, features described with respect to certain embodiments may be combined in other embodiments.

In addition, in the description of the present disclosure, the terms "first," "second," "third," etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or order.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A training method of an image fusion model based on a neural network comprises the following steps:

receiving M input images of a specific scene, wherein M is an integer greater than or equal to 3;

generating a three-dimensional global grid of the scene based on the M input images;

selecting one of the M input images as a reference image;

Generating n mosaic images for the view angle of the reference image using M-1 non-reference images in the M input images, comprising calculating the weight of the M-1 non-reference images on each grid in the three-dimensional global grid, selecting n non-reference images with higher weight, obtaining the distortion projection of the n non-reference images on the grid, generating the n mosaic images using the distortion projection of each grid, wherein the pixel at each grid in the first mosaic image in the n mosaic images is obtained by performing the distortion projection on the pixel corresponding to the grid of the non-reference image with highest weight on the grid, the pixel at each grid in the second mosaic image in the n mosaic images is obtained by performing the distortion projection on the pixel corresponding to the grid of the non-reference image with second higher weight on the grid, and so on, wherein n is an integer greater than or equal to 2 and n is equal to or less than M-1;

Inputting the three-dimensional global grid and n mosaic images as training images into the image fusion model to generate a predicted image with the same viewing angle as that of the reference image;

calculating an error between the predicted image and the reference image using a cost function, and

And adjusting the fusion weight of the fusion model by using the error to reduce the error.

2. Training method according to claim 1, wherein cosine values of angles between the normal of the lens of the camera obtaining the non-reference image and the normal of the grid in the three-dimensional global grid are used as weights of the non-reference image on the grid.

3. The training method of claim 1, wherein the neural network is a convolutional neural network, and

The prediction image and the reference image are clipped by a predetermined number of pixels around compared to the input image, and the error is calculated using the clipped prediction image and reference image.

4. A training method according to claim 3, wherein the predicted image is represented by the following formula:

Where o _ij denotes a pixel value of an ith row and a jth column of the predicted image, k=n+1, c _mij denotes a pixel value of an ith row and a jth column of an mth image of the n mosaic images and a three-dimensional global grid after clipping the predetermined number of pixels, And representing fusion weights corresponding to the ith row and the jth column of the mth image in the cut three-dimensional global grid and the n mosaic images.

5. The training method according to claim 1, wherein the steps of generating the predicted image, calculating the error, and adjusting the fusion weight are iterated until the error is smaller than a predetermined value or the number of iterations reaches a predetermined number.

6. An image generation method, comprising:

receiving L input images of a specific scene, wherein L is an integer greater than or equal to 2;

Generating a three-dimensional global grid of the scene based on the L input images;

selecting a new view angle different from the view angles of the L input images;

generating n mosaic images for the new view using the L input images, where n is an integer greater than or equal to 2 and n is less than or equal to L;

Inputting the three-dimensional global grid and n mosaic images into an image fusion model obtained by the method according to any one of claims 1 to 5, generating a predicted image of the new view.

7. A training device of an image fusion model based on a neural network, comprising:

A memory having instructions stored thereon, and

A processor configured to execute instructions stored on the memory to perform the method of any one of claims 1 to 5.

8. An image generating apparatus comprising:

A memory having instructions stored thereon, and

A processor configured to execute instructions stored on the memory to perform the method of claim 6.

9. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-6.