KR102587233B1

KR102587233B1 - 360 rgbd image synthesis from a sparse set of images with narrow field-of-view

Info

Publication number: KR102587233B1
Application number: KR1020220159773A
Authority: KR
Inventors: 박인규; 김수지
Original assignee: 인하대학교 산학협력단
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-10-10
Anticipated expiration: 2042-11-24
Also published as: KR102587233B9

Abstract

소수의 협소화각 RGBD 영상으로부터 360 RGBD 영상을 합성하기 위한 방법 및 장치를 개시한다. 일실시예에 따른 영상 합성 방법은 시야 추정 네트워크를 이용하여 파노라마 영상에 대해 상대적인 시야(Field of View, FoV)를 추정함으로써, 시야 영상을 생성하는 단계 및 파노라마 생성 네트워크를 이용하여 상기 생성된 시야 영상으로부터 파노라마 영상을 생성하는 단계를 포함할 수 있다.A method and device for synthesizing a 360 RGBD image from a small number of narrow-angle-of-view RGBD images are disclosed. An image synthesis method according to an embodiment includes generating a field of view image by estimating a field of view (FoV) relative to a panoramic image using a field of view estimation network, and generating the field of view image using a panorama generation network. It may include generating a panoramic image from.

Description

360 RGBD image synthesis from a small number of narrow field of view RGBD images {360 RGBD IMAGE SYNTHESIS FROM A SPARSE SET OF IMAGES WITH NARROW FIELD-OF-VIEW}

아래의 설명은 소수의 협소화각 RGBD 영상으로부터 360 RGBD 영상 합성하기 위한 영상 합성 방법 및 장치에 관한 것이다.The description below relates to an image synthesis method and device for synthesizing a 360 RGBD image from a small number of narrow field of view RGBD images.

최근 몰입감을 제공하는 실감형 미디어인 가상현실과 증강현실 기술에 대한 관심이 급상승하고 있다. 3D 장면 이해는 이러한 분야에서 매우 중요한 요소이며 그 중에서 가장 기본이 되는 연구 분야 중 하나는 깊이 추정 연구이다. 깊이 영상은 3차원 공간에서의 깊이 정보를 2차원 평면에 표현한 영상으로서, 공간의 3차원 구조 정보를 포함하므로 뷰 합성, 3D 모델링, 자율주행, 로봇공학 등 다양한 3D 비전 분야에 활용된다.Recently, interest in virtual reality and augmented reality technologies, which are realistic media that provide a sense of immersion, is rapidly increasing. 3D scene understanding is a very important element in these fields, and one of the most basic research areas is depth estimation research. A depth image is an image that expresses depth information in a 3D space on a 2D plane, and contains 3D structural information of the space, so it is used in various 3D vision fields such as view synthesis, 3D modeling, autonomous driving, and robotics.

기존의 좁은 FoV 카메라를 이용하여서 장면을 취득하는 경우에 주변 장면의 상당 부분이 소실된 일부분에 해당하는 영상을 얻게 되고, 전 방향 장면을 취득하려면 복수의 카메라를 구축하여 다수의 영상을 처리해야 하기 때문에 많은 비용이 발생하게 된다. 그 대안으로 FoV가 180° 이상인 넓은 FoV를 갖는 소수의 어안렌즈 카메라를 사용하여 전방향 영상을 취득할 수 있다. 어안렌즈는 구형 모델로 기존의 핀홀 카메라 기반의 평면 모델을 그대로 활용하기 어렵기 때문에 등장방형도법(Equirectangular projection) 영상과 같은 구체 투영 영상을 사용할 수 있다. 그러나 이러한 360° 투영 영상은 경계면과 극 부분에 왜곡 문제가 발생할 수 있고, 최근에 이러한 360° 투영 영상의 특성을 반영한 딥러닝 네트워크도 제안되고 있다.When acquiring a scene using an existing narrow FoV camera, an image corresponding to a portion of the surrounding scene is obtained with a significant portion of the surrounding scene missing. In order to acquire an omnidirectional scene, multiple cameras must be built and multiple images must be processed. Because of this, a lot of costs are incurred. Alternatively, omnidirectional images can be acquired using a small number of fisheye lens cameras with a wide FoV of 180° or more. Since the fisheye lens is a spherical model and it is difficult to utilize the existing pinhole camera-based flat model as is, a spherical projection image such as an equirectangular projection image can be used. However, these 360° projection images may have distortion problems at the boundaries and poles, and recently, a deep learning network that reflects the characteristics of these 360° projection images has been proposed.

오랜 기간 연구되어온 깊이 추정 연구는 전통적인 방식의 스테레오 영상을 이용한 깊이 추정, 최근 딥러닝 네트워크를 활용한 단안 영상에 대한 깊이 추정방식, 그리고 이 두 기법을 혼합한 방법이 연구되고 있다. 이러한 기법들은 대부분 좁은 FoV를 가지는 단안 영상을 대상으로 한다. 또한 일반적인 영상 데이터셋과 달리 좁은 FoV 영상 기반의 센서를 이용하여 취득된 고품질 360° 데이터셋은 충분하지 않으며, 특히 깊이 영상의 경우 깊이 정보가 취득된 화소에서는 높은 정밀도를 갖지만 딥러닝 네트워크 훈련에 적합하지 않은 희소한 깊이 영상이 취득되는 경우가 많다. 이러한 이유로 희소한 깊이 영상을 조밀한 깊이 영상으로 완성하는 연구가 수행되기도 하였으며, 스테레오 영상을 취득하고자 카메라를 새롭게 배치하여서 영상을 취득하기도 하지만 이 또한 일관성 있는 기준선을 유지하기에 힘들고 작업 비용이 크다는 단점이 있다.Depth estimation research that has been studied for a long time includes depth estimation using traditional stereo images, depth estimation for monocular images using deep learning networks, and a mixture of these two techniques. Most of these techniques target monocular images with a narrow FoV. In addition, unlike general image datasets, high-quality 360° datasets acquired using sensors based on narrow FoV images are not sufficient. In particular, depth images have high precision in pixels from which depth information is acquired, but are suitable for deep learning network training. There are many cases where sparse depth images that have not been used are acquired. For this reason, research has been conducted to complete sparse depth images into dense depth images, and images are acquired by newly arranging cameras to acquire stereo images, but this also has the disadvantage of being difficult to maintain a consistent baseline and high operation costs. There is.

[선행문헌번호][Prior document number]

한국공개특허 제10-2020-0095112호Korean Patent Publication No. 10-2020-0095112

소수의 협소화각 RGBD 영상으로부터 360 RGBD 영상 합성하기 위한 영상 합성 방법 및 장치를 제공한다.Provided is an image synthesis method and device for synthesizing 360 RGBD images from a small number of narrow-angle-of-view RGBD images.

적어도 하나의 프로세서를 포함하는 컴퓨터 장치의 영상 합성 방법에 있어서, 상기 적어도 하나의 프로세서에 의해, 시야 추정 네트워크를 이용하여 파노라마 영상에 대해 상대적인 시야(Field of View, FoV)를 추정함으로써, 시야 영상을 생성하는 단계; 및 상기 적어도 하나의 프로세서에 의해, 파노라마 생성 네트워크를 이용하여 상기 생성된 시야 영상으로부터 파노라마 영상을 생성하는 단계를 포함하는 영상 합성 방법을 제공한다.In the image synthesis method of a computer device including at least one processor, the field of view image is generated by estimating a field of view (FoV) relative to the panoramic image using a field of view estimation network by the at least one processor. generating step; and generating, by the at least one processor, a panoramic image from the generated field of view image using a panorama generation network.

일측에 따르면, 상기 파노라마 생성 네트워크는 유-넷(U-Net) 기반의 적대적 생성 신경망 네트워크를 포함하는 것을 특징으로 할 수 있다.According to one side, the panorama generation network may be characterized as including a U-Net-based adversarial generation neural network network.

다른 측면에 따르면, 상기 파노라마 생성 네트워크는 적대적 손실함수로 LSGAN(Least Squares GAN(Generative Adversarial Network))을 사용하여 훈련되는 것을 특징으로 할 수 있다.According to another aspect, the panorama generation network may be trained using a Least Squares Generative Adversarial Network (LSGAN) as an adversarial loss function.

또 다른 측면에 따르면, 상기 파노라마 생성 네트워크의 손실함수는, RGBD(Red Green Blue Depth) 네트워크의 RGB에 대한 제1 손실함수 및 상기 RGBD 네트워크의 깊이(Depth)에 대한 제2 손실함수를 포함하는 것을 특징으로 할 수 있다.According to another aspect, the loss function of the panorama generation network includes a first loss function for RGB of the RGBD (Red Green Blue Depth) network and a second loss function for the depth of the RGBD network. It can be characterized.

또 다른 측면에 따르면, 상기 제1 손실함수는 상기 RGBD 네트워크의 RGB에 대한 제1 적대적 손실함수, 상기 RGBD 네트워크의 RGB에 대한 생성기에 의해 생성된 영상 및 대응하는 참값 영상간의 픽셀 손실함수, 상기 RGBD 네트워크의 RGB에 대한 지각 손실 목적함수 및 상기 생성된 영상과 상기 참값 영상간에 측정된 프레쳇 거리(Frechet distance) 손실함수를 이용하여 결정되는 것을 특징으로 할 수 있다.According to another aspect, the first loss function includes a first adversarial loss function for RGB of the RGBD network, a pixel loss function between an image generated by a generator for RGB of the RGBD network and a corresponding true value image, the RGBD It may be determined using a perceptual loss objective function for RGB of the network and a Frechet distance loss function measured between the generated image and the true value image.

또 다른 측면에 따르면, 상기 제2 손실함수는 상기 RGBD 네트워크의 깊이에 대한 제2 적대적 손실함수, 상기 RGBD 네트워크의 깊이에 대한 생성기에 의해 생성된 영상 및 대응하는 참값 영상간의 픽셀 손실함수, 상기 RGBD 네트워크의 깊이에 대한 지각 손실 목적함수 및 상기 생성된 영상과 상기 참값 영상간에 측정된 프레쳇 거리 손실함수를 이용하여 결정되는 것을 특징으로 할 수 있다.According to another aspect, the second loss function includes a second adversarial loss function for the depth of the RGBD network, a pixel loss function between the image generated by the generator for the depth of the RGBD network and the corresponding true value image, the RGBD It may be characterized by being determined using a perceptual loss objective function for the depth of the network and a Frechett distance loss function measured between the generated image and the true value image.

또 다른 측면에 따르면, 상기 파노라마 생성 네트워크는 RGBD 네트워크의 RGB에 대한 생성기에 의해 생성된 영상과 상기 RGBD 네트워크의 깊이에 대한 생성기에 의해 생성된 영상에 대하여 입력 영상의 참값 영역을 제외한 나머지 부분에 이진 마스크를 적용한 영상의 특징을 공유하고, 상기 파노라마 생성 네트워크의 마지막 레이어의 출력이 상기 RGBD 네트워크의 마지막 블록에 채널 연결을 수행하여 상기 RGBD 네트워크의 디코더에 전달되는 것을 특징으로 할 수 있다.According to another aspect, the panorama generation network is binary in the remaining portion excluding the true value region of the input image for the image generated by the RGB generator of the RGBD network and the image generated by the depth generator of the RGBD network. It may share the characteristics of the masked image, and the output of the last layer of the panorama generation network may be transmitted to the decoder of the RGBD network by performing channel connection to the last block of the RGBD network.

컴퓨터 장치와 결합되어 상기 방법을 컴퓨터 장치에 실행시키기 위해 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 제공한다.A computer program stored on a computer-readable recording medium is provided in conjunction with a computer device to execute the method on the computer device.

상기 방법을 컴퓨터 장치에 실행시키기 위한 프로그램이 기록되어 있는 컴퓨터 판독 가능한 기록매체를 제공한다.Provided is a computer-readable recording medium on which a program for executing the above method on a computer device is recorded.

컴퓨터 장치에 있어서, 상기 컴퓨터 장치에서 판독 가능한 명령을 실행하도록 구현되는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서에 의해, 시야 추정 네트워크를 이용하여 파노라마 영상에 대해 상대적인 시야(Field of View, FoV)를 추정함으로써, 시야 영상을 생성하고, 파노라마 생성 네트워크를 이용하여 상기 생성된 시야 영상으로부터 파노라마 영상을 생성하는 것을 특징으로 하는 컴퓨터 장치를 제공한다.A computer device, comprising: at least one processor configured to execute instructions readable by the computer device, wherein the at least one processor determines a field of view relative to a panoramic image using a field of view estimation network. A computer device is provided, which generates a field of view image by estimating FoV) and generates a panoramic image from the generated field of view image using a panorama generation network.

소수의 협소화각 RGBD 영상으로부터 360 RGBD 영상 합성하기 위한 영상 합성 방법 및 장치를 제공할 수 있다.An image synthesis method and device for synthesizing a 360 RGBD image from a small number of narrow-angle-of-view RGBD images can be provided.

도 1은 본 발명의 일실시예에 따른 RGBD 생성 네트워크의 전체적인 구조의 예를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 RGBD 생성 네트워크의 특징 공유 모듈의 예를 도시한 도면이다.
도 3은 정제된 데이터셋의 샘플(3D60 데이터셋 샘플)의 예를 도시한 도면이다.
도 4는 생성된 RGBD 영상의 정성평가 결과의 예를 도시한 도면이다.
도 5는 생성된 RGBD 영상의 3D 포인트 클라우드 결과의 예를 도시한 도면이다.
도 6은 프레쳇 거리 손실함수의 사용 전후 결과를 퓨전 모듈 사용 전후 네트워크에 적용하여 비교한 결과의 예를 도시한 도면이다.
도 7은 프리쳇 거리의 검증 결과의 예를 도시한 도면이다.
도 8은 RGB 영상 결과에 대한 기존의 기법과의 정성적 비교 결과의 예를 도시한 도면이다.
도 9는 깊이 영상 결과에 대한 기존의 기법과의 정성적 비교 결과의 예를 도시한 도면이다.
도 10은 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다.
도 11은 본 발명의 일실시예에 따른 영상 합성 방법의 예를 도시한 흐름도이다.Figure 1 is a diagram showing an example of the overall structure of an RGBD generation network according to an embodiment of the present invention.
Figure 2 is a diagram illustrating an example of a feature sharing module of an RGBD generation network according to an embodiment of the present invention.
Figure 3 is a diagram showing an example of a sample of a refined dataset (3D60 dataset sample).
Figure 4 is a diagram showing an example of qualitative evaluation results of the generated RGBD image.
Figure 5 is a diagram showing an example of a 3D point cloud result of the generated RGBD image.
Figure 6 is a diagram illustrating an example of the results compared by applying the results before and after using the Frechette distance loss function to a network before and after using the fusion module.
Figure 7 is a diagram showing an example of the verification result of the Pritchett distance.
Figure 8 is a diagram showing an example of a qualitative comparison result of RGB image results with existing techniques.
Figure 9 is a diagram showing an example of a qualitative comparison result with existing techniques for depth image results.
Figure 10 is a block diagram showing an example of a computer device according to an embodiment of the present invention.
Figure 11 is a flowchart showing an example of an image synthesis method according to an embodiment of the present invention.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다. 하기 설명 및 첨부 도면에서 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the following description and accompanying drawings, detailed descriptions of well-known functions and configurations that may unnecessarily obscure the gist of the present invention are omitted.

깊이 영상은 3차원 공간상의 거리 정보를 2차원 평면에 나타낸 영상이며 다양한 3D 비전 연구에서 유용하게 사용된다. 기존의 많은 깊이 추정 연구는 주로 좁은 FoV(Field of View) 영상을 사용하여 전체 장면 중 상당 부분이 소실된 영상에 대한 깊이 정보를 추정한다. 본 발명의 실시예들에서는 소수의 좁은 FoV 영상으로부터 360° 전 방향 RGBD 영상을 동시에 생성하는 방법 및 장치를 제공한다. 일례로, 오버랩 되지 않는 4장의 소수 영상으로부터 전체 파노라마 영상에 대해서 상대적인 FoV를 추정하고 360° 컬러 영상과 깊이 영상을 동시에 생성하는 적대적 생성 신경망 기반의 영상 생성 모델을 제공할 수 있으며, 두 모달리티의 특징을 공유하여 상호 보완된 결과를 확인한다. 또한 360° 영상의 구면 특성을 반영한 네트워크를 구성하여 개선된 성능을 보인다.A depth image is an image that displays distance information in 3D space on a 2D plane and is useful in various 3D vision research. Many existing depth estimation studies mainly use narrow Field of View (FoV) images to estimate depth information for images in which a significant portion of the entire scene is missing. Embodiments of the present invention provide a method and device for simultaneously generating 360° omnidirectional RGBD images from a small number of narrow FoV images. For example, it is possible to provide an image generation model based on an adversarial generation neural network that estimates the relative FoV for the entire panoramic image from four non-overlapping minority images and simultaneously generates a 360° color image and a depth image, and provides features of the two modalities. Share to confirm mutually complementary results. In addition, improved performance is achieved by constructing a network that reflects the spherical characteristics of 360° images.

보다 구체적으로, 본 발명의 실시예들에서는 기존의 문제점들을 고려하여 RGBD 영상을 생성하는 컨볼루션 신경망(CNN) 기반의 딥러닝 네트워크를 제공할 수 있으며, 겹치지 않는 n장(일례로, 4장)의 좁은 FoV 영상에 대해서 360° RGBD 파노라마 영상을 생성하는 프레임워크를 제공할 수 있다. 본 발명의 실시예들에서는 3D60 360° 실내 데이터셋에 대하여 일반화된 네트워크임을 보이고, 두 모달리티의 상호작용을 통해서 기존의 단일 네트워크보다 개선된 결과를 보인다. 또한 360° 영상 특성을 고려한 특징을 추출하여 프레쳇 거리(Frechet distance)를 계산하고, 이에 대한 손실함수를 이용하여 복원 성능을 개선하였다.More specifically, embodiments of the present invention can provide a deep learning network based on a convolutional neural network (CNN) that generates an RGBD image in consideration of existing problems, and n non-overlapping sheets (for example, 4 sheets) can be provided. A framework for generating a 360° RGBD panoramic image can be provided for narrow FoV images. In the embodiments of the present invention, it is shown that it is a generalized network for the 3D60 360° indoor dataset, and it shows improved results than the existing single network through the interaction of the two modalities. In addition, the Frechet distance was calculated by extracting features considering 360° image characteristics, and the loss function was used to improve restoration performance.

본 발명의 실시예들의 주요 기여점을 요약하면 다음과 같다.The main contributions of the embodiments of the present invention are summarized as follows.

·3D60 실내 데이터셋을 이용하여 기존 실외영상 기반 네트워크를 일반화한다.·Generalize the existing outdoor video-based network using the 3D60 indoor dataset.

·U-Net 기반의 적대적 생성 신경망을 이용하여 컬러(RGB) 영상과 깊이(D) 영상을 동시에 생성한다.·Simultaneously generate color (RGB) images and depth (D) images using a U-Net-based adversarial generative neural network.

·RGBD 두 모달리티의 특징을 공유하도록 하여 상호 보완된 성능을 보인다.·RGBD shows complementary performance by sharing the characteristics of the two modalities.

·360° 영상 특성을 반영한 특징을 이용한 프레쳇 거리 손실 함수를 통해서 개선된 결과를 보인다.·Improved results are shown through the Frechett distance loss function using features that reflect 360° image characteristics.

1. 관련 연구1. Related research

1-1. 구형 모델에 대한 조밀한 깊이 추정 연구1-1. Dense depth estimation study for spherical models

360° 영상을 이용한 깊이 추정연구로는 기존의 좁은 FoV 영상 에서의 깊이 추정 연구와 유사하게 전통적인 방법인 스테레오 영상 기반의 깊이 추정 기법과 딥러닝 기반의 단안 영상에 대한 깊이 추정 연구 기법이 존재한다. 360SD-Net은 딥러닝 기반의 구형 스테레오 영상을 활용한 깊이 추정 네트워크로 Top-Bottom 구조로 카메라를 세팅하여 얻은 스테레오 영상과 구형 파노라마의 왜곡문제를 해결하기 위한 편각(polar angle) 영상을 사용하며 비용 볼륨 구축을 위해서 학습 가능한 시프팅 필터(Shifting filter)를 제안하였다. OmniDepth는 딥러닝 기반의 단안 영상에 대한 깊이 추정 연구로 오토인코더 구조의 360° 영상의 특성을 고려한 네트워크를 제안하였다. 구형 파노라마는 양 극으로 갈수록 왜곡이 심해지는 특징을 가지는데, 이러한 구조에 대해서 전역 컨텍스트를 위한 확장성(Dilated) 컨볼루션과 공간 상화 관계(Spatial correlation)를 위한 1Х1 컨볼루션으로 네트워크를 설계하였다. BiFuse 또한 딥러닝 기반의 단안 영상에 대한 깊이 추정 연구이며, 구형 파노라마에서 등장방형 형식과 큐브맵 형식에 의한 경계면 문제를 개선하고자 두 형식을 융합하는 모듈을 제안하여 네트워크를 설계하였다. 또한 큐브맵 형식에서 구형 패딩 기법을 제안하여서 구형 파노라마 영상의 왜곡을 줄이도록 하였다. 조밀한 깊이 추정을 위한 깊이 완성(Depth completion) 연구에서 특징 공유 모듈을 이용하여 깊이 영상을 생성하기 위한 그레이-스케일(Gray-scale) 영상을 생성하여 보조하는 네트워크를 제안하였다. 본 발명의 실시예들에서는 두 모달리티의 융합된 특징을 사용하는 방법을 기반으로 네트워크를 설계하였다.As for depth estimation research using 360° images, similar to depth estimation research in existing narrow FoV images, there are traditional stereo image-based depth estimation techniques and deep learning-based depth estimation research techniques for monocular images. 360SD-Net is a deep learning-based depth estimation network using spherical stereo images. It uses stereo images obtained by setting cameras in a top-bottom structure and polar angle images to solve the distortion problem of spherical panoramas, and reduces cost. A learnable shifting filter was proposed for volume construction. OmniDepth is a deep learning-based depth estimation study for monocular images and proposed a network that takes into account the characteristics of 360° images with an autoencoder structure. A spherical panorama has the characteristic of becoming more distorted toward both poles. For this structure, a network was designed using dilated convolution for global context and 1Х1 convolution for spatial correlation. BiFuse is also a depth estimation study for monocular images based on deep learning, and to improve the boundary problem caused by the equirectangular format and cubemap format in spherical panoramas, a module that fuses the two formats was proposed and a network was designed. Additionally, a spherical padding technique was proposed in the cube map format to reduce distortion of spherical panoramic images. In a study on depth completion for dense depth estimation, we proposed a network that generates and assists gray-scale images to generate depth images using a feature sharing module. In embodiments of the present invention, a network was designed based on a method that uses the fused characteristics of the two modalities.

1-2. 마스크 기반의 컨볼루션 신경망 네트워크1-2. Mask-based convolutional neural network

입력 영상에 마스크를 적용하고 손상된 영역을 채워 넣는 연구는 오래된 연구 주제로 영상 편집과 같은 작업에 사용될 수 있다. 전통적인 방식으로 전파-기반(Diffusion-based) 기법은 참조 영상을 이용하여서 주변 영역으로부터 정보를 가져와서 채워 넣는 방법이며, 패치-기반(Patch-based) 기법은 입력 영상의 손상되지 않은 영역으로부터 화소 정보를 가져와서 비어 있는 영역을 채워 넣는 방식이다. 이러한 방법은 입력 영상에 의존적이며 전체적인 구조에 일관성이 떨어지고 의미 정보를 생성하기 어려운 한계점을 가진다. 최근 딥러닝 네트워크의 발전으로 생성모델은 기존의 방식과는 다르게 입력 영상에 의존하지 않고도 전체 구조에 대해서 의미 정보를 포함한 일관성을 가진 영상 생성이 가능하다는 점에서 마스크 기반의 영상 생성 연구에 많이 활용된다. StructureFlow은 영상에 랜덤하게 이진 마스크를 생성하여 적용하고 누락된 픽셀을 생성하는 딥러닝 기반의 네트워크로 구조 정보를 위해서 edge-preserve smooth 영상을 생성하고 이를 기반으로 세부적인 텍스처 정보를 생성하도록 하여 전체 영상을 복원하는 기법을 제안하였다. MED는 인코더-디코더 기반의 영상 생성 네트워크이며 얕은 레이어의 텍스처 정보와 깊은 레이어 부분의 구조 정보에 대해서 멀티 스케일 커널을 적용하여 융합한 뒤 평활화 처리를 수행하여 디코더에 더하도록 하여서 불규칙한 마스크 영역에 대한 복원을 수행한다.The study of applying masks to input images and filling in damaged areas is an old research topic and can be used in tasks such as video editing. In a traditional way, the diffusion-based technique uses a reference image to fill in information from surrounding areas, and the patch-based technique uses pixel information from the undamaged area of the input image. The method is to import and fill in the empty area. These methods are dependent on the input image, have poor consistency in the overall structure, and have limitations that make it difficult to generate semantic information. With the recent development of deep learning networks, the generative model is widely used in mask-based image generation research in that, unlike existing methods, it is possible to generate consistent images including semantic information for the entire structure without relying on the input image. . StructureFlow is a deep learning-based network that randomly generates and applies a binary mask to the image and generates missing pixels. It generates an edge-preserve smooth image for structural information and generates detailed texture information based on this to create the entire image. proposed a technique to restore . MED is an encoder-decoder based image generation network that applies a multi-scale kernel to fuse the texture information of the shallow layer and the structural information of the deep layer, then performs smoothing and adds it to the decoder to restore irregular mask areas. Perform.

이처럼 360° 영상에 대한 깊이 추정 연구가 많이 수행되고 있지만 기존의 좁은 FoV를 입력으로 활용한 연구와 겹치지 않는 입력 영상에 대하여 전 방향 영상을 생성하는 연구는 드물다. 또한 대부분 영상 생성 연구에서 불규칙하고 영상 전체에 산발적으로 분포된 이진 마스크 영상을 사용하는데, 본 발명의 실시예들에서는 큰 블록 형태의 마스크로 주변 정보가 많지 않은 환경에서의 영상 생성으로 어려운 조건에 대한 네트워크를 보인다.Although many depth estimation studies on 360° images are being conducted, research that generates omnidirectional images for input images that do not overlap with existing studies using narrow FoV as input is rare. In addition, most image generation studies use binary mask images that are irregular and sporadically distributed throughout the image, but in embodiments of the present invention, a large block-shaped mask is used to overcome difficult conditions by generating images in an environment without much surrounding information. Shows the network.

2. 본 발명의 실시예들2. Embodiments of the present invention

본 발명의 실시예들에 따른 네트워크는 360° 영상에 대해서 상대적인 FoV를 추정하는 네트워크와 추정된 FoV 영상으로부터 360° 영상을 생성하는 파노라마 생성 네트워크로 구성될 수 있다. 상대적인 FoV를 추정하는 네트워크와 파노라마 생성 네트워크는 별도의 훈련과정을 거치며 FoV 추정 단계 이후에 생성된 결과 영상에 마스크를 적용하고 등장방형 도법 영상 형식으로 변형한 입력에 대하여 파노라마를 생성할 수 있다. 파노라마 생성 결과는 RGBD의 특징을 공유하여 상호 작용된 결과이다.The network according to embodiments of the present invention may be composed of a network that estimates the relative FoV for a 360° image and a panorama generation network that generates a 360° image from the estimated FoV image. The network that estimates the relative FoV and the panorama generation network go through separate training processes, and a mask can be applied to the resulting image generated after the FoV estimation step and a panorama can be generated for the input transformed into an equirectangular projection image format. The panorama creation result is an interactive result that shares the characteristics of RGBD.

도 1은 본 발명의 일실시예에 따른 RGBD 생성 네트워크의 전체적인 구조의 예를 도시한 도면이고, 도 2는 본 발명의 일실시예에 따른 RGBD 생성 네트워크의 특징 공유 모듈의 예를 도시한 도면이다.Figure 1 is a diagram showing an example of the overall structure of an RGBD generation network according to an embodiment of the present invention, and Figure 2 is a diagram showing an example of a feature sharing module of an RGBD generation network according to an embodiment of the present invention. .

2-1. 상대적인 FoV 추정2-1. Relative FoV estimation

전체 파노라마에 대해서 상대적인 FoV를 추정하는 단계는 파노라마 생성 이전에 수행되며 절대적인 각도가 아니라 전체 파노라마 영상에서 차지하는 크기의 추정 문제로 모델링할 수 있다. 4개의 수평 관측방향 영상을 입력으로 사용할 수 있으며, 병목 레이어에서는 FoV 각도에 해당하는 256개의 클래스를 출력하고 분류 작업을 수행할 수 있다. 분류작업의 목적함수로 크로스 엔트로피 손실함수를 사용할 수 있으며, 디코더 레이어에서 FoV각도에 대한 패딩이 추가된 마스크 영상을 생성하여 그라운드 트루스(Ground truth) 마스크와의 L1 거리 목적함수를 사용할 수 있다. 기존의 네트워크를 확장하여 RGBD 영상에 대해서 적용할 수 있고, 다음 단계인 파노라마 생성단계에서는 FoV 추정 네트워크가 정확한 FoV를 추정한다는 가정하에 최근 관련 연구에서 많이 사용되는 마스크 비율 중에서 가장 고품질 영상을 생성하기 어려운 비율인 FoV 60° (마스크 비율 약 57%)의 입력에 대해서 실험을 진행하였다. 마스크 기반의 영상 생성 연구에서는 0-60% 사이의 마스크를 생성하여서 비율에 따른 평가 결과를 보이는데, 마스크의 비율이 클수록 생성해야 할 영역이 커지며, 성능이 낮아지는 결과를 확인할 수 있다. 또한 마스크의 형태를 랜덤한 형태, 그리드 형태 그리고 본 발명의 실시예들에서와 유사한 블록 형태로 구성하여 진행된 실험에서 랜덤한 마스크 구성(75% 비율)에서 가장 좋은 결과를 보였고 블록 마스크 구성(50% 비율)은 다른 마스크 형태와 비교하였을 때 가장 낮은 성능을 보였다.The step of estimating the relative FoV for the entire panorama is performed before creating the panorama, and can be modeled as an estimation problem of the size occupied by the entire panoramic image rather than an absolute angle. Four horizontal observation direction images can be used as input, and the bottleneck layer can output 256 classes corresponding to FoV angles and perform classification tasks. The cross-entropy loss function can be used as the objective function for the classification task, and the L1 distance objective function with the ground truth mask can be used by generating a mask image with padding for the FoV angle in the decoder layer. The existing network can be expanded and applied to RGBD images, and in the next step, the panorama generation step, under the assumption that the FoV estimation network estimates the accurate FoV, it is difficult to generate the highest quality image among the mask ratios commonly used in recent related research. An experiment was conducted on input with a FoV of 60° (mask ratio approximately 57%). In mask-based image generation research, masks between 0-60% are generated and evaluation results are shown according to the ratio. The larger the mask ratio, the larger the area to be created, and the lower the performance. In addition, in an experiment conducted by configuring the mask in a random shape, a grid shape, and a block shape similar to the embodiments of the present invention, the best result was obtained in the random mask composition (75% ratio), and the block mask composition (50% ratio) ratio) showed the lowest performance when compared to other mask types.

2-2. 파노라마 생성 2-2. Create a panorama

파노라마 생성 네트워크는 유-넷(U-Net) 기반의 적대적 생성 신경망 네트워크일 수 있다. 훈련시에 적대적 손실함수로 LSGAN(Least Squares GAN(Generative Adversarial Network))을 사용하며, 각 RGBD 네트워크에 대한 적대적 손실함수 L _adv1, L _adv2는 아래 수학식 1 및 수학식 2와 같다.The panorama generation network may be a U-Net-based adversarial generative neural network network. During training, LSGAN (Least Squares GAN (Generative Adversarial Network)) is used as the adversarial loss function, and the adversarial loss functions L _adv1 and L _adv2 for each RGBD network are as shown in Equation 1 and Equation 2 below.

[수학식 1][Equation 1]

[수학식 2][Equation 2]

G _rgb, D _rgb, G _depth, D _depth는 각각 RGBD 네트워크의 생성기와 식별기일 수 있다. 또한, I _i, , R _i, 는 각각 G _rgb, G _depth의 입력과 생성된 영상일 수 있고, I _p, R _p는 생성된 영상에 대한 참값 영상일 수 있다. G _rgb , D _rgb , G _depth , and D _depth may be the generator and identifier of the RGBD network, respectively. Also, I _i , , R _i , may be the input of G _rgb and G _depth and the generated image, respectively, and I _p and R _p may be the true value image for the generated image.

생성된 영상과 참값 영상 사이의 픽셀 손실함수는 L1 손실함수를 사용할 수 있다. 다음 L _pix1, L _pix2는 RGBD 네트워크의 픽셀 손실함수로 아래 수학식 3 및 수학식 4와 같이 나타낼 수 있다.The L1 loss function can be used as the pixel loss function between the generated image and the true image. Next, L _pix1 and L _pix2 are pixel loss functions of the RGBD network and can be expressed as Equations 3 and 4 below.

[수학식 3][Equation 3]

[수학식 4][Equation 4]

또한 사실적인 영상 생성을 위해서 지각 손실 목적함수로 사전 훈련된 VGG(Visual Geometry Group) 네트워크를 사용할 수 있다. 다음 L _vgg1, L _vgg2는 RGBD 네트워크의 지각 손실 목적함수로 아래 수학식 5 및 수학식 6과 같이 나타낼 수 있다. VGG(·)는 VGG 네트워크로부터 추출된 i번째 특징벡터일 수 있다.Additionally, to generate realistic images, a VGG (Visual Geometry Group) network pre-trained with a perceptual loss objective function can be used. Next, L _vgg1 and L _vgg2 are the perceptual loss objective functions of the RGBD network and can be expressed as Equations 5 and 6 below. VGG (·) may be the ith feature vector extracted from the VGG network.

[수학식 5][Equation 5]

[수학식 6][Equation 6]

360° 영상 특징을 반영한 특징을 추출하기 위해서 본 발명의 실시예들에 따른 모델을 훈련하고 사전 훈련된 모델을 이용하여 RGBD 특징이 공유된 모듈 잠재공간의 마지막 레이어의 특징 를 추출할 수 있다. 그런 다음 종방향 불변성(Longitudinal invariant) 특징 와 횡방향 등변성(latitudinal equivariant) 특징 을 추출할 수 있다. 종방향 불변성 특징과 횡방향 등변성 특징은 각각 수학식 7 및 수학식 8과 같이 나타낼 수 있다. c, w, h는 각 채널, 너비, 높이를 의미할 수 있다. 단일 영상의 특징맵 거리를 계산하여 유사도를 향상시키는 VGG 손실 목적함수와 달리 생성된 영상 데이터셋과 그라운드 트루스 영상 데이터셋 사이의 분포 거리를 측정하게 되며 퓨전(Fusion) 모듈의 마지막 레이어를 사용하여 융합된 RGBD 특징벡터 분포의 유사도가 반영될 수 있다.In order to extract features reflecting 360° image features, train a model according to embodiments of the present invention and use a pre-trained model to extract the features of the last layer of the module latent space in which RGBD features are shared. can be extracted. Then the longitudinal invariant feature and latitudinal equivariant features can be extracted. The longitudinal invariance feature and the transverse isovariability feature can be expressed as Equation 7 and Equation 8, respectively. c , w , and h can refer to each channel, width, and height. Unlike the VGG loss objective function, which improves similarity by calculating the feature map distance of a single image, it measures the distribution distance between the generated image dataset and the ground truth image dataset, and is fused using the last layer of the Fusion module. The similarity of the RGBD feature vector distribution can be reflected.

[수학식 7][Equation 7]

[수학식 8][Equation 8]

그리고 프레쳇 거리를 측정하여 손실함수로 사용할 수 있다. 프레쳇 거리 는 그라운드 트루스 영상의 Mean (m, C) 가우시안과 생성된 영상의 Mean (m', C') 가우시안 사이의 프리쳇 거리로서 아래 수학식 9와 같이 나타낼 수 있다. m, C, m', C' 는 각각 그라운드 트루스 영상과 생성된 영상의 평균(Mean)과 공분산(Covariance)일 수 있다. 프레쳇 거리 손실함수는 훈련 중 0번째 반복에서 측정된 값 을 나누어서 다음 수학식 10과 같이 정규화할 수 있다.And the Frechette distance can be measured and used as a loss function. Frechet Street is the Pritchett distance between the Mean ( m , C ) Gaussian of the ground truth image and the Mean ( m' , C' ) Gaussian of the generated image, and can be expressed as Equation 9 below. m , C , m' , and C' may be the mean and covariance of the ground truth image and the generated image, respectively. The Frechette distance loss function is the value measured at the 0th iteration during training. It can be divided and normalized as shown in Equation 10 below.

[수학식 9][Equation 9]

[수학식 10][Equation 10]

따라서 파노라마 생성 네트워크의 전체 손실함수 L _rgb, L _depth는 아래 수학식 11 및 수학식 12와 같으며, 이때 λ₁, λ₂, λ₃는 상수인자일 수 있다.Therefore, the overall loss functions L _rgb and L _depth of the panorama generation network are as shown in Equation 11 and Equation 12 below, where λ ₁ , λ ₂ , and λ ₃ may be constant factors.

[수학식 11][Equation 11]

[수학식 12][Equation 12]

각 RGBD 네트워크에서 생성된 영상에 대하여 입력 영상의 참값 영역을 제외한 나머지 부분에 해당하는 이진 마스크를 적용한 영상의 특징을 공유하며 각 특징은 다음 수학식 13 및 수학식 14와 같이 나타낼 수 있다.For the images generated from each RGBD network, they share the features of the image to which the binary mask corresponding to the remaining portion excluding the true value region of the input image is applied, and each feature can be expressed as Equation 13 and Equation 14 below.

[수학식 13][Equation 13]

[수학식 14][Equation 14]

F()는 인코더일 수 있고, M은 가장 큰 FoV 클래스에 해당하는 이진 마스크일 수 있다. 각 RGBD 네트워크의 레이어는 픽셀 합계를 수행하며 아래 수학식 15 및 수학식 16과 같이 나타낼 수 있다. 마지막 레이어 f _s는 각 RGBD 네트워크의 마지막 블록에 채널 연결을 수행하여 각 디코더에 전달될 수 있다. 은 채널 기반의 연결 함수일 수 있다. 제안하는 특징 공유 모듈은 도 2에 도시하였다. F () can be an encoder, and M can be a binary mask corresponding to the largest FoV class. Each layer of the RGBD network performs pixel summation, which can be expressed as Equation 15 and Equation 16 below. The last layer f _s can be delivered to each decoder by performing channel connection to the last block of each RGBD network. may be a channel-based connection function. The proposed feature sharing module is shown in Figure 2.

[수학식 15][Equation 15]

[수학식 16][Equation 16]

2-3. 실험 결과 2-3. Experiment result

총 21,600개의 실내 데이터셋 Matterport3D, Stanford-3D, SunCG에서 깊이 영상을 기준으로 손상된 영역이 일정 값 이상에 해당하는 훈련에 부적절한 데이터셋 3,446개를 제거하는 과정을 거치며, 총 18,154개의 데이터셋에 대하여 훈련 데이터셋(80%), 테스트 데이터셋(20%)으로 나누어서 적용한다.A total of 21,600 indoor datasets such as Matterport3D, Stanford-3D, and SunCG go through the process of removing 3,446 datasets that are inappropriate for training with damaged areas exceeding a certain value based on the depth images, and a total of 18,154 datasets are trained. It is divided into dataset (80%) and test dataset (20%).

도 3은 정제된 데이터셋의 샘플(3D60 데이터셋 샘플)의 예를 도시한 도면이다. 도 3에서 1행 및 2행은 RGBD 영상 샘플이며, 3행은 깊이 영상을 기준으로 손상된 영역을 시각화한 맵이다.Figure 3 is a diagram showing an example of a sample of a refined dataset (3D60 dataset sample). In Figure 3, rows 1 and 2 are RGBD image samples, and row 3 is a map visualizing the damaged area based on the depth image.

네트워크 훈련에서는 ADAM(Adaptive momenteum) 옵티마이저를 사용하며, 학습률(Learning rate) α = 0.0002, β ₁ = 0.5, β ₂ = 0.99, 배치 사이즈(batch size)는 2로 설정하고, NVIDIA RTX A6000 GPU를 사용하여 훈련하였다. Matterport3D, Stanford3D, SunCG 데이터셋에 대해서 큐브맵 형식으로 변환하여 4개의 면을 훈련에 사용할 수 있도록 데이터셋을 구축하였으며, 세 개의 데이터셋을 각각 따로 훈련하였다.In network training, an adaptive momenteum (ADAM) optimizer is used, learning rate α = 0.0002, β ₁ = 0.5, β ₂ = 0.99, batch size is set to 2, and NVIDIA RTX A6000 GPU is used. trained using it. The Matterport3D, Stanford3D, and SunCG datasets were converted to cube map format to construct a dataset so that four surfaces could be used for training, and each of the three datasets was trained separately.

정량평가에서 생성 모델의 유사도를 위한 PSNR(Peak Signal-to-Noise Ratio), SSIM(Structural Similarity Index Map)과 깊이 영상의 유사도 평가를 위한 절대 차이(Abs Diff), 절대 상대 오차(Abs Rel), 상대 오차 제곱(Sq Rel), 평균 제곱근 오차(RMS), 평균 제곱근 로그 오차(RMS log), 상대 정확도(δ)를 측정하였다. 평가에 사용된 마스크 비율은 최근 관련 연구에서 많이 사용하는 가장 넓은 영역의 마스크 비율인 FoV 60°입력에 대하여 평가를 진행하였다.In quantitative evaluation, PSNR (Peak Signal-to-Noise Ratio) for the similarity of the generated model, SSIM (Structural Similarity Index Map) and absolute difference (Abs Diff), absolute relative error (Abs Rel) for similarity evaluation of depth images, The square relative error (Sq Rel), root mean square error (RMS), log root mean square error (RMS log), and relative accuracy (δ) were measured. The mask ratio used in the evaluation was evaluated for the FoV 60° input, which is the mask ratio for the widest area frequently used in recent related research.

정성평가를 위해서 제안하는 퓨전 블록을 사용한 네트워크와 퓨전 블록을 사용하지 않고 RGBD를 각각 따로 구성한 네트워크와 비교하였으며, 프레쳇 거리 손실함수를 사용하기 전과 후의 결과를 비교하였다. 또한 생성된 RGBD 영상에 대한 포인트 클라우드를 생성하여서 비교하였다.For qualitative evaluation, a network using the proposed fusion block was compared with a network separately constructed with RGBD without using a fusion block, and the results before and after using the Frechette distance loss function were compared. Additionally, a point cloud for the generated RGBD image was created and compared.

도 4는 생성된 RGBD 영상의 정성평가 결과의 예를 도시한 도면이다. 도 4에서는 본 발명의 일실시예에 따른 네트워크와 특징 공유 모듈을 사용하지 않은 네트워크의 결과를 정성적으로 비교하고 있다. 도 4에서 1행 및 3행은 컬러 영상을, 2행 및 4행은 깊이 영상 샘플을 나타낸다. 본 발명의 실시예들에 따른 네트워크의 결과에서 전반적으로 전체적인 레이아웃과 디테일한 정보를 잘 생성하는 것을 확인할 수 있다.Figure 4 is a diagram showing an example of qualitative evaluation results of the generated RGBD image. Figure 4 qualitatively compares the results of a network according to an embodiment of the present invention and a network that does not use a feature sharing module. In Figure 4, rows 1 and 3 represent color images, and rows 2 and 4 represent depth image samples. It can be seen from the results of the network according to the embodiments of the present invention that the overall layout and detailed information are well generated.

도 5는 생성된 RGBD 영상의 3D 포인트 클라우드 결과의 예를 도시한 도면이다. 도 5에서는 3차원 포인트 클라우드 결과를 비교하여 평가하였으며, 이를 통해 생성된 깊이 영상에서 정성적으로 비교하기 어려운 3차원 레이아웃과 잡음의 영향을 확인할 수 있다.Figure 5 is a diagram showing an example of a 3D point cloud result of the generated RGBD image. In Figure 5, the 3D point cloud results are compared and evaluated, and through this, the effects of 3D layout and noise, which are difficult to qualitatively compare in the generated depth images, can be confirmed.

표 1은 생성된 깊이 영상에 대한 다양한 정량평가 결과이며, 특징공유 모듈을 사용하기 전과 후를 비교하였다.Table 1 shows the results of various quantitative evaluations of the generated depth images, and compared them before and after using the feature sharing module.

[표 1][Table 1]

표 2, 표 3에서는 프레쳇 거리 손실함수를 사용하기 전과 후의 결과이다. 표 2는 프레쳇 거리 손실을 사용하지 않았을 때의 RGBD 영상 평가 결과이며, 표 3은 프레쳇 거리 손실을 적용한 RGBD 영상 평가 결과이다. Tables 2 and 3 show the results before and after using the Frechet distance loss function. Table 2 shows the RGBD image evaluation results without using Frechette distance loss, and Table 3 shows the RGBD image evaluation results with Frechette distance loss applied.

[표 2] [Table 2]

[표 3][Table 3]

생성된 RGBD 영상 모두 특징공유 모듈을 사용하지 않은 네트워크 결과와 비교하였을 때, 제안하는 네트워크의 생성 영상이 더 좋은 결과를 보이며, 프레쳇 거리 손실함수를 사용하기 전과 비교하여서도 더 좋은 결과를 확인할 수 있다.When comparing the generated RGBD images with the results of a network that did not use the feature sharing module, the images generated by the proposed network show better results, and even better results can be seen when compared to before using the Frechette distance loss function. there is.

도 6은 프레쳇 거리 손실함수의 사용 전후 결과를 퓨전 모듈 사용 전후 네트워크에 적용하여 비교한 결과의 예를 도시한 도면이다. 도 6에서 각 1,2행과 3,4행은 RGBD 영상 샘플 결과이다.Figure 6 is a diagram illustrating an example of the results compared by applying the results before and after using the Frechette distance loss function to a network before and after using the fusion module. In Figure 6, rows 1 and 2 and rows 3 and 4 are RGBD image sample results.

프레쳇 거리의 검증을 위해서 제안하는 모델의 사전 훈련된 모델로부터 특징을 추출하고, 파노라마 영상에 대해서 가우시안 블러(Gaussian blur), 솔트앤페퍼(Salt and pepper) 노이즈를 4가지 단계로 생성하여 프리쳇 거리를 측정하였다.To verify the Pritchett distance, features are extracted from the pre-trained model of the proposed model, and Gaussian blur and salt and pepper noise are generated for the panoramic image in four steps. The distance was measured.

도 7은 프리쳇 거리의 검증 결과의 예를 도시한 도면이다. 도 7에서는 RGBD 영상 모두 그라운드 트루스 영상의 특징 분포와 생성된 영상의 특징 분포의 거리가 장애(Disturbance) 레벨이 커질수록 증가하는 것을 확인할 수 있으며 제안하는 네트워크를 이용하여 특징을 추출하고 프레쳇 거리를 계산하였을 때 블러와 노이즈를 정상적으로 캡처하는 것을 확인할 수 있다. (1)-(4)는 각 장애 레벨을 의미할 수 있다.Figure 7 is a diagram showing an example of the verification result of the Pritchett distance. In Figure 7, it can be seen that the distance between the feature distribution of the ground truth image and the feature distribution of the generated image increases as the level of disturbance increases for both RGBD images. Features are extracted using the proposed network and the Frechett distance is calculated. When calculated, it can be confirmed that blur and noise are captured normally. (1)-(4) may refer to each disability level.

또한 제안하는 특징공유 모듈에 대해서 잔차블록(Residual block)을 사용하지 않았을 때, 한 개의 잔차블록만을 사용하였을 때 그리고 직접 연결을 수행하였을 때의 3가지 추가적인 케이스에 대해서 실험하였다. 표 4는 프레쳇 거리를 손실하지 않은 네트워크에 대한 3가지 구성에 대한 정량 평가 결과이다.In addition, we experimented with the proposed feature sharing module in three additional cases: when no residual blocks were used, when only one residual block was used, and when direct connection was performed. Table 4 shows the quantitative evaluation results for the three configurations for the network without Frechett distance loss.

[표 4][Table 4]

특징공유 모듈을 사용하지 않았을 때 보다 3가지 특징공유 모듈을 사용하였을 때 전반적으로 더 좋은 결과를 볼 수 있다. 다른 모델과의 비교를 위해서 RGB 영상 생성에 대한 모델로 인페인팅 모델과 w/o Fusion 네트워크인 기존의 모델에 대해서 비교하였다. 두 모델 모두 같은 조건인 무작위 FoV로 재훈련하였으며, FoV 60°에 대해서 정량적, 정성적 평가를 수행하였다. 그리고 깊이 영상 비교를 위해서 깊이 추정 모델의 훈련된 모델 그리고 같은 조건으로 다시 훈련한 인페인팅 모델, 기존의 모델을 비교하였다.Overall, better results can be seen when using the three feature sharing modules than when not using the feature sharing module. For comparison with other models, we compared the inpainting model and the existing model w/o Fusion network as a model for RGB image generation. Both models were retrained under the same random FoV, and quantitative and qualitative evaluation was performed on FoV 60°. And to compare depth images, we compared the trained model of the depth estimation model, the inpainting model trained again under the same conditions, and the existing model.

표 5와 표 6에서 Matterport3D 데이터셋 1,759개에 대한 PSNR, SSIM 평균을 구하여 비교하였다. 표 5는 기존 기법과의 RGB 영상 생성 결과를 비교한 예를 나타내고 있으며, 표 6은 기존 기법과의 깊이 영상 생성 결과를 비교한 예를 나타내고 있다.In Table 5 and Table 6, the average PSNR and SSIM for 1,759 Matterport3D datasets were calculated and compared. Table 5 shows an example of comparing the RGB image generation results with existing techniques, and Table 6 shows an example of comparing the depth image generation results with existing techniques.

[표 5][Table 5]

[표 6][Table 6]

도 8은 RGB 영상 결과에 대한 기존의 기법과의 정성적 비교 결과의 예를 도시한 도면이고, 도 9는 깊이 영상 결과에 대한 기존의 기법과의 정성적 비교 결과의 예를 도시한 도면이다. 깊이 영상의 경우에 다른 모델과 비교하기 위해서 Top, Bottom 부분을 제외하고 평가하였다. 또한 참값 RGB 영상을 입력으로 생성된 깊이 영상과 제안하는 모델의 RGB 영상을 입력으로 생성된 깊이 영상을 비교하였다.FIG. 8 is a diagram showing an example of a qualitative comparison result with existing techniques for RGB image results, and FIG. 9 is a diagram showing an example of a qualitative comparison result with existing techniques for depth image results. In the case of depth images, the Top and Bottom parts were excluded in order to compare with other models. In addition, the depth image generated using the true RGB image as input was compared with the depth image generated using the RGB image of the proposed model as input.

본 발명의 실시예들에서는 소수의 영상으로부터 RGBD 영상을 동시에 생성하는 적대적 생성 신경망 기반 네트워크를 제공할 수 있다. 두 모달리티의 특징을 공유한 생성 모델에 대하여 단일 네트워크보다 개선된 성능을 정량적, 정성적으로 확인하였으며, 360° 영상 특징이 반영된 프레쳇 거리 손실함수를 적용하여서 개선된 성능을 보였다. 또한 특징 공유 모듈에 대한 절제연구(Ablation study)를 통해서 기존의 단일 네트워크보다 높은 성능을 확인하였으며, 그 중에서 제안하는 구조에서 좋은 성능을 보이는 것을 볼 수 있었다. 기존의 RGBD 영상 생성 모델과 비교하였을 때에도 정성적, 정량적으로 우수한 성능을 보이는 것을 확인하였다. 기존의 파노라마 영상에 대한 생성 모델과는 달리 높은 비율의 마스크가 적용된 겹치지 않는 소수의 영상으로부터 고품질 RGBD 영상을 동시에 생성한다는 점과 360° 특징이 반영되고 RGBD의 상호보완된 결과를 갖는 점에서 차별점을 갖고, 고품질 RGBD 영상 생성을 통해서 복잡한 3D 장면 재구성에 기여할 수 있다.Embodiments of the present invention can provide an adversarial generative neural network-based network that simultaneously generates RGBD images from a small number of images. For the generative model that shared the characteristics of the two modalities, improved performance over a single network was quantitatively and qualitatively confirmed, and improved performance was shown by applying the Frechett distance loss function that reflects 360° image features. In addition, through an ablation study on the feature sharing module, higher performance was confirmed than that of the existing single network, and among them, the proposed structure was found to show good performance. Even when compared to the existing RGBD image generation model, it was confirmed that it showed excellent performance qualitatively and quantitatively. Unlike existing panoramic image generation models, it is different in that it simultaneously generates high-quality RGBD images from a small number of non-overlapping images with a high ratio mask applied, reflects 360° features, and has complementary results of RGBD. It can contribute to complex 3D scene reconstruction through high-quality RGBD image generation.

본 발명의 실시예들에 따른 영상 합성 장치는 적어도 하나의 컴퓨터 장치에 의해 구현될 수 있다. 이때, 컴퓨터 장치에는 본 발명의 일실시예에 따른 컴퓨터 프로그램이 설치 및 구동될 수 있고, 컴퓨터 장치는 구동된 컴퓨터 프로그램의 제어에 따라 본 발명의 실시예들에 따른 영상 합성 방법을 수행할 수 있다. 상술한 컴퓨터 프로그램은 컴퓨터 장치와 결합되어 영상 합성 방법을 컴퓨터에 실행시키기 위해 컴퓨터 판독 가능한 기록매체에 저장될 수 있다.The image synthesis device according to embodiments of the present invention may be implemented by at least one computer device. At this time, the computer program according to an embodiment of the present invention may be installed and driven in the computer device, and the computer device may perform the image compositing method according to the embodiment of the present invention under the control of the driven computer program. . The above-described computer program can be combined with a computer device and stored in a computer-readable recording medium to execute the image synthesis method on the computer.

도 10은 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이고, 도 11은 본 발명의 일실시예에 따른 영상 합성 방법의 예를 도시한 흐름도이다. 컴퓨터 장치(Computer device, 1000)는 도 10에 도시된 바와 같이, 메모리(Memory, 1010), 프로세서(Processor, 1020), 통신 인터페이스(Communication interface, 1030) 그리고 입출력 인터페이스(I/O interface, 1040)를 포함할 수 있다. 메모리(1010)는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 디스크 드라이브와 같은 비소멸성 대용량 기록장치는 메모리(1010)와는 구분되는 별도의 영구 저장 장치로서 컴퓨터 장치(1000)에 포함될 수도 있다. 또한, 메모리(1010)에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리(1010)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 메모리(1010)로 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 인터페이스(1030)를 통해 메모리(1010)에 로딩될 수도 있다. 예를 들어, 소프트웨어 구성요소들은 네트워크(Network, 1060)를 통해 수신되는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 컴퓨터 장치(1000)의 메모리(1010)에 로딩될 수 있다.FIG. 10 is a block diagram showing an example of a computer device according to an embodiment of the present invention, and FIG. 11 is a flowchart showing an example of an image synthesis method according to an embodiment of the present invention. As shown in FIG. 10, the computer device (1000) includes a memory (1010), a processor (1020), a communication interface (Communication interface, 1030), and an input/output interface (I/O interface, 1040). may include. The memory 1010 is a computer-readable recording medium and may include a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), and a disk drive. Here, non-perishable large-capacity recording devices such as ROM and disk drives may be included in the computer device 1000 as a separate permanent storage device that is distinct from the memory 1010. Additionally, an operating system and at least one program code may be stored in the memory 1010. These software components may be loaded into the memory 1010 from a computer-readable recording medium separate from the memory 1010. Such separate computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, and memory cards. In another embodiment, software components may be loaded into the memory 1010 through the communication interface 1030 rather than a computer-readable recording medium. For example, software components may be loaded into the memory 1010 of the computer device 1000 based on a computer program installed by files received through a network (Network, 1060).

프로세서(1020)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(1010) 또는 통신 인터페이스(1030)에 의해 프로세서(1020)로 제공될 수 있다. 예를 들어 프로세서(1020)는 메모리(1010)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processor 1020 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to the processor 1020 by the memory 1010 or the communication interface 1030. For example, the processor 1020 may be configured to execute received instructions according to program codes stored in a recording device such as memory 1010.

통신 인터페이스(1030)는 네트워크(1060)를 통해 컴퓨터 장치(1000)가 다른 장치(일례로, 앞서 설명한 저장 장치들)와 서로 통신하기 위한 기능을 제공할 수 있다. 일례로, 컴퓨터 장치(1000)의 프로세서(1020)가 메모리(1010)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이나 명령, 데이터, 파일 등이 통신 인터페이스(1030)의 제어에 따라 네트워크(1060)를 통해 다른 장치들로 전달될 수 있다. 역으로, 다른 장치로부터의 신호나 명령, 데이터, 파일 등이 네트워크(1060)를 거쳐 컴퓨터 장치(1000)의 통신 인터페이스(1030)를 통해 컴퓨터 장치(1000)로 수신될 수 있다. 통신 인터페이스(1030)를 통해 수신된 신호나 명령, 데이터 등은 프로세서(1020)나 메모리(1010)로 전달될 수 있고, 파일 등은 컴퓨터 장치(1000)가 더 포함할 수 있는 저장 매체(상술한 영구 저장 장치)로 저장될 수 있다.The communication interface 1030 may provide a function for the computer device 1000 to communicate with other devices (eg, the storage devices described above) through the network 1060. For example, requests, commands, data, files, etc. generated by the processor 1020 of the computer device 1000 according to the program code stored in a recording device such as memory 1010 are transmitted to the network ( 1060) and can be transmitted to other devices. Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device 1000 through the communication interface 1030 of the computer device 1000 via the network 1060. Signals, commands, data, etc. received through the communication interface 1030 may be transmitted to the processor 1020 or memory 1010, and files, etc. may be stored in a storage medium (as described above) that the computer device 1000 may further include. It can be stored as a permanent storage device).

입출력 인터페이스(1040)는 입출력 장치(I/O device, 1050)와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크, 키보드 또는 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커와 같은 장치를 포함할 수 있다. 다른 예로 입출력 인터페이스(1040)는 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수도 있다. 입출력 장치(1050) 중 적어도 하나는 컴퓨터 장치(1000)와 하나의 장치로 구성될 수도 있다. 예를 들어, 스마트폰과 같이 터치스크린, 마이크, 스피커 등이 컴퓨터 장치(1000)에 포함된 형태로 구현될 수 있다. The input/output interface 1040 may be a means for interfacing with an input/output device (I/O device, 1050). For example, input devices may include devices such as a microphone, keyboard, or mouse, and output devices may include devices such as displays and speakers. As another example, the input/output interface 1040 may be a means for interfacing with a device that integrates input and output functions, such as a touch screen. At least one of the input/output devices 1050 may be configured as one device with the computer device 1000. For example, like a smart phone, the computer device 1000 may include a touch screen, microphone, speaker, etc.

또한, 다른 실시예들에서 컴퓨터 장치(1000)는 도 10의 구성요소들보다 더 적은 혹은 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 장치(1000)는 상술한 입출력 장치(1050) 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Additionally, in other embodiments, computer device 1000 may include fewer or more components than those of FIG. 10 . However, there is no need to clearly show most prior art components. For example, the computer device 1000 may be implemented to include at least a portion of the input/output device 1050 described above, or may further include other components such as a transceiver, a database, etc.

본 실시예에 따른 영상 합성 방법은 영상 합성 장치를 구현하는 컴퓨터 장치(1000)에 의해 수행될 수 있다. 이때, 컴퓨터 장치(1000)의 프로세서(1020)는 메모리(1010)가 포함하는 운영체제의 코드나 적어도 하나의 컴퓨터 프로그램의 코드에 따른 제어 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 프로세서(1020)는 컴퓨터 장치(1000)에 저장된 코드가 제공하는 제어 명령에 따라 컴퓨터 장치(1000)가 도 11의 방법이 포함하는 단계들(1110 및 1120)을 수행하도록 컴퓨터 장치(1000)를 제어할 수 있다. The image synthesis method according to this embodiment can be performed by a computer device 1000 that implements an image synthesis device. At this time, the processor 1020 of the computer device 1000 may be implemented to execute control instructions according to the code of an operating system included in the memory 1010 or the code of at least one computer program. Here, the processor 1020 causes the computer device 1000 to perform steps 1110 and 1120 included in the method of FIG. 11 according to control instructions provided by code stored in the computer device 1000. can be controlled.

단계(1110)에서 컴퓨터 장치(1000)는 시야 추정 네트워크를 이용하여 파노라마 영상에 대해 상대적인 시야(Field of View, FoV)를 추정함으로써, 시야 영상을 생성할 수 있다. 이미 설명한 바와 같이, 시야를 추정하는 문제는 파노라마 생성 이전에 수행되며 절대적인 각도가 아니라 전체 파노라마 영상에서 차지하는 크기의 추정 문제로 모델링될 수 있다. n(일례로, n=4) 개의 수평 관측방향 영상을 입력으로 사용할 수 있으며, 시야 추정 네트워크의 병목 레이어에서는 시야 각도에 해당하는 256개의 클래스를 출력하고 분류 작업을 수행할 수 있다. 분류작업의 목적함수로 크로스 엔트로피 손실함수를 사용할 수 있으며, 디코더 레이어에서 시야 각도에 대한 패딩이 추가된 마스크 영상을 생성하여 그라운드 트루스(Ground truth) 마스크와의 L1 거리 목적함수를 사용할 수 있다. 기존의 네트워크를 확장하여 RGBD 영상에 대해서 적용할 수 있다.In step 1110, the computer device 1000 may generate a field of view image by estimating a field of view (FoV) relative to the panoramic image using a field of view estimation network. As already explained, the problem of estimating the field of view is performed before creating a panorama and can be modeled as a problem of estimating the size occupied by the entire panoramic image rather than an absolute angle. n (for example, n=4) horizontal observation direction images can be used as input, and the bottleneck layer of the field of view estimation network can output 256 classes corresponding to the field of view angles and perform a classification task. The cross-entropy loss function can be used as the objective function for the classification task, and the L1 distance objective function with the ground truth mask can be used by generating a mask image with added padding for the viewing angle in the decoder layer. The existing network can be expanded and applied to RGBD images.

단계(1120)에서 컴퓨터 장치(1000)는 파노라마 생성 네트워크를 이용하여 생성된 시야 영상으로부터 파노라마 영상을 생성할 수 있다. 일실시예로, 파노라마 생성 네트워크는 유-넷(U-Net) 기반의 적대적 생성 신경망 네트워크를 포함할 수 있으며, 파노라마 생성 네트워크는 적대적 손실함수로 LSGAN(Least Squares GAN(Generative Adversarial Network))을 사용하여 사전 훈련될 수 있다.In step 1120, the computer device 1000 may generate a panoramic image from a field of view image generated using a panorama generation network. In one embodiment, the panorama generation network may include a U-Net-based adversarial generative neural network, and the panorama generation network uses Least Squares Generative Adversarial Network (LSGAN) as an adversarial loss function. So it can be pre-trained.

한편, 파노라마 생성 네트워크의 손실함수는, RGBD(Red Green Blue Depth) 네트워크의 RGB에 대한 제1 손실함수 및 RGBD 네트워크의 깊이(Depth)에 대한 제2 손실함수를 포함할 수 있다. 일례로, 제1 손실함수는 수학식 11의 L _rgb에 대응할 수 있고, 제2 손실함수는 수학식 12의 L _depth에 대응할 수 있다. 수학식 11에 나타난 바와 같이, 제1 손실함수는 RGBD 네트워크의 RGB에 대한 제1 적대적 손실함수(일례로, 수학식 1의 L _adv1), 상기 RGBD 네트워크의 RGB에 대한 생성기에 의해 생성된 영상 및 대응하는 참값 영상간의 픽셀 손실함수(일례로, 수학식 3의 L _pix1), 상기 RGBD 네트워크의 RGB에 대한 지각 손실 목적함수(일례로, 수학식 5의 L _vgg1) 및 상기 생성된 영상과 상기 참값 영상간에 측정된 프레쳇 거리(Frechet distance) 손실함수(일례로, 수학식 11의 L _d1)를 이용하여 결정될 수 있다. 또한, 수학식 12에 나타난 바와 같이 제2 손실함수는 RGBD 네트워크의 깊이에 대한 제2 적대적 손실함수(일례로, 수학식 2의 L _adv2), RGBD 네트워크의 깊이에 대한 생성기에 의해 생성된 영상 및 대응하는 참값 영상간의 픽셀 손실함수(일례로, 수학식 4의 L _pix2), RGBD 네트워크의 깊이에 대한 지각 손실 목적함수(일례로, 수학식 6의 L _vgg2) 및 생성된 영상과 참값 영상간에 측정된 프레쳇 거리 손실함수(일례로, 수학식 12의 L _d2)를 이용하여 결정될 수 있다.Meanwhile, the loss function of the panorama generation network may include a first loss function for RGB of the RGBD (Red Green Blue Depth) network and a second loss function for the depth of the RGBD network. For example, the first loss function may correspond to L _rgb in Equation 11, and the second loss function may correspond to L _depth in Equation 12. As shown in Equation 11, the first loss function is a first adversarial loss function for RGB of the RGBD network (e.g., L _adv1 in Equation 1), an image generated by the generator for RGB of the RGBD network, and A pixel loss function between the corresponding true value images (e.g., L _pix1 in Equation 3), a perceptual loss objective function for RGB of the RGBD network (e.g., L _vgg1 in Equation 5), and the generated image and the true value. The Frechet distance measured between images can be determined using a loss function (for example, L _d1 in Equation 11). In addition, as shown in Equation 12, the second loss function is a second adversarial loss function for the depth of the RGBD network (e.g., L _adv2 in Equation 2), an image generated by the generator for the depth of the RGBD network, and Pixel loss function between the corresponding true images (e.g., L _pix2 in Equation 4), perceptual loss objective function for depth of the RGBD network (e.g., L _vgg2 in Equation 6), and measurement between the generated image and the true image. It can be determined using the Frechett distance loss function (for example, L _d2 in Equation 12).

또한, 도 2의 특징 공유 모델 및 수학식 13 내지 수학식 16을 통해 설명한 바와 같이, 파노라마 생성 네트워크는 RGBD 네트워크의 RGB에 대한 생성기에 의해 생성된 영상과 RGBD 네트워크의 깊이에 대한 생성기에 의해 생성된 영상에 대하여 입력 영상의 참값 영역을 제외한 나머지 부분에 이진 마스크를 적용한 영상의 특징을 공유하고, 파노라마 생성 네트워크의 마지막 레이어의 출력이 RGBD 네트워크의 마지막 블록에 채널 연결을 수행하여 RGBD 네트워크의 디코더에 전달될 수 있다.In addition, as explained through the feature sharing model of FIG. 2 and Equations 13 to 16, the panorama generation network is an image generated by the RGB generator of the RGBD network and the depth generator of the RGBD network. The image shares the characteristics of the image in which a binary mask is applied to the remaining part of the input image except for the true value region, and the output of the last layer of the panorama generation network is transmitted to the decoder of the RGBD network by performing channel connection to the last block of the RGBD network. It can be.

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The system or device described above may be implemented with hardware components or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), etc. , may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be embodied in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. The medium may continuously store a computer-executable program, or may temporarily store it for execution or download. In addition, the medium may be a variety of recording or storage means in the form of a single or several pieces of hardware combined. It is not limited to a medium directly connected to a computer system and may be distributed over a network. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, And there may be something configured to store program instructions, including ROM, RAM, flash memory, etc. Additionally, examples of other media include recording or storage media managed by app stores that distribute applications, sites that supply or distribute various other software, or servers. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments and equivalents of the claims also fall within the scope of the following claims.

Claims

In an image synthesis method of a computer device including at least one processor,
generating a field of view image by estimating, by the at least one processor, a field of view (FoV) relative to the panoramic image using a field of view estimation network; and
Generating, by the at least one processor, a panoramic image from the generated field of view image using a panorama generation network.
Including,
The loss function of the panorama generation network includes a first loss function for RGB of the RGBD (Red Green Blue Depth) network and a second loss function for the depth of the RGBD network. .

According to paragraph 1,
An image synthesis method, wherein the panorama generation network includes a U-Net-based adversarial generation neural network network.

According to paragraph 1,
An image synthesis method, characterized in that the panorama generation network is trained using LSGAN (Least Squares GAN (Generative Adversarial Network)) as an adversarial loss function.

delete

According to paragraph 1,
The first loss function is a first adversarial loss function for RGB of the RGBD network, a pixel loss function between the image generated by the generator for RGB of the RGBD network and the corresponding true value image, and the perception of RGB of the RGBD network. An image synthesis method characterized in that it is determined using a loss objective function and a Frechet distance loss function measured between the generated image and the true value image.

According to paragraph 1,
The second loss function is a second adversarial loss function for the depth of the RGBD network, a pixel loss function between the image generated by the generator for the depth of the RGBD network and the corresponding true value image, and the perception of the depth of the RGBD network. An image synthesis method characterized in that it is determined using a loss objective function and a Frechett distance loss function measured between the generated image and the true value image.

In an image synthesis method of a computer device including at least one processor,
generating a field of view image by estimating, by the at least one processor, a field of view (FoV) relative to the panoramic image using a field of view estimation network; and
Generating, by the at least one processor, a panoramic image from the generated field of view image using a panorama generation network.
Including,
The panorama generation network is an image generated by the RGB generator of the RGBD network and the image generated by the depth generator of the RGBD network. The characteristics of the image by applying a binary mask to the remaining portion excluding the true value region of the input image. Share it,
An image synthesis method wherein the output of the last layer of the panorama generation network is transmitted to a decoder of the RGBD network by performing channel connection to the last block of the RGBD network.

A computer program coupled to a computer device and stored in a computer-readable recording medium to execute the method of any one of claims 1 to 3 or 5 to 7 on the computer device.

A computer-readable recording medium recording a computer program for executing the method of any one of claims 1 to 3 or 5 to 7 on a computer device.

In computer devices,
At least one processor implemented to execute readable instructions on the computer device
Including,
By the at least one processor,
A field of view image is generated by estimating the relative field of view (FoV) for the panoramic image using a field of view estimation network,
Generating a panoramic image from the generated field of view image using a panorama generation network,
The loss function of the panorama generation network includes a first loss function for RGB of the RGBD (Red Green Blue Depth) network and a second loss function for the depth of the RGBD network.
A computer device characterized by a.