KR102655867B1

KR102655867B1 - Method for image outpainting and apparatus using the same

Info

Publication number: KR102655867B1
Application number: KR1020210143620A
Authority: KR
Inventors: 김보형; 홍천산; 김범석; 박성연
Original assignee: 주식회사 씨앤에이아이
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2024-04-09
Anticipated expiration: 2041-10-26
Also published as: WO2023075082A1; KR20230059430A

Abstract

본 발명에 따르면, 이미지 아웃 페인팅 방법에 있어서, Feature Expansion Network(FEN) 및 Context Prediction Network(CPN)을 포함하는 Semantic Regeneration Network(SRN)이 존재하고, 상기 FEN이 제1 인코더 및 제1 디코더를 포함하고, 상기 CPN은 제2 인코더 및 제2 디코더를 포함하는 상태에서, 컴퓨팅 장치가, 인풋 이미지를 상기 FEN의 상기 제1 인코더에 입력시켜 제1 bottleneck을 생성하고, 상기 제1 bottleneck을 상기 제1 디코더에 입력시켜 상기 인풋 이미지보다 확장된 사이즈를 가지는 FEN 결과물을 획득하는 단계; 상기 컴퓨팅 장치는, 상기 인풋 이미지를 포함하되 확장된 영역을 가지는 연장된 인풋 이미지와 상기 FEN 결과물을 병합시키고, 그 결과값을 상기 CPN의 상기 제2 인코더에 입력시켜 제2 bottleneck을 생성하고, 상기 제2 bottleneck을 이용하여 획득한 특징 값을 상기 제2 디코더에 입력시키는 단계; 및 상기 컴퓨팅 장치가, 제1 손실함수 및 제2 손실함수를 통해 학습이 완료된 상기 제2 디코더를 이용하여 상기 인풋 이미지에 매칭하면서 상기 확장된 사이즈만큼 영역이 확장된 이미지를 포함하는 아웃풋 이미지를 획득하는 단계를 포함하는 방법을 제시한다. According to the present invention, in the image out painting method, there is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), and the FEN includes a first encoder and a first decoder. In a state where the CPN includes a second encoder and a second decoder, the computing device inputs an input image to the first encoder of the FEN to generate a first bottleneck, and connects the first bottleneck to the first encoder. Obtaining a FEN result having a size larger than the input image by inputting it to a decoder; The computing device merges the FEN result with an extended input image including the input image but has an extended area, and inputs the result to the second encoder of the CPN to generate a second bottleneck, Inputting feature values obtained using a second bottleneck into the second decoder; And the computing device matches the input image using the second decoder, which has been trained through the first loss function and the second loss function, and obtains an output image including an image whose area is expanded by the expanded size. Presents a method including the following steps:

Description

Image outpainting method and device using the same {METHOD FOR IMAGE OUTPAINTING AND APPARATUS USING THE SAME}

본 발명은 이미지 아웃 페인팅 방법에 있어서, Feature Expansion Network(FEN) 및 Context Prediction Network(CPN)을 포함하는 Semantic Regeneration Network(SRN)이 존재하고, 상기 FEN이 제1 인코더 및 제1 디코더를 포함하고, 상기 CPN은 제2 인코더 및 제2 디코더를 포함하는 상태에서, 컴퓨팅 장치가, 인풋 이미지를 상기 FEN의 상기 제1 인코더에 입력시켜 제1 bottleneck을 생성하고, 상기 제1 bottleneck을 상기 제1 디코더에 입력시켜 상기 인풋 이미지보다 확장된 사이즈를 가지는 FEN 결과물을 획득하는 단계; 상기 컴퓨팅 장치는, 상기 인풋 이미지를 포함하되 확장된 영역을 가지는 연장된 인풋 이미지와 상기 FEN 결과물을 병합시키고, 그 결과값을 상기 CPN의 상기 제2 인코더에 입력시켜 제2 bottleneck을 생성하고, 상기 제2 bottleneck을 이용하여 획득한 특징 값을 상기 제2 디코더에 입력시키는 단계; 및 상기 컴퓨팅 장치가, 제1 손실함수, 제2 손실함수 및 Conditional Gan Loss 함수를 통해 학습이 완료된 상기 제2 디코더를 이용하여 상기 인풋 이미지에 매칭하면서 상기 확장된 사이즈만큼 영역이 확장된 이미지를 포함하는 아웃풋 이미지를 획득하는 단계를 포함하는 방법에 관한 것이다.The present invention relates to an image outpainting method, wherein there is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), and the FEN includes a first encoder and a first decoder, The CPN includes a second encoder and a second decoder, and the computing device inputs an input image to the first encoder of the FEN to generate a first bottleneck, and transmits the first bottleneck to the first decoder. Obtaining a FEN result having a size larger than the input image by inputting it; The computing device merges the FEN result with an extended input image including the input image but has an extended area, inputs the result to the second encoder of the CPN, and generates a second bottleneck. Inputting feature values obtained using a second bottleneck into the second decoder; And the computing device matches the input image using the second decoder, which has been trained through a first loss function, a second loss function, and a Conditional Gan Loss function, and includes an image whose area is expanded by the expanded size. It relates to a method including the step of obtaining an output image.

이미지 아웃페인팅은 이미지의 맥락을 고려하여 주어진 이미지의 외부를 지속적으로 채울 수 있는 기술에 해당한다. 이때, 생성된 영역의 내용과 원래 입력의 일관성을 유지하는 것이 중요하다.Image outpainting is a technology that can continuously fill in the outside of a given image by considering the context of the image. At this time, it is important to maintain consistency between the content of the created area and the original input.

바이오 산업에서는 자료의 불균형에 기한 class imbalance 문제가 대두되며, 이를 해결하기 위해 abnormal data를 생성할 필요가 있고 이때 이용되는 기술 중 하나가 아웃페인팅 기술이다. 정상적인 내시경에서 병변 만을 떼어내 딥러닝 네트워크 모델에 입력 값으로 활용하고, 병변 context에 맞는 다양한 내시경 사진을 만들 수 있을 것이다.In the bio industry, class imbalance problems arise due to data imbalance, and to solve this problem, it is necessary to generate abnormal data, and one of the technologies used in this case is outpainting technology. By removing only the lesion from a normal endoscope and using it as input to a deep learning network model, it will be possible to create various endoscopic pictures that fit the lesion context.

본 발명자는 이미지 아웃페인팅 방법 및 그를 이용한 장치를 제안하고자 한다.The present inventor would like to propose an image outpainting method and a device using the same.

본 발명은 상술한 문제점을 모두 해결하는 것을 목적으로 한다.The present invention aims to solve all of the above-mentioned problems.

본 발명은 작은 사이즈의 원본 이미지의 맥락을 유지하면서 확장된 이미지를 생성하는 것을 다른 목적으로 한다.Another purpose of the present invention is to generate an expanded image while maintaining the context of a small-sized original image.

또한, 본 발명은 원본 이미지를 기초로 다양한 이미지를 생성하는 것을 다른 목적으로 한다.Additionally, another purpose of the present invention is to generate various images based on the original image.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.In order to achieve the object of the present invention as described above and realize the characteristic effects of the present invention described later, the characteristic configuration of the present invention is as follows.

본 발명의 일 태양에 따르면, 이미지 아웃 페인팅 방법에 있어서, Feature Expansion Network(FEN) 및 Context Prediction Network(CPN)을 포함하는 Semantic Regeneration Network(SRN)이 존재하고, 상기 FEN이 제1 인코더 및 제1 디코더를 포함하고, 상기 CPN은 제2 인코더 및 제2 디코더를 포함하는 상태에서, 컴퓨팅 장치가, 인풋 이미지를 상기 FEN의 상기 제1 인코더에 입력시켜 제1 bottleneck을 생성하고, 상기 제1 bottleneck을 상기 제1 디코더에 입력시켜 상기 인풋 이미지보다 확장된 사이즈를 가지는 FEN 결과물을 획득하는 단계; 상기 컴퓨팅 장치는, 상기 인풋 이미지를 포함하되 확장된 영역을 가지는 연장된 인풋 이미지와 상기 FEN 결과물을 병합시키고, 그 결과값을 상기 CPN의 상기 제2 인코더에 입력시켜 제2 bottleneck을 생성하고, 상기 제2 bottleneck을 이용하여 획득한 특징 값을 상기 제2 디코더에 입력시키는 단계; 및 상기 컴퓨팅 장치가, 제1 손실함수, 제2 손실함수 및 Conditional Gan Loss 함수를 통해 학습이 완료된 상기 제2 디코더를 이용하여 상기 인풋 이미지에 매칭하면서 상기 확장된 사이즈만큼 영역이 확장된 이미지를 포함하는 아웃풋 이미지를 획득하는 단계를 포함하는 방법이 제공된다.According to one aspect of the present invention, in the image out painting method, there is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), and the FEN is a first encoder and a first and a decoder, and the CPN includes a second encoder and a second decoder, and the computing device inputs an input image to the first encoder of the FEN to generate a first bottleneck, and generates a first bottleneck. Obtaining a FEN result having a size larger than the input image by inputting it to the first decoder; The computing device merges the FEN result with an extended input image including the input image but has an extended area, and inputs the result to the second encoder of the CPN to generate a second bottleneck, Inputting feature values obtained using a second bottleneck into the second decoder; And the computing device matches the input image using the second decoder that has been trained through a first loss function, a second loss function, and a Conditional Gan Loss function, and includes an image whose area is expanded by the expanded size. A method is provided including the step of obtaining an output image.

또한, 본 발명의 다른 태양에 따르면, 이미지 아웃 페인팅을 수행하는 장치에 있어서, Feature Expansion Network(FEN) 및 Context Prediction Network(CPN)을 포함하는 Semantic Regeneration Network(SRN)이 존재하고, 상기 FEN이 제1 인코더 및 제1 디코더를 포함하고, 상기 CPN은 제2 인코더 및 제2 디코더를 포함하는 상태에서, 통신부; 데이터베이스; 및 인풋 이미지를 상기 FEN의 상기 제1 인코더에 입력시켜 제1 bottleneck을 생성하고, 상기 제1 bottleneck을 상기 제1 디코더에 입력시켜 상기 인풋 이미지보다 확장된 사이즈를 가지는 FEN 결과물을 획득하고, 상기 인풋 이미지를 포함하되 확장된 영역을 가지는 연장된 인풋 이미지와 상기 FEN 결과물을 병합시키고, 그 결과값을 상기 CPN의 상기 제2 인코더에 입력시켜 제2 bottleneck을 생성하고, 상기 제2 bottleneck을 이용하여 획득한 특징 값을 상기 제2 디코더에 입력시키며, 제1 손실함수, 제2 손실함수 및 Conditional Gan Loss 함수를 통해 학습이 완료된 상기 제2 디코더를 이용하여 상기 인풋 이미지에 매칭하면서 상기 확장된 사이즈만큼 영역이 확장된 이미지를 포함하는 아웃풋 이미지를 획득하는 프로세서를 포함하는 컴퓨팅 장치가 제공된다.In addition, according to another aspect of the present invention, in an apparatus for performing image outpainting, there is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), and the FEN is A communication unit including one encoder and a first decoder, and the CPN including a second encoder and a second decoder; database; and inputting the input image to the first encoder of the FEN to generate a first bottleneck, inputting the first bottleneck to the first decoder to obtain a FEN result having a size larger than the input image, and inputting the input image to the first encoder of the FEN. Merge the FEN result with an extended input image that includes an image but has an expanded area, input the result to the second encoder of the CPN to generate a second bottleneck, and obtain it using the second bottleneck. One feature value is input to the second decoder, and the second decoder that has completed learning through the first loss function, second loss function, and Conditional Gan Loss function is used to match the input image to an area equal to the expanded size. A computing device is provided that includes a processor that obtains an output image including this expanded image.

본 발명에 의하면, 다음과 같은 효과가 있다.According to the present invention, the following effects are achieved.

본 발명은 작은 사이즈의 원본 이미지의 맥락을 유지하면서 확장된 이미지를 생성하는 효과가 있다.The present invention has the effect of generating an expanded image while maintaining the context of the original image in a small size.

또한, 본 발명은 원본 이미지를 기초로 다양한 이미지를 생성하는 효과가 있다.Additionally, the present invention has the effect of generating various images based on the original image.

도 1은 본 발명의 일 실시예에 따라 알고리즘 흐름을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따라 sub-pixel Convolution 과정을 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따라 Context Normalization 수식을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따라 CPN의 디코더를 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따라 제1 손실함수 및 제2 손실함수의 수식을 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따라 제2 손실함수를 시각화해서 나타내는 도면이다.1 is a diagram showing an algorithm flow according to an embodiment of the present invention.
Figure 2 is a diagram showing a sub-pixel convolution process according to an embodiment of the present invention.
Figure 3 is a diagram showing the Context Normalization formula according to an embodiment of the present invention.
Figure 4 is a diagram showing a CPN decoder according to an embodiment of the present invention.
Figure 5 is a diagram showing the formulas of the first loss function and the second loss function according to an embodiment of the present invention.
Figure 6 is a diagram illustrating a visualization of the second loss function according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.The detailed description of the present invention described below refers to the accompanying drawings, which show by way of example specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different from one another but are not necessarily mutually exclusive. For example, specific shapes, structures and characteristics described herein with respect to one embodiment may be implemented in other embodiments without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description that follows is not intended to be taken in a limiting sense, and the scope of the invention is limited only by the appended claims, together with all equivalents to what those claims assert, if properly described. Similar reference numbers in the drawings refer to identical or similar functions across various aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, in order to enable those skilled in the art to easily practice the present invention, preferred embodiments of the present invention will be described in detail with reference to the attached drawings.

본 발명의 컴퓨팅 장치는 딥러닝 네트워크를 이용하여 이미지 아웃페인팅을 수행할 수 있다. 본 발명은 대개 바이오 산업에서 이용될 수 있으며, 정상적인 내시경에서 병변(ex 암 영역)만을 떼어내고, 이를 기초로 다수의 내시경 사진을 생성하는 기술을 포함할 수 있다. 이와 관련해서는 아래에서 살펴보도록 하겠다. The computing device of the present invention can perform image outpainting using a deep learning network. The present invention can generally be used in the bio industry and may include a technology for removing only a lesion (ex. cancer area) from a normal endoscope and generating multiple endoscopic pictures based on this. We will look into this below.

컴퓨팅 장치는 통신부, 프로세서 및 데이터베이스 등을 포함할 수 있다. A computing device may include a communication unit, a processor, and a database.

컴퓨팅 장치의 통신부는 다양한 통신 기술로 구현될 수 있다. 즉, 와이파이(WIFI), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), HSPA(High Speed Packet Access), 모바일 와이맥스(Mobile WiMAX), 와이브로(WiBro), LTE(Long Term Evolution), 블루투스(bluetooth), 적외선 통신(IrDA, infrared data association), NFC(Near Field Communication), 지그비(Zigbee), 무선랜 기술 등이 적용될 수 있다. 또한, 인터넷과 연결되어 서비스를 제공하는 경우 인터넷에서 정보전송을 위한 표준 프로토콜인 TCP/IP를 따를 수 있다.The communication unit of the computing device may be implemented using various communication technologies. That is, WIFI, WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), HSPA (High Speed Packet Access), Mobile WiMAX, and WiBro. , LTE (Long Term Evolution), Bluetooth, IrDA (infrared data association), NFC (Near Field Communication), Zigbee, wireless LAN technology, etc. may be applied. Additionally, when providing services by connecting to the Internet, TCP/IP, the standard protocol for information transmission on the Internet, can be followed.

도 1은 본 발명의 일 실시예에 따라 알고리즘 흐름을 나타내는 도면이다.1 is a diagram showing an algorithm flow according to an embodiment of the present invention.

우선, 본 발명에서 이용되는 알고리즘은 Semantic Regeneration Network(SRN)에 해당하고, 해당 네트워크는 Feature Expansion Network(FEN, 100) 및 Context Prediction Network(CPN, 200)을 포함할 수 있다. First, the algorithm used in the present invention corresponds to a Semantic Regeneration Network (SRN), and the network may include a Feature Expansion Network (FEN, 100) and a Context Prediction Network (CPN, 200).

도 1에서 볼 수 있듯이, 상기 FEN(100)은 제1 인코더(110) 및 제1 디코더(130)를 포함하고, 상기 CPN(200)은 제2 인코더(210) 및 제2 디코더(240)를 포함할 수 있다. 즉, FEN(100) 및 CPN(200) 모두 Encoder, Decoder 모델을 차용하고 있는 것이다.As can be seen in FIG. 1, the FEN 100 includes a first encoder 110 and a first decoder 130, and the CPN 200 includes a second encoder 210 and a second decoder 240. It can be included. In other words, both FEN (100) and CPN (200) are using the Encoder and Decoder models.

컴퓨팅 장치의 프로세서는 인풋 이미지(10)를 FEN(100)의 제1 인코더(110)에 입력시켜 제1 bottleneck(120)을 생성할 수 있다. 상기 제1 bottleneck(120)은 상기 제1 인코더(110)에서 수행된 dilated convolution에 의한 결과에 해당할 수 있다.The processor of the computing device may generate the first bottleneck 120 by inputting the input image 10 to the first encoder 110 of the FEN 100. The first bottleneck 120 may correspond to a result of dilated convolution performed in the first encoder 110.

다음으로, 컴퓨팅 장치는 상기 제1 bottleneck을 제1 디코더(130)에 입력시켜 인풋 이미지(10)보다 확장된 사이즈를 가지는 FEN 결과물(20)을 획득할 수 있다. 상기 FEN 결과물(20)은 일종의 특성 맵으로서, 상기 인풋 이미지(10)에 포함된 픽셀 수보다 많은 수의 픽셀을 포함할 수 있다. Next, the computing device may input the first bottleneck into the first decoder 130 to obtain a FEN result 20 having a size larger than the input image 10. The FEN result 20 is a type of feature map and may include a larger number of pixels than the number of pixels included in the input image 10.

상기 제1 인코더(110)의 레이어 수보다 상기 제1 디코더(130)에 포함된 레이어의 수가 더 많기 때문에 상기 FEN 결과물(20)이 상기 인풋 이미지(10)보다 확장된 사이즈를 가질 수 있다.Since the number of layers included in the first decoder 130 is greater than the number of layers in the first encoder 110, the FEN result 20 may have a larger size than the input image 10.

사이즈가 (h, w, c)를 가지는 인풋 이미지(10)가 상기 FEN(100)을 거치면서 그 결과인 상기 FEN 결과물(20)에 해당하는 이미지는 (r1h, r2w, c') 사이즈를 가질 수 있다. 즉, FEN(100)을 거치면서 Feature들 사이즈가 k-nearest neighbor로 upsampling(업샘플링)을 할 수 있다. As the input image 10 with a size of (h, w, c) passes through the FEN 100, the resulting image corresponding to the FEN output 20 has a size of (r1h, r2w, c'). You can. In other words, while going through FEN (100), the size of the features can be upsampled to the k-nearest neighbor.

도 2는 본 발명의 일 실시예에 따라 sub-pixel Convolution 과정을 나타내는 도면이다. Figure 2 is a diagram showing a sub-pixel convolution process according to an embodiment of the present invention.

컴퓨팅 장치는 상기 제1 디코더(130)에 포함된 특정 레이어(131)에서는 sub-pixel convolution을 진행할 수 있다. The computing device may perform sub-pixel convolution in a specific layer 131 included in the first decoder 130.

상기 sub-pixel convolution은 upsampling(업샘플링) 기법 중 하나로서, 계산량이 많이 요구되는 기존의 upsampling과 달리 병렬 처리가 가능해서 계산 속도가 빠를 수 있다. 또한, 이전 upsampling 기법들은 하나의 layer를 통해 진행되지만, 본 sub-pixel convolution은 여러 레이어를 거치기 때문에 보다 더 높은 정확성을 가질 수 있다.The sub-pixel convolution is one of the upsampling techniques. Unlike existing upsampling, which requires a large amount of calculation, parallel processing is possible, so calculation speed can be fast. In addition, previous upsampling techniques are performed through one layer, but this sub-pixel convolution can have higher accuracy because it goes through multiple layers.

본 발명의 FEN(100)은 다음 네트워크인 CPN(200)을 위한 전 단계에 해당할 수 있다.The FEN 100 of the present invention may correspond to a previous step for the next network, the CPN 200.

컴퓨팅 장치는 상기 FEN(100)에 의해 생성된 FEN 결과물(20)을 CPN(200)의 제2 인코더(210)에 입력시켜 제2 bottleneck(220)을 생성할 수 있다. 상기 제2 bottleneck(220) 역시 상기 제2 인코더(210)에서 수행된 dilated convolution에 의한 결과에 해당할 수 있다.The computing device may generate the second bottleneck 220 by inputting the FEN result 20 generated by the FEN 100 into the second encoder 210 of the CPN 200. The second bottleneck 220 may also correspond to the result of dilated convolution performed in the second encoder 210.

다만, 상기 FEN 결과물(20)을 상기 CPN(200)에 입력하기 전에, 컴퓨팅 장치는 인풋 이미지(10)를 포함하되 확장된 영역을 가지는 연장된 인풋 이미지(40)를 상기 FEN 결과물(20)과 병합시킬 수 있고, 그 결과 값을 상기 CPN(200)에 입력시킬 수 있다.However, before inputting the FEN result 20 to the CPN 200, the computing device combines the FEN result 20 with an extended input image 40 that includes the input image 10 but has an expanded area. They can be merged, and the resulting value can be input into the CPN (200).

상기 연장된 인풋 이미지(40)는 인풋 이미지(10)의 픽셀을 그대로 포함하고, 상기 FEN 결과물(20)과 같은 사이즈를 가지도록 영역을 확장하되, 확장된 영역은 '0' 픽셀 값을 가질 수 있다. 즉, 인풋 이미지(10)의 픽셀을 그대로 포함하면서 '0'의 픽셀 값을 가지는 확장된 영역으로 이루어진 것이다.The extended input image 40 includes the pixels of the input image 10 as is, and expands the area to have the same size as the FEN result 20, but the extended area may have a pixel value of '0'. there is. That is, it consists of an expanded area that includes the pixels of the input image 10 and has a pixel value of '0'.

상기 FEN 결과물(20)의 경우 인풋 이미지(10)의 픽셀에서 변경되어 이미지의 콘텐츠를 나타내기가 용이하지 않기 때문에, 본 발명의 컴퓨팅 장치에서는 상기 연장된 인풋 이미지(40)와 상기 FEN 결과물(20)을 channel-wise로 병합시켜 상기 FEN 결과물(20)에 인풋 이미지(10)의 관련성을 부가할 수 있다.In the case of the FEN result 20, the pixels of the input image 10 are changed, making it difficult to display the content of the image. Therefore, in the computing device of the present invention, the extended input image 40 and the FEN result 20 are used. The relevance of the input image 10 can be added to the FEN result 20 by merging channel-wise.

결국, 컴퓨팅 장치는 상기 연장된 인풋 이미지(40)와 FEN 결과물(20)을 channel-wise로 병합시키고, 그 결과값을 상기 CPN(200)의 제2 인코더(210)에 입력시킬 수 있다.Ultimately, the computing device can merge the extended input image 40 and the FEN result 20 channel-wise and input the resulting value into the second encoder 210 of the CPN 200.

도 3은 본 발명의 일 실시예에 따라 Context Normalization 수식을 나타내는 도면이다.Figure 3 is a diagram showing the Context Normalization formula according to an embodiment of the present invention.

컴퓨팅 장치는 상기 제2 bottleneck(220)에 대해서 Context Normalization 과정을 수행할 수 있다. 상기 Context Normalization은 CPN(200) 과정을 거치면서 생성되는 영역이 인풋 이미지(10)와 일관성을 유지할 수 있도록 지원하는 과정에 해당한다.The computing device may perform a context normalization process on the second bottleneck 220. The Context Normalization corresponds to a process that supports the area created through the CPN 200 process to maintain consistency with the input image 10.

참고로, 상기 Context Normalization는 도 3에 나타난 수식에 해당하며, n(.) 연산자는 확장된 부분을 정규화하고, 다시 일반 이미지로 변환하는데 input image(10)의 정규화 식을 빌려서 진행할 수 있다. 이는 도 1에서 나타난 바와 같이, FEN 결과물(20)이 CN(Context Normalization) 과정에서 이용되는 것을 통해서도 확인할 수 있다.For reference, the Context Normalization corresponds to the formula shown in Figure 3, and the n(.) operator normalizes the expanded part and converts it back to a normal image, which can be done by borrowing the normalization formula of the input image (10). As shown in Figure 1, this can also be confirmed through the fact that the FEN result 20 is used in the CN (Context Normalization) process.

컴퓨팅 장치는 상기 제2 bottleneck(220)에 대해서 CN(Context Normalization) 과정을 수행하고 획득한 특징 값(230)을 제2 디코더(240)에 입력시킬 수 있다. 참고로, 본 발명의 제2 디코더(240)에는 제1 손실함수 및 제2 손실함수를 통해 학습이 진행될 수 있다. The computing device may perform a CN (Context Normalization) process on the second bottleneck 220 and input the obtained feature value 230 to the second decoder 240. For reference, learning may be performed in the second decoder 240 of the present invention through the first loss function and the second loss function.

도 4는 본 발명의 일 실시예에 따라 CPN의 제2 디코더를 나타내는 도면이다.Figure 4 is a diagram showing a second decoder of CPN according to an embodiment of the present invention.

상기 특징 값(230)은 제1 손실함수 및 제2 손실함수를 통해 학습이 완료된 제2 디코더(240)에 입력될 수 있다. 따라서, 제2 디코더(240)에 포함된 제1 손실함 수 및 제2 손실함수의 학습이 선행되어야 할 것이며, 이와 관련 아래에서 살펴보도록 하겠다. The feature value 230 may be input to the second decoder 240 on which learning has been completed through the first loss function and the second loss function. Therefore, learning of the first loss function and the second loss function included in the second decoder 240 must be preceded, and this will be discussed below.

제2 디코더(240)는 도 4에서 볼 수 있듯이, 복수의 업스케일(또는 업샘플, Upsample) 모듈을 포함할 수 있고, 상기 제2 bottleneck(220)을 이용하여 획득한 상기 특징 값(230)인 Z0(Z0에 대해서는 후술) 및 제2 인코더(210)로부터 도출된 조건 벡터(conditional vector, conditional code)를 입력받을 수 있다. As can be seen in FIG. 4, the second decoder 240 may include a plurality of upscale (or upsample) modules, and the feature value 230 obtained using the second bottleneck 220 Z0 (Z0 will be described later) and a conditional vector (conditional code) derived from the second encoder 210 can be input.

참고로, 상기 업스케일 모듈은 Deconv, Leaky RelU, AdaIN 연산을 포함할 수 있다.For reference, the upscale module may include Deconv, Leaky RelU, and AdaIN operations.

학습 과정에 포함된 손실 함수 연산에 있어서 상기 CN Result(특징 값, Z0)가 복수 개 필요하며, 이를 위해 본 발명에서는 복수의 이미지 패치(인풋 이미지)를 통해 복수의 CN Result 값(Z0)을 획득할 수 있다. 상기 복수의 이미지 패치는 하나의 이미지에 포함된 이미지 패치일 필요는 없고, 서로 다른 이미지 각각에 포함된 이미지 패치(ex 새 이미지의 날개 부분, 자동차 이미지의 바퀴 부분)여도 무방할 것이다. 즉, 서로 다른 이미지 패치(인풋 이미지, 10) 각각으로부터 서로 다른 CN Result 값(특징 값, Z0)을 획득할 수 있다.In calculating the loss function included in the learning process, a plurality of CN Results (feature values, Z0) are required, and for this purpose, the present invention obtains a plurality of CN Result values (Z0) through a plurality of image patches (input images). can do. The plurality of image patches need not be image patches included in one image, but may be image patches included in each of different images (e.g., a wing part in a bird image, a wheel part in a car image). That is, different CN Result values (feature values, Z0) can be obtained from each of the different image patches (input images, 10).

상기 복수 개의 Z0는 latent vector에 해당할 수 있고, 복수 개의 Z0에 대해서는 제1 Z0, 제2 Z0 … 제K Z0로 설정할 수 있다. 예를 들어, 4개의 이미지 패치(인풋 이미지)로부터 특징 값(Z0, 230) 4개를 도출할 수 있고, 이들 각각을 제1 Z0, 제2 Z0, 제3 Z0, 제4 Z0 등으로 설정할 수 있는 것이다.The plurality of Z0 may correspond to a latent vector, and for the plurality of Z0, the first Z0, the second Z0... It can be set to KZ0. For example, four feature values (Z0, 230) can be derived from four image patches (input images), and each of these can be set as the first Z0, the second Z0, the third Z0, the fourth Z0, etc. There is.

컴퓨팅 장치는 상기 도출한 제1 Z0, 제2 Z0, 제3 Z0, 제4 Z0 각각에 대해 MLPS(Multilayer Perceptrons), 즉 다층퍼셉트론을 수행하고, 다음으로 복수의 Affine transformation(A1, A2, A3, … Ak)를 수행하여 Z1, Z2, Z3, Z4, … Zk를 획득할 수 있다. 이때, 상기 Affine transformation의 결과 값들을 latent vector라고 설정할 수 있다.The computing device performs MLPS (Multilayer Perceptrons), that is, multilayer perceptrons, on each of the derived first Z0, second Z0, third Z0, and fourth Z0, and then performs a plurality of Affine transformations (A1, A2, A3, …Ak) to perform Z1, Z2, Z3, Z4, … You can obtain Zk. At this time, the result values of the Affine transformation can be set as latent vectors.

구체적으로는, 상기 Z0가 복수 개(ex 제1 Z0, 제2 Z0, 제3 Z0, 제4 Z0 …) 이므로, 상기 복수 개의 Z0는 복수의 Affine transformation(A1, A2, A3, … Ak) 각각과 연산을 수행하며, 그 결과 복수 개의 Z1(제1 Z1, 제2 Z1, 제3 Z1, 제4 Z1 …), 복수 개의 Z2(제1 Z2, 제2 Z2, 제3 Z2, 제4 Z2 …), … 복수 개의 Zk(제1 Zk, 제2 Zk, 제3 Zk, 제4 Zk …)가 획득될 수 있고, 획득된 결과 값을 latent vector라고 설정할 수 있는 것이다.Specifically, since the Z0 is plural (ex. 1st Z0, 2nd Z0, 3rd Z0, 4th Z0...), the plurality of Z0 is a plurality of Affine transformations (A1, A2, A3,...Ak), respectively. and operations are performed, and as a result, a plurality of Z1 (1st Z1, 2nd Z1, 3rd Z1, 4th Z1...), a plurality of Z2 (1st Z2, 2nd Z2, 3rd Z2, 4th Z2...) ), … A plurality of Zk (1st Zk, 2nd Zk, 3rd Zk, 4th Zk...) can be obtained, and the obtained result value can be set as a latent vector.

본 발명의 일 실시예에 따라, 컴퓨팅 장치는 상기 조건 벡터(conditional vector, conditional code)를 고정된 변수로 설정하고 진행할 수도 있다. 본 발명의 경우 한 병변에 대해 학습시킨 뒤, 해당 병변을 기초로 아웃페인팅을 하기 때문에 고정된 변수로 설정하고 진행이 가능하기 때문이다.According to an embodiment of the present invention, the computing device may set the conditional vector (conditional code) as a fixed variable and proceed. In the case of the present invention, after learning about a lesion, outpainting is performed based on the lesion, so it is possible to set it as a fixed variable and proceed.

컴퓨팅 장치는 상기 Z1(latent vector)과 조건 벡터(c)를 제1 업 스케일 모듈(241)에 입력시키고 4x4 features를 획득할 수 있다. 또한, 컴퓨팅 장치는 상기 4x4 features와 Z2(latent vector)를 제2 업 스케일 모듈(242)에 입력시키고 8x8 features를 획득할 수 있고, 상기 8x8 features와 Z3(latent vector)를 제3 업 스케일 모듈(243)에 입력시켜 16x16 features를 획득할 수 있다. The computing device may input the latent vector (Z1) and the condition vector (c) into the first upscale module 241 and obtain 4x4 features. Additionally, the computing device may input the 4x4 features and Z2 (latent vector) to the second upscale module 242 and obtain 8x8 features, and input the 8x8 features and Z3 (latent vector) to the third upscale module (242). 243), 16x16 features can be obtained.

즉, 컴퓨팅 장치는 입력 값을 복수의 업 스케일 모듈을 순차적으로 통과시키고, 상기 복수의 업 스케일 모듈을 통과할 때마다 획득한 이미지(Z1, Z2, Z3, ?? Zk)의 픽셀 수를 증가시킬 수 있다. 컴퓨팅 장치는 마지막 업 스케일 모듈까지 통과가 완료된 경우 128x128 features를 획득할 수 있다. 경우에 따라서, 상기 마지막 feature의 사이즈는 달라질 수 있다.That is, the computing device sequentially passes the input value through a plurality of upscale modules and increases the number of pixels of the acquired image (Z1, Z2, Z3, ?? Zk) each time it passes the plurality of upscale modules. You can. The computing device can acquire 128x128 features if the last upscale module has been passed. In some cases, the size of the last feature may vary.

컴퓨팅 장치는 상기 복수의 features(4x4 features, 8x8 features, 16x16 features, ?? 128x128 features) 각각에 대해 Convolution 연산을 수행시키고 각 feature의 크기에 대응하는 이미지를 생성할 수 있다. 즉, 4x4 features로부터 4x4 사이즈의 이미지(21)가 생성되고, 8x8 features로부터 8x8 사이즈의 이미지(22)가 생성되는 것이다.The computing device may perform a convolution operation on each of the plurality of features (4x4 features, 8x8 features, 16x16 features, ?? 128x128 features) and generate an image corresponding to the size of each feature. In other words, a 4x4-sized image 21 is created from 4x4 features, and an 8x8-sized image 22 is created from 8x8 features.

구체적으로는, 4개의 이미지 패치(인풋 이미지, 10)로부터 각각 CN(Context normalization) 결과값(특징 값(230))인 제1 Z0, 제2 Z0, 제3 Z0, 제4 Z0를 획득하고, 이에 대해 Affine transformation(A1) 결과 Z1(latent vector)를 획득하고, 상기 Z1(latent vector)은 제1 Z1, 제2 Z1, 제3 Z1, 제4 Z1로 구분되며, 각각에 대해 제1 업 스케일 모듈(241)을 통해 각각 4x4 features를 획득할 수 있고, 궁극적으로는 4개의 4x4 사이즈의 이미지(21)가 생성될 수 있을 것이다. 이는 다른 latent vector(Z2, Z3, Z4)도 마찬가지이며, 이에 대해서는 후술하도록 하겠다.Specifically, the first Z0, second Z0, third Z0, and fourth Z0, which are CN (Context normalization) result values (feature values 230), are obtained from four image patches (input image, 10), respectively, In response to this, as a result of the affine transformation (A1), Z1 (latent vector) is obtained, and the Z1 (latent vector) is divided into the first Z1, the second Z1, the third Z1, and the fourth Z1, and the first upscale for each Each 4x4 feature can be acquired through the module 241, and ultimately four 4x4 sized images 21 can be generated. This also applies to other latent vectors (Z2, Z3, Z4), which will be described later.

참고로, 상기 복수의 Affine transformation(A1, A2, A3, … Ak)를 수행하여 획득한 Z1, Z2, Z3, Z4, … Zk(latent vector)들은 각각 서로 다른 영역을 집중적으로 나타낼 수 있다. 즉, Z1 = A1(Z0), Z2 = A2(Z0), … Zk = Ak(Z0)에 해당할 수 있다.For reference, Z1, Z2, Z3, Z4, … obtained by performing the plurality of affine transformations (A1, A2, A3, … Ak). Zk (latent vectors) can focus on different areas. That is, Z1 = A1(Z0), Z2 = A2(Z0), … It may correspond to Zk = Ak(Z0).

예를 들어, 인풋 이미지(10)가 사람 얼굴이라고 가정할 때, Z1은 사람의 눈에 대해 집중적으로 나타내고, Z1에 대응하는 4x4 features 역시 사람의 눈을 중심으로 이미지를 표시할 수 있는 것이다. 또한, Z2는 사람의 코에 대해 집중적으로 나타내고, Z2에 대응하는 8x8 features 역시 사람의 코를 중심으로 이미지를 표시할 수 있다. 상기 집중되는 영역은 미리 설정하는 것이 아니라 학습을 반복 수행하면서 정해지는 결과들에 해당할 수 있다. 또한, nxn feature에서 n이 커질수록 Z의 의미는 디테일을 수정하는 것이 되고, 작아질수록 전체적인 구조를 그리는 것에 해당할 수 있다.For example, assuming that the input image 10 is a human face, Z1 focuses on the human eye, and the 4x4 features corresponding to Z1 can also display an image centered on the human eye. Additionally, Z2 focuses on the human nose, and the 8x8 features corresponding to Z2 can also display images centered on the human nose. The focused area may not be set in advance but may correspond to results determined while repeatedly performing learning. Additionally, in an nxn feature, as n becomes larger, the meaning of Z becomes modifying details, and as it becomes smaller, it may correspond to drawing the overall structure.

위와 같이, Z1, Z2, Z3, Z4, … Zk들은 서로 다른 부분을 중점적으로 나타내며, 본 발명은 이로써 더 정확한 디테일을 아웃풋 이미지 상에 표시할 수 있는 효과를 가질 수 있다.As above, Z1, Z2, Z3, Z4, … Zk focuses on different parts, and the present invention can thereby have the effect of displaying more accurate details on the output image.

설명의 편의상 복수의 업 스케일 모듈의 결과마다 획득한 각각의 이미지를 순차적으로 제1 이미지 및 제2 이미지라고 가정할 수 있다. 예를 들어, 4x4 features로부터 생성된 이미지가 제1 이미지(21), 8x8 features로부터 생성된 이미지가 제2 이미지(22)라고 할 수도 있고, 8x8 features로부터 생성된 이미지가 제1 이미지(22), 16x16 features로부터 생성된 이미지가 제2 이미지(23)라고도 할 수 있는 것이다. 즉, 사이즈순으로 제1 이미지, 제2 이미지라고 가정할 수 있고, 이를 이용해서 제1 손실함수를 아래에서 설명하도록 하겠다.For convenience of explanation, it may be assumed that each image obtained as a result of a plurality of upscale modules is sequentially a first image and a second image. For example, an image generated from 4x4 features may be referred to as the first image 21, an image generated from 8x8 features may be referred to as the second image 22, and an image generated from 8x8 features may be referred to as the first image 22. The image generated from 16x16 features can also be referred to as the second image 23. That is, it can be assumed that they are the first image and the second image in order of size, and the first loss function will be explained below using this.

도 5는 본 발명의 일 실시예에 따라 제1 손실함수 및 제2 손실함수의 수식을 나타내는 도면이다.Figure 5 is a diagram showing the formulas of the first loss function and the second loss function according to an embodiment of the present invention.

제1 손실함수와 관련, 컴퓨팅 장치는 제1 이미지의 사이즈(ex 8x8 features)와 동일하도록 제2 이미지(ex 16x16 features)를 다운샘플링을 진행할 수 있고, 결국 제1 이미지와 제2 이미지의 사이즈를 동일하게 할 수 있다. 참고로, 전술하였듯이, 제2 디코더(240)의 각 레이어에서는 latent vector가 생성되고, 복수의 이미지 패치(인풋 이미지, 10)에 기초한 복수의 latent vector(Z0) 각각에 대응하는 복수의 이미지가 생성될 수 있다. In relation to the first loss function, the computing device can downsample the second image (ex 16x16 features) to be the same as the size of the first image (ex 8x8 features), and ultimately change the sizes of the first image and the second image to You can do the same. For reference, as described above, a latent vector is generated in each layer of the second decoder 240, and a plurality of images corresponding to each of a plurality of latent vectors (Z0) based on a plurality of image patches (input image 10) are generated. It can be.

이때, 8x8 features에 해당하는 복수의 이미지 중 특정 이미지 패치(ex 새 이미지의 다리 부분)에 대응하는 이미지(ex 제2 Z2 기초), 16x16 features에 해당하는 복수의 이미지 중 상기 특정 이미지 패치(ex 새 이미지의 다리 부분)에 대응하는 이미지(ex 제2 Z2 기초) 사이에서 상기 제1 손실함수의 연산이 이루어질 수 있을 것이다. 즉, 동일 이미지 패치로부터 획득한 feature 사이에 제1 손실함수의 연산이 이루어지는 것이다.At this time, among the plurality of images corresponding to 8x8 features, an image corresponding to a specific image patch (ex. the leg part of the new image) (ex. second Z2 basis), and the specific image patch (ex. the new image) among the plurality of images corresponding to 16x16 features. The calculation of the first loss function may be performed between the image (ex. second Z2 basis) corresponding to the leg portion of the image. In other words, the first loss function is calculated between features obtained from the same image patch.

또한, 컴퓨팅 장치는 i번째 layer를 거치는 N x N 사이즈의 이미지를 다운샘플링하여 N/2 x N/x 사이즈의 이미지로 변환하여, i-1번째 layer를 거치는 N/2 x N/2 사이즈의 이미지와 사이즈가 동일하게 만들 수 있다. In addition, the computing device downsamples the N x N size image passing through the i-th layer and converts it into an N/2 The size can be made the same as the image.

다음으로, 컴퓨팅 장치는 유클리디안 거리함수를 통해 다운샘플링된 제2 이미지와 제1 이미지의 거리를 최소화할 수 있다. 제1 손실함수 관련 수식(L_disent)은 도 5(a)를 통해 참조할 수 있다. 도 5(a)에서 볼 수 있듯이, S는 다운샘플링이고, d는 유클리디안 거리함수를 의미할 수 있으며, 각 단계마다 수정하는 부분은 독립적이 될 수 있다.Next, the computing device can minimize the distance between the downsampled second image and the first image through the Euclidean distance function. The first loss function related equation (L_disent) can be referenced through FIG. 5(a). As can be seen in Figure 5(a), S is downsampling, d can mean Euclidean distance function, and the part modified at each step can be independent.

제1 손실함수에서는 거리 함수(유클리디안 거리함수)를 통해 L_disent를 계산하고, 최적화(0과 가깝게)를 진행할 수 있다. 즉, 업 스케일 모듈(241, 242, 243, 244, ??)이 진행되어 새로운 features(제2 이미지)가 생성되어도 이전 features(제1 이미지)와의 차이가 일정 범위를 벗어나지 않도록 지원할 수 있는 것이다. In the first loss function, L_disent can be calculated through a distance function (Euclidean distance function) and optimization (close to 0) can be performed. In other words, even when the upscale modules 241, 242, 243, 244, ?? are performed and new features (second images) are generated, it is possible to ensure that the difference from previous features (first images) does not exceed a certain range.

이는 업 스케일 모듈을 통해 생성되는 제2 이미지가 기존 제1 이미지와의 통일성을 유지하고, 궁극적으로는 아웃페인팅 결과 생성되는 이미지와 인풋 이미지의 통일성을 유지하기 위함이다.This is to maintain the unity of the second image generated through the upscale module with the existing first image, and ultimately to maintain the unity of the image generated as a result of outpainting and the input image.

또한, 제2 디코더(240)의 레이어마다 L_disent(제1 손실함수)와 더불어, Conditional GAN Loss를 적용할 수 있다. Additionally, Conditional GAN Loss can be applied in addition to L_disent (first loss function) to each layer of the second decoder 240.

제2 디코더(240) 결과, fake 이미지가 생성될 수 있다. 학습 과정에서는 상기 fake 이미지를 real 이미지와 비교하여 loss를 나타낼 필요가 있으며, 이와 관련해서 아래 수식(Conditional GAN Loss)을 참조할 수 있다. 참고로, real 이미지의 경우 인풋 이미지(10)의 실제 전체 이미지(x)에 해당될 수 있으며, 도 1을 참조할 때 '새' 이미지가 상기 real 이미지(x)에 해당할 것이다. 상기 Conditional GAN Loss는 아래 수식에 해당할 수 있다.As a result of the second decoder 240, a fake image may be generated. In the learning process, it is necessary to express the loss by comparing the fake image with the real image, and in relation to this, you can refer to the formula (Conditional GAN Loss) below. For reference, in the case of a real image, it may correspond to the actual entire image (x) of the input image 10, and when referring to FIG. 1, the 'new' image will correspond to the real image (x). The Conditional GAN Loss may correspond to the formula below.

상기 수식을 참조할 때, 이미지 패치(인풋 이미지, 10)의 원본 이미지인 x에 대해 다운샘플링 연산을 진행하고, nxn feature에 대한 연산을 진행하는 것을 확인할 수 있다. 이때, 상기 수식에서는 상기 다운샘플링 연산과 nxn feature에 대한 연산 각각에 대해서 n개의 Discriminator(판별자)를 계산하여 Conditional GAN Loss를 획득할 수 있다.When referring to the above formula, it can be seen that a downsampling operation is performed on x, which is the original image of the image patch (input image, 10), and an operation on the nxn feature is performed. At this time, in the above formula, the Conditional GAN Loss can be obtained by calculating n discriminators for each of the downsampling operation and the operation on nxn features.

상기 수식은 fake 이미지에 대한 수식, real 이미지에 대한 수식을 조합한 것으로서, fake 이미지(nxn feature에 대한 연산)의 경우 결과 값이 0이 되도록 하고, real 이미지(x에 대해 다운 샘플링 연산)의 경우 결과 값이 1이 되도록 하는 최적화 과정을 나타낼 수 있다.The above formula is a combination of the formula for the fake image and the formula for the real image. In the case of a fake image (operation on nxn features), the result value is 0, and in the case of a real image (down-sampling operation on x) It can represent an optimization process that ensures that the result is 1.

또한, 상기 수식에서 Discriminator에는 i가 붙어있는데, 본 발명의 제2 디코더(240)의 레이어가 복수 개(ex 6개 - 4x4, 8x8, 16x16, 32x32, 64x64, 128x128, 도 4 참조)이므로 상기 i 역시 복수 개(ex 6개)가 존재할 수 있다. 따라서, 6개의 Discriminator가 활용될 수도 있을 것이다.In addition, in the above formula, i is attached to the discriminator, and since the second decoder 240 of the present invention has a plurality of layers (ex. 6 - 4x4, 8x8, 16x16, 32x32, 64x64, 128x128, see FIG. 4), the i There may also be multiple numbers (ex. 6). Therefore, six discriminators may be used.

참고로, 전술한 상기 제1 손실함수 및 후술할 상기 제2 손실함수의 경우 Generator(생성자)의 loss에 대해서 연산을 수행하고, 상기 Conditional GAN Loss의 경우에는 Discriminator(판별자)의 loss에 대해서 연산을 수행한 것이다.For reference, in the case of the above-described first loss function and the second loss function described later, the operation is performed on the loss of the generator, and in the case of the Conditional GAN Loss, the operation is performed on the loss of the discriminator. was carried out.

또한, 컴퓨팅 장치는 상기 제2 디코더(240)의 레이어마다 residual connection을 연결하여 도 5(a)의 수식을 진행할 수 있다. 이에 따라 각 레이어마다 수정하는 부분(ex 눈 부분, 코 부분 등)이 독립적으로 변하기 때문에, 각 레이어마다 표현하는 디테일이 달라질 수 있다.Additionally, the computing device may connect residual connections for each layer of the second decoder 240 and proceed with the equation of FIG. 5(a). Accordingly, the parts to be modified for each layer (ex. eyes, nose, etc.) change independently, so the details expressed for each layer may vary.

다음으로, 아래에서는 제2 손실함수와 관련 살펴보도록 하겠다.Next, we will look at the second loss function below.

도 6은 본 발명의 일 실시예에 따라 제2 손실함수를 시각화해서 나타내는 도면이다.Figure 6 is a diagram illustrating a visualization of the second loss function according to an embodiment of the present invention.

인풋 이미지(이미지 패치)만으로 학습을 하게 되면, 이미지 다양성을 확보하지 못할 수 있다. 따라서, 학습시에 이미지 패치뿐 아니라 Ground Truth 이미지를 CPN(200)에 입력값으로 함께 넣어 제2 손실 함수를 계산할 수 있다. 즉, Ground Truth 이미지를 FEN(100)을 거치지 않고 곧바로 CPN(200)에 입력하여 대표 벡터(representation vector)를 생성할 수 있다. 그 후, 제2 디코더(240)에 Ground Truth의 이미지 패치와 Ground Truth의 이미지를 함께 넣어 제2 손실함수(L_Ndiv)를 유도할 수 있다. 도 5(b)에 표현된 식이 제2 손실함수(L_Ndiv)에 해당할 수 있다. 도 5(b)를 참조하면, k는 layer, n은 layer 개수, N은 이미지 사이즈 픽셀, D는 판별자, z는 latent vector에 해당할 수 있다.If learning is done only with input images (image patches), image diversity may not be secured. Therefore, during training, the second loss function can be calculated by inputting not only the image patch but also the ground truth image as input to the CPN (200). In other words, the ground truth image can be directly input to the CPN (200) without going through the FEN (100) to generate a representation vector. Afterwards, the second loss function (L_Ndiv) can be derived by putting the ground truth image patch and the ground truth image together in the second decoder 240. The equation expressed in FIG. 5(b) may correspond to the second loss function (L_Ndiv). Referring to FIG. 5(b), k may correspond to a layer, n may correspond to the number of layers, N may correspond to an image size pixel, D may correspond to a discriminator, and z may correspond to a latent vector.

구체적으로, 본 발명의 컴퓨팅 장치는 학습 과정에서 복수의 이미지 패치(인풋 이미지, 10)에 기초하여 복수의 특징 값(Z0, 230)를 획득하고, 이로부터 복수의 latent vector(Z1, Z2, ??)를 획득하므로 이들을 특정 업 스케일 모듈에 입력 결과 획득한 이미지가 복수 개이고, 상기 복수 개의 이미지는 제1-1 이미지(ex 제1 Z2 기초), 제1-2 이미지(ex 제2 Z2 기초)를 포함하고 있다고 상정할 수 있다. Specifically, the computing device of the present invention acquires a plurality of feature values (Z0, 230) based on a plurality of image patches (input image, 10) during the learning process, and from these, a plurality of latent vectors (Z1, Z2, ? ?), so there are a plurality of images obtained as a result of inputting them to a specific upscale module, and the plurality of images are the 1-1 image (ex. based on the 1st Z2) and the 1-2 image (ex. based on the 2nd Z2) It can be assumed that it contains .

예를 들어, 특정 업 스케일 모듈(242)의 결과 8x8 features를 획득할 수 있고, 이를 기초로 대응하는 복수의 이미지(22)가 생성될 수 있다. 이때의 복수의 이미지(22)를 제1-1 이미지(ex 제1 Z2 기초), 제1-2 이미지(ex 제2 Z2 기초)라고 상정할 수 있는 것이다. For example, 8x8 features may be obtained as a result of a specific up-scale module 242, and a plurality of corresponding images 22 may be generated based on this. At this time, the plurality of images 22 can be assumed to be the 1-1 image (ex. 1st Z2 basis) and the 1-2 image (ex. 2nd Z2 basis).

위에서는 예시로서 8x8 features를 획득하게 하는 업 스케일 모듈(242)에 대해서만 서술하였으나, 다른 업 스케일 모듈(241, 243 등)에서도 마찬가지로 적용될 수 있을 것이다. 도 4에서 볼 수 있듯이, 제2 디코더(240)의 각 레이어에 의해 각각 복수의 이미지가 생성될 수 있는 것이다.Above, as an example, only the up-scale module 242 that obtains 8x8 features is described, but the same can be applied to other up-scale modules (241, 243, etc.). As can be seen in FIG. 4, a plurality of images can be generated by each layer of the second decoder 240.

제2 손실함수(L_Ndiv)는 상기 제1-1 이미지 및 제1-2 이미지 사이의 거리를 최대화하여 다양한 이미지가 생성되도록 지원하는 수식에 해당할 수 있다. 다시 말하면, 상기 제2 디코더(240)의 각 레이어로부터 latent vector(Z1, Z2 등)가 생성되고, 상기 latent vector(z)로부터 생성되는 이미지가 복수 개로서 제1-1 이미지, 제1-2 이미지에 해당할 수 있다.The second loss function (L_Ndiv) may correspond to a formula that supports generating various images by maximizing the distance between the 1-1 image and the 1-2 image. In other words, a latent vector (Z1, Z2, etc.) is generated from each layer of the second decoder 240, and there are a plurality of images generated from the latent vector (z), including images 1-1 and 1-2. It may correspond to an image.

이때, 제2 손실함수(L_Ndiv)는 latent vector(z) 사이의 거리는 최소화하면서, latent vector(z)에서 생성되는 이미지(제1-1 이미지, 제1-2 이미지) 사이의 픽셀 거리는 최대화하는 것을 목적으로 할 수 있다. 또한, 학습 과정에서 Ground Truth 이미지가 입력되어 제2 디코더(240)의 각 레이어에서 이미지가 생성될 수 있는데, 이를 제1-1 이미지 또는 제1-2 이미지로 고려할 수도 있다.At this time, the second loss function (L_Ndiv) minimizes the distance between latent vectors (z) and maximizes the pixel distance between images (1st-1st image, 1-2nd image) generated from the latent vector (z). It can be done for the purpose. Additionally, during the learning process, a ground truth image may be input and an image may be generated in each layer of the second decoder 240, which may be considered as the 1-1 image or the 1-2 image.

Ground truth 결과 생성 이미지와 일반 latent vector(z) 사이의 거리를 최대화(도 6에서의 양방향 포인트가 최대한 멀어지도록)하는 것을 제2 손실함수의 목표로 함으로써 생성되는 이미지의 다양성을 확보할 수 있다.The diversity of the generated images can be secured by setting the goal of the second loss function to maximize the distance between the ground truth result generated image and the general latent vector (z) (so that the bidirectional points in Figure 6 are as far apart as possible).

즉, 특정 latent vector(z)로부터 복수의 이미지(Ground Truth 결과 생성 이미지 포함)가 생성되는데, 상기 복수의 이미지 각각의 픽셀 거리가 최대화되어 다양한 이미지가 생성되도록 하는 것이다.In other words, a plurality of images (including images generated as ground truth results) are generated from a specific latent vector (z), and the pixel distance of each of the plurality of images is maximized to generate various images.

예를 들어, Z1이 사람의 눈을 집중적으로 나타내고, 업 스케일 모듈(241)을 통해 4x4 features, 복수의 이미지(제1-1 이미지, 제1-2 이미지) 등이 생성되었다고 할 때, 상기 복수의 이미지 중 제1-1 이미지는 둥글고 큰 파란색 눈을 집중적으로 생성할 수 있고, 제1-2 이미지는 가늘고 긴 갈색 눈을 집중적으로 생성할 수 있는 것이다. 다른 Z2, Z3 등에 대해서도 서로 다른 부분을 집중적으로 생성하는 이미지가 획득될 수 있을 것이다.For example, when Z1 focuses on the human eye and 4x4 features and a plurality of images (1-1st image, 1-2nd image), etc. are generated through the upscale module 241, the plurality of Among the images, image 1-1 can intensively create round and large blue eyes, and image 1-2 can intensively generate narrow and long brown eyes. Images that focus on different parts may be obtained for other Z2, Z3, etc.

결국, 제2 손실함수의 결과가 최적화(0과 가깝게)가 되도록 하면서 생성되는 아웃풋 이미지의 다양성이 확보될 수 있다.Ultimately, the diversity of the generated output images can be secured while ensuring that the result of the second loss function is optimized (close to 0).

컴퓨팅 장치는 상기 제1 손실함수(L_discent), 상기 제2 손실함수(L_Ndiv)를 선형 가중치 합(손실함수 = a(L_discent) + b(L_Ndiv))을 주면서 정확도와 다양성의 균형을 지킬 수 있으며, 제2 디코더(240)에 대한 학습을 진행할 수 있다.The computing device can balance accuracy and diversity by giving the first loss function (L_discent) and the second loss function (L_Ndiv) a linear weight sum (loss function = a(L_discent) + b(L_Ndiv)), Learning about the second decoder 240 may proceed.

컴퓨팅 장치는, 상기 제1 손실함수 및 제2 손실함수를 통해 학습이 완료된 제2 디코더(240)를 이용하여 인풋 이미지(10)에 매칭하면서 확장된 사이즈만큼 영역이 확장된 이미지를 포함하는 아웃풋 이미지(30)를 획득할 수 있을 것이다.The computing device matches the input image 10 using the second decoder 240, which has been trained through the first loss function and the second loss function, and outputs an output image including an image whose area is expanded by the expanded size. You will be able to obtain (30).

상기 확장된 사이즈의 크기는 상기 FEN(100)에 포함된 제1 인코더(110), 제1 디코더(130) 각각의 레이어를 통해 결정될 수 있다. 경우에 따라서, 컴퓨팅 장치는 사람의 얼굴 영역보다 사람의 다리, 팔 영역에 대해 확장된 사이즈의 크기를 더 크게 할 수 있고, 이에 따라 아웃페인팅 결과인 아웃풋 이미지 역시 사람의 다리, 팔 영역을 포함하는 이미지(생성된 이미지)가 얼굴 영역을 포함하는 이미지(생성된 이미지)보다 클 수 있다. The size of the expanded size can be determined through each layer of the first encoder 110 and first decoder 130 included in the FEN 100. In some cases, the computing device may make the expanded size of the human leg and arm areas larger than the human face area, and accordingly, the output image that is the result of outpainting also includes the human leg and arm areas. The image (generated image) may be larger than the image (generated image) containing the facial area.

한편, 본 발명은 암 사진 등 병변을 포함하는 이미지 패치를 인풋 이미지(10)로 하여 아웃 페인팅을 진행할 수 있다. 이때, 컴퓨팅 장치는 별도의 판정 모듈을 학습시켜 암 등의 병변인지 여부를 미리 판별할 수 있다. 이때, 이용되는 판정 모듈은 CNN, RNN, GAN 등 다양한 알고리즘이 이용될 수 있다. Meanwhile, in the present invention, out painting can be performed using an image patch containing a lesion, such as a cancer photo, as the input image 10. At this time, the computing device can learn a separate determination module to determine in advance whether it is a lesion such as cancer. At this time, the decision module used may be various algorithms such as CNN, RNN, and GAN.

그 결과, 컴퓨팅 장치는 판정 모듈을 통해 병변이라고 판단된 이미지 패치에 한하여, FEN(100), CPN(200) 과정을 수행할 수 있고, 병변이라고 판단되지 않은 이미지 패치의 경우 FEN(100), CPN(200) 과정을 수행하지 않을 수 있다.As a result, the computing device can perform the FEN (100) and CPN (200) processes only for image patches that are determined to be lesions through the judgment module, and FEN (100) and CPN (200) for image patches that are not determined to be lesions. (200) The process may not be performed.

또한, 컴퓨팅 장치는 미리 목적(ex 피부암, 유방암, 간암, 위암 등)이 설정될 수 있고, 설정된 특정 목적(ex 유방암)에 해당하는 이미지 패치(인풋 이미지)인지 여부를 판정 모듈을 통해 확인하고, 특정 목적에 해당하는 경우에만 상기 FEN(100), CPN(200) 과정을 수행할 수도 있을 것이다.In addition, the computing device may be set in advance for a purpose (ex. skin cancer, breast cancer, liver cancer, stomach cancer, etc.), and determines whether an image patch (input image) corresponds to the set specific purpose (ex. breast cancer) through a judgment module. The FEN (100) and CPN (200) processes may be performed only when applicable to a specific purpose.

상기 과정들은 Wide-Context Semantic Image Extrapolation(2019, Yi Wang …) 및 Nested Scale-Editing for Conditional Image Synthesis(2020, Lingzhi Zhang …)를 참조할 수 있다.The above processes can be referred to Wide-Context Semantic Image Extraction (2019, Yi Wang…) and Nested Scale-Editing for Conditional Image Synthesis (2020, Lingzhi Zhang…).

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specifically configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the invention and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , a person skilled in the art to which the present invention pertains can make various modifications and variations from this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the patent claims described below as well as all modifications equivalent to or equivalent to the scope of the claims fall within the scope of the spirit of the present invention. They will say they do it.

10: 인풋 이미지
20: FEN 결과물
21, 22, 23, 24: 제2 디코더의 연산 과정에서 생성되는 이미지
30: 아웃풋 이미지
40: 연장된 인풋 이미지
100: Feature Expansion Network(FEN)
110: 제1 인코더
120: 제1 bottleneck
130: 제1 디코더
200: Context Prediction Network(CPN)
210: 제1 인코더
220: 제2 bottleneck
230: 특징 값
240: 제2 디코더
241, 242, 243, 344: 제2 디코더에 포함된 업 스케일 모듈10: Input image
20: FEN result
21, 22, 23, 24: Images generated during the calculation process of the second decoder
30: Output image
40: Extended input image
100: Feature Expansion Network (FEN)
110: first encoder
120: first bottleneck
130: first decoder
200: Context Prediction Network (CPN)
210: first encoder
220: second bottleneck
230: Feature value
240: second decoder
241, 242, 243, 344: Upscale module included in the second decoder

Claims

In the image out painting method,
There is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), the FEN includes a first encoder and a first decoder, and the CPN includes a second encoder and a second decoder. In a state containing,
(a) A computing device generates a first bottleneck by inputting an input image into the first encoder of the FEN, and inputs the first bottleneck into the first decoder to produce a FEN result having a size larger than the input image. Obtaining;
(b) The computing device merges the FEN result with an extended input image including the input image but has an extended area, and inputs the result to the second encoder of the CPN to generate a second bottleneck. and inputting the feature value obtained using the second bottleneck into the second decoder; and
(c) The computing device matches the input image using the second decoder that has completed learning through the first loss function and the second loss function, and an output image including an image whose area is expanded by the expanded size. Obtaining a;
Includes,
A plurality of different input images are input to the first encoder,
The second decoder includes a plurality of upscale modules,
The first loss function is applied between a plurality of images obtained by sequentially passing one of the plurality of input images through the plurality of upscale modules,
The second loss function is applied between a plurality of result values corresponding to each of the plurality of input images for one of the plurality of upscale modules,
A method in which the second decoder is learned through the sum of linear weights for the first loss function and the second loss function.

According to paragraph 1,
The context normalization result value, which is the feature value acquired using the second bottleneck, and the condition vector derived from the second encoder are input to the second decoder and sequentially pass through a plurality of upscale modules included in the second decoder. do,
A method characterized in that the number of pixels of the acquired image increases each time it passes through the plurality of upscale modules.

According to paragraph 2,
When each image obtained as a result of the plurality of upscale modules is sequentially referred to as a first image and a second image,
The first loss function downsamples the second image and minimizes the distance between the downsampled second image and the first image, so that even when the upscale module is performed, the difference between the second image and the first image is A method characterized in that it corresponds to a formula that supports ensuring that does not exceed a certain range.

According to paragraph 2,
When there are a plurality of images obtained as a result of a specific upscale module, and the plurality of images include a 1-1 image and a 1-2 image,
The second loss function corresponds to a formula that supports generating various images by maximizing the distance between the 1-1 image and the 1-2 image.

In a device that performs image out painting,
There is a Semantic Regeneration Network (SRN) including a Feature Expansion Network (FEN) and a Context Prediction Network (CPN), the FEN includes a first encoder and a first decoder, and the CPN includes a second encoder and a second decoder. In a state containing,
Ministry of Communications;
database; and
An input image is input to the first encoder of the FEN to generate a first bottleneck, the first bottleneck is input to the first decoder to obtain a FEN result having an expanded size than the input image, and the input image is input to the first encoder. Merges the FEN result with an extended input image including an extended area, inputs the result to the second encoder of the CPN to generate a second bottleneck, and generates a second bottleneck using the second bottleneck. Feature values are input to the second decoder, and an image whose area is expanded by the expanded size is included while matching the input image using the second decoder that has completed learning through the first loss function and the second loss function. a processor that acquires an output image;
Includes,
A plurality of different input images are input to the first encoder,
The second decoder includes a plurality of upscale modules,
The first loss function is applied between a plurality of images obtained by sequentially passing one of the plurality of input images through the plurality of upscale modules,
The second loss function is applied between a plurality of result values corresponding to each of the plurality of input images for one of the plurality of upscale modules,
The second decoder is a computing device that is learned through the sum of linear weights for the first loss function and the second loss function.