KR101435772B1

KR101435772B1 - Gpu virtualizing system

Info

Publication number: KR101435772B1
Application number: KR1020130071605A
Authority: KR
Inventors: 이재진; 김정현
Original assignee: 서울대학교산학협력단
Priority date: 2013-06-21
Filing date: 2013-06-21
Publication date: 2014-08-29
Anticipated expiration: 2033-06-21

Abstract

본 발명은 가변적 개수로 변환 가능한 GPU 가상화 시스템에 관한 것이다. 본 발명의 제 1 측면은, GPU 코드를 입력받아 상기 슬레이브 노드로 전달하는 GPU 가상 디바이스 드라이버를 포함하는 마스터 노드; 및 상기 GPU 코드를 받아와 실제 GPU 를 통해 실행되도록 하는 GPU 가상 서버를 포함하는 슬레이브 노드(Slave Node)로 구분되는 다수의 클러스터 노드로 구성되는 GPU 가상화 시스템에 있어서, 상기 GPU 가상 디바이스 드라이버는, 개방형 범용 병렬 컴퓨팅 프레임워크인 OpenCL(Open Computing Language)로부터 N 개(N 은 자연수)의 실제 GPU 에 대한 GPU 코드를 M 개(M 은 N 과 같거나 다른 자연수)의 가상 GPU 로 전달시키기 위한 형태로 변환하는 에뮬레이터; 및 상기 에뮬레이터에 의해 변환된 GPU 코드를 상기 M 개의 가상 GPU 로 전달하는 디스패처; 를 포함하는 가변적 개수로 변환 가능한 GPU 가상화 시스템을 제공한다.
본 발명의 제 2 측면은, GPU 병렬처리를 위해 GPU 가상화를 수행하는 GPU 가상화 시스템에 있어서, M 개의 가상 GPU 로 구현하기 위해 사용자가 GPU 코드를 수행시키는 노드인 마스터 노드; 및 상기 마스터 노드와 통신하여 상기 GPU 코드를 전달받으며, N 개의 실제 GPU 를 구비하여 GPU 연산을 수행하는 슬레이브 노드; 를 포함하며, 상기 마스터 노드는, 워크아이템 사이즈와 워크그룹 사이즈에 기초하여 상기 GPU 코드를 통해 사용하는 데이터의 분배를 연산하는 어낼라이저를 포함한다.
이에 의해, 가상화를 통해 실제의 GPU를 결합하거나 재분배하는 방식을 제시함으로써, 실제 존재하지 않는 크기의 메모리와 컴퓨팅 성능을 가진 가상 GPU를 제공하여 더 큰 프로그램을 더 쉽게 신속하게 구동할 수 있는 효과를 제공한다.The present invention relates to a variable number transformable GPU virtualization system. A first aspect of the present invention is a master node including a GPU virtual device driver receiving a GPU code and transferring the received GPU code to the slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising: The GPU code for N (N is natural number) actual GPUs from OpenCL (Open Computing Language), a universal parallel computing framework, is converted into a form for transferring M GPUs (M is a natural number equal to or different from N) An emulator; And a dispatcher for delivering the GPU code converted by the emulator to the M virtual GPUs; To a variable number of GPU virtualization systems.
According to a second aspect of the present invention, there is provided a GPU virtualization system for performing GPU virtualization for GPU parallel processing, the system comprising: a master node which is a node through which a user executes GPU code for implementation as M virtual GPUs; And a slave node communicating with the master node to receive the GPU code and having N actual GPUs to perform a GPU operation; Wherein the master node includes an analyzer for computing a distribution of data to use over the GPU code based on a work item size and a work group size.
Virtual GPUs can be combined or redistributed through virtualization, providing virtual GPUs with memory and computing power that do not exist in real life, making it easier to run larger programs more quickly. to provide.

Description

GPU VIRTUALIZING SYSTEM

본 발명은 가변적 개수로 변환 가능한 GPU 가상화 시스템에 관한 것으로, 보다 구체적으로는, 가상화를 통해 실제의 GPU 를 결합하거나 재분배하는 방식을 제시하는 GPU 가상화 시스템에 관한 것이다.The present invention relates to a variable number of convertible GPU virtualization system, and more particularly, to a GPU virtualization system that suggests a way to combine or redistribute actual GPUs through virtualization.

GPU는 Graphic Processing Unit의 약자로, 그래픽 처리장치로 복잡한 그래픽 처리의 부담을 CPU(Central Processing Unit)인 중앙처리장치(이하, CPU)에 의한 그래픽에 대한 연산 및 처리 로드(Road)를 덜어주기 위해 개발되었다. GPU is an abbreviation of Graphic Processing Unit. It is a graphic processing unit that loads the burden of complicated graphics processing to relieve computation and processing load (road) of graphic by CPU (Central Processing Unit) Developed.

하지만 최근에는 GPU의 병렬처리 기능을 활용하여 복잡한 계산에 이용하고자 하는 기술개발이 활발히 진행되고 있다. However, in recent years, technology development has been actively pursued to utilize the parallel processing function of the GPU for use in complicated calculations.

이에 따라 많은 슈퍼 컴퓨터들은 GPU를 사용하여 처리 속도를 빠르게 높여나가고 있다. 이런 추세에 따라 GPU를 좀 더 효율적으로 사용하기 위해 GPU에 가상화를 추가하는 방법들도 연구되고 있으며, 복잡한 그래픽 처리의 부담을 CPU로부터 덜어주기 위해 개발되고 있다. As a result, many supercomputers are using GPUs to speed up processing. In this trend, methods to add virtualization to the GPU in order to use the GPU more efficiently have been studied and are being developed to alleviate the burden of complicated graphics processing from the CPU.

하나의 예로, 최근에는 GPU의 병렬처리 기능을 활용하여 복잡한 계산에 이용하고자 하는 기술 개발이 활발히 진행되고 있고, 최근의 많은 슈퍼 컴퓨터들은 GPU를 사용하여 처리 속도를 빠르게 높여나가고 있다. 따라서 병렬화가 잘 수행되고 계산량이 많은 알고리즘의 경우 GPU를 사용하여 높은 성능을 이끌어 낼 수 있다.As one example, in recent years, technology development has been actively pursued to utilize the parallel processing function of the GPU for complicated calculation, and many recent supercomputers are rapidly increasing the processing speed by using the GPU. Therefore, it is possible to achieve high performance by using GPU in case of parallelizing well-computed algorithm.

이에 따라 해당 기술분야에 있어서는 더 나아가 GPU 가상화를 통해 GPU를 더 작게 분할하는 함으로써, 가상 데스크탑 환경을 구성하거나, 가상 머신 별로 GPU가 하나씩 있는 것처럼 보여줄 수 있도록 하기 위한 기술 개발이 요구되고 있다.
Accordingly, in the related technology field, it is required to further divide the GPU through the GPU virtualization to form a virtual desktop environment, or to develop a technique for displaying the GPU as if there is one per virtual machine.

[관련기술문헌][Related Technical Literature]

1. CPU 와 GPU를 사용하는 이종 시스템에서 가상화를 이용한 어플리케이션 컴파일 및 실행 방법 및 장치(METHOD AND APPARATUS FOR COMPILING AND EXECUTING APPLICATION USING VIRTUALIZATION IN HETEROGENEOUS SYSTEM USING CPU AND GPU) (특허출원번호 제10-2010-0093327호)1. METHOD AND APPARATUS FOR COMPILING AND EXECUTING APPLICATION USING VIRTUALIZATION IN HETEROGENEOUS SYSTEM USING CPU AND GPU IN VARIOUS SYSTEMS USING CPU AND GPU (Patent Application No. 10-2010-0093327 number)

2. 가상 GPU (VIRTUAL GPU) (특허출원번호 제10-2012-0078202호)
2. Virtual GPU (VIRTUAL GPU) (Patent Application No. 10-2012-0078202)

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.On the other hand, the background art described above is technical information acquired by the inventor for the derivation of the present invention or obtained in the derivation process of the present invention, and can not necessarily be a known technology disclosed to the general public before the application of the present invention .

본 발명은 상기의 문제점을 해결하기 위한 것으로, N 개의 실제 GPU를 M 개의 GPU로 가상화 시킬 수 있어, 사용자가 직관적으로 쉽게 병렬 프로그래밍을 수행할 수 있도록 하기 위한 GPU 가상화 시스템을 제공하기 위한 것이다.In order to solve the above problems, the present invention provides a GPU virtualization system capable of virtualizing N actual GPUs into M GPUs so that a user can intuitively and easily perform parallel programming.

또한, 본 발명의 다른 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, 사용자의 요구에 맞게 GPU를 분할하여 GPU 사용률을 높일 수 있도록 하기 위한 GPU 가상화 시스템을 제공하기 위한 것이다.In addition, according to another embodiment of the present invention, a GPU virtualization system that can be converted into a variable number can provide a GPU virtualization system for dividing a GPU according to a user's request to increase the GPU utilization rate.

그러나 본 발명의 목적들은 상기에 언급된 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기의 목적을 달성하기 위해 본 발명의 제 1 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, GPU 코드를 입력받아 슬레이브 노드로 전달하는 GPU 가상 디바이스 드라이버를 포함하는 마스터 노드; 및 상기 GPU 코드를 받아와 실제 GPU 를 통해 실행되도록 하는 GPU 가상 서버를 포함하는 슬레이브 노드(Slave Node)로 구분되는 다수의 클러스터 노드로 구성되는 GPU 가상화 시스템에 있어서, 상기 GPU 가상 디바이스 드라이버는, 개방형 범용 병렬 컴퓨팅 프레임워크로부터 M 개(M 은 자연수)의 가상 GPU 에 대한 GPU 코드를 N 개(N 은 M 과 같거나 다른 자연수)의 실제 GPU 로 전달시키기 위한 형태로 변환하는 에뮬레이터; 및 상기 에뮬레이터에 의해 변환된 GPU 코드를 상기 N 개의 실제 GPU 로 전달하는 디스패처; 를 포함할 수 있다.According to an aspect of the present invention, there is provided a variable number-convertible GPU virtualization system including: a master node including a GPU virtual device driver for receiving a GPU code and transferring the GPU code to a slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising: An emulator for converting M (M is a natural number) virtual GPU code from a universal parallel computing framework into a form for transferring GPU code to N (N is a natural number equal to or different from M) actual GPU; And a dispatcher for passing the GPU code converted by the emulator to the N actual GPUs; . &Lt; / RTI >

또한 상기 디스패처는, 상기 N 개의 실제 GPU 의 위치를 기억하고, 상기 기억된 실제 GPU 의 위치에 기초하여 상기 변환된 GPU 코드를 전달 할 수 있다.The dispatcher may also store the location of the N actual GPUs and deliver the converted GPU code based on the location of the stored actual GPUs.

또한 상기 GPU 가상 디바이스 드라이버는, GPU 코드를 통해 사용하는 데이터에 대한 상기 실제 GPU 메모리로 분할 가능한지를 워크아이템 사이즈와 워크그룹 사이즈에 기초하여 분석하는 어낼라이저; 를 더 포함할 수 있다.The GPU virtual device driver may further include an analyzer analyzing whether the GPU memory is capable of being divided into the actual GPU memory for data used through the GPU code based on the work item size and the work group size; As shown in FIG.

또한, 상기 어낼라이저는, 상기 GPU 코드를 통해 사용하는 데이터의 분할이 가능한 경우, 상기 에뮬레이터와 상기 디스패처를 통해 상기 N 개의 실제 GPU 로 분할된 GPU 코드의 전달이 수행되도록 하며, 상기 GPU 코드를 통해 사용하는 데이터의 분할이 가능하지 않은 경우, 공유 가상 메모리(Shared virtual memory) 방식 또는 CPU 수행 방식에 의해 분할이 가능하지 않는 GPU 코드가 수행될 수 있다.In addition, when the data to be used through the GPU code can be divided, the analyzer causes the emulator and the dispatcher to carry out transmission of the GPU code divided into the N actual GPUs, If the data to be used can not be divided, a GPU code that can not be divided by a shared virtual memory (Shared Virtual Memory) method or a CPU execution method can be executed.

또한 상기 어낼라이저는, 상기 GPU 코드를 통해 사용하는 데이터에 대해 상기 N 개의 실제 GPU 에 분할되는 경우, 상기 N 개로 상기 워크그룹 사이즈(workgroup size)를 분할하여 상기 N 개의 실제 GPU 에 분배하여 실행시킬 수 있다.If the data to be used through the GPU code is divided into N actual GPUs, the analyzer may divide the workgroup size into N pieces and distribute the N pieces to the N actual GPUs to be executed .

또한 상기 어낼라이저는, 상기 GPU 코드를 통해 사용하는 데이터에 대해 상기 N 개의 실제 GPU 에 분할되지 않는 경우, 상기 실제 GPU 의 MMU 의 공유 가상 메모리(Shared virtual memory)를 이용한 상기 N 개의 실제 GPU 에서 겹치는 읽거나 쓰는 버퍼 영역에 대한 처리를 수행할 수 있다.In addition, the analyzer may overlap the N actual GPUs using the shared virtual memory of the MMU of the actual GPU if the data to be used through the GPU code is not divided into the N actual GPUs It is possible to perform processing on the buffer area for reading or writing.

또한 상기 마스터 노드는, 사용자에 의해 프로그램이 수행되는 노드이고, 상기 슬레이브 노드는, 상기 N 개의 실제 GPU 를 구비하여, GPU 연산을 수행하는 노드일 수 있다.Also, the master node may be a node where a program is executed by a user, and the slave node may be a node having the N actual GPUs and performing a GPU operation.

또한 상기 GPU 가상 디바이스 드라이버는, CUDA 및 OpenCL 를 지원할 수 있다.The GPU virtual device driver may support CUDA and OpenCL.

또한 상기 GPU 가상 서버와 상기 GPU 가상 디바이스 드라이버 간의 통신은 상기 마스터 노드와 상기 슬레이브 노드 간에 연결된 인터커넥션 네트워크(Interconnection Network)를 사용하는 것이 바람직하다.Preferably, communication between the GPU virtual server and the GPU virtual device driver uses an interconnection network connected between the master node and the slave node.

상기의 목적을 달성하기 위해 본 발명의 제 2 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, M 개의 가상 GPU 로 구현하기 위한 GPU 코드를 수행시키는 노드인 마스터 노드; 및 상기 마스터 노드와 통신하여 상기 GPU 코드를 전달받으며, N 개의 실제 GPU 를 구비하여 GPU 연산을 수행하는 하나 이상의 슬레이브 노드; 를 포함하며, 상기 마스터 노드는, 워크아이템 사이즈와 워크그룹 사이즈에 기초하여 상기 GPU 코드를 통해 사용하는 데이터의 분배를 연산하는 어낼라이저를 포함할 수 있다.According to another aspect of the present invention, there is provided a GPU virtualization system including: a master node, which is a node for executing GPU codes for implementing M virtual GPUs; And one or more slave nodes communicating with the master node to receive the GPU code and having N actual GPUs to perform GPU operations; And the master node may include an analyzer for computing a distribution of data to be used over the GPU code based on a work item size and a work group size.

또한 상기 어낼라이저에 의한 연산에 의해 상기 GPU 코드를 N 개의 실제 GPU 를 위해 변환하도록 구성되는 에뮬레이터; 및 상기 변환된 GPU 코드를 상기 N 개의 GPU 각각으로 제공하도록 구성되는 디스패처;를 더 포함하는 것이 바람직하다.An emulator configured to convert the GPU code for N real GPUs by operation by the analyzer; And a dispatcher configured to provide the converted GPU code to each of the N GPUs.

본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, 가상화를 통해 실제의 GPU를 결합하거나 재분배하는 방식을 제시함으로써, 실제 존재하지 않는 크기의 메모리와 컴퓨팅 성능을 가진 가상 GPU를 제공하여 더 큰 프로그램을 더 쉽게 신속하게 구동할 수 있는 효과를 제공한다. The GPU virtualization system capable of converting a variable number according to an embodiment of the present invention provides a method of combining or redistributing an actual GPU through virtualization to provide a virtual GPU having a memory and a computing capability that do not exist in actual size Thereby providing an effect that a larger program can be driven more quickly and more easily.

또한, 본 발명의 다른 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, 사용자의 요구에 맞게 GPU를 분할하여 GPU 사용률을 높일 수 있는 효과를 제공한다.Also, according to another embodiment of the present invention, the variable number-convertible GPU virtualization system provides an effect of increasing the GPU usage rate by dividing the GPU according to the demand of the user.

뿐만 아니라, 본 발명의 다른 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템은, N 개의 실제 GPU를 M 개의 GPU로 가상화 시킬 수 있어, 사용자가 직관적으로 쉽게 병렬 프로그래밍을 수행할 수 있는 효과를 제공한다.In addition, according to another embodiment of the present invention, the variable number-convertible GPU virtualization system can virtualize N real GPUs into M GPUs, thereby providing an effect that a user can easily perform parallel programming intuitively do.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The effects obtained by the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those skilled in the art from the following description will be.

도 1은 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서 사용되는 실제 GPU의 기본적인 구조를 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 가상 디바이스 드라이버 레이어(Virtual Device Driver Layer)를 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 GPU 가상 디바이스 드라이버를 나타내는 도면이다.
도 4는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 GPU 가상화 기법의 예를 설명하기 위한 도면이다.
도 5는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 가상화된 GPU에 의한 접근하는 데이터가 분할이 불가능하면 구현방법을 설명하기 위한 개념도이다.FIG. 1 is a diagram for explaining a basic structure of an actual GPU used in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
2 is a diagram for explaining a virtual device driver layer in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
3 is a diagram illustrating a GPU virtual device driver in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
4 is a diagram illustrating an example of a GPU virtualization technique in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
FIG. 5 is a conceptual diagram for explaining an implementation method when data accessed by a virtualized GPU in a GPU virtualization system that can be converted to a variable number according to an embodiment of the present invention can not be divided.

이하, 본 발명의 바람직한 실시예의 상세한 설명은 첨부된 도면들을 참조하여 설명할 것이다. 하기에서 본 발명을 설명함에 있어서, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a detailed description of preferred embodiments of the present invention will be given with reference to the accompanying drawings. In the following description of the present invention, detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

본 명세서에 있어서는 어느 하나의 구성요소가 다른 구성요소로 데이터 또는 신호를 '전송'하는 경우에는 구성요소는 다른 구성요소로 직접 상기 데이터 또는 신호를 전송할 수 있고, 적어도 하나의 또 다른 구성요소를 통하여 데이터 또는 신호를 다른 구성요소로 전송할 수 있음을 의미한다.
In the present specification, when any one element 'transmits' data or signals to another element, the element can transmit the data or signal directly to the other element, and through at least one other element Data or signal can be transmitted to another component.

도 1 은 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서 사용되는 실제 GPU(10)의 기본적인 구조를 설명하기 위한 도면이다. 도 1을 참조하면, 실제 GPU(10)는 여러 개의 PE(Processing Element)(11)와 GPU 메모리(GPU Memory)(12)로 구성된다. 실제 GPU(10)는 일반적인 CPU 연산과 비슷하게, PE(11)에 의해 GPU 메모리(12)로부터 데이터를 읽어 처리하고 다시 메모리(12)에 저장한다.FIG. 1 is a diagram for explaining a basic structure of an actual GPU 10 used in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to FIG. 1, an actual GPU 10 includes a plurality of processing elements (PE) 11 and a GPU memory 12. The actual GPU 10 reads data from the GPU memory 12 by the PE 11, processes the data, and stores the data in the memory 12, similarly to a general CPU operation.

MMU(Memory Management Unit)(13)는 CPU 의 MMU 와 같이 가상 메모리(Virtual memory)를 구현하기 위해 설치된다. MMU(13)는 실제 GPU(10)의 가상화를 지원하기 위해 최근의 GPU 에 구현되고 있다. 본 발명에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템상에서MMU(13)가 설치되는 경우와 설치되지 않은 경우 모두에 대한 가변적 개수로 변환 가능한 GPU 가상화에 대해서 이하 구체적으로 설명하도록 한다.
An MMU (Memory Management Unit) 13 is installed to implement a virtual memory like an MMU of a CPU. The MMU 13 is implemented in a recent GPU to support virtualization of the actual GPU 10. GPU virtualization that can be converted into a variable number for both the case where the MMU 13 is installed and the case where the MMU 13 is not installed on the GPU virtualization system that can be converted to a variable number according to the present invention will be described in detail below.

도 2 는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 가상 디바이스 드라이버 레이어(Virtual Device Driver Layer)를 나타내는 도면이다. 도 2 를 참조하면, 가변적 개수로 변환 가능한 GPU가상화 시스템의 클러스터 노드(Cluster Node)는 마스터 노드(Master Node: 100a)와 슬레이브 노드(Slave Node: 100b)로 구분된다.2 is a diagram illustrating a virtual device driver layer in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to FIG. 2, a cluster node of a GPU virtualization system convertible into a variable number is divided into a master node (master node) 100a and a slave node (slave node) 100b.

여기서, 마스터 노드(100a)는 사용자에 의해 GPU 코드를 수행하기 위해 형성된 노드를 의미하며, 하드웨어(Hardware: 100a-1), 운영체제(Operating System: 100a-2), GPU 가상 디바이스 드라이버(GPU virtual device driver: 100a-3), CUDA(100a-4), OpenCL(100a-5), CUDA App.(100a-6) 및 OpenCL App.(100a-7)를 포함한다.The master node 100a is a node formed by a user to execute a GPU code and includes hardware 100a-1, an operating system 100a-2, a GPU virtual device driver 100a-3, CUDA 100a-4, OpenCL 100a-5, CUDA App 100a-6 and OpenCL App 100a-7.

한편, 슬레이브 노드(100b)는 마스터 노드(100a)로부터 GPU 코드를 전달받아 실제 계산을 하는 노드를 의미하며, 하드웨어(Hardware: 100b-1), 운영체제(Operating System: 100b-2), GPU 디바이스 드라이버(GPU device driver: 100b-3), CUDA(100b-4), OpenCL(100b-5) 및 GPU 가상 서버(GPU virtual server: 100b-6)를 포함한다.The slave node 100b is a node that receives the GPU code from the master node 100a and performs the actual calculation. The slave node 100b includes a hardware 100b-1, an operating system 100b-2, A GPU device driver 100b-3, a CUDA 100b-4, an OpenCL 100b-5, and a GPU virtual server 100b-6.

이러한 마스터 노드(100a)와 슬레이브 노드(100b)로 구분되는 클러스터 노드 환경에서 먼저 사용자는 마스터 노드(100a)에 접속하여 GPU 코드를 수행하면, 마스터 노드(100a)의 GPU 가상 디바이스 드라이버(100a-3)는 사용자에게 가상 GPU 를 구현한다. 또한, GPU 가상 디바이스 드라이버(100a-3)는 사용자의 GPU 코드를 받아 슬레이브 노드(100b)로 인터커넥션 네트워크(Interconnection Networ: 100c)를 통해 전송한다.In a cluster node environment separated into the master node 100a and the slave node 100b, when the user first accesses the master node 100a and executes the GPU code, the GPU virtual device driver 100a-3 ) Implements a virtual GPU for the user. In addition, the GPU virtual device driver 100a-3 receives the GPU code of the user and transmits the received GPU code to the slave node 100b through the interconnection network (Interconnection Network) 100c.

이에 따라, 슬레이브 노드(100b)는 GPU 가상 서버(100b-6)에 의해 마스터 노드(100a)로부터 사용자의 GPU 코드를 전달받아 온다. 전달된 GPU 코드는 실제 GPU 디바이스 드라이버(100b-3)로 전달되어 실제 N 개(N 은 자연수)의 GPU(10)를 통해 실행할 수 있다.
Accordingly, the slave node 100b receives the GPU code of the user from the master node 100a by the GPU virtual server 100b-6. The delivered GPU code is delivered to the actual GPU device driver 100b-3 and can be executed through the N (N is a natural number) GPU 10.

도 3 은 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 GPU 가상 디바이스 드라이버(100a-3)를 나타내는 도면이다.3 is a diagram showing a GPU virtual device driver 100a-3 in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.

도 3을 참조하면, GPU 가상 디바이스 드라이버(100a-3)는 에뮬레이터(Emulator: 110a-3), 디스패처(Dispatcher: 120a-3) 및 어낼라이저(Analyzer: 130a-3)를 포함한다.Referring to FIG. 3, the GPU virtual device driver 100a-3 includes an emulator 110a-3, a dispatcher 120a-3, and an analyzer 130a-3.

먼저, 에뮬레이터(Emulator: 100a-3)는 병렬 컴퓨팅 프레임워크, 예를 들어, CUDA(Compute Unified Device Architecture: 100a-4)나 개방형 범용 병렬 컴퓨팅 프레임워크인 OpenCL(Open Computing Language: 100a-5)과 직접적으로 통신하는 구성요소로 구성될 수 있다. First, an emulator 100a-3 is connected to a parallel computing framework, for example, Compute Unified Device Architecture (CUDA) 100a-4 or OpenCL (Open Computing Language: 100a-5) And may be composed of components that communicate directly.

또한 OpenCL(100a-5)에서 에뮬레이터(110a-3)로 GPU 코드를 전달하며, 에뮬레이터(110a-3)는 해당 전달된 GPU 코드를 실제 GPU(10)에 전달할 수 있는 형태로 변환한다.
Also, the GPU code is transferred from the OpenCL 100a-5 to the emulator 110a-3, and the emulator 110a-3 converts the transferred GPU code into a form that can be transmitted to the actual GPU 10.

예를 들면, N 개(N 은 자연수)의 실제 GPU(10)를 1개의 가상 GPU로 가상화한 경우, 에뮬레이터(110a-3)는 1 개의 가상 GPU 로 전달된 GPU 코드를 N개의 실제 GPU(10)에 맞는 형태로 변환한다.For example, in a case where N (N is a natural number) virtual GPU 10 is virtualized as one virtual GPU, the emulator 110a-3 transmits the GPU code delivered in one virtual GPU to N actual GPUs 10 ).

다음으로, 디스패처(120a-3)는 에뮬레이터(110a-3)에 의해 변환된 GPU 코드를 실제 GPU(10)로 전달하는 역할을 수행한다. 보다 구체적으로, 디스패처(120a-3)는 실제 N 개의 실제 GPU(10)는 클러스터 노드 상에 분산되어 있기 때문에, 실제 각 GPU(10)의 위치를 기억하고 있다가 해당 GPU 에게 명령을 전달한다.Next, the dispatcher 120a-3 transmits the GPU code converted by the emulator 110a-3 to the actual GPU 10. More specifically, since the actual N real GPUs 10 are distributed on the cluster node, the dispatcher 120a-3 stores the location of each GPU 10 and delivers the command to the corresponding GPU.

마지막으로, 어낼라이저(130a-3)는 GPU 가상 디바이스 드라이버(100a-3)를 구성하는 추가적인 모듈로 GPU 코드를 통해 사용하는 데이터가 실제 GPU 메모리에 분할 가능한지 분석한다.Finally, the analyzer 130a-3 analyzes whether the data used by the GPU code in the additional module constituting the GPU virtual device driver 100a-3 is divisible into actual GPU memory.

그리고 본 명세서에서 모듈이라 함은, 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적, 구조적 결합을 의미할 수 있다. 예컨대, 상기 모듈은 소정의 코드와 상기 소정의 코드가 수행되기 위한 하드웨어 리소스의 논리적인 단위를 의미할 수 있으며, 반드시 물리적으로 연결된 코드를 의미하거나, 한 종류의 하드웨어를 의미하는 것은 아님은 본 발명의 기술분야의 평균적 전문가에게는 용이하게 추론될 수 있다.In this specification, a module may mean a functional and structural combination of hardware for carrying out the technical idea of the present invention and software for driving the hardware. For example, the module may mean a logical unit of a predetermined code and a hardware resource for executing the predetermined code, and it does not necessarily mean a physically connected code or a kind of hardware. Can be easily deduced to the average expert in the field of < / RTI >

어낼라이저(130a-3)는 GPU 코드를 통해 사용하는 데이터가 분할 가능하다면 에뮬레이터(110a-3)와 디스패처(120a-3)를 통해 N 개의 실제 GPU(10)로 데이터의 전달이 수행되도록 한다.The analyzer 130a-3 allows the data to be transferred to the N real GPUs 10 through the emulator 110a-3 and the dispatcher 120a-3 if the data to be used through the GPU code is divisible.

반대로, 어낼라이저(130a-3)에 의해 GPU 코드를 통해 사용하는 데이터가 분할이 불가능하면 하기의 두 가지 방법을 통하여 GPU 코드를 수행할 수 있다. 첫 번째는 GPU(10)에 구현된 MMU(13)는 공유 가상 메모리(Shared virtual memory)를 제공하여 메모리 접근이 발생하는 경우 GPU끼리 데이터를 교환 방식에 의해 수행한다. 두 번째는 GPU(10)에 MMU(13)가 없는 경우, 해당 코드를 CPU 를 통해 수행한다. 여기서 두 가지 방법에 대해서는 도 5 에서 구체적으로 살펴보도록 한다.
Conversely, if the data used by the GPU code by the analyzer 130a-3 can not be divided, the GPU code can be executed through the following two methods. First, the MMU 13 implemented in the GPU 10 provides a shared virtual memory to perform data exchange between the GPUs when a memory access occurs. If the MMU 13 is not present in the GPU 10, the second code is executed through the CPU. Here, the two methods will be described in detail in FIG.

도 4 는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 GPU 가상화 기법의 예를 설명하기 위한 도면이다. 도 4 의 (1)을 참조하면, GPU(10-0, 10-1, 10-2)가 2 개씩 설치되며 클러스터 노드가 3 개인 환경에서 가상화를 통해 사용자에게는 하나의 기계에 다수의 CPU(2)와 하나의 가상 GPU(1)가 설치된 것처럼 보이게 할 수 있다.4 is a diagram illustrating an example of a GPU virtualization technique in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to (1) of FIG. 4, two GPUs 10-0, 10-1, and 10-2 are installed. In an environment having three cluster nodes, a plurality of CPUs 2 And one virtual GPU 1 are installed.

본 명세서에서는 마스터 노드(100a)와 슬레이브 노드(100b)로 구분가능한 클러스터 노드 상의 CPU(2-0, 2-1, 2-2)와 메인 메모리(Memory: 3-0, 3-1, 3-2)가 결합하여 하나의 머신(Machine)처럼 보이는 가상화에 대해서는 이미 종래의 기술에 의한 것이기 때문에 설명하지 않고, GPU(10: 10-0, 10-1, 10-2)의 결합 방법에 대해서만 도 4 에 도시된 것과 같이 제시한다.In this specification, CPUs 2-0, 2-1, and 2-2 on a cluster node that can be divided into a master node 100a and a slave node 100b, and main memories 3-0, 3-1, 3- (10: 10-0, 10-1, 10-2) can be used only for the combining method of the GPU 10 (10: 10-0, 10-1, 10-2) 4 as shown in Fig.

도 2 에서 이미 설명한 바와 같이, 마스터 노드(100a)는 사용자로부터 프로그램을 수행하는 노드이고, 슬레이브 노드(100b)는 실제 GPU(10)가 설치되어 실제 GPU 연산을 수행하는 시스템을 의미한다.2, the master node 100a is a node for executing a program from a user, and the slave node 100b is a system in which an actual GPU 10 is installed to perform an actual GPU operation.

상술한 것처럼 마스터 노드(100a)에는 GPU 가상 디바이스 드라이버(100a-3)가 설치되고 GPU 가상 디바이스 드라이버(100a-3)에서 CUDA(100a-4) 또는 OpenCL(100a-5)를 지원한다. GPU 가상 디바이스 드라이버(100a-3)를 구성하는 요소는 에뮬레이터(110a-3), 디스패처(120a-3) 및 어낼라이저(130a-3)로 구성되는 것을 이미 도 3 에서 상술한 바 있다.The GPU virtual device driver 100a-3 is installed in the master node 100a and the CUDA 100a-4 or the OpenCL 100a-5 is supported in the GPU virtual device driver 100a-3. 3 that the elements constituting the GPU virtual device driver 100a-3 are composed of an emulator 110a-3, a dispatcher 120a-3, and an analyzer 130a-3.

슬레이브 노드(100b)에는 GPU 가상 서버(100b-6)가 설치되어 있어서, GPU가상 서버(100b-6)는 마스터 노드(100a)에 있는 GPU 가상 디바이스 드라이버(100a-3)로부터 명령어를 받아서 처리한다. GPU 가상 서버(100b-6)와 GPU 가상 디바이스 드라이버(100a-3) 간의 통신은 마스터 노드(100a)와 슬레이브 노드(100b) 간에 연결된 인터커넥션 네트워크(Interconnection Network: 100c)를 이용한다.The GPU virtual server 100b-6 is installed in the slave node 100b so that the GPU virtual server 100b-6 receives and processes the command from the GPU virtual device driver 100a-3 in the master node 100a . Communication between the GPU virtual server 100b-6 and the GPU virtual device driver 100a-3 uses an interconnection network 100c connected between the master node 100a and the slave node 100b.

이러한 상술한 구성을 중심으로, 도 4 의 (1)은 GPU 가상화의 일 실시예로 다수의 GPU(10: 10-0, 10-1, 10-2)를 1 개의 가상 GPU(1)로 가상화 한다고 가정한다.4 (1) is an example of GPU virtualization, in which a plurality of GPUs 10: 10-0, 10-1, 10-2 are virtualized into one virtual GPU 1 .

프로그래머가 OpenCL 프로그래밍 모델로 구현한 프로그램을 실행하면, GPU 가상 디바이스 드라이버(100a-3)는 GPU 코드를 받아 실제 GPU(10: 10-0, 10-1, 10-2)로 전달한다.When the programmer executes the program implemented by the OpenCL programming model, the GPU virtual device driver 100a-3 receives the GPU code and delivers it to the actual GPU 10 (10-0, 10-1, 10-2).

이때, 어낼라이저(130a-3)는 GPU 코드를 통해 사용하는 데이터가 GPU(10: 10-0, 10-1, 10-2)에 잘 분할될 수 있는지 검사한다.At this time, the analyzer 130a-3 checks whether the data used by the GPU code can be divided into the GPU 10 (10-0, 10-1, 10-2).

예를 들어 어낼라이저(130a-3)는 OpenCL(100a-5)에서 실행할 경우 입력된 워크아이템 사이즈(workitem size)와 워크그룹 사이즈(workgroup size)의 입력값을 이용하여 GPU 코드를 분석한다.For example, the analyzer 130a-3 analyzes the GPU code using the input values of the workitem size and the workgroup size, which are input when executing on the OpenCL 100a-5.

어낼라이저(130a-3)는 워크그룹 사이즈(workgroup size)의 크기가 1024 이고, 실제 GPU(10: 예컨대, 10-0 내지 10-15)가 16 개 있다면, 워크그룹(workgroup)을 64 개씩 분할한다.The analyzer 130a-3 divides the workgroup into 64 pieces if the size of the workgroup size is 1024 and the number of actual GPUs 10 (for example, 10-0 to 10-15) do.

어낼라이저(130a-3)는 워크그룹(workgroup)을 64 개씩 분할할 경우 실제 GPU 에서 읽거나 쓰는 버퍼 영역이 겹치지 않고 분할가능하다면, 해당 GPU 코드를 통해 사용하는 데이터는 실제 GPU(10: 10-0 내지 10-15)에 분배하여 실행시킬 수 있다.In the case where the worker is divided into 64 workgroups, the analyzer 130a-3 divides the workgroup into 64 GPUs, and if the buffer areas are not overlapping, 0 to 10 to 15).

이 경우, 분할된 버퍼 영역을 GPU(10)의 GPU 메모리(12)에 각각 복사한 후 워크그룹 사이즈(workgroup size)를 64 로 설정하여 각 실제 GPU(10)에서 수행할 수 있다. 수행이 완료되면, 분할된 버퍼는 마스터 노드(100a)의 메모리(3)로 다시 복사하여 합칠 수 있다.
In this case, the divided buffer areas may be respectively copied to the GPU memory 12 of the GPU 10, and then the workgroup size may be set to 64 so as to be performed in each of the actual GPUs 10. When the execution is completed, the divided buffers can be copied and merged again into the memory 3 of the master node 100a.

도 5 는 본 발명의 실시예에 따른 가변적 개수로 변환 가능한 GPU 가상화 시스템에서의 가상화된 GPU 에 의한 접근하는 데이터가 분할이 불가능하면 구현방법을 설명하기 위한 개념도이다. 가상화된 GPU 에 의한 접근하는 데이터가 분할이 불가능한 경우 2 가지 방식에 의해 수행된다.FIG. 5 is a conceptual diagram for explaining an implementation method when data accessed by a virtualized GPU in a GPU virtualization system that can be converted to a variable number according to an embodiment of the present invention can not be divided. If the data accessed by the virtualized GPU can not be partitioned, it is performed in two ways.

첫째, GPU(10)의 MMU(13)를 이용한다, 도 1 의 GPU(10) 구조상에 MMU(13)가 있다면 MMU(13)의 기능을 이용해 공유 가상 메모리(Shared virtual memory)를 구현한다. 공유 가상 메모리(Shared virtual memory)가 구현된 일 실시예는 도 5 와 같다.First, the MMU 13 of the GPU 10 is used. If there is an MMU 13 on the structure of the GPU 10 of FIG. 1, a shared virtual memory is implemented using the function of the MMU 13. An embodiment in which a shared virtual memory is implemented is shown in FIG.

GPU0{10(0)}에서 페이지 r 을 접근하려 할 때, GPU{10(0)}의 MMU{13(0)}는 해당 페이지가 자신의 local GPU 에 없다는 것을 인식한다.When attempting to access page r from GPU0 {10 (0)}, MMU {13 (0)} of GPU {10 (0)} recognizes that the page is not in its local GPU.

이 경우 해당 페이지를 마스터 노드(10)로부터 가져 온다.In this case, the corresponding page is fetched from the master node 10.

만약 페이지를 수정하려 한다면, 이후에 수정된 내용을 찾기 위해 도 5 와 같이 도시된 GPU1{10(1)}와 같이 twin 페이지를 생성한다. 코드 수행이 완료된 후에 twin 페이지를 기반으로 수정한 데이터(diff)를 메인 메모리(3)에 반영한다.If the page is to be modified, a twin page is generated as shown in FIG. 5 as GPU1 {10 (1)} to find the modified contents. After the code execution is completed, the modified data (diff) based on the twin page is reflected in the main memory 3.

OpenCL(100a-5)이나 CUDA(100a-4)에는 커널 수행 중에 전체 스레드에 대한 동기화를 제공하지 않기 때문에, 커널의 수행이 종료된 후에 데이터(diff)를 마스터 노드(100a)에 반영함으로써, 모든 수정사항을 마스터 노드(100a)에서 합칠 수 있다. 이 프로토콜은 기존의 공유 가상 메모리(Shared virtual memory) 기술의 구현과 일치한다.Since the OpenCL 100a-5 and the CUDA 100a-4 do not provide synchronization for the entire thread during kernel execution, data diff is reflected to the master node 100a after the execution of the kernel is completed, Modifications may be merged at the master node 100a. This protocol is consistent with the implementation of existing shared virtual memory technology.

둘째, 해당 코드를 CPU(2)에서 수행한다. 만약 MMU(13)가 없어 공유 가상 메모리를 제공할 수 없다면, 서로 다른 GPU 간의 메모리는 GPU 코드를 수행하면서 동기화할 수 없다. 그러므로 수행한 동작의 정확성을 보장할 수 없다. 다른 대안으로 에뮬레이터(110a-3)는 CPU(3)를 이용하여 해당 코드를 수행한다.
Second, the CPU 2 executes the corresponding code. If there is no MMU 13 and can not provide a shared virtual memory, memory between different GPUs can not be synchronized while GPU code is being executed. Therefore, the accuracy of the performed operation can not be guaranteed. As another alternative, the emulator 110a-3 executes the corresponding code using the CPU 3. [

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시예에 대하여 개시하였으며, 비록 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시예 외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다.As described above, preferred embodiments of the present invention have been disclosed in the present specification and drawings, and although specific terms have been used, they have been used only in a general sense to easily describe the technical contents of the present invention and to facilitate understanding of the invention , And are not intended to limit the scope of the present invention. It is to be understood by those skilled in the art that other modifications based on the technical idea of the present invention are possible in addition to the embodiments disclosed herein.

10: 실제 GPU
11: PE(Processing Element)
12: GPU 메모리(GPU Memory)
13: MMU(Memory Management Unit)
100a: 마스터 노드
100a-1: 하드웨어(Hardware)
100a-2: 운영체제(Operating System)
100a-3: GPU 가상 디바이스 드라이버(GPU virtual device driver)
110a-3: 에뮬레이터
120a-3: 디스패처
130a-3: 어낼라이저
100a-4: CUDA
100a-5: OpenCL
100a-6: CUDA App.
100a-7: OpenCL App.
100b: 슬레이브 노드
100b-3: GPU 디바이스 드라이버
100b-6: GPU 가상 서버(GPU virtual server)10: Actual GPU
11: PE (Processing Element)
12: GPU Memory (GPU Memory)
13: Memory Management Unit (MMU)
100a: master node
100a-1: Hardware
100a-2: Operating System (Operating System)
100a-3: GPU virtual device driver
110a-3: Emulator
120a-3: Dispatcher
130a-3:
100a-4: CUDA
100a-5: OpenCL
100a-6: CUDA App.
100a-7: OpenCL App.
100b: Slave node
100b-3: GPU device driver
100b-6: GPU virtual server

Claims

A master node including a GPU virtual device driver for receiving a GPU code and transferring the GPU code to a slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising:
An emulator for converting M (M is a natural number) virtual GPU code from a parallel computing framework into a form for transferring GPU code to N (N is a natural number equal to or different from M) actual GPU; And
A dispatcher for delivering the GPU code converted by the emulator to the N actual GPUs; Wherein the variable-number-convertible GPU virtualization system comprises:

The method according to claim 1,
The dispatcher comprising:
Wherein the location of the N actual GPUs is stored and the converted GPU code is delivered based on the location of the stored actual GPUs.

3. The method of claim 2,
The GPU virtual device driver comprising:
An analyzer for analyzing, based on the work item size and the work group size, whether the GPU memory is divisible into the actual GPU memory for data used through the GPU code; The GPU virtualization system further comprising:

The method of claim 3,
The above-
The GPU code being divided into N actual GPUs through the emulator and the dispatcher when the data to be used through the GPU code can be divided,
A GPU code that is not divisible by a shared virtual memory method or a CPU execution method is executed when data to be used through the GPU code can not be divided, Possible GPU virtualization system.

5. The method of claim 4,
The above-
Dividing the workgroup size into N pieces of the N actual GPUs for data to be used through the GPU code, and distributing the divided N pieces to the N actual GPUs, GPU virtualization system.

6. The method of claim 5,
The above-
If the data to be used through the GPU code is not divided into the N actual GPUs, the N real GPUs using the shared virtual memory of the MMU of the actual GPU may read or write in overlapping buffer areas The GPU virtualization system comprising:

The method according to claim 1,
The master node,
A node where a program is executed by a user,
The slave node,
Wherein the GPU is a node that has the N actual GPUs and performs GPU operations.

The method according to claim 1,
The GPU virtual device driver comprising:
A variable number of convertible GPU virtualization system characterized by supporting a parallel computing framework.

The method according to claim 1,
Wherein the communication between the GPU virtual server and the GPU virtual device driver uses an interconnection network connected between the master node and the slave node.

In a GPU virtualization system that performs GPU virtualization for GPU parallel processing,
A master node that executes a GPU code for implementing M virtual GPUs; And
One or more slave nodes communicating with the master node to receive the GPU code and having N actual GPUs to perform GPU operations; / RTI >
Wherein the master node comprises an analyzer for computing a distribution of data to be used over the GPU code based on a work item size and a work group size.

11. The method of claim 10,
The master node,
An emulator configured to convert the GPU code for N slave nodes by operation by the analyzer; And
A dispatcher configured to provide the converted GPU code to each of the N slave nodes; The GPU virtualization system further comprising: