CN103713938A

CN103713938A - Multi-graphics-processing-unit (GPU) cooperative computing method based on Open MP under virtual environment

Info

Publication number: CN103713938A
Application number: CN201310695055.4A
Authority: CN
Inventors: 秦谦; 袁家斌
Original assignee: Jiangsu Mingtong Tech Co Ltd
Current assignee: Jiangsu Mingtong Tech Co Ltd
Priority date: 2013-12-17
Filing date: 2013-12-17
Publication date: 2014-04-09

Abstract

The invention discloses a multi-graphics-processing-unit (GPU) cooperative computing method based on Open MP under a virtual environment. The method includes the following steps that host end threads with the same number as GPUs are arranged at the host end through the Open MP, and each host end thread is in charge of controlling one GPU, distributing video memory for each thread on each device and starting a kernel function. Each thread is provided with data indicators of its private host end and device, data computing and data merging are conducted, data is copied to the private host ends through device end data indicators owned by the GPUs, and data merging is conducted at the private host ends. Compared with the prior art, multi-GPU cooperative computing is achieved under the virtual environment, a single task is quickened by using the multiple GPUs, and the method has theoretical and practical significance on super computing, cloud computing and grid computing based on a central processing unit (CPU)+GPU isomerism platform.

Description

The collaborative computing method of many GPU based on OpenMP under virtualized environment

Technical field

The present invention relates to the many GPU of single task under virtualized environment and calculate field, relate in particular to the collaborative computing method of the many GPU based on OpenMP under virtualized environment.

Background technology

The collaborative computing technique of existing many GPU is all based on physical machine, OpenMP is generally for CPU parallel computation, be used in the sdk of GPU Zhong Yejiu NVIDIA official and provided example, do not support the API that many GPU are complete, gVirtuS is comparatively ripe at present GPU virtualization solution, it has solved and under virtualized environment, has utilized GPU to carry out the problem of CUDA programming, but its solution is all for single GPU, many GPU are not studied, in a word, in prior art, task of can not utilize many GPU to be on a grand scale to data under virtualized environment is accelerated simultaneously.

Summary of the invention

The present invention has overcome the deficiencies in the prior art, and the collaborative computing method of the many GPU based on OpenMP under a kind of virtualized environment are provided.

For solving the problems of the technologies described above, the technical solution used in the present invention is:

The collaborative computing method of many GPU based on OpenMP under virtualized environment, comprise the following steps,

Step S01; in service end, dispose GPU virtual (gVirtuS) service end assembly; to in this locality, carry out in the parameter of client interception; in physical machine, carry out; physical machine is offered the host side thread identical with GPU number by OpenMP in host side; each host side thread is responsible for controlling a GPU, and by DLL (dynamic link library) function, No. ID device number with GPU of host side thread is corresponding, the object definition that GPU is calculated is N*N matrix; DLL (dynamic link library) function is cudaSetDevice (cpu_thread_id) function;

In prior art, under virtualized environment, not for the driver of GPU, step S01 passes to service end by concrete execution after service end is disposed GPU virtual (gVirtuS) service end assembly, after service end completes, result is passed to virtual machine;

Step S02 is each thread distribution video memory on each equipment, and each self-starting kernel function, compute matrix compound operation, and described video memory size is distributed according to calculative size of data, the kernel function of described kernel function for multiplying each other for compute matrix;

Step S03, data decomposition, each thread arranges the data pointer of own privately owned host side and equipment, privately owned host side thread points to the different reference position of original pointer, and by CUDA copy function, be that cudaMemcpy () function reaches the object that scale is divided, line number or columns that wherein N is matrix, the number that n is GPU from position N/n data scale of copy of own privately owned host side thread;

Step S04, data are calculated, and according to matrix multiple rule, the matrix of n GPU are calculated, and OpenMP synchronization module is controlled the output time of described GPU result of calculation, by ccudaDeviceSynchronize () function, synchronously exports data;

Step S05, data merge, in described step S04, the data of synchrodata copy back privately owned host side by the privately owned equipment end data pointer of GPU, in privately owned host side, carry out data merging, service end is communicated by letter to result of calculation is passed to client by socket after completing calculating, and described client is virtual machine.

In step S03, GPU number is 4, is respectively GPU0, GPU1, GPU2 and GPU3, and described matrix is A, B, C and D, and A*B+C*D data decomposition comprises the following steps:

1) GPU 0 carry out matrix A half be multiplied by B, GPU 1 carry out matrix A second half be multiplied by B, GPU 2 carry out Matrix C half be multiplied by D, GPU 3 carry out Matrix C second half be multiplied by D;

2) OpenMP synchronization module waits for that the phase multiplication of GPU0, GPU1, GPU2 and GPU3 all completes, and copies total data on GPU 0 to by cudaMemcpy () function the correctness of host side check results.

Data are calculated and are adopted matrix multiplication to calculate, comprise the following steps: A matrix is divided into 4 A/4(N/4*N), and multiply each other with matrix B respectively, obtain respectively 4 AB/4 matrixes, 4 AB/4 matrix combinations can obtain matrix of consequence AB, with the method, calculate respectively matrix multiple in GPU0, GPU1, GPU2 and GPU3, AB matrix is N*N matrix, and AB matrix is similarly N*N matrix.。

Compared with prior art, beneficial effect of the present invention has: the present invention realizes the collaborative calculating of many GPU under virtualized environment, utilize many GPU to accelerate single task, the supercomputing based on CPU+GPU heterogeneous platform, cloud computing and grid computing are had to great theory and realistic meaning.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Fig. 2 is data decomposition algorithm schematic diagram of the present invention.

Fig. 3 is data computational algorithm schematic diagram of the present invention.

Fig. 4 is the computing time comparison diagram of A*B+C*D under many GPU and single GPU environment.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further described.

As shown in Figure 1, the collaborative computing method of many GPU based on OpenMP under virtualized environment, comprise the following steps,

Step S01; in service end, dispose GPU virtual (gVirtuS) service end assembly; to in this locality, carry out in the parameter of client interception; in physical machine, carry out; physical machine is offered the host side thread identical with GPU number by OpenMP in host side; each host side thread is responsible for controlling a GPU, and by cudaSetDevice (cpu_thread_id) function, No. ID device number with GPU of host side thread is corresponding, the object definition that GPU is calculated is N*N matrix;

Step S03, data decomposition, each thread arranges the data pointer of own privately owned host side and equipment, privately owned host side thread points to the different reference position of original pointer, and from position N/n data scale of copy of own privately owned host side thread, reach the object that scale is divided by cudaMemcpy () function, the line number that wherein N is matrix, the number that n is GPU;

Step S05, data merge, in described step S04, the data of synchrodata copy back privately owned host side by the privately owned equipment end data pointer of GPU, in privately owned host side, carry out data merging, service end is communicated by letter to result of calculation being passed to client (virtual machine) by socket after completing calculating.

As shown in Figure 2, in step S03, GPU number is 4, is respectively GPU0, GPU1, GPU2 and GPU3, and described matrix is A, B, C and D, and A*B+C*D data decomposition comprises the following steps:

As shown in Figure 3, data are calculated and are adopted matrix multiplication to calculate, comprise the following steps: A matrix is divided into 4 A/4(N/4*N), and multiply each other with matrix B respectively, obtain respectively 4 AB/4 matrixes, 4 AB/4 matrix combinations can obtain matrix of consequence AB, with the method, calculate respectively matrix multiple in GPU0, GPU1, GPU2 and GPU3, and AB matrix is N*N matrix.

Fig. 4 is that A*B+C*D is at comparison diagram computing time under the many GPU of virtualized environment and single GPU environment, in diagram, when matrix exponent number increases, single GPU is index and rises operation time, consuming time long, the collaborative calculating of many GPU based on OpenMP under virtualized environment, when matrix exponent number increases, the approximate linear increase that is consuming time, efficiency is high.

The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. under virtualized environment, the many GPU based on OpenMP work in coordination with computing method, it is characterized in that: comprise the following steps,

Step S01; in service end, dispose GPU virtualization services end assembly; to in this locality, carry out in the parameter of client interception; in physical machine, carry out; physical machine is offered the host side thread identical with GPU number by OpenMP in host side; each host side thread is responsible for controlling a GPU, and by DLL (dynamic link library) function, No. ID device number with GPU of host side thread is corresponding, the object definition that GPU is calculated is N*N matrix computations;

Step S02 is each thread distribution video memory, and starts respectively kernel function on each equipment;

Step S03, data decomposition, each thread arranges the data pointer of own privately owned host side and equipment, privately owned host side thread points to the different reference position of original pointer, and from position N/n data of copy of own privately owned host side thread, reach the object that scale is divided by CUDA copy function, the line number that wherein N is matrix, the number that n is GPU;

Step S04, data are calculated, and according to matrix multiple rule, the matrix of n GPU are calculated, and OpenMP controls the output time of described GPU result of calculation, synchronously exports data;

Step S05, data merge, and in described step S04, the data of synchrodata copy back privately owned host side by the privately owned equipment end data pointer of GPU, in privately owned host side, carry out data merging, service end is communicated by letter by socket after completing calculating, and result of calculation is passed to client.

2. under virtualized environment according to claim 1, the many GPU based on OpenMP work in coordination with computing method, it is characterized in that: in described step S03, GPU number is 4, be respectively GPU0, GPU1, GPU2 and GPU3, described matrix is A, B, C and D, and A*B+C*D data decomposition comprises the following steps:

2) OpenMP synchronization module waits for that the phase multiplication of GPU0, GPU1, GPU2 and GPU3 all completes, and copies total data on GPU 0 to the correctness of host side check results.

3. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 2, is characterized in that: the data that calculate in described GPU1, GPU2 and GPU3 copy GPU 0 to by cudaMemcpy () function.

4. according to the collaborative computing method of the many GPU based on OpenMP under the virtualized environment described in claim 1 or 2, it is characterized in that: described data are calculated and adopted matrix multiplication to calculate, comprise the following steps: A matrix is divided into 4 A/4(N/4*N), and multiply each other with matrix B respectively, obtain respectively 4 AB/4 matrixes, 4 AB/4 matrix combinations can obtain matrix of consequence AB, with the method, calculate respectively matrix multiple in GPU0, GPU1, GPU2 and GPU3.

5. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that, described DLL (dynamic link library) function is cudaSetDevice (cpu_thread_id) function.

6. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that, described step S04 synchronously completes by ccudaDeviceSynchronize () function.

7. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that, described step CUDA copy function is cudaMemcpy () function.

8. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that, described client is virtual machine.

9. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that, described video memory size is distributed according to calculative size of data.

10. the collaborative computing method of the many GPU based on OpenMP under virtualized environment according to claim 1, is characterized in that the kernel function of described kernel function for multiplying each other for compute matrix.