CN119201360B

CN119201360B - A GPU virtual sharing system based on multi-system isolation

Info

Publication number: CN119201360B
Application number: CN202411697419.7A
Authority: CN
Inventors: 吴宁; 吴春光; 刘仁学; 黄顺玉; 申利飞
Original assignee: Kirin Software Co Ltd
Current assignee: Kirin Software Co Ltd
Priority date: 2024-11-26
Filing date: 2024-11-26
Publication date: 2025-02-28
Anticipated expiration: 2044-11-26
Also published as: CN119201360A

Abstract

The embodiment of the present invention discloses a GPU virtual sharing system based on multi-system isolation, which divides multiple isolated systems into GPU control domains and GPU application domains, wherein the GPU control domain includes: a GPU control description file for describing the hardware information of GPU hardware resources and the result of allocating GPU hardware resources; a GPU control description information linked list for describing the GPU hardware resource information used by each GPU application domain; a GPU control module for receiving a GPU application data packet sent by the GPU application domain, allocating idle GPU hardware resources for the GPU application domain, returning a GPU hardware resource allocation data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet. The corresponding GPU hardware resources can be transparently transmitted to the GPU application domain through VFIO, thereby avoiding a large amount of data transmission between different system domains during use, as well as the massive interruptions generated, thereby improving the utilization rate of GPU hardware resources.

Description

GPU virtual sharing system based on multi-system isolation

Technical Field

The invention relates to the technical field of GPU sharing, in particular to a GPU virtual sharing system based on multi-system isolation.

Background

With the development of CPU multi-core technology in recent years, it is becoming more and more widespread to isolate different cores and hardware resources on a CPU to run different operating systems, and the hardware isolation methods such as Jailhouse isolate a plurality of system domains on a multi-core SOC through a virtualized hardware isolation method to run different systems, and meanwhile, through a transparent transmission technology, each system domain can access the same physical address of hardware equipment.

In virtualization, the GPU may be isolated to a certain system domain, and other system domains may use the GPU in an analog sharing manner. However, simulating a shared GPU device between system domains requires an interrupt approach, and GPU usage can involve a large amount of data input and output, while mass data transfer can trigger a large amount of interrupts, which undoubtedly greatly exacerbates the data communication and interrupt handling pressures between the isolated systems. In addition, extra data transmission and receiving can be added between two system domains used by the GPU, so that the real-time performance of GPU data processing is reduced, useless data transmission is added in the middle, and meanwhile, if a plurality of GPUs are used, the support functions of different system domains for the GPUs are different, and the use difficulty and the error rate are greatly increased.

Disclosure of Invention

The embodiment of the invention provides a GPU virtual sharing system based on multi-system isolation, which aims at solving the technical problems that a large number of virtual GPUs among a plurality of isolated operating systems are interrupted, mass data are transmitted and the error rate of the GPU shared by the plurality of isolated operating systems is high in the prior art.

The embodiment of the invention provides a GPU virtual sharing system based on multi-system isolation, which comprises the following components:

The GPU control domain is used for managing the use condition of GPU hardware resources;

the GPU application domain is used for requesting GPU hardware resources for temporary use of the local domain;

The GPU control domain includes:

The GPU control description file is used for describing hardware information of GPU hardware resources and a result of distributing the GPU hardware resources;

The GPU control description information linked list is used for describing GPU hardware resource information used by each GPU application domain;

The GPU control module is used for receiving a GPU application data packet sent by a GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, dividing and distributing idle GPU hardware resources for the GPU application domain when the resources exist, returning a GPU hardware resource distribution data packet to the GPU application domain, updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet, receiving a GPU setting data packet sent by the GPU application domain, setting the distributed GPU hardware resources according to the GPU setting data packet, setting the successfully-set data packet to the GPU application domain, and sending a setting failure command packet to the GPU application domain after the setting fails.

Further, the GPU application domain includes:

The GPU application description file is used for describing the distributed GPU hardware resource information;

The GPU application description information chain table is used for describing GPU hardware resource application information of a GPU application domain;

The GPU application module is used for sending a GPU hardware resource application data packet to a GPU control module of the GPU control domain, updating a GPU application description information linked list according to the GPU hardware resource allocation data packet returned by the GPU control module so that the GPU application domain has corresponding GPU hardware resource use permission and uses VFIO to directly use the corresponding GPU hardware resource, and the GPU application module is used for sending the GPU hardware resource application data packet to the GPU control domain

Sending a GPU setting data packet to a GPU control module of the GPU control domain, requesting the GPU control module to set the allocated GPU hardware resources according to the setting information, using the GPU hardware resources set by the GPU control domain according to the setting information after receiving the setting success data packet sent by the GPU control module, and throwing setting error reporting information by a GPU application module after receiving the setting failure data packet.

Furthermore, the GPU control description information linked list comprises linked list nodes, wherein each linked list node correspondingly describes the condition that one GPU application domain applies for using GPU hardware resources;

The linked list node comprises:

information of a system domain corresponding to the allocated GPU hardware resources;

rights of the allocated GPU hardware resources relative to the allocated system domain;

The system domain structure of the distributed GPU hardware resources comprises GPU computing resources and a GPU display interface, wherein the GPU computing resources are divided and managed by a GPU pooling technology by a GPU control module;

And a pointer to a system domain structure corresponding to the next allocated GPU hardware resource.

Further, the GPU application description information linked list includes:

at least one node, each node representing information of a GPU control domain that can use GPU hardware resources;

The node comprises:

information of a GPU control domain where the allocated GPU hardware resources are located;

the system domain structure body of the distributed GPU hardware resources;

and a pointer to a GPU control domain structure corresponding to the next allocated GPU hardware resource, the pointer pointing to another GPU control domain.

Further, the GPU control module is further configured to:

And when the GPU application domain actively releases the GPU hardware resources, the GPU control module receives and analyzes the GPU hardware resource release data packet, recovers the released GPU hardware resources according to the analysis result, and updates the GPU control description information linked list and the GPU control description file according to the recovered and released information.

Further, the simulation module further includes:

And the simulation unit is used for calling a drive interface of the slave system domain to process the USB equipment package when the simulation unit determines that the peripheral equipment is the USB interface through the simulation equipment linked list, so as to obtain a peripheral equipment drive equipment stream of the slave system domain.

Further, the GPU application module is further configured to:

And when the application domain is determined to use the GPU hardware resources which are completely allocated, sending a hardware resource release data packet to the GPU control module, receiving and analyzing the hardware resource release data packet, and updating a GPU application description information linked list and a GPU application description file according to analysis results.

Further, the GPU control module is further configured to:

and when the corresponding resources do not exist, searching an occupied application domain corresponding to the corresponding resources in the GPU application description information linked list, and sending a GPU hardware resource release request data packet to the occupied application domain.

Furthermore, the GPU control module of the GPU control domain and the GPU application module of the GPU application domain communicate through the shared memory, and the corresponding interrupt is triggered after the data is written into the shared memory.

Further, the GPU application domain is further configured to:

and actively sending a GPU hardware resource use data packet to acquire GPU hardware resource information and corresponding authority of a corresponding GPU control domain.

Furthermore, when the self resources do not meet the requirements, the GPU control domain can apply for GPU hardware resources from other GPU control domains with GPU hardware resources;

Correspondingly, when the GPU application domain has GPU hardware resources, the GPU application domain can be used as a GPU control domain for providing services to other GPU application domains. The GPU virtual sharing system based on multi-system isolation provided by the embodiment of the invention is characterized in that a plurality of isolation systems are divided into a GPU control domain and a GPU application domain, wherein the GPU control domain is used for managing the use condition of GPU hardware resources, the GPU application domain is used for requesting GPU hardware resources for temporary use of the domain, the GPU control domain comprises a GPU control description file used for describing hardware information of the GPU hardware resources and a result of distributing the GPU hardware resources, a GPU control description information linked list used for describing GPU hardware resource information used by each GPU application domain, and a GPU control module used for receiving a GPU application data packet sent by the GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, distributing idle GPU hardware resources to the GPU application domain, returning a GPU resource distribution data packet to the GPU application domain, and updating and maintaining the GPU control description information link and the GPU control description file according to the GPU application data packet. When the GPU control domain meets the request of the GPU application domain GPU, the corresponding GPU hardware resources can be transmitted to the GPU application domain through VFIO, so that a large amount of data transmission and mass interrupt generated in the using process are avoided. And as the GPU resources are uniformly managed by the GPU control domain according to the GPU use request, the error rate and the use difficulty are reduced.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

fig. 1 is a schematic structural diagram of a GPU virtual sharing system based on multi-system isolation according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a schematic structural diagram of a GPU virtual sharing system based on multi-system isolation, which is provided by the embodiment of the present invention, referring to fig. 1, and may include a GPU control domain, which is used for managing the use situation of GPU hardware resources, a GPU application domain, which is used for requesting GPU hardware resources for temporary use in the domain, and includes a GPU control description file, which is used for describing hardware information of GPU hardware resources and a result of allocating GPU hardware resources, a GPU control description information linked list, which is used for describing GPU hardware resource information used by each GPU application domain, and a GPU control module, which is used for receiving GPU application data packets sent by the GPU application domain, judging whether resources corresponding to the application data packets exist according to the GPU application data packets and the GPU control description file, and when the GPU application data packets exist, allocating idle hardware resources for the GPU application domain, returning a data packet to the GPU application domain, and maintaining the GPU control link description file and the GPU control link description file according to the GPU application data packets. And

And receiving a GPU setting data packet sent by the GPU application domain, setting the distributed GPU hardware resources according to the GPU setting data packet, sending a successful setting data packet to the GPU application domain after successful setting, and sending a failed setting command packet to the GPU application domain after failed setting.

In this embodiment, the isolation system may be configured to use the same multi-core CPU to isolate different hardware resources to run multiple different operating systems. The system comprises a master system and a slave system, wherein the slave system can be a plurality of slave systems. In this embodiment, the GPU control domain may be a master system or a slave system having GPU hardware resources, and the GPU application domain may be a master system or a slave system that needs to apply for using GPU hardware resources. Further, the number of the GPU control domains may be plural, and correspondingly, the number of the GPU application domains may be plural.

The GPU control domain is configured to manage usage of GPU hardware resources, and illustratively, GPU hardware resources may be allocated to the GPU control domain. The GPU control domain may thus directly use GPU hardware resources. Correspondingly, other domains applying for GPU hardware resources from the GPU control domains are GPU application domains, request is sent to the corresponding GPU control domains, and the GPU control domains use the GPU hardware resources allocated by the GPU control domains in a VFIO transparent transmission mode.

The GPU control module is a core component of the GPU control domain. The GPU control module can divide the hardware resources of the GPU equipment, share the divided GPU hardware resources to different system domains, set the authority for the divided GPU hardware resources, and control whether other system domains can use the corresponding GPU hardware resources, wherein the corresponding authority comprises read-only authority, read-write authority, control authority and the like. The read-only authority other system domains can only read the corresponding GPU hardware resource data, the read-write authority other system domains can read and write the corresponding GPU hardware resource data, and the control authority other system domains can control the corresponding GPU hardware resource to complete more complex work.

The CPU control module can load the GPU equipment and initialize the GPU equipment according to the GPU control description file. One of the functions of the GPU control description file is used to describe GPU hardware resource information, for example, describe that the GPU control module can load those GPU devices. After the initialization of the GPU device is completed, the GPU control domain may use the hardware resources of the GPU. And the CPU control module can be utilized to provide corresponding hardware resources of the GPU for the GPU application domain. And creating and maintaining a GPU control description information linked list. The GPU application description information linked list is used for describing GPU hardware resource application information of the GPU application domain.

The GPU control description information link table is in a link table structure, wherein each link table node correspondingly describes a situation that a GPU application domain applies for using GPU hardware resources. The GPU control module divides and manages GPU computing power resources by a GPU pooling technology, and the GPU display interface describes hardware display interface resources provided by the GPU device, such as HDMI, VGA, DP and other hardware display interfaces. And a pointer to a system domain structure corresponding to the next allocated GPU hardware resource, so as to facilitate searching the next system domain of the CPU control module, and fully know the allocation result of the GPU hardware resource and the residual resource.

Illustratively, the example GPU control description information linked list is as follows:

typedef struct control_domain_g{

uint_32_tos;// corresponding System Domain

Uint_8_t limit;// corresponding GPU hardware resource permissions

Struct computing_unit gpu_unit/GPU computing unit structure

Struct display_dev gpu_dev;// GPU hardware display interface

Struct control_domain_g next;// GPU resource structure pointer to next System Domain

} control_domain_t;

The GPU control module can receive a GPU application data packet sent by a GPU application domain, analyze the GPU application data packet, acquire corresponding GPU hardware resources, and judge whether idle GPU hardware resources meet the request of the GPU application domain. And when the data packet is satisfied, the GPU control module allocates corresponding GPU hardware resources to the GPU application domain and sends a GPU hardware resource allocation data packet to the GPU application domain. To enable the GPU application domain to use the allocated GPU hardware resources. And based on the distributed information, the distributed information is used as a new linked list node and is written into a GPU control description information linked list. And corresponding allocation information, namely the corresponding relation between the allocated GPU hardware resources and the GPU application domain, describes the GPU hardware resource segmentation mode and the allocation GPU hardware resource using authority of the GPU application domain.

And when the GPU resources cannot be met, the GPU control module sends out GPU hardware resource recovery data packets to the GPU application domain occupying the resources, receives the GPU hardware resource release data packets returned by the GPU application domain, analyzes the GPU hardware resource release data packets, recovers the corresponding GPU hardware resources and updates the GPU control description information linked list. For example, the corresponding node may be deleted and the pointer associated therewith modified.

If the GPU hardware resource release data packet is not received, the recovery is not performed temporarily. In addition, after the GPU application domain finishes using the allocated GPU hardware resources, the GPU hardware resources need to be actively released, a GPU hardware resource release data packet is actively sent, the GPU control module receives and analyzes the GPU hardware resource release data packet, then recovers the corresponding GPU hardware resources, updates the GPU control description information linked list, and the related GPU hardware resources cannot access the resources corresponding to the GPU application domain in the process. And synchronously updating the processed information to the GPU control description file. And the initialization of the GPU control module after the start or restart is facilitated.

In addition, the GPU control module can also receive a GPU setting data packet sent by a GPU application domain, set the distributed GPU hardware resources according to the GPU setting data packet, send a setting success data packet to the GPU application domain after setting is successful, and send a setting failure command packet to the GPU application domain after setting is failed. In this embodiment, the GPU application domain may select a corresponding library according to the requirements of its task to perform various calculations by using the computing capability of the GPU. For example, when the GPU application domain needs to complete image drawing and rendering by using the GPU through the OpenGL library, a setting data packet may be sent out according to the function characteristics of the OpenGL library, and after the GPU control module is successfully set, a setting success data packet is returned to the GPU control module, so that the GPU application domain can better utilize the allocated GPU resource to perform calculation according to hardware setting. Illustratively, the GPU control module may use GPU hardware resources by calling OpenGl, openCL, FFmpeg, GStreamer, mesa or the like libraries.

The GPU application domain comprises a GPU application description file, a GPU application description information linked list and a GPU application module, wherein the GPU application description file is used for describing GPU hardware resource information, the GPU application description information linked list is used for describing GPU hardware resource application information of the GPU application domain, the GPU application module is used for sending a GPU hardware resource application data packet to a GPU control module of the GPU control domain, and updating the GPU application description information linked list according to a GPU hardware resource distribution data packet returned by the GPU control module, so that the GPU application domain has corresponding GPU hardware resource use permission, and the corresponding GPU hardware resource is directly used by VFIO. The corresponding rights include read-only rights, read-write rights, control rights, and the like. The read-only authority GPU application domain can only read the corresponding GPU hardware resource data, the read-write authority GPU application domain can read and write the corresponding GPU hardware resource data, and the control authority GPU application domain can control the corresponding GPU hardware resource to complete more complex work

In this embodiment, the GPU application module is a core component of the GPU application domain. Specifically, the GPU application module is configured to generate a GPU application data packet according to an operation requirement of the system domain, and send the GPU application data packet to the GPU control module of the corresponding GPU control domain, where the corresponding GPU control domain may be single or multiple, and is determined according to the operation requirement and the multi-system setting. After the GPU control module of the GPU control domain receives the GPU application data packet, whether the requirements can be met or not can be judged according to the current hardware resource information, and when the requirements can be met, idle GPU hardware resources are allocated for the GPU application domain. And returning a GPU application confirmation data packet to the GPU application module according to the distribution information. And the GPU application module receives and analyzes the GPU application confirmation data packet, and writes the distributed GPU hardware resources into a GPU application description file according to the analysis result of the GPU application confirmation data packet. And the GPU application module initializes according to the GPU application description file and correspondingly creates a GPU application description information linked list. The initialization is equivalent to mounting corresponding GPU hardware for the GPU application domain, so that the distributed GPU hardware resources can be directly used by utilizing the transmission technology in the later period. Furthermore, the GPU application module can also send a GPU setting data packet to request the GPU control module of the GPU control domain to set GPU hardware resources allocated to the GPU application domain, the GPU application module uses the GPU hardware resources allocated and set by the GPU control domain according to the setting information after receiving the setting success data packet sent by the GPU control domain, and the GPU application module throws out setting error reporting information after receiving the setting failure data packet. The setting data packet may include information for directly configuring the allocated GPU hardware resources, or provide corresponding configuration description information, so that the GPU control module configures the allocated GPU hardware resources according to the configuration description information by using the corresponding configuration information set by itself, so as to reduce transmission data.

In this embodiment, the GPU application description information linked list is also a linked list structure, and at least includes one node, where each node represents information of a GPU control domain that can use GPU hardware resources. By way of example, each node may include information of a GPU control domain in which the allocated GPU hardware resources are located, rights of the allocated GPU hardware resources, and a system domain structure of the allocated GPU hardware resources for describing the hardware resource information allocated by the GPU. The method comprises the steps of dividing and managing GPU computing power resources by a GPU pooling technology by a GPU control module, providing hardware display interface resources by GPU hardware equipment, and pointing to a GPU control domain structure corresponding to the next allocated GPU hardware resources, and facilitating calculation by using a plurality of divided GPU resources, wherein the divided GPU resources can be in one or more GPU control domains.

Illustratively, the GPU application description information linked list is exemplified as follows:

typedef struct apply_domain_g{

uint_32_tos;// corresponding System Domain

Uint_8_t limit;// corresponding GPU hardware resource permissions

Struct computing_unit gpu_unit/GPU computing unit structure

Struct display_dev gpu_dev;// GPU hardware display interface

Struct apply_domain_g next;// GPU resource structure pointer to next System Domain

} apply_domain_t;

In addition, the GPU application description file can be updated and maintained according to the applied GPU hardware resources, so that the updated GPU application description file can be used for initialization after the GPU application domain is restarted.

When the GPU application domain uses the allocated GPU hardware resources, the GPU application module can send a GPU hardware resource release data packet to a GPU control module of the allocated GPU hardware resources, the GPU control module receives the GPU hardware resource release data packet and analyzes the GPU hardware resource release data packet, and then the corresponding GPU hardware resources are recovered. The GPU application module updates the GPU application description information linked list accordingly, and may delete the corresponding node, for example. When GPU hardware resources using a plurality of GPU control modules exist, corresponding pointers are modified according to the use completion condition.

In addition, the GPU application module can also receive the GPU hardware resource release request data packet, analyze the GPU hardware resource release request data packet, judge whether the GPU hardware resources in the GPU hardware resource release data packet are used completely, and return the GPU hardware resource release data packet to the GPU control module when the use is completed, so that the GPU control module can release and recover the corresponding GPU hardware resources conveniently. Otherwise, no action is required.

In this embodiment, the GPU control domain and the GPU application domain may communicate through a shared memory, and when data is written in the shared memory, the receiving party generates a soft interrupt to process the data. Compared with the traditional mode, the interrupt times are obviously reduced, and the real-time performance of GPU data processing is not affected.

In addition, in this embodiment, the roles of the GPU control domain and the GPU application domain are not fixed, and illustratively, the GPU control domain may also be provided with a GPU application module, a GPU application description file, and a GPU application description information linked list. GPU hardware resources of other system domains may be used. Correspondingly, the GPU application domain can be also provided with a GPU control module, a GPU control description file and a GPU control description information linked list, so that GPU hardware resources are provided for other system domains.

The working process of the GPU virtual sharing system based on multi-system isolation provided in this embodiment is further described below.

Firstly, jailhouse can be utilized to isolate the Soc system into a plurality of systems, and the GPU application domain and the GPU control domain are set according to the setting condition of GPU hardware resources. The system comprises a GPU control domain, a GPU application domain and a local domain, wherein the GPU control domain is used for managing the use condition of GPU hardware resources, and the GPU application domain is used for requesting the GPU hardware resources for temporary use. Each system domain independently runs the corresponding operating systems such as linux, rtos (Real Time Operate System) and the like, and the GPU equipment is transparently transferred from the GPU control domain to the GPU application domain through a hardware transparent transfer method, so that the GPU control domain and the GPU application domain can share and directly access GPU physical equipment resources.

After setting the GPU control domain and the GPU application domain, the GPU control module in the GPU control domain is initialized by using the GPU hardware information described in the GPU control description file, so that the GPU control module determines which GPU hardware devices can be loaded, and various characteristics of the hardware devices. The GPU control module can be further generated according to the description of the GPU control description file, and is maintained in real time, and the hardware resources of the GPU corresponding to each system domain are described by the GPU control description information linked list independently.

Correspondingly, in the GPU application domain, the GPU application module is initialized according to a GPU application description file, wherein the GPU application description file firstly describes the system domain as the GPU application domain.

After the initialization of the GPU application domain and the GPU control domain is completed, the GPU application domain sends a GPU hardware resource application data packet to a GPU control module of the GPU control domain according to hardware resources required by the system, the GPU control module receives the GPU application data packet of the GPU application domain GPU application module, if idle GPU hardware resources exist, the corresponding GPU hardware resources are distributed to the GPU application domain, and the GPU hardware resource distribution data packet is sent to the GPU application domain, so that the GPU application domain can directly use related GPU hardware resources by utilizing a transmission technology. The GPU control module updates the GPU control description file and can specifically describe the mapping relation between the distributed GPU hardware resources and the application domain. Describing a GPU hardware resource segmentation mode, corresponding to different system domain use authorities of GPU hardware resources, sharing related information of the GPU hardware resources and the like. And updating and maintaining a GPU control description information linked list according to the allocation information.

And if the GPU application domain successfully applies for the GPU hardware resources and receives the GPU hardware resource allocation data packet, updating a GPU application description information linked list, having the corresponding GPU hardware resource use permission, and generating a GPU application description file. The GPU application domain can directly use the allocated GPU hardware resources by utilizing VFIO transmission technology, so that similar effects of using the GPU resources can be realized. The GPU application module can actively send a GPU hardware resource use data packet to acquire GPU hardware resource information and corresponding rights of a corresponding GPU control domain.

When the GPU hardware resources allocated to the GPU application domain are required to be recovered, the GPU control module sends a GPU hardware resource recovery data packet to the GPU application module of the GPU application domain, the GPU application module detects whether the corresponding GPU hardware resources can be released, when the corresponding GPU hardware resources are idle and can be released by the GPU application domain, the hardware resources of the GPU control module are successfully recovered, the GPU control module adds relevant GPU hardware information to a GPU control description information linked list, and when the corresponding GPU hardware resources cannot be released when the corresponding GPU hardware resources are used by the GPU application domain, the GPU control module fails to recover the hardware resources and does not process the GPU hardware resources.

When the GPU application domain actively releases the GPU hardware resources, the GPU control module receives and analyzes the GPU hardware resource release data packet, then recovers the corresponding GPU hardware resources, and adds relevant GPU hardware information to the GPU control description information linked list. And the GPU hardware resource allocation and recovery result is synchronously and correspondingly updated into the GPU control description file and the GPU application description file.

When the GPU application domain lacks a GPU related software library, a GPU application module of the GPU application domain can send a GPU setting data packet to a GPU control domain to request a GPU control module of the GPU control domain to set GPU hardware resources allocated to the GPU application domain, the GPU control module of the GPU control domain receives the GPU setting data packet sent by the GPU application domain, sets the GPU hardware resources allocated to the GPU application domain according to the GPU setting data packet, sends a setting success data packet containing setting success information to the GPU application module of the GPU application domain after the GPU control module successfully sets the GPU hardware resources, sends a setting failure command packet containing error information to the GPU application module of the GPU application domain after the GPU application module receives the setting success data packet sent by the GPU control domain, uses the GPU control domain to allocate and set GPU hardware resources according to the setting information, and throws out setting error information after the GPU application module receives the setting failure data packet.

By the method, the GPU hardware resources can be effectively managed and reasonably allocated, so that each application domain can use the GPU hardware resources.

The GPU virtual sharing system based on multi-system isolation comprises a plurality of isolation systems, a GPU application domain, a GPU control module and a GPU control module, wherein the plurality of isolation systems are divided into a GPU control domain and a GPU application domain, the GPU control domain is used for managing the use condition of GPU hardware resources, the GPU application domain is used for requesting GPU hardware resources for temporary use of the domain, the GPU control domain comprises a GPU control description file used for describing hardware information of the GPU hardware resources and a result of distributing the GPU hardware resources, the GPU control description file is used for describing GPU hardware resource information used by each GPU application domain, the GPU control module is used for receiving a GPU application data packet sent by the GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, distributing idle GPU hardware resources for the GPU application domain, returning a GPU hardware resource distribution data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet. When the GPU control domain meets the request of the GPU application domain GPU, the corresponding GPU hardware resources can be transmitted to the GPU application domain through VFIO, so that a large amount of data transmission and mass interrupt generated in the using process are avoided. And as the GPU resources are uniformly managed by the GPU control domain according to the GPU use request, the error rate and the use difficulty are reduced.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A GPU virtual sharing system based on multi-system isolation, characterized by comprising:

A GPU control domain, wherein the GPU control domain is used to manage the usage of GPU hardware resources;

A GPU application domain, wherein the GPU application domain is used to request GPU hardware resources for temporary use by the domain;

The GPU control domain includes:

A GPU control description file, wherein the GPU control description file is used to describe the hardware information of the GPU hardware resources and the result of allocating the GPU hardware resources;

GPU control description information linked list, used to describe the GPU hardware resource information used by each GPU application domain;

a GPU control module, the GPU control module being used to receive a GPU application data packet sent by a GPU application domain, and judging whether there are resources corresponding to the application data packet according to the GPU application data packet and the GPU control description file, and if so, allocating idle GPU hardware resources to the GPU application domain, returning a GPU hardware resource allocation data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet; and

Receive a GPU setting data packet sent by the GPU application domain, set the allocated GPU hardware resources according to the GPU setting data packet, send a setting success data packet to the GPU application domain after the setting is successful, and send a setting failure command packet to the GPU application domain after the setting fails;

The GPU application domain includes:

A GPU application description file, wherein the GPU application description file is used to describe the allocated GPU hardware resource information;

A GPU application description information linked list, wherein the GPU application description information linked list is used to describe the GPU hardware resource application information of the GPU application domain;

A GPU application module, wherein the GPU application module is used to send a GPU hardware resource application data packet to a GPU control module of a GPU control domain, and update a GPU application description information linked list according to a GPU hardware resource allocation data packet returned by the GPU control module, so that the GPU application domain has the corresponding GPU hardware resource usage authority and directly uses the corresponding GPU hardware resources using VFIO; and

A GPU setting data packet is sent to the GPU control module of the GPU control domain, requesting the GPU control module to set the allocated GPU hardware resources according to the setting information; after receiving the setting success data packet sent by the GPU control module, the GPU hardware resources set by the GPU control domain are used according to the setting information; after receiving the setting failure data packet, the GPU application module throws a setting error message.

2. The system according to claim 1, characterized in that the GPU control description information linked list comprises: linked list nodes, each linked list node corresponding to a description of a GPU application domain applying for the use of GPU hardware resources;

The linked list nodes include:

Information about the system domain corresponding to the allocated GPU hardware resources;

The permissions of the allocated GPU hardware resources relative to the allocated system domain;

A system domain structure of allocated GPU hardware resources, the structure comprising: GPU computing resources divided and managed by the GPU control module using GPU pooling technology, and a GPU display interface;

And a pointer to the system domain structure corresponding to the next allocated GPU hardware resource.

3. The system according to claim 2, wherein the GPU application description information linked list comprises:

At least one node, each node represents information of a GPU control domain that can use GPU hardware resources;

The nodes include:

Information about the GPU control domain where the allocated GPU hardware resources are located;

Permissions for allocated GPU hardware resources; System domain structure for allocated GPU hardware resources;

And a pointer to a GPU control domain structure corresponding to the next allocated GPU hardware resource, wherein the pointer points to another GPU control domain.

4. The system according to claim 1, wherein the GPU control module is further configured to:

When the GPU application domain actively releases GPU hardware resources, the GPU control module receives and parses the GPU hardware resource release data packet, recycles the released GPU hardware resources according to the parsing result, and updates the GPU control description information linked list and the GPU control description file according to the recycled and released information.

5. The system according to claim 4, wherein the GPU application module is further used for:

When it is determined that the application domain has completed the allocated GPU hardware resources, a hardware resource release data packet is sent to the GPU control module, and a hardware resource recovery data packet is received and parsed, and the GPU application description information linked list and the GPU application description file are updated according to the parsing result.

6. The system according to claim 5, wherein the GPU control module is further configured to:

When there is no corresponding resource, the occupied application domain corresponding to the corresponding resource in the GPU application description information linked list is searched, and a GPU hardware resource release request data packet is sent to the occupied application domain.

7. The system according to claim 1 is characterized in that the GPU control module of the GPU control domain and the GPU application module of the GPU application domain communicate through a shared memory, and a corresponding interrupt will be triggered after data is written into the shared memory.

8. The system according to claim 3, wherein the GPU application domain is further used for:

Actively send GPU hardware resource usage data packets to obtain the GPU hardware resource information and corresponding permissions of the corresponding GPU control domain.

9. The system according to claim 1, characterized in that when its own resources do not meet the requirements, the GPU control domain can apply for GPU hardware resources from other GPU control domains that have GPU hardware resources;

Correspondingly, when the GPU application domain has GPU hardware resources, the GPU application domain can serve as a GPU control domain that provides services to other GPU application domains.