[go: up one dir, main page]

CN119201360B - A GPU virtual sharing system based on multi-system isolation - Google Patents

A GPU virtual sharing system based on multi-system isolation Download PDF

Info

Publication number
CN119201360B
CN119201360B CN202411697419.7A CN202411697419A CN119201360B CN 119201360 B CN119201360 B CN 119201360B CN 202411697419 A CN202411697419 A CN 202411697419A CN 119201360 B CN119201360 B CN 119201360B
Authority
CN
China
Prior art keywords
gpu
domain
application
control
hardware resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411697419.7A
Other languages
Chinese (zh)
Other versions
CN119201360A (en
Inventor
吴宁
吴春光
刘仁学
黄顺玉
申利飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kirin Software Co Ltd
Original Assignee
Kirin Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kirin Software Co Ltd filed Critical Kirin Software Co Ltd
Priority to CN202411697419.7A priority Critical patent/CN119201360B/en
Publication of CN119201360A publication Critical patent/CN119201360A/en
Application granted granted Critical
Publication of CN119201360B publication Critical patent/CN119201360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45587Isolation or security of virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明实施例公开了一种基于多系统隔离的GPU虚拟共享系统,通过将多个隔离系统分为GPU控制域和GPU申请域,GPU控制域包括:GPU控制描述文件用于描述GPU硬件资源的硬件信息,以及分配GPU硬件资源的结果;GPU控制描述信息链表用于描述每个的GPU申请域使用的GPU硬件资源信息;GPU控制模块,用于接收GPU申请域发送的GPU申请数据包,为所述GPU申请域划分分配空闲GPU硬件资源,向GPU申请域返回GPU硬件资源分配数据包,并根据所述GPU申请数据包更新维护所述GPU控制描述信息链和所述GPU控制描述文件。可以通过VFIO透传对应GPU硬件资源到GPU申请域,避免在使用过程中不同系统域之间大量数据传输,以及产生的海量中断,提高了GPU硬件资源的使用率。

The embodiment of the present invention discloses a GPU virtual sharing system based on multi-system isolation, which divides multiple isolated systems into GPU control domains and GPU application domains, wherein the GPU control domain includes: a GPU control description file for describing the hardware information of GPU hardware resources and the result of allocating GPU hardware resources; a GPU control description information linked list for describing the GPU hardware resource information used by each GPU application domain; a GPU control module for receiving a GPU application data packet sent by the GPU application domain, allocating idle GPU hardware resources for the GPU application domain, returning a GPU hardware resource allocation data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet. The corresponding GPU hardware resources can be transparently transmitted to the GPU application domain through VFIO, thereby avoiding a large amount of data transmission between different system domains during use, as well as the massive interruptions generated, thereby improving the utilization rate of GPU hardware resources.

Description

GPU virtual sharing system based on multi-system isolation
Technical Field
The invention relates to the technical field of GPU sharing, in particular to a GPU virtual sharing system based on multi-system isolation.
Background
With the development of CPU multi-core technology in recent years, it is becoming more and more widespread to isolate different cores and hardware resources on a CPU to run different operating systems, and the hardware isolation methods such as Jailhouse isolate a plurality of system domains on a multi-core SOC through a virtualized hardware isolation method to run different systems, and meanwhile, through a transparent transmission technology, each system domain can access the same physical address of hardware equipment.
In virtualization, the GPU may be isolated to a certain system domain, and other system domains may use the GPU in an analog sharing manner. However, simulating a shared GPU device between system domains requires an interrupt approach, and GPU usage can involve a large amount of data input and output, while mass data transfer can trigger a large amount of interrupts, which undoubtedly greatly exacerbates the data communication and interrupt handling pressures between the isolated systems. In addition, extra data transmission and receiving can be added between two system domains used by the GPU, so that the real-time performance of GPU data processing is reduced, useless data transmission is added in the middle, and meanwhile, if a plurality of GPUs are used, the support functions of different system domains for the GPUs are different, and the use difficulty and the error rate are greatly increased.
Disclosure of Invention
The embodiment of the invention provides a GPU virtual sharing system based on multi-system isolation, which aims at solving the technical problems that a large number of virtual GPUs among a plurality of isolated operating systems are interrupted, mass data are transmitted and the error rate of the GPU shared by the plurality of isolated operating systems is high in the prior art.
The embodiment of the invention provides a GPU virtual sharing system based on multi-system isolation, which comprises the following components:
The GPU control domain is used for managing the use condition of GPU hardware resources;
the GPU application domain is used for requesting GPU hardware resources for temporary use of the local domain;
The GPU control domain includes:
The GPU control description file is used for describing hardware information of GPU hardware resources and a result of distributing the GPU hardware resources;
The GPU control description information linked list is used for describing GPU hardware resource information used by each GPU application domain;
The GPU control module is used for receiving a GPU application data packet sent by a GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, dividing and distributing idle GPU hardware resources for the GPU application domain when the resources exist, returning a GPU hardware resource distribution data packet to the GPU application domain, updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet, receiving a GPU setting data packet sent by the GPU application domain, setting the distributed GPU hardware resources according to the GPU setting data packet, setting the successfully-set data packet to the GPU application domain, and sending a setting failure command packet to the GPU application domain after the setting fails.
Further, the GPU application domain includes:
The GPU application description file is used for describing the distributed GPU hardware resource information;
The GPU application description information chain table is used for describing GPU hardware resource application information of a GPU application domain;
The GPU application module is used for sending a GPU hardware resource application data packet to a GPU control module of the GPU control domain, updating a GPU application description information linked list according to the GPU hardware resource allocation data packet returned by the GPU control module so that the GPU application domain has corresponding GPU hardware resource use permission and uses VFIO to directly use the corresponding GPU hardware resource, and the GPU application module is used for sending the GPU hardware resource application data packet to the GPU control domain
Sending a GPU setting data packet to a GPU control module of the GPU control domain, requesting the GPU control module to set the allocated GPU hardware resources according to the setting information, using the GPU hardware resources set by the GPU control domain according to the setting information after receiving the setting success data packet sent by the GPU control module, and throwing setting error reporting information by a GPU application module after receiving the setting failure data packet.
Furthermore, the GPU control description information linked list comprises linked list nodes, wherein each linked list node correspondingly describes the condition that one GPU application domain applies for using GPU hardware resources;
The linked list node comprises:
information of a system domain corresponding to the allocated GPU hardware resources;
rights of the allocated GPU hardware resources relative to the allocated system domain;
The system domain structure of the distributed GPU hardware resources comprises GPU computing resources and a GPU display interface, wherein the GPU computing resources are divided and managed by a GPU pooling technology by a GPU control module;
And a pointer to a system domain structure corresponding to the next allocated GPU hardware resource.
Further, the GPU application description information linked list includes:
at least one node, each node representing information of a GPU control domain that can use GPU hardware resources;
The node comprises:
information of a GPU control domain where the allocated GPU hardware resources are located;
the system domain structure body of the distributed GPU hardware resources;
and a pointer to a GPU control domain structure corresponding to the next allocated GPU hardware resource, the pointer pointing to another GPU control domain.
Further, the GPU control module is further configured to:
And when the GPU application domain actively releases the GPU hardware resources, the GPU control module receives and analyzes the GPU hardware resource release data packet, recovers the released GPU hardware resources according to the analysis result, and updates the GPU control description information linked list and the GPU control description file according to the recovered and released information.
Further, the simulation module further includes:
And the simulation unit is used for calling a drive interface of the slave system domain to process the USB equipment package when the simulation unit determines that the peripheral equipment is the USB interface through the simulation equipment linked list, so as to obtain a peripheral equipment drive equipment stream of the slave system domain.
Further, the GPU application module is further configured to:
And when the application domain is determined to use the GPU hardware resources which are completely allocated, sending a hardware resource release data packet to the GPU control module, receiving and analyzing the hardware resource release data packet, and updating a GPU application description information linked list and a GPU application description file according to analysis results.
Further, the GPU control module is further configured to:
and when the corresponding resources do not exist, searching an occupied application domain corresponding to the corresponding resources in the GPU application description information linked list, and sending a GPU hardware resource release request data packet to the occupied application domain.
Furthermore, the GPU control module of the GPU control domain and the GPU application module of the GPU application domain communicate through the shared memory, and the corresponding interrupt is triggered after the data is written into the shared memory.
Further, the GPU application domain is further configured to:
and actively sending a GPU hardware resource use data packet to acquire GPU hardware resource information and corresponding authority of a corresponding GPU control domain.
Furthermore, when the self resources do not meet the requirements, the GPU control domain can apply for GPU hardware resources from other GPU control domains with GPU hardware resources;
Correspondingly, when the GPU application domain has GPU hardware resources, the GPU application domain can be used as a GPU control domain for providing services to other GPU application domains. The GPU virtual sharing system based on multi-system isolation provided by the embodiment of the invention is characterized in that a plurality of isolation systems are divided into a GPU control domain and a GPU application domain, wherein the GPU control domain is used for managing the use condition of GPU hardware resources, the GPU application domain is used for requesting GPU hardware resources for temporary use of the domain, the GPU control domain comprises a GPU control description file used for describing hardware information of the GPU hardware resources and a result of distributing the GPU hardware resources, a GPU control description information linked list used for describing GPU hardware resource information used by each GPU application domain, and a GPU control module used for receiving a GPU application data packet sent by the GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, distributing idle GPU hardware resources to the GPU application domain, returning a GPU resource distribution data packet to the GPU application domain, and updating and maintaining the GPU control description information link and the GPU control description file according to the GPU application data packet. When the GPU control domain meets the request of the GPU application domain GPU, the corresponding GPU hardware resources can be transmitted to the GPU application domain through VFIO, so that a large amount of data transmission and mass interrupt generated in the using process are avoided. And as the GPU resources are uniformly managed by the GPU control domain according to the GPU use request, the error rate and the use difficulty are reduced.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
fig. 1 is a schematic structural diagram of a GPU virtual sharing system based on multi-system isolation according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1 is a schematic structural diagram of a GPU virtual sharing system based on multi-system isolation, which is provided by the embodiment of the present invention, referring to fig. 1, and may include a GPU control domain, which is used for managing the use situation of GPU hardware resources, a GPU application domain, which is used for requesting GPU hardware resources for temporary use in the domain, and includes a GPU control description file, which is used for describing hardware information of GPU hardware resources and a result of allocating GPU hardware resources, a GPU control description information linked list, which is used for describing GPU hardware resource information used by each GPU application domain, and a GPU control module, which is used for receiving GPU application data packets sent by the GPU application domain, judging whether resources corresponding to the application data packets exist according to the GPU application data packets and the GPU control description file, and when the GPU application data packets exist, allocating idle hardware resources for the GPU application domain, returning a data packet to the GPU application domain, and maintaining the GPU control link description file and the GPU control link description file according to the GPU application data packets. And
And receiving a GPU setting data packet sent by the GPU application domain, setting the distributed GPU hardware resources according to the GPU setting data packet, sending a successful setting data packet to the GPU application domain after successful setting, and sending a failed setting command packet to the GPU application domain after failed setting.
In this embodiment, the isolation system may be configured to use the same multi-core CPU to isolate different hardware resources to run multiple different operating systems. The system comprises a master system and a slave system, wherein the slave system can be a plurality of slave systems. In this embodiment, the GPU control domain may be a master system or a slave system having GPU hardware resources, and the GPU application domain may be a master system or a slave system that needs to apply for using GPU hardware resources. Further, the number of the GPU control domains may be plural, and correspondingly, the number of the GPU application domains may be plural.
The GPU control domain is configured to manage usage of GPU hardware resources, and illustratively, GPU hardware resources may be allocated to the GPU control domain. The GPU control domain may thus directly use GPU hardware resources. Correspondingly, other domains applying for GPU hardware resources from the GPU control domains are GPU application domains, request is sent to the corresponding GPU control domains, and the GPU control domains use the GPU hardware resources allocated by the GPU control domains in a VFIO transparent transmission mode.
The GPU control module is a core component of the GPU control domain. The GPU control module can divide the hardware resources of the GPU equipment, share the divided GPU hardware resources to different system domains, set the authority for the divided GPU hardware resources, and control whether other system domains can use the corresponding GPU hardware resources, wherein the corresponding authority comprises read-only authority, read-write authority, control authority and the like. The read-only authority other system domains can only read the corresponding GPU hardware resource data, the read-write authority other system domains can read and write the corresponding GPU hardware resource data, and the control authority other system domains can control the corresponding GPU hardware resource to complete more complex work.
The CPU control module can load the GPU equipment and initialize the GPU equipment according to the GPU control description file. One of the functions of the GPU control description file is used to describe GPU hardware resource information, for example, describe that the GPU control module can load those GPU devices. After the initialization of the GPU device is completed, the GPU control domain may use the hardware resources of the GPU. And the CPU control module can be utilized to provide corresponding hardware resources of the GPU for the GPU application domain. And creating and maintaining a GPU control description information linked list. The GPU application description information linked list is used for describing GPU hardware resource application information of the GPU application domain.
The GPU control description information link table is in a link table structure, wherein each link table node correspondingly describes a situation that a GPU application domain applies for using GPU hardware resources. The GPU control module divides and manages GPU computing power resources by a GPU pooling technology, and the GPU display interface describes hardware display interface resources provided by the GPU device, such as HDMI, VGA, DP and other hardware display interfaces. And a pointer to a system domain structure corresponding to the next allocated GPU hardware resource, so as to facilitate searching the next system domain of the CPU control module, and fully know the allocation result of the GPU hardware resource and the residual resource.
Illustratively, the example GPU control description information linked list is as follows:
typedef struct control_domain_g{
uint_32_tos;// corresponding System Domain
Uint_8_t limit;// corresponding GPU hardware resource permissions
Struct computing_unit gpu_unit/GPU computing unit structure
Struct display_dev gpu_dev;// GPU hardware display interface
Struct control_domain_g next;// GPU resource structure pointer to next System Domain
} control_domain_t;
The GPU control module can receive a GPU application data packet sent by a GPU application domain, analyze the GPU application data packet, acquire corresponding GPU hardware resources, and judge whether idle GPU hardware resources meet the request of the GPU application domain. And when the data packet is satisfied, the GPU control module allocates corresponding GPU hardware resources to the GPU application domain and sends a GPU hardware resource allocation data packet to the GPU application domain. To enable the GPU application domain to use the allocated GPU hardware resources. And based on the distributed information, the distributed information is used as a new linked list node and is written into a GPU control description information linked list. And corresponding allocation information, namely the corresponding relation between the allocated GPU hardware resources and the GPU application domain, describes the GPU hardware resource segmentation mode and the allocation GPU hardware resource using authority of the GPU application domain.
And when the GPU resources cannot be met, the GPU control module sends out GPU hardware resource recovery data packets to the GPU application domain occupying the resources, receives the GPU hardware resource release data packets returned by the GPU application domain, analyzes the GPU hardware resource release data packets, recovers the corresponding GPU hardware resources and updates the GPU control description information linked list. For example, the corresponding node may be deleted and the pointer associated therewith modified.
If the GPU hardware resource release data packet is not received, the recovery is not performed temporarily. In addition, after the GPU application domain finishes using the allocated GPU hardware resources, the GPU hardware resources need to be actively released, a GPU hardware resource release data packet is actively sent, the GPU control module receives and analyzes the GPU hardware resource release data packet, then recovers the corresponding GPU hardware resources, updates the GPU control description information linked list, and the related GPU hardware resources cannot access the resources corresponding to the GPU application domain in the process. And synchronously updating the processed information to the GPU control description file. And the initialization of the GPU control module after the start or restart is facilitated.
In addition, the GPU control module can also receive a GPU setting data packet sent by a GPU application domain, set the distributed GPU hardware resources according to the GPU setting data packet, send a setting success data packet to the GPU application domain after setting is successful, and send a setting failure command packet to the GPU application domain after setting is failed. In this embodiment, the GPU application domain may select a corresponding library according to the requirements of its task to perform various calculations by using the computing capability of the GPU. For example, when the GPU application domain needs to complete image drawing and rendering by using the GPU through the OpenGL library, a setting data packet may be sent out according to the function characteristics of the OpenGL library, and after the GPU control module is successfully set, a setting success data packet is returned to the GPU control module, so that the GPU application domain can better utilize the allocated GPU resource to perform calculation according to hardware setting. Illustratively, the GPU control module may use GPU hardware resources by calling OpenGl, openCL, FFmpeg, GStreamer, mesa or the like libraries.
The GPU application domain comprises a GPU application description file, a GPU application description information linked list and a GPU application module, wherein the GPU application description file is used for describing GPU hardware resource information, the GPU application description information linked list is used for describing GPU hardware resource application information of the GPU application domain, the GPU application module is used for sending a GPU hardware resource application data packet to a GPU control module of the GPU control domain, and updating the GPU application description information linked list according to a GPU hardware resource distribution data packet returned by the GPU control module, so that the GPU application domain has corresponding GPU hardware resource use permission, and the corresponding GPU hardware resource is directly used by VFIO. The corresponding rights include read-only rights, read-write rights, control rights, and the like. The read-only authority GPU application domain can only read the corresponding GPU hardware resource data, the read-write authority GPU application domain can read and write the corresponding GPU hardware resource data, and the control authority GPU application domain can control the corresponding GPU hardware resource to complete more complex work
In this embodiment, the GPU application module is a core component of the GPU application domain. Specifically, the GPU application module is configured to generate a GPU application data packet according to an operation requirement of the system domain, and send the GPU application data packet to the GPU control module of the corresponding GPU control domain, where the corresponding GPU control domain may be single or multiple, and is determined according to the operation requirement and the multi-system setting. After the GPU control module of the GPU control domain receives the GPU application data packet, whether the requirements can be met or not can be judged according to the current hardware resource information, and when the requirements can be met, idle GPU hardware resources are allocated for the GPU application domain. And returning a GPU application confirmation data packet to the GPU application module according to the distribution information. And the GPU application module receives and analyzes the GPU application confirmation data packet, and writes the distributed GPU hardware resources into a GPU application description file according to the analysis result of the GPU application confirmation data packet. And the GPU application module initializes according to the GPU application description file and correspondingly creates a GPU application description information linked list. The initialization is equivalent to mounting corresponding GPU hardware for the GPU application domain, so that the distributed GPU hardware resources can be directly used by utilizing the transmission technology in the later period. Furthermore, the GPU application module can also send a GPU setting data packet to request the GPU control module of the GPU control domain to set GPU hardware resources allocated to the GPU application domain, the GPU application module uses the GPU hardware resources allocated and set by the GPU control domain according to the setting information after receiving the setting success data packet sent by the GPU control domain, and the GPU application module throws out setting error reporting information after receiving the setting failure data packet. The setting data packet may include information for directly configuring the allocated GPU hardware resources, or provide corresponding configuration description information, so that the GPU control module configures the allocated GPU hardware resources according to the configuration description information by using the corresponding configuration information set by itself, so as to reduce transmission data.
In this embodiment, the GPU application description information linked list is also a linked list structure, and at least includes one node, where each node represents information of a GPU control domain that can use GPU hardware resources. By way of example, each node may include information of a GPU control domain in which the allocated GPU hardware resources are located, rights of the allocated GPU hardware resources, and a system domain structure of the allocated GPU hardware resources for describing the hardware resource information allocated by the GPU. The method comprises the steps of dividing and managing GPU computing power resources by a GPU pooling technology by a GPU control module, providing hardware display interface resources by GPU hardware equipment, and pointing to a GPU control domain structure corresponding to the next allocated GPU hardware resources, and facilitating calculation by using a plurality of divided GPU resources, wherein the divided GPU resources can be in one or more GPU control domains.
Illustratively, the GPU application description information linked list is exemplified as follows:
typedef struct apply_domain_g{
uint_32_tos;// corresponding System Domain
Uint_8_t limit;// corresponding GPU hardware resource permissions
Struct computing_unit gpu_unit/GPU computing unit structure
Struct display_dev gpu_dev;// GPU hardware display interface
Struct apply_domain_g next;// GPU resource structure pointer to next System Domain
} apply_domain_t;
In addition, the GPU application description file can be updated and maintained according to the applied GPU hardware resources, so that the updated GPU application description file can be used for initialization after the GPU application domain is restarted.
When the GPU application domain uses the allocated GPU hardware resources, the GPU application module can send a GPU hardware resource release data packet to a GPU control module of the allocated GPU hardware resources, the GPU control module receives the GPU hardware resource release data packet and analyzes the GPU hardware resource release data packet, and then the corresponding GPU hardware resources are recovered. The GPU application module updates the GPU application description information linked list accordingly, and may delete the corresponding node, for example. When GPU hardware resources using a plurality of GPU control modules exist, corresponding pointers are modified according to the use completion condition.
In addition, the GPU application module can also receive the GPU hardware resource release request data packet, analyze the GPU hardware resource release request data packet, judge whether the GPU hardware resources in the GPU hardware resource release data packet are used completely, and return the GPU hardware resource release data packet to the GPU control module when the use is completed, so that the GPU control module can release and recover the corresponding GPU hardware resources conveniently. Otherwise, no action is required.
In this embodiment, the GPU control domain and the GPU application domain may communicate through a shared memory, and when data is written in the shared memory, the receiving party generates a soft interrupt to process the data. Compared with the traditional mode, the interrupt times are obviously reduced, and the real-time performance of GPU data processing is not affected.
In addition, in this embodiment, the roles of the GPU control domain and the GPU application domain are not fixed, and illustratively, the GPU control domain may also be provided with a GPU application module, a GPU application description file, and a GPU application description information linked list. GPU hardware resources of other system domains may be used. Correspondingly, the GPU application domain can be also provided with a GPU control module, a GPU control description file and a GPU control description information linked list, so that GPU hardware resources are provided for other system domains.
The working process of the GPU virtual sharing system based on multi-system isolation provided in this embodiment is further described below.
Firstly, jailhouse can be utilized to isolate the Soc system into a plurality of systems, and the GPU application domain and the GPU control domain are set according to the setting condition of GPU hardware resources. The system comprises a GPU control domain, a GPU application domain and a local domain, wherein the GPU control domain is used for managing the use condition of GPU hardware resources, and the GPU application domain is used for requesting the GPU hardware resources for temporary use. Each system domain independently runs the corresponding operating systems such as linux, rtos (Real Time Operate System) and the like, and the GPU equipment is transparently transferred from the GPU control domain to the GPU application domain through a hardware transparent transfer method, so that the GPU control domain and the GPU application domain can share and directly access GPU physical equipment resources.
After setting the GPU control domain and the GPU application domain, the GPU control module in the GPU control domain is initialized by using the GPU hardware information described in the GPU control description file, so that the GPU control module determines which GPU hardware devices can be loaded, and various characteristics of the hardware devices. The GPU control module can be further generated according to the description of the GPU control description file, and is maintained in real time, and the hardware resources of the GPU corresponding to each system domain are described by the GPU control description information linked list independently.
Correspondingly, in the GPU application domain, the GPU application module is initialized according to a GPU application description file, wherein the GPU application description file firstly describes the system domain as the GPU application domain.
After the initialization of the GPU application domain and the GPU control domain is completed, the GPU application domain sends a GPU hardware resource application data packet to a GPU control module of the GPU control domain according to hardware resources required by the system, the GPU control module receives the GPU application data packet of the GPU application domain GPU application module, if idle GPU hardware resources exist, the corresponding GPU hardware resources are distributed to the GPU application domain, and the GPU hardware resource distribution data packet is sent to the GPU application domain, so that the GPU application domain can directly use related GPU hardware resources by utilizing a transmission technology. The GPU control module updates the GPU control description file and can specifically describe the mapping relation between the distributed GPU hardware resources and the application domain. Describing a GPU hardware resource segmentation mode, corresponding to different system domain use authorities of GPU hardware resources, sharing related information of the GPU hardware resources and the like. And updating and maintaining a GPU control description information linked list according to the allocation information.
And if the GPU application domain successfully applies for the GPU hardware resources and receives the GPU hardware resource allocation data packet, updating a GPU application description information linked list, having the corresponding GPU hardware resource use permission, and generating a GPU application description file. The GPU application domain can directly use the allocated GPU hardware resources by utilizing VFIO transmission technology, so that similar effects of using the GPU resources can be realized. The GPU application module can actively send a GPU hardware resource use data packet to acquire GPU hardware resource information and corresponding rights of a corresponding GPU control domain.
When the GPU hardware resources allocated to the GPU application domain are required to be recovered, the GPU control module sends a GPU hardware resource recovery data packet to the GPU application module of the GPU application domain, the GPU application module detects whether the corresponding GPU hardware resources can be released, when the corresponding GPU hardware resources are idle and can be released by the GPU application domain, the hardware resources of the GPU control module are successfully recovered, the GPU control module adds relevant GPU hardware information to a GPU control description information linked list, and when the corresponding GPU hardware resources cannot be released when the corresponding GPU hardware resources are used by the GPU application domain, the GPU control module fails to recover the hardware resources and does not process the GPU hardware resources.
When the GPU application domain actively releases the GPU hardware resources, the GPU control module receives and analyzes the GPU hardware resource release data packet, then recovers the corresponding GPU hardware resources, and adds relevant GPU hardware information to the GPU control description information linked list. And the GPU hardware resource allocation and recovery result is synchronously and correspondingly updated into the GPU control description file and the GPU application description file.
When the GPU application domain lacks a GPU related software library, a GPU application module of the GPU application domain can send a GPU setting data packet to a GPU control domain to request a GPU control module of the GPU control domain to set GPU hardware resources allocated to the GPU application domain, the GPU control module of the GPU control domain receives the GPU setting data packet sent by the GPU application domain, sets the GPU hardware resources allocated to the GPU application domain according to the GPU setting data packet, sends a setting success data packet containing setting success information to the GPU application module of the GPU application domain after the GPU control module successfully sets the GPU hardware resources, sends a setting failure command packet containing error information to the GPU application module of the GPU application domain after the GPU application module receives the setting success data packet sent by the GPU control domain, uses the GPU control domain to allocate and set GPU hardware resources according to the setting information, and throws out setting error information after the GPU application module receives the setting failure data packet.
By the method, the GPU hardware resources can be effectively managed and reasonably allocated, so that each application domain can use the GPU hardware resources.
The GPU virtual sharing system based on multi-system isolation comprises a plurality of isolation systems, a GPU application domain, a GPU control module and a GPU control module, wherein the plurality of isolation systems are divided into a GPU control domain and a GPU application domain, the GPU control domain is used for managing the use condition of GPU hardware resources, the GPU application domain is used for requesting GPU hardware resources for temporary use of the domain, the GPU control domain comprises a GPU control description file used for describing hardware information of the GPU hardware resources and a result of distributing the GPU hardware resources, the GPU control description file is used for describing GPU hardware resource information used by each GPU application domain, the GPU control module is used for receiving a GPU application data packet sent by the GPU application domain, judging whether resources corresponding to the application data packet exist or not according to the GPU application data packet and the GPU control description file, distributing idle GPU hardware resources for the GPU application domain, returning a GPU hardware resource distribution data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet. When the GPU control domain meets the request of the GPU application domain GPU, the corresponding GPU hardware resources can be transmitted to the GPU application domain through VFIO, so that a large amount of data transmission and mass interrupt generated in the using process are avoided. And as the GPU resources are uniformly managed by the GPU control domain according to the GPU use request, the error rate and the use difficulty are reduced.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (9)

1.一种基于多系统隔离的GPU虚拟共享系统,其特征在于,包括:1. A GPU virtual sharing system based on multi-system isolation, characterized by comprising: GPU控制域,所述GPU控制域用于对GPU硬件资源的使用情况进行管理;A GPU control domain, wherein the GPU control domain is used to manage the usage of GPU hardware resources; GPU申请域,所述GPU申请域用于请求GPU硬件资源供本域临时使用;A GPU application domain, wherein the GPU application domain is used to request GPU hardware resources for temporary use by the domain; 所述GPU控制域包括:The GPU control domain includes: GPU控制描述文件,所述GPU控制描述文件用于描述GPU硬件资源的硬件信息,以及分配GPU硬件资源的结果;A GPU control description file, wherein the GPU control description file is used to describe the hardware information of the GPU hardware resources and the result of allocating the GPU hardware resources; GPU控制描述信息链表,用于描述每个的GPU申请域使用的GPU硬件资源信息;GPU control description information linked list, used to describe the GPU hardware resource information used by each GPU application domain; GPU控制模块,所述GPU控制模块用于接收GPU申请域发送的GPU申请数据包,并根据所述GPU申请数据包和所述GPU控制描述文件判断是否存在与所述申请数据包对应的资源,在存在时,为所述GPU申请域划分分配空闲GPU硬件资源,向GPU申请域返回GPU硬件资源分配数据包,并根据所述GPU申请数据包更新维护所述GPU控制描述信息链和所述GPU控制描述文件;以及a GPU control module, the GPU control module being used to receive a GPU application data packet sent by a GPU application domain, and judging whether there are resources corresponding to the application data packet according to the GPU application data packet and the GPU control description file, and if so, allocating idle GPU hardware resources to the GPU application domain, returning a GPU hardware resource allocation data packet to the GPU application domain, and updating and maintaining the GPU control description information chain and the GPU control description file according to the GPU application data packet; and 接收GPU申请域发送的GPU设置数据包,根据所述GPU设置数据包设置分配的GPU硬件资源,设置成功后发送的设置成功数据包到GPU申请域,在设置失败后发送设置失败命令包到GPU申请域;Receive a GPU setting data packet sent by the GPU application domain, set the allocated GPU hardware resources according to the GPU setting data packet, send a setting success data packet to the GPU application domain after the setting is successful, and send a setting failure command packet to the GPU application domain after the setting fails; 所述GPU申请域,包括:The GPU application domain includes: GPU申请描述文件,所述GPU申请描述文件用于描述分配得到的GPU硬件资源信息;A GPU application description file, wherein the GPU application description file is used to describe the allocated GPU hardware resource information; GPU申请描述信息链表,所述GPU申请描述信息链表用于描述所在GPU申请域的GPU硬件资源申请信息;A GPU application description information linked list, wherein the GPU application description information linked list is used to describe the GPU hardware resource application information of the GPU application domain; GPU申请模块,所述GPU申请模块用于发送GPU硬件资源申请数据包到GPU控制域的GPU控制模块,并根据GPU控制模块返回的GPU硬件资源分配数据包更新GPU申请描述信息链表,以使得所述GPU申请域拥有对应GPU硬件资源使用权限,并利用VFIO直接使用对应的GPU硬件资源;以及A GPU application module, wherein the GPU application module is used to send a GPU hardware resource application data packet to a GPU control module of a GPU control domain, and update a GPU application description information linked list according to a GPU hardware resource allocation data packet returned by the GPU control module, so that the GPU application domain has the corresponding GPU hardware resource usage authority and directly uses the corresponding GPU hardware resources using VFIO; and 向所述GPU控制域的GPU控制模块发送GPU设置数据包,请求所述GPU控制模块按照设置信息对所述分配的GPU硬件资源进行设置,在接收到所述GPU控制模块发送的设置成功数据包后依据设置信息使用GPU控制域设置完成的GPU硬件资源,在接收到设置失败数据包后,GPU申请模块抛出设置报错信息。A GPU setting data packet is sent to the GPU control module of the GPU control domain, requesting the GPU control module to set the allocated GPU hardware resources according to the setting information; after receiving the setting success data packet sent by the GPU control module, the GPU hardware resources set by the GPU control domain are used according to the setting information; after receiving the setting failure data packet, the GPU application module throws a setting error message. 2.根据权利要求1所述的系统,其特征在于,所述GPU控制描述信息链表,包括:链表节点,每个链表节点对应描述一个GPU申请域申请使用GPU硬件资源的情况;2. The system according to claim 1, characterized in that the GPU control description information linked list comprises: linked list nodes, each linked list node corresponding to a description of a GPU application domain applying for the use of GPU hardware resources; 所述链表节点包括:The linked list nodes include: 分配的GPU硬件资源对应的系统域的信息;Information about the system domain corresponding to the allocated GPU hardware resources; 分配的GPU硬件资源相对于分配的系统域的权限;The permissions of the allocated GPU hardware resources relative to the allocated system domain; 分配的GPU硬件资源的系统域结构体,所述结构体包括:GPU控制模块由GPU池化技术分割并管理的GPU算力资源,和GPU显示接口;A system domain structure of allocated GPU hardware resources, the structure comprising: GPU computing resources divided and managed by the GPU control module using GPU pooling technology, and a GPU display interface; 以及指向下一个分配的GPU硬件资源对应的系统域结构体的指针。And a pointer to the system domain structure corresponding to the next allocated GPU hardware resource. 3.根据权利要求2所述的系统,其特征在于,所述GPU申请描述信息链表,包括:3. The system according to claim 2, wherein the GPU application description information linked list comprises: 至少一个节点,每个节点代表一个可以使用GPU硬件资源的GPU控制域的信息;At least one node, each node represents information of a GPU control domain that can use GPU hardware resources; 所述节点包括:The nodes include: 分配的GPU硬件资源所在的GPU控制域的信息;Information about the GPU control domain where the allocated GPU hardware resources are located; 分配的GPU硬件资源的权限;分配的GPU硬件资源的系统域结构体;Permissions for allocated GPU hardware resources; System domain structure for allocated GPU hardware resources; 以及指向下一个被分配GPU硬件资源对应的GPU控制域结构体的指针,所述指针指向另一个GPU控制域。And a pointer to a GPU control domain structure corresponding to the next allocated GPU hardware resource, wherein the pointer points to another GPU control domain. 4.根据权利要求1所述的系统,其特征在于,所述GPU控制模块还用于:4. The system according to claim 1, wherein the GPU control module is further configured to: 在GPU申请域主动释放GPU硬件资源时,GPU控制模块接收并解析GPU硬件资源释放数据包,根据解析结果回收释放的GPU硬件资源,并根据所述回收释放的信息更新所述GPU控制描述信息链表和GPU控制描述文件。When the GPU application domain actively releases GPU hardware resources, the GPU control module receives and parses the GPU hardware resource release data packet, recycles the released GPU hardware resources according to the parsing result, and updates the GPU control description information linked list and the GPU control description file according to the recycled and released information. 5.根据权利要求4所述的系统,其特征在于,所述GPU申请模块还用于:5. The system according to claim 4, wherein the GPU application module is further used for: 在确定所在的申请域使用完成分配的GPU硬件资源时,向所述GPU控制模块发送硬件资源释放数据包,并接收解析到硬件资源回收数据包,根据解析结果更新GPU申请描述信息链表和GPU申请描述文件。When it is determined that the application domain has completed the allocated GPU hardware resources, a hardware resource release data packet is sent to the GPU control module, and a hardware resource recovery data packet is received and parsed, and the GPU application description information linked list and the GPU application description file are updated according to the parsing result. 6.根据权利要求5所述的系统,其特征在于,所述GPU控制模块还用于:6. The system according to claim 5, wherein the GPU control module is further configured to: 在不存在对应的资源时,查找GPU申请描述信息链表中与所述对应的资源对应的占用申请域,向占用申请域发送GPU硬件资源释放请求数据包。When there is no corresponding resource, the occupied application domain corresponding to the corresponding resource in the GPU application description information linked list is searched, and a GPU hardware resource release request data packet is sent to the occupied application domain. 7.根据权利要求1所述的系统,其特征在于,所述GPU控制域的GPU控制模块和GPU申请域的GPU申请模块通过共享内存通信,数据写入共享内存后会触发对应中断。7. The system according to claim 1 is characterized in that the GPU control module of the GPU control domain and the GPU application module of the GPU application domain communicate through a shared memory, and a corresponding interrupt will be triggered after data is written into the shared memory. 8.根据权利要求3所述的系统,其特征在于,所述GPU申请域还用于:8. The system according to claim 3, wherein the GPU application domain is further used for: 主动发送GPU硬件资源使用数据包获取对应GPU控制域的GPU硬件资源信息和对应权限。Actively send GPU hardware resource usage data packets to obtain the GPU hardware resource information and corresponding permissions of the corresponding GPU control domain. 9.根据权利要求1所述的系统,其特征在于,在自身资源不满足要求时,所述GPU控制域能向其它拥有GPU硬件资源的GPU控制域申请GPU硬件资源;9. The system according to claim 1, characterized in that when its own resources do not meet the requirements, the GPU control domain can apply for GPU hardware resources from other GPU control domains that have GPU hardware resources; 相应的,在所述GPU申请域具有GPU硬件资源时,所述GPU申请域能够作为向其它GPU申请域提供服务的GPU控制域。Correspondingly, when the GPU application domain has GPU hardware resources, the GPU application domain can serve as a GPU control domain that provides services to other GPU application domains.
CN202411697419.7A 2024-11-26 2024-11-26 A GPU virtual sharing system based on multi-system isolation Active CN119201360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411697419.7A CN119201360B (en) 2024-11-26 2024-11-26 A GPU virtual sharing system based on multi-system isolation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411697419.7A CN119201360B (en) 2024-11-26 2024-11-26 A GPU virtual sharing system based on multi-system isolation

Publications (2)

Publication Number Publication Date
CN119201360A CN119201360A (en) 2024-12-27
CN119201360B true CN119201360B (en) 2025-02-28

Family

ID=94058618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411697419.7A Active CN119201360B (en) 2024-11-26 2024-11-26 A GPU virtual sharing system based on multi-system isolation

Country Status (1)

Country Link
CN (1) CN119201360B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286645A (en) * 2020-12-29 2021-01-29 北京泽塔云科技股份有限公司 GPU resource pool scheduling system and method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11388054B2 (en) * 2019-04-30 2022-07-12 Intel Corporation Modular I/O configurations for edge computing using disaggregated chiplets
CN114237913B (en) * 2021-12-27 2025-06-10 中国科学院深圳先进技术研究院 Interference-aware GPU heterogeneous cluster scheduling method, system and medium
CN114510319B (en) * 2021-12-29 2025-05-02 中国科学院信息工程研究所 A method for sharing GPU space based on Kubernetes cluster
US12034647B2 (en) * 2022-08-29 2024-07-09 Oracle International Corporation Data plane techniques for substrate managed containers
CN115687294A (en) * 2022-09-19 2023-02-03 浙江大华技术股份有限公司 Database deployment method, domain name resolution method and related device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286645A (en) * 2020-12-29 2021-01-29 北京泽塔云科技股份有限公司 GPU resource pool scheduling system and method

Also Published As

Publication number Publication date
CN119201360A (en) 2024-12-27

Similar Documents

Publication Publication Date Title
JP4769484B2 (en) Method and system for migrating virtual machines
US20210216495A1 (en) Parameter server and method for sharing distributed deep learning parameter using the same
US7200695B2 (en) Method, system, and program for processing packets utilizing descriptors
CN100458757C (en) Inter core communication method and apparatus for multi-core processor in embedded real-time operating system
US20200293216A1 (en) Method, apparatus and electronic device for cloud service migration
JP3653159B2 (en) Virtual computer migration control method between virtual computer systems
CN112035220A (en) Processing method, device and equipment for operation task of development machine and storage medium
WO2022143714A1 (en) Server system, and virtual machine creation method and apparatus
JP7615474B2 (en) Computing device and method for handling interrupts - Patents.com
CN114816741A (en) GPU resource management method, device, system and readable storage medium
CN114691037B (en) Uninstall card namespace management, input and output request processing system and method
CN118484136A (en) Distributed storage system, data processing method, device and medium
CN116383127A (en) Inter-node communication method, device, electronic device and storage medium
CN118210593B (en) Virtual machine core binding method and computing device
CN117591489B (en) Virtual file sharing system based on multi-system isolation
CN117519908B (en) Virtual machine thermomigration method, computer equipment and medium
CN110543351A (en) Data processing method and computer device
CN119201360B (en) A GPU virtual sharing system based on multi-system isolation
CN118550595B (en) Method and system for organizing and managing RTOS (real time operating System) system by Linux system
CN118331687B (en) User-state paravirtualized data path acceleration method, device, cluster and medium
US20220405135A1 (en) Scheduling in a container orchestration system utilizing hardware topology hints
CN115562796A (en) Storage resource management method, device and system for container cluster
CN119690683A (en) Online management method, device, product and medium for virtual input/output equipment queue
CN116166572A (en) System configuration and memory synchronization method and device, system, equipment and medium
CN116841731A (en) An FPGA virtualized resource scheduling system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant