CN116991553A

CN116991553A - A virtual GPU allocation method and system in a container cloud environment based on API interception and forwarding

Info

Publication number: CN116991553A
Application number: CN202310792607.7A
Authority: CN
Inventors: 吴恒; 吴悦文; 罗荣周; 余甜; 张文博
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-11-03

Abstract

The invention relates to a virtual GPU allocation method and system in a container cloud environment based on API interception and forwarding. The method includes: abstracting the physical GPU in the cloud environment into fine-grained virtual GPU resources in the video memory dimension and computing dimension; using a declarative approach to allocate a virtual GPU with a specified amount of computing resources and a specified amount of video memory resources to the container; through customization The scheduler schedules the container to the appropriate working node and binds it to the appropriate GPU device; inserts custom CUDA, OpenCL, HIP and other API interception libraries to intercept relevant API calls when the container calls CUDA and other APIs; The intercepted APIs related to computing resources and video memory resources implement resource quota algorithms, and then forward and call real CUDA and other APIs; when the elastic allocation mode is enabled, the computing resources of the virtual GPU of the container in the idle phase are recycled and allocated to other appropriate Containers are used.

Description

A virtual GPU allocation method in a container cloud environment based on API interception and forwarding and system

技术领域Technical field

本发明属于软件技术领域，具体涉及一种基于API拦截转发的容器云环境下的虚拟GPU分配方法及系统。The invention belongs to the field of software technology, and specifically relates to a virtual GPU allocation method and system in a container cloud environment based on API interception and forwarding.

背景技术Background technique

随着云原生(Cloud Native)技术的不断发展，越来越多使用GPU的高性能计算任务以容器的形式部署到了云环境的GPU集群上，并且这些高性能计算任务使用云上的GPU资源进行计算加速。根据公开的数据显示，微软的云集群中包含有2490块GPU资源，阿里巴巴的单集群中包含有6500块GPU资源。当前主流的云服务提供商(如亚马逊、谷歌、阿里巴巴、华为等)在其基础设施中集成了Kubernetes等容器编排框架，以支持容器云。目前许多公司和社区已经探索了在容器中使用GPU并在容器云中部署高性能计算任务的方法。但是目前的解决方案大多都是让容器独占GPU资源，例如Kubernetes使用设备插件的方式将整个GPU分配给单一的容器使用。在这种容器独占GPU资源的情况下，运行高性能计算任务不可避免的造成了GPU资源的利用率不足。对于一些能让容器共享GPU的解决方案，也没有考虑到容器在GPU计算空闲时的资源回收与重分配，同样导致了GPU资源的浪费。因此在多租户的环境下，如何在不同的容器之间共享GPU资源显得非常重要。With the continuous development of cloud native technology, more and more high-performance computing tasks using GPUs are deployed on GPU clusters in the cloud environment in the form of containers, and these high-performance computing tasks use GPU resources on the cloud. Compute acceleration. According to public data, Microsoft's cloud cluster contains 2,490 GPU resources, and Alibaba's single cluster contains 6,500 GPU resources. Current mainstream cloud service providers (such as Amazon, Google, Alibaba, Huawei, etc.) have integrated container orchestration frameworks such as Kubernetes into their infrastructure to support container clouds. Currently, many companies and communities have explored ways to use GPUs in containers and deploy high-performance computing tasks in container clouds. However, most of the current solutions allow containers to exclusively occupy GPU resources. For example, Kubernetes uses device plug-ins to allocate the entire GPU to a single container. In this case where the container monopolizes GPU resources, running high-performance computing tasks inevitably leads to insufficient utilization of GPU resources. Some solutions that allow containers to share GPUs do not take into account resource recycling and reallocation of containers when GPU computing is idle, which also leads to a waste of GPU resources. Therefore, in a multi-tenant environment, how to share GPU resources between different containers is very important.

实现在容器间高效地共享GPU，其解决方案是为不同的容器分配虚拟GPU，每个虚拟GPU具有指定的计算资源额度和显存资源额度，这个过程也叫做GPU虚拟化。目前主流的GPU虚拟化方式分别在GPU软硬件调用栈的不同层次进行。The solution to efficiently share GPUs between containers is to allocate virtual GPUs to different containers. Each virtual GPU has a specified computing resource quota and video memory resource quota. This process is also called GPU virtualization. The current mainstream GPU virtualization methods are performed at different levels of the GPU software and hardware call stack.

在应用软件层进行GPU的虚拟化。如阿里巴巴研究团队研究并发表的AntMan(XiaoW,Ren S,Li Y,et al.AntMan:Dynamic Scaling on GPU Clusters for Deep Learning[C]//OSDI.2020:533-548.)在深度学习框架层实现GPU的虚拟化。AntMan通过修改PyTorch和Tensorflow等深度学习框架，在框架层实现深度学习任务共享GPU，分配并限制不同深度学习任务的计算资源和显存资源的额度。AntMan以深度学习计算图模型的算子为粒度，通过限制算子的运行来限制任务GPU计算资源的使用量。AntMan维护了一个显存资源池，通过显存资源池限制任务GPU显存资源的使用量。在应用软件层进行GPU的虚拟化的缺点是只支持特定应用场景下的高性能计算任务，与深度学习框架等应用软件高度绑定。GPU virtualization is performed at the application software layer. For example, AntMan (XiaoW, Ren S, Li Y, et al. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning [C]//OSDI.2020:533-548.) studied and published by the Alibaba research team is in the deep learning framework. The layer implements GPU virtualization. By modifying deep learning frameworks such as PyTorch and Tensorflow, AntMan realizes GPU sharing for deep learning tasks at the framework layer, and allocates and limits the amount of computing resources and video memory resources for different deep learning tasks. AntMan takes the operators of the deep learning computational graph model as the granularity and limits the usage of task GPU computing resources by limiting the operation of the operators. AntMan maintains a video memory resource pool and limits the usage of task GPU video memory resources through the video memory resource pool. The disadvantage of GPU virtualization at the application software layer is that it only supports high-performance computing tasks in specific application scenarios and is highly bound to application software such as deep learning frameworks.

在CUDA、OpenCL、HIP等API层进行GPU的虚拟化。如最新的研究成果KubeShare(YehT A,Chen H H,Chou J.Kubeshare:Aframework to manage gpus as first-class andshared resources in container cloud[C]//Proceedings of the 29th internationalsymposium on high-performance parallel and distributed computing.2020:173-184.)在CUDA驱动级API层实现GPU的虚拟化。KubeShare的主要思路是为高性能计算任务分配指定额度的GPU计算资源和显存资源后，拦截任务在CUDA驱动级API层的计算资源和显存资源相关的API调用，并决定是否转发并调用真正的API，从而达到资源的限制效果。在API层进行GPU的虚拟化的优点是与应用软件无关，具有普遍性；缺点是安全性较低，并且需要频繁的更新以适应最新的API。GPU virtualization is performed at API layers such as CUDA, OpenCL, and HIP. For example, the latest research results KubeShare (YehT A, Chen H H, Chou J. Kubeshare: A framework to manage gpus as first-class and shared resources in container cloud [C]//Proceedings of the 29th international symposium on high-performance parallel and distributed computing. 2020:173-184.) Implement GPU virtualization at the CUDA driver-level API layer. The main idea of KubeShare is to allocate a specified amount of GPU computing resources and video memory resources to a high-performance computing task, intercept the task's API calls related to the computing resources and video memory resources at the CUDA driver-level API layer, and decide whether to forward and call the real API. , thereby achieving the resource limitation effect. The advantage of GPU virtualization at the API layer is that it is independent of application software and is universal; the disadvantage is that it is less secure and requires frequent updates to adapt to the latest API.

在GPU驱动层进行GPU的虚拟化。这种方式实现的虚拟化功能更多，例如可以实现动态迁移等特性，并且更加的安全，无需频繁改动GPU的加速软件库。但是由于虚拟化层次较低，因此虚拟化实现比较复杂，并且由于GPU生产厂商对驱动代码严格闭源的策略，在GPU驱动层进行虚拟化难度的较高，通常需要采用逆向工程等技术。GPU virtualization is performed at the GPU driver layer. This method implements more virtualization functions, such as dynamic migration and other features, and is more secure without the need to frequently change the GPU acceleration software library. However, due to the low level of virtualization, the implementation of virtualization is relatively complex, and due to the GPU manufacturer's strict closed-source policy on driver code, virtualization at the GPU driver layer is more difficult and usually requires the use of reverse engineering and other technologies.

在硬件层面上进行GPU的虚拟化。例如英伟达公司的Tesla A100，其采用的是硬件层面的虚拟化方式。Tesla A100将算力单元SM平均分成7份，将显存资源平均分成8份。在Tesla A100运行的任务进程将独享各自的硬件资源(例如片上交叉端口、L2缓存组、内存控制器和DRAM地址总线)。在硬件层面上进行GPU的虚拟化可以让应用程序直接访问到物理的GPU设备，而不是通过API转发或驱动程序转发。在硬件层面上进行GPU的虚拟化的优点是用户无需更改GPU的加速软件库和GPU驱动程序，并且由于硬件支持而获得接近本机的性能，应用执行时完全安全；缺点是虚拟化由GPU硬件提供商来实现，价格较为昂贵，并且由于没有经过操作系统等高级别的层次，无法自定义一些策略，很难像基于API拦截转发的GPU虚拟化方式那样实现资源调度算法、检查点功能、实时迁移和容错等特性。GPU virtualization is performed at the hardware level. For example, NVIDIA's Tesla A100 uses hardware-level virtualization. Tesla A100 divides the computing unit SM into 7 parts equally and the video memory resources into 8 parts evenly. Task processes running on the Tesla A100 will have exclusive access to their own hardware resources (such as on-chip cross-connect ports, L2 cache banks, memory controllers, and DRAM address buses). GPU virtualization at the hardware level allows applications to directly access the physical GPU device instead of forwarding through API or driver forwarding. The advantage of virtualizing the GPU at the hardware level is that users do not need to change the GPU's acceleration software library and GPU driver, and get near-native performance due to hardware support, and the application is completely safe when executing; the disadvantage is that virtualization is controlled by the GPU hardware It is more expensive to implement it through the provider, and because it does not go through high-level layers such as the operating system, it is impossible to customize some policies. It is difficult to implement resource scheduling algorithms, checkpoint functions, and real-time like the GPU virtualization method based on API interception and forwarding. Features such as migration and fault tolerance.

发明内容Contents of the invention

为了克服现有技术方案的不足，本发明的目的是提供一种基于API拦截转发的容器云环境下的虚拟GPU分配方法及系统。首先将云环境下的物理GPU抽象成为显存维度和计算维度的细粒度的虚拟GPU资源，分配给容器使用；然后分析高性能计算任务在GPU上使用计算资源和显存资源的逻辑，进行性能建模，实施基于API拦截转发的虚拟化策略，达到资源限额的效果；最后提供一种计算资源弹性分配工作模式，回收处于空闲阶段容器的虚拟GPU的计算资源，分配给其它合适的容器使用，提升集群中GPU资源的利用率。In order to overcome the shortcomings of existing technical solutions, the purpose of the present invention is to provide a virtual GPU allocation method and system in a container cloud environment based on API interception and forwarding. First, the physical GPU in the cloud environment is abstracted into fine-grained virtual GPU resources in the memory dimension and computing dimension, which are allocated to containers; then the logic of high-performance computing tasks using computing resources and video memory resources on the GPU is analyzed, and performance modeling is performed. , implement a virtualization strategy based on API interception and forwarding to achieve the effect of resource quota; finally provide a working mode of elastic allocation of computing resources, recycle the computing resources of the virtual GPU of the container in the idle stage, and allocate them to other suitable containers for use, improving the cluster GPU resource utilization.

本发明的技术解决方案是：The technical solution of the present invention is:

一种基于API拦截转发的容器云环境下的虚拟GPU分配方法，包括以下步骤：A virtual GPU allocation method in a container cloud environment based on API interception and forwarding, including the following steps:

将云环境下的物理GPU抽象成为显存维度和计算维度的细粒度的虚拟GPU资源；Abstract the physical GPU in the cloud environment into fine-grained virtual GPU resources in the memory dimension and computing dimension;

使用声明式方式为容器分配指定额度计算资源和指定额度显存资源的虚拟GPU；Use a declarative approach to allocate a virtual GPU with a specified amount of computing resources and a specified amount of video memory resources to the container;

通过自定义的调度器，将容器调度到合适的工作节点，与合适的GPU设备进行绑定；Through the custom scheduler, the container is scheduled to the appropriate working node and bound to the appropriate GPU device;

插入自定义的CUDA、OpenCL、HIP等API拦截库，在容器调用CUDA等API时拦截相关的API调用，对拦截到的计算资源和显存资源相关API实施资源限额算法，然后转发并调用真正的CUDA等API；Insert custom API interception libraries such as CUDA, OpenCL, HIP, etc., intercept relevant API calls when the container calls CUDA and other APIs, implement resource quota algorithms on the intercepted APIs related to computing resources and video memory resources, and then forward and call the real CUDA etc API;

在启用弹性分配模式时，回收处于空闲阶段容器的虚拟GPU的计算资源，并分配给其它合适的容器使用。When the elastic allocation mode is enabled, the computing resources of the virtual GPU of the container in the idle phase are recovered and allocated to other appropriate containers.

进一步地，所述将云环境下的物理GPU抽象成为显存维度和计算维度的细粒度的虚拟GPU资源，包括：Further, the physical GPU in the cloud environment is abstracted into fine-grained virtual GPU resources in the memory dimension and computing dimension, including:

使用GPU厂商提供的检测接口，检测节点上的物理GPU的信息；Use the detection interface provided by the GPU manufacturer to detect the physical GPU information on the node;

物理GPU的执行时间均匀分为100份，每一份执行时间对应一个粒度的虚拟GPU计算资源，也对应GPU的利用率；The execution time of the physical GPU is evenly divided into 100 parts. Each execution time corresponds to a granular virtual GPU computing resource and also corresponds to the GPU utilization;

物理GPU的显存按照1MB的大小分为若干份，每一份显存对应一个粒度的虚拟GPU显存资源；The video memory of the physical GPU is divided into several parts according to the size of 1MB, and each part of the video memory corresponds to a granular virtual GPU video memory resource;

扩展实现Kubernetes提供的设备插件功能，注册并上报细粒度的虚拟GPU计算资源和虚拟GPU显存资源。Expand and implement the device plug-in function provided by Kubernetes, and register and report fine-grained virtual GPU computing resources and virtual GPU memory resources.

进一步地，所述使用声明式方式为容器分配指定额度计算资源和指定额度显存资源的虚拟GPU，包括：用户通过YAML描述文件向Kubernetes集群中提交使用GPU资源的高性能计算任务Pod，并指定容器所需要的虚拟GPU显存资源和虚拟GPU计算资源的数量。Further, the use of a declarative approach to allocate a virtual GPU with a specified amount of computing resources and a specified amount of video memory resources to the container includes: the user submits a high-performance computing task Pod using GPU resources to the Kubernetes cluster through a YAML description file, and specifies the container The required number of virtual GPU memory resources and virtual GPU computing resources.

进一步地，所述通过自定义的调度器，将容器调度到合适的工作节点，与合适的GPU设备进行绑定，包括：Further, the custom scheduler is used to schedule the container to the appropriate working node and bind it to the appropriate GPU device, including:

Pod调度器通过Kubernetes的watch机制，监控需要虚拟GPU的Pod，并进行调度；The Pod scheduler monitors and schedules Pods that require virtual GPUs through the watch mechanism of Kubernetes;

监视器监听到提交的Pod后，将Pod放入调度器任务队列；After the monitor listens to the submitted Pod, it puts the Pod into the scheduler task queue;

通知器监听CRD，时刻获取集群中GPU资源的使用情况；The notifier listens to CRD and obtains the usage of GPU resources in the cluster at all times;

当可用GPU资源满足条件时，通知器通知调度器任务队列头部的Pod出队；When the available GPU resources meet the conditions, the notifier notifies the scheduler that the Pod at the head of the task queue is dequeued;

出队的Pod执行调度与绑定流程，通过过滤、打分、绑定等环节，调度Pod到合适的节点和GPU并绑定；The dequeued Pod executes the scheduling and binding process. Through filtering, scoring, binding, etc., the Pod is scheduled to the appropriate node and GPU and bound;

Pod调度器内部设置常见的调度算法(首次适应算法、最佳适应算法、最坏适应算法等)供用户在不同场景下选用，用户也可以通过预留的调度算法扩展接口，实现自定义的调度算法。The Pod scheduler internally sets common scheduling algorithms (first adaptation algorithm, best adaptation algorithm, worst adaptation algorithm, etc.) for users to choose in different scenarios. Users can also expand the interface through reserved scheduling algorithms to implement customized scheduling. algorithm.

进一步地，所述插入自定义的CUDA、OpenCL、HIP等API拦截库，在容器调用CUDA等API时拦截相关的API调用，对拦截到的计算资源和显存资源相关API实施资源限额算法，然后转发并调用真正的CUDA等API，包括：Further, the custom API interception libraries such as CUDA, OpenCL, and HIP are inserted to intercept relevant API calls when the container calls APIs such as CUDA, implement a resource limit algorithm on the intercepted APIs related to computing resources and video memory resources, and then forward them. And call real CUDA and other APIs, including:

编译含有原生的CUDA、OpenCL、HIP等加速平台全部API的动态链接库；Compile a dynamic link library containing all APIs of native CUDA, OpenCL, HIP and other acceleration platforms;

通过Linux操作系统的挂钩机制，将容器环境变量LD_LIBRARY_PATH设置为CUDA等API拦截库的目录，拦截其调用的API；Through the hooking mechanism of the Linux operating system, set the container environment variable LD_LIBRARY_PATH to the directory of API interception libraries such as CUDA to intercept the APIs they call;

拦截到计算资源和显存资源相关的API调用，执行资源限额算法；Intercept API calls related to computing resources and video memory resources, and execute resource quota algorithms;

对于计算资源的限额，采用的是基于监控的GPU计算资源限额算法；For the limit of computing resources, the monitoring-based GPU computing resource limit algorithm is used;

对于显存资源的限额，采用的是基于配额的GPU显存资源限额算法；For the quota of video memory resources, a quota-based GPU video memory resource quota algorithm is used;

通过使用Linux操作系统的dlopen和dlsym等系统调用，获取到真实的CUDA等API的地址并调用相关API，达到转发API的效果。By using system calls such as dlopen and dlsym of the Linux operating system, the address of the real CUDA and other APIs is obtained and the relevant APIs are called to achieve the effect of forwarding the API.

进一步地，所述基于监控的GPU计算资源限额算法，包括：Further, the monitoring-based GPU computing resource quota algorithm includes:

通过GPU厂商的相关接口实时监控容器的GPU利用率；Monitor the GPU utilization of the container in real time through the relevant interfaces of the GPU manufacturer;

为CUDA核函数等计算单元的执行时间和GPU的可用时间进行量化；Quantify the execution time of computing units such as CUDA kernel functions and the available time of the GPU;

以生产者消费者模式运行，动态调整容器中高性能计算任务启动CUDA核函数等计算单元的速率。Run in producer-consumer mode, dynamically adjust the rate at which high-performance computing tasks in the container start computing units such as CUDA kernel functions.

进一步地，所述基于配额的GPU显存资源限额算法，包括：Further, the quota-based GPU memory resource limit algorithm includes:

为每一个容器维护一个显存配额，记录当前该容器所使用的显存大小，创建容器时显存配额数值为0；Maintain a video memory quota for each container and record the current video memory size used by the container. The video memory quota value is 0 when the container is created;

容器中的进程调用申请或释放显存的相关API时，通过当前的配额大小判断是否转发调用真正的API。When a process in the container calls APIs related to applying for or releasing video memory, it uses the current quota size to determine whether to forward the call to the real API.

进一步地，所述在启用弹性分配模式时，回收处于空闲阶段容器的虚拟GPU的计算资源，并分配给其它合适的容器使用，包括：Further, when the elastic allocation mode is enabled, the computing resources of the virtual GPU of the container in the idle phase are recovered and allocated to other suitable containers for use, including:

将任务分为两类，一类是资源敏感型任务(Resource Sensitive Tasks,RST)，一类是资源不敏感型任务(Resource Insensitive Tasks,RIT)；Divide tasks into two categories, one is Resource Sensitive Tasks (RST), and the other is Resource Insensitive Tasks (RIT);

采用基于滑动窗口采样的算法来判断高性能计算任务是否处于“空闲阶段”；An algorithm based on sliding window sampling is used to determine whether the high-performance computing task is in the "idle phase";

将GPU上的计算资源分为固定计算资源和弹性计算资源，并将回收到的弹性计算资源进行重分配；Divide computing resources on the GPU into fixed computing resources and elastic computing resources, and reallocate the recovered elastic computing resources;

固定计算资源其值等于所有未弹性增加的RST的计算资源之和；The value of fixed computing resources is equal to the sum of the computing resources of all RSTs that are not elastically increased;

弹性计算资源其值等于总资源减去固定资源减去处于限制状态的RIT的计算资源之和；The value of elastic computing resources is equal to the sum of total resources minus fixed resources minus the computing resources of RIT in a restricted state;

当GPU中没有未处于限制状态的RIT时，将弹性计算资源均匀分配给RST；When there is no RIT in the GPU that is not in a restricted state, elastic computing resources are evenly allocated to RST;

当GPU中有未处于限制状态的RIT时，将弹性计算资源均匀分配给RIT。When there are RITs in the GPU that are not in a restricted state, elastic computing resources are evenly allocated to the RITs.

一种采用上述方法的基于API拦截转发的容器云环境下的虚拟GPU分配系统，所述系统包括：A virtual GPU allocation system in a container cloud environment based on API interception and forwarding using the above method, the system includes:

Pod调度器，用于为用户提交的高性能计算任务Pod选取合适的节点和GPU，并进行绑定和执行；The Pod scheduler is used to select appropriate nodes and GPUs for high-performance computing task Pods submitted by users, and bind and execute them;

设备插件模块，用于抽象虚拟GPU资源并将其发布到Kubelet上，并为高性能计算任务Pod申请和分配指定计算资源和显存资源额度的虚拟GPU；The device plug-in module is used to abstract virtual GPU resources and publish them to Kubelet, and apply for and allocate virtual GPUs with specified computing resources and video memory resource quotas for high-performance computing task Pods;

Pod资源管理器，用于与CUDA等API拦截库进行交互，维护Pod整个生命周期内的资源配额，实现弹性计算资源分配；Pod resource manager is used to interact with API interception libraries such as CUDA, maintain resource quotas throughout the life cycle of Pods, and achieve elastic computing resource allocation;

API拦截库模块，用于拦截并转发CUDA等API，执行资源限额算法，达到虚拟GPU的计算资源和显存资源限额的效果。The API interception library module is used to intercept and forward CUDA and other APIs, execute resource quota algorithms, and achieve the effect of virtual GPU computing resource and video memory resource limits.

本发明与现有技术相比的优点在于：The advantages of the present invention compared with the prior art are:

本发明针对高性能计算任务GPU软硬件调用栈特点，提出了在CUDA、OpenCL、HIP等加速平台的开源API层进行GPU的虚拟化的方式，避免了GPU厂商闭源策略的影响。In view of the characteristics of the GPU software and hardware call stack for high-performance computing tasks, the present invention proposes a method of virtualizing the GPU in the open source API layer of acceleration platforms such as CUDA, OpenCL, and HIP, thus avoiding the impact of the closed-source strategy of the GPU manufacturer.

本发明针对高性能计算任务使用GPU计算资源和显存资源的原理，提出了相应的GPU资源限额算法，具有限额准确、开销低的优点。The present invention proposes a corresponding GPU resource quota algorithm based on the principle of using GPU computing resources and video memory resources for high-performance computing tasks, which has the advantages of accurate quota and low overhead.

本发明针对高性能计算任务运行时的计算资源空闲现象，提出了基于自适应弹性分配算法的计算资源分配策略，减少了GPU计算资源的浪费。In view of the idle phenomenon of computing resources when high-performance computing tasks are running, the present invention proposes a computing resource allocation strategy based on an adaptive elastic allocation algorithm, thereby reducing the waste of GPU computing resources.

本发明针对目前云环境下的部署特点，基于Kubernetes设计开发了相应的软件系统，具有兼容原生的Kubernetes、部署方便、交互性好的特点。In view of the deployment characteristics in the current cloud environment, the present invention designs and develops a corresponding software system based on Kubernetes, which has the characteristics of compatibility with native Kubernetes, convenient deployment, and good interactivity.

附图说明Description of the drawings

图1为本发明的虚拟GPU分配系统的系统结构示意图。Figure 1 is a schematic system structure diagram of the virtual GPU allocation system of the present invention.

图2为本发明的Pod调度器执行的流程示意图。Figure 2 is a schematic flowchart of the execution of the Pod scheduler of the present invention.

具体实施方式Detailed ways

下面将结合本发明的附图，进一步说明本发明的技术方案，所描述的实施例是本发明一部分实施例，而不代表全部的实施例。The technical solution of the present invention will be further described below with reference to the accompanying drawings of the present invention. The described embodiments are some of the embodiments of the present invention, but do not represent all of the embodiments.

对于本领域的技术人员来说，一些公知技术可能未进行详细阐述。For those skilled in the art, some well-known technologies may not be described in detail.

本发明的一种基于API拦截转发的容器云环境下的虚拟GPU分配方法，包括以下步骤：A virtual GPU allocation method in a container cloud environment based on API interception and forwarding of the present invention includes the following steps:

插入自定义的CUDA、OpenCL、HIP等API拦截库，在容器调用CUDA等API时拦截相关的API调用；Insert custom CUDA, OpenCL, HIP and other API interception libraries to intercept related API calls when the container calls CUDA and other APIs;

对拦截到的计算资源和显存资源相关API实施资源限额算法，然后转发并调用真正的CUDA等API；Implement the resource quota algorithm on the intercepted APIs related to computing resources and video memory resources, and then forward and call the real CUDA and other APIs;

本发明的一种基于API拦截转发的容器云环境下的虚拟GPU分配系统，包括Pod调度器、设备插件模块、Pod资源管理器、API拦截库模块四部分构成，其中：A virtual GPU allocation system in a container cloud environment based on API interception and forwarding of the present invention consists of four parts: a Pod scheduler, a device plug-in module, a Pod resource manager, and an API interception library module, wherein:

Pod调度器：通过运行一系列调度算法，为用户向Kubernetes集群中提交的高性能计算任务Pod(Pod是Kubernetes集群管理与调度的最小计算单元，表示处于运行状态的一组容器)选取合适的节点和GPU，并进行绑定和执行。Pod调度器通过watch(监测)机制实现调度的功能，内部设置了常见的调度算法(首次适应算法、最佳适应算法、最坏适应算法等)供用户在不同场景下选用，用户也可以通过预留的调度算法扩展接口，实现自定义的调度算法。Pod scheduler: By running a series of scheduling algorithms, it selects appropriate nodes for high-performance computing tasks Pod submitted by users to the Kubernetes cluster (Pod is the smallest computing unit for Kubernetes cluster management and scheduling, representing a group of containers in a running state) and GPU for binding and execution. The Pod scheduler implements the scheduling function through the watch (monitoring) mechanism. Common scheduling algorithms (first adaptation algorithm, best adaptation algorithm, worst adaptation algorithm, etc.) are set up internally for users to choose in different scenarios. Users can also pre-set the scheduling function. The reserved scheduling algorithm extension interface can implement customized scheduling algorithms.

设备插件模块：通过扩展实现Kubernetes提供的用于支持自定义资源的框架，抽象出虚拟GPU资源并将其发布到Kubelet上，并为高性能计算任务Pod申请和分配指定计算资源和显存资源额度的虚拟GPU。设备插件通过DaemonSet的方式，部署在每一个工作节点上，通过gRPC服务向Kubelet进行注册。在运行的时候，设备插件在自己的套接字下启动一个gRPC服务，用于向Kubelet提供支持，包括了监控和上报物理GPU资源、为高性能计算任务Pod分配虚拟GPU两个过程。Device plug-in module: By extending the framework provided by Kubernetes to support custom resources, it abstracts virtual GPU resources and publishes them to Kubelet, and applies for and allocates specified computing resources and video memory resource quotas to high-performance computing task Pods. Virtual GPU. The device plug-in is deployed on each worker node through DaemonSet and registered with Kubelet through the gRPC service. When running, the device plug-in starts a gRPC service under its own socket to provide support to Kubelet, including monitoring and reporting physical GPU resources, and allocating virtual GPUs to high-performance computing task Pods.

Pod资源管理器：负责与CUDA等API拦截库进行交互，维护Pod整个生命周期内的资源配额，实现弹性计算资源分配。当Pod启动时，Pod资源管理器将获取Pod容器的进程ID、虚拟GPU资源分配额度等信息，然后采用gRPC的方式与CUDA等API拦截库进行通信，使CUDA等拦截库完成一些初始化的工作。在运行时，Pod资源管理器管理高性能计算任务Pod在整个生命周期内的虚拟GPU的分配情况，持续监测Pod的资源使用情况，执行计算资源弹性收缩与分配策略。Pod Resource Manager: Responsible for interacting with API interception libraries such as CUDA, maintaining resource quotas throughout the Pod's life cycle, and achieving elastic computing resource allocation. When a Pod starts, the Pod Resource Manager will obtain the Pod container's process ID, virtual GPU resource allocation, and other information, and then use gRPC to communicate with API interception libraries such as CUDA to complete some initialization work. At runtime, the Pod Resource Manager manages the allocation of virtual GPUs for high-performance computing task Pods throughout their life cycle, continuously monitors the resource usage of Pods, and implements elastic shrinkage and allocation strategies for computing resources.

API拦截库模块：通过自定义CUDA、OpenCL、HIP等API拦截库，拦截并转发CUDA等API，并执行资源限额算法，达到虚拟GPU的计算资源和显存资源限额的效果。API拦截库模块采用Linux操作系统提供的挂钩机制实现拦截，采用Linux的相关系统调用获取真正的CUDA等API地址实现转发。API拦截库模块通过gRPC的方式与Pod资源管理器进行交互，获取容器分配的虚拟GPU资源额度大小，并执行GPU资源限额。API interception library module: By customizing API interception libraries such as CUDA, OpenCL, HIP, etc., it intercepts and forwards CUDA and other APIs, and executes resource quota algorithms to achieve the effect of virtual GPU computing resource and video memory resource quota. The API interception library module uses the hook mechanism provided by the Linux operating system to implement interception, and uses the relevant system calls of Linux to obtain the real CUDA and other API addresses for forwarding. The API interception library module interacts with the Pod Resource Manager through gRPC, obtains the virtual GPU resource quota allocated by the container, and implements the GPU resource limit.

示例性地，所述容器云环境下的虚拟GPU分配是指用户通过声明式方式向Kubernetes集群中提交使用GPU的高性能计算任务Pod，并为Pod指定自定义数量的GPU显存资源和计算资源额度，系统将Pod调度绑定到合适的物理GPU上执行，并按照相应额度进行GPU显存资源和计算资源的限额，使不同Pod之间相互隔离。For example, the virtual GPU allocation in the container cloud environment means that the user submits a high-performance computing task Pod using the GPU to the Kubernetes cluster in a declarative manner, and specifies a custom amount of GPU memory resources and computing resource quota for the Pod. , the system binds Pod scheduling to the appropriate physical GPU for execution, and limits GPU memory resources and computing resources according to the corresponding quota, so that different Pods are isolated from each other.

示例性地，所述声明式方式中，声明式Pod的描述文件的格式为YAML，具体的标准如下所示：For example, in the declarative approach, the format of the declarative Pod description file is YAML, and the specific standards are as follows:

所述Pod的描述文件用于提供Pod的名称、Pod调度器名称、容器镜像名称、容器启动参数、容器所需资源份额等信息。Pod的描述文件的示例如下，字段后的尖括号内标识了<字段类型：字段内容>，其中str为字符串类型。The Pod description file is used to provide information such as the name of the Pod, the name of the Pod scheduler, the name of the container image, the container startup parameters, and the resource share required by the container. An example of a Pod description file is as follows. The angle brackets after the field identify <field type: field content>, where str is a string type.

其中，字段apiVersion的内容为Kubernetes的API版本号，用于指定当前对应的API服务器的版本，作用是不同版本间API服务器的上下兼容；字段kind的内容为资源的类型，本系统支持的资源类型是Pod；字段metadata→name的内容为Pod的名称，具有唯一性；字段spec→schedulerName的内容为集群所使用的Pod调度器名称；字段spec→containers→image的内容为容器的镜像名称及版本；字段spec→containers→name的内容为容器的名称；字段spec→containers→command的内容为容器的启动参数；字段spec→containers→resources→limits→doslab.io/gpu-memory的内容为为容器所分配的GPU显存资源额度，单位为MB；字段spec→containers→resources→limits→doslab.io/gpu-core的内容为为容器所分配的GPU计算资源额度，每一个物理GPU被分为了100份计算资源，对应GPU的利用率。Among them, the content of the field apiVersion is the API version number of Kubernetes, which is used to specify the version of the current corresponding API server. It is used to ensure the upper and lower compatibility of API servers between different versions; the content of the field kind is the type of resource, which is the resource type supported by this system. It is a Pod; the content of the field metadata→name is the name of the Pod, which is unique; the content of the field spec→schedulerName is the name of the Pod scheduler used by the cluster; the content of the field spec→containers→image is the image name and version of the container; The content of the field spec→containers→name is the name of the container; the content of the field spec→containers→command is the startup parameters of the container; the content of the field spec→containers→resources→limits→doslab.io/gpu-memory is allocated for the container GPU memory resource quota, in MB; the content of the field spec→containers→resources→limits→doslab.io/gpu-core is the GPU computing resource quota allocated for the container. Each physical GPU is divided into 100 computing resources. , corresponding to the GPU utilization.

下面说明本发明的关键过程的优选的实现方式：The following describes the preferred implementation of the key processes of the present invention:

(1)Pod提交与调度过程(1) Pod submission and scheduling process

系统部署时Pod调度器运行在集群的Master节点(主节点)上，当有使用虚拟GPU的Pod时，将执行调度流程。用户通过声明式描述文件YAML的方式向Kubernetes集群中提交使用GPU的高性能计算任务Pod，需要指定容器中所需要的GPU显存资源和计算资源的额度。Kubernetes将解析并在集群中生成Pod实例对象，被Pod调度器的watch机制所捕获，执行调度的逻辑。When the system is deployed, the Pod scheduler runs on the Master node of the cluster. When there are Pods using virtual GPUs, the scheduling process will be executed. Users submit high-performance computing task Pods using GPUs to the Kubernetes cluster through declarative description files YAML, and need to specify the amount of GPU memory resources and computing resources required in the container. Kubernetes will parse and generate Pod instance objects in the cluster, which will be captured by the watch mechanism of the Pod scheduler and execute the scheduling logic.

Pod调度器将捕获的高性能计算任务Pod放入调度器任务队列中。当GPU资源满足条件时，通知器通知出队，进入调度与绑定流程。集群中GPU资源的使用情况来自于设备插件模块所上报的CRD(Custom Resource Definitions，自定义资源)。首先进入过滤阶段，过滤阶段将所有满足Pod调度需求(如CPU、内存等资源需求)的节点选出来，供后续的调度工作。然后进入打分阶段，根据对应的调度算法和资源分配情况，为Pod选择出最合适的节点以及节点上的GPU设备。最后将对应的节点和Pod进行绑定操作，将GPU的ID信息写入Pod的Annotation字段中。The Pod scheduler puts the captured high-performance computing task Pod into the scheduler task queue. When the GPU resources meet the conditions, the notifier notifies the dequeue and enters the scheduling and binding process. The usage of GPU resources in the cluster comes from the CRD (Custom Resource Definitions) reported by the device plug-in module. First enter the filtering stage, which selects all nodes that meet Pod scheduling requirements (such as CPU, memory and other resource requirements) for subsequent scheduling work. Then enter the scoring stage, and select the most suitable node for the Pod and the GPU device on the node based on the corresponding scheduling algorithm and resource allocation. Finally, bind the corresponding node to the Pod, and write the GPU ID information into the Annotation field of the Pod.

(2)虚拟GPU的抽象与分配过程(2) Abstraction and allocation process of virtual GPU

设备插件是Kubernetes提供的用于支持自定义资源的一个框架。系统部署时设备插件模块以DaemonSet的方式运行在每一个工作节点上，将节点上的物理GPU抽象为细粒度的虚拟GPU资源，并上报到Kubelet中；在运行时设备插件模块将检测物理GPU的变化情况以及为容器分配指定额度的虚拟GPU资源。Device plug-ins are a framework provided by Kubernetes to support custom resources. When the system is deployed, the device plug-in module runs on each working node in the form of DaemonSet, abstracts the physical GPU on the node into fine-grained virtual GPU resources, and reports them to Kubelet; at runtime, the device plug-in module will detect the physical GPU changes and allocate a specified amount of virtual GPU resources to the container.

设备插件模块首先需要通过gRPC服务向Kubelet进行注册，也就是需要实现框架提供的Register接口。设备插件模块首先通过Nvidia等GPU厂商提供的相关接口，监测节点上的GPU信息，包括GPU的ID等信息。然后设备插件模块通过gRPC服务向Kubelet进行注册，传输设备插件的Unix套接字、设备插件API版本、设备资源名称等信息，其中将GPU显存资源和GPU计算资源抽象为了细粒度的虚拟GPU资源，注册进Kubelet中，同时上报到CRD中用于调度。对于一块物理GPU，按照1MB的粒度将显存抽象为虚拟GPU显存资源，按照1％的利用率的粒度将算力抽象为虚拟GPU计算资源。为容器分配虚拟GPU，本质上就通过设备插件向容器分配指定数量的虚拟GPU计算资源和虚拟GPU显存资源。The device plug-in module first needs to register with Kubelet through the gRPC service, that is, it needs to implement the Register interface provided by the framework. The device plug-in module first monitors the GPU information on the node, including the GPU ID and other information, through the relevant interfaces provided by GPU manufacturers such as Nvidia. Then the device plug-in module registers with Kubelet through the gRPC service and transmits the device plug-in's Unix socket, device plug-in API version, device resource name and other information, in which the GPU memory resources and GPU computing resources are abstracted into fine-grained virtual GPU resources. Register into Kubelet and report to CRD for scheduling. For a physical GPU, the video memory is abstracted into virtual GPU video memory resources according to the granularity of 1MB, and the computing power is abstracted into virtual GPU computing resources according to the granularity of 1% utilization. Allocating a virtual GPU to a container essentially allocates a specified number of virtual GPU computing resources and virtual GPU memory resources to the container through the device plug-in.

在运行的时候，设备插件模块还需要在自己的Unix套接字下启动一个gRPC服务，用于向Kubelet提供支持。设备插件模块需要实现框架提供的一系列接口，其中重点包括ListAndWatch、Allocate这两个接口。ListAndWatch接口用于返回设备列表构成的数据流，也就是虚拟GPU显存资源数量和虚拟GPU计算资源数量。当有设备更新或设备消失的时候，ListAndWatch接口也将通知Kubelet进行相应的更新。当创建GPU加速容器时，会申请分配虚拟GPU显存资源和虚拟GPU计算资源，Kubelet将调用Allocate接口进行分配，并运行一些特定于设备的操作。在调用Allocate接口的过程中，设备插件模块为容器进行了目录挂载(例如挂载CUDA等API拦截库的目录)、设置环境变量(例如表示使用哪一个物理GPU的NVIDIA_VISIBLE_DEVICES环境变量)、设置Annotation等流程，使容器能够使用物理GPU资源。When running, the device plug-in module also needs to start a gRPC service under its own Unix socket to provide support to Kubelet. The device plug-in module needs to implement a series of interfaces provided by the framework, including the two interfaces ListAndWatch and Allocate. The ListAndWatch interface is used to return a data stream composed of a device list, that is, the number of virtual GPU memory resources and the number of virtual GPU computing resources. When a device is updated or the device disappears, the ListAndWatch interface will also notify Kubelet to update accordingly. When a GPU acceleration container is created, it will apply for allocation of virtual GPU memory resources and virtual GPU computing resources. Kubelet will call the Allocate interface to allocate and run some device-specific operations. In the process of calling the Allocate interface, the device plug-in module mounts the directory for the container (such as mounting the directory of API interception libraries such as CUDA), sets environment variables (such as the NVIDIA_VISIBLE_DEVICES environment variable indicating which physical GPU is used), and sets the Annotation and other processes to enable the container to use physical GPU resources.

(3)API拦截转发与资源限额过程(3)API interception and forwarding and resource quota process

用户提交的高性能计算Pod在经过Pod调度器的调度和设备插件模块分配虚拟GPU资源后，Pod资源管理器将获取Pod容器的进程id、资源指定额度等信息，然后采用gRPC的方式与CUDA等拦截库进行通信，将相关信息传输到CUDA等拦截库，使CUDA等拦截库完成一些初始化的工作。CUDA等API拦截库是一个本发明自定义的动态链接库，它和原生的CUDA、OpenCL、HIP等计算加速库具有相同的API。通过Linux操作系统的挂钩机制，将容器环境变量LD_LIBRARY_PATH设置为CUDA等API拦截库的目录，达到拦截API的效果。通过使用Linux操作系统的dlopen和dlsym等系统调用，获取到真实的CUDA等API的地址并调用相关API，达到转发API的效果。After the high-performance computing Pod submitted by the user is scheduled by the Pod scheduler and the device plug-in module allocates virtual GPU resources, the Pod resource manager will obtain the process ID of the Pod container, resource specified quota and other information, and then use gRPC to communicate with CUDA, etc. The interception library communicates and transmits relevant information to interception libraries such as CUDA, so that interception libraries such as CUDA can complete some initialization work. The API interception library such as CUDA is a dynamic link library customized by the present invention. It has the same API as the native CUDA, OpenCL, HIP and other computing acceleration libraries. Through the hooking mechanism of the Linux operating system, set the container environment variable LD_LIBRARY_PATH to the directory of API interception libraries such as CUDA to achieve the effect of intercepting APIs. By using system calls such as dlopen and dlsym of the Linux operating system, the address of the real CUDA and other APIs is obtained and the relevant APIs are called to achieve the effect of forwarding the API.

容器分配了一定数额的GPU计算资源和显存资源，在运行的时候需要严格保证其使用的额度不超过分配的额度。其中容器内所有进程的平均GPU利用率应约等于分配的GPU计算资源数额；容器内所有进程的GPU显存使用量之和不得超过分配的GPU显存资源数额。The container is allocated a certain amount of GPU computing resources and video memory resources. When running, it must be strictly ensured that the amount used does not exceed the allocated amount. The average GPU utilization of all processes in the container should be approximately equal to the allocated amount of GPU computing resources; the sum of the GPU memory usage of all processes in the container must not exceed the allocated amount of GPU memory resources.

对于计算资源的限额，本发明采用的是基于监控的GPU计算资源限额算法。由于高性能计算任务调用GPU进行加速的主要过程是在GPU上发射启动CUDA核函数等计算单元，因此算法的核心思路是基于监控并调节的算法，通过使用Nvidia等GPU厂商提供的相关监控接口，实时监控容器的GPU利用率，动态调整容器中高性能计算任务启动CUDA核函数等计算单元的速率，使实际的GPU利用率大致等于容器指定的计算资源额度大小。As for the limit of computing resources, the present invention adopts a monitoring-based GPU computing resource limit algorithm. Since the main process of calling the GPU for acceleration of high-performance computing tasks is to launch and start computing units such as CUDA kernel functions on the GPU, the core idea of the algorithm is based on monitoring and adjustment algorithms. By using relevant monitoring interfaces provided by GPU manufacturers such as Nvidia, Monitor the GPU utilization of the container in real time, and dynamically adjust the rate at which high-performance computing tasks in the container start CUDA kernel functions and other computing units, so that the actual GPU utilization is roughly equal to the computing resource quota specified by the container.

为了使整个监控调节的过程更加的平滑，使GPU利用率波动更小，本发明为CUDA核函数等计算单元的执行时间和GPU的可用时间进行了量化。GPU的可用时间额度可以看作一个全局维护的资源，以生产者消费者模式(Producer Consumer Pattern)运行。实时GPU利用率监控程序是生产者，当GPU利用率低于或高于分配的数值时，增加或减少可用时间额度，其变动的时间额度可以用公式表示为increment＝(allocation-current)×α，其中α是经验数值，每一型号的GPU互不相同，allocation表示为容器分配的GPU计算资源额度，即预期的GPU利用率，current表示容器内所有进程的实际GPU利用率之和。CUDA核函数等计算单元是消费者，当启动CUDA核函数等计算单元的时候，消耗时间额度，其消耗的时间额度可以用公式表示为consume＝f(Block,Thread)×α，其中f是经验函数，Block和Thread是CUDA核函数等计算单元的输入参数，α是经验数值，每一型号的GPU互不相同。此外为了保证能够实现GPU资源与限额的效果，需要严格保证生产者的生产速率不低于消费者的消耗速率，否则会产生饥饿现象，使GPU利用率无法达到目标数值。In order to make the entire monitoring and adjustment process smoother and reduce fluctuations in GPU utilization, the present invention quantifies the execution time of computing units such as CUDA kernel functions and the available time of the GPU. The available time limit of the GPU can be regarded as a globally maintained resource, running in the Producer Consumer Pattern. The real-time GPU utilization monitoring program is a producer. When the GPU utilization is lower or higher than the allocated value, it increases or decreases the available time quota. The changing time quota can be expressed by the formula as increment=(allocation-current)×α , where α is an empirical value, each model of GPU is different, allocation represents the amount of GPU computing resources allocated to the container, that is, the expected GPU utilization, and current represents the sum of the actual GPU utilization of all processes in the container. Computing units such as CUDA kernel functions are consumers. When CUDA kernel functions and other computing units are started, a time limit is consumed. The time limit consumed can be expressed as consume=f(Block,Thread)×α, where f is experience Function, Block and Thread are the input parameters of computing units such as CUDA kernel function, α is an empirical value, and each model of GPU is different from each other. In addition, in order to ensure that the effect of GPU resources and quotas can be achieved, it is necessary to strictly ensure that the production rate of the producer is not lower than the consumption rate of the consumer, otherwise starvation will occur and the GPU utilization will not reach the target value.

对于显存资源的限额，本发明采用的是配额的GPU显存资源限额算法。容器在其整个生命周期内会不断的申请和释放GPU的显存资源。因此，本发明为每一个容器维护了一个显存配额，记录当前该容器所使用的显存的大小，在创建容器的时候显存配额数值为0。As for the limit of video memory resources, the present invention adopts a quota-based GPU video memory resource limit algorithm. The container will continuously apply for and release GPU memory resources throughout its life cycle. Therefore, the present invention maintains a video memory quota for each container and records the size of the video memory currently used by the container. When the container is created, the video memory quota value is 0.

当容器中的进程调用例如cuArray3DCreate、cuMemAlloc等显存申请相关的API时，CUDA等拦截库拦截其调用，获取到申请显存的大小，并将申请的显存数值加上容器目前的显存配额，判断是否超过其指定额度的大小，如果超过指定额度的大小，则报显存不足的异常，CUDA等拦截库不继续调用真正的加速库；如果没有超过指定额度的大小，CUDA等拦截库调用真正的加速库，执行分配显存的API，并返回其调用的真正结果。当容器中的进程调用例如cuArrayDestroy、cuMemFree等显存释放相关的API时，CUDA等拦截库拦截其调用，通过指针获取到释放显存的大小，并将容器目前的显存配额减去释放的显存数值，然后CUDA等拦截库调用真正的加速库，执行释放显存的API，并返回其调用的真正结果。When the process in the container calls APIs related to video memory application such as cuArray3DCreate, cuMemAlloc, etc., interception libraries such as CUDA intercept the call, obtain the size of the video memory requested, and add the applied video memory value to the current video memory quota of the container to determine whether it exceeds If the size of the specified quota exceeds the specified quota, an exception of insufficient video memory will be reported, and interception libraries such as CUDA will not continue to call the real acceleration library; if the size of the specified quota is not exceeded, interception libraries such as CUDA will call the real acceleration library. Execute the API that allocates video memory and return the real result of its call. When the process in the container calls APIs related to video memory release such as cuArrayDestroy and cuMemFree, interception libraries such as CUDA intercept the call, obtain the size of the released video memory through the pointer, and subtract the released video memory value from the current video memory quota of the container, and then The interception library such as CUDA calls the real acceleration library, executes the API to release the video memory, and returns the real result of its call.

(4)计算资源弹性收缩与分配过程(4) Elastic shrinkage and allocation process of computing resources

用户提交的高性能计算任务在运行的过程中往往并没有充分利用到所分配的GPU计算资源，出现了资源浪费的现象，本发明为了解决这个问题，提出了一种基于自适应弹性分配算法的计算资源分配策略，通过监控探测容器实际的GPU计算资源使用量，动态分配与回收GPU计算资源，提高GPU资源的利用率。High-performance computing tasks submitted by users often do not fully utilize the allocated GPU computing resources during operation, resulting in a waste of resources. In order to solve this problem, the present invention proposes an adaptive elastic allocation algorithm based on The computing resource allocation strategy monitors and detects the actual GPU computing resource usage of the container, dynamically allocates and recycles GPU computing resources, and improves the utilization of GPU resources.

本发明将高性能计算任务分为两类，一类是资源敏感型任务(ResourceSensitive Tasks,RST)，一类是资源不敏感型任务(Resource Insensitive Tasks,RIT)。其中资源敏感型任务指的是该任务对GPU计算资源的要求较高，当任务需要计算资源的时候严格保证其指定的计算资源需求，不被抢占；资源不敏感型任务指的是任务对GPU计算资源的要求较低，执行任务的整个生命周期内可以进行计算资源的抢占。RST和RIT在提交时都将指定具体的GPU显存资源的大小，而RIT在提交时不指定具体的GPU计算资源大小，其值固定为0。The present invention divides high-performance computing tasks into two categories, one is resource-sensitive tasks (ResourceSensitive Tasks, RST), and the other is resource-insensitive tasks (Resource Insensitive Tasks, RIT). Among them, resource-sensitive tasks refer to tasks that have higher requirements for GPU computing resources. When a task requires computing resources, its specified computing resource requirements are strictly guaranteed and will not be preempted; resource-insensitive tasks refer to tasks that require GPU computing resources. The requirements for computing resources are low, and computing resources can be preempted during the entire life cycle of the execution task. Both RST and RIT will specify the specific size of GPU memory resources when submitting, while RIT does not specify the specific size of GPU computing resources when submitting, and its value is fixed at 0.

当高性能计算任务在一段时间内的实际GPU利用率显著低于指定的GPU计算资源额度，并超过一个阈值，即可认为该任务处于“空闲阶段”。此时将该任务的指定GPU计算资源额度减少到实际的GPU利用率大小，避免GPU计算资源的浪费。When the actual GPU utilization of a high-performance computing task within a period of time is significantly lower than the specified GPU computing resource quota and exceeds a threshold, the task can be considered to be in the "idle phase". At this time, the specified GPU computing resource quota for the task is reduced to the actual GPU utilization to avoid waste of GPU computing resources.

本发明采用了基于滑动窗口采样的算法来判断高性能计算任务是否处于“空闲阶段”。具体做法是基于监控，维护一个滑动窗口，实时获取高性能计算任务在窗口时间段内的GPU利用率。一旦某高性能计算任务进入了“空闲阶段”，将自动调低该任务的GPU计算资源额度，该任务同时进入了限制状态。过了一段时间后，解除该任务的限制状态，恢复其指定的GPU计算资源额度，此过程称为性能探针，目的是为了检测任务是否恢复了原来的GPU计算资源利用率。若任务仍然处于“空闲阶段”，将立刻启动资源收缩流程，该任务重新进入限制状态。The present invention adopts an algorithm based on sliding window sampling to determine whether the high-performance computing task is in the "idle phase". The specific approach is to maintain a sliding window based on monitoring and obtain the GPU utilization of high-performance computing tasks within the window time period in real time. Once a high-performance computing task enters the "idle phase", the task's GPU computing resource quota will be automatically reduced, and the task will also enter a restricted state. After a period of time, the restricted state of the task is released and its specified GPU computing resource quota is restored. This process is called a performance probe, and the purpose is to detect whether the task has restored its original GPU computing resource utilization. If the task is still in the "idle phase", the resource shrinking process will be started immediately, and the task will re-enter the restricted state.

GPU的计算资源可以分为弹性资源和固定资源两部分。记某一块GPU的计算资源的总量为Core，在该GPU上运行的RST的集合为J＝{J₁，J₂，...，J_n}，在该GPU上运行的RIT的集合为j＝{j₁,j₂，...，j_m}。在某一时刻该GPU上分配的固定计算资源Core_b可以用公式(1)表示，其值等于所有未弹性增加的RST的计算资源之和。GPU computing resources can be divided into elastic resources and fixed resources. Let the total amount of computing resources of a certain GPU be Core, the set of RSTs running on the GPU is J={J ₁ , J ₂ ,..., J _n }, and the set of RITs running on the GPU is j={j ₁ , j ₂ ,..., j _m }. The fixed computing resource Core _b allocated on the GPU at a certain moment can be expressed by formula (1), and its value is equal to the sum of the computing resources of all RSTs that have not been elastically increased.

在某一时刻该GPU上的弹性计算资源Core_e可以用公式(2)表示，其值等于总资源减去固定资源减去处于限制状态的RIT的计算资源之和。The elastic computing resource Core _e on the GPU at a certain moment can be expressed by formula (2), and its value is equal to the sum of the total resources minus the fixed resources minus the computing resources of the RIT in a restricted state.

其中，restricted表示处于限制状态的任务，j∩restricted表示处于限制状态的RIT。Among them, restricted represents a task in a restricted state, and j∩restricted represents a RIT in a restricted state.

当GPU中没有未处于限制状态的RIT时，将弹性计算资源均匀分配给RST，如公式(3)所示。其中Core_i为任务i的当前资源分配量，Alloc_i为任务f的初始资源分配量，n为未处于限制状态的RST的数量。When there is no RIT in the GPU that is not in a restricted state, the elastic computing resources are evenly allocated to the RST, as shown in formula (3). Among them, Core _i is the current resource allocation of task i, Alloc _i is the initial resource allocation of task f, and n is the number of RSTs that are not in a restricted state.

当GPU中有未处于限制状态的RIT时，将弹性计算资源均匀分配给RIT，如公式(4)所示。When there are RITs in the GPU that are not in a restricted state, the elastic computing resources are evenly allocated to the RITs, as shown in formula (4).

其中Core_i为任务i的当前资源分配量，m为未处于限制状态的RIT的数量。Among them, Core _i is the current resource allocation of task i, and m is the number of RITs that are not in a restricted state.

本发明的一个实施例中，提供一种基于API拦截转发的容器云环境下的虚拟GPU分配方法，如图1所示，该方法整体上可分为如下步骤：In one embodiment of the present invention, a virtual GPU allocation method in a container cloud environment based on API interception and forwarding is provided. As shown in Figure 1, the method as a whole can be divided into the following steps:

步骤001：用户将设备插件模块以DaemonSet的方式部署到每一个Worker节点(工作节点)上，设备插件模块向Kubelet注册虚拟GPU资源。Step 001: The user deploys the device plug-in module to each Worker node (worker node) in the form of DaemonSet, and the device plug-in module registers virtual GPU resources with Kubelet.

步骤101：用户通过YAML描述文件的方式向集群提交Pod。Step 101: Users submit Pods to the cluster through YAML description files.

步骤102：Pod调度器通过watch机制监听到提交的Pod。Step 102: The Pod scheduler monitors the submitted Pod through the watch mechanism.

步骤103：Pod调度器通过调度算法调度Pod到合适的Worker节点和GPU并绑定。Step 103: The Pod scheduler schedules the Pod to the appropriate Worker node and GPU through the scheduling algorithm and binds it.

步骤104：设备插件模块通过watch机制监听到调度成功的Pod。Step 104: The device plug-in module monitors the successfully scheduled Pod through the watch mechanism.

步骤105：设备插件模块与Kubelet交互，为Pod分配虚拟GPU资源。Step 105: The device plug-in module interacts with Kubelet to allocate virtual GPU resources to the Pod.

步骤106：设备插件模块分配完毕虚拟GPU资源后，通知Pod资源管理器。Step 106: After the device plug-in module allocates virtual GPU resources, it notifies the Pod resource manager.

步骤107：Pod资源管理器通过gRPC的方式与API拦截库模块通信，进行初始化和计算资源的弹性分配。Step 107: The Pod Resource Manager communicates with the API interception library module through gRPC to perform initialization and elastic allocation of computing resources.

步骤108：API拦截转发模块拦截转发容器的API调用，实现资源限额。Step 108: The API interception and forwarding module intercepts the API calls of the forwarding container to implement resource quotas.

本实施例中，Pod调度器进行调度的流程如图2所示，可分为如下步骤：In this embodiment, the scheduling process of the Pod scheduler is shown in Figure 2, which can be divided into the following steps:

步骤201：监视器通过watch机制监听提交的Pod。Step 201: The monitor monitors the submitted Pod through the watch mechanism.

步骤202：监视器监听到提交的Pod后，将Pod放入调度器任务队列。Step 202: After monitoring the submitted Pod, the monitor puts the Pod into the scheduler task queue.

步骤203：通知器通过watch机制监听CRD，时刻获取集群中GPU资源的使用情况。Step 203: The notifier monitors CRD through the watch mechanism and obtains the usage of GPU resources in the cluster at all times.

步骤204：当可用GPU资源满足条件时，通知器通知调度器任务队列头部的Pod出队。Step 204: When the available GPU resources meet the conditions, the notifier notifies the scheduler that the Pod at the head of the task queue is dequeued.

步骤205：出队的Pod执行调度与绑定流程，通过过滤、打分、绑定等环节，调度Pod到合适的节点和GPU并绑定。Step 205: The dequeued Pod executes the scheduling and binding process. Through filtering, scoring, binding, etc., the Pod is scheduled to the appropriate node and GPU and bound.

基于同一发明构思，本发明的另一实施例提供一种电子装置(计算机、服务器等)，其包括存储器和处理器，所述存储器存储计算机程序，所述计算机程序被配置为由所述处理器执行，所述计算机程序包括用于执行本发明方法中各步骤的指令。Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, etc.), which includes a memory and a processor. The memory stores a computer program, and the computer program is configured to be processed by the processor. Execution, the computer program includes instructions for executing each step in the method of the present invention.

基于同一发明构思，本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘)，所述计算机可读存储介质存储计算机程序，所述计算机程序被计算机执行时，实现本发明方法的各个步骤。Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, magnetic disk, optical disk). The computer-readable storage medium stores a computer program. When the computer program is executed by a computer, , implement each step of the method of the present invention.

以上对于本发明实施例的具体实施方法的描述只是举例说明，本发明的保护范围由所述权利要求书阐述。本领域的技术人员在理解了上述说明的基础之上，进行的任何形式的变化和改动均落入本发明的保护范围之内。The above description of specific implementation methods of the embodiments of the present invention is only illustrative, and the protection scope of the present invention is set forth in the claims. Any changes and modifications made by those skilled in the art based on understanding the above description will fall within the protection scope of the present invention.

Claims

1. The virtual GPU distribution method under the container cloud environment based on API interception forwarding is characterized by comprising the following steps of:

abstracting a physical GPU in a cloud environment into virtual GPU resources with fine granularity of a video memory dimension and a calculation dimension;

a virtual GPU for distributing the appointed amount of computing resources and the appointed amount of video memory resources to the container in a declarative mode;

dispatching the container to a proper working node through a self-defined dispatcher, and binding the container with proper GPU equipment;

inserting an API interception library, intercepting related API calls when a container calls the API, implementing a resource quota algorithm on the intercepted computing resource and the related API of the video memory resource, and then forwarding and calling a real API;

when the flexible allocation mode is enabled, the computing resources of the virtual GPU in the idle phase container are reclaimed and allocated for use by other suitable containers.

2. The method of claim 1, wherein abstracting the physical GPU in the cloud environment into fine-grained virtual GPU resources of the memory dimension and the compute dimension comprises:

detecting information of a physical GPU on the node by using a detection interface provided by a GPU manufacturer;

the execution time of the physical GPU is uniformly divided into 100 parts, and each part of execution time corresponds to virtual GPU computing resources with one granularity and also corresponds to the utilization rate of the GPU;

dividing the video memory of the physical GPU into a plurality of parts according to the size of 1MB, wherein each part of video memory corresponds to virtual GPU video memory resources with one granularity;

and (3) expanding and realizing the function of the device plug-in provided by Kubernetes, registering and reporting the virtual GPU computing resources and the virtual GPU video memory resources with fine granularity.

3. The method according to claim 1, wherein the scheduling, by the custom scheduler, the container to the appropriate working node and binding with the appropriate GPU device includes:

the Pod scheduler monitors the Pod of the virtual GPU through a watch mechanism of Kubernetes and schedules the Pod;

after monitoring the submitted Pod, the monitor puts the Pod into a task queue of a scheduler;

the notifier monitors the CRD and acquires the use condition of GPU resources in the cluster at any time;

when the available GPU resources meet the conditions, the notifier notifies the Pod at the head of the task queue of the scheduler to dequeue;

the dequeued Pod executes a scheduling and binding procedure, and schedules the Pod to a proper node and GPU and binds the Pod through filtering, scoring and binding links;

the scheduling algorithm is set in the Pod scheduler for the user to select in different scenes or the user expands the interface through the reserved scheduling algorithm to realize the self-defined scheduling algorithm.

4. The method according to claim 1, wherein inserting the API interceptor library, intercepting the associated API call when the container calls the API, implementing a resource quota algorithm on the intercepted computing resources and the memory resource associated API, and then forwarding and calling the real API, comprises:

compiling a dynamic link library containing all APIs of a native acceleration platform; the API includes CUDA, openCL, HIP;

setting a container environment variable LD_LIBRARY_PATH as a catalog of an API interception LIBRARY through a hooking mechanism of a Linux operating system, and intercepting an API called by the API interception LIBRARY;

intercepting API calls related to computing resources and video memory resources, and executing a resource quota algorithm;

for the quota of the computing resource, a GPU computing resource quota algorithm based on monitoring is adopted;

for the quota of the video memory resource, a GPU video memory resource quota algorithm based on quota is adopted;

the address of the real API is obtained and the related API is called by using dlopen and dlsym system calls of the Linux operating system, so that the effect of forwarding the API is achieved.

5. The method of claim 4, wherein the monitoring-based GPU calculates a resource quota algorithm comprising:

monitoring the GPU utilization rate of the container in real time through a relevant interface of a GPU manufacturer;

quantifying the execution time of a computing unit including the CUDA kernel function and the available time of the GPU;

running in producer consumer mode, the rate at which high performance computing tasks in the container start the computing units is dynamically adjusted.

6. The method of claim 4, wherein the quota-based GPU video memory resource quota algorithm comprises:

maintaining a video memory quota for each container, recording the current video memory size used by the container, and setting the value of the video memory quota to be 0 when the container is created;

when a process in the container calls a related API for applying or releasing the video memory, judging whether to forward and call a real API according to the current quota size.

7. The method according to claim 1, wherein when the flexible allocation mode is enabled, recovering the computing resources of the virtual GPU in the idle phase container and allocating to other suitable containers for use, comprising:

dividing tasks into two types, wherein one type is a resource sensitive task RST and the other type is a resource insensitive task RIT;

adopting a sliding window sampling-based algorithm to judge whether the high-performance computing task is in an idle stage;

dividing the computing resources on the GPU into fixed computing resources and elastic computing resources, and reallocating the received elastic computing resources;

the fixed computing resource has a value equal to the sum of the computing resources of all the RSTs that have not been increased elastically;

the value of the elastic computing resource is equal to the sum of the total resource minus the fixed resource minus the computing resource of the RIT in the limiting state;

when the RIT which is not in a limiting state is not available in the GPU, the elastic computing resources are uniformly distributed to RST;

when there is an RIT in the GPU that is not in a restricted state, the elastic computing resources are evenly allocated to the RIT.

8. A virtual GPU distribution system in an API-based intercept forwarding container cloud environment employing the method of any of claims 1-7, the system comprising:

the Pod scheduler is used for selecting proper nodes and GPU for the high-performance computing task Pod submitted by the user, and binding and executing the nodes and GPU;

the device plug-in module is used for abstracting virtual GPU resources and releasing the virtual GPU resources to the Kubelet, and applying for and distributing virtual GPU of appointed computing resources and video memory resource limit for the high-performance computing task Pod;

the Pod resource manager is used for interacting with the API interception library, maintaining resource quota in the whole life cycle of the Pod and realizing flexible computing resource allocation;

and the API interception library module is used for intercepting and forwarding the API, executing a resource quota algorithm and achieving the effect of limiting the computing resources and the video memory resources of the virtual GPU.

9. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-7.