CN114356543A

CN114356543A - A Kubernetes-based Multi-tenant Machine Learning Task Resource Scheduling Method

Info

Publication number: CN114356543A
Application number: CN202111460970.6A
Authority: CN
Inventors: 杨立波; 王宇冬; 马斌; 李一鹏; 栗维勋; 袁龙; 李�昊; 季学纯; 孙云枫; 李佳阳; 沈嘉灵; 徐丽燕; 胡锐锋; 劳莹莹; 陈子韵
Original assignee: State Grid Corp of China SGCC; State Grid Hebei Electric Power Co Ltd; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Current assignee: State Grid Corp of China SGCC; State Grid Hebei Electric Power Co Ltd; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-04-15
Anticipated expiration: 2041-12-02
Also published as: CN114356543B

Abstract

The invention discloses a multi-tenant machine learning task resource scheduling method based on Kubernetes, which performs quota management on the computing resources available to different users, monitors the resource status information of each Node node in the Kubernetes platform, and considers the resources of the host machine where the nodes are located. Utilization problem, avoid the problem of inaccurate scheduling results, and at the same time by monitoring real-time scheduling and pre-scheduling request demand information, according to the scheduling task demand information to prioritize each Node node, get the host label of the optimal node, according to the label Reasonable allocation of resource requirements for various machine learning model training and prediction tasks. The invention effectively prevents and reduces the inclination problem of node resource usage in the Kubernetes platform, realizes multi-node load balance, and improves the utilization rate of node resources.

Description

A Kubernetes-based Multi-tenant Machine Learning Task Resource Scheduling Method

技术领域technical field

本发明涉及一种基于Kubernetes的多租户机器学习任务资源调度方法，属于电力调控技术领域。The invention relates to a Kubernetes-based multi-tenant machine learning task resource scheduling method, which belongs to the technical field of power regulation.

背景技术Background technique

目前电网调控领域人工智能技术应用取得了初步的成果，但在算力资源管控方面遇到了算力分散，制约应用的突破问题，各类应用“烟囱式”部署人工智能开发运行环境，造成了底层硬件资源的重复建设、算力分散且较难扩展。At present, the application of artificial intelligence technology in the field of power grid regulation has achieved preliminary results, but in the management and control of computing power resources, the computing power is scattered, which restricts the breakthrough of applications. Repeated construction of hardware resources, scattered computing power and difficult to expand.

云计算平台IaaS层主要利用了虚拟化技术实现多租户资源隔离与动态分配，但传统的虚拟化技术自身对硬件资源占用率较高，不适合机器学习模型训练和预测任务的算力资源高利用率场景；并且在应用程序配置、运行、管理等环节的复杂性较高，不利于集群化统筹管理。The IaaS layer of the cloud computing platform mainly uses virtualization technology to achieve multi-tenant resource isolation and dynamic allocation. However, the traditional virtualization technology itself has a high occupancy rate of hardware resources and is not suitable for high utilization of computing resources for machine learning model training and prediction tasks. In addition, the complexity of application configuration, operation, and management is relatively high, which is not conducive to clustered overall management.

kubernetes具有对服务进行自动化的编排、部署和资源调度等能力深受开发者的欢迎，本发明基于kubernetes对资源进行自定义编排调度，支撑新一代调度技术支持系统中人工智能应用开发及服务支撑平台的产品研制工作，用于电网故障辨识与分析、电网运行的预测与分析和电网智能调度辅助决策等机器学习训练和预测任务的资源调度，其应用成果验证了本发明的技术路线与可靠性。Kubernetes has the ability to automate service arrangement, deployment and resource scheduling, which is very popular among developers. The present invention performs customized arrangement and scheduling of resources based on Kubernetes, and supports artificial intelligence application development and service support platform in the new generation scheduling technology support system. The product development work of the invention is used for the resource scheduling of machine learning training and prediction tasks such as power grid fault identification and analysis, power grid operation prediction and analysis, and power grid intelligent dispatch auxiliary decision-making. Its application results verify the technical route and reliability of the present invention.

发明内容SUMMARY OF THE INVENTION

目的：为了克服现有技术中存在的不足，本发明提供一种基于Kubernetes的多租户机器学习任务资源调度方法，采用Kubernetes与容器技术对IaaS层CPU、GPU与内存资源进行统一调控，构建多租户机器学习模型训练与预测的应用程序标准化运行环境，提高电网调控系统的可控性、弹性扩展能力与资源隔离能力。Objective: In order to overcome the deficiencies in the prior art, the present invention provides a multi-tenant machine learning task resource scheduling method based on Kubernetes, which adopts Kubernetes and container technology to uniformly control the IaaS layer CPU, GPU and memory resources, and builds a multi-tenant system. The application of machine learning model training and prediction standardizes the operating environment, and improves the controllability, elastic expansion capability and resource isolation capability of the power grid regulation system.

技术方案：为解决上述技术问题，本发明采用的技术方案为：Technical scheme: in order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

一种基于Kubernetes的多租户机器学习任务资源调度方法，包括如下步骤：A Kubernetes-based multi-tenant machine learning task resource scheduling method, comprising the following steps:

计算集群中Node节点已使用资源与已创建容器使用资源的差值，得到Node节点操作系统自身所有进程占用的资源信息。Calculate the difference between the resources used by the Node node in the cluster and the resources used by the created container, and obtain the resource information occupied by all the processes of the Node node operating system itself.

调用Kubernetes API获取Node节点上所有机器学习模型训练与预测任务容器申请的资源信息。Call the Kubernetes API to obtain resource information requested by all machine learning model training and prediction task containers on the Node node.

将Node节点固有资源容量减去Node节点操作系统自身所有进程占用的资源信息和Node节点上所有机器学习模型训练与预测任务容器申请的资源信息，计算出Node节点实时可用资源信息。The real-time available resource information of the Node node is calculated by subtracting the inherent resource capacity of the Node node from the resource information occupied by all the processes of the Node node operating system itself and the resource information applied by all machine learning model training and prediction task containers on the Node node.

根据Node节点实时可用资源信息和Node节点固有资源容量，计算Node节点CPU、GPU和内存的可用率。According to the real-time available resource information of the Node node and the inherent resource capacity of the Node node, the availability rate of the CPU, GPU, and memory of the Node node is calculated.

系统集群资源管控服务预设资源阈值百分比，Node节点CPU、GPU和内存的可用率不低于预设资源阈值百分比的Node节点为机器学习模型训练与预测任务分配算力资源。The system cluster resource management and control service presets the resource threshold percentage. Node nodes whose CPU, GPU, and memory availability rates are not lower than the preset resource threshold percentages allocate computing resources for machine learning model training and prediction tasks.

机器学习任务调度服务将不同用户的机器学习模型训练与预测任务申请的CPU、GPU和内存资源数量发送至系统集群资源管控服务。The machine learning task scheduling service sends the number of CPU, GPU, and memory resources requested by different users' machine learning model training and prediction tasks to the system cluster resource management and control service.

系统集群资源管控服务通过计算多租户资源配额表、用户资源使用情况表的资源差值得到用户可申请剩余资源，并校验机器学习模型训练与预测任务申请的CPU、GPU和内存数量是否超过用户可申请剩余资源。The system cluster resource management and control service calculates the resource difference between the multi-tenant resource quota table and the user resource usage table to obtain the remaining resources that the user can apply for, and verifies whether the number of CPUs, GPUs, and memory requested by the machine learning model training and prediction tasks exceeds the number of users. Remaining resources can be applied for.

选择未超过用户可申请剩余资源的Node节点，系统集群资源管控服务将Node节点实时可用资源信息与申请的CPU、GPU和内存数量计算差值，除以Node节点固有资源容量，得到分配出资源后CPU、GPU和内存所剩资源的百分比。Select the Node node that does not exceed the remaining resources that the user can apply for. The system cluster resource management service calculates the difference between the real-time available resource information of the Node node and the number of CPUs, GPUs, and memory applied for, and divides it by the inherent resource capacity of the Node node to obtain the allocated resources. The percentage of resources left for CPU, GPU, and memory.

选择分配出资源后CPU、GPU和内存所剩资源的百分比大于预设的资源阈值百分比的Node节点，将每个Node节点的分配出资源后CPU、GPU和内存所剩资源的百分比进行评分计算，并按评分从大到小进行排序。Select Node nodes whose percentages of CPU, GPU, and memory remaining resources after resource allocation are greater than the preset resource threshold percentage, and calculate the percentage of CPU, GPU, and memory remaining resources for each Node node after resource allocation. And sort by rating from largest to smallest.

系统集群资源管控服务从序列中排序在前的Node节点为最优节点，并将最优节点的节点名返回给机器学习任务调度服务，并在用户资源使用情况表中进行持久化存储。The system cluster resource management and control service selects the node node ranked first in the sequence as the optimal node, returns the node name of the optimal node to the machine learning task scheduling service, and stores it persistently in the user resource usage table.

机器学习任务调度服务动态生成Kubernetes yaml文件，调用Kubernetes API在最优节点中创建容器运行机器学习模型训练与预测任务。The machine learning task scheduling service dynamically generates Kubernetes yaml files, and calls the Kubernetes API to create containers on the optimal nodes to run machine learning model training and prediction tasks.

作为优选方案，集群中每个Kubernetes Node节点上部署CPU、GPU与内存使用情况采集程序。As a preferred solution, a CPU, GPU, and memory usage collection program is deployed on each Kubernetes Node in the cluster.

作为优选方案，集群中Node节点固有资源容量将用户ID作为Kubernetes中的命名空间对虚拟资源池进行逻辑划分与隔离。As a preferred solution, the inherent resource capacity of Node nodes in the cluster uses the user ID as the namespace in Kubernetes to logically divide and isolate the virtual resource pool.

作为优选方案，多租户资源配额表如下所示：As a preferred solution, the multi-tenant resource quota table is as follows:

作为优选方案，用户资源使用情况表如下所示：As a preferred solution, the user resource usage table is as follows:

作为优选方案，通过Kubernetes的基于角色的访问控制对不同用户可操作的命名空间赋予访问权限。As a preferred option, grant access to different user-operable namespaces through Kubernetes' role-based access control.

作为优选方案，Kubernetes集群包括以下组件：API Server、ControllerManager、Scheduler、Kubelet、Kube-proxy、Etcd、Container runtime。As a preferred solution, a Kubernetes cluster includes the following components: API Server, ControllerManager, Scheduler, Kubelet, Kube-proxy, Etcd, and Container runtime.

作为优选方案，每个Node节点的分配出资源后CPU、GPU和内存所剩资源的百分比进行评分计算的方法如下：As a preferred solution, the method for calculating the percentage of the remaining resources of the CPU, GPU and memory of each Node node after allocating resources is as follows:

Score_i＝request_cpu×percent_cpu_i+request_gpu×percent_gpu_i+request_mem×percent_mem_i其中，Score_i为第i个Node节点的评分，percent_cpu_i、percent_gpu_i、percent_mem_i分别为第i个Node节点的分配出资源后CPU、GPU和内存所剩资源的百分比，request_cpu、request_gpu、request_mem分别为第i个Node节Score _i =request_cpu×percent_cpu _i +request_gpu×percent_gpu _i +request_mem×percent_mem _i Among them, Score _i is the score of the i-th Node node, percent_cpu _i , percent_gpu _i , percent_mem _i are the resources of the i-th Node node after allocating resources respectively The percentage of the remaining resources of CPU, GPU and memory, request_cpu, request_gpu, request_mem are the i-th Node section respectively

点的申请的CPU、GPU和内存数量。The amount of CPU, GPU, and memory requested by the point.

有益效果：本发明提供的一种基于Kubernetes的多租户机器学习任务资源调度方法，对不同用户可使用的算力资源进行配额管理，同时监测Kubernetes平台中各Node节点资源状态信息，考虑节点所在宿主机的资源利用率的问题，避免出现调度结果不准确的问题，同时通过监测实时调度和预调度request需求信息，根据调度任务需求信息对各Node节点进行优先级排序，获取最优节点的主机标签，根据标签对各类机器学习模型训练与预测任务的资源需求进行合理分配，有效的预防和减少Kubernetes平台中节点资源使用的倾斜问题，实现多节点负载均衡，提高节点资源的利用率。Beneficial effect: The invention provides a Kubernetes-based multi-tenant machine learning task resource scheduling method, which performs quota management on computing resources available to different users, and monitors resource status information of each Node node in the Kubernetes platform at the same time, considering the destination of the node. The problem of resource utilization of the host can avoid the problem of inaccurate scheduling results. At the same time, by monitoring the real-time scheduling and pre-scheduling request demand information, each Node node is prioritized according to the scheduling task demand information, and the host label of the optimal node is obtained. , according to the label, the resource requirements of various machine learning model training and prediction tasks are reasonably allocated, which can effectively prevent and reduce the skew problem of node resource usage in the Kubernetes platform, achieve multi-node load balancing, and improve the utilization of node resources.

附图说明Description of drawings

图1是本发明实例中集群资源多租户管理示意图。FIG. 1 is a schematic diagram of multi-tenant management of cluster resources in an example of the present invention.

图2是本发明实例中Kubernetes集群资源管理架构示意图。FIG. 2 is a schematic diagram of a Kubernetes cluster resource management architecture in an example of the present invention.

图3是本发明实施例中机器学习训练与预测任务创建流程图。FIG. 3 is a flowchart of machine learning training and prediction task creation in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明作更进一步的说明。The present invention will be further described below in conjunction with specific embodiments.

1)通过计算Node节点已使用资源(node_cpu_used_i、node_gpu_used_i和node_mem_used_i)与已创建容器使用资源(pod_cpu_used_i、pod_gpu_used_i和pod_mem_used_i)的差值，得到该节点操作系统自身所有进程占用的资源信息。1) By calculating the difference between the resources used by the Node node (node_cpu_used _i , node_gpu_used _i and node_mem_used _i ) and the resources used by the created container (pod_cpu_used _i , pod_gpu_used _i and pod_mem_used _i ), the resources occupied by all processes of the node's operating system are obtained. information.

2)通过调用Kubernetes API获取到节点上所有机器学习模型训练与预测任务容器申请的资源信息(pod_cpu_req_i、pod_gpu_req_i和pod_mem_req_i)。2) Obtain the resource information (pod_cpu_req _i , pod_gpu_req _i and pod_mem_req _i ) applied by all machine learning model training and prediction task containers on the node by calling the Kubernetes API.

3)将Node节点固有资源容量(node_cpu_total_i、node_gpu_total_i和node_mem_total_i)减去上述两个值，即可计算出Node节点实时可用资源信息(node_cpu_i、node_gpu_i和node_mem_i)。3) Subtract the above two values from the inherent resource capacity of the Node node (node_cpu_total _i , node_gpu_total _i and node_mem_total _i ) to calculate the real-time available resource information of the Node node (node_cpu_i , node_gpu _i and node_mem _i ₎ .

4)通过如下公式计算各Node节点CPU、GPU和内存的可用率：4) Calculate the availability of CPU, GPU and memory of each Node node by the following formula:

percent_cpu_i＝node_cpu_i/node_cpu_total_i percent_cpu _i =node_cpu _i /node_cpu_total _i

percent_gpu_i＝node_gpu_i/node_gpu_total_i percent_gpu _i =node_gpu _i /node_gpu_total _i

percent_mem_i＝node_mem_i/node_mem_total_i percent_mem _i =node_mem _i /node_mem_total _i

5)系统集群资源管控服务通过预设资源阈值百分比，低于预设资源阈值百分比的节点将不再为机器学习模型训练与预测任务分配算力资源，保证各Node节点不会出现过载运行情况。5) The system cluster resource management and control service passes the preset resource threshold percentage. Nodes below the preset resource threshold percentage will no longer allocate computing resources for machine learning model training and prediction tasks to ensure that each Node node will not be overloaded.

6)机器学习任务调度服务将不同用户的机器学习模型训练与预测任务所需的CPU、GPU和内存资源数量(request_cpu、request_gpu和request_mem)发送至系统集群资源管控服务。6) The machine learning task scheduling service sends the number of CPU, GPU and memory resources (request_cpu, request_gpu and request_mem) required by different users' machine learning model training and prediction tasks to the system cluster resource management and control service.

7)系统集群资源管控服务通过计算多租户资源配额表、用户资源使用情况表的资源差值得到用户可申请剩余资源，并校验机器学习模型训练与预测任务申请的request_cpu、request_gpu和request_mem数量是否超过用户可申请剩余资源。7) The system cluster resource management and control service calculates the resource difference between the multi-tenant resource quota table and the user resource usage table to obtain the remaining resources that the user can apply for, and verifies whether the number of request_cpu, request_gpu and request_mem applied for the machine learning model training and prediction tasks is not Exceeding users can apply for the remaining resources.

8)系统集群资源管控服务将Node节点实时可用资源信息(node_cpu_i、node_gpu_i和node_mem_i)与申请的request_cpu、request_gpu和request_mem数量计算差值，除以Node节点固有资源容量，得到分配出资源后所剩资源的百分比。8) The system cluster resource management and control service calculates the difference between the real-time available resource information (node_cpu _i , node_gpu _i and node_mem _i ) of the Node node and the requested number of request_cpu, request_gpu and request_mem, and divides it by the inherent resource capacity of the Node node to obtain the allocated resources. The percentage of resources remaining.

percent_cpu_i＝(node_cpu_i-request_cpu)/node_cpu_total_i percent_cpu _i =(node_cpu _i -request_cpu)/node_cpu_total _i

percent_gpu_i＝(node_gpu_i-request_gpu)/node_gpu_total_i percent_gpu _i =(node_gpu _i -request_gpu)/node_gpu_total _i

percent_mem_i＝(node_mem_i-request_mem)/node_mem_total_i percent_mem _i =(node_mem _i -request_mem)/node_mem_total _i

将上述分配出资源后所剩资源百分比小于预设资源阈值百分比的节点对比后过滤，然后将剩余节点所剩资源百分比进行评分计算与并按评分排序。Compare and filter the nodes whose percentage of the remaining resources after allocating the resources is less than the preset resource threshold percentage, and then calculate and sort by the score the percentage of the remaining resources of the remaining nodes.

Score_i＝request_cpu×percent_cpu_i+request_gpu×percent_gpu_i+request_mem×percent_mem_i Score _i =request_cpu×percent_cpu _i +request_gpu×percent_gpu _i +request_mem×percent_mem _i

9)系统集群资源管控服务从排序中选取评分在前的Node节点作为最优节点，并将节点名返回给机器学习任务调度服务，并在用户资源使用情况表中进行持久化存储。9) The system cluster resource management and control service selects the Node node with the highest score from the ranking as the optimal node, returns the node name to the machine learning task scheduling service, and stores it persistently in the user resource usage table.

10)机器学习任务调度服务动态生成Kubernetes yaml文件，调用Kubernetes API在最优节点中创建容器运行机器学习模型训练与预测任务。10) The machine learning task scheduling service dynamically generates the Kubernetes yaml file, and calls the Kubernetes API to create a container in the optimal node to run the machine learning model training and prediction tasks.

本发明的目的在于采用Kubernetes与容器技术对IaaS层CPU、GPU、内存与存储资源进行统一调控，构建多租户机器学习模型训练与预测的应用程序标准化运行环境，提高电网系统的可控性、弹性扩展能力与资源隔离能力。下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述：The purpose of the present invention is to use Kubernetes and container technology to uniformly control IaaS layer CPU, GPU, memory and storage resources, build a standardized operating environment for multi-tenant machine learning model training and prediction applications, and improve the controllability and flexibility of the power grid system Expansion capability and resource isolation capability. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention:

根据集群中不同Node节点的可用CPU、GPU、内存与存储资源信息进行标签化管理，通过Kubernetes将集群资源整合为一个资源池，将用户ID作为Kubernetes中的命名空间(Namespace)对虚拟资源池进行逻辑划分与隔离，如图1所示。Label management is performed according to the available CPU, GPU, memory and storage resource information of different Node nodes in the cluster, and the cluster resources are integrated into a resource pool through Kubernetes, and the user ID is used as the namespace in Kubernetes. Logical division and isolation, as shown in Figure 1.

系统管理员通过集群多租户资源管理界面工具为不同用户分配所需资源信息，该信息采用多租户资源配额表进行持久化存储，如表1所示。通过Kubernetes的基于角色的访问控制(RBAC)对不同用户可操作的命名空间赋予访问权限，防止用户间的资源使用相互干扰。The system administrator allocates the required resource information to different users through the cluster multi-tenant resource management interface tool. The information is persistently stored in the multi-tenant resource quota table, as shown in Table 1. Through Kubernetes' role-based access control (RBAC), access permissions are granted to the namespaces that can be operated by different users, preventing mutual interference of resource usage among users.

集群中每个Kubernetes Node节点上部署CPU、GPU与内存使用情况采集程序，并根据上述获取信息分别计算出所有Node节点上的可用资源情况与可用率情况。The CPU, GPU, and memory usage collection programs are deployed on each Kubernetes Node node in the cluster, and the available resources and availability rates on all Node nodes are calculated based on the obtained information above.

表1多租户资源配额表Table 1 Multi-tenant resource quota table

字段名field name 用途use user_iduser_id 与权限、单点登录保持一致的用户唯一IDUser unique ID consistent with permissions, single sign-on cpu_capacitycpu_capacity CPU逻辑总核数The total number of logical cores of the CPU memory_capacitymemory_capacity 内存总量(GB)Total memory (GB) gpu_capacitygpu_capacity GPU卡数Number of GPU cards storage_capacitystorage_capacity 存储空间(GB)Storage space (GB)

表2用户资源使用情况表Table 2 User resource usage table

如图2所示为本实施例Kubernetes集群资源管理架构示意图，本实施例中Kubernetes集群由2个Master节点和6个Node节点构成。Master节点是集群的主要控制单元，主要对集群进行调度管理，防止项目需求增大，访问量增多，因此本实施例构建双Master节点的高可用模式；Node节点是工作负载节点，主要用于运行业务应用的容器，包括CPU和GPU两个集群，CPU集群主要用于创建常规pod任务，而GPU集群主要用于创建涉及图像运算的pod任务，双Node集群的模式将使得部署于其中的应用运行得更加合理、高效。FIG. 2 is a schematic diagram of the resource management architecture of the Kubernetes cluster in this embodiment. In this embodiment, the Kubernetes cluster is composed of 2 Master nodes and 6 Node nodes. The Master node is the main control unit of the cluster, which mainly performs scheduling management on the cluster to prevent the increase of project demand and the increase of access volume. Therefore, this embodiment builds a high-availability mode of dual-Master nodes; the Node node is a workload node, which is mainly used for running Containers for business applications include two clusters: CPU and GPU. The CPU cluster is mainly used to create regular pod tasks, while the GPU cluster is mainly used to create pod tasks involving image operations. The dual-node cluster mode will allow applications deployed in it to run. more reasonable and efficient.

Kubernetes集群主要包括七种主要的组件：API Server、Controller Manager、Scheduler、Kubelet、Kube-proxy、Etcd、Container runtime，以上各个组件之间协同配合进而实现整个集群的运行，本发明的调度策略主要对Scheduler起作用，通过计算实时任务和定时任务在各个Node节点的评价得分。评价得分包含两个方面，一方面是参考Node节点自身的资源的实际使用情况，另一方面兼顾了pod对于CPU、GPU和内存资源需求的偏好程度。最后本发明调度策略根据实时任务和定时任务对各个Node节点进行综合的评价，选择评价得分最高的Node节点为目标调度节点，跳过Scheduler的预选策略和优选策略，通过设定唯一标签可以直接在指定Node节点创建pod。如图2所示为本实施例中pod任务请求创建流程图，具体方式如下所示：The Kubernetes cluster mainly includes seven main components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, and Container Runtime. The above components cooperate with each other to realize the operation of the entire cluster. The scheduling strategy of the present invention is mainly for The Scheduler works by calculating the evaluation scores of real-time tasks and timed tasks in each Node node. The evaluation score includes two aspects. On the one hand, it refers to the actual use of the resources of the Node node itself, and on the other hand, it takes into account the pod's preference for CPU, GPU, and memory resource requirements. Finally, the scheduling strategy of the present invention comprehensively evaluates each Node node according to real-time tasks and timing tasks, selects the Node node with the highest evaluation score as the target scheduling node, skips the pre-selection strategy and the optimization strategy of the Scheduler, and can directly use the unique label by setting the unique label. Specify the Node node to create a pod. Figure 2 shows the flow chart of creating a pod task request in this embodiment, and the specific method is as follows:

步骤1：获取Kubernetes平台中各个Node节点所在宿主机的CPU、GPU和内存使用信息，以及该节点上pod的CPU、GPU和内存的使用信息和request分配信息，并根据上述获取信息分别计算出每个Node节点的可用资源情况与可用率；Step 1: Obtain the CPU, GPU and memory usage information of the host where each Node node in the Kubernetes platform is located, as well as the CPU, GPU, and memory usage information and request allocation information of the pod on the node, and calculate each node according to the above obtained information. The available resources and availability rate of each Node node;

首先，通过计算宿主机和pod使用资源的差值获取pod容器外的宿主机使用资源情况；其次，获取pod容器的实际分配资源情况，将容器外宿主机使用情况与之求和，即可计算出Node节点的实际可用情况；通过如下公式计算出所有Node节点的CPU、GPU和内存的实际可用资源情况：First, obtain the resource usage of the host outside the pod container by calculating the difference between the resources used by the host and the pod; secondly, obtain the actual resource allocation of the pod container, and sum the usage of the host outside the container with it to calculate Get the actual availability of Node nodes; calculate the actual available resources of CPU, GPU and memory of all Node nodes by the following formula:

node_cpu_i＝node_cpu_total_i-(host_cpu_used_i-pod_cpu_used_i)-pod_cpu_req_i node_cpu _i =node_cpu_total _i -(host_cpu_used _i -pod_cpu_used _i )-pod_cpu_req _i

node_mem_i＝node_mem_total_i-(host_mem_used_i-pod_mem_used_i)-pod_mem_req_i node_mem _i =node_mem_total _i -(host_mem_used _i -pod_mem_used _i )-pod_mem_req _i

node_gpu_i＝node_gpu_total_i-(host_gpu_used_i-pod_gpu_used_i)-pod_gpu_req_i node_gpu _i =node_gpu_total _i -(host_gpu_used _i -pod_gpu_used _i )-pod_gpu_req _i

其中，所述node_cpu_i、node_mem_i和node_gpu_i分别对应Node节点CPU、GPU和内存的实际可用资源信息，所述node_cpu_total_i、node_mem_total_i和node_gpu_total_i分别对应Node节点CPU、GPU和内存的总计资源配置信息，所述host_cpu_used_i、host_mem_used_i和host_gpu_used_i分别对应Node节点CPU、GPU和内存的宿主机使用信息，所述host_cpu_used_i、pod_mem_used_i和host_gpu_used_i分别对应Node节点CPU、GPU和内存的pod使用信息，所述pod_cpu_req_i、pod_mem_req_i和pod_gpu_req_i分别对应当前Node节点CPU、GPU和内存的pod的资源request分配信息；Wherein, the node_cpu _i , node_mem _i and node_gpu _i correspond to the actual available resource information of the Node node CPU, GPU and memory respectively, and the node_cpu_total _i , node_mem_total _i and node_gpu_total _i respectively correspond to the total resource configuration of the Node node CPU, GPU and memory Information, the host_cpu_used _i , host_mem_used _i and host_gpu_used _i correspond to the host machine usage information of Node node CPU, GPU and memory respectively, and the host_cpu_used _i , pod_mem_used _i and host_gpu_used _i correspond to the pod usage information of Node node CPU, GPU and memory respectively , the pod_cpu_req _i , pod_mem_req _i and pod_gpu_req _i respectively correspond to the resource request allocation information of the current Node node CPU, GPU and memory pod;

通过如下公式计算各Node节点CPU、GPU和内存的可用率：Calculate the availability of CPU, GPU and memory of each Node node by the following formula:

步骤2：将各Node节点的CPU、GPU和内存可用率与预设阈值进行比对，若有节点低于规定阈值，则表明此节点过载，对该节点进行过滤，如果过滤出的节点个数为0，则返回调度失败；如果过滤出的节点个数大于0，则继续进行第3步；Step 2: Compare the CPU, GPU, and memory availability of each Node node with the preset threshold. If any node is lower than the specified threshold, it means that the node is overloaded, and the node is filtered. If the number of filtered nodes is If it is 0, it will return scheduling failure; if the number of filtered nodes is greater than 0, continue to step 3;

步骤3：通过K8s调度器获取到实时任务和定时任务pod对CPU、GPU和内存资源的请求信息，分别为request_cpu、request_gpu、request_mem以及用户ID，根据用户ID查表可获取当前用户资源剩余信息，通过对比可判断是否支持继续创建pod，如果不满足，则返回调度失败，如果满足则继续下一步；Step 3: Obtain the request information of real-time tasks and timed task pods for CPU, GPU and memory resources through the K8s scheduler, which are request_cpu, request_gpu, request_mem and user ID respectively. According to the user ID lookup table, the remaining information of the current user resources can be obtained. By comparison, it can be judged whether it supports to continue to create pods, if not, it will return scheduling failure, if it is satisfied, continue to the next step;

步骤4：将步骤3获取的任务资源请求信息与Node节点可用资源进行比对，过滤CPU、GPU和内存资源不足的Node节点，如果过滤出的节点个数为0，则返回调度失败，如果过滤出的节点个数等于1，则该Node节点设置为待创建pod的宿主机，如果过滤出的节点个数大于1，则继续进行下一步；Step 4: Compare the task resource request information obtained in Step 3 with the available resources of the Node node, and filter the Node nodes with insufficient CPU, GPU and memory resources. If the number of filtered nodes is 0, the scheduling failure is returned. If the number of outgoing nodes is equal to 1, the Node node is set as the host of the pod to be created. If the number of filtered out nodes is greater than 1, proceed to the next step;

步骤5：对过滤出的Node节点进行评分，通过下式计算请求任务在各个Node节点分配出资源后所剩资源的百分比。Step 5: Score the filtered Node nodes, and calculate the percentage of the remaining resources of the request task after each Node node allocates resources by the following formula.

将上述分配出资源后所剩资源百分比小于留资源阈值百分比的节点进行排除，并将所有节点的CPU、GPU和内存的分配出资源后所剩资源百分比进行累加与排序。The nodes whose percentage of the remaining resources after the above allocation of resources is less than the percentage of the remaining resource threshold are excluded, and the percentages of the remaining resources of all the nodes' CPU, GPU, and memory after the resources are allocated are accumulated and sorted.

对各Node节点进行优先级排序，根据排序确定最优节点个数，若节点个数为1，则该Node节点为最优节点并获取其标签；若节点个数大于1，则根据排序选择最优Node节点并获取其标签；最后通过机器学习任务的yaml文件指定标签启动pod。Prioritize each Node node, and determine the optimal number of nodes according to the ranking. If the number of nodes is 1, the Node node is the optimal node and its label is obtained; if the number of nodes is greater than 1, the most optimal node is selected according to the ordering Optimize the Node node and get its label; finally, start the pod by specifying the label in the yaml file of the machine learning task.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only the preferred embodiment of the present invention, it should be pointed out that: for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

1. A Kubernetes-based multi-tenant machine learning task resource scheduling method is characterized by comprising the following steps: the method comprises the following steps:

calculating the difference value of the used resources of the Node nodes in the cluster and the used resources of the created containers to obtain the resource information occupied by all processes of the Node operating system;

calling a Kubernetes API to acquire resource information applied by all machine learning models and prediction task containers on Node nodes;

subtracting resource information occupied by all processes of a Node operating system and resource information applied by all machine learning model training and prediction task containers on the Node from the inherent resource capacity of the Node, and calculating real-time available resource information of the Node;

calculating the availability ratios of a Node CPU, a GPU and a memory according to the real-time available resource information of the Node and the inherent resource capacity of the Node;

the Node nodes with the availability ratios of Node CPUs, GPUs and memories not lower than the preset resource threshold percentage allocate computing resources for the machine learning model training and predicting tasks;

the machine learning task scheduling service sends the quantities of CPU, GPU and memory resources applied by machine learning model training and prediction tasks of different users to the system cluster resource management and control service;

the system cluster resource management and control service calculates resource difference values of a multi-tenant resource quota table and a user resource use condition table to obtain user-applicable residual resources, and checks whether the quantity of CPUs (central processing units), GPUs (graphic processing units) and memories applied by the machine learning model training and prediction tasks exceeds the quantity of the user-applicable residual resources or not;

selecting Node nodes which do not exceed the amount of the residual resources which can be applied by the user, calculating the difference value of the real-time available resource information of the Node nodes and the amount of the applied CPU, GPU and memory by the system cluster resource management and control service, and dividing the difference value by the inherent resource capacity of the Node nodes to obtain the percentage of the residual resources of the CPU, GPU and memory after the resources are distributed;

selecting Node nodes with the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed being larger than the preset resource threshold percentage, carrying out score calculation on the percentage of the residual resources of the CPU, the GPU and the memory after the resources are distributed of each Node, and sequencing according to the score from large to small;

the Node ordered in the sequence of the system cluster resource management and control service is the optimal Node, the Node name of the optimal Node is returned to the machine learning task scheduling service, and persistent storage is carried out in the user resource use condition table;

and dynamically generating a Kubernets yaml file by the machine learning task scheduling service, and calling a Kubernets API to create a container in the optimal node to run a machine learning model training and predicting task.

2. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: and each Kubernetes Node in the cluster is provided with a CPU, a GPU and a memory use condition acquisition program.

3. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the inherent resource capacity of the Node nodes in the cluster takes the user ID as a name space in Kubernetes to logically divide and isolate the virtual resource pool.

4. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the multi-tenant resource quota table is as follows:

name of field Use of user_id User unique ID consistent with authority and single sign-on cpu_capacity Total core number of CPU logic memory_capacity Total memory (GB) gpu_capacity GPU card number storage_capacity Storage space (GB)

。

5. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the user resource usage table is as follows:

6. the Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the role-based access control by Kubernetes gives access rights to namespaces that are operable by different users.

7. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: the Kubernetes cluster includes the following components: API Server, Controller Manager, Scheduler, Kubelet, Kube-proxy, Etcd, Container runtime.

8. The Kubernetes-based multi-tenant machine learning task resource scheduling method according to claim 1, characterized in that: CPU, GPU and memory of each Node after distributing resource

The method for calculating the scores by percentage is as follows:

Score_i＝request_cpu×percent_cpu_i+request_gpu×percent_gpu_i+request_mem×percent_mem_i

wherein, Score_iRating _ cpu, rating of i Node_i、percent_gpu_i、percent_mem_iAnd respectively allocating the percentages of the residual resources of the CPU, the GPU and the memory for the ith Node after the resources are allocated, wherein the request _ CPU, the request _ GPU and the request _ mem are respectively the CPU, the GPU and the memory quantity applied by the ith Node.