CN118093170A

CN118093170A - A GPU resource scheduling method and system based on heterogeneous computing platform

Info

Publication number: CN118093170A
Application number: CN202410178435.9A
Authority: CN
Inventors: 刘金硕; 姜方县; 张政; 陈志斌; 罗琳; 宋啟鹏
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2024-02-09
Filing date: 2024-02-09
Publication date: 2024-05-28

Abstract

The present invention discloses a GPU resource scheduling method and system based on a heterogeneous computing platform. Firstly, a GPU is selected as a general resource and the GPU resources under the scattered heterogeneous computing platforms are concentrated to form a GPU resource pool for scheduling by a scheduler. Secondly, monitoring data of the heterogeneous platform is collected, and tools under different platforms are used to collect data. The collected data may have data anomalies due to equipment failure, network problems, software operation problems, etc., and anomaly detection algorithms can be used to identify and locate abnormal data, and timely measures can be taken to process them. Then, in view of the problems of large resource scheduling overhead and low efficiency of the heterogeneous platform, a prediction scheduling algorithm is used to predict the resource usage trend of the platform, and resources are reasonably allocated to the platform in advance. Finally, according to the above prediction results, resources are reasonably allocated by the scheduler to improve the utilization rate of GPU resources.

Description

A GPU resource scheduling method and system based on heterogeneous computing platform

技术领域Technical Field

本发明属于计算机学科中的资源调度和中间件技术领域，涉及一种对异构平台下的GPU资源的调度方法及系统，使用不同计算平台的中间件来收集监控数据，特别涉及云计算平台的openstack集群、高性能计算平台的slurm集群和人工智能平台的kubernetes集群收集数据。The present invention belongs to the technical field of resource scheduling and middleware in computer science, and relates to a scheduling method and system for GPU resources under heterogeneous platforms, using middleware of different computing platforms to collect monitoring data, and particularly relates to collecting data from an openstack cluster of a cloud computing platform, a slurm cluster of a high-performance computing platform, and a kubernetes cluster of an artificial intelligence platform.

背景技术Background technique

随着信息技术的发展，数据中心对于算力的需求越来越高，需要更多更好的利用算力处理爆发式增长的数据。算力已成为核心生产力。作为算力生产中心和供应中心的数据中心，对外提供着基于算力的各种服务。GPU凭借强大的计算能力、高内存带宽和海量数据集并行计算处理等优势成为重要的计算资源。With the development of information technology, data centers have an increasing demand for computing power, and need to make better use of computing power to process explosively growing data. Computing power has become the core productivity. As a computing power production center and supply center, data centers provide various computing power-based services to the outside world. GPUs have become an important computing resource with their powerful computing power, high memory bandwidth, and parallel computing and processing of massive data sets.

当前，针对数据中心的集群调度成为研究的重点，在不同的领域均出现了不同的调度系统，比如人工智能领域、大数据分析领域及云计算领域。单个计算平台来说，其调度系统能有效的调度任务和分配资源，提高资源利用率。Currently, cluster scheduling for data centers has become a research focus, and different scheduling systems have emerged in different fields, such as artificial intelligence, big data analysis, and cloud computing. For a single computing platform, its scheduling system can effectively schedule tasks and allocate resources to improve resource utilization.

基于人工智能平台的GPU调度框架主要是Kubernetes框架，Kubernetes是一种基于容器技术的调度框架，可以将一个物理资源虚拟化成多个逻辑资源，这些逻辑资源可以被不同的应用程序使用。基于高性能计算平台的GPU资源调度框架主要是Slurm，Slurm是一种广泛应用于高性能计算集群的集群管理器和作业调度系统。它以其开源、高容错性和高度可伸缩性等众多优点而备受青睐。The GPU scheduling framework based on the artificial intelligence platform is mainly the Kubernetes framework. Kubernetes is a scheduling framework based on container technology that can virtualize a physical resource into multiple logical resources, which can be used by different applications. The GPU resource scheduling framework based on the high-performance computing platform is mainly Slurm, which is a cluster manager and job scheduling system widely used in high-performance computing clusters. It is popular for its many advantages such as open source, high fault tolerance and high scalability.

常见的数据异常检测算法主要基于机器学习与深度学习，其中，基于机器学习的数据异常检测模型包括SR、SVR和Prophet，基于深度学习的数据异常检测模型主要包括VAE和LSTM，不同的检测算法适用于不同的数据场景，需要根据具体的数据特征选择不同的数据异常检测模型。Common data anomaly detection algorithms are mainly based on machine learning and deep learning. Among them, data anomaly detection models based on machine learning include SR, SVR and Prophet, and data anomaly detection models based on deep learning mainly include VAE and LSTM. Different detection algorithms are suitable for different data scenarios, and different data anomaly detection models need to be selected according to specific data characteristics.

ARIMA模型是一种时间序列模型，基本思想是利用数据本身的历史信息预测未来信息，可以通过该模型预测待调度任务的资源使用情况。The ARIMA model is a time series model. The basic idea is to use the historical information of the data itself to predict future information. The model can be used to predict the resource usage of the tasks to be scheduled.

现有的异构计算平台的资源调度存在各个平台采集的数据特征值差异大，数据异常程度高以及异构计算平台的资源调度准确性地、及时性差以致导致资源浪费等问题。The existing resource scheduling of heterogeneous computing platforms has problems such as large differences in data feature values collected by each platform, high degree of data anomaly, and poor accuracy and timeliness of resource scheduling of heterogeneous computing platforms, resulting in waste of resources.

发明内容Summary of the invention

本发明提出一种针对异构计算平台的GPU资源调度的方法及系统，用以部分解决现有技术中存在的异构计算平台GPU资源利用率不高的问题。The present invention proposes a method and system for GPU resource scheduling for heterogeneous computing platforms, which are used to partially solve the problem of low GPU resource utilization of heterogeneous computing platforms existing in the prior art.

本发明的方法所采用的技术方案是：一种基于异构计算平台的GPU资源调度方法，包括以下步骤：The technical solution adopted by the method of the present invention is: a GPU resource scheduling method based on a heterogeneous computing platform, comprising the following steps:

步骤1：将GPU作为通用资源并把分散在异构计算平台的GPU资源集中，形成GPU资源池，供调度器调度；Step 1: Use GPU as a general resource and centralize the GPU resources scattered on heterogeneous computing platforms to form a GPU resource pool for scheduling by the scheduler;

步骤2：采集异构平台的监控数据，为数据异常检测和预测调度提供数据支持；Step 2: Collect monitoring data from heterogeneous platforms to provide data support for data anomaly detection and predictive scheduling;

步骤3：对数据进行检测，处理检测到的异常数据；Step 3: Detect the data and process the detected abnormal data;

步骤4：进行资源的使用趋势预测，预测平台的资源使用趋势，提前为平台分配资源，进一步充分利用GPU资源；Step 4: Predict resource usage trends, predict the resource usage trends of the platform, allocate resources to the platform in advance, and further make full use of GPU resources;

步骤5：根据步骤4的预测结果，对资源进行不同计算平台间的调度。Step 5: Based on the prediction results of step 4, resources are scheduled between different computing platforms.

作为优选，步骤1中，利用GPU虚拟化技术实现将大卡转小卡，利用远程GPU功能打破物理服务器的边界，将GPU的管理和使用从单台服务器扩展至整个数据中心，实现不同异构平台的GPU资源的池化，形成可供调度器调度的GPU资源池。Preferably, in step 1, GPU virtualization technology is used to convert large cards to small cards, and remote GPU functions are used to break the boundaries of physical servers, extending the management and use of GPUs from a single server to the entire data center, realizing pooling of GPU resources of different heterogeneous platforms, and forming a GPU resource pool that can be scheduled by the scheduler.

作为优选，步骤2中，云计算平台的数据采集使用openstack社区的Ceilometer框架，Ceilometer框架使用代理完成集群资源信息的数据采集。Preferably, in step 2, the cloud computing platform data collection uses the Ceilometer framework of the openstack community, and the Ceilometer framework uses an agent to complete the data collection of cluster resource information.

作为优选，步骤2中，人工智能平台的历史数据采集，使用基于Resource metricsAPI和Custom Metrics API收集集群的资源使用数据。Preferably, in step 2, historical data collection of the artificial intelligence platform uses the Resource metricsAPI and Custom Metrics API to collect resource usage data of the cluster.

作为优选，步骤2中，高性能计算平台的历史资源使用信息数据采集，通过部署Slurm集群的Promethes收集，Promethes是一个用于监控和报警的开源框架，可以使用Promethes的导出器导出高性能计算平台的各种资源信息。Preferably, in step 2, historical resource usage information data of the high-performance computing platform is collected by deploying Promethes of the Slurm cluster. Promethes is an open source framework for monitoring and alarming, and the exporter of Promethes can be used to export various resource information of the high-performance computing platform.

作为优选，步骤3中，基于异构计算平台数据的异常检测模型，对数据进行异常检测以及对检测到的数据异常进行处理；Preferably, in step 3, an anomaly detection model of heterogeneous computing platform data is used to detect anomalies in the data and process the detected data anomalies;

所述基于异构计算平台数据的异常检测模型，通过提取各应用平台数据的稳定性及周期性特征值，然后模型路由根据特征值选择匹配的异常检测算法。The anomaly detection model based on heterogeneous computing platform data extracts the stability and periodicity characteristic values of the data of each application platform, and then the model routing selects a matching anomaly detection algorithm according to the characteristic values.

作为优选，步骤3中，基于时间序列进行数据异常检测，其中出现频率最高和使用范围最广的异常类别为点异常；Preferably, in step 3, data anomaly detection is performed based on time series, wherein the most frequently occurring and widely used anomaly category is point anomaly;

根据点异常判断数据异常的公式如下：The formula for judging data anomalies based on point anomalies is as follows:

其中，x_t指在t时刻的数据特征值，指数据在t时刻的期望值，τ表示一个大于0的阈值，若在t时刻，数据的特征值与期望值相减的绝对值大于给定的阈值，则说明t时刻采集的数据是异常的，需要对该时刻的数据进行去异常处理。Among them, _xt refers to the data feature value at time t, It refers to the expected value of the data at time t, and τ represents a threshold value greater than 0. If at time t, the absolute value of the difference between the characteristic value of the data and the expected value is greater than the given threshold, it means that the data collected at time t is abnormal and needs to be de-anomaly processed.

作为优选，步骤4中，进行资源的使用趋势预测；Preferably, in step 4, a resource usage trend forecast is performed;

所述ARIMA模型，首先输入采集到的时间序列，对时间序列进行平稳性的校验，运用差分模型进行差分运算，以便消除数据的波动性，使数据的平稳性达到要求，在这个过程中，确定差分的阶数；数据平稳性达到要求后，对数据的白噪声进行检验，此过程中，使用AR模型与MA模型进行去白噪声操作，使数据的白噪声满足一定的阈值，对ARIMA模型进行拟合，使用自回归模型确定数据自回归部分的阶数p及使用移动平均模型确定代表数据移动平均部分的阶数q，当数据的平稳性及白噪声满足要求后，ARIMA模型的建模结束。The ARIMA model first inputs the collected time series, verifies the stability of the time series, and uses the difference model to perform difference operations to eliminate the volatility of the data and make the stability of the data meet the requirements. In this process, the order of the difference is determined; after the data stability meets the requirements, the white noise of the data is tested. In this process, the AR model and the MA model are used to perform white noise removal operations to make the white noise of the data meet a certain threshold, and the ARIMA model is fitted. The autoregressive model is used to determine the order p of the autoregressive part of the data and the moving average model is used to determine the order q representing the moving average part of the data. When the stability of the data and the white noise meet the requirements, the modeling of the ARIMA model is completed.

本发明的系统所采用的技术方案是：一种基于异构计算平台的GPU资源调度系统，包括以下模块：The technical solution adopted by the system of the present invention is: a GPU resource scheduling system based on a heterogeneous computing platform, comprising the following modules:

GPU资源池构建模块，用于将GPU作为通用资源并把分散在异构计算平台的GPU资源集中，形成GPU资源池，供调度器调度；The GPU resource pool construction module is used to use GPU as a general resource and centralize the GPU resources scattered on the heterogeneous computing platform to form a GPU resource pool for scheduling by the scheduler;

异构平台监控数据采集模块，用于采集异构平台的监控数据，为数据异常检测和预测调度提供数据支持；Heterogeneous platform monitoring data collection module, used to collect monitoring data of heterogeneous platforms and provide data support for data anomaly detection and predictive scheduling;

异常数据检测处理模块，用于对数据进行检测，处理检测到的异常数据；An abnormal data detection and processing module is used to detect data and process the detected abnormal data;

趋势预测模块，用于进行资源的使用趋势预测，预测平台的资源使用趋势，提前为平台分配资源，进一步充分利用GPU资源；The trend prediction module is used to predict the resource usage trend of the platform, allocate resources to the platform in advance, and further make full use of GPU resources;

资源调度模块，用于预测结果，对资源进行不同计算平台间的调度。The resource scheduling module is used to predict results and schedule resources between different computing platforms.

相对于现有技术，本发明的有益效果包括：Compared with the prior art, the beneficial effects of the present invention include:

(1)本发明通过将GPU节点与平台解耦并池化，实现资源在异构计算平台间的调度，提升数据中心的整体资源利用率。(1) The present invention decouples GPU nodes from the platform and pools them to achieve resource scheduling among heterogeneous computing platforms, thereby improving the overall resource utilization of the data center.

(2)本发明提出了一种基于异构计算平台的数据异常检测模型，为解决不同平台数据异常程度高和特征值差异大的问题，通过提取数据特征值，选择匹配的异常算法，提高数据质量。(2) The present invention proposes a data anomaly detection model based on a heterogeneous computing platform. In order to solve the problem of high anomaly levels and large differences in eigenvalues of data on different platforms, the data quality is improved by extracting data eigenvalues and selecting matching anomaly algorithms.

(3)本发明提出了资源预测的方法，针对异构计算平台资源调度开销大、效率低的问题，在调度之前提出使用ARIMA模型进行资源预测，提高资源的利用率。(3) The present invention proposes a method for resource prediction. To address the problem of high overhead and low efficiency in resource scheduling on heterogeneous computing platforms, the present invention proposes using the ARIMA model to perform resource prediction before scheduling to improve resource utilization.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面使用实施例，以及具体实施方式作进一步说明本文的技术方案。另外，在说明技术方案的过程中，也使用了一些附图。对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图以及本发明的意图。The technical solution of this invention is further described below using embodiments and specific implementation methods. In addition, some drawings are also used in the process of describing the technical solution. For those skilled in the art, other drawings and the intention of the invention can also be obtained based on these drawings without paying creative work.

图1为本发明实施例提供的一种基于异构计算平台的GPU资源调度方法流程图；FIG1 is a flow chart of a GPU resource scheduling method based on a heterogeneous computing platform provided by an embodiment of the present invention;

图2为本发明中的ARIMA模型的建模流程图；FIG2 is a modeling flow chart of the ARIMA model in the present invention;

图3为本发明实施例中openstack集群采集数据的结构示意图；FIG3 is a schematic diagram of the structure of an openstack cluster collecting data in an embodiment of the present invention;

图4为本发明实施例中Kurbenetes集群采集数据的结构示意图；FIG4 is a schematic diagram of the structure of Kurbenets cluster data collection in an embodiment of the present invention;

图5为本发明实施例中数据异常检测模型的流程图；FIG5 is a flow chart of a data anomaly detection model according to an embodiment of the present invention;

图6为本发明实施例中资源预测流程图；FIG6 is a flow chart of resource prediction in an embodiment of the present invention;

图7为本发明实施例中资源调度的原理示意图。FIG. 7 is a schematic diagram showing the principle of resource scheduling in an embodiment of the present invention.

具体实施方式Detailed ways

为了便于本领域普通技术人员理解和实施本发明，下面结合附图及实施例对本发明作进一步的详细描述，应当理解，此处所描述的实施示例仅用于说明和解释本发明，并不用于限定本发明。In order to facilitate ordinary technicians in the field to understand and implement the present invention, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described herein are only used to illustrate and explain the present invention, and are not used to limit the present invention.

请见图1，本实施例提供的一种基于异构计算平台的GPU资源调度方法，包括以下步骤：Please refer to FIG1 . This embodiment provides a GPU resource scheduling method based on a heterogeneous computing platform, including the following steps:

步骤1：将GPU作为通用资源并把分散在异构计算平台的GPU资源集中，形成GPU资源池，供调度器调度。利用GPU虚拟化技术实现将大卡转小卡，GPU虚拟化技术可以使用NVIDIA公司提供的显卡虚拟化技术，即VCUDA技术，实现GPU资源的细粒度划分、重组和再利用,支持GPU细颗粒度共享，还能支持GPU资源的动态分配和自动释放，利用远程GPU功能打破物理服务器的边界，将GPU的管理和使用从单台服务器扩展至整个数据中心，实现不同异构平台的GPU资源的池化，形成可供调度器调度的GPU资源池。Step 1: Use GPU as a general resource and centralize the GPU resources scattered in heterogeneous computing platforms to form a GPU resource pool for scheduling by the scheduler. Use GPU virtualization technology to convert large cards to small cards. GPU virtualization technology can use the graphics card virtualization technology provided by NVIDIA, namely VCUDA technology, to achieve fine-grained division, reorganization and reuse of GPU resources, support fine-grained GPU sharing, and support dynamic allocation and automatic release of GPU resources. Use remote GPU functions to break the boundaries of physical servers, expand the management and use of GPUs from a single server to the entire data center, and achieve pooling of GPU resources on different heterogeneous platforms to form a GPU resource pool that can be scheduled by the scheduler.

步骤2：采集异构计算平台的监控数据，即异构计算平台的资源信息。Step 2: Collect monitoring data of the heterogeneous computing platform, that is, resource information of the heterogeneous computing platform.

所述的数据采集方式使用openstack集群、Kurbenetes集群及Slurm集群采集数据。The data collection method uses an openstack cluster, a Kurbenetes cluster, and a Slurm cluster to collect data.

在一种实施方式中，openstack集群采集数据的流程图见图3，openstack社区的Ceilometer框架主要负责数据采集，其数据采集主要有代理agent完成，在Ceilometer中有compute、central等agent。基于openstack集群的资源信息数据采集描述如下。In one implementation, the flow chart of data collection in the openstack cluster is shown in FIG3 . The Ceilometer framework of the openstack community is mainly responsible for data collection, and its data collection is mainly completed by agents. Ceilometer includes compute, central, and other agents. The resource information data collection based on the openstack cluster is described as follows.

Compute agent负责采集部署在openstack平台的各个计算节点上VM实例的资源使用数据。Compute agent is responsible for collecting resource usage data of VM instances deployed on each computing node of the openstack platform.

Central agent负责收集未通过消息队列发布消息的openstack组件的相关资源使用情况，还可以收集硬件相关的资源的使用情况。The Central agent is responsible for collecting resource usage information for OpenStack components that do not publish messages through the message queue. It can also collect resource usage information for hardware-related resources.

将上述收集到的资源信息经过消息队列发送到collector中进行汇总，并将收集到的信息存入数据库中。The resource information collected above is sent to the collector through the message queue for aggregation, and the collected information is stored in the database.

在一种实施方式中，Kurbenetes集群收集资源信息流程图见图4。In one implementation, a flow chart of Kurbenetes cluster collecting resource information is shown in FIG4 .

采用基于Resource metrics API和Custom Metrics API收集集群资源使用数据。Kubelet充当节点级和应用程序级指标收集器，对各个节点的资源使用情况进行监视。Cluster resource usage data is collected based on Resource metrics API and Custom Metrics API. Kubelet acts as a node-level and application-level indicator collector to monitor resource usage of each node.

Metrics-server在本地存储从Kubelet抓取的最新指标值并向外部客户端公开主指标API，负责提供主指标API的服务。通过访问资源指标API，查看集群中pod和node的资源使用情况，并将采集到的资源使用情况汇总于数据库。Metrics-server stores the latest indicator values captured from Kubelet locally and exposes the main indicator API to external clients, responsible for providing the main indicator API service. By accessing the resource indicator API, you can view the resource usage of pods and nodes in the cluster, and summarize the collected resource usage in the database.

在一种实施方式中，slurm集群通过部署Prometheus从每个客户端中抓取由HTTP服务提供的指标数据。Prometheus是一个用于监控和警报的开源框架。对于监控框架，系统使用Prometheus的数据库来存储作业的指标数据，Prometheus的导出器会采集整个Slurm调度系统的指标，包括各种资源信息在内。In one embodiment, the slurm cluster crawls the metrics data provided by the HTTP service from each client by deploying Prometheus. Prometheus is an open source framework for monitoring and alerting. For the monitoring framework, the system uses the Prometheus database to store the metrics data of the job, and the Prometheus exporter collects the metrics of the entire Slurm scheduling system, including various resource information.

步骤3：对采集到的数据进行异常检测并处理，其流程图见图5。其具体步骤描述如下：Step 3: Detect and process the collected data for anomalies. The flowchart is shown in Figure 5. The specific steps are described as follows:

在一种实施方式中，基于异构计算平台数据的异常检测模型，对数据进行异常检测以及对检测到的数据异常进行处理；In one embodiment, an anomaly detection model for heterogeneous computing platform data is used to detect anomalies in the data and process the detected data anomalies;

本实施例提出的一种基于异构计算平台数据的异常检测模型。本模型通过提取各应用平台数据的稳定性及周期性特征值，然后模型路由根据特征值选择匹配的异常检测算法，模型路由的过程即选择异常检测算法的过程，提高数据质量，为调度算法提供数据支撑；This embodiment proposes an anomaly detection model based on heterogeneous computing platform data. This model extracts the stability and periodicity characteristic values of each application platform data, and then the model routing selects a matching anomaly detection algorithm based on the characteristic values. The process of model routing is the process of selecting an anomaly detection algorithm, which improves data quality and provides data support for the scheduling algorithm.

数据异常检测方法常用模型：针对于各种特征值不同的数据，可以使用不同的异常检测模型对数据进行异常检测。本发明中，使用的数据异常检测模型有SR、SVR、Prophet、VAE、和LSTM等算法模型。上述几种模型适用于对检测时间、准确性、计算消耗以及适用范围要求不同的数据异常检测场景，实现高效率的数据异常检测。Commonly used models for data anomaly detection methods: For data with different eigenvalues, different anomaly detection models can be used to detect anomalies in the data. In the present invention, the data anomaly detection models used include SR, SVR, Prophet, VAE, and LSTM algorithm models. The above models are suitable for data anomaly detection scenarios with different requirements for detection time, accuracy, computational consumption, and scope of application, to achieve efficient data anomaly detection.

在一种实施方式中，基于机器学习与深度学习，对数据进行异常检测以及对检测到的数据异常进行处理；In one embodiment, based on machine learning and deep learning, data anomaly detection is performed and the detected data anomaly is processed;

具体包括以下子步骤：It includes the following sub-steps:

步骤3.1：将步骤2中采集到的数据作为输入。Step 3.1: Take the data collected in step 2 as input.

步骤3.2：对采集到的数据进行清洗。Step 3.2: Clean the collected data.

步骤3.2.1：去除数据中的重复记录，在采集到的数据集中，或许存在一些完全相同的记录，多余的记录没有任务价值，所以需要对多出来的重复数据进行清除。可以使用Pandas库中的drop_duplicate()函数去除重复记录。Step 3.2.1: Remove duplicate records from the data. In the collected data set, there may be some identical records. The redundant records have no task value, so it is necessary to remove the redundant duplicate data. You can use the drop_duplicate() function in the Pandas library to remove duplicate records.

步骤3.2.2：填充缺失值，实际采集到的数据往往不够完整，通常存在缺失值的情况，需要对缺失值进行处理，使数据更加完整。可以选用Pandas库中的fillna()函数将缺失值替换为特定的数据特征值。Step 3.2.2: Fill missing values. The actual collected data is often incomplete and often contains missing values. It is necessary to process the missing values to make the data more complete. You can use the fillna() function in the Pandas library to replace missing values with specific data feature values.

步骤3.2.3：数据矫正，为了统一数据标准，减小二次误差，降低网络波动造成的数据采集不稳定造成的影响，需要对数据进行矫正。Step 3.2.3: Data correction, in order to unify data standards, reduce secondary errors, and reduce the impact of unstable data collection caused by network fluctuations, the data needs to be corrected.

步骤3.3：特征提取，对进行数据清洗后的数据进行特征提取，数据特征提取的质量好坏将直接影响异常检测的效果。数据特征提取主要提取数据的稳定性和周期性。稳定性检测可以采用ADF检验，ADF检验是一种时间序列平稳性检验方法，通过检验序列是否存在单位根来判断数据是否平稳。周期性检测则选用傅里叶系数作为相关的判断依据。Step 3.3: Feature extraction. Extract features from the cleaned data. The quality of data feature extraction will directly affect the effect of anomaly detection. Data feature extraction mainly extracts the stability and periodicity of the data. Stability detection can use the ADF test, which is a time series stationarity test method. It determines whether the data is stable by testing whether the sequence has a unit root. Periodicity detection uses Fourier coefficients as the relevant judgment basis.

步骤3.4、模型路由过程即异常检测算法的推荐过程，常用的数据异常检测算法基于机器学习和深度学习，其中基于机器学习的数据异常检测算法主要包括SR、SVR和Prophet,基于深度学习的数据异常检测算法包括VAE和LSTM,不同的算法适应于不同的数据异常检测模型。各种算法的使用场景如表1所示。Step 3.4, the model routing process is the recommendation process of the anomaly detection algorithm. Commonly used data anomaly detection algorithms are based on machine learning and deep learning. The data anomaly detection algorithms based on machine learning mainly include SR, SVR and Prophet, and the data anomaly detection algorithms based on deep learning include VAE and LSTM. Different algorithms are suitable for different data anomaly detection models. The usage scenarios of various algorithms are shown in Table 1.

表1异常检测算法对比Table 1 Comparison of anomaly detection algorithms

通过上一步得到的指标特征系数，即数据的稳定性和周期性，模型路由过程将根据数据的特征系数及其他算法指标(如准确性、检测时间及检测消耗等指标)推荐一个最合适的异常检测算法。Based on the characteristic coefficients of the indicators obtained in the previous step, namely the stability and periodicity of the data, the model routing process will recommend the most suitable anomaly detection algorithm based on the characteristic coefficients of the data and other algorithm indicators (such as accuracy, detection time, and detection consumption).

步骤3.5：检测器调用模型路由推荐的算法进行数据异常检测，对检测结果为正常的数据保留，将检测结果为异常的数据丢弃。Step 3.5: The detector calls the algorithm recommended by the model routing to detect data anomalies, retains the data with normal detection results, and discards the data with abnormal detection results.

数据中心监测的数据通常是基于时间序列的，检测到的数据一般都带有时间戳的属性。基于时间序列的数据异常有不同的呈现形式，其中出现频率最高和使用范围最广的异常类别是点异常。The data monitored by the data center is usually based on time series, and the detected data generally has the attribute of timestamp. There are different forms of data anomalies based on time series, among which the most frequent and widely used anomaly category is point anomaly.

其中，x_t指在t时刻的数据特征值，指数据在t时刻的期望值，τ表示一个大于0的阈值，若在t时刻，数据的特征值与期望值相减的绝对值大于给定的阈值，则说明t时刻采集的数据是异常的，需要对该时刻的数据进行去异常处理。Among them, _xt refers to the data characteristic value at time t, It refers to the expected value of the data at time t, and τ represents a threshold value greater than 0. If at time t, the absolute value of the difference between the characteristic value of the data and the expected value is greater than the given threshold, it means that the data collected at time t is abnormal and needs to be de-anomaly processed.

步骤4：通过上述去异常后的数据，使用资源预测模型及算法预测资源的使用趋势，在本发明中，使用ARIMA模型进行资源的预测提前为资源的分配做规划。Step 4: Using the above data after removing anomalies, use the resource prediction model and algorithm to predict the resource usage trend. In the present invention, the ARIMA model is used to predict resources and plan the resource allocation in advance.

所述ARIMA模型全称为自回归差分移动平均模型，其基本思想是利用数据本身的历史信息预测未来，试图通过数据的自相关性和差分的方式，提取出隐藏在数据背后的时间序列模式，然后用这些模式来预测未来的数据，通过此方式可以实现对资源使用趋势进行预测。ARIMA模型由三部分构成，分别是自回归模型AR，差分过程I和移动平均模型MA。The full name of the ARIMA model is the Autoregressive Difference Moving Average model. Its basic idea is to use the historical information of the data itself to predict the future. It attempts to extract the time series patterns hidden behind the data through the autocorrelation and difference of the data, and then use these patterns to predict future data. In this way, the resource usage trend can be predicted. The ARIMA model consists of three parts, namely the autoregressive model AR, the difference process I and the moving average model MA.

在一种实施方式中，所述ARIMA模型建模流程图如图2所示；首先输入采集到的时间序列，对时间序列进行平稳性的校验，运用差分模型进行差分运算，以便消除数据的波动性，使数据的平稳性达到要求，在这个过程中，确定差分的阶数。数据平稳性达到要求后，对数据的白噪声进行检验，此过程中，使用AR模型与MA模型进行去白噪声操作，使数据的白噪声满足一定的阈值，对ARIMA模型进行拟合，使用自回归模型确定数据自回归部分的阶数p及使用移动平均模型确定代表数据移动平均部分的阶数q，当数据的平稳性及白噪声满足要求后，ARIMA模型的建模结束，可根据去异常后的数据以及建模过程中得到的相关参数进行资源的使用趋势预测。In one embodiment, the ARIMA model modeling flow chart is shown in Figure 2; first, the collected time series is input, the time series is checked for stationarity, and the differential model is used to perform differential operations to eliminate the volatility of the data and make the data stationarity meet the requirements. In this process, the order of the differential is determined. After the data stationarity meets the requirements, the white noise of the data is tested. In this process, the AR model and the MA model are used to perform white noise removal operations so that the white noise of the data meets a certain threshold, and the ARIMA model is fitted. The order p of the autoregressive part of the data is determined using the autoregressive model and the order q representing the moving average part of the data is determined using the moving average model. When the stationarity and white noise of the data meet the requirements, the modeling of the ARIMA model is completed, and the resource usage trend can be predicted based on the data after the abnormality is removed and the relevant parameters obtained in the modeling process.

ARIMA模型的主要三部分构成说明如下：The three main components of the ARIMA model are described as follows:

1)自回归模型AR是一种分析时间序列数据的统计模型，用于描述某个变量与其过去值的关系。对于一个时间序列数据，AR模型的P阶形式可表示为：1) Autoregressive model AR is a statistical model for analyzing time series data, which is used to describe the relationship between a variable and its past values. For a time series data, the P-order form of the AR model can be expressed as:

其中，Y_t是在时刻t的观察值，c是常数，为自回归系数，Y_t-1是在时刻t-1的观察值，最后一项则是误差。Where _Yt is the observed value at time t, c is a constant, is the autoregressive coefficient, Y _t-1 is the observed value at time t-1, and the last term is the error.

2)移动平均模型MA描述当前时间点的数据与过去噪声之间的关系。MA是基于白噪声序列的假设，白噪声是一种特殊的时间序列模型，每个时间点的数据都是独立同分布的，且具有常数的均值和方差。给定一个白噪声序列η_t，MA模型定义为：2) Moving average model MA describes the relationship between the data at the current time point and the past noise. MA is based on the assumption of white noise sequence. White noise is a special time series model. The data at each time point are independent and identically distributed, and have a constant mean and variance. Given a white noise sequence η _t , the MA model is defined as:

Y_t＝μ+η_t+θ₁η_t-1+θ₂η_t-2+........+θ_qη_t-q Y _t = μ + η _t + θ ₁ η _t-1 + θ ₂ η _t-2 + ........ + θ _q η _tq

其中，Y_t是我们感兴趣的时间序列在t时刻的观察值，μ是时间序列的均值或期望值，η_t，η_t-1,....η_t-q是对应时刻的白噪声，θ₁，θ₂,......θ_q是模型对应的参数，衡量白噪声对当前时间点的影响程度。Among them, _Yt is the observed value of the time series we are interested in at time t, μ is the mean or expected value of the time series, _ηt , ηt _-1 ,.... _ηtq is the white noise at the corresponding time, _θ1 , _θ2 ,... _θq are the corresponding parameters of the model, which measure the influence of white noise on the current time point.

3)差分过程I是一种数学操作，描述一组数值序列相邻数据点的差值。对于一个时间序列Y_t,一阶差分可定义为：3) The difference process I is a mathematical operation that describes the difference between adjacent data points in a set of numerical sequences. For a time series Y _t , the first-order difference can be defined as:

ΔY_t＝Y_t-Y_t-1 ΔY _t =Y _t -Y _t-1

其中，Y_t-1是t-1时刻的时间序列，ΔY_t是两个时间序列的差值。Among them, Y _t-1 is the time series at time t-1, and ΔY _t is the difference between the two time series.

其具体流程见图6，步骤描述如下：The specific process is shown in Figure 6, and the steps are described as follows:

步骤4.1：输入历史数据信息，即上一步骤中去异常化后的数据。Step 4.1: Input historical data information, that is, the data after de-abnormalization in the previous step.

步骤4.2：使用差分模型、自回归模型与移动平均模型进行初步数据分析，获得时间阶数，以便确定ARIMA模型的时间阶数。Step 4.2: Perform preliminary data analysis using the difference model, autoregressive model, and moving average model to obtain the time order in order to determine the time order of the ARIMA model.

步骤4.2.1：历史数据信息是一种时间序列，其信息属性带有采集时刻的时间值。利用差分模型，计算相邻两个时间序列的差值。经过差分模型的处理，可一定程度上消除数据波动。Step 4.2.1: Historical data information is a time series, and its information attribute carries the time value of the collection moment. Using the difference model, calculate the difference between two adjacent time series. After processing by the difference model, data fluctuations can be eliminated to a certain extent.

步骤4.2.2：利用自回归模型AR处理历史数据信息的自回归部分，并据此进行初步预测，对历史数据信息进行处理，使其更加的平稳，AR模型考虑了过去若干时期的观测值对当前值的影响。Step 4.2.2: Use the autoregressive model AR to process the autoregressive part of the historical data information, and make preliminary predictions based on it. Process the historical data information to make it more stable. The AR model takes into account the impact of observations from several periods in the past on the current value.

步骤4.2.3：使用MA模型，即移动平均模型处理临时、突发的变化或者噪声较大的时间序列数据。MA模型考虑了过去的预测误差对当前值的影响。Step 4.2.3: Use the MA model, or moving average model, to handle temporary, sudden changes or noisy time series data. The MA model takes into account the impact of past forecast errors on current values.

步骤4.3：由上一步获得时间阶数，将此时间阶数作为ARIMA模型的时间阶数，运用ARIMA模型进行预测，获得预测结果。Step 4.3: Obtain the time order from the previous step, use this time order as the time order of the ARIMA model, use the ARIMA model to make predictions, and obtain the prediction results.

步骤5：根据预测结果，调度GPU资源。Step 5: Schedule GPU resources based on the prediction results.

其具体流程见图7，步骤描述如下：The specific process is shown in Figure 7, and the steps are described as follows:

步骤5.1：调度模块通过异构平台的调度器获取GPU资源的各种情况，包括GPU资源的使用情况、GPU的类型、GPU的数量以及GPU从属于哪一个异构平台。Step 5.1: The scheduling module obtains various conditions of GPU resources through the scheduler of the heterogeneous platform, including the usage of GPU resources, the type of GPU, the number of GPUs, and which heterogeneous platform the GPU belongs to.

步骤5.2：调度模块记录各个任务所需的GPU资源以及其他资源，记录各个任务的状态，监控任务的完成情况。Step 5.2: The scheduling module records the GPU resources and other resources required for each task, records the status of each task, and monitors the completion of the task.

步骤5.3：根据步骤4的资源预测结果，结合GPU资源的使用情况，判断是否有满足任务所需的GPU资源，没有，则等待，直至有满足任务的GPU资源可供分配。若有，则进行步骤5.4。Step 5.3: Based on the resource prediction results of step 4 and the usage of GPU resources, determine whether there are GPU resources that meet the task requirements. If not, wait until GPU resources that meet the task are available for allocation. If yes, proceed to step 5.4.

步骤5.4：根据调度模块的GPU资源记录表，调度模块通过调用各个平台的调度器为任务分配所需的GPU资源，即一个任务所需的GPU资源来自不同的平台。Step 5.4: According to the GPU resource record table of the scheduling module, the scheduling module allocates the required GPU resources for the task by calling the schedulers of each platform, that is, the GPU resources required for a task come from different platforms.

步骤5.5：调度模块监视作业的运行情况，若作业运行进行，调用各个平台的调度器，释放GPU资源，将GPU资源归还GPU资源池。Step 5.5: The scheduling module monitors the running status of the job. If the job is running, it calls the scheduler of each platform to release the GPU resources and return the GPU resources to the GPU resource pool.

本实施例还提供了一种基于异构计算平台的GPU资源调度系统，包括以下模块：This embodiment also provides a GPU resource scheduling system based on a heterogeneous computing platform, including the following modules:

不同异构计算平台的GPU资源较为分散，需要将不同同构平台的GPU资源进行集中，形成GPU资源池，并有调度器对GPU中的资源池进行统一的调度分配，以达到充分利用GPU资源的目的。本发明可以使用不同异构计算平台的数据工具进行数据的采集，数据中心中常见的三种分布式集群调度平台有云计算平台、人工智能平台以及高性能计算平台，使用三种平台提供的组件进行数据采集。在上述三种平台中，都有各自的组件进行数据的采集，云计算平台主要使用openstack集群采集数据，人工智能平台则主要使用kubernetes集群采集数据，高性能计算平台使用Slurn集群采集数据。The GPU resources of different heterogeneous computing platforms are relatively scattered, and it is necessary to concentrate the GPU resources of different homogeneous platforms to form a GPU resource pool, and a scheduler is used to uniformly schedule and allocate the resource pool in the GPU to achieve the purpose of making full use of GPU resources. The present invention can use the data tools of different heterogeneous computing platforms to collect data. The three common distributed cluster scheduling platforms in data centers include cloud computing platforms, artificial intelligence platforms, and high-performance computing platforms, and use the components provided by the three platforms to collect data. In the above three platforms, each has its own component to collect data. The cloud computing platform mainly uses the openstack cluster to collect data, the artificial intelligence platform mainly uses the kubernetes cluster to collect data, and the high-performance computing platform uses the Slurn cluster to collect data.

不同数据平台采集的数据多样、复杂，导致异构平台数据异常程度高和特征值差异大，对采集来的数据，本发明首先需要进行数据的清洗，过滤掉一些不符合要求的数据，采用模型提取各应用平台数据的稳定性以及周期性特征值，然后模型路由根据数据的特征值选择匹配的异常算法，常用的异常检测算法主要基于机器学习与深度学习，对数据进行异常检测以及对检测到的数据异常进行处理。The data collected by different data platforms are diverse and complex, resulting in high anomaly levels and large differences in eigenvalues of heterogeneous platform data. For the collected data, the present invention first needs to clean the data, filter out some data that does not meet the requirements, and use a model to extract the stability and periodic eigenvalues of the data of each application platform. Then, the model routing selects a matching anomaly algorithm based on the eigenvalues of the data. Commonly used anomaly detection algorithms are mainly based on machine learning and deep learning, which perform anomaly detection on the data and process the detected data anomalies.

由资源调度器选取合适的节点，本发明按照任务所需的GPU数量以及类型，分配GPU资源，并将任务迁移到满足资源分配的节点执行，若有其他任务对GPU资源的需求同当前任务的需求一致，两个任务可分时使用该节点的GPU资源，达到共享GPU资源的目的，提高GPU资源的使用率。The resource scheduler selects a suitable node, allocates GPU resources according to the number and type of GPUs required for the task, and migrates the task to a node that meets the resource allocation requirements for execution. If the demand for GPU resources of other tasks is consistent with that of the current task, the two tasks can use the GPU resources of the node in a time-sharing manner, thereby achieving the purpose of sharing GPU resources and improving the utilization rate of GPU resources.

应当理解的是，上述描述的实施例是本发明一部分实施例，而不是全部的实施例。另外，本发明提供的各个实施例或单个实施例中的技术特征可以相互任意结合，以形成可行的技术方案，这种结合不受步骤先后次序和/或结构组成模式的约束，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时，应当认为这种技术方案的结合不存在，也不在本发明要求的保护范围之内。It should be understood that the embodiments described above are part of the embodiments of the present invention, rather than all of the embodiments. In addition, the technical features in the various embodiments or single embodiments provided by the present invention can be arbitrarily combined with each other to form a feasible technical solution. Such combination is not restricted by the sequence of steps and/or the structural composition mode, but must be based on the ability of ordinary technicians in the field to implement. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such combination of technical solutions does not exist and is not within the protection scope required by the present invention.

应当理解的是，上述针对较佳实施例的描述较为详细，并不能因此而认为是对本发明专利保护范围的限制，本领域的普通技术人员在本发明的启示下，在不脱离本发明权利要求所保护的范围情况下，还可以做出替换或变形，均落入本发明的保护范围之内，本发明的请求保护范围应以所附权利要求为准。It should be understood that the above description of the preferred embodiment is relatively detailed and cannot be regarded as limiting the scope of patent protection of the present invention. Under the enlightenment of the present invention, ordinary technicians in this field can also make substitutions or modifications without departing from the scope of protection of the claims of the present invention, which all fall within the scope of protection of the present invention. The scope of protection requested for the present invention shall be based on the attached claims.

Claims

1. The GPU resource scheduling method based on the heterogeneous computing platform is characterized by comprising the following steps of:

Step 1: using the GPU as a general resource and centralizing the GPU resources scattered on the heterogeneous computing platform to form a GPU resource pool for scheduling by a scheduler;

Step 2: collecting monitoring data of a heterogeneous platform, and providing data support for data anomaly detection and prediction scheduling;

Step 3: detecting the data and processing the detected abnormal data;

Step 4: predicting the use trend of resources, predicting the use trend of the resources of the platform, distributing the resources for the platform in advance, and further fully utilizing the GPU resources;

step 5: and (3) scheduling the resources among different computing platforms according to the prediction result of the step (4).

2. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in the step 1, the large card is changed into the small card by using a GPU virtualization technology, the boundary of a physical server is broken by using a remote GPU function, management and use of the GPU are expanded from a single server to the whole data center, pooling of GPU resources of different heterogeneous platforms is realized, and a GPU resource pool which can be scheduled by a scheduler is formed.

3. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in the step2, the data collection of the cloud computing platform uses Ceilometer frames of an opentack community, and Ceilometer frames use agents to complete the data collection of cluster resource information.

4. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in step 2, historical data of the artificial intelligence platform is collected, and Resource usage data of the clusters is collected based on Resource METRICS API and Custom METRICS API.

5. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in step 2, historical resource usage information data of the high-performance computing platform is collected through Promethes of the deployment Slurm cluster.

6. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in the step 3, based on an anomaly detection model of heterogeneous computing platform data, anomaly detection is carried out on the data, and detected data anomalies are processed;

according to the anomaly detection model based on the heterogeneous computing platform data, the stability and the periodic characteristic values of each application platform data are extracted, and then a model route selects a matched anomaly detection algorithm according to the characteristic values.

7. The heterogeneous computing platform-based GPU resource scheduling method of claim 1, wherein: in the step 3, data anomaly detection is carried out based on a time sequence, wherein the anomaly category with the highest occurrence frequency and the widest use range is point anomaly;

the formula for judging data abnormality according to the point abnormality is as follows:

Wherein x _t is a characteristic value of data at time t, x _t is an expected value of the data at time t, τ is a threshold value greater than 0, and if the absolute value of the subtraction of the characteristic value and the expected value of the data at time t is greater than a given threshold value, it is indicated that the data acquired at time t is abnormal, and the data at the time t needs to be processed for abnormality removal.

8. The heterogeneous computing platform-based GPU resource scheduling method of any of claims 1-7, wherein: in step 4, predicting the using trend of the resource by using an ARIMA model;

The ARIMA model firstly inputs the acquired time sequence, performs stability verification on the time sequence, and performs differential operation by using a differential model so as to eliminate the fluctuation of data, so that the stability of the data meets the requirement, and in the process, the differential order is determined; after the data stability meets the requirement, the white noise of the data is checked, in the process, the AR model and the MA model are used for carrying out the white noise removing operation, so that the white noise of the data meets a certain threshold value, the ARIMA model is fitted, the order p of the autoregressive part of the data is determined by using the autoregressive model, the order q of the moving average part of the data is determined by using the moving average model, and when the data stability and the white noise meet the requirement, the modeling of the ARIMA model is finished.

9. GPU resource scheduling system based on heterogeneous computing platform, characterized by comprising the following modules:

The GPU resource pool construction module is used for taking the GPU as a general resource and dispersing the GPU resources in the GPU resource sets of the heterogeneous computing platforms to form a GPU resource pool for a scheduler to schedule;

The heterogeneous platform monitoring data acquisition module is used for acquiring monitoring data of the heterogeneous platform and providing data support for data anomaly detection and prediction scheduling;

the abnormal data detection processing module is used for detecting the data and processing the detected abnormal data;

The trend prediction module is used for predicting the use trend of the resources, predicting the use trend of the resources of the platform, distributing the resources for the platform in advance, and further fully utilizing the GPU resources;

and the resource scheduling module is used for predicting results and scheduling the resources among different computing platforms.