CN116302404B - Resource decoupling data center-oriented server non-perception calculation scheduling method - Google Patents
Resource decoupling data center-oriented server non-perception calculation scheduling method Download PDFInfo
- Publication number
- CN116302404B CN116302404B CN202310149359.4A CN202310149359A CN116302404B CN 116302404 B CN116302404 B CN 116302404B CN 202310149359 A CN202310149359 A CN 202310149359A CN 116302404 B CN116302404 B CN 116302404B
- Authority
- CN
- China
- Prior art keywords
- task type
- task
- computing
- execution
- allocation ratio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000004364 calculation method Methods 0.000 title claims description 12
- 230000008447 perception Effects 0.000 title 1
- 230000005540 biological transmission Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 description 11
- 230000008859 change Effects 0.000 description 9
- 230000014509 gene expression Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000008602 contraction Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
Description
技术领域Technical field
本发明涉及数据中心调度技术领域,特别是涉及一种面向资源解耦合数据中心的服务器无感知计算调度方法。The present invention relates to the field of data center scheduling technology, and in particular to a server-less computing scheduling method oriented to resource decoupling data centers.
背景技术Background technique
服务器无感知计算是新一代云计算范式,该范式允许开发者基于服务器无感知函数进行应用开发,无需关注底层资源管理的细节。服务器无感知计算采用资源解耦合架构,将计算和存储解耦,分为两个独立的资源池,计算资源池中的服务器节点通过网络从存储资源池中获取数据并完成计算。由于两个资源池彼此独立,资源解耦合架构有着很好的可扩展性和较高的资源利用率。Server-less computing is a new generation of cloud computing paradigm that allows developers to develop applications based on server-less functions without paying attention to the details of underlying resource management. Server-less computing uses a resource decoupling architecture to decouple computing and storage into two independent resource pools. Server nodes in the computing resource pool obtain data from the storage resource pool through the network and complete calculations. Since the two resource pools are independent of each other, the resource decoupling architecture has good scalability and high resource utilization.
然而,远程访问数据所产生的网络开销往往不容忽视。对于IO密集型任务来说(IO表示数据输入/输出),将数据从存储节点移动到计算节点会产生可能需要多次RTT(Round-Trip Time网络往返时间)或者占用大量带宽。这一问题虽然通过存储端计算解决——使用RPC(Remote Procedure Call远程过程调用)来运行存储节点上所注册的存储过程。但由于存储资源池以存储见长,往往计算资源有限,无法满足所有任务计算需求。现有技术将工作负载按照一定分配比例调度到存储和计算节点上执行,从而使两个资源池都得到充分利用。现有技术的局限性在于只在负载层面进行调度决策,即只为整体负载计算一个统一的分配比例,这一调度方式并未考虑任务自身的属性,使得调度效果不佳、系统运行低效。However, the network overhead caused by remote access to data cannot be ignored. For IO-intensive tasks (IO stands for data input/output), moving data from storage nodes to computing nodes may require multiple RTTs (Round-Trip Time network round trip time) or occupy a lot of bandwidth. Although this problem is solved through storage-side computing - using RPC (Remote Procedure Call) to run the stored procedures registered on the storage node. However, because the storage resource pool is specialized in storage, its computing resources are often limited and cannot meet the computing needs of all tasks. Existing technology schedules workloads to storage and computing nodes for execution according to a certain allocation ratio, so that both resource pools are fully utilized. The limitation of the existing technology is that it only makes scheduling decisions at the load level, that is, it only calculates a unified allocation ratio for the overall load. This scheduling method does not consider the attributes of the task itself, resulting in poor scheduling results and inefficient system operation.
发明内容Contents of the invention
有鉴于此,本发明实施例提供一种面向资源解耦合数据中心的服务器无感知计算调度方法。旨在将各种类型的任务分配至与自身运行时特征相匹配的节点上进行执行,从而提高任务执行效率和系统的资源利用率,进而提高系统的吞吐量。In view of this, embodiments of the present invention provide a server-insensitive computing scheduling method for resource decoupling data centers. It aims to allocate various types of tasks to nodes that match their own runtime characteristics for execution, thereby improving task execution efficiency and system resource utilization, thereby improving system throughput.
本发明实施例第一方面提供一种面向资源解耦合数据中心的服务器无感知计算调度方法,应用于调度器,所述方法包括:The first aspect of the embodiment of the present invention provides a server-less computing scheduling method for resource decoupling data centers, which is applied to the scheduler. The method includes:
根据接收的任务请求RPC,确定对应的任务类型;Determine the corresponding task type based on the received task request RPC;
根据任务类型,确定所述任务类型对应的分配比例;According to the task type, determine the allocation ratio corresponding to the task type;
根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行。According to the allocation ratio, tasks in the task type are allocated to computing nodes or storage nodes for execution with corresponding probabilities.
可选地,所述根据任务类型,确定所述任务类型对应的分配比例,包括:Optionally, determining an allocation ratio corresponding to the task type according to the task type includes:
在所述任务类型为未处理过的第一任务类型时,确定所述第一任务类型对应的分配比例为默认分配比例;When the task type is an unprocessed first task type, determine the allocation ratio corresponding to the first task type to be the default allocation ratio;
所述根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行,包括:Allocating tasks in the task type to computing nodes or storage nodes for execution with corresponding probabilities according to the allocation ratio includes:
将所述第一任务类型的任务以所述默认分配比例分配至计算节点或存储节点进行执行,并统计预设数量的所述第一任务类型中的任务在计算节点和存储节点上的执行代价,进而确定所述第一任务类型的运行时特征;Allocate the tasks of the first task type to computing nodes or storage nodes for execution according to the default allocation ratio, and calculate the execution costs of a preset number of tasks of the first task type on the computing nodes and storage nodes. , and then determine the runtime characteristics of the first task type;
通过将所述运行时特征输入预设调度算法进行计算,确定所述第一任务类型的最优分配比例;Determine the optimal allocation ratio of the first task type by inputting the runtime characteristics into a preset scheduling algorithm for calculation;
根据所述第一任务类型的最优分配比例,将所述第一任务类型中的任务以对应概率分配至计算节点或存储节点进行执行;According to the optimal allocation ratio of the first task type, allocate tasks in the first task type to computing nodes or storage nodes for execution with corresponding probabilities;
在所述任务类型为处理过的第二任务类型时,获取预设调度算法确定的所述第二任务类型对应的最优分配比例;When the task type is a processed second task type, obtain the optimal allocation ratio corresponding to the second task type determined by the preset scheduling algorithm;
所述根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行,包括:Allocating tasks in the task type to computing nodes or storage nodes for execution with corresponding probabilities according to the allocation ratio includes:
根据所述第二任务类型对应的最优分配比例,将所述第二任务类型中的任务以对应概率分配至计算节点或存储节点进行执行。According to the optimal allocation ratio corresponding to the second task type, tasks in the second task type are allocated to computing nodes or storage nodes for execution with corresponding probabilities.
可选地,所述运行时特征包括:计算节点执行任务时的数据传输的网络开销,以及,存储节点执行任务时的CPU开销。Optionally, the runtime characteristics include: network overhead of data transmission when the computing node performs the task, and CPU overhead when the storage node performs the task.
可选地,所述通过将所述运行时特征输入预设调度算法进行计算,确定所述第一任务类型的最优分配比例,包括:Optionally, determining the optimal allocation ratio of the first task type by inputting the runtime characteristics into a preset scheduling algorithm for calculation includes:
通过将所述运行时特征输入预设调度算法,确定所述第一任务类型的相对代价;Determine the relative cost of the first task type by inputting the runtime characteristics into a preset scheduling algorithm;
根据所述相对代价对所述第一任务类型和所述第二任务类型进行排序;Sorting the first task type and the second task type according to the relative cost;
通过所述预设调度算法中的子算法确定分割点参数的最优解;Determine the optimal solution of the split point parameters through the sub-algorithm in the preset scheduling algorithm;
根据排序结果和分割点参数的最优解,确定所述第一任务类型的最优分配比例。According to the sorting result and the optimal solution of the split point parameters, the optimal allocation ratio of the first task type is determined.
可选地,所述方法还包括:Optionally, the method also includes:
根据任务执行过程中的端到端延迟和用户给出的服务水平目标,通过限流器调整系统的吞吐量;According to the end-to-end delay during task execution and the service level target given by the user, the throughput of the system is adjusted through the current limiter;
根据所述系统的吞吐量,调整调度器的分割点参数,以使所述系统的吞吐量最大化。According to the throughput of the system, the split point parameters of the scheduler are adjusted to maximize the throughput of the system.
可选地,所述方法还包括:Optionally, the method also includes:
确定当前时间窗口内各个任务类型的运行时特征与自身的历史运行时特征滑动平均值的差值;Determine the difference between the runtime characteristics of each task type in the current time window and its own historical runtime characteristics sliding average;
将当前时间窗口内的运行时特征和自身的历史运行时特征滑动平均值的差值超过第一阈值的第三任务类型的分配比例设置为默认分配比例;Set the allocation proportion of the third task type whose difference between the runtime characteristics in the current time window and its own historical runtime characteristics sliding average exceeds the first threshold as the default allocation proportion;
将第三任务类型以所述默认分配比例分配至计算节点或存储节点进行执行,并统计预设数量的所述第三任务类型的任务在计算节点和存储节点上的新的执行代价,进而确定新的运行时特征;Allocate the third task type to the computing nodes or storage nodes for execution according to the default allocation ratio, and calculate the new execution costs of the preset number of tasks of the third task type on the computing nodes and storage nodes, and then determine new runtime features;
通过将所述新的运行时特征输入预设调度算法,确定所述第三任务类型的相对代价;Determine the relative cost of the third task type by inputting the new runtime characteristics into a preset scheduling algorithm;
根据各个任务类型的相对代价对所有任务类型进行排序;Sort all task types according to their relative cost;
通过所述预设调度算法中的子算法确定分割点参数的最优解;Determine the optimal solution of the split point parameters through the sub-algorithm in the preset scheduling algorithm;
根据所有任务类型的排序结果和该分割点参数的最优解,确定所有任务类型的最优分配比例。Based on the ranking results of all task types and the optimal solution of the split point parameters, the optimal allocation ratio of all task types is determined.
可选地,所述方法还包括:Optionally, the method also includes:
确定限流器的请求丢弃情况;Determine the request dropping status of the current limiter;
在限流器丢弃请求时,进行阶梯式增加计算资源池的容量,直至限流器不再丢弃请求为止;When the current limiter discards requests, increase the capacity of the computing resource pool in a stepwise manner until the current limiter no longer discards requests;
在限流器未丢弃请求时,根据系统的负载情况,进行阶梯式缩减计算节点数量。When the current limiter does not discard requests, the number of computing nodes is reduced in a stepwise manner according to the load of the system.
可选地,所述方法还包括:Optionally, the method also includes:
确定各个存储节点的实际负载;Determine the actual load of each storage node;
在存储节点的实际负载与系统的平均负载之间的差值超过第二阈值时,将该存储节点划分至新的独立调度策略组;When the difference between the actual load of the storage node and the average load of the system exceeds the second threshold, the storage node is divided into a new independent scheduling policy group;
将预设数量的计算节点划分至所述新的独立调度策略组;Divide a preset number of computing nodes into the new independent scheduling policy group;
每个策略组作为任务调度和执行的子系统独立运行。Each policy group operates independently as a subsystem for task scheduling and execution.
本发明实施例包括以下优点:Embodiments of the present invention include the following advantages:
本发明实施例提供的一种面向资源解耦合数据中心的服务器无感知计算调度方法,根据接收任务请求RPC确定任务的任务类型,根据该任务类型确定该任务类型对应的分配比例(例如对于计算密集型的任务将其全部或大部分分配至计算资源充足的计算节点进行执行,其余则分配至存储节点进行执行;对于IO密集型的任务将其全部或大部分分配至IO效率较高的存储节点进行执行,其余少部分则分配至计算节点进行执行),根据确定的分配比例,将该任务类型中的任务以和该分配比例相等的概率分配至计算节点或存储节点进行执行。本发明根据任务类型确定与该任务类型对应的分配比例,并将该任务类型中的任务以和该分配比例相等的概率分配至计算节点或存储节点进行执行,以使得该任务类型中的任务能够在与自身运行时特征相匹配的节点进行执行,从而提高任务执行效率和系统的资源利用率,进而提高系统的吞吐量。The embodiment of the present invention provides a server-less computing scheduling method for resource decoupling data centers. The task type of the task is determined according to the received task request RPC, and the allocation ratio corresponding to the task type is determined based on the task type (for example, for intensive computing) For IO-intensive tasks, all or most of them are allocated to computing nodes with sufficient computing resources for execution, and the rest are allocated to storage nodes for execution; for IO-intensive tasks, all or most of them are allocated to storage nodes with higher IO efficiency. execution, and the remaining small part is allocated to computing nodes for execution), according to the determined allocation ratio, tasks in this task type are allocated to computing nodes or storage nodes for execution with a probability equal to the allocation ratio. The present invention determines the allocation ratio corresponding to the task type according to the task type, and allocates the tasks in the task type to the computing node or the storage node for execution with a probability equal to the allocation ratio, so that the tasks in the task type can It is executed on nodes that match its own runtime characteristics, thereby improving task execution efficiency and system resource utilization, thereby improving system throughput.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative labor.
图1是本发明一实施例示出的一种面向资源解耦合数据中心的服务器无感知计算调度方法的流程图;Figure 1 is a flow chart of a server-less computing scheduling method for a resource decoupling data center oriented to an embodiment of the present invention;
图2是本发明一实施例示出的一种面向资源解耦合数据中心的服务器无感知计算调度方法实现的系统模块图。FIG. 2 is a system module diagram illustrating a server-insensitive computing scheduling method for a resource decoupling data center oriented according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
在对本发明进行说明之前,先对本发明提出的背景进行说明。在实际应用场景中,不对任务类型进行区分、仅在整体工作负载(所有任务的集合)层面进行统一调度,将会导致调度准确性较低、系统运行低效。本发明发现在实际应用场景中系统所执行的任务类型是多样的,不同任务类型的任务的运行时特征不同,将任务分配至与其运行时特征相匹配的节点进行执行(如将计算密集型任务放置在计算能力较为充足的计算节点进行执行,将IO密集型任务放置在IO效率较高存储节点进行执行以避免多次RTT和大规模数据传输),可以分别发挥各个节点各自的优势,从而提高任务执行效率和系统的资源利用率,进而提高系统的任务吞吐量。Before describing the present invention, the background of the present invention will be described first. In actual application scenarios, not distinguishing task types and only performing unified scheduling at the overall workload (a collection of all tasks) level will lead to low scheduling accuracy and inefficient system operation. The present invention finds that in actual application scenarios, the types of tasks executed by the system are diverse, and tasks of different task types have different runtime characteristics. The tasks are assigned to nodes that match their runtime characteristics for execution (such as computing-intensive tasks). Place IO-intensive tasks on computing nodes with sufficient computing power for execution, and place IO-intensive tasks on storage nodes with higher IO efficiency for execution to avoid multiple RTTs and large-scale data transmission), which can give full play to the respective advantages of each node, thereby improving Task execution efficiency and system resource utilization, thereby improving the system’s task throughput.
在本发明中,图1是本发明一实施例示出的一种面向资源解耦合数据中心的服务器无感知计算调度方法的流程图。参照图1,本发明提供的一种面向资源解耦合数据中心的服务器无感知计算调度方法,应用于调度器,包括:In the present invention, FIG. 1 is a flow chart of a server-less computing scheduling method oriented to a resource decoupling data center according to an embodiment of the present invention. Referring to Figure 1, the present invention provides a server-less computing scheduling method for resource decoupling data centers, which is applied to the scheduler and includes:
步骤S11:根据接收的任务请求RPC,确定对应的任务类型;Step S11: Determine the corresponding task type according to the received task request RPC;
步骤S12:根据任务类型,确定所述任务类型对应的分配比例;Step S12: According to the task type, determine the allocation ratio corresponding to the task type;
步骤S13:根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行。Step S13: According to the allocation ratio, allocate tasks in the task type to computing nodes or storage nodes for execution with corresponding probabilities.
在本发明的实施例中,用户将预先编写的应用程序代码提交至系统,由系统编译后存储于计算节点和存储节点,其中一段特定的程序对应于一种任务类型。之后用户将任务调用请求以RPC的方式提交至调度器。调度器负责将任务调度至计算节点或存储节点进行执行,调度器在收到任务请求RPC后,通过解析该任务请求RPC识别出相应的任务类型,进而根据任务类型确定相应的分配比例,并按照和该任务类型的分配比例相等的概率将当前任务请求调度到存储节点。具体而言,如果该任务类型的分配比例为60%,则该任务有60%的概率调度到存储节点,有40%的概率调度到计算节点。本发明中分配比例表示的是任务被调度到存储节点的概率,而在实际场景中该分配比例表示的是任务被调度到存储节点的概率还是调度到计算节点的概率可自由选择。宏观上根据大数定律,一个任务类型在计算节点和存储节点上执行的任务数量之间的比例会收敛到该任务类型对应的分配比例。继续以上述示例进行说明,如果用户总共提交了10000个属于该任务类型的任务,则存储节点上大约执行6000个(约占60%),计算节点上大约执行4000个(约占40%),因此本发明将60%这一比例称为分配比例。In the embodiment of the present invention, the user submits the pre-written application code to the system, which is compiled by the system and stored in the computing node and the storage node, where a specific program corresponds to a task type. The user then submits the task call request to the scheduler in RPC. The scheduler is responsible for scheduling tasks to computing nodes or storage nodes for execution. After receiving the task request RPC, the scheduler identifies the corresponding task type by parsing the task request RPC, and then determines the corresponding allocation ratio according to the task type, and according to The current task request is scheduled to the storage node with a probability equal to the allocation ratio of the task type. Specifically, if the allocation ratio of this task type is 60%, then the task has a 60% probability of being scheduled to the storage node and a 40% probability of being scheduled to the computing node. In the present invention, the allocation ratio represents the probability of a task being scheduled to a storage node. In an actual scenario, whether the allocation ratio represents the probability of a task being scheduled to a storage node or a computing node can be freely selected. Macroscopically, according to the law of large numbers, the ratio between the number of tasks executed on computing nodes and storage nodes for a task type will converge to the allocation ratio corresponding to the task type. Continuing with the above example, if the user submits a total of 10,000 tasks belonging to this task type, approximately 6,000 of them will be executed on the storage node (accounting for approximately 60%), and approximately 4,000 of them will be executed on the computing node (accounting for approximately 40%). Therefore, the present invention refers to the ratio of 60% as the distribution ratio.
调度器根据当前预设时长内提交的所有任务类型,联合计算每种任务类型的分配比例,这一过程在调度器后台异步进行,即调度器计算任务类型分配比例的同时并不影响调度器对当前任务的调度工作。当实际的任务请求RPC到达时,调度器识别出相应的任务类型,并按照该任务类型所对应的分配比例将该任务以对应的分配比例调度到计算或存储节点进行执行。The scheduler jointly calculates the allocation ratio of each task type based on all task types submitted within the current preset time period. This process is performed asynchronously in the background of the scheduler, that is, the scheduler calculates the task type allocation ratio without affecting the scheduler's allocation ratio. Scheduling work for the current task. When the actual task request RPC arrives, the scheduler identifies the corresponding task type and schedules the task to the computing or storage node for execution according to the allocation ratio corresponding to the task type.
其中,各个任务类型对应的分配比例由调度器中的预设调度算法计算获得,其具体实施方式将在后续实施例中进行说明。The allocation ratio corresponding to each task type is calculated by a preset scheduling algorithm in the scheduler, and its specific implementation will be described in subsequent embodiments.
本发明实施例提供的一种面向资源解耦合数据中心的服务器无感知计算调度方法,根据接收到的任务请求RPC确定任务所属的任务类型,根据该任务类型确定对应的分配比例,根据确定的分配比例,将属于该任务类型的任务以相应概率分配至计算节点或存储节点进行执行。本发明为不同任务类型单独设置分配比例进行调度,使得各个类型的任务能够在与自身运行时特征相匹配的节点进行执行(例如对于计算密集型的任务将其全部或大部分分配至计算资源充足的计算节点进行执行,其余则分配至存储节点进行执行;对于IO密集型的任务将其全部或大部分分配至IO效率较高的存储节点进行执行,其余少部分则分配至计算节点进行执行)从而提高任务执行效率和系统的资源利用率,进而提高系统的任务吞吐量。The embodiment of the present invention provides a server-less computing scheduling method for resource decoupling data centers. The task type to which the task belongs is determined according to the received task request RPC, and the corresponding allocation proportion is determined according to the task type. According to the determined allocation Proportion, allocate tasks belonging to this task type to computing nodes or storage nodes for execution with corresponding probabilities. The present invention sets allocation ratios for different task types separately for scheduling, so that each type of task can be executed on a node that matches its own runtime characteristics (for example, for computing-intensive tasks, all or most of them are allocated to a node with sufficient computing resources. computing nodes for execution, and the rest is allocated to storage nodes for execution; for IO-intensive tasks, all or most of them are allocated to storage nodes with higher IO efficiency for execution, and the remaining small part is allocated to computing nodes for execution) This improves task execution efficiency and system resource utilization, thereby improving the system's task throughput.
在本发明中,所述根据任务类型,确定所述任务类型对应的分配比例,包括:在所述任务类型为未处理过的第一任务类型时,确定所述第一任务类型对应的分配比例为默认分配比例;所述根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行,包括:将所述第一任务类型的任务以所述默认分配比例分配至计算节点或存储节点进行执行,并统计预设数量的所述第一任务类型中的任务在计算节点和存储节点上的执行代价,进而确定所述第一任务类型的运行时特征;通过将所述运行时特征输入预设调度算法进行计算,确定所述第一任务类型的最优分配比例;根据所述第一任务类型的最优分配比例,将所述第一任务类型中的任务以对应概率分配至计算节点或存储节点进行执行;在所述任务类型为处理过的第二任务类型时,获取预设调度算法确定的所述第二任务类型对应的最优分配比例;所述根据所述分配比例,将所述任务类型中的任务以对应的概率分配至计算节点或存储节点进行执行,包括:根据所述第二任务类型对应的最优分配比例,将所述第二任务类型中的任务以对应概率分配至计算节点或存储节点进行执行。In the present invention, determining the allocation ratio corresponding to the task type according to the task type includes: when the task type is an unprocessed first task type, determining the allocation ratio corresponding to the first task type. is the default allocation ratio; allocating tasks in the task type to computing nodes or storage nodes with corresponding probabilities according to the allocation ratio includes: allocating tasks of the first task type to the default Allocate proportions to computing nodes or storage nodes for execution, and count the execution costs of a preset number of tasks in the first task type on the computing nodes and storage nodes, thereby determining the runtime characteristics of the first task type. ; Determine the optimal allocation ratio of the first task type by inputting the runtime characteristics into the preset scheduling algorithm for calculation; according to the optimal allocation ratio of the first task type, divide the first task type into The tasks are assigned to the computing nodes or storage nodes for execution with corresponding probabilities; when the task type is the processed second task type, the optimal allocation ratio corresponding to the second task type determined by the preset scheduling algorithm is obtained; Allocating tasks in the task type to computing nodes or storage nodes for execution with corresponding probabilities according to the allocation ratio includes: according to the optimal allocation ratio corresponding to the second task type, allocating the first task type to the computing node or the storage node for execution. Tasks in the second task type are assigned to computing nodes or storage nodes for execution with corresponding probabilities.
在本发明的实施例中,根据任务类型,确定所述任务类型对应的分配比例的一种实施方式为:在收到的任务请求对应的任务类型为系统未处理过的第一任务类型时,对于该第一任务类型,以默认分配比例分配至计算节点或存储节点进行任务执行。In an embodiment of the present invention, according to the task type, one implementation method of determining the allocation ratio corresponding to the task type is: when the task type corresponding to the received task request is the first task type that has not been processed by the system, For the first task type, the tasks are allocated to computing nodes or storage nodes in a default allocation ratio for task execution.
在本发明的实施例中,默认分配比例优选为50%,应当理解的是默认分配比例也可以是其他分配比例,在此不做具体限定。将未经过处理的第一任务类型中的任务以默认分配比例分配至计算节点或存储节点进行任务执行的目的是为了获取到第一任务类型中的任务在计算节点和在存储节点进行执行时的运行时特征,以便于基于该运行时特征计算到该第一任务类型对应的最优分配比例。In the embodiment of the present invention, the default allocation ratio is preferably 50%. It should be understood that the default allocation ratio can also be other allocation ratios, which are not specifically limited here. The purpose of allocating unprocessed tasks of the first task type to computing nodes or storage nodes for task execution at a default allocation ratio is to obtain the performance of tasks of the first task type when they are executed on computing nodes and storage nodes. Runtime characteristics, so as to calculate the optimal allocation ratio corresponding to the first task type based on the runtime characteristics.
在分配至计算节点的第一任务类型的任务数量达到预设数量时,统计该预设数量的第一任务类型在计算节点的执行代价;在分配至存储节点的第一任务类型的任务数量达到预设数量时,统计该预设数量的第一任务类型在存储节点的执行代价;二者共同构成该第一任务类型的运行时特征。其中,预设数量的取值可根据实际应用场景进行设置,在此不做具体限定。When the number of tasks of the first task type assigned to the computing node reaches a preset number, the execution cost of the preset number of first task types on the computing node is counted; when the number of tasks of the first task type assigned to the storage node reaches a When the number is preset, the execution cost of the preset number of first task types on the storage node is counted; both of them together constitute the runtime characteristics of the first task type. Among them, the value of the preset quantity can be set according to the actual application scenario, and is not specifically limited here.
在本发明的实施例中,在计算节点和存储节点均统计到了第一任务类型的运行时特征后;通过将获得的两种运行时特征输入预设调度算法进行计算,获得该第一任务类型的最优分配比例。然后按照该最优分配比例,将属于该第一任务类型的任务以对应概率分配至计算节点或存储节点进行执行。In the embodiment of the present invention, after both the computing node and the storage node have counted the runtime characteristics of the first task type; by inputting the two obtained runtime characteristics into the preset scheduling algorithm for calculation, the first task type is obtained the optimal allocation ratio. Then according to the optimal allocation ratio, tasks belonging to the first task type are allocated to computing nodes or storage nodes for execution with corresponding probabilities.
在本发明的实施例中,在任务类型为系统处理过的第二任务类型时,则表示调度器之前已获得了该第二任务类型在计算节点和存储节点进行执行时的运行时特征,并基于该运行时特征,通过预设调度算法计算获得了该第二任务类型对应的最优分配比例,此时只需根据该最优分配比例,将该第二任务类型的任务以对应概率分配至计算节点或存储节点进行执行。In the embodiment of the present invention, when the task type is the second task type processed by the system, it means that the scheduler has previously obtained the runtime characteristics of the second task type when it is executed by the computing node and the storage node, and Based on the runtime characteristics, the optimal allocation ratio corresponding to the second task type is calculated through the preset scheduling algorithm. At this time, it is only necessary to allocate the tasks of the second task type to the corresponding probability according to the optimal allocation ratio. Compute node or storage node for execution.
在本发明中,所述运行时特征包括:计算节点执行任务时的数据传输的网络开销,以及,存储节点执行任务时的CPU开销。In the present invention, the runtime characteristics include: network overhead of data transmission when the computing node performs the task, and CPU overhead when the storage node performs the task.
在本发明的实施例中,一类任务的运行时特征指的是一个二元组(ci,si),其中下标i为任务类型的编号,ci是第i个任务类型在计算节点上的执行代价,si是第i个任务类型在存储节点上的执行代价。任务在计算节点和存储节点上的执行代价可以表示一种特定资源的开销(如CPU开销,网络开销等),也可以是由多种资源按照某种启发式进行加权得到的抽象执行代价。In the embodiment of the present invention, the runtime characteristics of a type of task refer to a tuple ( ci , si ) , where the subscript i is the number of the task type, and c i is the i-th task type in the calculation The execution cost on the node, s i is the execution cost of the i-th task type on the storage node. The execution cost of a task on computing nodes and storage nodes can represent the cost of a specific resource (such as CPU overhead, network overhead, etc.), or it can be an abstract execution cost weighted by multiple resources according to a certain heuristic.
在本发明的实施例中,由于本发明支持计算资源池的弹性伸缩,计算节点上的CPU不会成为性能瓶颈,因此本发明中任务在计算节点的执行代价优选为数据传输的网络开销。而由于存储节点的数量由应用需要持久化存储的数据量决定,通常在短时间内不会出现显著波动,因此存储节点的CPU资源是受限的,故而本发明中任务在存储节点的执行代价优选为CPU开销。此时,ci表示第i个任务类型的任务在计算节点执行时进行数据传输的网络开销,si表示第i个任务类型的任务在存储节点执行时产生的CPU开销。In the embodiment of the present invention, since the present invention supports elastic scaling of the computing resource pool, the CPU on the computing node will not become a performance bottleneck. Therefore, the execution cost of the task in the computing node in the present invention is preferably the network overhead of data transmission. Since the number of storage nodes is determined by the amount of data that the application needs to store persistently, and usually does not fluctuate significantly in a short period of time, the CPU resources of the storage nodes are limited, so the execution cost of the tasks in the present invention on the storage nodes is Preferably CPU overhead. At this time, c i represents the network overhead of data transmission when the task of the i-th task type is executed on the computing node, and s i represents the CPU overhead generated when the task of the i-th task type is executed on the storage node.
第i个任务类型的任务在计算节点和存储节点执行获得对应的ci和si后,将ci和si反馈至调度器。调度器对所有第i个任务类型中的任务反馈的ci和si进行统计并计算平均,可以得到第i个任务类型的总体运行时特征(ci,si),即对与一类任务来说,运行时特征二元组中ci和si表示的是数学期望值,而对单个任务来说,ci和si表示的是具体的执行代价。通过将第i个任务类型对总体运行时特征(ci,si)输入预设调度算法,调度器可以进一步确定第i个任务类型的最优分配比例。After the task of the i-th task type is executed on the computing node and the storage node to obtain the corresponding c i and s i , c i and s i are fed back to the scheduler. The scheduler counts the c i and s i of the task feedback in all i-th task types and calculates the average. The overall runtime characteristics (c i , s i ) of the i-th task type can be obtained, that is, for a class For tasks, c i and si in the runtime feature tuples represent mathematical expectation values, while for a single task, c i and si represent the specific execution cost. By inputting the overall runtime characteristics (c i , s i ) of the i-th task type into the preset scheduling algorithm, the scheduler can further determine the optimal allocation ratio of the i-th task type.
在本发明的实施例中,任务的运行时特征是一个动态变化量,为应对任务的运行时特征的动态变化,本发明中的调度器记录的运行时特征ci和si是通过滑动平均的方式得到的。In the embodiment of the present invention, the runtime characteristics of the task are a dynamic change amount. In order to cope with the dynamic changes of the runtime characteristics of the task, the runtime characteristics c i and s i recorded by the scheduler in the present invention are calculated by moving average. way to get it.
在本发明中,所述通过将所述运行时特征输入预设调度算法进行计算,确定所述第一任务类型的最优分配比例,包括:通过将所述运行时特征输入预设调度算法,确定所述第一任务类型的相对代价;根据所述相对代价对所述第一任务类型和所述第二任务类型进行排序;通过所述预设调度算法中的子算法确定分割点参数的最优解;根据排序结果和分割点参数的最优解,确定所述第一任务类型的最优分配比例。In the present invention, determining the optimal allocation ratio of the first task type by inputting the runtime characteristics into a preset scheduling algorithm for calculation includes: inputting the runtime characteristics into a preset scheduling algorithm, Determine the relative cost of the first task type; sort the first task type and the second task type according to the relative cost; determine the optimal split point parameter through a sub-algorithm in the preset scheduling algorithm Optimal solution: determine the optimal allocation ratio of the first task type based on the sorting results and the optimal solution of the split point parameters.
在本发明的实施例中,本发明中系统的工作负载中n个任务类型的占比为:P={p0,p1,...,pn-1},∑ipi=1,其中pi为第i个任务类型所占的比例,且i从0开始编号;系统的总吞吐量为R;第i个任务类型的吞吐量为piR;n类任务的分配比例为其中xi∈[0,1]表示第i个任务类型调度至存储节点的比例(或属于第i个任务类型的任务调度至存储节点的概率);Ti为第i个任务类型的尾部延迟(例如第i个任务类型中的任务99%分位数的延迟);ti为第i个任务类型的延迟SLO(ServiceLevelObjective服务水平目标),表示的是对Ti的约束,即任务的尾部延迟Ti不得超过延迟SLO。In the embodiment of the present invention, the proportion of n task types in the workload of the system in the present invention is: P={p 0 , p 1 ,..., p n-1 }, ∑ i p i =1 , where p i is the proportion of the i-th task type, and i is numbered from 0; the total throughput of the system is R; the throughput of the i-th task type is p i R; the allocation ratio of n-type tasks is where x i ∈ [0, 1] represents the proportion of the i-th task type scheduled to the storage node (or the probability that tasks belonging to the i-th task type are scheduled to the storage node); T i is the tail delay of the i-th task type (For example, the delay of the 99% quantile of the task in the i-th task type); t i is the delay SLO (ServiceLevelObjective service level objective) of the i-th task type, which represents the constraint on T i , that is, the tail of the task Delay T i must not exceed delay SLO.
在本发明的实施例中,给定第i个任务类型在所有任务类型中的占比,则Ti为R和的函数,因为一方面延迟和系统的吞吐量正相关,另一方面由于资源共享,任务也会受到其他调度到同一节点的任务的影响。本发明的目标是在满足各个任务类型的延迟SLO的前提下最大化系统总吞吐量。可将问题形式化为如下公式(1)和公式(2):In the embodiment of the present invention, given the proportion of the i-th task type in all task types, then T i is R and function, because on the one hand, delay is positively related to the throughput of the system, on the other hand, due to resource sharing, tasks will also be affected by other tasks scheduled to the same node. The goal of the present invention is to maximize the total system throughput on the premise of meeting the delay SLO of each task type. The problem can be formalized as the following formula (1) and formula (2):
对于上述问题来说,需要求解的是n维向量,搜索空间过于庞大,因此难以直接求解。本发明通过排队论的数学性质将上述公式(1)和公式(2)构成的原问题表达式转化为另一组等效的表达式,并在此基础上推得最优解具有一分为二的结构,从而极大简化了搜索空间。具体地,由于任务要么被调度到计算节点被执行,要么被调度到存储节点被执行,因此本发明将上述尾部延迟Ti分解为两部分:/>和/>分别表示第i个任务类型在计算节点和存储节点上执行时的尾部延迟。根据第i个任务类型的分配比例xi,总体尾部延迟Ti和Ti C,Ti S的关系可以由公式(3)至(5)表示:For the above problems, it is necessary to solve is an n-dimensional vector, and the search space is too large, so it is difficult to solve directly. This invention uses the mathematical properties of queuing theory to transform the original problem expression composed of the above formula (1) and formula (2) into another set of equivalent expressions, and on this basis, it is deduced that the optimal solution has one Two structures, thus greatly simplifying the search space. Specifically, since tasks are either scheduled to a computing node for execution or to a storage node for execution, the present invention decomposes the above tail delay Ti into two parts:/> and/> represent the tail latency of the i-th task type when it is executed on the computing node and storage node respectively. According to the allocation ratio x i of the i-th task type, the relationship between the overall tail delays T i and T i C and T i S can be expressed by formulas (3) to (5):
Ti=Ti C,xi=0#(3)T i =T i C , x i =0#(3)
Ti=Ti S,xi=1#(4)T i =T i S , x i =1#(4)
Ti≤max{Ti C,Ti S},xi∈(0,1)#(5)T i ≤ max {T i C , T i S }, x i ∈ (0, 1) #(5)
因此,为了满足上述公式(2)中的延迟SLO约束,可以等效地分别对Ti C和Ti S选行约束。Therefore, in order to satisfy the delay SLO constraint in the above formula (2), row selection constraints can be equivalently applied to T i C and T i S respectively.
进一步地,根据排队论,可以将上述Ti C和Ti S与系统的负载进行联系。这样的优势在于系统的负载可以写为闭式表达式。对于第i个任务类型的运行时特征(ci,si),本发明实施例中已说明任务在计算节点的执行代价优选为进行数据传输的网络开销,以及任务在存储节点的执行代价优选为CPU开销。设计算节点和存储节点之间的网络带宽资源容量和存储资源池的CPU资源容量分别为C和S,可以进一步定义网络带宽和存储节点CPU上的负载分别为和/>即资源占用和资源容量之间的比值。Further, according to queuing theory, the above T i C and T i S can be related to the load of the system. The advantage of this is that the load on the system can be written as a closed-form expression. For the runtime characteristics (c i , s i ) of the i-th task type, the embodiment of the present invention has explained that the execution cost of the task on the computing node is preferably the network overhead of data transmission, and the execution cost of the task on the storage node is preferably for CPU overhead. The network bandwidth resource capacity between the computing node and the storage node and the CPU resource capacity of the storage resource pool are C and S respectively. The network bandwidth and the load on the storage node CPU can be further defined as and/> That is, the ratio between resource occupancy and resource capacity.
在本发明的实施例中,在任务请求PRC的到达过程为泊松过程时,则根据排队论可以推知TxC是ρC的单调增函数,且Ti S是ρS的单调增函数。由此,可通过调节系统的负载对Ti C和Ti S进行调整。在此基础上,本发明进一步将公式(1)和(2)的原问题转化为如下公式(6)至(10)对应的问题:In the embodiment of the present invention, when the arrival process of the task request PRC is a Poisson process, it can be inferred according to the queuing theory that Tx C is a monotonically increasing function of ρ C , and T i S is a monotonically increasing function of ρ S. Thus, T i C and T i S can be adjusted by adjusting the load of the system. On this basis, the present invention further transforms the original problems of formulas (1) and (2) into problems corresponding to the following formulas (6) to (10):
其中和/>分别是针对Ti C和Ti S的等效延迟SLO,也就是分别针对Ti C和Ti S的延迟约束条件。可以在数学上证明/>和/>的存在性,但往往无法从用户给出的延迟SLO拆分得到二者各自的实际取值。引入这两个等效延迟约束的主要意义在于,可以将公式(1)和公式(2)的原问题转化为对ρc和ρS的优化。in and/> are the equivalent delay SLOs for T i C and T i S respectively, that is, the delay constraints for T i C and T i S respectively. Can be proven mathematically/> and/> existence, but it is often impossible to obtain the actual values of the two from the delay SLO split given by the user. The main significance of introducing these two equivalent delay constraints is that the original problems of formula (1) and formula (2) can be transformed into the optimization of ρ c and ρ S.
具体地,首先通过调整分配比例使得ρC和ρS尽可能均衡;其次,由于Ti C和Ti S分别关于ρC和ρS单调,因此调节/>可使得公式(9)至(10)中的等效延迟约束条件得到满足;最后,根据公式(3)至(5)所给出的等价性,可以进一步将两个等效延迟约束的可满足性转化为原问题公式(2)中的约束的可满足性,从而保证任务的尾部延迟不超过用户给出的延迟SLO。Specifically, first by adjusting the allocation ratio Make ρ C and ρ S as balanced as possible; secondly, since T i C and T i S are monotonic with respect to ρ C and ρ S respectively, so adjust/> The equivalent delay constraints in formulas (9) to (10) can be satisfied; finally, according to the equivalence given by formulas (3) to (5), the feasibility of the two equivalent delay constraints can be further Satisfaction is transformed into the satisfiability of the constraints in formula (2) of the original problem, thereby ensuring that the tail delay of the task does not exceed the delay SLO given by the user.
在实际系统中,调度器中的限流器只对总体的尾部延迟Ti进行统计。限流器逐步增加(或降低)系统的总吞吐量,直到Ti达到用户给出的延迟SLO。以增加吞吐量为例,这一过程对应公式(6)至(10),则是限流器在当前的分配比例下通过增加系统的总吞吐量R使得ρC和ρS上升,进而使得公式(9)和公式(10)中的任一个率先达到等效延迟约束条件,进而导致实际统计量Ti达到用户给出的延迟SLO。In an actual system, the current limiter in the scheduler only counts the overall tail delay Ti . The current limiter gradually increases (or decreases) the total throughput of the system until Ti reaches the user-given latency SLO. Taking increasing throughput as an example, this process corresponds to formulas (6) to (10), which is the current allocation ratio of the current limiter. By increasing the total throughput R of the system, ρ C and ρ S rise, which in turn makes either of formula (9) and formula (10) reach the equivalent delay constraint first, which in turn causes the actual statistic Ti to reach the user-given Out of delay SLO.
基于上述思路,本发明的实施例通过对ρC和ρS的优化来解决上述公式(1)和公式(2)的原问题。根据上述公式(6)至(10)以及Ti C和Ti S关于ρC和ρS的单调性,可以推知最优解应当使得ρC和ρS尽可能小。具体来说,只需考虑满足帕累托最优的二元组(ρC,ρS)。换言之,对于固定的ρC,最优解应当使得ρS最小;反之,对于固定的ρS,最优解应当使得ρC最小。考虑初始情况下所有任务都全部调度到计算节点,即xi=0,此时ρS=0。现逐步将一些任务调度到存储节点以平衡网络带宽的负载。若选择将第i个任务类型调度至存储节点,则每从ρC中减少一个单位的负载,对应ρS升高的幅度为因此,为使得ρS尽可能小,应当优先将si/ci较小的任务类型调度到存储节点。因此,本发明使用si/ci作为估价函数,将任务的运行时特征转化为相对代价,并将相对代价的排序作为每一任务类型调度到存储节点的优先级,也就是相对代价越小的任务类型,调度到存储节点的优先级越高。Based on the above ideas, embodiments of the present invention solve the original problems of the above formula (1) and formula (2) by optimizing ρ C and ρ S. According to the above formulas (6) to (10) and the monotonicity of T i C and T i S with respect to ρ C and ρ S , it can be inferred that the optimal solution should make ρ C and ρ S as small as possible. Specifically, we only need to consider the binary group (ρ C , ρ S ) that satisfies Pareto optimality. In other words, for fixed ρ C , the optimal solution should minimize ρ S ; conversely, for fixed ρ S , the optimal solution should minimize ρ C . Consider the initial situation where all tasks are scheduled to computing nodes, that is, x i =0, and ρ S =0 at this time. Now, some tasks are gradually scheduled to storage nodes to balance the load of network bandwidth. If you choose to schedule the i-th task type to the storage node, each time the load is reduced by one unit from ρ C , the corresponding increase in ρ S is: Therefore, in order to make ρ S as small as possible, task types with smaller s i /c i should be prioritized to be scheduled to the storage node. Therefore, the present invention uses s i /c i as the evaluation function to convert the runtime characteristics of the tasks into relative costs, and uses the ranking of relative costs as the priority of scheduling each task type to the storage node, that is, the smaller the relative cost The task type is scheduled to the storage node with a higher priority.
在本发明的实施例中,将所有任务类型根据相对代价从小到大排序后,优先将排序靠前的任务类型调度到存储节点。因此,在最优解中存在一个特殊的分割点任务类型k,所有相对代价比k小的任务类型都调度到存储节点(分配比例为100%),所有相对代价k比大的任务类型都调度到计算节点(分配比例为0%),只有该分割点任务类型k确的分配比例介于0%和100%之间,在存储节点和计算节点上都有任务执行。因此,只需要确定该分割点任务类型k,就可以一分为二地生成所有任务类型各自的最优分配比例,从而可以简化搜索空间,便于快速收敛。为便于数值上的求解,本发明通过一个[0,1]上的实数α(分割点参数)来表示这种一分为二的结构。经过排序后的各种任务类型的分配比例xi和分割点参数α的映射关系为如下公式(11):In the embodiment of the present invention, after all task types are sorted from small to large according to their relative costs, the top-ranked task types are prioritized and scheduled to the storage node. Therefore, there is a special split point task type k in the optimal solution. All task types with a smaller relative cost than k are scheduled to the storage node (the allocation ratio is 100%), and all task types with a larger relative cost than k are scheduled. To the computing node (the allocation ratio is 0%), only the exact allocation ratio of task type k at this split point is between 0% and 100%, and tasks are executed on both the storage node and the computing node. Therefore, only the task type k of the split point needs to be determined, and the optimal allocation proportions of all task types can be generated in two, thus simplifying the search space and facilitating rapid convergence. In order to facilitate numerical solution, the present invention represents this bisecting structure through a real number α (split point parameter) on [0, 1]. The mapping relationship between the sorted allocation proportions x i of various task types and the split point parameter α is the following formula (11):
通过公式(11)给出的映射关系以及分割点参数α的取值,即可确定各个任务类型的分配比例。上述公式(11)所表达的具体含义为:将n种任务类型(任务类型的编号为0到n-1)进行排序后映射到[0,1]区间,其中第i种任务映射到子区间[(i/n,(i+1)/n)];而α所处的子区间对应的任务类型即为分割点任务类型k。例如共有10个任务类型,编号为0到9;如果预设调度算法给出的最优解α*为0.55,则该最优解所处的子区间为[0.5,0.6],对应于第6个任务类型为分割点任务类型,编号k=5。同时也可以得到该任务类型的分配比例为αn-i=0.5(50%)。对于相对代价低于第6类的任务类型的前5个任务类型,其分配比例均为1(100%),而相对代价高于第6类的任务类型的后4个任务类型则分配比例均为0%。Through the mapping relationship given by formula (11) and the value of the split point parameter α, the allocation proportion of each task type can be determined. The specific meaning expressed by the above formula (11) is: sort n task types (task type numbers are 0 to n-1) and map them to the [0, 1] interval, where the i-th task is mapped to the sub-interval [(i/n, (i+1)/n)]; and the task type corresponding to the sub-interval where α is located is the split point task type k. For example, there are 10 task types, numbered 0 to 9; if the optimal solution α * given by the preset scheduling algorithm is 0.55, then the sub-interval where the optimal solution is located is [0.5, 0.6], corresponding to the 6th The task type is the split point task type, and the number k=5. At the same time, it can also be obtained that the allocation ratio of this task type is αn-i=0.5 (50%). For the first 5 task types whose relative cost is lower than that of category 6, their allocation ratios are all 1 (100%), while the last four task types whose relative cost is higher than that of category 6 have the same allocation ratio. is 0%.
在本发明的实施例中,通过将第一任务类型的运行时特征输入预设调度算法中进行计算,获得第一任务类型的相对代价。示例地,在第一任务类型的运行时特征为(c1,s1)时,则第一任务类型的相对代价为s1/c1。在获得第一任务类型的相对代价后,由于第二任务类型为系统处理过的任务类型,系统已经得到了各个第二任务类型的相对代价,通过各个任务类型的相对代价对该第一任务类型和各个第二任务类型进行排序,获得对应的排序结果。同时,通过预设调度算法中的子算法确定分割点参数α的最优解。In embodiments of the present invention, the relative cost of the first task type is obtained by inputting the runtime characteristics of the first task type into a preset scheduling algorithm for calculation. For example, when the runtime characteristics of the first task type are (c 1 , s 1 ), the relative cost of the first task type is s 1 /c 1 . After obtaining the relative cost of the first task type, since the second task type is a task type processed by the system, the system has obtained the relative cost of each second task type, and compares the first task type with the relative cost of each task type. Sort with each second task type to obtain the corresponding sorting result. At the same time, the optimal solution of the split point parameter α is determined through the sub-algorithm in the preset scheduling algorithm.
在获得第一任务类型和第二任务类型的排序结果,以及分割点参数α的最优解后,基于各种任务类型的分配比例与分割点参数的映射关系表达式,即公式(11),即可确定第一任务类型的最优分配比例。After obtaining the sorting results of the first task type and the second task type, and the optimal solution of the split point parameter α, based on the mapping relationship expression between the allocation ratio of various task types and the split point parameter, that is, formula (11), The optimal allocation ratio of the first task type can be determined.
在本发明的实施例中,预设调度算法中用于确定分割点参数α的最优解α*的子算法详细流程如下。首先,系统在满足用户给定的延迟SLO的前提下,能够承载的最大吞吐量是分割点参数α的函数,记为R(α)。对于给定的α,相应的R(α)的取值可以通过调度器中的限流器进行近似,并且通过对α进行采样可以对R(α)的曲线进行拟合。调度器求解分割点参数α的最优解α*的子算法基于一个基本假设,即R(α)是单峰函数的假设。子算法通过维护α*的上界和下界α,并通过迭代逐步缩小搜索空间(即区间/>)。具体来说,算法将α当前的搜索空间划分为数个小区间,对小区间端点逐个进行采样,通过在各个端点处观测到的R(α)取值来判断R(α)的峰值(即α*)所处于的区间范围。上述过程称为一轮迭代。每轮迭代的初始搜索空间为上一轮迭代所确定的新的区间范围,即每轮迭代开始时会对上一轮迭代给出的新区间范围进行划分。每轮迭代所确定的新的区间范围都严格包含于其初始搜索空间,从而使得算法可以逐步缩小搜索空间直至逼近最优解α*。In the embodiment of the present invention, the detailed process of the sub-algorithm used to determine the optimal solution α * of the split point parameter α in the preset scheduling algorithm is as follows. First of all, the maximum throughput that the system can carry under the premise of meeting the delay SLO given by the user is a function of the split point parameter α, recorded as R(α). For a given α, the corresponding value of R(α) can be approximated by the current limiter in the scheduler, and the curve of R(α) can be fitted by sampling α. The sub-algorithm of the scheduler to solve the optimal solution α * of the split point parameter α is based on a basic assumption, that is, the assumption that R(α) is a unimodal function. sub-algorithm by maintaining an upper bound on α * and lower bound α , and gradually reduce the search space (i.e. interval/> ). Specifically, the algorithm divides the current search space of α into several small intervals, samples the endpoints of the small intervals one by one, and determines the peak value of R(α) (i.e., α) through the value of R(α) observed at each endpoint. * ) is within the interval range. The above process is called an iteration. The initial search space of each iteration is the new interval range determined in the previous iteration, that is, at the beginning of each iteration, the new interval range given in the previous iteration is divided. The new interval range determined in each round of iteration is strictly included in its initial search space, so that the algorithm can gradually reduce the search space until it approaches the optimal solution α * .
具体来说,在每轮迭代中,算法会将等分为M个小区间(M为算法的参数),并依次遍历所有区间端点以寻找局部极大值;局部最大值定义为:三个连续端点αi-1<αi<αi+1,使得R(αi)>R(αi-1)且R(αi)>R(αi-1)。基于R(α)是单峰函数的假设,此时可以确定α*位于[αi-1,αi+1]上,从而可以缩小搜索空间。由于每个小区间都是原区间的1/M,算法可以保证搜索空间在每轮迭代后都以常数比例缩减。因此算法具有对数时间的复杂度。Specifically, in each iteration, the algorithm will Divide it into M small intervals (M is the parameter of the algorithm), and traverse all interval endpoints in order to find the local maximum; the local maximum is defined as: three consecutive endpoints α i-1 <α i <α i+1 , such that R(α i )>R(α i-1 ) and R(α i )>R(α i-1 ). Based on the assumption that R(α) is a unimodal function, it can be determined that α * is located on [α i-1 , α i+1 ], thereby reducing the search space. Since each small interval is 1/M of the original interval, the algorithm can ensure that the search space is reduced by a constant ratio after each iteration. Therefore the algorithm has logarithmic time complexity.
在本发明的实施例中,调度器将搜索区间初始化为[0,1],并重复迭代直到区间长度小于阈值δ时;此时,可以任取/>内的一点作为对最优解α*的近似。在算法收敛之后,调度器会一直使用这一结果,直至检测到工作负载或集群状态出现变化。此时,调度器会重新初始化预设调度算法并计算新的α*,以应对相关变化。In the embodiment of the present invention, the scheduler will search the interval Initialize to [0, 1], and repeat the iteration until the interval length is less than the threshold δ; at this time, you can choose any/> A point within is used as an approximation to the optimal solution α * . After the algorithm converges, the scheduler uses this result until it detects a change in the workload or cluster state. At this time, the scheduler will reinitialize the preset scheduling algorithm and calculate a new α * to cope with related changes.
在本发明中,所述方法还包括:根据任务执行过程中的端到端延迟和用户给出的服务水平目标,通过限流器调整系统的吞吐量;根据所述系统的吞吐量,调整调度器的分割点参数,以使所述系统的吞吐量最大化。In the present invention, the method further includes: adjusting the throughput of the system through a current limiter according to the end-to-end delay during task execution and the service level target given by the user; adjusting the scheduling according to the throughput of the system split point parameters of the processor to maximize the throughput of the system.
在本发明的实施例中,调度器和调度器中的限流器构成一个双重循环的控制系统。调度器中的限流器作为内层循环。在该内层循环中,限流器对任务执行过程中的端到端延迟进行统计,得到各个任务类型的尾部延迟,根据用户给定的服务水平目标(延迟SLO)来调节系统的总吞吐量R,从而确保用户的服务水平目标得到满足。调度器则构成外层循环。在该外层循环中,调度器会运行调度算法中的子算法,根据限流器所给出的系统的总吞吐量R来调整分割点参数α,进而优化各个任务类型的最优分配比例,以最大化系统的总吞吐量R。该双重循环控制系统在调度器后台异步运行,即调度器同一时间可以在前台根据该双重循环控制系统当前给出的分配比例进行调度。In the embodiment of the present invention, the scheduler and the current limiter in the scheduler form a dual cycle control system. The current limiter in the scheduler acts as an inner loop. In this inner loop, the current limiter counts the end-to-end delay during task execution, obtains the tail delay of each task type, and adjusts the total throughput of the system according to the service level target (delay SLO) given by the user. R, thereby ensuring that user service level objectives are met. The scheduler forms the outer loop. In the outer loop, the scheduler will run the sub-algorithm of the scheduling algorithm, adjust the split point parameter α according to the total throughput R of the system given by the current limiter, and then optimize the optimal allocation ratio of each task type. to maximize the total throughput R of the system. The dual loop control system runs asynchronously in the background of the scheduler, that is, the scheduler can schedule in the foreground at the same time according to the allocation ratio currently given by the dual loop control system.
在本发明的实施例中,控制系统的内循环和外循环不断进行迭代优化,直至收敛至最优解。为避免这两个循环之间出现耦合关系导致无法收敛,两个循环的循环频率不能过于接近。因此,本发明优选将内循环的频率设定为200HZ,外循环的频率设定为20HZ,应当理解的是内循环与外循环的各自的频率同样可以是其他频率,只需保证两者的频率相差达到设定阈值即可,所述设定阈值可根据实际应用场景进行设定,在此不做具体限定。In the embodiment of the present invention, the inner loop and the outer loop of the control system are continuously optimized iteratively until they converge to the optimal solution. In order to avoid a coupling relationship between the two loops that will lead to failure to converge, the cycle frequencies of the two loops cannot be too close. Therefore, the present invention preferably sets the frequency of the inner circulation to 200HZ and the frequency of the outer circulation to 20HZ. It should be understood that the respective frequencies of the inner circulation and the outer circulation can also be other frequencies, as long as the frequencies of the two are ensured It is sufficient that the difference reaches a set threshold. The set threshold can be set according to actual application scenarios and is not specifically limited here.
在本发明的实施例中,若系统中出现新的任务类型和/或任务类型减少和/或任务类型的运行时特征发生明显变化,意味着系统的工作负载发生了变化。此时,通过所述预设调度算法中的子算法确定新的分割点参数的最优解,其具体实施方式与上述实施方式相同,在此不再赘述。若系统中出现节点的增减,意味着系统的集群状态发生了变化,此时通过所述预设调度算法中的子算法确定分割点参数的新的最优解,其具体实施方式与上述实施方式相同,在此不再赘述。In the embodiment of the present invention, if a new task type appears in the system and/or the task type decreases and/or the runtime characteristics of the task type change significantly, it means that the workload of the system has changed. At this time, the optimal solution of the new split point parameters is determined through the sub-algorithm in the preset scheduling algorithm. The specific implementation is the same as the above implementation, and will not be described again here. If there is an increase or decrease in nodes in the system, it means that the cluster status of the system has changed. At this time, the new optimal solution of the split point parameters is determined through the sub-algorithm in the preset scheduling algorithm. The specific implementation is the same as the above implementation. The method is the same and will not be repeated here.
在本发明中,所述方法还包括:确定当前时间窗口内各个任务类型的运行时特征与自身的历史运行时特征滑动平均值的差值;将当前时间窗口内的运行时特征和自身的历史运行时特征滑动平均值的差值超过第一阈值的第三任务类型的分配比例设置为默认分配比例;将第三任务类型以所述默认分配比例分配至计算节点或存储节点进行执行,并统计预设数量的所述第三任务类型的任务在计算节点和存储节点上的新的执行代价,进而确定新的运行时特征;通过将所述新的运行时特征输入预设调度算法,确定所述第三任务类型的相对代价;根据相对代价对所有任务类型进行排序;通过所述预设调度算法中的子算法确定分割点参数的最优解;根据所有任务类型的排序结果和该分割点参数的最优解,确定所有任务类型的最优分配比例。In the present invention, the method further includes: determining the difference between the runtime characteristics of each task type in the current time window and the sliding average of its own historical runtime characteristics; The allocation ratio of the third task type whose difference between the running-time characteristic sliding average exceeds the first threshold is set as the default allocation ratio; the third task type is allocated to the computing node or storage node for execution according to the default allocation ratio, and statistics are collected The new execution costs of the preset number of tasks of the third task type on the computing nodes and storage nodes are then determined, and new runtime characteristics are determined; by inputting the new runtime characteristics into the preset scheduling algorithm, the new execution costs are determined. The relative cost of the third task type; sort all task types according to the relative cost; determine the optimal solution of the split point parameters through the sub-algorithm in the preset scheduling algorithm; according to the sorting results of all task types and the split point The optimal solution of parameters determines the optimal allocation ratio of all task types.
在本发明的实施例中,确定当前时间窗口内各个任务类型的运行时特征与各自的历史运行时特征滑动平均值的差值。在任务类型的当前时间窗口内的运行时特征和自身的历史运行时特征滑动平均值的差值超过第一阈值时,表明该任务类型的运行时特征发生了显著变化,此时该任务类型的相对代价和对应的最优分配比例也将发生变化,该任务类型当前使用的分配比例将不再准确,需要对该任务类型的相对代价和对应的最优分配比例进行修正。In an embodiment of the present invention, the difference between the runtime characteristics of each task type in the current time window and the respective sliding average of historical runtime characteristics is determined. When the difference between the runtime characteristics of the task type in the current time window and the sliding average of its own historical runtime characteristics exceeds the first threshold, it indicates that the runtime characteristics of the task type have changed significantly. At this time, the task type's runtime characteristics have changed significantly. The relative cost and corresponding optimal allocation ratio will also change. The allocation ratio currently used by this task type will no longer be accurate, and the relative cost and corresponding optimal allocation ratio of this task type need to be corrected.
此时,本发明将当前时间窗口内的运行时特征和自身的历史运行时特征滑动平均值的差值超过第一阈值的第三任务类型的分配比例设置为默认分配比例,并将第三任务类型以所述默认分配比例分配至计算节点或存储节点进行执行,并统计预设数量的所述第三任务类型的任务在计算节点和存储节点上的新的执行代价,进而确定新的运行时特征。通过将该新的运行时特征输入预设调度算法,确定第三任务类型的相对代价,其具体实施方式与上述实施方式相同,在此不再赘述。根据各个任务类型的相对代价重新对所有任务类型进行排序,通过预设调度算法中的子算法重新确定分割点参数的最优解,其具体实施方式与上述实施方式相同,在此不再赘述。根据得到的所有任务类型的排序结果和分割点参数的最优解,以及各种任务类型的分配比例与分割点参数的映射关系表达式(也就是上述公式11),重新确定所有任务类型的最优分配比例,其具体实施方式与上述实施方式相同,在此不再赘述。At this time, the present invention sets the allocation proportion of the third task type whose difference between the runtime characteristics in the current time window and the sliding average of its own historical runtime characteristics exceeds the first threshold as the default allocation proportion, and assigns the third task The type is allocated to computing nodes or storage nodes according to the default allocation ratio for execution, and the new execution costs of the preset number of tasks of the third task type on the computing nodes and storage nodes are counted to determine the new runtime. feature. By inputting the new runtime characteristics into the preset scheduling algorithm, the relative cost of the third task type is determined. The specific implementation is the same as the above implementation, and will not be described again here. All task types are re-sorted according to the relative cost of each task type, and the optimal solution of the split point parameters is re-determined through the sub-algorithm in the preset scheduling algorithm. The specific implementation is the same as the above implementation and will not be described again here. Based on the obtained ranking results of all task types and the optimal solution of the split point parameters, as well as the mapping relationship expression between the distribution ratio of various task types and the split point parameters (that is, the above formula 11), the optimal solution of all task types is re-determined. The specific implementation of the optimal distribution ratio is the same as the above-mentioned implementation, and will not be described again here.
在本发明的实施例中,另一可选实施方式为每隔预设时间间隔就主动重新统计各个任务类型的运行时特征,以应对运行时特征可能发生的变化。In an embodiment of the present invention, another optional implementation is to actively re-calculate the runtime characteristics of each task type every preset time interval to cope with possible changes in the runtime characteristics.
在本发明中,所述方法还包括:确定限流器的请求丢弃情况;在限流器丢弃请求时,进行阶梯式增加计算资源池的容量,直至限流器不再丢弃请求为止;在限流器未丢弃请求时,根据系统的负载情况,进行阶梯式缩减计算节点数量。In the present invention, the method further includes: determining the request discarding situation of the current limiter; when the current limiter discards the request, increasing the capacity of the computing resource pool in a stepwise manner until the current limiter no longer discards the request; When the streamer does not discard requests, the number of computing nodes is reduced in a stepwise manner according to the load of the system.
在本发明的实施例中,在资源容量固定的情况下,限流器会限制系统中任务请求的数量以满足延迟SLO。而在真实场景中,任务请求的规模随时间可能发生较大波动。为避免在请求量较大时丢弃掉溢出的请求,或者在请求量较低时因资源闲置而产生浪费,本发明的调度器可根据任务请求的实时规模调整计算资源池的容量。由于对计算资源的需求取决于用户提交的请求数量,具有较大的变化幅度,而对存储资源的需求取决于应用持久化存储的数据量大小,变化幅度通常较小。因此,本发明选择计算资源池作为资源弹性伸缩的对象。In the embodiment of the present invention, when the resource capacity is fixed, the current limiter will limit the number of task requests in the system to meet the delay SLO. In real scenarios, the scale of task requests may fluctuate greatly over time. In order to avoid discarding overflow requests when the request volume is large, or causing waste due to idle resources when the request volume is low, the scheduler of the present invention can adjust the capacity of the computing resource pool according to the real-time scale of task requests. Since the demand for computing resources depends on the number of requests submitted by users, it has a large range of change, while the demand for storage resources depends on the amount of data persistently stored by the application, and the range of change is usually small. Therefore, the present invention selects the computing resource pool as the object of resource elastic expansion.
在本发明的实施例中,为实现资源弹性伸缩,本发明在调度器已有的双重循环之外增加一个新的外层循环,该外层循环确定限流器的请求丢弃情况,在限流器丢弃溢出请求的情况下,则该外层循环会进行阶梯式的增加计算资源池的容量,直至限流器不再丢弃请求为止。同时在限流器未丢弃请求时,根据系统的网络带宽和存储节点CPU上的负载ρC和ρS,在不影响吞吐量和延迟SLO的前提下,进行阶梯式缩减计算节点数量。本发明的实施例在CPU核级别上进行资源弹性伸缩,并且可以根据系统的具体配置和实现调整弹性伸缩的粒度。In the embodiment of the present invention, in order to achieve resource elastic expansion, the present invention adds a new outer loop in addition to the existing double loop of the scheduler. The outer loop determines the request discarding situation of the current limiter. If the overflow request is discarded by the current limiter, the outer loop will increase the capacity of the computing resource pool in a stepwise manner until the current limiter no longer discards requests. At the same time, when the current limiter does not discard requests, according to the network bandwidth of the system and the load ρ C and ρ S on the storage node CPU, the number of computing nodes is reduced in a stepwise manner without affecting the throughput and delay SLO. Embodiments of the present invention perform elastic scaling of resources at the CPU core level, and can adjust the granularity of elastic scaling according to the specific configuration and implementation of the system.
在本发明中,所述方法还包括:确定各个存储节点的实际负载;在存储节点的实际负载与系统的平均负载之间的差值超过第二阈值时,将该存储节点划分至新的独立调度策略组;将预设数量的计算节点划分至所述新的独立调度策略组;每个策略组作为任务调度和执行的子系统独立运行。In the present invention, the method further includes: determining the actual load of each storage node; when the difference between the actual load of the storage node and the average load of the system exceeds a second threshold, dividing the storage node into a new independent Scheduling strategy group; divide a preset number of computing nodes into the new independent scheduling strategy group; each strategy group runs independently as a subsystem for task scheduling and execution.
在真实场景中,应用的数据通常在多个存储节点上进行分片,而任务在访问数据分片时可能呈现出偏斜的访问模式,也就是少数几个热点分片承载大多数的访问。在这种情况下,ρC和ρS所代表的平均值与热点分片的实际负载会有较大差距,从而影响调度效果。In real scenarios, application data is usually sharded on multiple storage nodes, and tasks may show a skewed access pattern when accessing data shards, that is, a few hotspot shards carry most accesses. In this case, there will be a large gap between the average values represented by ρ C and ρ S and the actual load of the hotspot shards, thus affecting the scheduling effect.
针对这一问题,本发明提出调度策略组机制。该机制会收集各个存储节点的实际负载。当访问模式没有发生偏斜时,也就是不存在热点分片时,则所有节点都属于一个默认调度策略组。而当访问模式存在偏斜,使得某个存储节点的实际负载与系统的平均负载之间的差值超过第二阈值时,则将该存储节点划分至新的独立调度策略组,并将预设数量的计算节点划分至该新的调度策略组,由此每个调度策略组拥有独立的计算和存储节点。该新的调度策略组中计算节点的数量可以根据组中任务的计算强度设置初始值;由于本发明能够实现计算资源的弹性伸缩,初始值的设置不会影响最终性能。随着工作负载的变化,可以将负载水平回归平均值的存储节点重新合并到默认的调度策略组中。每个调度策略组会独立运行上述实施例中的双重循环的控制系统以及资源弹性伸缩。由于调度策略组之间不进行资源共享,每个策略组作为任务调度和执行的子系统独立运行。这意味着每个调度策略组的调度算法可以进行并行。To address this problem, the present invention proposes a scheduling policy group mechanism. This mechanism collects the actual load of each storage node. When the access pattern is not skewed, that is, when there are no hotspot shards, all nodes belong to a default scheduling policy group. When there is a skew in the access pattern such that the difference between the actual load of a storage node and the average load of the system exceeds the second threshold, the storage node is divided into a new independent scheduling policy group and the preset A certain number of computing nodes are divided into the new scheduling policy group, so that each scheduling policy group has independent computing and storage nodes. The number of computing nodes in the new scheduling policy group can be set to an initial value according to the computing intensity of the tasks in the group; since the present invention can achieve elastic expansion and contraction of computing resources, the setting of the initial value will not affect the final performance. As workloads change, storage nodes whose load levels revert to the mean can be re-incorporated into the default scheduling policy group. Each scheduling policy group will independently run the dual-cycle control system and resource elastic expansion in the above embodiment. Since there is no resource sharing between scheduling policy groups, each policy group runs independently as a subsystem for task scheduling and execution. This means that the scheduling algorithms of each scheduling policy group can be parallelized.
在本发明的实施例中,第一任务类型表征的是系统未经处理的任务类型,第二任务类型表征的是系统处理过的任务类型,第三任务类型表征的是运行时特征发生显著变化,需要重新统计自身的运行时特征的任务类型。In the embodiment of the present invention, the first task type represents a task type that has not been processed by the system, the second task type represents a task type that has been processed by the system, and the third task type represents a significant change in runtime characteristics. , a task type that needs to re-count its own runtime characteristics.
在本发明的实施例中,本发明聚焦于资源解耦合数据中心的混合型负载(多种任务类型的混合型负载)调度问题,目标是在满足各种任务类型的延迟SLO(服务水平目标)的前提下最大化系统的吞吐量。延迟SLO通常规定任务的尾部延迟不超过一设定常数;每个任务类型都可单独设定延迟SLO。In the embodiment of the present invention, the present invention focuses on the scheduling problem of mixed loads (mixed loads of multiple task types) in resource decoupled data centers, with the goal of meeting the delay SLO (Service Level Objective) of various task types. Maximize the system throughput under the premise. Delay SLO usually stipulates that the tail delay of a task does not exceed a set constant; delay SLO can be set independently for each task type.
本发明的核心是根据各种任务类型的运行时特征将其分开进行调度,也就是每种任务类型都按照各自的分配比例进行调度。具体地,本发明发现并提出了混合型负载调度问题的最优解具有一种特殊的结构,能够极大地简化问题的复杂度。在此基础上,本发明设计一种高效的算法以实时求解最优调度方案。此外,为应对任务请求数量随时间的波动,更好地进行资源利用,本发明支持计算资源池的弹性伸缩。最后,当数据访问模式存在偏斜时,本发明提供了调度策略组这一机制以应对集群内部不均衡的负载。The core of the present invention is to schedule various task types separately according to their runtime characteristics, that is, each task type is scheduled according to its own allocation ratio. Specifically, the present invention discovers and proposes that the optimal solution to the mixed load scheduling problem has a special structure, which can greatly simplify the complexity of the problem. On this basis, the present invention designs an efficient algorithm to solve the optimal scheduling plan in real time. In addition, in order to cope with the fluctuation of the number of task requests over time and better utilize resources, the present invention supports elastic expansion and contraction of the computing resource pool. Finally, when the data access pattern is skewed, the present invention provides a scheduling policy group mechanism to cope with the unbalanced load within the cluster.
在本发明的实施例中,本发明实现的原型系统由三部分组成,包括调度器,计算节点,以及存储节点,如图2所示。调度器用于将任务调度至计算或存储节点进行执行。调度器将不同任务类型进行分开处理,为每种任务类型计算一个对应的最优分配比例。调度器通过统计计算节点和存储节点上的任务执行代价来获得任务的运行时特征,并作为预设调度算法的输入。调度器上还包括限流器,限流器根据任务的端到端延迟来限制任务请求的数量,从而满足延迟SLO。In the embodiment of the present invention, the prototype system implemented by the present invention consists of three parts, including a scheduler, a computing node, and a storage node, as shown in Figure 2. The scheduler is used to schedule tasks to computing or storage nodes for execution. The scheduler processes different task types separately and calculates a corresponding optimal allocation ratio for each task type. The scheduler obtains the runtime characteristics of the task by counting the task execution costs on the computing nodes and storage nodes, and uses them as input to the preset scheduling algorithm. The scheduler also includes a current limiter, which limits the number of task requests based on the end-to-end latency of the task to meet the latency SLO.
计算节点用于接收调度器分配的任务并负责执行。计算节点配备大量CPU(中央处理器),具有比存储节点更强的计算能力。计算节点上的CPU被组织成数个工作单元。在执行任务时,这些工作单元远程访问存储节点以获取数据。此外,计算节点还包含一个监控单元,用于跟踪任务的运行时特征。当任务完成时,将向调度器发送包含这些统计数据的反馈消息。The computing node is used to receive tasks assigned by the scheduler and is responsible for executing them. Computing nodes are equipped with a large number of CPUs (central processing units) and have stronger computing power than storage nodes. The CPU on a compute node is organized into several work units. While executing tasks, these workers remotely access storage nodes to obtain data. Additionally, the compute node contains a monitoring unit that tracks the runtime characteristics of the tasks. When a task completes, a feedback message containing these statistics is sent to the scheduler.
存储节点由数据仓库和数个工作单元组成。存储节点上的工作单元和计算节点上的类似,但在执行任务时可以直接访问本地数据(即进行存储端的计算)。存储节点上的工作单元数量通常少于计算节点。存储节点和计算节点相同,存储节点也包含一个监控单元,用于跟踪任务的运行时特征。当任务完成时,将向调度器发送包含这些统计数据的反馈消息。The storage node consists of a data warehouse and several work units. The work unit on the storage node is similar to that on the computing node, but it can directly access local data when executing tasks (that is, perform calculations on the storage side). The number of work units on storage nodes is usually less than that on compute nodes. The storage node is the same as the compute node. The storage node also contains a monitoring unit to track the runtime characteristics of the tasks. When a task completes, a feedback message containing these statistics is sent to the scheduler.
在本发明的实施例中,任务所访问的数据可能在多个存储节点上进行数据分片。本发明默认任务只访问一个分片,即每个任务所需的数据只需访问单个存储节点即可获取。这一默认假设与现有分布式数据库中的存储式过程的实际情况是一致的。用户提交任务调用请求时需指明其访问的数据分片。若调度器选择将该任务调度到存储节点,则该存储节点必须存有相关分片;若调度器选择将该任务调度到计算节点,则不受限制,因为数据需通过网络远程获取。In the embodiment of the present invention, the data accessed by the task may be data fragmented on multiple storage nodes. By default, tasks in this invention only access one shard, that is, the data required by each task can be obtained by accessing only a single storage node. This default assumption is consistent with the reality of stored procedures in existing distributed databases. When users submit a task call request, they need to indicate the data shard they access. If the scheduler chooses to schedule the task to the storage node, the storage node must have relevant shards; if the scheduler chooses to schedule the task to the computing node, there is no restriction because the data needs to be obtained remotely through the network.
在本发明的实施例中,调度器实例可独立于计算资源池和存储资源池进行增减。由于预设调度算法只需要周期性运行,因此将调度器作为一个后台进程集成到数据中心内部广泛部署的负载均衡器上。由于预设调度算法不位于任务执行的关键路径上,因此调度器通常不会成为系统瓶颈。为适应大型数据中心的吞吐量需求,则可以按照业界惯例增加多个负载均衡器节点。In embodiments of the present invention, scheduler instances can be added or removed independently of the computing resource pool and the storage resource pool. Since the preset scheduling algorithm only needs to run periodically, the scheduler is integrated into the load balancer widely deployed within the data center as a background process. Because the preset scheduling algorithm is not on the critical path of task execution, the scheduler usually does not become a system bottleneck. To meet the throughput requirements of large data centers, multiple load balancer nodes can be added according to industry practice.
为验证本发明所提供的调度方法的有效性,本发明使用多种合成负载和应用负载对实现的原型系统进行测试。实验结果表明本发明能够实现亚秒级别的收敛速度,可以适应多种动态工作负载,相比于目前的调度方法可以将系统吞吐量提升3至21倍。系统同时具有较好的横向可扩展性。In order to verify the effectiveness of the scheduling method provided by the present invention, the present invention uses a variety of synthetic loads and application loads to test the implemented prototype system. Experimental results show that the present invention can achieve sub-second level convergence speed, can adapt to a variety of dynamic workloads, and can increase system throughput by 3 to 21 times compared to current scheduling methods. The system also has good horizontal scalability.
本发明所提供的一种面向资源解耦合数据中心的服务器无感知计算调度方法具有以下优势:应用性能更优化,本发明能够在保证延迟SLO的前提下实现高吞吐量;资源利用率更高。本发明能够在计算资源池和存储资源池之间进行负载均衡,以提升资源利用率;快速稳定收敛,本发明能够在负载或集群配置发生变化的时候快速做出反应,并将系统稳定在最优状态;易于部署使用,本发明不需要上层应用更改代码。The server-less computing scheduling method for resource decoupling data centers provided by the present invention has the following advantages: application performance is more optimized, the present invention can achieve high throughput while ensuring delay SLO, and the resource utilization rate is higher. The present invention can perform load balancing between the computing resource pool and the storage resource pool to improve resource utilization; it can quickly and stably converge. The present invention can respond quickly when the load or cluster configuration changes, and stabilize the system at the optimal level. status; easy to deploy and use, the present invention does not require the upper layer application to change the code.
在本发明的实施例中,限流器根据任务的端到端延迟调整系统的吞吐量,以减轻计算节点和存储节点上任务请求的排队情况,从而确保延迟SLO得到满足。本发明的原型系统中,限流器采用AIMD算法。限流器也可以采用其他拥塞控制类的算法。为让应用程序既能够在计算节点执行(远程数据访问)也能够在存储节点执行(本地数据访问),本发明在计算节点上将远程的数据仓库抽象成为本地数据仓库,面向应用提供和本地数据访问一致的API(应用程序编程接口)。计算节点和存储节点上的监控单元负责统计任务的运行时特征并反馈给调度器。监控单元只统计任务的端到端执行代价,例如占用的总CPU时间,因此对应用程序为非侵入式。In embodiments of the present invention, the current limiter adjusts the throughput of the system according to the end-to-end delay of the task to alleviate the queuing of task requests on the computing nodes and storage nodes, thereby ensuring that the delay SLO is met. In the prototype system of the present invention, the current limiter adopts AIMD algorithm. The current limiter can also use other congestion control algorithms. In order to allow applications to be executed on both computing nodes (remote data access) and storage nodes (local data access), the present invention abstracts the remote data warehouse into a local data warehouse on the computing node, providing application-oriented and local data Access a consistent API (Application Programming Interface). The monitoring units on the computing nodes and storage nodes are responsible for counting the runtime characteristics of the tasks and feeding them back to the scheduler. The monitoring unit only counts the end-to-end execution cost of the task, such as the total CPU time occupied, so it is non-intrusive to the application.
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or any such actual relationship or sequence between operations. Furthermore, the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or end device that includes a list of elements includes not only those elements, but also elements not expressly listed or other elements inherent to such process, method, article or terminal equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or terminal device including the stated element.
以上对本发明所提供的一种面向资源解耦合数据中心的服务器无感知计算调度方法,进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The above is a detailed introduction to the server-less computing scheduling method for resource decoupling data centers provided by the present invention. Specific examples are used in this article to illustrate the principles and implementation methods of the present invention. The description of the above embodiments It is only used to help understand the method and its core idea of the present invention; at the same time, for those of ordinary skill in the field, there will be changes in the specific implementation and application scope based on the idea of the present invention. In summary, The content of this description should not be construed as limiting the invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310149359.4A CN116302404B (en) | 2023-02-16 | 2023-02-16 | Resource decoupling data center-oriented server non-perception calculation scheduling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310149359.4A CN116302404B (en) | 2023-02-16 | 2023-02-16 | Resource decoupling data center-oriented server non-perception calculation scheduling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116302404A CN116302404A (en) | 2023-06-23 |
CN116302404B true CN116302404B (en) | 2023-10-03 |
Family
ID=86837103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310149359.4A Active CN116302404B (en) | 2023-02-16 | 2023-02-16 | Resource decoupling data center-oriented server non-perception calculation scheduling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116302404B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116662290B (en) * | 2023-07-24 | 2023-09-29 | 北京大学 | Read optimization method and device for stateful server-less function |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912401A (en) * | 2016-04-08 | 2016-08-31 | 中国银行股份有限公司 | Distributed data batch processing system and method |
CN107832153A (en) * | 2017-11-14 | 2018-03-23 | 北京科技大学 | A kind of Hadoop cluster resources self-adapting distribution method |
CN112988360A (en) * | 2021-05-10 | 2021-06-18 | 杭州绿城信息技术有限公司 | Task distribution system based on big data analysis |
CN113238848A (en) * | 2021-05-27 | 2021-08-10 | 上海商汤科技开发有限公司 | Task scheduling method and device, computer equipment and storage medium |
CN113742059A (en) * | 2021-07-15 | 2021-12-03 | 上海朋熙半导体有限公司 | Task allocation method and device, computer equipment and storage medium |
WO2022028157A1 (en) * | 2020-08-03 | 2022-02-10 | 同济大学 | Elastic scaling method and system for microservice system in cloud environment, medium and device |
-
2023
- 2023-02-16 CN CN202310149359.4A patent/CN116302404B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912401A (en) * | 2016-04-08 | 2016-08-31 | 中国银行股份有限公司 | Distributed data batch processing system and method |
CN107832153A (en) * | 2017-11-14 | 2018-03-23 | 北京科技大学 | A kind of Hadoop cluster resources self-adapting distribution method |
WO2022028157A1 (en) * | 2020-08-03 | 2022-02-10 | 同济大学 | Elastic scaling method and system for microservice system in cloud environment, medium and device |
CN112988360A (en) * | 2021-05-10 | 2021-06-18 | 杭州绿城信息技术有限公司 | Task distribution system based on big data analysis |
CN113238848A (en) * | 2021-05-27 | 2021-08-10 | 上海商汤科技开发有限公司 | Task scheduling method and device, computer equipment and storage medium |
CN113742059A (en) * | 2021-07-15 | 2021-12-03 | 上海朋熙半导体有限公司 | Task allocation method and device, computer equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
Performance improvement in cloud computing through dynamic task scheduling algorithm;Shital Patil等;《2015 1st International Conference on Next Generation Computing Technologies (NGCT)》;全文 * |
异构Hadoop集群下自适应平衡数据存储的大数据放置策略;张少辉等;《现代电子技术》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116302404A (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019179250A1 (en) | Scheduling method, scheduler, storage medium, and system | |
CN109617826B (en) | Storm dynamic load balancing method based on cuckoo search | |
CN106844051A (en) | The loading commissions migration algorithm of optimised power consumption in a kind of edge calculations environment | |
CN107038071B (en) | A Storm Task Scaling Scheduling Algorithm Based on Data Flow Prediction | |
CN111722806B (en) | Cloud disk allocation method and device, electronic equipment and storage medium | |
WO2025055413A1 (en) | Dynamic adaptive client load balancing method and system for microservices | |
CN103631657A (en) | Task scheduling algorithm based on MapReduce | |
Tian et al. | User preference-based hierarchical offloading for collaborative cloud-edge computing | |
CN116302578B (en) | A QoS-constrained streaming application delay assurance method and system | |
CN112817728A (en) | Task scheduling method, network device and storage medium | |
CN115629865B (en) | Deep learning inference task scheduling method based on edge calculation | |
CN115129463A (en) | Computing power scheduling method and device, system and storage medium | |
CN114064294B (en) | Dynamic resource allocation method and system in mobile edge computing environment | |
CN114780247A (en) | Flow application scheduling method and system with flow rate and resource sensing | |
CN116302404B (en) | Resource decoupling data center-oriented server non-perception calculation scheduling method | |
CN108280018B (en) | A kind of node workflow communication overhead efficiency analysis and optimization method and system | |
WO2025026130A1 (en) | Computing power internet scheduling service method and system based on network performance comprehensive weight decision | |
Xing et al. | Task classification unloading algorithm for mobile edge computing in smart grid | |
CN114938372A (en) | A method and device for dynamic migration scheduling of microgrid group requests based on federated learning | |
CN114356585A (en) | An optimization method, device and computer equipment for offloading mobile edge computing | |
Meng et al. | Achieving energy efficiency through dynamic computing offloading in mobile edge-clouds | |
CN111614735A (en) | New fog computing architecture based on weighted round-robin algorithm and its task scheduling method | |
CN118349333A (en) | Task scheduling method, system, device and readable storage medium | |
CN110308965A (en) | Rule-based heuristic virtual machine allocation method and system for cloud data center | |
CN103678000A (en) | Computational grid balance task scheduling method based on reliability and cooperative game |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |