CN112698911B

CN112698911B - Cloud job scheduling method based on deep reinforcement learning

Info

Publication number: CN112698911B
Application number: CN202011578884.0A
Authority: CN
Inventors: 李启锐; 彭志平; 崔得龙; 林建鹏; 何杰光
Original assignee: Guangdong University of Petrochemical Technology
Current assignee: Guangdong University of Petrochemical Technology
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-05-17
Anticipated expiration: 2040-12-28
Also published as: CN112698911A

Abstract

The invention relates to the field of cloud computing resource scheduling, in particular to a cloud job scheduling method based on deep reinforcement learning, which comprises the following steps: receiving user jobs sent by a user; decoupling user operation to obtain a ready operation set; scheduling the ready job set through a job scheduler; the scheduling is to take action according to a scheduling strategy and deploy the operation in the ready operation set to the corresponding virtual machine; executing the job through the virtual machine and returning an execution result; collecting training samples and establishing an experience pool; the training sample is used for storing a ready operation set state, a virtual machine state, an action and a return value; the return value is the return obtained by taking action; judging whether the number of training samples in the experience pool is smaller than a threshold value, if so, re-receiving user operation sent by a user, and otherwise, optimizing an operation scheduler by using the training samples in the experience pool; and scheduling by using the optimized job scheduler. The invention can shorten the completion time of the user operation.

Description

A cloud job scheduling method based on deep reinforcement learning

技术领域technical field

本发明涉及云计算资源调度领域，更具体地，涉及一种基于深度强化学习的云作业调度方法。The invention relates to the field of cloud computing resource scheduling, and more particularly, to a cloud job scheduling method based on deep reinforcement learning.

背景技术Background technique

云计算本质上是一种通过技术集成，将计算、存储、应用程序等IT资源集中并实现高效低成本供给的服务提供模型。中国云计算专家咨询委员会秘书长刘鹏教授曾对云计算做了长短两种定义。长定义是：云计算是一种商业计算模型，它将计算任务分布在大计算机构成的资源池上，使各种应用系统能够根据需要获取计算能力、储存空间和信息服务。短定义是：云计算是通过网络按需提供可动态伸缩的廉价计算服务。Cloud computing is essentially a service provision model that integrates computing, storage, applications and other IT resources and achieves efficient and low-cost supply through technology integration. Professor Liu Peng, secretary-general of the China Cloud Computing Expert Advisory Committee, has made two definitions of cloud computing, long and short. The long definition is: cloud computing is a business computing model that distributes computing tasks on a resource pool composed of large computers, enabling various application systems to obtain computing power, storage space and information services as needed. The short definition is: Cloud computing is the provision of cheap computing services that can be dynamically scaled on demand over the network.

如今，云计算已发展成为提供各类云端服务与组件的软硬一体化技术资源平台，是一个带有明确商业模式的综合性载体。在社会需求的价值驱动下，云计算作为一种低成本、可灵活的配置的服务提供模型，将迎来更丰富的应用场景与更广阔的发展空间。在云作业调度过程中虽然具有较强的目标性，但其运动策略随环境状态变化而不断调整，具有一定的随机性；同时在制定作业调度策略时仅利用当前的系统状态，与过去和将来的系统状态无关，因此作业调度过程具有明显的马尔科夫特性(Markov property，是概率论中的一个概念，因为俄国数学家安德雷·马尔科夫得名。在给定现在状态时，它与过去状态(即该过程的历史路径)是条件独立的，那么此随机过程即具有马尔科夫特性)。一般情况下，云资源供需双方追求的目标都是在提交作业后在最短的时间内得到响应。随着人工智能技术的迅速发展，很多学者开始尝试使用强化学习等机器学习算法解决诸多在云作业调度上的问题，以贴近这一目标。目前，基于强化学习的云作业调度算法在云计算资源调度领域已取得许多不错的成果，同时也存在不少问题。云计算平台是一个庞大的、瞬变的系统，因此状态空间也是庞大的，状态空间的庞大以及即将持续增大，使基于强化学习的云作业调度算法应用在分布式数据中心这种复杂的大型的云计算系统中具有很大局限性，这种局限性严重影响强化学习算法的性能。而强化学习算法的性能受到影响会导致作业的完工时间过长，用户体验不佳。综上所述，目前亟需一种能缩短用户作业完工时间的基于深度强化学习的云作业调度方法。Today, cloud computing has developed into a software and hardware integrated technology resource platform that provides various cloud services and components, and is a comprehensive carrier with a clear business model. Driven by the value of social needs, cloud computing, as a low-cost, flexible configuration service provision model, will usher in richer application scenarios and broader development space. Although the cloud job scheduling process has a strong objective, its motion strategy is constantly adjusted with the change of the environmental state, which is random. The system state of the It is conditionally independent from the past state (that is, the historical path of the process), then this stochastic process has Markov properties). In general, the goal pursued by both the supply and demand sides of cloud resources is to get a response in the shortest time after submitting a job. With the rapid development of artificial intelligence technology, many scholars have begun to try to use machine learning algorithms such as reinforcement learning to solve many problems in cloud job scheduling in order to get closer to this goal. At present, cloud job scheduling algorithms based on reinforcement learning have achieved many good results in the field of cloud computing resource scheduling, but there are also many problems. The cloud computing platform is a huge and transient system, so the state space is also huge. The state space is huge and will continue to increase, which makes the cloud job scheduling algorithm based on reinforcement learning applied to the complex large-scale distributed data center. There are great limitations in the cloud computing system of the RL, and this limitation seriously affects the performance of reinforcement learning algorithms. The performance of reinforcement learning algorithms is affected, resulting in long job completion times and poor user experience. To sum up, there is an urgent need for a cloud job scheduling method based on deep reinforcement learning that can shorten the completion time of user jobs.

发明内容SUMMARY OF THE INVENTION

为了解决上述问题，本发明提供一种基于深度强化学习的云作业调度方法，该方法能缩短用户作业完工时间。In order to solve the above problems, the present invention provides a cloud job scheduling method based on deep reinforcement learning, which can shorten the completion time of user jobs.

本发明采取的技术方案是：The technical scheme adopted by the present invention is:

一种基于深度强化学习的云作业调度方法，包括：A cloud job scheduling method based on deep reinforcement learning, comprising:

接收用户发送的用户作业；Receive user jobs sent by users;

对用户作业进行解耦，获取就绪作业集；Decouple user jobs to obtain ready job sets;

通过作业调度器对就绪作业集进行调度；所述调度为按照调度策略采取动作，将就绪作业集中的作业部署到相应的虚拟机上；所述动作为就绪作业集中的作业的虚拟机分配方式；The ready job set is scheduled by the job scheduler; the scheduling is to take an action according to the scheduling policy, and deploy the jobs in the ready job set to the corresponding virtual machine; the action is the virtual machine allocation method of the jobs in the ready job set;

通过虚拟机执行作业，并且返回执行结果；Execute the job through the virtual machine and return the execution result;

收集训练样本，建立经验池；所述训练样本用于存储就绪作业集状态、虚拟机状态、动作和回报值；所述回报值为采取动作获得的回报；Collect training samples to establish an experience pool; the training samples are used to store the ready job set state, virtual machine state, action and reward value; the reward value is the reward obtained by taking the action;

判断经验池内的训练样本数量是否小于阈值，若小于阈值则重新接收用户发送的用户作业，否则利用经验池中的训练样本优化作业调度器；Determine whether the number of training samples in the experience pool is less than the threshold, and if it is less than the threshold, re-receive the user job sent by the user, otherwise use the training samples in the experience pool to optimize the job scheduler;

利用优化后的作业调度器进行调度。Use the optimized job scheduler for scheduling.

具体地，作业调度的具体过程为：首先，企业在多个数据中心部署虚拟机服务器后，企业工作人员将种类各异的用户作业提交给作业解耦器，作业解耦器对用户作业进行解耦。用户作业一般包含原子作业和存在依赖关系的次子作业，这里的解耦就是根据子作业的优先级和依赖关系将用户作业拆分为不同的子作业。将用户作业解耦成不同的子作业之后，子作业由就绪作业集进行存储。然后，通过作业调度器对就绪作业集进行调度，将就绪作业集中的子作业分配到不同数据中心。部署在不同数据中心的虚拟机服务器建立虚拟机，以此对分配到该中心的子作业进行处理。最后，虚拟机执行子作业后把执行结果返回给企业。每个虚拟机服务器与企业的距离不同、每个服务器之间处理性能不同以及每个服务器要处理的子作业的种类和数量不同，这些都会导致每个服务器建立的虚拟机的状态不同。根据虚拟机之间状态的差异，按照合适的调度策略将子作业分配虚拟机执行(即作业调度)会带来不同的调度效果(最直观的效果为缩短机器的响应时间)。为了缩短机器的响应时间，提高用户体验，本方案使用深度强化学习算法优化作业调度器，深度强化学习的过程为：智能体Agent通过不断与云环境进行交互探索，通过奖罚机制和经验回放机制，累积学习经验，以寻找最优的调度策略。与上述过程对应的步骤为：云计算平台通过不断接收企业的用户作业，对用户作业进行调度和执行；通过收集回报函数的计算结果以及从作业调度的数据，建立经验池；利用经验池寻找最优的调度策略，优化作业调度器。Specifically, the specific process of job scheduling is as follows: first, after the enterprise deploys virtual machine servers in multiple data centers, the enterprise staff submits various user jobs to the job decoupler, and the job decoupler decouples the user jobs. coupled. User jobs generally include atomic jobs and secondary sub-jobs with dependencies. The decoupling here is to split user jobs into different sub-jobs according to their priorities and dependencies. After decoupling the user jobs into different sub-jobs, the sub-jobs are stored by the ready job set. Then, the ready job set is scheduled through the job scheduler, and sub-jobs in the ready job set are allocated to different data centers. Virtual machine servers deployed in different data centers create virtual machines to process sub-jobs assigned to the center. Finally, the virtual machine executes the sub-job and returns the execution result to the enterprise. The distance between each virtual machine server and the enterprise, the processing performance between each server, and the type and number of sub-jobs to be processed by each server will result in different states of the virtual machines established by each server. According to the state difference between virtual machines, assigning sub-jobs to virtual machines for execution according to an appropriate scheduling policy (ie job scheduling) will bring different scheduling effects (the most intuitive effect is to shorten the response time of the machine). In order to shorten the response time of the machine and improve the user experience, this solution uses the deep reinforcement learning algorithm to optimize the job scheduler. The process of deep reinforcement learning is as follows: the agent interacts with the cloud environment through continuous exploration, through the reward and punishment mechanism and the experience playback mechanism , accumulate learning experience to find the optimal scheduling strategy. The steps corresponding to the above process are: the cloud computing platform schedules and executes user jobs by continuously receiving user jobs of the enterprise; establishes an experience pool by collecting the calculation results of the return function and data from job scheduling; Optimized scheduling strategy and optimized job scheduler.

进一步地，所述调度的目标函数为：Further, the objective function of the scheduling is:

所述J为用户作业；所述π为调度策略；所述

为第k个用户的第i个作业；所述

为第k个用户的第i个作业的完工时间。The J is the user job; the π is the scheduling strategy; the

is the ith job of the kth user; the

is the completion time of the i-th job of the k-th user.

具体地，由目标函数可知，作业调度的目标为将作业分配到合适数据中心的虚拟机服务器的前提下，尽可能减少作业的完工时间。Specifically, it can be known from the objective function that the goal of job scheduling is to reduce the completion time of the job as much as possible under the premise of allocating the job to a virtual machine server in a suitable data center.

进一步地，所述

所述

所述

为作业

传输到虚拟机的数据量；所述L^k(i)为作业

的长度；所述

为作业

被执行后返回执行结果的数据量；所述

为作业

的执行时间；所述

为作业

的传输时间；所述

为作业

的等待时间；所述等待时间为在通过作业调度器对就绪作业集进行调度之后，通过虚拟机执行作业，并且返回执行结果之前，虚拟机计算能力不足，被调度的作业进入虚拟机等待队列等待被执行的时间。Further, the

said

for homework

The amount of data transferred to the virtual machine; the L ^k (i) is the job

the length of; the

for homework

The amount of data that returns the execution result after being executed; the

for homework

execution time; the

for homework

the transmission time; the

for homework

waiting time; the waiting time is that after the ready job set is scheduled by the job scheduler, the job is executed by the virtual machine, and before the execution result is returned, the computing power of the virtual machine is insufficient, and the scheduled job enters the virtual machine waiting queue to wait time it was executed.

具体地，作业的完工时间为作业的执行时间、传输时间和等待时间之和。L^k(i)为作业

的长度，即作业

的文件长度。Specifically, the completion time of a job is the sum of the execution time, transmission time, and waiting time of the job. L ^k (i) is the homework

the length of the job

file length.

进一步地，所述

所述

为分配给作业

的MIPS；所述c为兆字节到字节的转换系数；所述p为虚拟机完成每单位长度作业的CPU周期；所述

所述

为作业

向虚拟机传输数据的时间；所述

为作业

被执行后，返回处理结果的传输时间；所述

所述J_j为第j个作业；所述q为等待队列中作业

之前所有作业的集合；所述t_j，e为第j个作业的执行时间。Further, the

said

to assign to a job

The MIPS; the c is the conversion factor from megabytes to bytes; the p is the CPU cycle that the virtual machine completes the job per unit length; the

said

for homework

the time to transfer data to the virtual machine; the

for homework

After being executed, the transmission time of the processing result is returned; the

The J _j is the jth job; the q is the job in the waiting queue

The set of all previous jobs; the t _{j, e} are the execution time of the jth job.

进一步地，所述

所述

所述作业

的传输数据量为

所述

为虚拟机分配给每个作业的带宽资源。Further, the

said

the job

The amount of transmitted data is

said

The bandwidth resource allocated to each job for the virtual machine.

具体地，根据上述

Specifically, according to the above

进一步地，所述

所述b为虚拟机的带宽资源；所述

为在时隙T传输到虚拟机的作业数。Further, the

The b is the bandwidth resource of the virtual machine; the

is the number of jobs transferred to the virtual machine at time slot T.

进一步地，所述训练样本为(s_t，α_t，r_t，s_t+1)；所述就绪作业集状态为s_J＝{t₁，d₁，t₂，d₂，……，t_n，d_n}；所述虚拟机状态为

所述动作由动作空间A存储，A＝{α₁，α₂，……，α_n}；所述回报值由回报函数R计算，

所述s_t和s_t+1分别为时间步t和时间步t+1的状态；所述状态由状态空间S存储，S＝{s_J，s_VM}；所述α为时间步t从动作空间A中选取的动作；所述r_t为时间步t回报函数R计算的回报值；所述就绪作业集状态s_J中的t_i和d_i，分别表示就绪作业集中第i个作业的执行时间和传输到虚拟机的数据量；所述n为就绪作业集的作业数量；所述虚拟机状态S_VM中的

和

分别表示当前时间步第x个虚拟机中剩余的计算能力和等待执行的作业数量；所述m为虚拟机的数量；所述动作空间A中的动作α_i表示就绪作业集中第i个作业的虚拟机分配方式；所述动作α_i的可选项为m+1；所述

和

分别为第x个虚拟机已执行的作业数量和等待执行的作业数量。Further, the training samples are (s _t , α _t , r _t , s _t+1 ); the state of the ready job set is s _J ={t ₁ , d ₁ , t ₂ , d ₂ ,..., t _n , d _n }; the virtual machine state is

The action is stored in the action space A, A={α ₁ , α ₂ , ..., α _n }; the reward value is calculated by the reward function R,

The s _t and s _t+1 are the states of the time step t and the time step t+1 respectively; the state is stored in the state space S, S={s _J , s _VM }; the α is the time step t from The action selected in the action space A; the r _t is the reward value calculated by the reward function R at the time step t; t _i and d _i in the state s _J of the ready job set represent the Execution time and the amount of data transferred to the virtual machine; the n is the number of jobs in the ready job set; the virtual machine state S _VM

and

Respectively represent the remaining computing power and the number of jobs waiting to be executed in the xth virtual machine at the current time step; the m is the number of virtual machines; the action α _i in the action space A represents the ith job in the ready job set. virtual machine allocation method; the optional option of the action α _i is m+1; the

and

are the number of jobs that have been executed by the xth virtual machine and the number of jobs waiting to be executed.

具体地，在强化学习中训练样本一般为四元组信息(S，α，r，S′)，S为当前时间步的状态，α为当前时间步的动作，r为当前时间步采取动作α获得的回报值，S′为S下一个时间步的状态。与上述对应，本方案采集的训练样本为(s_t，α_t，r_t，s_t+1)，s_t为时间步t的状态，α_t为时间步t的动作，r_t为时间步t采取动作α_t获得的回报值，s_t+1为s_t的下一个时间步t+1的状态。训练样本的状态由状态空间S存储，状态空间S由就绪作业集状态和虚拟机状态构成，S＝{s_J，s_VM}。训练样本的动作α，由动作空间A存储，动作空间A内每个动作α_i有m+1个可选项，每个动作α_i第一个可选项为空动作，第二个可选项为将作业分配到排序第一的虚拟机上，以此类推。例如：α₁＝(0，0，1，0)表示将作业1分配到虚拟机2上，α₂＝(1，0，0，0)表示采用空动作，当前时间步作业2不分配到任何虚拟机上。回报值r由回报函数计算获取，回报函数为：

回报函数设计在深度强化学习中是极其重要的一环，回报函数的设计是否符合目标需求将决定机器能否学到期望的策略，并接影响算法的收敛速度和最终性能。因为的优化目标为最小化作业调度的完工时间，所以针对当前时间步完成的作业给予正回报，对于当前时间步等待的作业给予负回报，鼓励作业能够尽快完成。Specifically, in reinforcement learning, the training samples are generally four-tuple information (S, α, r, S'), where S is the state of the current time step, α is the action at the current time step, and r is the action taken at the current time step α The reward value obtained, S' is the state of S at the next time step. Corresponding to the above, the training samples collected in this scheme are (s _t , α _t , r _t , s _t+1 ), s _t is the state of time step t, α _t is the action of time step t, and r _t is time step t takes the reward value obtained by action α _t , and s _t+1 is the state of the next time step t+1 of s _t . The state of the training samples is stored by the state space S, which is composed of the ready job set state and the virtual machine state, S={s _J , s _VM }. The action α of the training sample is stored in the action space A. Each action α _i in the action space A has m+1 options, the first option of each action α _i is an empty action, and the second option is the The job is assigned to the first-ranked virtual machine, and so on. For example: α ₁ =(0, 0, 1, 0) indicates that job 1 is assigned to virtual machine 2, α ₂ =(1, 0, 0, 0) indicates that an empty action is used, and job 2 at the current time step is not assigned to on any virtual machine. The reward value r is calculated by the reward function, and the reward function is:

The design of reward function is an extremely important part in deep reinforcement learning. Whether the design of reward function meets the target requirements will determine whether the machine can learn the desired strategy, which in turn affects the convergence speed and final performance of the algorithm. Because the optimization goal is to minimize the completion time of job scheduling, positive rewards are given to jobs completed at the current time step, and negative rewards are given to jobs waiting at the current time step, encouraging jobs to be completed as soon as possible.

进一步地，所述优化作业调度器的目标函数为：Further, the objective function of the optimized job scheduler is:

所述γ为折扣因子，γ∈[0，1]。The γ is a discount factor, γ∈[0,1].

具体地，本方案采集的训练样本大于经验池阈值后，从中批量抽取进行作业调度器的优化。目标为最大化期望累计折扣回报。Specifically, after the training samples collected by this solution are larger than the threshold of the experience pool, they are extracted in batches to optimize the job scheduler. The goal is to maximize the expected cumulative discounted return.

进一步地，所述优化作业调度器的损失函数为：Further, the loss function of the optimized job scheduler is:

所述θ_z为第z次迭代后的作业调度器参数；所述s_t+1为s_t的下一个时间步的状态空间；D(M)为经验池D每次抽取的样本数为M；所述α_t+1为s_t+1对应最大Q值的动作；所述

为优化第z次迭代后的作业调度器的参数。The θ _z is the job scheduler parameter after the z-th iteration; the s _t+1 is the state space of the next time step of s _t ; D(M) is the number of samples drawn by the experience pool D each time is M ; the α _t+1 is the action of s _t+1 corresponding to the maximum Q value; the

Parameters for optimizing the job scheduler after the zth iteration.

具体地，作业调度器采用Mini-batch训练方法，每次迭代从经验池中随机选取M个样本(s_t，α_t，r_t，s_t+1)，将状态s_t作为在线网络的输入，获得动作α_t的当前Q值，将下一状态s_t+1作为目标网络的输入，获得目标网络中所有动作中的最大Q值。Specifically, the job scheduler adopts the Mini-batch training method, and randomly selects M samples (s _t , α _t , r _t , s _t+1 ) from the experience pool in each iteration, and takes the state s _t as the input of the online network , obtain the current Q value of the action α _t , take the next state s _t+1 as the input of the target network, and obtain the maximum Q value among all actions in the target network.

进一步地，所述参数θ关于损失函数的梯度为：Further, the gradient of the parameter θ with respect to the loss function is:

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

(1)通过调度策略有效缩短了机器的响应时间，提高了用户体验。(1) Through the scheduling strategy, the response time of the machine is effectively shortened, and the user experience is improved.

(2)通过训练样本优化了作业调度器，避免了因状态空间增大降低了算法性能，减少了作业的完工时间。(2) The job scheduler is optimized by training samples, which avoids the reduction of algorithm performance and the completion time of jobs due to the increase of state space.

附图说明Description of drawings

图1为本发明的云平台示意图；1 is a schematic diagram of a cloud platform of the present invention;

图2为本发明的仿真实验数据图a；Fig. 2 is the simulation experiment data figure a of the present invention;

图3为本发明的仿真实验数据图b；Fig. 3 is the simulation experiment data graph b of the present invention;

图4为本发明的仿真实验数据图c。Fig. 4 is the simulation experiment data graph c of the present invention.

具体实施方式Detailed ways

本发明附图仅用于示例性说明，不能理解为对本发明的限制。为了更好说明以下实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。The accompanying drawings of the present invention are only used for exemplary illustration, and should not be construed as limiting the present invention. In order to better illustrate the following embodiments, some parts of the drawings may be omitted, enlarged or reduced, which do not represent the size of the actual product; for those skilled in the art, some well-known structures and their descriptions in the drawings may be omitted. understandable.

实施例Example

本实施例提供一种基于深度强化学习的云作业调度方法，包括：This embodiment provides a cloud job scheduling method based on deep reinforcement learning, including:

接收用户发送的用户作业；Receive user jobs sent by users;

图1为本发明的云平台示意图，如图1所示，上述方法使用在图内的云平台上。作业调度的具体过程为：首先，企业在多个数据中心部署虚拟机服务器后，企业工作人员将种类各异的用户作业提交给作业解耦器，作业解耦器对用户作业进行解耦。用户作业一般包含原子作业和存在依赖关系的次子作业，这里的解耦就是根据子作业的优先级和依赖关系将用户作业拆分为不同的子作业。将用户作业解耦成不同的子作业之后，子作业由就绪作业集进行存储。然后，通过作业调度器对就绪作业集进行调度，将就绪作业集中的子作业分配到不同数据中心。部署在不同数据中心的虚拟机服务器建立虚拟机，以此对分配到该中心的子作业进行处理。最后，虚拟机执行子作业后把执行结果返回给企业。每个虚拟机服务器与企业的距离不同、每个服务器之间处理性能不同以及每个服务器要处理的子作业的种类和数量不同，这些都会导致每个服务器建立的虚拟机的状态不同。根据虚拟机之间状态的差异，按照合适的调度策略将子作业分配虚拟机执行(即作业调度)会带来不同的调度效果(最直观的效果为缩短机器的响应时间)。为了缩短机器的响应时间，提高用户体验，本方案使用深度强化学习算法优化作业调度器，深度强化学习的过程为：智能体Agent通过不断与云环境进行交互探索，通过奖罚机制和经验回放机制，累积学习经验，以寻找最优的调度策略。与上述过程对应的步骤为：云计算平台通过不断接收企业的用户作业，对用户作业进行调度和执行；通过收集回报函数的计算结果以及从作业调度的数据，建立经验池；利用经验池寻找最优的调度策略，优化作业调度器。FIG. 1 is a schematic diagram of a cloud platform of the present invention. As shown in FIG. 1 , the above method is used on the cloud platform in the figure. The specific process of job scheduling is as follows: First, after the enterprise deploys virtual machine servers in multiple data centers, the enterprise staff submits various user jobs to the job decoupler, and the job decoupler decouples the user jobs. User jobs generally include atomic jobs and secondary sub-jobs with dependencies. The decoupling here is to split user jobs into different sub-jobs according to their priorities and dependencies. After decoupling the user jobs into different sub-jobs, the sub-jobs are stored by the ready job set. Then, the ready job set is scheduled through the job scheduler, and sub-jobs in the ready job set are allocated to different data centers. Virtual machine servers deployed in different data centers create virtual machines to process sub-jobs assigned to the center. Finally, the virtual machine executes the sub-job and returns the execution result to the enterprise. The distance between each virtual machine server and the enterprise, the processing performance between each server, and the type and number of sub-jobs to be processed by each server will result in different states of the virtual machines established by each server. According to the state difference between virtual machines, assigning sub-jobs to virtual machines for execution according to an appropriate scheduling policy (ie job scheduling) will bring different scheduling effects (the most intuitive effect is to shorten the response time of the machine). In order to shorten the response time of the machine and improve the user experience, this solution uses the deep reinforcement learning algorithm to optimize the job scheduler. The process of deep reinforcement learning is as follows: the agent interacts with the cloud environment through continuous exploration, through the reward and punishment mechanism and the experience playback mechanism , accumulate learning experience to find the optimal scheduling strategy. The steps corresponding to the above process are: the cloud computing platform schedules and executes user jobs by continuously receiving user jobs of the enterprise; establishes an experience pool by collecting the calculation results of the return function and data from job scheduling; Optimized scheduling strategy and optimized job scheduler.

所述J为用户作业；所述π为调度策略；所述

为第k个用户的第i个作业；所述

is the ith job of the kth user; the

is the completion time of the i-th job of the k-th user.

进一步地，所述

所述

所述

为作业

传输到虚拟机的数据量；所述L^k(i)为作业

的长度；所述

为作业

被执行后返回执行结果的数据量；所述

勾作业

的执行时间；所述

为作业

的传输时间；所述

为作业

said

for homework

The amount of data transferred to the virtual machine; the L ^k (i) is the job

the length of; the

for homework

The amount of data that returns the execution result after being executed; the

tick homework

execution time; the

for homework

the transmission time; the

for homework

的长度，即作业

the length of the job

file length.

进一步地，所述

所述

为分配给作业

所述

为作业

向虚拟机传输数据的时间；所述

为作业

被执行后，返回处理结果的传输时间；所述

所述J_j为第j个作业；所述q为等待队列中作业

said

to assign to a job

said

for homework

the time to transfer data to the virtual machine; the

for homework

The J _j is the jth job; the q is the job in the waiting queue

进一步地，所述

所述

所述作业

的传输数据量为

所述

为虚拟机分配给每个作业的带宽资源。Further, the

said

the job

The amount of transmitted data is

said

The bandwidth resource allocated to each job for the virtual machine.

具体地，根据上述

Specifically, according to the above

进一步地，所述

所述b为虚拟机的带宽资源；所述

为在时隙T传输到虚拟机的作业数。Further, the

The b is the bandwidth resource of the virtual machine; the

is the number of jobs transferred to the virtual machine at time slot T.

和

和

The s _t and s _t+1 are the states of the time step t and the time step t+1 respectively; the state is stored in the state space S, S={s _J , s _VM }; the α is the time step t from The action selected in the action space A; the r _t is the reward value calculated by the reward function R at the time step t; t _i and d _i in the state s _J of the ready job set represent the Execution time and the amount of data transferred to the virtual machine; the n is the number of jobs in the ready job set; the virtual machine state s in the _VM

and

回报函数设计在深度强化学习中是极其重要的一环，回报函数的设计是否符合目标需求将决定机器能否学到期望的策略，并接影响算法的收敛速度和最终性能。因为的优化目标为最小化作业调度的完工时间，所以针对当前时间步完成的作业给予正回报，对于当前时间步等待的作业给予负回报，鼓励作业能够尽快完成。Specifically, in reinforcement learning, the training samples are generally four-tuple information (S, α, r, S'), where S is the state of the current time step, α is the action at the current time step, and r is the action taken at the current time step α The reward value obtained, S' is the state of S at the next time step. Corresponding to the above, the training samples collected in this scheme are (s _t , α _t , r _t , s _t+1 ), s _t is the state of time step t, α _t is the action of time step t, and r _t is time step t takes the reward value obtained by action α _t , and s _t+1 is the state of the next time step t+1 of s _t . The state of the training samples is stored by the state space S, which is composed of the ready job set state and the virtual machine state, S={s _J , s _VM }. The action α of the training sample is stored in the action space A. Each action α _i in the action space A has m+1 options, the first option of each action α _i is an empty action, and the second option is the Jobs are assigned to the first-ranked virtual machine, and so on. For example: α ₁ =(0, 0, 1, 0) indicates that job 1 is assigned to virtual machine 2, α ₂ =(1, 0, 0, 0) indicates that an empty action is used, and job 2 at the current time step is not assigned to on any virtual machine. The reward value r is calculated by the reward function, and the reward function is:

Parameters for optimizing the job scheduler after the zth iteration.

本实施例还进行了仿真实验，实验目标为测试基于深度强化学习的作业调度方法。In this embodiment, a simulation experiment is also performed, and the objective of the experiment is to test the job scheduling method based on deep reinforcement learning.

使用Python语言搭建了一个仿真平台，平台的具体系统参数为：用户数量为4，用户作业队列数量为4，每个时间步作业队列进行就绪作业集的作业数量为3，就绪作业集的作业数量为12，资源利用率阀值为0.6。实验用的用户作业包括4种作业类型，作业的数据传输量与计算量的比值如表1所示。A simulation platform is built using Python language. The specific system parameters of the platform are: the number of users is 4, the number of user job queues is 4, the number of jobs in the job queue for each time step is 3, and the number of jobs in the ready job set is 3. is 12, and the resource utilization threshold is 0.6. The user jobs used in the experiment include four types of jobs, and the ratio of the data transmission amount to the calculation amount of the job is shown in Table 1.

表1作业的数据传输量与计算量的比值Table 1. The ratio of data transfer amount to computation amount of the job

作业的传输数据量在10-20M之间随机生成，子作业之间的依赖性随机生成，作业数量为200。云平台中3台虚拟机的计算能力分别为650MIPS、850MIPS和1500MIPS，带宽大小分别为200M、300M和500M，计算核心数分别为4、8和12。DQN网络关键参数如表2所示，在上述实验环境下进行以下实验。The transfer data volume of the job is randomly generated between 10-20M, the dependencies between sub-jobs are randomly generated, and the number of jobs is 200. The computing capabilities of the three virtual machines in the cloud platform are 650MIPS, 850MIPS, and 1500MIPS, respectively, the bandwidth sizes are 200M, 300M, and 500M, and the number of computing cores are 4, 8, and 12, respectively. The key parameters of the DQN network are shown in Table 2. The following experiments are carried out in the above experimental environment.

表2虚拟机使用阶段DQN模型的参数Table 2 Parameters of the DQN model in the stage of virtual machine use

首先，验证本阶段DQN算法在训练过程中的收敛性以及收敛速度。图2为本发明的仿真实验数据图a，训练过程回报值的变化情况如图2所示。可以看出，随着训练的深入，Agent从环境中获得的总回报值递增，大约经过1300回合训练后开始趋于收敛，说明模型通过不断的训练，学习到可实现目标优化的策略。First, verify the convergence and convergence speed of the DQN algorithm in the training process at this stage. Fig. 2 is a simulation experiment data graph a of the present invention, and Fig. 2 shows the change of the reward value in the training process. It can be seen that with the deepening of training, the total return value obtained by the agent from the environment increases, and begins to converge after about 1300 rounds of training, indicating that the model has learned a strategy that can achieve target optimization through continuous training.

接下来，对本发明在全局完工时间方面的优化效果与其他算法的性能差异。采用的基准算法有随机算法Random、循环算法RR、具备学习能力的智能调度算法HDDL算法。HDDL算法协同多个异构深度学习模型作为智能调度器，从历史经验中，学习探索最优或是次优的调度策略。图3为本发明的仿真实验数据图b，实验结果如图3所示。实验结果表明随着训练迭代次数的增加，DQN和HDDL的完工时间曲线递减并趋于稳定收敛。同时表明DQN和HDDL智能体均能从历史经验中学习到优化策略，实现系统目标优化，减少全局完工时间，但是DQN的完工时间要优于HDDL。Next, the optimization effect of the present invention in terms of global completion time is different from the performance of other algorithms. The benchmark algorithms used include random algorithm Random, cyclic algorithm RR, and intelligent scheduling algorithm HDDL algorithm with learning ability. The HDDL algorithm cooperates with multiple heterogeneous deep learning models as an intelligent scheduler, and learns to explore optimal or sub-optimal scheduling strategies from historical experience. FIG. 3 is a simulation experimental data diagram b of the present invention, and the experimental results are shown in FIG. 3 . The experimental results show that with the increase of the number of training iterations, the makepan curves of DQN and HDDL decrease and tend to converge stably. At the same time, it is shown that both DQN and HDDL agents can learn optimization strategies from historical experience, achieve system goal optimization, and reduce global completion time, but DQN's completion time is better than HDDL.

最后，验证在不同的虚拟机个数，即不同系统资源的情况下，各算法的优化效果。图4为本发明的仿真实验数据图c，实验结果如图4所示。Finally, verify the optimization effect of each algorithm under the condition of different number of virtual machines, that is, different system resources. FIG. 4 is a simulation experimental data graph c of the present invention, and the experimental results are shown in FIG. 4 .

在图中，由于DQN算法的波动性，为了使实验结果更具一般性，本文算法作业完工时间取的是最后100个回合的平均值。从图4可清楚观察到，提出的基于DQN选择算法在不同虚拟机数目下，作业完工时间均小于其他基准算法。另外，随着虚拟机数目增多，各算法模型的作业完工时间逐渐减少，差距变小。以上结果可以说明在云资源竞争较大的情况下，智能调度器能够根据任务属性和系统资源状态来动态任务的调度策略，从而减少全局作业完工时间。In the figure, due to the volatility of the DQN algorithm, in order to make the experimental results more general, the job completion time of the algorithm in this paper is the average of the last 100 rounds. It can be clearly observed from Figure 4 that the proposed DQN-based selection algorithm has less job completion time than other benchmark algorithms under different numbers of virtual machines. In addition, as the number of virtual machines increases, the job completion time of each algorithm model gradually decreases, and the gap becomes smaller. The above results can illustrate that in the case of large competition for cloud resources, the intelligent scheduler can dynamically schedule tasks according to task attributes and system resource status, thereby reducing the global job completion time.

在云计算环境中，作业调度时，本实验使用基于深度强化学习算法求得作业的优化调度策略，并依此策略将作业提交到最优虚拟机执行，解决了由于用户作业类型、大小、虚拟机状态等均为动态变化导致用户作业调度困难的问题。通过综合考虑作业执行时间、作业等待时间等服务质量因素，有效地降低了作业的整体完工时间。In the cloud computing environment, when scheduling jobs, this experiment uses a deep reinforcement learning algorithm to obtain an optimal scheduling strategy for jobs, and submits jobs to the optimal virtual machine for execution according to this strategy. The state of the machine and the like are all the problems that the dynamic change leads to the difficulty of user job scheduling. By comprehensively considering service quality factors such as job execution time and job waiting time, the overall completion time of the job is effectively reduced.

显然，本发明的上述实施例仅仅是为清楚地说明本发明技术方案所作的举例，而并非是对本发明的具体实施方式的限定。凡在本发明权利要求书的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principle of the claims of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. A cloud job scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

receiving user jobs sent by a user;

decoupling user operation to obtain a ready operation set;

scheduling the ready job set through a job scheduler; the scheduling is to take action according to a scheduling strategy and deploy the operation in the ready operation set to the corresponding virtual machine; the action is used as a virtual machine distribution mode of the operation in the ready operation set;

executing the job through the virtual machine and returning an execution result;

collecting training samples and establishing an experience pool; the training sample is used for storing a ready operation set state, a virtual machine state, an action and a return value; the return value is the return obtained by taking action;

judging whether the number of training samples in the experience pool is smaller than a threshold value, if so, re-receiving user operation sent by a user, and otherwise, optimizing an operation scheduler by using the training samples in the experience pool;

scheduling by using the optimized job scheduler;

the objective function of the scheduling is:

j is user operation; the pi is a scheduling strategy; the above-mentioned

An ith job for a kth user; the above-mentioned

The completion time of the ith job for the kth user; the above-mentioned

The described

The above-mentioned

To do work

The amount of data transferred to the virtual machine; said L^k(i) To do work

Length of (d); the above-mentioned

To do work

Returning the data volume of the execution result after being executed; the above-mentioned

To do work

The execution time of (c); the above-mentioned

To do work

The transmission time of (c); the above-mentioned

To do work

Waiting time; the waiting time is adjusted by a job scheduler to a ready job setAfter the operation is finished, the operation is executed through the virtual machine, and before the execution result is returned, the computing capacity of the virtual machine is insufficient, and the scheduled operation enters a virtual machine waiting queue to wait for the time of being executed;

the above-mentioned

The above-mentioned

To be allocated to a job

The MIPS of (mobile industry processor); c is a conversion coefficient from megabytes to bytes; the p is the CPU period of the virtual machine for completing the operation of each unit length; the above-mentioned

The above-mentioned

To do work

Time of data transfer to the virtual machine; the above-mentioned

To do work

Returning the transmission time of the processing result after being executed; the described

Said J_jIs the j job; q is a job in the wait queue

A set of all previous jobs; the above-mentionedt_j，eIs the execution time of the j-th job.

2. The cloud job scheduling method based on deep reinforcement learning according to claim 1, wherein the cloud job scheduling method is based on deep reinforcement learning

The above-mentioned

The operation

The amount of data transmitted is

The above-mentioned

Bandwidth resources allocated to each job for the virtual machine.

3. The cloud job scheduling method based on deep reinforcement learning according to claim 2, wherein the cloud job scheduling method is characterized in that

B is the bandwidth resource of the virtual machine; the above-mentioned

Is the number of jobs transferred to the virtual machine in time slot T.

4. The cloud job scheduling method based on deep reinforcement learning of claim 3, wherein the training sample is(s)_t，α_t，r_t，s_t+1) (ii) a The ready job set state is s_J＝{t₁，d₁，t₂，d₂，……，t_n，d_n}; the virtual machine state is

The motion is stored by a motion space a, a ═ α₁，α₂，……，α_n}; the reward value is calculated by a reward function R,

s is_tAnd s_t+1The states of time step t and time step t +1 respectively; the states are stored by a state space S, S ═ S_J，s_VM}; a is said_tSelecting an action from the action space A for a time step t; said r_tReporting the value calculated for the function R at time step t; the ready job set state s_JT in (1)_iAnd d_iRespectively representing the execution time of the ith job in the ready job set and the data amount transmitted to the virtual machine; the n is the job number of the ready job set; the virtual machine state s_VMIn (1)

And

respectively representing the residual computing capacity in the x-th virtual machine at the current time step and the number of jobs waiting to be executed; the m is the number of the virtual machines; motion a in the motion space A_iShowing the distribution mode of the virtual machine of the ith operation in the ready operation set; the action a_iIs m + 1; the above-mentioned

And

the number of executed jobs and the number of jobs waiting to be executed of the x-th virtual machine are respectively.

5. The method for cloud job scheduling based on deep reinforcement learning according to claim 4, wherein the objective function of the optimized job scheduler is:

the gamma is a discount factor, and the gamma belongs to [0, 1 ].

6. The method for cloud job scheduling based on deep reinforcement learning according to claim 5, wherein the loss function of the optimized job scheduler is:

theta is described_zThe z-th iteration is the job scheduler parameter; s is_t+1S for the next time step_t(ii) a D (M) is the number M of samples extracted each time by the experience pool D; a is said_t+1Is s is_t+1An action corresponding to the maximum Q value; the above-mentioned

To optimize the parameters of the job scheduler after the z-th iteration.

7. The cloud job scheduling method based on deep reinforcement learning according to claim 6, wherein the gradient of the parameter θ with respect to the loss function is: