CN116321293A

CN116321293A - Edge Computing Offloading and Resource Allocation Method Based on Multi-agent Reinforcement Learning

Info

Publication number: CN116321293A
Application number: CN202211707655.3A
Authority: CN
Inventors: 刘旭; 朱绍恩; 杨龙祥; 朱洪波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-23

Abstract

The invention discloses an edge computing offloading and resource allocation method based on multi-agent reinforcement learning, including: constructing a mobile edge computing offloading and resource allocation model based on complex scenarios of multiple mobile users and multiple edge servers, and setting optimization problems based on system overhead Objective function and constraints; Model the optimization problem as a Markov decision process, and set the state space, action space and reward function in deep reinforcement learning; use the method based on multi-agent deep reinforcement learning to find the best Optimize the unloading strategy and resource allocation strategy, and optimize the objective function, and at the same time use the NoisyNet method to add Gaussian noise to the output layer of the Actor network, improve the network model exploration efficiency, and improve the optimization effect; the method disclosed in the present invention integrates multi-agent DDPG and NoisyNet can effectively reduce the total system overhead, improve the optimization effect, and improve the overall experience of users in the system.

Description

Edge Computing Offloading and Resource Allocation Method Based on Multi-agent Reinforcement Learning

技术领域technical field

本发明涉及无线通信网络、移动边缘计算技术领域，具体为基于多智能体强化学习的边缘计算卸载和资源分配方法。The invention relates to the technical fields of wireless communication network and mobile edge computing, in particular to an edge computing unloading and resource allocation method based on multi-agent reinforcement learning.

背景技术Background technique

近些年来随着物联网和5G的飞速发展，低时延、高速率的万物互联成为可能，一系列新兴交互式应用应运而生，例如虚拟现实、增强显示、人脸识别、智能服务、自动驾驶等。由于物联网设备或者移动智能设备计算的局限性，一些计算任务不得不卸载到具有足够计算能力的云服务器，这就促使了移动云计算的发展；然而云计算的服务器是集中式的，其计算资源和带宽有限，在面对大量网络接入设备时，极易导致数据处理不及时，难以满足所有用户设备的需求，甚至容易造成系统故障，且距离用户较远，导致传播时延较高，难以满足对时延要求较高的任务的有效处理。In recent years, with the rapid development of the Internet of Things and 5G, low-latency, high-speed Internet of Everything has become possible, and a series of emerging interactive applications have emerged, such as virtual reality, augmented display, face recognition, intelligent services, and autonomous driving. wait. Due to the limitations of the computing of IoT devices or mobile smart devices, some computing tasks have to be offloaded to cloud servers with sufficient computing power, which promotes the development of mobile cloud computing; however, the server of cloud computing is centralized, and its computing Resources and bandwidth are limited. When faced with a large number of network access devices, it is easy to cause data processing to be untimely, difficult to meet the needs of all user equipment, and even easy to cause system failure, and the distance from users is long, resulting in high propagation delay. It is difficult to meet the effective processing of tasks with high latency requirements.

移动边缘计算(MEC)的提出可以有效弥补这些缺点，在MEC系统中，MEC服务器可以提供比本地设备更加强大的计算能力，虽然还不如云服务器，但是它更接近设备，同时MEC服务器的分布式结构使核心网络的流量不会出现拥塞；移动边缘计算是通过收获大量分布在网络边缘上的闲置计算能力和存储空间，将这些资源用于移动设备上，来处理移动设备所产生的对时延敏感或计算较为复杂的任务。在MEC的相关问题中，计算卸载的决策和资源的分配是决定MEC是否可以发挥出好的效果的关键技术，因此计算卸载和资源分配研究是提高MEC性能的迫切要求，具有非常重要的研究意义。The proposal of mobile edge computing (MEC) can effectively make up for these shortcomings. In the MEC system, the MEC server can provide more powerful computing capabilities than the local device. Although it is not as good as the cloud server, it is closer to the device. At the same time, the distributed MEC server The structure prevents the traffic of the core network from being congested; mobile edge computing handles the time delay generated by mobile devices by harvesting a large amount of idle computing power and storage space distributed on the edge of the network and using these resources on mobile devices. Sensitive or computationally complex tasks. In the related issues of MEC, the decision-making of computing offloading and the allocation of resources are the key technologies to determine whether MEC can play a good role. Therefore, the research on computing offloading and resource allocation is an urgent requirement to improve the performance of MEC, which has very important research significance. .

目前的研究工作中，边缘计算卸载场景主要分为单用户单边缘服务器、多用户单边缘服务器、多用户多边缘服务器三种；目前针对移动边缘计算卸载和资源分配方法主要以最小化能耗或最小化延迟或最小化两者的加权和为目标，以用户终端的能量、计算资源、边缘服务器的计算资源、任务最大允许时延等为约束条件建立优化问题并求解，从而获得最优策略；然而这种优化问题通常是NP-hard问题，尤其当网络规模较大时，即使通过例如遗传算法、粒子群算法等启发式算法仍然需要较长时间开销来获取最优策略；此外，网络的动态变化需要中心节点不断去求解复杂的优化问题，且难以自适应地跟踪网络动态变化。In the current research work, edge computing offloading scenarios are mainly divided into three types: single-user single-edge server, multi-user single-edge server, and multi-user multi-edge server; The goal is to minimize the delay or minimize the weighted sum of the two, and set up an optimization problem and solve it with constraints such as the energy of the user terminal, computing resources, computing resources of the edge server, and the maximum allowable delay of the task, so as to obtain the optimal strategy; However, this kind of optimization problem is usually NP-hard, especially when the network scale is large, even through heuristic algorithms such as genetic algorithm and particle swarm algorithm, it still takes a long time to obtain the optimal strategy; in addition, the dynamics of the network Changes require the central node to continuously solve complex optimization problems, and it is difficult to adaptively track network dynamic changes.

近年来，随着人工智能技术的迅速发展，强化学习算法受到广泛关注；智能体和环境不断交互获得奖励指导行为，从而使智能体随着时间的推进做出较好的动作决策，获得较大的奖励，即近似最优；由于强化学习是对动作进行评价并根据反馈修正动作选择，所以不需要依赖先验知识，能够自适应地跟踪环境变化，适合解决较为复杂的场景下的决策优化问题；因此可以借助强化学习算法进行边缘计算卸载和资源分配决策优化，实现系统的开销最小，改善用户体验，本发明在传统的强化学习算法DDPG上进行改进，以适应更加复杂的场景。In recent years, with the rapid development of artificial intelligence technology, reinforcement learning algorithms have received widespread attention; the agent and the environment continuously interact to obtain rewards to guide behaviors, so that the agent can make better action decisions over time and obtain greater The reward is approximately optimal; since reinforcement learning evaluates actions and corrects action choices based on feedback, it does not need to rely on prior knowledge and can adaptively track environmental changes, which is suitable for solving decision-making optimization problems in more complex scenarios ; Therefore, edge computing offloading and resource allocation decision optimization can be performed with the help of reinforcement learning algorithms to minimize system overhead and improve user experience. The present invention improves on the traditional reinforcement learning algorithm DDPG to adapt to more complex scenarios.

发明内容Contents of the invention

本部分的目的在于概述本发明的实施例的一些方面以及简要介绍一些较佳实施例。在本部分以及本申请的说明书摘要和发明名称中可能会做些简化或省略以避免使本部分、说明书摘要和发明名称的目的模糊，而这种简化或省略不能用于限制本发明的范围。The purpose of this section is to outline some aspects of embodiments of the invention and briefly describe some preferred embodiments. Some simplifications or omissions may be made in this section, as well as in the abstract and titles of this application, to avoid obscuring the purpose of this section, abstract and titles, and such simplifications or omissions should not be used to limit the scope of the invention.

鉴于上述存在的问题，提出了本发明。In view of the above problems, the present invention has been proposed.

因此，本发明解决的技术问题是：在移动边缘计算卸载和资源分配场景中传统的强化学习方法开销大以及复杂场景下模型难以快速收敛的问题。Therefore, the technical problem solved by the present invention is: the traditional reinforcement learning method is expensive in mobile edge computing offloading and resource allocation scenarios, and the model is difficult to quickly converge in complex scenarios.

为解决上述技术问题，本发明提供如下技术方案：基于多智能体强化学习的边缘计算卸载和资源分配方法，包括：根据多移动用户、多边缘服务器的复杂场景构建移动边缘计算卸载和资源分配模型，并设置所述模型的参数；基于所述模型计算的移动边缘计算卸载和资源分配的系统开销，设定优化问题的目标函数和约束条件；将所述优化问题建模成马尔科夫决策过程，并设置深度强化学习中的状态空间、动作空间和奖励函数；采用基于多智能体深度强化学习的方法为各移动用户寻找最优卸载策略和资源分配策略，并对所述目标函数进行优化，同时采用NoisyNet方法将高斯噪声添加到Actor网络的输出层，提高网络模型探索效率，提升优化效果。In order to solve the above technical problems, the present invention provides the following technical solution: a method for edge computing offloading and resource allocation based on multi-agent reinforcement learning, including: constructing a mobile edge computing offloading and resource allocation model according to complex scenarios with multiple mobile users and multiple edge servers , and set the parameters of the model; based on the system overhead of mobile edge computing offloading and resource allocation calculated by the model, set the objective function and constraint conditions of the optimization problem; model the optimization problem as a Markov decision process , and set the state space, action space and reward function in deep reinforcement learning; use the method based on multi-agent deep reinforcement learning to find the optimal unloading strategy and resource allocation strategy for each mobile user, and optimize the objective function, At the same time, the NoisyNet method is used to add Gaussian noise to the output layer of the Actor network to improve the exploration efficiency of the network model and improve the optimization effect.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述移动边缘计算卸载和资源分配模型包括系统模型、任务模型、移动性模型和计算模型；As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the mobile edge computing offloading and resource allocation model includes a system model, a task model, a mobility model, and a computing Model;

所述系统模型包括M个边缘服务器和N个移动用户设备，所述边缘服务器部署在无线接入点旁，每个无线接入点独立覆盖一片小区，所述移动用户设备可通过该小区的无线接入点向该小区的边缘服务器卸载计算任务，请求计算资源，所述无线接入点之间通过基站连接和传输数据；The system model includes M edge servers and N mobile user equipments, the edge servers are deployed next to wireless access points, each wireless access point independently covers a cell, and the mobile user equipment can pass through the cell's wireless The access point offloads computing tasks to the edge server of the cell, requests computing resources, and connects and transmits data between the wireless access points through the base station;

所述任务模型包括每个移动用户设备在每个时刻随机生成一个计算任务，生成的计算任务属性用一个三元组A_n表示，即

其中，L_n表示任务的数据量，X_n表示计算任务所需的CPU循环数，/>

表示完成任务所需的最大允许时延；The task model includes that each mobile user equipment randomly generates a computing task at each moment, and the attributes of the generated computing task are represented by a triplet A _n , namely

Among them, L _n represents the amount of data of the task, X _n represents the number of CPU cycles required to calculate the task, />

Indicates the maximum allowable delay required to complete the task;

所述移动性模型包括采用离散随机跳跃对用户移动性进行建模，用平均驻留时间表示跳跃之间的强度；The mobility model includes modeling user mobility using discrete random hops, and using an average dwell time to represent the strength between hops;

所述平均驻留时间的概率密度函数

的计算包括，The probability density function of the average dwell time

The calculation includes,

其中，β_n表示移动用户n的平均驻留时间，

表示用户实际驻留时间。Among them, β _n represents the average residence time of mobile user n,

Indicates the actual dwell time of the user.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：还包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, it also includes,

所述计算模型包括移动用户设备在不同卸载决策和资源分配策略下的总开销，所述总开销包括时延、能耗和资源成本；The calculation model includes the total overhead of the mobile user equipment under different offloading decisions and resource allocation strategies, and the total overhead includes delay, energy consumption and resource cost;

所述移动用户设备总开销Q_n的计算包括，The calculation of the mobile user equipment total cost Q _n includes,

其中，

表示没有发生任务迁移时移动用户设备的总开销，/>

表示发生任务迁移时移动用户设备的总开销，ω₁表示时延系数，ω₂表示能耗系数，ω₃表示资源成本系数，/>

表示本地计算和边缘计算的最大时延，/>

表示本地计算和边缘计算加上额外迁移时延的最大时延，E_n表示本地计算和边缘计算的能耗，f_m→n表示边缘服务器m给用户n分配的计算资源，P_βn表示事件发生的概率，T_mn表示边缘处理任务的总时延；in,

Indicates the total overhead of moving user equipment when no task migration occurs, />

Indicates the total overhead of mobile user equipment when task migration occurs, ω ₁ represents the delay coefficient, ω ₂ represents the energy consumption coefficient, ω ₃ represents the resource cost coefficient, />

Indicates the maximum latency of local computing and edge computing, />

Indicates the maximum delay of local computing and edge computing plus additional migration delay, E _n represents the energy consumption of local computing and edge computing, f _m→n represents the computing resources allocated by edge server m to user n, P _βn represents the occurrence of events The probability of , T _mn represents the total delay of edge processing tasks;

利用开销期望来衡量系统的性能，所述移动用户设备总开销的期望的计算包括，The performance of the system is measured by using the overhead expectation, and the calculation of the expected total overhead of the mobile user equipment includes,

其中，

表示时延、能耗以及计算资源的平均开销。in,

Indicates the average overhead of latency, energy consumption, and computing resources.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述优化问题的目标函数的计算包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the calculation of the objective function of the optimization problem includes,

其中，γ_n表示任务卸载比例，I(x_n)表示指示函数，x_n表示用户的初始位置索引。Among them, γ _n represents the proportion of task unloading, I(x _n ) represents the indicator function, and x _n represents the user's initial location index.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述优化问题的约束条件的设定包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the setting of the constraints of the optimization problem includes,

其中，U_m表示能接入边缘服务器一侧的无线接入点的用户设备集合，

表示每个边缘服务器的计算资源总量，T_n表示任务处理的总时延，/>

表示任务的最大允许时延，C1表示用户初始位置范围的约束，C2表示任务卸载比例的约束，C3表示保证边缘服务器分配给移动用户的计算资源是非负的，C4表示保证分配给每个任务的计算资源总和不会超过边缘服务器的全部计算资源，C5表示规定任务的最大允许时延。Wherein, U _m represents the user equipment set that can access the wireless access point on the side of the edge server,

Indicates the total amount of computing resources of each edge server, T _n indicates the total delay of task processing, />

Indicates the maximum allowable delay of the task, C1 indicates the constraint of the user's initial location range, C2 indicates the constraint of the task unloading ratio, C3 indicates that the computing resources allocated by the edge server to the mobile user are non-negative, and C4 indicates that the allocation of each task is guaranteed The sum of computing resources will not exceed all computing resources of the edge server, and C5 indicates the maximum allowable delay of the specified task.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述状态空间的设置包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the setting of the state space includes,

所述移动用户设备n的状态空间s_n表示为，The state space s _n of the mobile user equipment n is expressed as,

其中，A_n表示任务信息，h_nm表示信道增益，

表示用户移动性属性，

表示边缘服务器的剩余资源。Among them, A _n represents the task information, h _nm represents the channel gain,

represents the user mobility attribute,

Indicates the remaining resources of the edge server.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述动作空间的设置包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the setting of the action space includes,

所述移动用户设备n的动作空间a_n表示为，The action space a _n of the mobile user equipment n is expressed as,

a_n＝{γ_n，f_m→n}a _n ＝{γ _n ，f _m→n }

其中，γ_n表示要优化的卸载策略，f_m→n表示要优化的资源分配策略。Among them, γ _n represents the unloading strategy to be optimized, and f _m→n represents the resource allocation strategy to be optimized.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述奖励函数的设置包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the setting of the reward function includes,

所述移动用户设备n的奖励函数r_n表示为，The reward function r _n of the mobile user equipment n is expressed as,

其中，

表示用户完全在本地处理任务的开销，P_n表示任务超时惩罚因子。in,

Represents the overhead of the user completely processing the task locally, and P _n represents the task timeout penalty factor.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：采用基于多智能体深度强化学习的移动边缘计算卸载和资源分配方法对优化问题进行求解的步骤包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, the optimization problem is solved by using the mobile edge computing offloading and resource allocation method based on multi-agent deep reinforcement learning The steps include,

将每个移动用户设备看成一个智能体，每个智能体包括Actor网络和Critic网络，设置Actor网络的学习率为0.0004，Critic网络的学习率为0.004，网络采用软更新的方式进行每步更新，软更新系数τ设置为0.01；Consider each mobile user device as an agent, each agent includes Actor network and Critic network, set the learning rate of Actor network to 0.0004, and the learning rate of Critic network to 0.004, and the network adopts soft update method for each step update , the soft update coefficient τ is set to 0.01;

采用确定性策略梯度法获得每个移动用户的计算卸载策略和资源分配策略，利用中心化训练分布式执行的思想，每个移动用户的Critic网络收集所有移动用户的状态观测信息和动作信息进行训练，Actor网络分别为每个移动用户生成动作策略，以适应复杂的动态环境；Using the deterministic policy gradient method to obtain each mobile user's computing offloading strategy and resource allocation strategy, using the idea of centralized training and distributed execution, each mobile user's critic network collects the state observation information and action information of all mobile users for training , the Actor network generates action strategies for each mobile user separately to adapt to complex dynamic environments;

采用NoisyNet的方法给Actor网络的最后一层即输出层引入高斯噪声变量ε，使所述输出层变为噪声层，通过网络进行学习生成权重和偏差，由此提高模型探索效率；The NoisyNet method is used to introduce a Gaussian noise variable ε into the last layer of the Actor network, that is, the output layer, so that the output layer becomes a noise layer, and the network is used to learn to generate weights and deviations, thereby improving the efficiency of model exploration;

所述权重和偏差的生成的计算包括，The calculation of the generation of the weights and biases includes,

ω＝μ^ω+σ^ω⊙ε^ω ω＝μ ^ω +σ ^ω ⊙ε ^ω

b＝μ^b+σ^b⊙ε^b b＝μ ^b +σ ^b ⊙ε ^b

其中，ω表示噪声层生成的权重，b表示噪声层生成的偏差，μ^ω、σ^ω、μ^b、σ^b表示噪声层需要学习的参数，ε^ω、ε^b表示高斯噪声变量；Among them, ω represents the weight generated by the noise layer, b represents the deviation generated by the noise layer, μ ^ω , σ ^ω , μ ^b , σ ^b represent the parameters that the noise layer needs to learn, ε ^ω , ε ^b represent Gaussian noise variables;

引入存储所有智能体状态、动作、奖励和下一状态的经验池，所述经验池大小设置为64，当需要进行网络训练时，从所述经验池中随机抽取小批量训练数据进行训练，由此来减少样本间的依赖性和相关性；Introduce an experience pool that stores all agent states, actions, rewards, and next states. The size of the experience pool is set to 64. When network training is required, a small batch of training data is randomly selected from the experience pool for training. This reduces the dependence and correlation between samples;

利用梯度下降法和梯度上升法更新Critic网络和Actor网络参数，并进行迭代优化，使所述Critic网络的输出不断接近真实开销值，所述Actor网络输出不断优化的卸载和资源分配策略，从而降低系统开销。Utilize the gradient descent method and the gradient ascent method to update the Critic network and Actor network parameters, and perform iterative optimization, so that the output of the Critic network is constantly approaching the real cost value, and the Actor network outputs continuously optimized unloading and resource allocation strategies, thereby reducing System overhead.

作为本发明所述的基于多智能体强化学习的边缘计算卸载和资源分配方法的一种优选方案，其中：所述迭代优化的计算包括，As a preferred solution of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning in the present invention, wherein: the calculation of the iterative optimization includes,

其中，

表示Critic在线网络的损失函数，/>

表示Actor在线网络的策略梯度，I表示样本数量，/>

表示第n个智能体在线网络的评价值，/>

表示第n个智能体目标网络输出的评价值，/>

表示N个智能体的状态集合，/>

表示N个智能体的下一状态集合，/>

表示第n个智能体的动作，/>

表示第n个智能体下一状态输出的动作，/>

表示Critic在线网络的参数，/>

表示Critic目标网络的参数，/>

表示智能体开销相关的奖励，γ表示折扣因子，/>

表示Actor在线网络的输出的动作，/>

表示Actor在线网络中输出层的噪声，/>

表示Actor目标网络的参数，/>

表示Actor在线网络的参数。in,

Represents the loss function of the Critic online network, />

Indicates the strategy gradient of Actor's online network, I indicates the number of samples, />

Indicates the evaluation value of the nth agent online network, />

Indicates the evaluation value output by the nth agent target network, />

Represents the state set of N agents, />

Indicates the next state set of N agents, />

Indicates the action of the nth agent, />

Indicates the action of the next state output of the nth agent, />

Indicates the parameters of the Critic online network, />

Indicates the parameters of the Critic target network, />

Represents the agent's cost-related reward, γ represents the discount factor, />

Indicates the action of the output of the Actor's online network, />

Indicates the noise of the output layer in the Actor online network, />

Indicates the parameters of the Actor's target network, />

Represents the parameters of the Actor's online network.

本发明的有益效果：本发明提出了一种基于多智能体深度强化学习的计算卸载和资源分配方法，该方法采用集中式训练，分布式执行的技巧，将每个用户看成一个智能体，每个智能体可以利用彼此的状态观测信息训练Critic和Actor网络并分别做出决策，形成一种智能体之间的合作关系，相较传统的强化学习方法更适用于动态复杂环境，可以根据用户的移动性为用户灵活制定卸载决策和资源分配决策，降低系统的总开销；并且本发明采用NoisyNet在Actor网络的输出层引入参数化噪声，进一步提高了复杂场景中探索计算卸载和资源分配策略的能力和效率；此外，在同样场景下相较于DDPG，本发明公开的Noise-MADDPG方法在训练稳定性、收敛速度、优化效果均有所提高，具有很高的使用价值和推广价值。Beneficial effects of the present invention: the present invention proposes a computing offloading and resource allocation method based on multi-agent deep reinforcement learning, which adopts centralized training and distributed execution techniques, and regards each user as an agent, Each agent can use each other's state observation information to train the Critic and Actor networks and make decisions separately, forming a cooperative relationship between agents. Compared with traditional reinforcement learning methods, it is more suitable for dynamic and complex environments. The mobility allows users to flexibly make unloading decisions and resource allocation decisions, reducing the overall overhead of the system; and the present invention uses NoisyNet to introduce parameterized noise in the output layer of the Actor network, which further improves the ability to explore computing offloading and resource allocation strategies in complex scenarios. Ability and efficiency; In addition, compared with DDPG in the same scene, the Noise-MADDPG method disclosed in the present invention has improved training stability, convergence speed, and optimization effect, and has high use value and promotion value.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其它的附图。其中：In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort. in:

图1为本发明提供的基于多智能体强化学习的边缘计算卸载和资源分配方法的整体流程图；Fig. 1 is the overall flowchart of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning provided by the present invention;

图2为本发明提供的基于多智能体强化学习的边缘计算卸载和资源分配方法的移动边缘计算场景图；Fig. 2 is a mobile edge computing scene diagram of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning provided by the present invention;

图3为本发明提供的基于多智能体强化学习的边缘计算卸载和资源分配方法的基于多智能体深度强化学习框架图；Fig. 3 is a frame diagram based on multi-agent deep reinforcement learning of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning provided by the present invention;

图4为本发明提供的基于多智能体强化学习的边缘计算卸载和资源分配方法的训练轮数和平均奖励的关系图；Fig. 4 is a relationship diagram of the number of training rounds and the average reward of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning provided by the present invention;

图5为本发明提供的基于多智能体强化学习的边缘计算卸载和资源分配方法的系统总开销和用户数的关系图。Fig. 5 is a relationship diagram of total system overhead and number of users of the edge computing offloading and resource allocation method based on multi-agent reinforcement learning provided by the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合说明书附图对本发明的具体实施方式做详细的说明，显然所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明的保护的范围。In order to make the above-mentioned purposes, features and advantages of the present invention more obvious and easy to understand, the specific implementation modes of the present invention will be described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Example. Based on the embodiments of the present invention, all other embodiments obtained by ordinary persons in the art without creative efforts shall fall within the protection scope of the present invention.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，但是本发明还可以采用其他不同于在此描述的其它方式来实施，本领域技术人员可以在不违背本发明内涵的情况下做类似推广，因此本发明不受下面公开的具体实施例的限制。In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.

其次，此处所称的“一个实施例”或“实施例”是指可包含于本发明至少一个实现方式中的特定特征、结构或特性。在本说明书中不同地方出现的“在一个实施例中”并非均指同一个实施例，也不是单独的或选择性的与其他实施例互相排斥的实施例。Second, "one embodiment" or "an embodiment" referred to herein refers to a specific feature, structure or characteristic that may be included in at least one implementation of the present invention. "In one embodiment" appearing in different places in this specification does not all refer to the same embodiment, nor is it a separate or selective embodiment that is mutually exclusive with other embodiments.

本发明结合示意图进行详细描述，在详述本发明实施例时，为便于说明，表示器件结构的剖面图会不依一般比例作局部放大，而且所述示意图只是示例，其在此不应限制本发明保护的范围。此外，在实际制作中应包含长度、宽度及深度的三维空间尺寸。The present invention is described in detail in conjunction with schematic diagrams. When describing the embodiments of the present invention in detail, for the convenience of explanation, the cross-sectional view showing the device structure will not be partially enlarged according to the general scale, and the schematic diagram is only an example, which should not limit the present invention. scope of protection. In addition, the three-dimensional space dimensions of length, width and depth should be included in actual production.

同时在本发明的描述中，需要说明的是，术语中的“上、下、内和外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一、第二或第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。At the same time, in the description of the present invention, it should be noted that the orientation or positional relationship indicated by "upper, lower, inner and outer" in the terms is based on the orientation or positional relationship shown in the accompanying drawings, and is only for the convenience of describing the present invention. The invention and the simplified description do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operate in a specific orientation, and thus should not be construed as limiting the present invention. In addition, the terms "first, second or third" are used for descriptive purposes only, and should not be construed as indicating or implying relative importance.

本发明中除非另有明确的规定和限定，术语“安装、相连、连接”应做广义理解，例如：可以是固定连接、可拆卸连接或一体式连接；同样可以是机械连接、电连接或直接连接，也可以通过中间媒介间接相连，也可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。Unless otherwise specified and limited in the present invention, the term "installation, connection, connection" should be understood in a broad sense, for example: it can be a fixed connection, a detachable connection or an integrated connection; it can also be a mechanical connection, an electrical connection or a direct connection. A connection can also be an indirect connection through an intermediary, or it can be an internal communication between two elements. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention in specific situations.

实施例1Example 1

参照图1～3为本发明的一个实施例，提供了基于多智能体强化学习的边缘计算卸载和资源分配方法，包括：Referring to Figures 1 to 3, an embodiment of the present invention provides an edge computing offloading and resource allocation method based on multi-agent reinforcement learning, including:

S1：根据多移动用户、多边缘服务器的复杂场景构建移动边缘计算卸载和资源分配模型，并设置模型的参数。需要说明的是：S1: Construct a mobile edge computing offloading and resource allocation model according to the complex scenario of multiple mobile users and multiple edge servers, and set the parameters of the model. It should be noted:

移动边缘计算卸载和资源分配模型包括系统模型、任务模型、移动性模型和计算模型；Mobile edge computing offloading and resource allocation models include system model, task model, mobility model and computing model;

具体的，系统模型包括，如图2所示，系统采用基站和无线接入点共同覆盖的方式，系统包括M个边缘服务器和N个移动用户设备，定义M＝{1,2,...,M},N＝{1,2,...,N}作为所有边缘服务器和移动用户设备的集合，边缘服务器部署在无线接入点旁，每个无线接入点独立覆盖一片小区，移动用户设备可通过该小区的无线接入点向该小区的边缘服务器卸载计算任务，请求计算资源，无线接入点之间通过基站连接和传输数据；Specifically, the system model includes, as shown in Figure 2, the system adopts the common coverage mode of the base station and the wireless access point, the system includes M edge servers and N mobile user equipment, and M={1,2,... ,M},N={1,2,...,N} is a collection of all edge servers and mobile user equipment, the edge server is deployed next to the wireless access point, each wireless access point independently covers a cell, and the mobile The user equipment can offload computing tasks to the edge server of the cell through the wireless access point of the cell, request computing resources, and the wireless access points connect and transmit data through the base station;

具体的，任务模型包括每个移动用户设备在每个时刻随机生成一个计算任务，生成的计算任务属性用一个三元组A_n表示，即

表示完成任务所需的最大允许时延；Specifically, the task model includes that each mobile user equipment randomly generates a computing task at each moment, and the attributes of the generated computing task are represented by a triplet A _n , namely

Indicates the maximum allowable delay required to complete the task;

具体的，移动性模型包括采用离散随机跳跃对用户移动性进行建模，用平均驻留时间表示跳跃之间的强度；Specifically, the mobility model includes modeling user mobility using discrete random hops, and using the average dwell time to represent the strength between hops;

应说明的，驻留时间表示移动用户在指定无线接入点提供服务的区域停留的时间，是规划网络资源和提高服务质量的重要性能指标，一旦移动用户离开当前无线接入点提供服务的区域，如果任务还未处理完成，就会由于小区切换而导致任务迁移；It should be noted that the dwell time indicates the time that the mobile user stays in the area provided by the designated wireless access point. It is an important performance indicator for planning network resources and improving service quality. Once the mobile user leaves the area served by the current wireless access point , if the task has not been processed yet, the task will be migrated due to cell switching;

应说明的，平均驻留时间的概率密度函数

的计算包括，It should be stated that the probability density function of the average dwell time

The calculation includes,

其中，β_n表示移动用户n的平均驻留时间，

表示用户实际驻留时间；Among them, β _n represents the average residence time of mobile user n,

Indicates the actual residence time of the user;

具体的，计算模型包括本地计算模型和边缘计算模型；Specifically, the computing model includes a local computing model and an edge computing model;

①本地计算模型① Local computing model

设f_n表示用户设备n的本地计算能力，即本地CPU每秒转速，单位为CPU cycls/s，任务是可分割的，用户将任务的一部分卸载到选择的边缘服务器执行，卸载比例为γ_n，那么对于任务A_n，其本地计算时延

可表示为，Let f _n represent the local computing power of user device n, that is, the local CPU speed per second, the unit is CPU cycls/s, the task is divisible, the user offloads part of the task to the selected edge server for execution, and the unloading ratio is γ _n , then for task A _n , its local computation delay

can be expressed as,

具体的，任务在本地计算的能耗采用一个被广泛认可的计算每CPU转数能耗模型E＝κf²来计算，其中κ表示一种能耗转化系数，和硬件结构有关，f表示计算能力，那么任务A_n的本地能耗可表示为，Specifically, the energy consumption of tasks calculated locally is calculated using a widely recognized calculation energy consumption model per CPU revolution E=κf ² , where κ represents a conversion coefficient of energy consumption, which is related to the hardware structure, and f represents computing power , then the local energy consumption of task A _n can be expressed as,

因此，本地计算的总开销可以表示为，Therefore, the total overhead of local computation can be expressed as,

其中，ω₁表示时延的权重系数，ω₂表示能耗的权重系数，如果ω₁>ω₂，表示任务对时延更敏感；如果ω₁<ω₂，表示任务自身能源紧缺，在实际应用中可以根据实际任务需求确定合适的权重系数；Among them, ω ₁ represents the weight coefficient of delay, and ω ₂ represents the weight coefficient of energy consumption. If ω ₁ >ω ₂ , it means that the task is more sensitive to delay; if ω ₁ <ω ₂ , it means that the task itself is short of energy. In the application, the appropriate weight coefficient can be determined according to the actual task requirements;

②边缘计算模型②Edge Computing Model

当部分任务被上传到部署在无线接入点一侧的边缘服务器上进行处理时，其处理时延主要由三部分组成：1)任务从用户上传到边缘服务器的上行链路传输时延

2)边缘服务器的任务计算时延/>

3)任务结果从边缘服务器回传到用户的下行传输时延

假设任务处理结果比输入数据小的多，下行速率比上行速率大的多，那么边缘服务器上的任务结果回传到用户设备的时延可以忽略；When some tasks are uploaded to the edge server deployed on the side of the wireless access point for processing, the processing delay is mainly composed of three parts: 1) The uplink transmission delay of tasks uploaded from the user to the edge server

2) The task calculation delay of the edge server/>

3) The downlink transmission delay of the task result from the edge server to the user

Assuming that the task processing result is much smaller than the input data, and the downlink rate is much larger than the uplink rate, then the delay of the task result on the edge server back to the user equipment can be ignored;

每个边缘服务器可同时服务多个移动用户，移动用户和边缘服务器一侧的无线接入点建立传输连接时，该用户与无线接入点之间的上行链路信干噪比(SINR)必须满足约束SINR_nm≥SINR^th，其中SINR^th是保证传输质量的门限，因此能接入边缘服务器一侧的无线接入点的用户设备集合可以表示为U_m＝{n∈N|SINR_nm≥SINR^th}；Each edge server can serve multiple mobile users at the same time. When a mobile user establishes a transmission connection with the wireless access point on the side of the edge server, the uplink signal-to-interference-noise ratio (SINR) between the user and the wireless access point must be Satisfy the constraint SINR _nm ≥ ^{SINR th} , where SINR ^th is the threshold to ensure the transmission quality, so the set of user equipment that can access the wireless access point on the side of the edge server can be expressed as U _m ={n∈N|SINR _nm ≥SINR ^th };

假设上行链路采用OFDMA的接入方式，同一个无线接入点下的信道之间相互正交，用户设备间不会造成干扰，所有用户设备平均分配信道带宽，不同无线接入点共享信道资源，用户设备以最大功率传输任务数据，用户设备到边缘服务器一侧的无线接入点的信道增益为h_nm，和传输距离、小尺度衰落等因素有关，那么SINR_nm可以表示为，Assuming that the uplink adopts the OFDMA access method, the channels under the same wireless access point are orthogonal to each other, no interference will be caused between user equipment, all user equipment are equally allocated channel bandwidth, and different wireless access points share channel resources , the user equipment transmits task data at the maximum power, and the channel gain from the user equipment to the wireless access point on the side of the edge server is h _nm , which is related to factors such as transmission distance and small-scale fading, then SINR _nm can be expressed as,

其中，σ²表示高斯白噪声方差，I_n表示用户设备n和无线接入点之间的干扰，包括码间干扰、信道间干扰等；Among them, σ ² represents the variance of Gaussian white noise, _{and In} represents the interference between user equipment n and the wireless access point, including intersymbol interference, interchannel interference, etc.;

传输速率可以表示为，The transfer rate can be expressed as,

其中，B表示子信道带宽；Wherein, B represents sub-channel bandwidth;

因此，任务从设备到边缘服务器的传输时延可以表示为，Therefore, the transmission delay of the task from the device to the edge server can be expressed as,

假设每个边缘服务器的计算资源总量为F^max，卸载到边缘服务器进行处理的用户分配的计算资源表示为

因此，分配给每个任务的计算资源应满足以下约束：Assuming that the total amount of computing resources of each edge server is F ^max , the computing resources allocated by users offloaded to the edge server for processing are expressed as

Therefore, the computing resources allocated to each task should satisfy the following constraints:

所以边缘服务器的计算时延可以表示为，Therefore, the calculation delay of the edge server can be expressed as,

边缘处理任务的总时延为，The total latency of edge processing tasks is,

任务传输产生的能耗可以表示为：The energy consumption generated by task transmission can be expressed as:

最后，将任务A_n部分卸载到无线接入点一侧的边缘服务器进行处理的总开销表示为，Finally, the total overhead of offloading part of task A _n to the edge server on the side of the wireless access point for processing is expressed as,

其中，除了时延和能耗，开销还包括占用边缘服务器的资源成本ω₃f_m→n，ω₃表示为资源成本系数，范围为ω₃∈[0,1]。Among them, in addition to delay and energy consumption, overhead also includes the resource cost of occupying the edge server ω ₃ f _m→n , ω ₃ is expressed as the resource cost coefficient, and the range is ω ₃ ∈ [0,1].

进一步的，用户的移动性涉及任务的迁移，导致时延增大，增加了系统总开销，本发明通过公开的方法优化卸载策略和资源分配策略，降低任务迁移时延，从而降低系统总开销，具体可分为两种情况：Furthermore, user mobility involves task migration, resulting in increased delay and increased system overhead. The present invention optimizes the unloading strategy and resource allocation strategy through the disclosed method to reduce the task migration delay, thereby reducing the overall system overhead. Specifically, it can be divided into two situations:

①在移动用户设备离开原无线接入点覆盖范围前，如果卸载任务可以在原边缘服务器端执行完毕，则不会发生任务迁移，此时的用户设备总开销的计算包括，① Before the mobile user equipment leaves the coverage area of the original wireless access point, if the offloading task can be completed on the original edge server, the task migration will not occur. The calculation of the total cost of the user equipment at this time includes:

②在移动用户设备离开原无线接入点覆盖范围前，如果卸载任务不可以在原边缘服务器端执行完毕，则会发生任务迁移，原边缘服务器上的执行结果将会通过基站迁移到另一个边缘服务器所在的无线接入点处，该无线接入点会将执行结果传回给移动用户设备，此时会产生额外迁移时延

假设迁移时延只与数据量多少有关，此时用户设备总开销的计算包括，②Before the mobile user equipment leaves the coverage area of the original wireless access point, if the offloading task cannot be completed on the original edge server, task migration will occur, and the execution results on the original edge server will be migrated to another edge server through the base station At the wireless access point where the wireless access point is located, the wireless access point will send the execution result back to the mobile user equipment, and additional migration delay will be generated at this time

Assuming that the migration delay is only related to the amount of data, the calculation of the total cost of the user equipment includes,

更进一步的，移动用户设备总开销Q_n的计算包括，Furthermore, the calculation of the total cost Q _n of the mobile user equipment includes,

其中，

表示没有发生任务迁移时移动用户设备的总开销，/>

表示本地计算和边缘计算的最大时延，/>

表示本地计算和边缘计算加上额外迁移时延的最大时延，E_n表示本地计算和边缘计算的能耗，f_m→n表示边缘服务器m给用户n分配的计算资源，/>

表示事件发生的概率，T_mn表示边缘处理任务的总时延；in,

Indicates the maximum latency of local computing and edge computing, />

Indicates the maximum delay of local computing and edge computing plus additional migration delay, E _n represents the energy consumption of local computing and edge computing, f _m→n represents the computing resources allocated by edge server m to user n, />

Indicates the probability of event occurrence, and T _mn indicates the total delay of edge processing tasks;

利用开销期望来衡量系统的性能，则移动用户设备总开销的期望的计算包括，Using overhead expectations to measure the performance of the system, the calculation of the expected total overhead of mobile user equipment includes,

其中，

表示时延、能耗以及计算资源的平均开销。in,

S2：基于模型计算的移动边缘计算卸载和资源分配的系统开销，设定优化问题的目标函数和约束条件。需要说明的是：S2: System overhead of mobile edge computing offloading and resource allocation based on model computing, setting the objective function and constraints of the optimization problem. It should be noted:

优化问题的目标函数的计算包括，The computation of the objective function for an optimization problem involves,

其中，γ_n表示任务卸载比例，I(x_n)表示指示函数，x_n表示用户的初始位置索引；Among them, γ _n represents the task unloading ratio, I(x _n ) represents the indicator function, and x _n represents the user’s initial position index;

应说明的，若x_n>0，则I(x_n)＝1；It should be noted that if x _n >0, then I(x _n )=1;

进一步的，优化问题的约束条件的设定包括，Further, the setting of the constraint conditions of the optimization problem includes,

S3：将优化问题建模成马尔科夫决策过程，并设置深度强化学习中的状态空间、动作空间和奖励函数。需要说明的是：S3: Model the optimization problem as a Markov decision process, and set the state space, action space and reward function in deep reinforcement learning. It should be noted:

①设置移动用户设备n的状态空间s_n为，① Set the state space s _n of mobile user equipment n as,

其中，A_n表示任务信息，h_nm表示信道增益，

表示用户移动性属性，

表示边缘服务器的剩余资源；Among them, A _n represents the task information, h _nm represents the channel gain,

represents the user mobility attribute,

Indicates the remaining resources of the edge server;

应说明的，所有移动用户设备的状态空间表示为s＝{s₁,s₂,...,s_N}；It should be noted that the state space of all mobile user equipments is expressed as s={s ₁ , s ₂ ,...,s _N };

②设置移动用户设备n的动作空间a_n为，② Set the action space a _n of mobile user equipment n as,

a_n＝{γ_n，f_m→n}a _n ＝{γ _n ，f _m→n }

其中，γ_n表示要优化的卸载策略，f_m→n表示要优化的资源分配策略；Among them, γ _n represents the unloading strategy to be optimized, and f _m→n represents the resource allocation strategy to be optimized;

应说明的，所有移动用户设备的动作空间表示为a＝{a₁,a₂,...,a_N}；It should be noted that the action space of all mobile user equipments is expressed as a={a ₁ ,a ₂ ,...,a _N };

③设置移动用户设备n的奖励函数r_n为，③ Set the reward function r _n of mobile user equipment n as,

其中，

表示用户完全在本地处理任务的开销，P_n表示任务超时惩罚因子；in,

Indicates the overhead of the user completely processing the task locally, and P _n indicates the task timeout penalty factor;

应说明的，整个系统即所有用户的奖励函数表示为r＝Σ_n∈Nr_n。It should be noted that the reward function of the entire system, that is, all users, is expressed as r=Σ _n∈N r _n .

S4：采用基于多智能体深度强化学习的方法为各移动用户寻找最优卸载策略和资源分配策略，并对目标函数进行优化，同时采用NoisyNet方法将高斯噪声添加到Actor网络的输出层，提高网络模型探索效率，提升优化效果。S4: Use the method based on multi-agent deep reinforcement learning to find the optimal unloading strategy and resource allocation strategy for each mobile user, and optimize the objective function. At the same time, use the NoisyNet method to add Gaussian noise to the output layer of the Actor network to improve the network performance. Improve the efficiency of model exploration and improve the optimization effect.

需要说明的是：It should be noted:

采用基于多智能体深度强化学习的移动边缘计算卸载和资源分配方法对优化问题进行求解的步骤包括，The steps of solving the optimization problem using the mobile edge computing offloading and resource allocation method based on multi-agent deep reinforcement learning include,

①将每个移动用户设备看成一个智能体，每个智能体包括Actor网络和Critic网络，两者均包含在线神经网络和目标神经网络，采用相似的网络结构，包括输入层、两个FC层、两个BN层和输出层，两个FC层的神经元个数分别设置为500和300，BN层用于归一化FC层输出，防止FC层输出进入激活函数饱和区间，中间层采用Relu激活函数，Actor网络的输出层采用Tanh进行激活输出，Critic网络的输出层采用Relu激活输出，设置Actor网络的学习率为0.0004，Critic网络的学习率为0.004，网络采用软更新的方式进行每步更新，软更新系数τ设置为0.01；① Treat each mobile user device as an agent, each agent includes actor network and critic network, both of which include online neural network and target neural network, adopt similar network structure, including input layer, two FC layers , two BN layers and an output layer, the number of neurons of the two FC layers is set to 500 and 300 respectively, the BN layer is used to normalize the output of the FC layer, to prevent the output of the FC layer from entering the saturation range of the activation function, and the middle layer uses Relu Activation function, the output layer of the Actor network uses Tanh to activate the output, the output layer of the Critic network uses Relu to activate the output, set the learning rate of the Actor network to 0.0004, and the learning rate of the Critic network to 0.004, and the network adopts a soft update method for each step Update, the soft update coefficient τ is set to 0.01;

②采用确定性策略梯度法获得每个移动用户的计算卸载策略和资源分配策略，利用中心化训练分布式执行的思想，每个移动用户的Critic网络收集所有移动用户的状态观测信息和动作信息进行训练，Actor网络分别为每个移动用户生成动作策略，以适应复杂的动态环境；②Using the deterministic policy gradient method to obtain the computing offloading strategy and resource allocation strategy of each mobile user, using the idea of centralized training and distributed execution, the Critic network of each mobile user collects the state observation information and action information of all mobile users Training, the Actor network generates action strategies for each mobile user to adapt to complex dynamic environments;

③采用NoisyNet的方法给Actor网络的最后一层即输出层引入高斯噪声变量ε，使输出层变为噪声层，通过网络进行学习生成权重和偏差，由此提高模型探索效率；③Use the NoisyNet method to introduce Gaussian noise variable ε into the last layer of the Actor network, that is, the output layer, so that the output layer becomes a noise layer, and learn to generate weights and deviations through the network, thereby improving the efficiency of model exploration;

应说明的，权重和偏差的生成的计算包括，It should be noted that the calculations for the generation of weights and biases include,

ω＝μ^ω+σ^ω⊙ε^ω ω＝μ ^ω +σ ^ω ⊙ε ^ω

b＝μ^b+σ^b⊙ε^b b＝μ ^b +σ ^b ⊙ε ^b

其中，ω表示噪声层生成的权重，b表示噪声层生成的偏差，μ^ω、σ^ω、μ^b、σ^b表示噪声层需要学习的参数，跟随Actor网络的其他训练参数一起在训练中调整大小，ε^ω、ε^b表示高斯噪声变量；Among them, ω represents the weight generated by the noise layer, b represents the deviation generated by the noise layer, μ ^ω , σ ^ω , μ ^b , σ ^b represent the parameters that the noise layer needs to learn, and adjust the size during training along with other training parameters of the Actor network , ε ^ω , ε ^b represent Gaussian noise variables;

④引入存储所有智能体状态、动作、奖励和下一状态的经验池，经验池大小设置为64，当需要进行网络训练时，从经验池中随机抽取小批量训练数据进行训练，由此来减少样本间的依赖性和相关性；④ Introduce an experience pool that stores all agent states, actions, rewards, and next states. The size of the experience pool is set to 64. When network training is required, a small batch of training data is randomly selected from the experience pool for training, thereby reducing Dependencies and correlations between samples;

⑤利用梯度下降法和梯度上升法更新Critic网络和Actor网络参数，并进行迭代优化，使Critic网络的输出不断接近真实开销值，Actor网络输出不断优化的卸载和资源分配策略，从而降低系统开销；⑤ Utilize the gradient descent method and the gradient ascent method to update the critic network and actor network parameters, and perform iterative optimization, so that the output of the critic network is constantly approaching the real cost value, and the actor network outputs continuously optimized unloading and resource allocation strategies, thereby reducing system overhead;

应说明的，迭代优化的计算包括，It should be noted that the calculation of iterative optimization includes,

其中，

表示Critic在线网络的损失函数，/>

表示Actor在线网络的策略梯度，I表示样本数量，/>

表示第n个智能体在线网络的评价值，/>

表示第n个智能体目标网络输出的评价值，/>

表示N个智能体的状态集合，/>

表示N个智能体的下一状态集合，/>

表示第n个智能体的动作，/>

表示第n个智能体下一状态输出的动作，/>

表示Critic在线网络的参数，/>

表示Critic目标网络的参数，/>

表示智能体开销相关的奖励，γ表示折扣因子，/>

表示Actor在线网络的输出的动作，/>

表示Actor在线网络中输出层的噪声，/>

表示Actor目标网络的参数，/>

表示Actor在线网络的参数。in,

Represents the loss function of the Critic online network, />

Indicates the evaluation value of the nth agent online network, />

Indicates the evaluation value output by the nth agent target network, />

Represents the state set of N agents, />

Indicates the next state set of N agents, />

Indicates the action of the nth agent, />

Indicates the action of the next state output of the nth agent, />

Indicates the parameters of the Critic online network, />

Indicates the parameters of the Critic target network, />

Indicates the action of the output of the Actor's online network, />

Indicates the noise of the output layer in the Actor online network, />

Indicates the parameters of the Actor's target network, />

Represents the parameters of the Actor's online network.

应说明的，本发明提出了一种基于多智能体深度强化学习的计算卸载和资源分配方法，该方法采用集中式训练，分布式执行的技巧，将每个用户看成一个智能体，每个智能体可以利用彼此的状态观测信息训练Critic和Actor网络并分别做出决策，形成一种智能体之间的合作关系，相较传统的强化学习方法更适用于动态复杂环境，可以根据用户的移动性为用户灵活制定卸载决策和资源分配决策，降低系统的总开销；并且本发明采用NoisyNet在Actor网络的输出层引入参数化噪声，进一步提高了复杂场景中探索计算卸载和资源分配策略的能力和效率；此外，在同样场景下相较于DDPG，本发明公开的Noise-MADDPG方法在训练稳定性、收敛速度、优化效果均有所提高，具有很高的使用价值和推广价值。It should be noted that the present invention proposes a computing offloading and resource allocation method based on multi-agent deep reinforcement learning. This method adopts the technique of centralized training and distributed execution, and regards each user as an agent. Agents can use each other's state observation information to train Critic and Actor networks and make decisions separately, forming a cooperative relationship between agents. Compared with traditional reinforcement learning methods, it is more suitable for dynamic and complex environments. It can flexibly make unloading decisions and resource allocation decisions for users, reducing the overall overhead of the system; and the present invention uses NoisyNet to introduce parameterized noise in the output layer of the Actor network, further improving the ability and ability to explore computing offloading and resource allocation strategies in complex scenarios. Efficiency; In addition, compared with DDPG in the same scene, the Noise-MADDPG method disclosed in the present invention has improved training stability, convergence speed, and optimization effect, and has high use value and promotion value.

实施例2Example 2

参照图4～5为本发明的第二个实施例，该实施例不同于第一个实施例的是，提供了基于多智能体强化学习的边缘计算卸载和资源分配方法的验证测试，为对本方法中采用的技术效果加以验证说明。4-5 is the second embodiment of the present invention. This embodiment is different from the first embodiment in that it provides a verification test of edge computing offloading and resource allocation methods based on multi-agent reinforcement learning. The technical effect adopted in the method is verified and explained.

本实施例基于Python 3.7编程语言和PyTorch 1.13.1深度强化学习框架实现，使用Pycharm作为IDE；This embodiment is implemented based on the Python 3.7 programming language and the PyTorch 1.13.1 deep reinforcement learning framework, using Pycharm as the IDE;

设置实验仿真关键参数：用户数N＝15，边缘服务器数M＝5，任务数据量L_n的范围是[0.3,0.5]Mhz，计算任务所需的CPU循环数X_n的范围是[900,1100]MCycles，任务最大允许时延

信道带宽B＝1Mhz，信道增益h_nm的范围是[0.1,0.9]，用户设备计算能力f_n的范围是[800,900]Mhz，发射功率p_n的范围是[20,30]W，平均驻留时间β_n是服从均值为0.4，方差为1.2的高斯分布；基站总计算能力/>

的范围是[14000,16000]Mhz，时延系数ω₁＝0.8，能耗系数ω₂＝0.2，资源成本系数ω₃＝0.8。Set the key parameters of the experimental simulation: the number of users N=15, the number of edge servers M=5, the range of task data L _n is [0.3,0.5]Mhz, the range of CPU cycles X _n required for the calculation task is [900, 1100] MCycles, the maximum allowable delay of the task

Channel bandwidth B=1Mhz, the range of channel gain h _nm is [0.1,0.9], the range of user equipment computing capability f _n is [800,900]Mhz, the range of transmit power p _n is [20,30]W, the average resident Time β _n follows a Gaussian distribution with a mean of 0.4 and a variance of 1.2; the total computing power of the base station/>

The range is [14000, 16000]Mhz, the delay coefficient ω ₁ =0.8, the energy consumption coefficient ω ₂ =0.2, and the resource cost coefficient ω ₃ =0.8.

如图4所示，在本实施例设置的参数下，随着训练轮数的增加，本发明提供的方法采用的Noise-MADDPG方法达到收敛状态，与DDPG方法相比，训练效果有明显提升。As shown in Figure 4, under the parameters set in this embodiment, with the increase of the number of training rounds, the Noise-MADDPG method adopted by the method provided by the present invention reaches a state of convergence, and compared with the DDPG method, the training effect is significantly improved.

改变用户数N，得到系统总开销与用户数的关系图如图5所示，分别取用户数N＝10,15,20,25，进一步验证本发明提供的方法相较于DDPG方法的优越性以及相较于全本地计算和全卸载计算的有效性。Change the number of users N to obtain the relationship diagram between the total system overhead and the number of users as shown in Figure 5, respectively take the number of users N=10, 15, 20, 25, and further verify the superiority of the method provided by the present invention compared to the DDPG method And compared to the effectiveness of all local computing and full offload computing.

因此，在同样场景下相较于DDPG方法，本发明公开的Noise-MADDPG方法在训练稳定性、收敛速度、优化效果均有所提高，具有很高的使用价值和推广价值。Therefore, compared with the DDPG method in the same scene, the Noise-MADDPG method disclosed in the present invention has improved training stability, convergence speed, and optimization effect, and has high use value and promotion value.

应说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围，其均应涵盖在本发明的权利要求范围当中。It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention without limitation, although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be carried out Modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present invention shall be covered by the claims of the present invention.

Claims

1. The edge computing unloading and resource allocation method based on multi-agent reinforcement learning is characterized by comprising the following steps:

constructing a mobile edge computing unloading and resource allocation model according to complex scenes of multiple mobile users and a multi-edge server, and setting parameters of the model;

calculating the system overhead of unloading and resource allocation based on the mobile edge calculated by the model, and setting an objective function and constraint conditions of the optimization problem;

modeling the optimization problem into a Markov decision process, and setting a state space, an action space and a reward function in deep reinforcement learning;

and searching an optimal unloading strategy and a resource allocation strategy for each mobile user by adopting a multi-agent deep reinforcement learning-based method, optimizing the objective function, and simultaneously adding Gaussian noise to an output layer of an Actor network by adopting a NoisyNet method, so that the network model exploration efficiency is improved, and the optimization effect is improved.

2. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 1, wherein: the mobile edge computing unloading and resource allocation model comprises a system model, a task model, a mobility model and a computing model;

the system model comprises M edge servers and N mobile user equipment, wherein the edge servers are deployed beside wireless access points, each wireless access point independently covers a cell, the mobile user equipment can offload calculation tasks to the edge servers of the cell through the wireless access point of the cell to request calculation resources, and the wireless access points are connected and transmit data through a base station;

the task model comprises that each mobile user equipment randomly generates a computing task at each moment, and the generated computing task attribute uses a triplet A _n Representation, i.e.

wherein ,L_n Representing the data volume of a task, X _n Represents the number of CPU cycles required for the calculation task, < >>

Representing a maximum allowable delay required to complete a task;

the mobility model comprises modeling user mobility by adopting discrete random hops, and representing the intensity between hops by using average residence time;

probability density function of the average residence time

The calculation of (c) includes the steps of,

wherein ,β_n Representing the average residence time of the mobile user n,

indicating the actual residence time of the user.

3. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 2, wherein: also included is a method of manufacturing a semiconductor device,

the calculation model comprises the total cost of the mobile user equipment under different unloading decisions and resource allocation strategies, wherein the total cost comprises time delay, energy consumption and resource cost;

the mobile user equipment overhead Q _n The calculation of (c) includes the steps of,

wherein ,

indicating the total overhead of the mobile user equipment when no task migration has occurred,/->

Representing the total overhead, ω, of the mobile user equipment when task migration occurs ₁ Representing the delay coefficient, omega ₂ Representing the energy consumption coefficient omega ₃ Representing resource cost coefficients, < >>

Maximum delay representing local computation and edge computation,/-for>

Representing the maximum delay of local computation and edge computation plus additional migration delay, E _n Representing the energy consumption of local computation and edge computation, f _m→n Representing the computing resources allocated by edge server m to user n,

representing the probability of occurrence of an event, T _mn Representing total latency of edge processing tasks；

The performance of the system is measured by using the overhead expectations, the calculation of which includes,

wherein ,

representing the average overhead of latency, energy consumption, and computing resources.

4. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 3, wherein: the calculation of the objective function of the optimization problem includes,

wherein ,γ_n Representing task offload ratio, I (x _n ) Representing an indication function, x _n Representing the initial position index of the user.

5. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 4, wherein: the setting of constraints for the optimization problem includes,

wherein ,U_m Representing a set of user equipments having access to a wireless access point on the side of an edge server,

representing the total amount of computing resources per edge server, T _n Representing the total delay of task processing, +.>

Representing maximum allowable time delay of a task, C1 representing a constraint of a user initial position range, C2 representing a constraint of a task offloading ratio, C3 representing a constraint of ensuring that computing resources allocated to a mobile user by an edge server are non-negative, C4 representing a constraint of ensuring that a sum of computing resources allocated to each task does not exceed all computing resources of the edge server, and C5 representing a maximum allowable time delay of a prescribed task.

6. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 5, wherein: the setting of the state space comprises that,

the state space s of the mobile user equipment n _n It is indicated that the number of the elements is,

wherein ,A_n Representing task information, h _nm Indicating the gain of the channel,

representing user mobility attributes, ++>

Representing the remaining resources of the edge server.

7. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 6, wherein: the setting of the action space includes,

the action space a of the mobile user equipment n _n It is indicated that the number of the elements is,

a _n ＝{γ _n ，f _m→n }

wherein ,γ_n Representing an offloading policy to be optimized, f _m→n Representing the resource allocation policy to be optimized.

8. The multi-agent reinforcement learning based edge computing offload and resource allocation method of claim 7, wherein: the setting of the bonus function includes,

the reward function r of the mobile user equipment n _n It is indicated that the number of the elements is,

wherein ,

representing the overhead of a user processing tasks entirely locally, P _n Representing a task timeout penalty factor.

9. The multi-agent reinforcement learning-based edge computing offloading and resource allocation method of claim 8, wherein: the step of solving the optimization problem by adopting the multi-agent deep reinforcement learning-based mobile edge computing unloading and resource allocation method comprises,

each mobile user equipment is regarded as an agent, each agent comprises an Actor network and a Critic network, the learning rate of the Actor network is set to be 0.0004, the learning rate of the Critic network is set to be 0.004, the network adopts a soft update mode to update each step, and the soft update coefficient tau is set to be 0.01;

acquiring a calculation unloading strategy and a resource allocation strategy of each mobile user by adopting a deterministic strategy gradient method, and collecting state observation information and action information of all mobile users by using a Critic network of each mobile user to train by utilizing a centralized training distributed execution idea, wherein an Actor network respectively generates the action strategy for each mobile user so as to adapt to a complex dynamic environment;

introducing Gaussian noise variable epsilon to the last layer of an Actor network, namely an output layer, by adopting a NoisyNet method, so that the output layer is changed into a noise layer, and learning through the network to generate weight and deviation, thereby improving the model exploration efficiency;

the calculation of the generation of the weights and deviations includes,

ω＝μ ^ω +σ ^ω ⊙ε ^ω

b＝μ ^b +σ ^b ⊙ε ^b

wherein ω represents the weight of noise layer generation, b represents the deviation of noise layer generation, μ ^ω 、σ ^ω 、μ ^b 、σ ^b Parameters epsilon representing the noise floor to be learned ^ω 、ε ^b Representing gaussian noise variation;

introducing an experience pool for storing all the states, actions, rewards and next states of the intelligent agents, wherein the size of the experience pool is set to 64, and when network training is required, small batches of training data are randomly extracted from the experience pool for training, so that dependence and correlation among samples are reduced;

and updating Critic network and Actor network parameters by using a gradient descent method and a gradient ascent method, and performing iterative optimization to ensure that the output of the Critic network is continuously close to a real overhead value, and the Actor network outputs continuously optimized unloading and resource allocation strategies, thereby reducing the system overhead.

10. The multi-agent reinforcement learning based edge computing offloading and resource allocation method of claim 9, wherein: the calculation of the iterative optimization includes,