CN114339879A

CN114339879A - A service migration method based on reinforcement learning in mobile edge computing

Info

Publication number: CN114339879A
Application number: CN202111492744.6A
Authority: CN
Inventors: 何利; 刘浩东
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Jiangxi Hualianyuan Universe Digital Technology Co ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-12
Anticipated expiration: 2041-12-08
Also published as: CN114339879B

Abstract

The invention requests to protect a service migration method based on reinforcement learning in mobile edge computing, which comprises the following main steps: s1, constructing a reward function based on the server position of the user task, the area position of the user and the processing task server load; s2, constructing a state transition matrix based on the current position, the previous moving direction and the transition decision of the user; s3, based on the reward function and the state transition matrix, making a migration decision by using a value iteration algorithm; s4, assigning link consumption by performing normalized processing based on time delay consumption and network consumption between routes; and S5, based on the normalized link consumption, using a reinforcement learning algorithm to perform path selection and adaptively updating link selection to adapt to the link change of the dynamic network. The invention introduces the movement prediction to make the model more accord with the actual scene; and solving the self-adaptive service migration path in the dynamic network environment by using reinforcement learning.

Description

A service migration method based on reinforcement learning in mobile edge computing

技术领域technical field

本发明属于移动边缘计算领域，特别是涉及一种移动边缘计算中基于强化学习的服务迁移方法。The invention belongs to the field of mobile edge computing, in particular to a service migration method based on reinforcement learning in mobile edge computing.

背景技术Background technique

移动边缘计算通过将计算资源下放至距离用户更近的各类节点上，为用户提供更好的服务，更好地提高了系统的服务质量(QoS)和提升了用户的服务体验(QoE)。移动边缘计算的出现推动了物联网、5G和运营商个性化服务的发展，同时它也被广泛应用于增强现实(AR)、视频优化加速、视频流分析、物联网(IoT)和车联网等领域之中。移动边缘计算中的主要问题就是任务卸载的问题，而任务卸载主要包括三个方面，分别是任务卸载的决策问题、资源分配问题和移动性管理问题。其中移动性管理问题产生于用户的移动性，一个有效地解决移动性管理问题的方法就是服务迁移。服务迁移通过将运行在距离用户较远距离的服务器上的服务迁移至距离用户较近的服务器，极大地降低了时延，为用户提供了更好的服务。Mobile edge computing provides users with better services by delegating computing resources to various nodes closer to users, better improving the quality of service (QoS) of the system and the service experience (QoE) of users. The emergence of mobile edge computing drives the development of IoT, 5G, and operator personalization, and it is also widely used in augmented reality (AR), video optimization acceleration, video streaming analysis, Internet of Things (IoT), and the Internet of Vehicles, etc. in the field. The main problem in mobile edge computing is the problem of task offloading, and task offloading mainly includes three aspects, namely the decision problem of task offloading, the problem of resource allocation and the problem of mobility management. The mobility management problem arises from the user's mobility, and an effective solution to the mobility management problem is service migration. Service migration greatly reduces latency and provides better services for users by migrating services running on servers farther away from users to servers closer to users.

已有的服务迁移策略主要通过马尔可夫决策过程、时间窗口技术和预测技术等来实现。服务迁移作为一个由于用户移动性所带来的问题，导致其考虑的往往是长期的优化，因此可以将服务迁移问题建模为时序决策问题来进行求解，而马尔可夫决策过程作为时序决策问题的经典形式化表示，可用于研究服务迁移问题。时间窗口技术和预测技术能够很好地对未来的能耗进行预测，进而能够寻找最优化的服务放置策略，也同样可以用来解决服务迁移问题。The existing service migration strategies are mainly implemented through Markov decision process, time window technology and prediction technology. As a problem caused by user mobility, service migration often considers long-term optimization. Therefore, the service migration problem can be modeled as a time series decision problem to solve, while the Markov decision process is a time series decision problem. The classical formal representation of , which can be used to study the problem of service migration. Time window technology and prediction technology can well predict the future energy consumption, and then can find the optimal service placement strategy, and can also be used to solve the problem of service migration.

尽管服务迁移已经使用上述方法进行了诸多研究，但基于马尔可夫决策过程的研究往往对于环境因素考虑不充足，同时也较少考虑用户的实际移动特性，并且迁移决策制定后如何进行迁移的问题也很少被提到。因此，综合考虑多种环境因素做出迁移决策，并选择合适的迁移路径至关重要。Although there have been many studies on service migration using the above methods, the research based on Markov decision process often does not take enough consideration of environmental factors, and also less considers the actual mobile characteristics of users, and how to migrate after the migration decision is made. Also rarely mentioned. Therefore, it is crucial to comprehensively consider a variety of environmental factors to make migration decisions and choose an appropriate migration path.

经过检索，申请公开号为CN110347495A，一种使用深度强化学习进行移动边缘计算的任务迁移方法，先设定系统模型各参数，再描述强化学习中的决策公式，然后基于公式给出任务迁移算法；通过本方法能够获得高效的任务迁移机制，而高效的任务迁移机制能够提高系统实时性，充分利用计算资源，并减少能耗；本方法同时运用了深度强化学习思想进行任务调度，即决策是否迁移计算任务，尤其使用了马尔可夫决策过程，能够在极短时间内给出较优解，实时性强；本方法适用于用户处在高速运动状态时，解决是否更换使用的服务器基站问题。在该专利中，通过使用深度强化学习算法求解移动边缘计算中任务迁移问题，但是由于移动状态的不确定性，往往不能够覆盖用户全部移动轨迹。本专利通过使用用户之前移动方向预测用户之后移动方向，进而构建用户移动模型，再结合迁移与否的决策构建状态转移矩阵，能够覆盖所有可能的用户移动状态，进而能够求解更加符合实际场景的迁移决策问题；同时本专利还针对迁移路径的选择问题使用强化学习算法进行了求解。After retrieval, the application publication number is CN110347495A, a task migration method using deep reinforcement learning for mobile edge computing. First, the parameters of the system model are set, and then the decision formula in reinforcement learning is described, and then the task migration algorithm is given based on the formula; Through this method, an efficient task migration mechanism can be obtained, and an efficient task migration mechanism can improve the real-time performance of the system, make full use of computing resources, and reduce energy consumption; this method also uses the idea of deep reinforcement learning for task scheduling, that is, the decision whether to migrate The calculation task, especially the Markov decision process, can give a better solution in a very short time, and the real-time performance is strong; this method is suitable for solving the problem of whether to replace the server base station when the user is in a high-speed motion state. In this patent, the task migration problem in mobile edge computing is solved by using a deep reinforcement learning algorithm, but due to the uncertainty of the movement state, it is often unable to cover all the user's movement trajectories. This patent uses the user's previous movement direction to predict the user's subsequent movement direction, and then builds a user movement model, and then combines the decision of whether to migrate or not to build a state transition matrix, which can cover all possible user movement states, and can solve the migration that is more in line with the actual scene. decision-making problem; at the same time, this patent also uses reinforcement learning algorithm to solve the selection problem of migration path.

申请公开号为CN110830560A，一种基于强化学习的多用户移动边缘计算迁移方法，包括以下步骤：首先移动设备确定当前工作负载到达率、可再生能源和电池电量等状态；然后通过访问动作状态值矩阵，根据∈-greedy策略决定在本地处理的任务量并采取相应动作；再计算可以反映当前动作好坏的奖赏值并以此更新动作状态值矩阵；最后计算移动设备的总成本(包括延迟成本和计算成本)。本发明将强化学习应用于5G关键技术之一的移动边缘计算技术，并结合Q-learning无模型的优势，制定了移动设备的任务分配策略，显著减少了移动设备的成本。该专利通过强化学习求解了多用户任务迁移问题，主要用于解决任务卸载过程中系统的长期成本优化问题。与本专利求解不同位置不同移动方向下迁移决策的制定问题有所不同，同时本专利还求解了迁移路径选择问题。The application publication number is CN110830560A, a multi-user mobile edge computing migration method based on reinforcement learning, including the following steps: first, the mobile device determines the current workload arrival rate, renewable energy and battery power and other states; then, by accessing the action state value matrix , according to the ∈-greedy strategy, determine the amount of tasks to be processed locally and take corresponding actions; then calculate the reward value that can reflect the quality of the current action and update the action state value matrix; finally calculate the total cost of the mobile device (including delay cost and Computing costs). The invention applies reinforcement learning to the mobile edge computing technology, one of the key technologies of 5G, and combines the model-free advantage of Q-learning to formulate a task allocation strategy for mobile devices, which significantly reduces the cost of mobile devices. This patent solves the multi-user task transfer problem through reinforcement learning, and is mainly used to solve the long-term cost optimization problem of the system during the task offloading process. It is different from this patent to solve the problem of making migration decisions in different positions and different moving directions. At the same time, this patent also solves the problem of migration path selection.

发明内容SUMMARY OF THE INVENTION

本发明旨在解决现有移动边缘计算中的服务迁移问题，提出了一种综合考虑多种环境因素影响和移动预测的迁移决策制定模型并使用价值迭代算法对问题进行求解；同时使用强化学习算法来实现动态网络环境下自适应迁移路径的选择。本发明的技术方案如下：The invention aims to solve the problem of service migration in the existing mobile edge computing, and proposes a migration decision-making model that comprehensively considers the influence of multiple environmental factors and mobile prediction, and uses a value iteration algorithm to solve the problem; at the same time, a reinforcement learning algorithm is used to solve the problem. To realize the selection of adaptive migration path in dynamic network environment. The technical scheme of the present invention is as follows:

一种移动边缘计算中基于强化学习的服务迁移方法，其包括以下步骤：A service migration method based on reinforcement learning in mobile edge computing, comprising the following steps:

S1，根据用户服务所处服务器位置、用户当前所处区域位置以及当前处理任务的服务器负载构建奖励函数；S1, construct a reward function according to the server location where the user service is located, the current regional location of the user, and the server load of the current processing task;

S2，根据用户当前所处位置，之前移动方向以及是否迁移构建状态转移矩阵；S2, construct a state transition matrix according to the current position of the user, the previous movement direction and whether to migrate;

S3，根据所述奖励函数和所述状态转移矩阵，使用价值迭代算法进行迁移决策制定；S3, according to the reward function and the state transition matrix, use a value iteration algorithm to make a migration decision;

S4，根据路由之间的时延消耗和网络消耗做规范化处理来赋值链路消耗；S4, assigning link consumption by normalizing the delay consumption and network consumption between routes;

S5，根据规范化后的链路消耗，使用强化学习算法进行路径选择并自适应地更新链路选择以适应动态网络的链路变化。S5, according to the normalized link consumption, use a reinforcement learning algorithm to perform path selection and adaptively update the link selection to adapt to the link changes of the dynamic network.

进一步的，S1所述根据用户服务所处服务器位置、用户当前所处区域位置以及用户当前服务器负载构建奖励函数，具体包括：Further, according to the location of the server where the user service is located, the current regional location of the user and the current server load of the user, the reward function is constructed according to S1, which specifically includes:

(1)使用用户距离处理任务服务器的距离d_t和处理任务服务器的负载h_t构建用户服务满意度函数；(1) Use the distance d _t of the user to the processing task server and the load h _t of the processing task server to construct a user service satisfaction function;

(2)使用用户距离处理任务服务器的距离d_t构建迁移消耗函数；(2) Use the distance d _t of the user to the processing task server to construct the migration consumption function;

(3)使用服务满意度函数和迁移消耗函数的加权和作为奖励函数。(3) Use the weighted sum of the service satisfaction function and the migration cost function as the reward function.

进一步的，所述(1)使用用户距离处理任务服务器的距离和处理任务服务器的负载构建用户满意度c₁(s_t,a_t)，具体公式为：Further, the (1) user satisfaction c ₁ (s _t , at ₎ is constructed by using the distance between the user and the processing task server and the load of the processing task server, and the specific formula is:

c₁(s_t,a_t)＝D-μ₁d_t-μ₂h_t c ₁ (s _t ,at )=D-μ ₁ _d _t -μ ₂ h _t

其中，D表示用户能够获得最大服务满意度，d_t表示用户t时刻距离处理任务服务器的距离，h_t代表t时刻处理任务的服务器负载情况，μ₁和μ₂是比例系数，表示距离和负载对于用户服务满意度的影响程度。d_t通过计算用户当前位置l_t＝(x_t,y_t)与处理任务服务器位置l_s＝(x_s,y_s)的欧式距离获得；Among them, D represents that the user can obtain the maximum service satisfaction, d _t represents the distance between the user and the processing task server at time t, h _t represents the server load of the processing task at time t, μ ₁ and μ ₂ are proportional coefficients, indicating the distance and load The degree of influence on user service satisfaction. d _t is obtained by calculating the Euclidean distance between the user's current position lt =(x _t , y _t ₎ and the processing task server position _ls =(x _s , y _s );

(2)使用用户距离处理任务服务器的距离d_t构建迁移消耗函数c₂(s_t,a_t)：(2) Construct the migration cost function c ₂ (s _t , at _t ) using the distance d _t of the user from the processing task server:

c₂(s_t,a_t)＝μ₃+μ₄d_t c ₂ (s _t ,at )=μ ₃ +μ ₄ _d _t

其中，使用距离d_t的线性函数表示迁移消耗，μ₃表示常数消耗，μ₄表示距离的影响系数；Among them, the linear function of distance d _t is used to represent the migration consumption, μ ₃ represents the constant consumption, and μ ₄ represents the influence coefficient of distance;

(3)使用用户服务满意度函数和迁移消耗函数的加权和作为奖励函数r(s,a)：(3) Use the weighted sum of the user service satisfaction function and the migration consumption function as the reward function r(s, a):

其中，a表示迁移决策，a＝0表示不进行迁移，a＝1表示进行迁移；d_max代表处理服务所允许的最大距离，超出该距离会有极大的惩罚M。Among them, a represents the migration decision, a=0 means no migration, a=1 means migration; d _max represents the maximum distance allowed by the processing service, beyond which there will be a great penalty M.

进一步的，S2所述根据用户当前所处位置，之前移动方向以及是否迁移构建状态转移矩阵，主要包括：Further, according to the current position of the user, the previous movement direction and whether to migrate to construct a state transition matrix according to S2, which mainly includes:

(1)将用户位置转化为相对于服务器的相对位置并记录用户前一时刻移动方向；(1) Convert the user's position into a relative position relative to the server and record the user's movement direction at the previous moment;

(2)不同的移动方向会对用户接下来的移动轨迹产生影响，用户的移动模型为用户有较大的概率不改变方向，较小的概率改变方向；(2) Different moving directions will have an impact on the user's next movement trajectory. The user's movement model is that the user has a high probability of not changing the direction, and a small probability of changing the direction;

(3)基于用户的移动模型与是否迁移的决策，决定下一时刻用户的状态。(3) Determine the state of the user at the next moment based on the user's mobility model and the decision of whether to migrate.

进一步的，所述(1)记录用户前一时刻移动方向z_t，使用用户当前所处位置l_t与之前移动方向z_t表示用户目前状态s_t＝(x_t,y_t,z_t)；Further, described (1) recording the user's movement direction z _t at the previous moment, and using the user's current position lt and the previous movement direction z _t to represent the user's current state s _t = (x _t , y _t , z _t ₎ ;

(2)所述不同的移动方向z_t会对用户接下来的移动轨迹产生影响，用户在下一时序有较大的概率p保持移动方向z_t不变并到达位置

同时，用户在下一时序有较小的概率

改变移动方向为

或

并到达位置

或

(2) The different moving directions z _t will have an impact on the user's next movement trajectory, and the user has a large probability p in the next time sequence to keep the moving direction z _t unchanged and reach the position

At the same time, the user has a small probability in the next time series

Change the movement direction to

or

and reach the location

or

(3)基于用户的移动模型与迁移决策，确定状态转移概率P(s'|s,a)：(3) Based on the user's mobility model and transition decision, determine the state transition probability P(s'|s,a):

其中，

表示在迁移后用户与处理任务的服务器处于同一位置；

表示迁移后用户移动方向不变，同时不迁移时有p的概率用户移动方向不变。in,

Indicates that the user is in the same location as the server processing the task after the migration;

It means that the user's moving direction remains unchanged after the migration, and there is a probability p that the user's moving direction remains unchanged when the migration is not performed.

进一步的，S3所述根据所述奖励函数和所述状态转移矩阵，使用价值迭代算法进行迁移决策制定，主要包括：Further, according to the reward function and the state transition matrix, the value iteration algorithm is used to make a migration decision according to S3, which mainly includes:

(1)随机初始化用户在不同位置不同移动方向下的状态价值函数v(s)；(1) Randomly initialize the state value function v(s) of the user at different positions and different moving directions;

(2)基于贝尔曼最优方程使用上一迭代周期的状态价值函数值更新下一迭代周期的状态价值函数值，具体公式为：(2) Using the state-value function value of the previous iteration cycle to update the state-value function value of the next iteration cycle based on the Bellman optimal equation, the specific formula is:

其中，v_k+1(s)表示第k+1个迭代周期状态s所对应的状态价值函数，

表示状态s选取动作a所获得的奖励，

表示状态s选取动作a到达状态s'的概率，v_k(s')表示第k个迭代周期状态s'所对应的状态价值函数；Among them, v _k+1 (s) represents the state value function corresponding to the state s of the k+1th iteration cycle,

represents the reward obtained by state s selecting action a,

Represents the probability that the state s selects the action a to reach the state s', v _k (s') represents the state value function corresponding to the k-th iteration cycle state s';

(3)重复步骤(2)，直至不同位置不同方向下的状态价值函数均收敛。(3) Step (2) is repeated until the state value functions in different positions and different directions converge.

进一步的，S4所述根据路由之间的时延消耗t和网络消耗p做规范化处理来赋值链路消耗c的方法包括步骤：Further, the method for assigning link consumption c by performing normalization processing according to the delay consumption t and network consumption p between routes described in S4 includes the steps:

(1)记录链路中传输所需要的时延消耗t以及网络消耗p；(1) Record the delay consumption t and network consumption p required for transmission in the link;

(2)对二者进行均一化处理后加权求和赋值链路消耗c：(2) After normalizing the two, the weighted summation assigns the link consumption c:

c_i＝ω_tt_i+ω_pp_i c _i =ω _t t _i +ω _p p _i

其中，t_i和p_i表示每条链路对应时延消耗和网络消耗，

表示链路时延消耗的最小值，

表示链路时延消耗的最大值，

表示链路网络消耗的最小值，

表示链路网络消耗的最大值；ω_t和ω_p分别表示时延消耗与网络消耗的加权系数。Among them, t _i and p _i represent the corresponding delay consumption and network consumption of each link,

Indicates the minimum value of link delay consumption,

Indicates the maximum value of link delay consumption,

represents the minimum value of link network consumption,

Represents the maximum value of link network consumption; ω _t and ω _p represent the weighting coefficient of delay consumption and network consumption, respectively.

进一步的，S5所述使用Sarsa强化学习选择迁移路径的方法主要包括：Further, the method of using Sarsa reinforcement learning to select a migration path described in S5 mainly includes:

(1)随机初始化各路由所连接的链路信息，包括时延消耗t和网络消耗p；(1) Randomly initialize the link information connected to each route, including delay consumption t and network consumption p;

(2)从原服务器至目标服务器的数据信息随机选择路径传输；(2) The data information from the original server to the target server is randomly selected for transmission;

(3)记录数据传输过程中产生的时延消耗t以及产生的网络消耗p，并将其进行标准化后加权求得对应链路消耗c；(3) Record the time delay consumption t and the network consumption p generated in the data transmission process, and weight them to obtain the corresponding link consumption c after normalization;

(4)各路由根据ε贪婪策略选取数据传输的链路，同时记录选择该链路传输至下一路由的链路消耗，各路由根据本次数据的传输更新其对应的状态动作Q值表；(4) Each route selects the link for data transmission according to the ε greedy strategy, and records the link consumption of selecting the link to transmit to the next route at the same time, and each route updates its corresponding state action Q value table according to the transmission of this data;

(5)伴随数据的传输，各个路由重复步骤(4)，进行本路由Q值表的动态更新并选择更优化的路径。(5) With the transmission of data, each route repeats step (4) to dynamically update the Q-value table of the route and select a more optimized path.

进一步的，所述使用ε贪婪策略选取动作方式和状态动作Q值表更新方式分别为：Further, the use of the ε greedy strategy to select the action mode and the state-action Q-value table update mode are respectively:

Q(S,A)←Q(S,A)+α(R+γQ(S',A')-Q(S,A))Q(S,A)←Q(S,A)+α(R+γQ(S',A')-Q(S,A))

其中，π(a|s)表示在状态s下选取动作a的概率，a^*表示当前状态s下能够使得Q值最大的动作，m表示可供选择的动作个数。Q(S,A)表示各个状态下选取不同动作对应的状态动作函数值，α是学习速率参数，γ是衰减因子，Q(S',A')表示下一状态对应的状态动作函数值。Among them, π(a|s) represents the probability of selecting action a in state s, a ^* represents the action that can maximize the Q value in the current state s, and m represents the number of actions that can be selected. Q(S, A) represents the state-action function value corresponding to different actions selected in each state, α is the learning rate parameter, γ is the decay factor, and Q(S', A') represents the state-action function value corresponding to the next state.

本发明的优点及有益效果如下：The advantages and beneficial effects of the present invention are as follows:

1.本发明在移动边缘环境中综合考虑用户关注的时延因素与商家关注的迁移消耗因素，并基于移动预测构建了更加符合实际场景的移动模型和状态转移矩阵，最终使用价值迭代算法求得不同位置不同移动方向下的迁移决策。指导用户何时进行迁移能够收益最大化。本发明最终求得的迁移决策，不同于其他在相同位置时有相同迁移决策的服务迁移策略，而是在相同位置下也会因用户之前移动方向不同而产生不同的迁移决策，更加符合实际场景。1. The present invention comprehensively considers the time delay factor concerned by users and the migration consumption factor concerned by merchants in the mobile edge environment, and builds a mobility model and state transition matrix that are more in line with the actual scene based on mobile prediction, and finally uses the value iteration algorithm to obtain Migration decisions in different moving directions at different locations. Instruct users when to migrate to maximize benefits. The migration decision finally obtained by the present invention is different from other service migration strategies that have the same migration decision at the same location, but in the same location, different migration decisions will also be generated due to the different moving directions of the user before, which is more in line with the actual scene. .

2.本发明在移动边缘环境中服务确定需要迁移以后，综合考虑用户和商家的利益所在，基于时延和网络消耗赋值链路消耗，并使用强化学习算法求解动态网络环境下自适应的迁移路径选择问题。本发明最终求得的迁移路径，会随着网络链路状态的变化而实时动态更新，能够为服务提供更优的迁移路径。2. After the service is determined to need to be migrated in the mobile edge environment, the present invention comprehensively considers the interests of users and businesses, assigns link consumption based on delay and network consumption, and uses reinforcement learning algorithms to solve adaptive migration paths in a dynamic network environment Choose a question. The migration path finally obtained by the present invention will be dynamically updated in real time with the change of the network link state, which can provide a better migration path for the service.

附图说明Description of drawings

图1是本发明提供优选实施例基于价值迭代的迁移决策制定算法；Fig. 1 is the migration decision-making algorithm based on value iteration based on the preferred embodiment provided by the present invention;

图2是基于Sarsa的动态路径选择算法；Fig. 2 is the dynamic path selection algorithm based on Sarsa;

图3是移动边缘计算中基于强化学习的服务迁移方法流程图。FIG. 3 is a flowchart of a service migration method based on reinforcement learning in mobile edge computing.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the above-mentioned technical problems is:

如图3所示，本发明公开了一种移动边缘计算中基于强化学习的服务迁移方法，包括以下步骤：As shown in Figure 3, the present invention discloses a service migration method based on reinforcement learning in mobile edge computing, comprising the following steps:

S1，基于用户服务所处服务器位置l_s、用户当前所处区域位置l_t以及处理任务的服务器负载h_t构建奖励函数r(s,a)。使用奖励函数表示用户每次进行迁移决策时所能够获得的收益，将其作为用户服务体验与迁移消耗的均衡体现；S1, construct a reward function _{r(s,a) based on the server location ls where the user service is located, the current regional location lt where the user is located, and the server load h t} _of _the processing task. Use the reward function to represent the benefits that users can obtain each time they make a migration decision, and take it as a balanced reflection of user service experience and migration consumption;

S2，基于用户当前所处位置l_t、之前移动方向z_t以及是否迁移构建状态转移矩阵P。使用状态转移矩阵表示用户每次进行迁移决策以后产生的状态变化，包括位置的变化和移动方向的变化；S2, a state transition matrix P _{is constructed based on the user's current position lt , the previous movement direction z t} _and whether the user has migrated. Use the state transition matrix to represent the state changes after each user makes a migration decision, including the change of position and the change of moving direction;

S3，基于奖励函数r(s,a)和状态转移矩阵P，使用价值迭代算法求解迁移决策的制定问题。进而确定不同位置不同移动方向下的用户服务是否需要迁移。S3, based on the reward function r(s, a) and the state transition matrix P, use the value iteration algorithm to solve the problem of making the transfer decision. Further, it is determined whether user services in different locations and different moving directions need to be migrated.

S4，基于路由之间的时延消耗t和网络消耗p，对二者进行规范化处理后赋值链路消耗c；S4, based on the delay consumption t between the routes and the network consumption p, the link consumption c is assigned after normalizing the two;

S5，基于规范后的链路消耗c，使用Sarsa强化学习算法进行迁移路径选择，并自适应地更新链路选择以适应动态网络的链路变化。S5, based on the normalized link consumption c, use the Sarsa reinforcement learning algorithm for migration path selection, and adaptively update the link selection to adapt to the link changes of the dynamic network.

在本实施例中，所述步骤S1中根据用户服务所处服务器位置l_s、用户当前所处区域位置l_t以及处理任务的服务器负载h_t构建奖励函数r(s,a)的方法包括步骤：In this embodiment, the method for constructing the reward function _{r(s, a) in step S1 according to the server location ls where the user service is located, the current regional location lt where the user is located, and the server load h t} _of _the processing task includes the steps of: :

(1)使用用户距离处理任务的服务器距离d_t和处理任务的服务器负载h_t构建用户满意度函数c₁(s_t,a_t)：(1) Construct the user satisfaction function c ₁ (s _t , at _t ) using the distance d _t of the server from the user to the processing task and the server load h _t of the processing task:

其中，D表示用户能够获得最大服务满意度，d_t表示用户t时刻距离处理任务服务器的距离，h_t代表t时刻处理任务的服务器负载情况，μ₁和μ₂是比例系数，表示距离和负载对于用户服务满意度的影响程度。d_t通过计算用户当前位置l_t＝(x_t,y_t)与处理任务服务器位置l_s＝(x_s,y_s)的欧式距离获得。Among them, D represents that the user can obtain the maximum service satisfaction, d _t represents the distance between the user and the processing task server at time t, h _t represents the server load of the processing task at time t, μ ₁ and μ ₂ are proportional coefficients, indicating the distance and load The degree of influence on user service satisfaction. d _t is obtained by calculating the Euclidean distance between the user's current location lt =(x _t , y _t ₎ and the processing task server location _ls =(x _s , y _s ).

c₂(s_t,a_t)＝μ₃+μ₄d_t c ₂ (s _t ,at )=μ ₃ +μ ₄ _d _t

其中，使用距离d_t的线性函数表示迁移消耗，μ₃表示常数消耗，μ₄表示距离的影响系数。Among them, the linear function of distance d _t is used to represent the migration cost, μ ₃ represents the constant cost, and μ ₄ represents the influence coefficient of distance.

在本实施例中，所述步骤S2中根据用户当前所处位置l_t、之前移动方向z_t以及是否迁移构建状态转移矩阵的方法包括步骤：In this embodiment, the method for constructing a state transition matrix according to the current position 1 _t of the user, the previous moving direction z _t and whether the user has migrated in the step S2 includes the following steps:

(1)记录用户前一时刻移动方向z_t，使用用户当前所处位置l_t与之前移动方向z_t表示用户目前状态s_t＝(x_t,y_t,z_t)。(1) Record the user's movement direction z _t at the previous moment, and use the user's current position lt and the previous movement direction z _t to represent the user's current state s _t =(x _t , y _t , z _t ₎ .

(2)不同的移动方向z_t会对用户接下来的移动轨迹产生影响，用户在下一时序有较大的概率p保持移动方向z_t不变并到达位置

同时，用户在下一时序有较小的概率

改变移动方向为

或

并到达位置

或

(2) Different moving directions z _t will have an impact on the user's next movement trajectory, and the user has a greater probability p in the next time sequence to keep the moving direction z _t unchanged and reach the position

At the same time, the user has a small probability in the next time series

Change the movement direction to

or

and reach the location

or

(3)基于用户的移动模型与迁移决策，确定用户的状态转移概率P(s'|s,a)如下：(3) Based on the user's mobility model and transition decision, determine the user's state transition probability P(s'|s,a) as follows:

其中，

表示在迁移后用户与处理任务的服务器处于同一位置；

在本实施例中，所述步骤S3中根据奖励函数r(s,a)和状态转移矩阵P，使用价值迭代算法进行迁移决策制定的方法包括步骤：In this embodiment, according to the reward function r(s, a) and the state transition matrix P in the step S3, the method for making a migration decision using the value iteration algorithm includes the steps:

表示状态s选取动作a所获得的奖励，

represents the reward obtained by state s selecting action a,

在本实施例中，所述步骤S4中根据路由之间的时延消耗t和网络消耗p做规范化处理来赋值链路消耗c的方法包括步骤：In this embodiment, the method for assigning link consumption c by performing normalization processing according to the delay consumption t and network consumption p between routes in the step S4 includes the steps:

c_i＝ω_tt_i+ω_pp_i c _i =ω _t t _i +ω _p p _i

其中，t_i和p_i表示每条链路对应时延消耗和网络消耗，

表示链路时延消耗的最小值，

表示链路时延消耗的最大值，

表示链路网络消耗的最小值，

Indicates the minimum value of link delay consumption,

Indicates the maximum value of link delay consumption,

represents the minimum value of link network consumption,

在本实施例中，所述步骤S5中根据规范后的链路消耗，使用Sarsa强化学习算法进行路径选择并自适应地更新链路选择以适应动态网络的链路变化的方法包括步骤：In this embodiment, according to the standardized link consumption in the step S5, the method of using the Sarsa reinforcement learning algorithm for path selection and adaptively updating the link selection to adapt to the link change of the dynamic network includes the steps:

(4)各路由根据ε贪婪策略选取数据传输的链路，同时记录选择该链路传输至下一路由的链路消耗，各路由根据本次数据的传输更新其对应的状态动作Q值表，其中使用ε贪婪策略选取动作方式和状态动作Q值表更新方式分别为：(4) Each route selects the link for data transmission according to the ε greedy strategy, and records the link consumption of selecting this link to transmit to the next route, and each route updates its corresponding state action Q value table according to the current data transmission, Among them, using the ε greedy strategy to select the action method and the state-action Q-value table update method are:

Q(S,A)←Q(S,A)+α(R+γQ(S',A')-Q(S,A))Q(S,A)←Q(S,A)+α(R+γQ(S',A')-Q(S,A))

本发明综合考虑多种环境因素进行迁移决策的制定以及迁移路径的选择，与已有的服务迁移方法相比，本发明具有以下主要优点：(1)综合考虑多种环境因素，引入服务器负载作为影响用户服务体验的因素，同时引入用户之前移动方向作为一种预测指标，使其影响用户之后移动，更加符合实际场景；(2)综合考虑用户关注的时延因素以及服务供应商所关注的网络消耗因素，使用强化学习求解动态网络环境下自适应的服务迁移路径。The present invention comprehensively considers multiple environmental factors to make migration decisions and select migration paths. Compared with the existing service migration methods, the present invention has the following main advantages: (1) Comprehensively considering multiple environmental factors, the server load is introduced as a Factors that affect the user service experience, and introduce the user's previous movement direction as a predictor to make it affect the user's subsequent movement, which is more in line with the actual scenario; (2) Comprehensively consider the delay factors that users are concerned about and the network that service providers are concerned about Consumption factor, using reinforcement learning to solve adaptive service migration path in dynamic network environment.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention and not for limiting the protection scope of the present invention. After reading the contents of the description of the present invention, the skilled person can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A reinforcement learning-based service migration method in mobile edge computing is characterized by comprising the following steps:

s1, constructing a reward function according to the server position of the user task, the current area position of the user and the server load of the current processing task;

s2, constructing a state transition matrix according to the current position, the previous moving direction and the transition decision of the user;

s3, making a migration decision by using a value iteration algorithm according to the reward function and the state transition matrix;

s4, assigning link consumption by normalized processing according to time delay consumption and network consumption between routes;

and S5, according to the normalized link consumption, using a reinforcement learning algorithm to select a path and adaptively updating the link selection to adapt to the link change of the dynamic network.

2. The method according to claim 1, wherein the constructing of the reward function according to the server location where the user task is located, the area location where the user is currently located, and the processing task server load specifically includes:

(1) using the distance d of the user from the processing task server_tAnd load h of processing task server_tConstructing a user service satisfaction function;

(2) using the distance d of the user from the processing task server_tConstructing a migration consumption function;

(3) a weighted sum of the service satisfaction function and the migration consumption function is used as the reward function.

3. The reinforcement learning-based service migration method for mobile edge computing according to claim 2, wherein the (1) uses the distance between the user and the processing task server and the load of the processing task server to construct the user satisfaction degree c₁(s_t,a_t) The concrete formula is as follows:

c₁(s_t,a_t)＝D-μ₁d_t-μ₂h_t

where D represents the maximum service satisfaction that the user can obtain, D_tIndicating the distance, h, from the processing task server at time t of the user_tIndicating the server load, μ, of the processing task at time t₁And mu₂The scale factor represents the influence degree of the distance and the load on the user service satisfaction; d_tBy calculating the current location l of the user_t＝(x_t,y_t) And processing task server location l_s＝(x_s,y_s) Obtaining the Euclidean distance;

(2) using the distance d of the user from the processing task server_tBuilding a migration consumption function c₂(s_t,a_t)：

c₂(s_t,a_t)＝μ₃+μ₄d_t

Wherein the distance d is used_tRepresents migration consumption, μ₃Represents constant consumption, μ₄An influence coefficient representing a distance;

(3) using a weighted sum of the user service satisfaction function and the migration consumption function as the reward function r (s, a):

wherein, a represents a migration decision, a-0 represents that migration is not performed, and a-1 represents that migration is performed; d_maxRepresenting the maximum distance allowed for the task to be processed beyond which there is a significant penalty M.

4. The method according to claim 1, wherein the constructing the state transition matrix according to the current location, the previous moving direction and the transition decision of the user mainly comprises:

(1) recording the current position of a user and the moving direction of the user at the previous moment;

(2) different moving directions can affect the following moving track of the user, and the moving model of the user is that the user has higher probability of not changing the direction and has lower probability of changing the direction;

(3) and determining the state of the user at the next moment based on the movement model and the migration decision of the user.

5. The method of claim 4, wherein the moving edge calculation is based on reinforcement learningThe service migration method of (1) recording the moving direction z of the user at the previous time_tUsing the current location l of the user_tWith the direction z of the previous movement_tIndicating the current state s of the user_t＝(x_t,y_t,z_t)；

(2) Said different direction of movement z_tThe next movement track of the user is influenced, and the user has a larger probability p to keep the movement direction z at the next time sequence_tDoes not change and reaches the position

At the same time, the user has a smaller probability in the next time sequence

Change the moving direction to

Or

And come to a position

Or

(3) Determining a state transition probability P (s '| s, a) based on the user's mobility model and the migration decision:

wherein,

indicating that the user is co-located with the server handling the task after migration;

the method shows that the moving direction of the user is unchanged after the migration, and the moving direction of the user has p probability when the user is not migrated.

6. The reinforcement learning-based service migration method in mobile edge computing according to claim 1, wherein a value iterative algorithm is used to make a migration decision according to the reward function and the state transition matrix, and the method mainly comprises:

(1) randomly initializing state value functions v(s) of a user at different positions and in different moving directions;

(2) updating the state cost function value of the next iteration period by using the state cost function value of the previous iteration period based on the Bellman optimal equation, wherein the specific formula is as follows:

wherein v is_k+1(s) represents the state cost function corresponding to the state s of the (k + 1) th iteration cycle,

indicating that state s picks the reward obtained by action a,

representing the probability, v, that state s picks action a to reach state s_k(s ') representing a state cost function corresponding to the state s' of the kth iteration cycle;

(3) and (5) repeating the step (2) until the state cost functions at different positions and in different directions are converged.

7. The reinforcement learning-based service migration method in mobile edge computing according to claim 1, wherein the method for assigning the link consumption c according to the normalization process of the delay consumption t between routes and the network consumption p comprises the steps of:

(1) recording time delay consumption t and network consumption p required by transmission in a link;

(2) and c, carrying out homogenization treatment on the two and then weighting, summing and assigning link consumption:

c_i＝ω_tt_i+ω_pp_i

wherein, t_iAnd p_iIndicating delay consumption and network consumption for each link,

represents the minimum value of the link delay consumption,

represents the maximum value of the link delay consumption,

represents the minimum value of the link network consumption,

represents the maximum value of link network consumption; omega_tAnd ω_pRespectively representing the weighting coefficients of the delay consumption and the network consumption.

8. The method according to claim 1, wherein the method for selecting a migration path using Sarsa reinforcement learning mainly comprises:

(1) randomly initializing link information connected with each route, wherein the link information comprises time delay consumption t and network consumption p;

(2) randomly selecting a path for transmitting data information from an original server to a target server;

(3) recording time delay consumption t and network consumption p generated in the data transmission process, standardizing the time delay consumption t and the network consumption p, and weighting to obtain corresponding link consumption c;

(4) each route selects a link for data transmission according to an epsilon greedy strategy, simultaneously records the link consumption for selecting the link for transmission to the next route, and updates a corresponding state action Q value table according to the transmission of the data;

(5) and (5) repeating the step (4) for each route along with the transmission of the data, dynamically updating the route Q value table and selecting a more optimal path.

9. The reinforcement learning-based service migration method in mobile edge computing according to claim 8, wherein the selection of the action mode using the epsilon greedy strategy and the update of the state action Q-value table are respectively as follows:

Q(S,A)←Q(S,A)+α(R+γQ(S',A')-Q(S,A))

where π (a | s) represents the probability of picking an action a in state s, a^*Represents the action that can maximize the Q value in the current state s, and m represents the number of actions that can be selected. Q (S, a) represents a state action function value corresponding to different actions selected in each state, α is a learning rate parameter, γ is an attenuation factor, and Q (S ', a') represents a state action function value corresponding to the next state.